Twatter: .Net Regular Expressions and accented / unicode characters

I was trying to replace any non letter characters using regular expressions which turned out to be a bit of a pain when unicode / accented characters were used.

I ended up trying to match the stuff that I wanted and remove everything else. Regular expressions aren't really set up like this as there isn't really a "not" operator.

This page was very useful:

http://www.regular-expressions.info/unicode.html

This expression did the trick for me, it matches everything, but only replaces (with the match) matches that I wanted.

Regex.Replace(authors, @"(?(?\p{L}\p{M}*|[ ,;|-])|(?.))", "${allowed}", RegexOptions.Compiled | RegexOptions.Multiline);

\p{L} matches any letter character without a separate accent
\p{M} matches any accent
\P{L}\p{M}* matches any letter character with any number of accents (it is possible to have more than one)
[ ,;|-] matches any special characters that I wanted to keep
the all group matches everything
the allowed group matches characters that I want to keep
the not allowed group matches anything else
The first expression in an or (|) group is the one that is matched so is the allowed group matches then the not allowed doesn't.

"${allowed}" in the replace string replaces a match with the contents of the allowed group. Since everything is matched nothing remains of the original string. If an not allowed match is replaced there is nothing in the allowed group.

Some notes:
Accented characters in unicode can be represented by a single character (for legacy reasons) or as a combination of a base character and one or more accent characters.
Thus a single character on screen such as é can be represented by either one or two unicode characters.
It is thus not possible to match accented characters in the usual way using square brackets [].

Twatter

Monday, April 28, 2008

.Net Regular Expressions and accented / unicode characters

No comments:

About Me

Blog Archive