Regular Expressions Tutorial

Here I made my short note on Regular Expressions.

Wildcard meta characters

Single Charactor . (dot)

n.t matches => not ✔ net ✔ nut ✔ n#t ✔ n.t ✔ , neat ✘


Note:
5.00 => 5.00 ✔ 5500 ✔ 5-00 ✔
So we need to escape dot character

5\.00 => 5.00 ✔ 5500 ✘ 5-00 ✘

Other special charaters
  • Spaces
  • Tabs (\t)
  • Line Returns (\r, \n, \r\n)
  • None printable characters: bell(\a) escape(\e) feed(\f) vertical tab(\v)
  • ASCII or ANCI codes: 0xA9 = \xA9

Character sets [ ]


[aeiou] any vowel

b[aeiou]t => bat ✔ but ✔  bait ✘


Character Ranges -


[0-9]
[A-Za-z]

[0-9][0-9][0-9] => 576


Negate a character set ^


[^aeiou] not a vowel

see[^mn] => seek ✔  seem ✘ seen ✘

 Note: 
It matches spaces also
see[^mn] => 'see ' ✔ 'see' ✘

Below characters should escape inside sets
] - ^ \
(Metacharaters automatically escaped in sets . \n etc )


Shorthand character sets  

 

  • \d -> digit [0-9]
  • \w -> word character [a-zA-Z0-9_]
  • \s -> white space [\t\r\n]
  • \D -> not digit [^0-9]
  • \W -> not word [^a-zA-Z0-9_]
  • \S -> not whitespace [^\t\r\n]
Note:
[^\d\s] not equals [\D\S]


Bracket Expresssions


[:alpha:] = A-Za-z
[:digit:] = 0-9

[[:alpha:]]
Not work in Java ,JavaScript, .Net, Python
Work in PHP, Perl, Ruby Unix


Repetition metacharacters


*  zero or more
+  One or more
?  zero or one

bananas* => banana ✔  bananas ✔  bananasssss ✔
bananas+ =>  banana ✘  bananas ✔  bananasssss ✔
bananas? =>  banana ✔  bananas ✔  bananasssss ✘


Quantified repetition { }


{min,max} min is required max is optional.

\d{3,5} 3 to 5 digits
\d{3}  exactly 3 digits
\d{3,} 3 to infinite

\d{0,}  = \d*
\d{1,} = \d+

\d{2}-\d{3}-\d{4} => 35-345-7896 ✔


Greedy and Lazy Expressions

.*[0-9]+  => page 120 //greedy
.*?[0-9]+ => page 120 //lazy

? make preceding quantifier lazy
  • *?
  • +?
  • {min,max}?
  • ??



Grouping metacharacter ( ) 


(abc)+  => abc ✔ abcabc ✔
gun(s)?  => gun  ✔ guns ✔


Alternation meta character | 

  
mango|apple => apple ✔ mango ✔
 r(u|a)n => run ✔  ran ✔


Start and End anchors 


^ Start of string/line
$ End of string/line
\A Start of string only
\Z End of string only

\A,\Z working in Java, PHP, .Net,  Perl, Python, Ruby

^mango or \Amango   Beginning of string
mango$ or mango\Z  End of string
^mango$ or \Amango\Z

 Note:
^[A-Z]  = Beginning of string
[^A-Z]  = Negation


Word Boundaries


\b word boundary (start/end of word)
\B not a word boundary

\b\w+\b => test string. I'm a boy.
                   matches  test, string, I, m, a, boy


Back References


\1 to \9  (some regex engines use $1 to $9)

(mango) to \1 => mango to mango
<(i|em).+?</\1>  => <i>test</i> ✔
                                 <em>test</em>  ✔
                                 <i>test</em> ✘


Non Capturing group expression ?:


 (?:orange) and (apples) to \1  => orange and apples to apples



Look ahead assertions ?=  ?!


 Positive look ahead ?=
(?=regex)

(?=seashore) sea => 'sea' in seashore ✔ , not 'sea' in seaside ✘
sea(?=shore) is same as previous

eg: Find words before comma
\b[A-Za-z']+?\b(?=,)

Negative look ahead ?!
(?!regex)

(?!seashore)sea => 'sea' in seaside ✔ 'sea' in seashore ✘
sea(?!shore) is same as previous


Look behind assertions ?<=  ?<!


Positive look behind ?<= 

 (?<=base)ball  =>  'ball' in baseball ✔  'ball' in football ✘

Negative Look Behind ?<!

 (?<!base)ball  =>  'ball' in football ✔ 'ball' in baseball ✘

 Note:
Look behind assertions not work in JavaScript


Matching Unicodes


Unicode indicator \u
caf\u00E9 => café ✔  cafe ✘

Unicode wildcard  \X
caf\X => café ✔  cafe ✔    //Only work on PHP and Perl

Unicode property \p {property}   , not property \P{property}                
  •  L -> Letter
  • M -> Mark
  • Z -> Separator
  • S -> Symbol
  • N -> Number
  • P -> Punctuation
  • C -> Other
 caf\P{M}\p{M} => café ✔

Work on Java,.Net, Perl, PHP, Ruby


Reference: Using Regular Expressions - Lynda.com

No comments:

Post a Comment