Tristan Matthews | 0461646 | 2013-11-14 16:09:34 -0500 | [diff] [blame] | 1 | .TH PCRESYNTAX 3 |
| 2 | .SH NAME |
| 3 | PCRE - Perl-compatible regular expressions |
| 4 | .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY" |
| 5 | .rs |
| 6 | .sp |
| 7 | The full syntax and semantics of the regular expressions that are supported by |
| 8 | PCRE are described in the |
| 9 | .\" HREF |
| 10 | \fBpcrepattern\fP |
| 11 | .\" |
| 12 | documentation. This document contains just a quick-reference summary of the |
| 13 | syntax. |
| 14 | . |
| 15 | . |
| 16 | .SH "QUOTING" |
| 17 | .rs |
| 18 | .sp |
| 19 | \ex where x is non-alphanumeric is a literal x |
| 20 | \eQ...\eE treat enclosed characters as literal |
| 21 | . |
| 22 | . |
| 23 | .SH "CHARACTERS" |
| 24 | .rs |
| 25 | .sp |
| 26 | \ea alarm, that is, the BEL character (hex 07) |
| 27 | \ecx "control-x", where x is any ASCII character |
| 28 | \ee escape (hex 1B) |
| 29 | \ef formfeed (hex 0C) |
| 30 | \en newline (hex 0A) |
| 31 | \er carriage return (hex 0D) |
| 32 | \et tab (hex 09) |
| 33 | \eddd character with octal code ddd, or backreference |
| 34 | \exhh character with hex code hh |
| 35 | \ex{hhh..} character with hex code hhh.. |
| 36 | . |
| 37 | . |
| 38 | .SH "CHARACTER TYPES" |
| 39 | .rs |
| 40 | .sp |
| 41 | . any character except newline; |
| 42 | in dotall mode, any character whatsoever |
| 43 | \eC one byte, even in UTF-8 mode (best avoided) |
| 44 | \ed a decimal digit |
| 45 | \eD a character that is not a decimal digit |
| 46 | \eh a horizontal whitespace character |
| 47 | \eH a character that is not a horizontal whitespace character |
| 48 | \eN a character that is not a newline |
| 49 | \ep{\fIxx\fP} a character with the \fIxx\fP property |
| 50 | \eP{\fIxx\fP} a character without the \fIxx\fP property |
| 51 | \eR a newline sequence |
| 52 | \es a whitespace character |
| 53 | \eS a character that is not a whitespace character |
| 54 | \ev a vertical whitespace character |
| 55 | \eV a character that is not a vertical whitespace character |
| 56 | \ew a "word" character |
| 57 | \eW a "non-word" character |
| 58 | \eX an extended Unicode sequence |
| 59 | .sp |
| 60 | In PCRE, by default, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII |
| 61 | characters, even in UTF-8 mode. However, this can be changed by setting the |
| 62 | PCRE_UCP option. |
| 63 | . |
| 64 | . |
| 65 | .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP" |
| 66 | .rs |
| 67 | .sp |
| 68 | C Other |
| 69 | Cc Control |
| 70 | Cf Format |
| 71 | Cn Unassigned |
| 72 | Co Private use |
| 73 | Cs Surrogate |
| 74 | .sp |
| 75 | L Letter |
| 76 | Ll Lower case letter |
| 77 | Lm Modifier letter |
| 78 | Lo Other letter |
| 79 | Lt Title case letter |
| 80 | Lu Upper case letter |
| 81 | L& Ll, Lu, or Lt |
| 82 | .sp |
| 83 | M Mark |
| 84 | Mc Spacing mark |
| 85 | Me Enclosing mark |
| 86 | Mn Non-spacing mark |
| 87 | .sp |
| 88 | N Number |
| 89 | Nd Decimal number |
| 90 | Nl Letter number |
| 91 | No Other number |
| 92 | .sp |
| 93 | P Punctuation |
| 94 | Pc Connector punctuation |
| 95 | Pd Dash punctuation |
| 96 | Pe Close punctuation |
| 97 | Pf Final punctuation |
| 98 | Pi Initial punctuation |
| 99 | Po Other punctuation |
| 100 | Ps Open punctuation |
| 101 | .sp |
| 102 | S Symbol |
| 103 | Sc Currency symbol |
| 104 | Sk Modifier symbol |
| 105 | Sm Mathematical symbol |
| 106 | So Other symbol |
| 107 | .sp |
| 108 | Z Separator |
| 109 | Zl Line separator |
| 110 | Zp Paragraph separator |
| 111 | Zs Space separator |
| 112 | . |
| 113 | . |
| 114 | .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP" |
| 115 | .rs |
| 116 | .sp |
| 117 | Xan Alphanumeric: union of properties L and N |
| 118 | Xps POSIX space: property Z or tab, NL, VT, FF, CR |
| 119 | Xsp Perl space: property Z or tab, NL, FF, CR |
| 120 | Xwd Perl word: property Xan or underscore |
| 121 | . |
| 122 | . |
| 123 | .SH "SCRIPT NAMES FOR \ep AND \eP" |
| 124 | .rs |
| 125 | .sp |
| 126 | Arabic, |
| 127 | Armenian, |
| 128 | Avestan, |
| 129 | Balinese, |
| 130 | Bamum, |
| 131 | Bengali, |
| 132 | Bopomofo, |
| 133 | Braille, |
| 134 | Buginese, |
| 135 | Buhid, |
| 136 | Canadian_Aboriginal, |
| 137 | Carian, |
| 138 | Cham, |
| 139 | Cherokee, |
| 140 | Common, |
| 141 | Coptic, |
| 142 | Cuneiform, |
| 143 | Cypriot, |
| 144 | Cyrillic, |
| 145 | Deseret, |
| 146 | Devanagari, |
| 147 | Egyptian_Hieroglyphs, |
| 148 | Ethiopic, |
| 149 | Georgian, |
| 150 | Glagolitic, |
| 151 | Gothic, |
| 152 | Greek, |
| 153 | Gujarati, |
| 154 | Gurmukhi, |
| 155 | Han, |
| 156 | Hangul, |
| 157 | Hanunoo, |
| 158 | Hebrew, |
| 159 | Hiragana, |
| 160 | Imperial_Aramaic, |
| 161 | Inherited, |
| 162 | Inscriptional_Pahlavi, |
| 163 | Inscriptional_Parthian, |
| 164 | Javanese, |
| 165 | Kaithi, |
| 166 | Kannada, |
| 167 | Katakana, |
| 168 | Kayah_Li, |
| 169 | Kharoshthi, |
| 170 | Khmer, |
| 171 | Lao, |
| 172 | Latin, |
| 173 | Lepcha, |
| 174 | Limbu, |
| 175 | Linear_B, |
| 176 | Lisu, |
| 177 | Lycian, |
| 178 | Lydian, |
| 179 | Malayalam, |
| 180 | Meetei_Mayek, |
| 181 | Mongolian, |
| 182 | Myanmar, |
| 183 | New_Tai_Lue, |
| 184 | Nko, |
| 185 | Ogham, |
| 186 | Old_Italic, |
| 187 | Old_Persian, |
| 188 | Old_South_Arabian, |
| 189 | Old_Turkic, |
| 190 | Ol_Chiki, |
| 191 | Oriya, |
| 192 | Osmanya, |
| 193 | Phags_Pa, |
| 194 | Phoenician, |
| 195 | Rejang, |
| 196 | Runic, |
| 197 | Samaritan, |
| 198 | Saurashtra, |
| 199 | Shavian, |
| 200 | Sinhala, |
| 201 | Sundanese, |
| 202 | Syloti_Nagri, |
| 203 | Syriac, |
| 204 | Tagalog, |
| 205 | Tagbanwa, |
| 206 | Tai_Le, |
| 207 | Tai_Tham, |
| 208 | Tai_Viet, |
| 209 | Tamil, |
| 210 | Telugu, |
| 211 | Thaana, |
| 212 | Thai, |
| 213 | Tibetan, |
| 214 | Tifinagh, |
| 215 | Ugaritic, |
| 216 | Vai, |
| 217 | Yi. |
| 218 | . |
| 219 | . |
| 220 | .SH "CHARACTER CLASSES" |
| 221 | .rs |
| 222 | .sp |
| 223 | [...] positive character class |
| 224 | [^...] negative character class |
| 225 | [x-y] range (can be used for hex characters) |
| 226 | [[:xxx:]] positive POSIX named set |
| 227 | [[:^xxx:]] negative POSIX named set |
| 228 | .sp |
| 229 | alnum alphanumeric |
| 230 | alpha alphabetic |
| 231 | ascii 0-127 |
| 232 | blank space or tab |
| 233 | cntrl control character |
| 234 | digit decimal digit |
| 235 | graph printing, excluding space |
| 236 | lower lower case letter |
| 237 | print printing, including space |
| 238 | punct printing, excluding alphanumeric |
| 239 | space whitespace |
| 240 | upper upper case letter |
| 241 | word same as \ew |
| 242 | xdigit hexadecimal digit |
| 243 | .sp |
| 244 | In PCRE, POSIX character set names recognize only ASCII characters by default, |
| 245 | but some of them use Unicode properties if PCRE_UCP is set. You can use |
| 246 | \eQ...\eE inside a character class. |
| 247 | . |
| 248 | . |
| 249 | .SH "QUANTIFIERS" |
| 250 | .rs |
| 251 | .sp |
| 252 | ? 0 or 1, greedy |
| 253 | ?+ 0 or 1, possessive |
| 254 | ?? 0 or 1, lazy |
| 255 | * 0 or more, greedy |
| 256 | *+ 0 or more, possessive |
| 257 | *? 0 or more, lazy |
| 258 | + 1 or more, greedy |
| 259 | ++ 1 or more, possessive |
| 260 | +? 1 or more, lazy |
| 261 | {n} exactly n |
| 262 | {n,m} at least n, no more than m, greedy |
| 263 | {n,m}+ at least n, no more than m, possessive |
| 264 | {n,m}? at least n, no more than m, lazy |
| 265 | {n,} n or more, greedy |
| 266 | {n,}+ n or more, possessive |
| 267 | {n,}? n or more, lazy |
| 268 | . |
| 269 | . |
| 270 | .SH "ANCHORS AND SIMPLE ASSERTIONS" |
| 271 | .rs |
| 272 | .sp |
| 273 | \eb word boundary |
| 274 | \eB not a word boundary |
| 275 | ^ start of subject |
| 276 | also after internal newline in multiline mode |
| 277 | \eA start of subject |
| 278 | $ end of subject |
| 279 | also before newline at end of subject |
| 280 | also before internal newline in multiline mode |
| 281 | \eZ end of subject |
| 282 | also before newline at end of subject |
| 283 | \ez end of subject |
| 284 | \eG first matching position in subject |
| 285 | . |
| 286 | . |
| 287 | .SH "MATCH POINT RESET" |
| 288 | .rs |
| 289 | .sp |
| 290 | \eK reset start of match |
| 291 | . |
| 292 | . |
| 293 | .SH "ALTERNATION" |
| 294 | .rs |
| 295 | .sp |
| 296 | expr|expr|expr... |
| 297 | . |
| 298 | . |
| 299 | .SH "CAPTURING" |
| 300 | .rs |
| 301 | .sp |
| 302 | (...) capturing group |
| 303 | (?<name>...) named capturing group (Perl) |
| 304 | (?'name'...) named capturing group (Perl) |
| 305 | (?P<name>...) named capturing group (Python) |
| 306 | (?:...) non-capturing group |
| 307 | (?|...) non-capturing group; reset group numbers for |
| 308 | capturing groups in each alternative |
| 309 | . |
| 310 | . |
| 311 | .SH "ATOMIC GROUPS" |
| 312 | .rs |
| 313 | .sp |
| 314 | (?>...) atomic, non-capturing group |
| 315 | . |
| 316 | . |
| 317 | . |
| 318 | . |
| 319 | .SH "COMMENT" |
| 320 | .rs |
| 321 | .sp |
| 322 | (?#....) comment (not nestable) |
| 323 | . |
| 324 | . |
| 325 | .SH "OPTION SETTING" |
| 326 | .rs |
| 327 | .sp |
| 328 | (?i) caseless |
| 329 | (?J) allow duplicate names |
| 330 | (?m) multiline |
| 331 | (?s) single line (dotall) |
| 332 | (?U) default ungreedy (lazy) |
| 333 | (?x) extended (ignore white space) |
| 334 | (?-...) unset option(s) |
| 335 | .sp |
| 336 | The following are recognized only at the start of a pattern or after one of the |
| 337 | newline-setting options with similar syntax: |
| 338 | .sp |
| 339 | (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE) |
| 340 | (*UTF8) set UTF-8 mode (PCRE_UTF8) |
| 341 | (*UCP) set PCRE_UCP (use Unicode properties for \ed etc) |
| 342 | . |
| 343 | . |
| 344 | .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS" |
| 345 | .rs |
| 346 | .sp |
| 347 | (?=...) positive look ahead |
| 348 | (?!...) negative look ahead |
| 349 | (?<=...) positive look behind |
| 350 | (?<!...) negative look behind |
| 351 | .sp |
| 352 | Each top-level branch of a look behind must be of a fixed length. |
| 353 | . |
| 354 | . |
| 355 | .SH "BACKREFERENCES" |
| 356 | .rs |
| 357 | .sp |
| 358 | \en reference by number (can be ambiguous) |
| 359 | \egn reference by number |
| 360 | \eg{n} reference by number |
| 361 | \eg{-n} relative reference by number |
| 362 | \ek<name> reference by name (Perl) |
| 363 | \ek'name' reference by name (Perl) |
| 364 | \eg{name} reference by name (Perl) |
| 365 | \ek{name} reference by name (.NET) |
| 366 | (?P=name) reference by name (Python) |
| 367 | . |
| 368 | . |
| 369 | .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)" |
| 370 | .rs |
| 371 | .sp |
| 372 | (?R) recurse whole pattern |
| 373 | (?n) call subpattern by absolute number |
| 374 | (?+n) call subpattern by relative number |
| 375 | (?-n) call subpattern by relative number |
| 376 | (?&name) call subpattern by name (Perl) |
| 377 | (?P>name) call subpattern by name (Python) |
| 378 | \eg<name> call subpattern by name (Oniguruma) |
| 379 | \eg'name' call subpattern by name (Oniguruma) |
| 380 | \eg<n> call subpattern by absolute number (Oniguruma) |
| 381 | \eg'n' call subpattern by absolute number (Oniguruma) |
| 382 | \eg<+n> call subpattern by relative number (PCRE extension) |
| 383 | \eg'+n' call subpattern by relative number (PCRE extension) |
| 384 | \eg<-n> call subpattern by relative number (PCRE extension) |
| 385 | \eg'-n' call subpattern by relative number (PCRE extension) |
| 386 | . |
| 387 | . |
| 388 | .SH "CONDITIONAL PATTERNS" |
| 389 | .rs |
| 390 | .sp |
| 391 | (?(condition)yes-pattern) |
| 392 | (?(condition)yes-pattern|no-pattern) |
| 393 | .sp |
| 394 | (?(n)... absolute reference condition |
| 395 | (?(+n)... relative reference condition |
| 396 | (?(-n)... relative reference condition |
| 397 | (?(<name>)... named reference condition (Perl) |
| 398 | (?('name')... named reference condition (Perl) |
| 399 | (?(name)... named reference condition (PCRE) |
| 400 | (?(R)... overall recursion condition |
| 401 | (?(Rn)... specific group recursion condition |
| 402 | (?(R&name)... specific recursion condition |
| 403 | (?(DEFINE)... define subpattern for reference |
| 404 | (?(assert)... assertion condition |
| 405 | . |
| 406 | . |
| 407 | .SH "BACKTRACKING CONTROL" |
| 408 | .rs |
| 409 | .sp |
| 410 | The following act immediately they are reached: |
| 411 | .sp |
| 412 | (*ACCEPT) force successful match |
| 413 | (*FAIL) force backtrack; synonym (*F) |
| 414 | .sp |
| 415 | The following act only when a subsequent match failure causes a backtrack to |
| 416 | reach them. They all force a match failure, but they differ in what happens |
| 417 | afterwards. Those that advance the start-of-match point do so only if the |
| 418 | pattern is not anchored. |
| 419 | .sp |
| 420 | (*COMMIT) overall failure, no advance of starting point |
| 421 | (*PRUNE) advance to next starting character |
| 422 | (*SKIP) advance start to current matching position |
| 423 | (*THEN) local failure, backtrack to next alternation |
| 424 | . |
| 425 | . |
| 426 | .SH "NEWLINE CONVENTIONS" |
| 427 | .rs |
| 428 | .sp |
| 429 | These are recognized only at the very start of the pattern or after a |
| 430 | (*BSR_...) or (*UTF8) or (*UCP) option. |
| 431 | .sp |
| 432 | (*CR) carriage return only |
| 433 | (*LF) linefeed only |
| 434 | (*CRLF) carriage return followed by linefeed |
| 435 | (*ANYCRLF) all three of the above |
| 436 | (*ANY) any Unicode newline sequence |
| 437 | . |
| 438 | . |
| 439 | .SH "WHAT \eR MATCHES" |
| 440 | .rs |
| 441 | .sp |
| 442 | These are recognized only at the very start of the pattern or after a |
| 443 | (*...) option that sets the newline convention or UTF-8 or UCP mode. |
| 444 | .sp |
| 445 | (*BSR_ANYCRLF) CR, LF, or CRLF |
| 446 | (*BSR_UNICODE) any Unicode newline sequence |
| 447 | . |
| 448 | . |
| 449 | .SH "CALLOUTS" |
| 450 | .rs |
| 451 | .sp |
| 452 | (?C) callout |
| 453 | (?Cn) callout with data n |
| 454 | . |
| 455 | . |
| 456 | .SH "SEE ALSO" |
| 457 | .rs |
| 458 | .sp |
| 459 | \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3), |
| 460 | \fBpcrematching\fP(3), \fBpcre\fP(3). |
| 461 | . |
| 462 | . |
| 463 | .SH AUTHOR |
| 464 | .rs |
| 465 | .sp |
| 466 | .nf |
| 467 | Philip Hazel |
| 468 | University Computing Service |
| 469 | Cambridge CB2 3QH, England. |
| 470 | .fi |
| 471 | . |
| 472 | . |
| 473 | .SH REVISION |
| 474 | .rs |
| 475 | .sp |
| 476 | .nf |
| 477 | Last updated: 21 November 2010 |
| 478 | Copyright (c) 1997-2010 University of Cambridge. |
| 479 | .fi |