blob: 692fdde6205a0a7bef00baf09aa2820caa2947aa [file] [log] [blame]
Tristan Matthews04616462013-11-14 16:09:34 -05001.TH PCRESYNTAX 3
2.SH NAME
3PCRE - Perl-compatible regular expressions
4.SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
5.rs
6.sp
7The full syntax and semantics of the regular expressions that are supported by
8PCRE are described in the
9.\" HREF
10\fBpcrepattern\fP
11.\"
12documentation. This document contains just a quick-reference summary of the
13syntax.
14.
15.
16.SH "QUOTING"
17.rs
18.sp
19 \ex where x is non-alphanumeric is a literal x
20 \eQ...\eE treat enclosed characters as literal
21.
22.
23.SH "CHARACTERS"
24.rs
25.sp
26 \ea alarm, that is, the BEL character (hex 07)
27 \ecx "control-x", where x is any ASCII character
28 \ee escape (hex 1B)
29 \ef formfeed (hex 0C)
30 \en newline (hex 0A)
31 \er carriage return (hex 0D)
32 \et tab (hex 09)
33 \eddd character with octal code ddd, or backreference
34 \exhh character with hex code hh
35 \ex{hhh..} character with hex code hhh..
36.
37.
38.SH "CHARACTER TYPES"
39.rs
40.sp
41 . any character except newline;
42 in dotall mode, any character whatsoever
43 \eC one byte, even in UTF-8 mode (best avoided)
44 \ed a decimal digit
45 \eD a character that is not a decimal digit
46 \eh a horizontal whitespace character
47 \eH a character that is not a horizontal whitespace character
48 \eN a character that is not a newline
49 \ep{\fIxx\fP} a character with the \fIxx\fP property
50 \eP{\fIxx\fP} a character without the \fIxx\fP property
51 \eR a newline sequence
52 \es a whitespace character
53 \eS a character that is not a whitespace character
54 \ev a vertical whitespace character
55 \eV a character that is not a vertical whitespace character
56 \ew a "word" character
57 \eW a "non-word" character
58 \eX an extended Unicode sequence
59.sp
60In PCRE, by default, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII
61characters, even in UTF-8 mode. However, this can be changed by setting the
62PCRE_UCP option.
63.
64.
65.SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
66.rs
67.sp
68 C Other
69 Cc Control
70 Cf Format
71 Cn Unassigned
72 Co Private use
73 Cs Surrogate
74.sp
75 L Letter
76 Ll Lower case letter
77 Lm Modifier letter
78 Lo Other letter
79 Lt Title case letter
80 Lu Upper case letter
81 L& Ll, Lu, or Lt
82.sp
83 M Mark
84 Mc Spacing mark
85 Me Enclosing mark
86 Mn Non-spacing mark
87.sp
88 N Number
89 Nd Decimal number
90 Nl Letter number
91 No Other number
92.sp
93 P Punctuation
94 Pc Connector punctuation
95 Pd Dash punctuation
96 Pe Close punctuation
97 Pf Final punctuation
98 Pi Initial punctuation
99 Po Other punctuation
100 Ps Open punctuation
101.sp
102 S Symbol
103 Sc Currency symbol
104 Sk Modifier symbol
105 Sm Mathematical symbol
106 So Other symbol
107.sp
108 Z Separator
109 Zl Line separator
110 Zp Paragraph separator
111 Zs Space separator
112.
113.
114.SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
115.rs
116.sp
117 Xan Alphanumeric: union of properties L and N
118 Xps POSIX space: property Z or tab, NL, VT, FF, CR
119 Xsp Perl space: property Z or tab, NL, FF, CR
120 Xwd Perl word: property Xan or underscore
121.
122.
123.SH "SCRIPT NAMES FOR \ep AND \eP"
124.rs
125.sp
126Arabic,
127Armenian,
128Avestan,
129Balinese,
130Bamum,
131Bengali,
132Bopomofo,
133Braille,
134Buginese,
135Buhid,
136Canadian_Aboriginal,
137Carian,
138Cham,
139Cherokee,
140Common,
141Coptic,
142Cuneiform,
143Cypriot,
144Cyrillic,
145Deseret,
146Devanagari,
147Egyptian_Hieroglyphs,
148Ethiopic,
149Georgian,
150Glagolitic,
151Gothic,
152Greek,
153Gujarati,
154Gurmukhi,
155Han,
156Hangul,
157Hanunoo,
158Hebrew,
159Hiragana,
160Imperial_Aramaic,
161Inherited,
162Inscriptional_Pahlavi,
163Inscriptional_Parthian,
164Javanese,
165Kaithi,
166Kannada,
167Katakana,
168Kayah_Li,
169Kharoshthi,
170Khmer,
171Lao,
172Latin,
173Lepcha,
174Limbu,
175Linear_B,
176Lisu,
177Lycian,
178Lydian,
179Malayalam,
180Meetei_Mayek,
181Mongolian,
182Myanmar,
183New_Tai_Lue,
184Nko,
185Ogham,
186Old_Italic,
187Old_Persian,
188Old_South_Arabian,
189Old_Turkic,
190Ol_Chiki,
191Oriya,
192Osmanya,
193Phags_Pa,
194Phoenician,
195Rejang,
196Runic,
197Samaritan,
198Saurashtra,
199Shavian,
200Sinhala,
201Sundanese,
202Syloti_Nagri,
203Syriac,
204Tagalog,
205Tagbanwa,
206Tai_Le,
207Tai_Tham,
208Tai_Viet,
209Tamil,
210Telugu,
211Thaana,
212Thai,
213Tibetan,
214Tifinagh,
215Ugaritic,
216Vai,
217Yi.
218.
219.
220.SH "CHARACTER CLASSES"
221.rs
222.sp
223 [...] positive character class
224 [^...] negative character class
225 [x-y] range (can be used for hex characters)
226 [[:xxx:]] positive POSIX named set
227 [[:^xxx:]] negative POSIX named set
228.sp
229 alnum alphanumeric
230 alpha alphabetic
231 ascii 0-127
232 blank space or tab
233 cntrl control character
234 digit decimal digit
235 graph printing, excluding space
236 lower lower case letter
237 print printing, including space
238 punct printing, excluding alphanumeric
239 space whitespace
240 upper upper case letter
241 word same as \ew
242 xdigit hexadecimal digit
243.sp
244In PCRE, POSIX character set names recognize only ASCII characters by default,
245but some of them use Unicode properties if PCRE_UCP is set. You can use
246\eQ...\eE inside a character class.
247.
248.
249.SH "QUANTIFIERS"
250.rs
251.sp
252 ? 0 or 1, greedy
253 ?+ 0 or 1, possessive
254 ?? 0 or 1, lazy
255 * 0 or more, greedy
256 *+ 0 or more, possessive
257 *? 0 or more, lazy
258 + 1 or more, greedy
259 ++ 1 or more, possessive
260 +? 1 or more, lazy
261 {n} exactly n
262 {n,m} at least n, no more than m, greedy
263 {n,m}+ at least n, no more than m, possessive
264 {n,m}? at least n, no more than m, lazy
265 {n,} n or more, greedy
266 {n,}+ n or more, possessive
267 {n,}? n or more, lazy
268.
269.
270.SH "ANCHORS AND SIMPLE ASSERTIONS"
271.rs
272.sp
273 \eb word boundary
274 \eB not a word boundary
275 ^ start of subject
276 also after internal newline in multiline mode
277 \eA start of subject
278 $ end of subject
279 also before newline at end of subject
280 also before internal newline in multiline mode
281 \eZ end of subject
282 also before newline at end of subject
283 \ez end of subject
284 \eG first matching position in subject
285.
286.
287.SH "MATCH POINT RESET"
288.rs
289.sp
290 \eK reset start of match
291.
292.
293.SH "ALTERNATION"
294.rs
295.sp
296 expr|expr|expr...
297.
298.
299.SH "CAPTURING"
300.rs
301.sp
302 (...) capturing group
303 (?<name>...) named capturing group (Perl)
304 (?'name'...) named capturing group (Perl)
305 (?P<name>...) named capturing group (Python)
306 (?:...) non-capturing group
307 (?|...) non-capturing group; reset group numbers for
308 capturing groups in each alternative
309.
310.
311.SH "ATOMIC GROUPS"
312.rs
313.sp
314 (?>...) atomic, non-capturing group
315.
316.
317.
318.
319.SH "COMMENT"
320.rs
321.sp
322 (?#....) comment (not nestable)
323.
324.
325.SH "OPTION SETTING"
326.rs
327.sp
328 (?i) caseless
329 (?J) allow duplicate names
330 (?m) multiline
331 (?s) single line (dotall)
332 (?U) default ungreedy (lazy)
333 (?x) extended (ignore white space)
334 (?-...) unset option(s)
335.sp
336The following are recognized only at the start of a pattern or after one of the
337newline-setting options with similar syntax:
338.sp
339 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
340 (*UTF8) set UTF-8 mode (PCRE_UTF8)
341 (*UCP) set PCRE_UCP (use Unicode properties for \ed etc)
342.
343.
344.SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
345.rs
346.sp
347 (?=...) positive look ahead
348 (?!...) negative look ahead
349 (?<=...) positive look behind
350 (?<!...) negative look behind
351.sp
352Each top-level branch of a look behind must be of a fixed length.
353.
354.
355.SH "BACKREFERENCES"
356.rs
357.sp
358 \en reference by number (can be ambiguous)
359 \egn reference by number
360 \eg{n} reference by number
361 \eg{-n} relative reference by number
362 \ek<name> reference by name (Perl)
363 \ek'name' reference by name (Perl)
364 \eg{name} reference by name (Perl)
365 \ek{name} reference by name (.NET)
366 (?P=name) reference by name (Python)
367.
368.
369.SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
370.rs
371.sp
372 (?R) recurse whole pattern
373 (?n) call subpattern by absolute number
374 (?+n) call subpattern by relative number
375 (?-n) call subpattern by relative number
376 (?&name) call subpattern by name (Perl)
377 (?P>name) call subpattern by name (Python)
378 \eg<name> call subpattern by name (Oniguruma)
379 \eg'name' call subpattern by name (Oniguruma)
380 \eg<n> call subpattern by absolute number (Oniguruma)
381 \eg'n' call subpattern by absolute number (Oniguruma)
382 \eg<+n> call subpattern by relative number (PCRE extension)
383 \eg'+n' call subpattern by relative number (PCRE extension)
384 \eg<-n> call subpattern by relative number (PCRE extension)
385 \eg'-n' call subpattern by relative number (PCRE extension)
386.
387.
388.SH "CONDITIONAL PATTERNS"
389.rs
390.sp
391 (?(condition)yes-pattern)
392 (?(condition)yes-pattern|no-pattern)
393.sp
394 (?(n)... absolute reference condition
395 (?(+n)... relative reference condition
396 (?(-n)... relative reference condition
397 (?(<name>)... named reference condition (Perl)
398 (?('name')... named reference condition (Perl)
399 (?(name)... named reference condition (PCRE)
400 (?(R)... overall recursion condition
401 (?(Rn)... specific group recursion condition
402 (?(R&name)... specific recursion condition
403 (?(DEFINE)... define subpattern for reference
404 (?(assert)... assertion condition
405.
406.
407.SH "BACKTRACKING CONTROL"
408.rs
409.sp
410The following act immediately they are reached:
411.sp
412 (*ACCEPT) force successful match
413 (*FAIL) force backtrack; synonym (*F)
414.sp
415The following act only when a subsequent match failure causes a backtrack to
416reach them. They all force a match failure, but they differ in what happens
417afterwards. Those that advance the start-of-match point do so only if the
418pattern is not anchored.
419.sp
420 (*COMMIT) overall failure, no advance of starting point
421 (*PRUNE) advance to next starting character
422 (*SKIP) advance start to current matching position
423 (*THEN) local failure, backtrack to next alternation
424.
425.
426.SH "NEWLINE CONVENTIONS"
427.rs
428.sp
429These are recognized only at the very start of the pattern or after a
430(*BSR_...) or (*UTF8) or (*UCP) option.
431.sp
432 (*CR) carriage return only
433 (*LF) linefeed only
434 (*CRLF) carriage return followed by linefeed
435 (*ANYCRLF) all three of the above
436 (*ANY) any Unicode newline sequence
437.
438.
439.SH "WHAT \eR MATCHES"
440.rs
441.sp
442These are recognized only at the very start of the pattern or after a
443(*...) option that sets the newline convention or UTF-8 or UCP mode.
444.sp
445 (*BSR_ANYCRLF) CR, LF, or CRLF
446 (*BSR_UNICODE) any Unicode newline sequence
447.
448.
449.SH "CALLOUTS"
450.rs
451.sp
452 (?C) callout
453 (?Cn) callout with data n
454.
455.
456.SH "SEE ALSO"
457.rs
458.sp
459\fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
460\fBpcrematching\fP(3), \fBpcre\fP(3).
461.
462.
463.SH AUTHOR
464.rs
465.sp
466.nf
467Philip Hazel
468University Computing Service
469Cambridge CB2 3QH, England.
470.fi
471.
472.
473.SH REVISION
474.rs
475.sp
476.nf
477Last updated: 21 November 2010
478Copyright (c) 1997-2010 University of Cambridge.
479.fi