blob: 9fa3ebd61af0a81a06152f39fde6d0509d64a6f8 [file] [log] [blame]
Tristan Matthews04616462013-11-14 16:09:34 -05001<html>
2<head>
3<title>pcresyntax specification</title>
4</head>
5<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6<h1>pcresyntax man page</h1>
7<p>
8Return to the <a href="index.html">PCRE index page</a>.
9</p>
10<p>
11This page is part of the PCRE HTML documentation. It was generated automatically
12from the original man page. If there is any nonsense in it, please consult the
13man page, in case the conversion went wrong.
14<br>
15<ul>
16<li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
17<li><a name="TOC2" href="#SEC2">QUOTING</a>
18<li><a name="TOC3" href="#SEC3">CHARACTERS</a>
19<li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20<li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21<li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22<li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23<li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24<li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25<li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26<li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
27<li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28<li><a name="TOC13" href="#SEC13">CAPTURING</a>
29<li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30<li><a name="TOC15" href="#SEC15">COMMENT</a>
31<li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32<li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
33<li><a name="TOC18" href="#SEC18">BACKREFERENCES</a>
34<li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
35<li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a>
36<li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a>
37<li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a>
38<li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a>
39<li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40<li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41<li><a name="TOC26" href="#SEC26">AUTHOR</a>
42<li><a name="TOC27" href="#SEC27">REVISION</a>
43</ul>
44<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
45<P>
46The full syntax and semantics of the regular expressions that are supported by
47PCRE are described in the
48<a href="pcrepattern.html"><b>pcrepattern</b></a>
49documentation. This document contains just a quick-reference summary of the
50syntax.
51</P>
52<br><a name="SEC2" href="#TOC1">QUOTING</a><br>
53<P>
54<pre>
55 \x where x is non-alphanumeric is a literal x
56 \Q...\E treat enclosed characters as literal
57</PRE>
58</P>
59<br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
60<P>
61<pre>
62 \a alarm, that is, the BEL character (hex 07)
63 \cx "control-x", where x is any ASCII character
64 \e escape (hex 1B)
65 \f formfeed (hex 0C)
66 \n newline (hex 0A)
67 \r carriage return (hex 0D)
68 \t tab (hex 09)
69 \ddd character with octal code ddd, or backreference
70 \xhh character with hex code hh
71 \x{hhh..} character with hex code hhh..
72</PRE>
73</P>
74<br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
75<P>
76<pre>
77 . any character except newline;
78 in dotall mode, any character whatsoever
79 \C one byte, even in UTF-8 mode (best avoided)
80 \d a decimal digit
81 \D a character that is not a decimal digit
82 \h a horizontal whitespace character
83 \H a character that is not a horizontal whitespace character
84 \N a character that is not a newline
85 \p{<i>xx</i>} a character with the <i>xx</i> property
86 \P{<i>xx</i>} a character without the <i>xx</i> property
87 \R a newline sequence
88 \s a whitespace character
89 \S a character that is not a whitespace character
90 \v a vertical whitespace character
91 \V a character that is not a vertical whitespace character
92 \w a "word" character
93 \W a "non-word" character
94 \X an extended Unicode sequence
95</pre>
96In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII
97characters, even in UTF-8 mode. However, this can be changed by setting the
98PCRE_UCP option.
99</P>
100<br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
101<P>
102<pre>
103 C Other
104 Cc Control
105 Cf Format
106 Cn Unassigned
107 Co Private use
108 Cs Surrogate
109
110 L Letter
111 Ll Lower case letter
112 Lm Modifier letter
113 Lo Other letter
114 Lt Title case letter
115 Lu Upper case letter
116 L& Ll, Lu, or Lt
117
118 M Mark
119 Mc Spacing mark
120 Me Enclosing mark
121 Mn Non-spacing mark
122
123 N Number
124 Nd Decimal number
125 Nl Letter number
126 No Other number
127
128 P Punctuation
129 Pc Connector punctuation
130 Pd Dash punctuation
131 Pe Close punctuation
132 Pf Final punctuation
133 Pi Initial punctuation
134 Po Other punctuation
135 Ps Open punctuation
136
137 S Symbol
138 Sc Currency symbol
139 Sk Modifier symbol
140 Sm Mathematical symbol
141 So Other symbol
142
143 Z Separator
144 Zl Line separator
145 Zp Paragraph separator
146 Zs Space separator
147</PRE>
148</P>
149<br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
150<P>
151<pre>
152 Xan Alphanumeric: union of properties L and N
153 Xps POSIX space: property Z or tab, NL, VT, FF, CR
154 Xsp Perl space: property Z or tab, NL, FF, CR
155 Xwd Perl word: property Xan or underscore
156</PRE>
157</P>
158<br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
159<P>
160Arabic,
161Armenian,
162Avestan,
163Balinese,
164Bamum,
165Bengali,
166Bopomofo,
167Braille,
168Buginese,
169Buhid,
170Canadian_Aboriginal,
171Carian,
172Cham,
173Cherokee,
174Common,
175Coptic,
176Cuneiform,
177Cypriot,
178Cyrillic,
179Deseret,
180Devanagari,
181Egyptian_Hieroglyphs,
182Ethiopic,
183Georgian,
184Glagolitic,
185Gothic,
186Greek,
187Gujarati,
188Gurmukhi,
189Han,
190Hangul,
191Hanunoo,
192Hebrew,
193Hiragana,
194Imperial_Aramaic,
195Inherited,
196Inscriptional_Pahlavi,
197Inscriptional_Parthian,
198Javanese,
199Kaithi,
200Kannada,
201Katakana,
202Kayah_Li,
203Kharoshthi,
204Khmer,
205Lao,
206Latin,
207Lepcha,
208Limbu,
209Linear_B,
210Lisu,
211Lycian,
212Lydian,
213Malayalam,
214Meetei_Mayek,
215Mongolian,
216Myanmar,
217New_Tai_Lue,
218Nko,
219Ogham,
220Old_Italic,
221Old_Persian,
222Old_South_Arabian,
223Old_Turkic,
224Ol_Chiki,
225Oriya,
226Osmanya,
227Phags_Pa,
228Phoenician,
229Rejang,
230Runic,
231Samaritan,
232Saurashtra,
233Shavian,
234Sinhala,
235Sundanese,
236Syloti_Nagri,
237Syriac,
238Tagalog,
239Tagbanwa,
240Tai_Le,
241Tai_Tham,
242Tai_Viet,
243Tamil,
244Telugu,
245Thaana,
246Thai,
247Tibetan,
248Tifinagh,
249Ugaritic,
250Vai,
251Yi.
252</P>
253<br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
254<P>
255<pre>
256 [...] positive character class
257 [^...] negative character class
258 [x-y] range (can be used for hex characters)
259 [[:xxx:]] positive POSIX named set
260 [[:^xxx:]] negative POSIX named set
261
262 alnum alphanumeric
263 alpha alphabetic
264 ascii 0-127
265 blank space or tab
266 cntrl control character
267 digit decimal digit
268 graph printing, excluding space
269 lower lower case letter
270 print printing, including space
271 punct printing, excluding alphanumeric
272 space whitespace
273 upper upper case letter
274 word same as \w
275 xdigit hexadecimal digit
276</pre>
277In PCRE, POSIX character set names recognize only ASCII characters by default,
278but some of them use Unicode properties if PCRE_UCP is set. You can use
279\Q...\E inside a character class.
280</P>
281<br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
282<P>
283<pre>
284 ? 0 or 1, greedy
285 ?+ 0 or 1, possessive
286 ?? 0 or 1, lazy
287 * 0 or more, greedy
288 *+ 0 or more, possessive
289 *? 0 or more, lazy
290 + 1 or more, greedy
291 ++ 1 or more, possessive
292 +? 1 or more, lazy
293 {n} exactly n
294 {n,m} at least n, no more than m, greedy
295 {n,m}+ at least n, no more than m, possessive
296 {n,m}? at least n, no more than m, lazy
297 {n,} n or more, greedy
298 {n,}+ n or more, possessive
299 {n,}? n or more, lazy
300</PRE>
301</P>
302<br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
303<P>
304<pre>
305 \b word boundary
306 \B not a word boundary
307 ^ start of subject
308 also after internal newline in multiline mode
309 \A start of subject
310 $ end of subject
311 also before newline at end of subject
312 also before internal newline in multiline mode
313 \Z end of subject
314 also before newline at end of subject
315 \z end of subject
316 \G first matching position in subject
317</PRE>
318</P>
319<br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
320<P>
321<pre>
322 \K reset start of match
323</PRE>
324</P>
325<br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
326<P>
327<pre>
328 expr|expr|expr...
329</PRE>
330</P>
331<br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
332<P>
333<pre>
334 (...) capturing group
335 (?&#60;name&#62;...) named capturing group (Perl)
336 (?'name'...) named capturing group (Perl)
337 (?P&#60;name&#62;...) named capturing group (Python)
338 (?:...) non-capturing group
339 (?|...) non-capturing group; reset group numbers for
340 capturing groups in each alternative
341</PRE>
342</P>
343<br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
344<P>
345<pre>
346 (?&#62;...) atomic, non-capturing group
347</PRE>
348</P>
349<br><a name="SEC15" href="#TOC1">COMMENT</a><br>
350<P>
351<pre>
352 (?#....) comment (not nestable)
353</PRE>
354</P>
355<br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
356<P>
357<pre>
358 (?i) caseless
359 (?J) allow duplicate names
360 (?m) multiline
361 (?s) single line (dotall)
362 (?U) default ungreedy (lazy)
363 (?x) extended (ignore white space)
364 (?-...) unset option(s)
365</pre>
366The following are recognized only at the start of a pattern or after one of the
367newline-setting options with similar syntax:
368<pre>
369 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
370 (*UTF8) set UTF-8 mode (PCRE_UTF8)
371 (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
372</PRE>
373</P>
374<br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
375<P>
376<pre>
377 (?=...) positive look ahead
378 (?!...) negative look ahead
379 (?&#60;=...) positive look behind
380 (?&#60;!...) negative look behind
381</pre>
382Each top-level branch of a look behind must be of a fixed length.
383</P>
384<br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br>
385<P>
386<pre>
387 \n reference by number (can be ambiguous)
388 \gn reference by number
389 \g{n} reference by number
390 \g{-n} relative reference by number
391 \k&#60;name&#62; reference by name (Perl)
392 \k'name' reference by name (Perl)
393 \g{name} reference by name (Perl)
394 \k{name} reference by name (.NET)
395 (?P=name) reference by name (Python)
396</PRE>
397</P>
398<br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
399<P>
400<pre>
401 (?R) recurse whole pattern
402 (?n) call subpattern by absolute number
403 (?+n) call subpattern by relative number
404 (?-n) call subpattern by relative number
405 (?&name) call subpattern by name (Perl)
406 (?P&#62;name) call subpattern by name (Python)
407 \g&#60;name&#62; call subpattern by name (Oniguruma)
408 \g'name' call subpattern by name (Oniguruma)
409 \g&#60;n&#62; call subpattern by absolute number (Oniguruma)
410 \g'n' call subpattern by absolute number (Oniguruma)
411 \g&#60;+n&#62; call subpattern by relative number (PCRE extension)
412 \g'+n' call subpattern by relative number (PCRE extension)
413 \g&#60;-n&#62; call subpattern by relative number (PCRE extension)
414 \g'-n' call subpattern by relative number (PCRE extension)
415</PRE>
416</P>
417<br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br>
418<P>
419<pre>
420 (?(condition)yes-pattern)
421 (?(condition)yes-pattern|no-pattern)
422
423 (?(n)... absolute reference condition
424 (?(+n)... relative reference condition
425 (?(-n)... relative reference condition
426 (?(&#60;name&#62;)... named reference condition (Perl)
427 (?('name')... named reference condition (Perl)
428 (?(name)... named reference condition (PCRE)
429 (?(R)... overall recursion condition
430 (?(Rn)... specific group recursion condition
431 (?(R&name)... specific recursion condition
432 (?(DEFINE)... define subpattern for reference
433 (?(assert)... assertion condition
434</PRE>
435</P>
436<br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br>
437<P>
438The following act immediately they are reached:
439<pre>
440 (*ACCEPT) force successful match
441 (*FAIL) force backtrack; synonym (*F)
442</pre>
443The following act only when a subsequent match failure causes a backtrack to
444reach them. They all force a match failure, but they differ in what happens
445afterwards. Those that advance the start-of-match point do so only if the
446pattern is not anchored.
447<pre>
448 (*COMMIT) overall failure, no advance of starting point
449 (*PRUNE) advance to next starting character
450 (*SKIP) advance start to current matching position
451 (*THEN) local failure, backtrack to next alternation
452</PRE>
453</P>
454<br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
455<P>
456These are recognized only at the very start of the pattern or after a
457(*BSR_...) or (*UTF8) or (*UCP) option.
458<pre>
459 (*CR) carriage return only
460 (*LF) linefeed only
461 (*CRLF) carriage return followed by linefeed
462 (*ANYCRLF) all three of the above
463 (*ANY) any Unicode newline sequence
464</PRE>
465</P>
466<br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
467<P>
468These are recognized only at the very start of the pattern or after a
469(*...) option that sets the newline convention or UTF-8 or UCP mode.
470<pre>
471 (*BSR_ANYCRLF) CR, LF, or CRLF
472 (*BSR_UNICODE) any Unicode newline sequence
473</PRE>
474</P>
475<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
476<P>
477<pre>
478 (?C) callout
479 (?Cn) callout with data n
480</PRE>
481</P>
482<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
483<P>
484<b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
485<b>pcrematching</b>(3), <b>pcre</b>(3).
486</P>
487<br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
488<P>
489Philip Hazel
490<br>
491University Computing Service
492<br>
493Cambridge CB2 3QH, England.
494<br>
495</P>
496<br><a name="SEC27" href="#TOC1">REVISION</a><br>
497<P>
498Last updated: 21 November 2010
499<br>
500Copyright &copy; 1997-2010 University of Cambridge.
501<br>
502<p>
503Return to the <a href="index.html">PCRE index page</a>.
504</p>