Blame - jni/libpcre/sources/doc/html/pcrepattern.html - jami-client-android

blob: aa39d63066e73767b9a5c9e9addd6cff44ae4c2a [file] [log] [blame]

Tristan Matthews	0461646	2013-11-14 16:09:34 -0500	[diff] [blame]	1	<html>
				2	<head>
				3	<title>pcrepattern specification</title>
				4	</head>
				5	<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
				6	<h1>pcrepattern man page</h1>
				7	<p>
				8	Return to the <a href="index.html">PCRE index page</a>.
				9	</p>
				10	<p>
				11	This page is part of the PCRE HTML documentation. It was generated automatically
				12	from the original man page. If there is any nonsense in it, please consult the
				13	man page, in case the conversion went wrong.
				14	<br>
				15	<ul>
				16	<li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION DETAILS</a>
				17	<li><a name="TOC2" href="#SEC2">NEWLINE CONVENTIONS</a>
				18	<li><a name="TOC3" href="#SEC3">CHARACTERS AND METACHARACTERS</a>
				19	<li><a name="TOC4" href="#SEC4">BACKSLASH</a>
				20	<li><a name="TOC5" href="#SEC5">CIRCUMFLEX AND DOLLAR</a>
				21	<li><a name="TOC6" href="#SEC6">FULL STOP (PERIOD, DOT) AND \N</a>
				22	<li><a name="TOC7" href="#SEC7">MATCHING A SINGLE BYTE</a>
				23	<li><a name="TOC8" href="#SEC8">SQUARE BRACKETS AND CHARACTER CLASSES</a>
				24	<li><a name="TOC9" href="#SEC9">POSIX CHARACTER CLASSES</a>
				25	<li><a name="TOC10" href="#SEC10">VERTICAL BAR</a>
				26	<li><a name="TOC11" href="#SEC11">INTERNAL OPTION SETTING</a>
				27	<li><a name="TOC12" href="#SEC12">SUBPATTERNS</a>
				28	<li><a name="TOC13" href="#SEC13">DUPLICATE SUBPATTERN NUMBERS</a>
				29	<li><a name="TOC14" href="#SEC14">NAMED SUBPATTERNS</a>
				30	<li><a name="TOC15" href="#SEC15">REPETITION</a>
				31	<li><a name="TOC16" href="#SEC16">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
				32	<li><a name="TOC17" href="#SEC17">BACK REFERENCES</a>
				33	<li><a name="TOC18" href="#SEC18">ASSERTIONS</a>
				34	<li><a name="TOC19" href="#SEC19">CONDITIONAL SUBPATTERNS</a>
				35	<li><a name="TOC20" href="#SEC20">COMMENTS</a>
				36	<li><a name="TOC21" href="#SEC21">RECURSIVE PATTERNS</a>
				37	<li><a name="TOC22" href="#SEC22">SUBPATTERNS AS SUBROUTINES</a>
				38	<li><a name="TOC23" href="#SEC23">ONIGURUMA SUBROUTINE SYNTAX</a>
				39	<li><a name="TOC24" href="#SEC24">CALLOUTS</a>
				40	<li><a name="TOC25" href="#SEC25">BACKTRACKING CONTROL</a>
				41	<li><a name="TOC26" href="#SEC26">SEE ALSO</a>
				42	<li><a name="TOC27" href="#SEC27">AUTHOR</a>
				43	<li><a name="TOC28" href="#SEC28">REVISION</a>
				44	</ul>
				45	<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br>
				46	<P>
				47	The syntax and semantics of the regular expressions that are supported by PCRE
				48	are described in detail below. There is a quick-reference syntax summary in the
				49	<a href="pcresyntax.html"><b>pcresyntax</b></a>
				50	page. PCRE tries to match Perl syntax and semantics as closely as it can. PCRE
				51	also supports some alternative regular expression syntax (which does not
				52	conflict with the Perl syntax) in order to provide some compatibility with
				53	regular expressions in Python, .NET, and Oniguruma.
				54	</P>
				55	<P>
				56	Perl's regular expressions are described in its own documentation, and
				57	regular expressions in general are covered in a number of books, some of which
				58	have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",
				59	published by O'Reilly, covers regular expressions in great detail. This
				60	description of PCRE's regular expressions is intended as reference material.
				61	</P>
				62	<P>
				63	The original operation of PCRE was on strings of one-byte characters. However,
				64	there is now also support for UTF-8 character strings. To use this,
				65	PCRE must be built to include UTF-8 support, and you must call
				66	<b>pcre_compile()</b> or <b>pcre_compile2()</b> with the PCRE_UTF8 option. There
				67	is also a special sequence that can be given at the start of a pattern:
				68	<pre>
				69	(*UTF8)
				70	</pre>
				71	Starting a pattern with this sequence is equivalent to setting the PCRE_UTF8
				72	option. This feature is not Perl-compatible. How setting UTF-8 mode affects
				73	pattern matching is mentioned in several places below. There is also a summary
				74	of UTF-8 features in the
				75	<a href="pcreunicode.html"><b>pcreunicode</b></a>
				76	page.
				77	</P>
				78	<P>
				79	Another special sequence that may appear at the start of a pattern or in
				80	combination with (*UTF8) is:
				81	<pre>
				82	(*UCP)
				83	</pre>
				84	This has the same effect as setting the PCRE_UCP option: it causes sequences
				85	such as \d and \w to use Unicode properties to determine character types,
				86	instead of recognizing only characters with codes less than 128 via a lookup
				87	table.
				88	</P>
				89	<P>
				90	If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
				91	PCRE_NO_START_OPTIMIZE option either at compile or matching time. There are
				92	also some more of these special sequences that are concerned with the handling
				93	of newlines; they are described below.
				94	</P>
				95	<P>
				96	The remainder of this document discusses the patterns that are supported by
				97	PCRE when its main matching function, <b>pcre_exec()</b>, is used.
				98	From release 6.0, PCRE offers a second matching function,
				99	<b>pcre_dfa_exec()</b>, which matches using a different algorithm that is not
				100	Perl-compatible. Some of the features discussed below are not available when
				101	<b>pcre_dfa_exec()</b> is used. The advantages and disadvantages of the
				102	alternative function, and how it differs from the normal function, are
				103	discussed in the
				104	<a href="pcrematching.html"><b>pcrematching</b></a>
				105	page.
				106	<a name="newlines"></a></P>
				107	<br><a name="SEC2" href="#TOC1">NEWLINE CONVENTIONS</a><br>
				108	<P>
				109	PCRE supports five different conventions for indicating line breaks in
				110	strings: a single CR (carriage return) character, a single LF (linefeed)
				111	character, the two-character sequence CRLF, any of the three preceding, or any
				112	Unicode newline sequence. The
				113	<a href="pcreapi.html"><b>pcreapi</b></a>
				114	page has
				115	<a href="pcreapi.html#newlines">further discussion</a>
				116	about newlines, and shows how to set the newline convention in the
				117	<i>options</i> arguments for the compiling and matching functions.
				118	</P>
				119	<P>
				120	It is also possible to specify a newline convention by starting a pattern
				121	string with one of the following five sequences:
				122	<pre>
				123	(*CR) carriage return
				124	(*LF) linefeed
				125	(*CRLF) carriage return, followed by linefeed
				126	(*ANYCRLF) any of the three above
				127	(*ANY) all Unicode newline sequences
				128	</pre>
				129	These override the default and the options given to <b>pcre_compile()</b> or
				130	<b>pcre_compile2()</b>. For example, on a Unix system where LF is the default
				131	newline sequence, the pattern
				132	<pre>
				133	(*CR)a.b
				134	</pre>
				135	changes the convention to CR. That pattern matches "a\nb" because LF is no
				136	longer a newline. Note that these special settings, which are not
				137	Perl-compatible, are recognized only at the very start of a pattern, and that
				138	they must be in upper case. If more than one of them is present, the last one
				139	is used.
				140	</P>
				141	<P>
				142	The newline convention affects the interpretation of the dot metacharacter when
				143	PCRE_DOTALL is not set, and also the behaviour of \N. However, it does not
				144	affect what the \R escape sequence matches. By default, this is any Unicode
				145	newline sequence, for Perl compatibility. However, this can be changed; see the
				146	description of \R in the section entitled
				147	<a href="#newlineseq">"Newline sequences"</a>
				148	below. A change of \R setting can be combined with a change of newline
				149	convention.
				150	</P>
				151	<br><a name="SEC3" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br>
				152	<P>
				153	A regular expression is a pattern that is matched against a subject string from
				154	left to right. Most characters stand for themselves in a pattern, and match the
				155	corresponding characters in the subject. As a trivial example, the pattern
				156	<pre>
				157	The quick brown fox
				158	</pre>
				159	matches a portion of a subject string that is identical to itself. When
				160	caseless matching is specified (the PCRE_CASELESS option), letters are matched
				161	independently of case. In UTF-8 mode, PCRE always understands the concept of
				162	case for characters whose values are less than 128, so caseless matching is
				163	always possible. For characters with higher values, the concept of case is
				164	supported if PCRE is compiled with Unicode property support, but not otherwise.
				165	If you want to use caseless matching for characters 128 and above, you must
				166	ensure that PCRE is compiled with Unicode property support as well as with
				167	UTF-8 support.
				168	</P>
				169	<P>
				170	The power of regular expressions comes from the ability to include alternatives
				171	and repetitions in the pattern. These are encoded in the pattern by the use of
				172	<i>metacharacters</i>, which do not stand for themselves but instead are
				173	interpreted in some special way.
				174	</P>
				175	<P>
				176	There are two different sets of metacharacters: those that are recognized
				177	anywhere in the pattern except within square brackets, and those that are
				178	recognized within square brackets. Outside square brackets, the metacharacters
				179	are as follows:
				180	<pre>
				181	\ general escape character with several uses
				182	^ assert start of string (or line, in multiline mode)
				183	$ assert end of string (or line, in multiline mode)
				184	. match any character except newline (by default)
				185	[ start character class definition
				186	\| start of alternative branch
				187	( start subpattern
				188	) end subpattern
				189	? extends the meaning of (
				190	also 0 or 1 quantifier
				191	also quantifier minimizer
				192	* 0 or more quantifier
				193	+ 1 or more quantifier
				194	also "possessive quantifier"
				195	{ start min/max quantifier
				196	</pre>
				197	Part of a pattern that is in square brackets is called a "character class". In
				198	a character class the only metacharacters are:
				199	<pre>
				200	\ general escape character
				201	^ negate the class, but only if the first character
				202	- indicates character range
				203	[ POSIX character class (only if followed by POSIX syntax)
				204	] terminates the character class
				205	</pre>
				206	The following sections describe the use of each of the metacharacters.
				207	</P>
				208	<br><a name="SEC4" href="#TOC1">BACKSLASH</a><br>
				209	<P>
				210	The backslash character has several uses. Firstly, if it is followed by a
				211	character that is not a number or a letter, it takes away any special meaning
				212	that character may have. This use of backslash as an escape character applies
				213	both inside and outside character classes.
				214	</P>
				215	<P>
				216	For example, if you want to match a * character, you write \* in the pattern.
				217	This escaping action applies whether or not the following character would
				218	otherwise be interpreted as a metacharacter, so it is always safe to precede a
				219	non-alphanumeric with backslash to specify that it stands for itself. In
				220	particular, if you want to match a backslash, you write \\.
				221	</P>
				222	<P>
				223	In UTF-8 mode, only ASCII numbers and letters have any special meaning after a
				224	backslash. All other characters (in particular, those whose codepoints are
				225	greater than 127) are treated as literals.
				226	</P>
				227	<P>
				228	If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the
				229	pattern (other than in a character class) and characters between a # outside
				230	a character class and the next newline are ignored. An escaping backslash can
				231	be used to include a whitespace or # character as part of the pattern.
				232	</P>
				233	<P>
				234	If you want to remove the special meaning from a sequence of characters, you
				235	can do so by putting them between \Q and \E. This is different from Perl in
				236	that $ and @ are handled as literals in \Q...\E sequences in PCRE, whereas in
				237	Perl, $ and @ cause variable interpolation. Note the following examples:
				238	<pre>
				239	Pattern PCRE matches Perl matches
				240
				241	\Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
				242	\Qabc\$xyz\E abc\$xyz abc\$xyz
				243	\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
				244	</pre>
				245	The \Q...\E sequence is recognized both inside and outside character classes.
				246	An isolated \E that is not preceded by \Q is ignored. If \Q is not followed
				247	by \E later in the pattern, the literal interpretation continues to the end of
				248	the pattern (that is, \E is assumed at the end). If the isolated \Q is inside
				249	a character class, this causes an error, because the character class is not
				250	terminated.
				251	<a name="digitsafterbackslash"></a></P>
				252	<br><b>
				253	Non-printing characters
				254	</b><br>
				255	<P>
				256	A second use of backslash provides a way of encoding non-printing characters
				257	in patterns in a visible manner. There is no restriction on the appearance of
				258	non-printing characters, apart from the binary zero that terminates a pattern,
				259	but when a pattern is being prepared by text editing, it is often easier to use
				260	one of the following escape sequences than the binary character it represents:
				261	<pre>
				262	\a alarm, that is, the BEL character (hex 07)
				263	\cx "control-x", where x is any ASCII character
				264	\e escape (hex 1B)
				265	\f formfeed (hex 0C)
				266	\n linefeed (hex 0A)
				267	\r carriage return (hex 0D)
				268	\t tab (hex 09)
				269	\ddd character with octal code ddd, or back reference
				270	\xhh character with hex code hh
				271	\x{hhh..} character with hex code hhh.. (non-JavaScript mode)
				272	\uhhhh character with hex code hhhh (JavaScript mode only)
				273	</pre>
				274	The precise effect of \cx is as follows: if x is a lower case letter, it
				275	is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
				276	Thus \cz becomes hex 1A (z is 7A), but \c{ becomes hex 3B ({ is 7B), while
				277	\c; becomes hex 7B (; is 3B). If the byte following \c has a value greater
				278	than 127, a compile-time error occurs. This locks out non-ASCII characters in
				279	both byte mode and UTF-8 mode. (When PCRE is compiled in EBCDIC mode, all byte
				280	values are valid. A lower case letter is converted to upper case, and then the
				281	0xc0 bits are flipped.)
				282	</P>
				283	<P>
				284	By default, after \x, from zero to two hexadecimal digits are read (letters
				285	can be in upper or lower case). Any number of hexadecimal digits may appear
				286	between \x{ and }, but the value of the character code must be less than 256
				287	in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum
				288	value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest
				289	Unicode code point, which is 10FFFF.
				290	</P>
				291	<P>
				292	If characters other than hexadecimal digits appear between \x{ and }, or if
				293	there is no terminating }, this form of escape is not recognized. Instead, the
				294	initial \x will be interpreted as a basic hexadecimal escape, with no
				295	following digits, giving a character whose value is zero.
				296	</P>
				297	<P>
				298	If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is
				299	as just described only when it is followed by two hexadecimal digits.
				300	Otherwise, it matches a literal "x" character. In JavaScript mode, support for
				301	code points greater than 256 is provided by \u, which must be followed by
				302	four hexadecimal digits; otherwise it matches a literal "u" character.
				303	</P>
				304	<P>
				305	Characters whose value is less than 256 can be defined by either of the two
				306	syntaxes for \x (or by \u in JavaScript mode). There is no difference in the
				307	way they are handled. For example, \xdc is exactly the same as \x{dc} (or
				308	\u00dc in JavaScript mode).
				309	</P>
				310	<P>
				311	After \0 up to two further octal digits are read. If there are fewer than two
				312	digits, just those that are present are used. Thus the sequence \0\x\07
				313	specifies two binary zeros followed by a BEL character (code value 7). Make
				314	sure you supply two digits after the initial zero if the pattern character that
				315	follows is itself an octal digit.
				316	</P>
				317	<P>
				318	The handling of a backslash followed by a digit other than 0 is complicated.
				319	Outside a character class, PCRE reads it and any following digits as a decimal
				320	number. If the number is less than 10, or if there have been at least that many
				321	previous capturing left parentheses in the expression, the entire sequence is
				322	taken as a <i>back reference</i>. A description of how this works is given
				323	<a href="#backreferences">later,</a>
				324	following the discussion of
				325	<a href="#subpattern">parenthesized subpatterns.</a>
				326	</P>
				327	<P>
				328	Inside a character class, or if the decimal number is greater than 9 and there
				329	have not been that many capturing subpatterns, PCRE re-reads up to three octal
				330	digits following the backslash, and uses them to generate a data character. Any
				331	subsequent digits stand for themselves. In non-UTF-8 mode, the value of a
				332	character specified in octal must be less than \400. In UTF-8 mode, values up
				333	to \777 are permitted. For example:
				334	<pre>
				335	\040 is another way of writing a space
				336	\40 is the same, provided there are fewer than 40 previous capturing subpatterns
				337	\7 is always a back reference
				338	\11 might be a back reference, or another way of writing a tab
				339	\011 is always a tab
				340	\0113 is a tab followed by the character "3"
				341	\113 might be a back reference, otherwise the character with octal code 113
				342	\377 might be a back reference, otherwise the byte consisting entirely of 1 bits
				343	\81 is either a back reference, or a binary zero followed by the two characters "8" and "1"
				344	</pre>
				345	Note that octal values of 100 or greater must not be introduced by a leading
				346	zero, because no more than three octal digits are ever read.
				347	</P>
				348	<P>
				349	All the sequences that define a single character value can be used both inside
				350	and outside character classes. In addition, inside a character class, \b is
				351	interpreted as the backspace character (hex 08).
				352	</P>
				353	<P>
				354	\N is not allowed in a character class. \B, \R, and \X are not special
				355	inside a character class. Like other unrecognized escape sequences, they are
				356	treated as the literal characters "B", "R", and "X" by default, but cause an
				357	error if the PCRE_EXTRA option is set. Outside a character class, these
				358	sequences have different meanings.
				359	</P>
				360	<br><b>
				361	Unsupported escape sequences
				362	</b><br>
				363	<P>
				364	In Perl, the sequences \l, \L, \u, and \U are recognized by its string
				365	handler and used to modify the case of following characters. By default, PCRE
				366	does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT
				367	option is set, \U matches a "U" character, and \u can be used to define a
				368	character by code point, as described in the previous section.
				369	</P>
				370	<br><b>
				371	Absolute and relative back references
				372	</b><br>
				373	<P>
				374	The sequence \g followed by an unsigned or a negative number, optionally
				375	enclosed in braces, is an absolute or relative back reference. A named back
				376	reference can be coded as \g{name}. Back references are discussed
				377	<a href="#backreferences">later,</a>
				378	following the discussion of
				379	<a href="#subpattern">parenthesized subpatterns.</a>
				380	</P>
				381	<br><b>
				382	Absolute and relative subroutine calls
				383	</b><br>
				384	<P>
				385	For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
				386	a number enclosed either in angle brackets or single quotes, is an alternative
				387	syntax for referencing a subpattern as a "subroutine". Details are discussed
				388	<a href="#onigurumasubroutines">later.</a>
				389	Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i>
				390	synonymous. The former is a back reference; the latter is a
				391	<a href="#subpatternsassubroutines">subroutine</a>
				392	call.
				393	<a name="genericchartypes"></a></P>
				394	<br><b>
				395	Generic character types
				396	</b><br>
				397	<P>
				398	Another use of backslash is for specifying generic character types:
				399	<pre>
				400	\d any decimal digit
				401	\D any character that is not a decimal digit
				402	\h any horizontal whitespace character
				403	\H any character that is not a horizontal whitespace character
				404	\s any whitespace character
				405	\S any character that is not a whitespace character
				406	\v any vertical whitespace character
				407	\V any character that is not a vertical whitespace character
				408	\w any "word" character
				409	\W any "non-word" character
				410	</pre>
				411	There is also the single sequence \N, which matches a non-newline character.
				412	This is the same as
				413	<a href="#fullstopdot">the "." metacharacter</a>
				414	when PCRE_DOTALL is not set. Perl also uses \N to match characters by name;
				415	PCRE does not support this.
				416	</P>
				417	<P>
				418	Each pair of lower and upper case escape sequences partitions the complete set
				419	of characters into two disjoint sets. Any given character matches one, and only
				420	one, of each pair. The sequences can appear both inside and outside character
				421	classes. They each match one character of the appropriate type. If the current
				422	matching point is at the end of the subject string, all of them fail, because
				423	there is no character to match.
				424	</P>
				425	<P>
				426	For compatibility with Perl, \s does not match the VT character (code 11).
				427	This makes it different from the the POSIX "space" class. The \s characters
				428	are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is
				429	included in a Perl script, \s may match the VT character. In PCRE, it never
				430	does.
				431	</P>
				432	<P>
				433	A "word" character is an underscore or any character that is a letter or digit.
				434	By default, the definition of letters and digits is controlled by PCRE's
				435	low-valued character tables, and may vary if locale-specific matching is taking
				436	place (see
				437	<a href="pcreapi.html#localesupport">"Locale support"</a>
				438	in the
				439	<a href="pcreapi.html"><b>pcreapi</b></a>
				440	page). For example, in a French locale such as "fr_FR" in Unix-like systems,
				441	or "french" in Windows, some character codes greater than 128 are used for
				442	accented letters, and these are then matched by \w. The use of locales with
				443	Unicode is discouraged.
				444	</P>
				445	<P>
				446	By default, in UTF-8 mode, characters with values greater than 128 never match
				447	\d, \s, or \w, and always match \D, \S, and \W. These sequences retain
				448	their original meanings from before UTF-8 support was available, mainly for
				449	efficiency reasons. However, if PCRE is compiled with Unicode property support,
				450	and the PCRE_UCP option is set, the behaviour is changed so that Unicode
				451	properties are used to determine character types, as follows:
				452	<pre>
				453	\d any character that \p{Nd} matches (decimal digit)
				454	\s any character that \p{Z} matches, plus HT, LF, FF, CR
				455	\w any character that \p{L} or \p{N} matches, plus underscore
				456	</pre>
				457	The upper case escapes match the inverse sets of characters. Note that \d
				458	matches only decimal digits, whereas \w matches any Unicode digit, as well as
				459	any Unicode letter, and underscore. Note also that PCRE_UCP affects \b, and
				460	\B because they are defined in terms of \w and \W. Matching these sequences
				461	is noticeably slower when PCRE_UCP is set.
				462	</P>
				463	<P>
				464	The sequences \h, \H, \v, and \V are features that were added to Perl at
				465	release 5.10. In contrast to the other sequences, which match only ASCII
				466	characters by default, these always match certain high-valued codepoints in
				467	UTF-8 mode, whether or not PCRE_UCP is set. The horizontal space characters
				468	are:
				469	<pre>
				470	U+0009 Horizontal tab
				471	U+0020 Space
				472	U+00A0 Non-break space
				473	U+1680 Ogham space mark
				474	U+180E Mongolian vowel separator
				475	U+2000 En quad
				476	U+2001 Em quad
				477	U+2002 En space
				478	U+2003 Em space
				479	U+2004 Three-per-em space
				480	U+2005 Four-per-em space
				481	U+2006 Six-per-em space
				482	U+2007 Figure space
				483	U+2008 Punctuation space
				484	U+2009 Thin space
				485	U+200A Hair space
				486	U+202F Narrow no-break space
				487	U+205F Medium mathematical space
				488	U+3000 Ideographic space
				489	</pre>
				490	The vertical space characters are:
				491	<pre>
				492	U+000A Linefeed
				493	U+000B Vertical tab
				494	U+000C Formfeed
				495	U+000D Carriage return
				496	U+0085 Next line
				497	U+2028 Line separator
				498	U+2029 Paragraph separator
				499	<a name="newlineseq"></a></PRE>
				500	</P>
				501	<br><b>
				502	Newline sequences
				503	</b><br>
				504	<P>
				505	Outside a character class, by default, the escape sequence \R matches any
				506	Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the following:
				507	<pre>
				508	(?>\r\n\|\n\|\x0b\|\f\|\r\|\x85)
				509	</pre>
				510	This is an example of an "atomic group", details of which are given
				511	<a href="#atomicgroup">below.</a>
				512	This particular group matches either the two-character sequence CR followed by
				513	LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
				514	U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next
				515	line, U+0085). The two-character sequence is treated as a single unit that
				516	cannot be split.
				517	</P>
				518	<P>
				519	In UTF-8 mode, two additional characters whose codepoints are greater than 255
				520	are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
				521	Unicode character property support is not needed for these characters to be
				522	recognized.
				523	</P>
				524	<P>
				525	It is possible to restrict \R to match only CR, LF, or CRLF (instead of the
				526	complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF
				527	either at compile time or when the pattern is matched. (BSR is an abbrevation
				528	for "backslash R".) This can be made the default when PCRE is built; if this is
				529	the case, the other behaviour can be requested via the PCRE_BSR_UNICODE option.
				530	It is also possible to specify these settings by starting a pattern string with
				531	one of the following sequences:
				532	<pre>
				533	(*BSR_ANYCRLF) CR, LF, or CRLF only
				534	(*BSR_UNICODE) any Unicode newline sequence
				535	</pre>
				536	These override the default and the options given to <b>pcre_compile()</b> or
				537	<b>pcre_compile2()</b>, but they can be overridden by options given to
				538	<b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>. Note that these special settings,
				539	which are not Perl-compatible, are recognized only at the very start of a
				540	pattern, and that they must be in upper case. If more than one of them is
				541	present, the last one is used. They can be combined with a change of newline
				542	convention; for example, a pattern can start with:
				543	<pre>
				544	(ANY)(BSR_ANYCRLF)
				545	</pre>
				546	They can also be combined with the (UTF8) or (UCP) special sequences. Inside
				547	a character class, \R is treated as an unrecognized escape sequence, and so
				548	matches the letter "R" by default, but causes an error if PCRE_EXTRA is set.
				549	<a name="uniextseq"></a></P>
				550	<br><b>
				551	Unicode character properties
				552	</b><br>
				553	<P>
				554	When PCRE is built with Unicode character property support, three additional
				555	escape sequences that match characters with specific properties are available.
				556	When not in UTF-8 mode, these sequences are of course limited to testing
				557	characters whose codepoints are less than 256, but they do work in this mode.
				558	The extra escape sequences are:
				559	<pre>
				560	\p{<i>xx</i>} a character with the <i>xx</i> property
				561	\P{<i>xx</i>} a character without the <i>xx</i> property
				562	\X an extended Unicode sequence
				563	</pre>
				564	The property names represented by <i>xx</i> above are limited to the Unicode
				565	script names, the general category properties, "Any", which matches any
				566	character (including newline), and some special PCRE properties (described
				567	in the
				568	<a href="#extraprops">next section).</a>
				569	Other Perl properties such as "InMusicalSymbols" are not currently supported by
				570	PCRE. Note that \P{Any} does not match any characters, so always causes a
				571	match failure.
				572	</P>
				573	<P>
				574	Sets of Unicode characters are defined as belonging to certain scripts. A
				575	character from one of these sets can be matched using a script name. For
				576	example:
				577	<pre>
				578	\p{Greek}
				579	\P{Han}
				580	</pre>
				581	Those that are not part of an identified script are lumped together as
				582	"Common". The current list of scripts is:
				583	</P>
				584	<P>
				585	Arabic,
				586	Armenian,
				587	Avestan,
				588	Balinese,
				589	Bamum,
				590	Bengali,
				591	Bopomofo,
				592	Braille,
				593	Buginese,
				594	Buhid,
				595	Canadian_Aboriginal,
				596	Carian,
				597	Cham,
				598	Cherokee,
				599	Common,
				600	Coptic,
				601	Cuneiform,
				602	Cypriot,
				603	Cyrillic,
				604	Deseret,
				605	Devanagari,
				606	Egyptian_Hieroglyphs,
				607	Ethiopic,
				608	Georgian,
				609	Glagolitic,
				610	Gothic,
				611	Greek,
				612	Gujarati,
				613	Gurmukhi,
				614	Han,
				615	Hangul,
				616	Hanunoo,
				617	Hebrew,
				618	Hiragana,
				619	Imperial_Aramaic,
				620	Inherited,
				621	Inscriptional_Pahlavi,
				622	Inscriptional_Parthian,
				623	Javanese,
				624	Kaithi,
				625	Kannada,
				626	Katakana,
				627	Kayah_Li,
				628	Kharoshthi,
				629	Khmer,
				630	Lao,
				631	Latin,
				632	Lepcha,
				633	Limbu,
				634	Linear_B,
				635	Lisu,
				636	Lycian,
				637	Lydian,
				638	Malayalam,
				639	Meetei_Mayek,
				640	Mongolian,
				641	Myanmar,
				642	New_Tai_Lue,
				643	Nko,
				644	Ogham,
				645	Old_Italic,
				646	Old_Persian,
				647	Old_South_Arabian,
				648	Old_Turkic,
				649	Ol_Chiki,
				650	Oriya,
				651	Osmanya,
				652	Phags_Pa,
				653	Phoenician,
				654	Rejang,
				655	Runic,
				656	Samaritan,
				657	Saurashtra,
				658	Shavian,
				659	Sinhala,
				660	Sundanese,
				661	Syloti_Nagri,
				662	Syriac,
				663	Tagalog,
				664	Tagbanwa,
				665	Tai_Le,
				666	Tai_Tham,
				667	Tai_Viet,
				668	Tamil,
				669	Telugu,
				670	Thaana,
				671	Thai,
				672	Tibetan,
				673	Tifinagh,
				674	Ugaritic,
				675	Vai,
				676	Yi.
				677	</P>
				678	<P>
				679	Each character has exactly one Unicode general category property, specified by
				680	a two-letter abbreviation. For compatibility with Perl, negation can be
				681	specified by including a circumflex between the opening brace and the property
				682	name. For example, \p{^Lu} is the same as \P{Lu}.
				683	</P>
				684	<P>
				685	If only one letter is specified with \p or \P, it includes all the general
				686	category properties that start with that letter. In this case, in the absence
				687	of negation, the curly brackets in the escape sequence are optional; these two
				688	examples have the same effect:
				689	<pre>
				690	\p{L}
				691	\pL
				692	</pre>
				693	The following general category property codes are supported:
				694	<pre>
				695	C Other
				696	Cc Control
				697	Cf Format
				698	Cn Unassigned
				699	Co Private use
				700	Cs Surrogate
				701
				702	L Letter
				703	Ll Lower case letter
				704	Lm Modifier letter
				705	Lo Other letter
				706	Lt Title case letter
				707	Lu Upper case letter
				708
				709	M Mark
				710	Mc Spacing mark
				711	Me Enclosing mark
				712	Mn Non-spacing mark
				713
				714	N Number
				715	Nd Decimal number
				716	Nl Letter number
				717	No Other number
				718
				719	P Punctuation
				720	Pc Connector punctuation
				721	Pd Dash punctuation
				722	Pe Close punctuation
				723	Pf Final punctuation
				724	Pi Initial punctuation
				725	Po Other punctuation
				726	Ps Open punctuation
				727
				728	S Symbol
				729	Sc Currency symbol
				730	Sk Modifier symbol
				731	Sm Mathematical symbol
				732	So Other symbol
				733
				734	Z Separator
				735	Zl Line separator
				736	Zp Paragraph separator
				737	Zs Space separator
				738	</pre>
				739	The special property L& is also supported: it matches a character that has
				740	the Lu, Ll, or Lt property, in other words, a letter that is not classified as
				741	a modifier or "other".
				742	</P>
				743	<P>
				744	The Cs (Surrogate) property applies only to characters in the range U+D800 to
				745	U+DFFF. Such characters are not valid in UTF-8 strings (see RFC 3629) and so
				746	cannot be tested by PCRE, unless UTF-8 validity checking has been turned off
				747	(see the discussion of PCRE_NO_UTF8_CHECK in the
				748	<a href="pcreapi.html"><b>pcreapi</b></a>
				749	page). Perl does not support the Cs property.
				750	</P>
				751	<P>
				752	The long synonyms for property names that Perl supports (such as \p{Letter})
				753	are not supported by PCRE, nor is it permitted to prefix any of these
				754	properties with "Is".
				755	</P>
				756	<P>
				757	No character that is in the Unicode table has the Cn (unassigned) property.
				758	Instead, this property is assumed for any code point that is not in the
				759	Unicode table.
				760	</P>
				761	<P>
				762	Specifying caseless matching does not affect these escape sequences. For
				763	example, \p{Lu} always matches only upper case letters.
				764	</P>
				765	<P>
				766	The \X escape matches any number of Unicode characters that form an extended
				767	Unicode sequence. \X is equivalent to
				768	<pre>
				769	(?>\PM\pM*)
				770	</pre>
				771	That is, it matches a character without the "mark" property, followed by zero
				772	or more characters with the "mark" property, and treats the sequence as an
				773	atomic group
				774	<a href="#atomicgroup">(see below).</a>
				775	Characters with the "mark" property are typically accents that affect the
				776	preceding character. None of them have codepoints less than 256, so in
				777	non-UTF-8 mode \X matches any one character.
				778	</P>
				779	<P>
				780	Note that recent versions of Perl have changed \X to match what Unicode calls
				781	an "extended grapheme cluster", which has a more complicated definition.
				782	</P>
				783	<P>
				784	Matching characters by Unicode property is not fast, because PCRE has to search
				785	a structure that contains data for over fifteen thousand characters. That is
				786	why the traditional escape sequences such as \d and \w do not use Unicode
				787	properties in PCRE by default, though you can make them do so by setting the
				788	PCRE_UCP option for <b>pcre_compile()</b> or by starting the pattern with
				789	(*UCP).
				790	<a name="extraprops"></a></P>
				791	<br><b>
				792	PCRE's additional properties
				793	</b><br>
				794	<P>
				795	As well as the standard Unicode properties described in the previous
				796	section, PCRE supports four more that make it possible to convert traditional
				797	escape sequences such as \w and \s and POSIX character classes to use Unicode
				798	properties. PCRE uses these non-standard, non-Perl properties internally when
				799	PCRE_UCP is set. They are:
				800	<pre>
				801	Xan Any alphanumeric character
				802	Xps Any POSIX space character
				803	Xsp Any Perl space character
				804	Xwd Any Perl "word" character
				805	</pre>
				806	Xan matches characters that have either the L (letter) or the N (number)
				807	property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or
				808	carriage return, and any other character that has the Z (separator) property.
				809	Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the
				810	same characters as Xan, plus underscore.
				811	<a name="resetmatchstart"></a></P>
				812	<br><b>
				813	Resetting the match start
				814	</b><br>
				815	<P>
				816	The escape sequence \K causes any previously matched characters not to be
				817	included in the final matched sequence. For example, the pattern:
				818	<pre>
				819	foo\Kbar
				820	</pre>
				821	matches "foobar", but reports that it has matched "bar". This feature is
				822	similar to a lookbehind assertion
				823	<a href="#lookbehind">(described below).</a>
				824	However, in this case, the part of the subject before the real match does not
				825	have to be of fixed length, as lookbehind assertions do. The use of \K does
				826	not interfere with the setting of
				827	<a href="#subpattern">captured substrings.</a>
				828	For example, when the pattern
				829	<pre>
				830	(foo)\Kbar
				831	</pre>
				832	matches "foobar", the first substring is still set to "foo".
				833	</P>
				834	<P>
				835	Perl documents that the use of \K within assertions is "not well defined". In
				836	PCRE, \K is acted upon when it occurs inside positive assertions, but is
				837	ignored in negative assertions.
				838	<a name="smallassertions"></a></P>
				839	<br><b>
				840	Simple assertions
				841	</b><br>
				842	<P>
				843	The final use of backslash is for certain simple assertions. An assertion
				844	specifies a condition that has to be met at a particular point in a match,
				845	without consuming any characters from the subject string. The use of
				846	subpatterns for more complicated assertions is described
				847	<a href="#bigassertions">below.</a>
				848	The backslashed assertions are:
				849	<pre>
				850	\b matches at a word boundary
				851	\B matches when not at a word boundary
				852	\A matches at the start of the subject
				853	\Z matches at the end of the subject
				854	also matches before a newline at the end of the subject
				855	\z matches only at the end of the subject
				856	\G matches at the first matching position in the subject
				857	</pre>
				858	Inside a character class, \b has a different meaning; it matches the backspace
				859	character. If any other of these assertions appears in a character class, by
				860	default it matches the corresponding literal character (for example, \B
				861	matches the letter B). However, if the PCRE_EXTRA option is set, an "invalid
				862	escape sequence" error is generated instead.
				863	</P>
				864	<P>
				865	A word boundary is a position in the subject string where the current character
				866	and the previous character do not both match \w or \W (i.e. one matches
				867	\w and the other matches \W), or the start or end of the string if the
				868	first or last character matches \w, respectively. In UTF-8 mode, the meanings
				869	of \w and \W can be changed by setting the PCRE_UCP option. When this is
				870	done, it also affects \b and \B. Neither PCRE nor Perl has a separate "start
				871	of word" or "end of word" metasequence. However, whatever follows \b normally
				872	determines which it is. For example, the fragment \ba matches "a" at the start
				873	of a word.
				874	</P>
				875	<P>
				876	The \A, \Z, and \z assertions differ from the traditional circumflex and
				877	dollar (described in the next section) in that they only ever match at the very
				878	start and end of the subject string, whatever options are set. Thus, they are
				879	independent of multiline mode. These three assertions are not affected by the
				880	PCRE_NOTBOL or PCRE_NOTEOL options, which affect only the behaviour of the
				881	circumflex and dollar metacharacters. However, if the <i>startoffset</i>
				882	argument of <b>pcre_exec()</b> is non-zero, indicating that matching is to start
				883	at a point other than the beginning of the subject, \A can never match. The
				884	difference between \Z and \z is that \Z matches before a newline at the end
				885	of the string as well as at the very end, whereas \z matches only at the end.
				886	</P>
				887	<P>
				888	The \G assertion is true only when the current matching position is at the
				889	start point of the match, as specified by the <i>startoffset</i> argument of
				890	<b>pcre_exec()</b>. It differs from \A when the value of <i>startoffset</i> is
				891	non-zero. By calling <b>pcre_exec()</b> multiple times with appropriate
				892	arguments, you can mimic Perl's /g option, and it is in this kind of
				893	implementation where \G can be useful.
				894	</P>
				895	<P>
				896	Note, however, that PCRE's interpretation of \G, as the start of the current
				897	match, is subtly different from Perl's, which defines it as the end of the
				898	previous match. In Perl, these can be different when the previously matched
				899	string was empty. Because PCRE does just one match at a time, it cannot
				900	reproduce this behaviour.
				901	</P>
				902	<P>
				903	If all the alternatives of a pattern begin with \G, the expression is anchored
				904	to the starting match position, and the "anchored" flag is set in the compiled
				905	regular expression.
				906	</P>
				907	<br><a name="SEC5" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br>
				908	<P>
				909	Outside a character class, in the default matching mode, the circumflex
				910	character is an assertion that is true only if the current matching point is
				911	at the start of the subject string. If the <i>startoffset</i> argument of
				912	<b>pcre_exec()</b> is non-zero, circumflex can never match if the PCRE_MULTILINE
				913	option is unset. Inside a character class, circumflex has an entirely different
				914	meaning
				915	<a href="#characterclass">(see below).</a>
				916	</P>
				917	<P>
				918	Circumflex need not be the first character of the pattern if a number of
				919	alternatives are involved, but it should be the first thing in each alternative
				920	in which it appears if the pattern is ever to match that branch. If all
				921	possible alternatives start with a circumflex, that is, if the pattern is
				922	constrained to match only at the start of the subject, it is said to be an
				923	"anchored" pattern. (There are also other constructs that can cause a pattern
				924	to be anchored.)
				925	</P>
				926	<P>
				927	A dollar character is an assertion that is true only if the current matching
				928	point is at the end of the subject string, or immediately before a newline
				929	at the end of the string (by default). Dollar need not be the last character of
				930	the pattern if a number of alternatives are involved, but it should be the last
				931	item in any branch in which it appears. Dollar has no special meaning in a
				932	character class.
				933	</P>
				934	<P>
				935	The meaning of dollar can be changed so that it matches only at the very end of
				936	the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This
				937	does not affect the \Z assertion.
				938	</P>
				939	<P>
				940	The meanings of the circumflex and dollar characters are changed if the
				941	PCRE_MULTILINE option is set. When this is the case, a circumflex matches
				942	immediately after internal newlines as well as at the start of the subject
				943	string. It does not match after a newline that ends the string. A dollar
				944	matches before any newlines in the string, as well as at the very end, when
				945	PCRE_MULTILINE is set. When newline is specified as the two-character
				946	sequence CRLF, isolated CR and LF characters do not indicate newlines.
				947	</P>
				948	<P>
				949	For example, the pattern /^abc$/ matches the subject string "def\nabc" (where
				950	\n represents a newline) in multiline mode, but not otherwise. Consequently,
				951	patterns that are anchored in single line mode because all branches start with
				952	^ are not anchored in multiline mode, and a match for circumflex is possible
				953	when the <i>startoffset</i> argument of <b>pcre_exec()</b> is non-zero. The
				954	PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
				955	</P>
				956	<P>
				957	Note that the sequences \A, \Z, and \z can be used to match the start and
				958	end of the subject in both modes, and if all branches of a pattern start with
				959	\A it is always anchored, whether or not PCRE_MULTILINE is set.
				960	<a name="fullstopdot"></a></P>
				961	<br><a name="SEC6" href="#TOC1">FULL STOP (PERIOD, DOT) AND \N</a><br>
				962	<P>
				963	Outside a character class, a dot in the pattern matches any one character in
				964	the subject string except (by default) a character that signifies the end of a
				965	line. In UTF-8 mode, the matched character may be more than one byte long.
				966	</P>
				967	<P>
				968	When a line ending is defined as a single character, dot never matches that
				969	character; when the two-character sequence CRLF is used, dot does not match CR
				970	if it is immediately followed by LF, but otherwise it matches all characters
				971	(including isolated CRs and LFs). When any Unicode line endings are being
				972	recognized, dot does not match CR or LF or any of the other line ending
				973	characters.
				974	</P>
				975	<P>
				976	The behaviour of dot with regard to newlines can be changed. If the PCRE_DOTALL
				977	option is set, a dot matches any one character, without exception. If the
				978	two-character sequence CRLF is present in the subject string, it takes two dots
				979	to match it.
				980	</P>
				981	<P>
				982	The handling of dot is entirely independent of the handling of circumflex and
				983	dollar, the only relationship being that they both involve newlines. Dot has no
				984	special meaning in a character class.
				985	</P>
				986	<P>
				987	The escape sequence \N behaves like a dot, except that it is not affected by
				988	the PCRE_DOTALL option. In other words, it matches any character except one
				989	that signifies the end of a line. Perl also uses \N to match characters by
				990	name; PCRE does not support this.
				991	</P>
				992	<br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
				993	<P>
				994	Outside a character class, the escape sequence \C matches any one byte, both
				995	in and out of UTF-8 mode. Unlike a dot, it always matches line-ending
				996	characters. The feature is provided in Perl in order to match individual bytes
				997	in UTF-8 mode, but it is unclear how it can usefully be used. Because \C
				998	breaks up characters into individual bytes, matching one byte with \C in UTF-8
				999	mode means that the rest of the string may start with a malformed UTF-8
				1000	character. This has undefined results, because PCRE assumes that it is dealing
				1001	with valid UTF-8 strings (and by default it checks this at the start of
				1002	processing unless the PCRE_NO_UTF8_CHECK option is used).
				1003	</P>
				1004	<P>
				1005	PCRE does not allow \C to appear in lookbehind assertions
				1006	<a href="#lookbehind">(described below)</a>
				1007	in UTF-8 mode, because this would make it impossible to calculate the length of
				1008	the lookbehind.
				1009	</P>
				1010	<P>
				1011	In general, the \C escape sequence is best avoided in UTF-8 mode. However, one
				1012	way of using it that avoids the problem of malformed UTF-8 characters is to
				1013	use a lookahead to check the length of the next character, as in this pattern
				1014	(ignore white space and line breaks):
				1015	<pre>
				1016	(?\| (?=[\x00-\x7f])(\C) \|
				1017	(?=[\x80-\x{7ff}])(\C)(\C) \|
				1018	(?=[\x{800}-\x{ffff}])(\C)(\C)(\C) \|
				1019	(?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
				1020	</pre>
				1021	A group that starts with (?\| resets the capturing parentheses numbers in each
				1022	alternative (see
				1023	<a href="#dupsubpatternnumber">"Duplicate Subpattern Numbers"</a>
				1024	below). The assertions at the start of each branch check the next UTF-8
				1025	character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
				1026	character's individual bytes are then captured by the appropriate number of
				1027	groups.
				1028	<a name="characterclass"></a></P>
				1029	<br><a name="SEC8" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>
				1030	<P>
				1031	An opening square bracket introduces a character class, terminated by a closing
				1032	square bracket. A closing square bracket on its own is not special by default.
				1033	However, if the PCRE_JAVASCRIPT_COMPAT option is set, a lone closing square
				1034	bracket causes a compile-time error. If a closing square bracket is required as
				1035	a member of the class, it should be the first data character in the class
				1036	(after an initial circumflex, if present) or escaped with a backslash.
				1037	</P>
				1038	<P>
				1039	A character class matches a single character in the subject. In UTF-8 mode, the
				1040	character may be more than one byte long. A matched character must be in the
				1041	set of characters defined by the class, unless the first character in the class
				1042	definition is a circumflex, in which case the subject character must not be in
				1043	the set defined by the class. If a circumflex is actually required as a member
				1044	of the class, ensure it is not the first character, or escape it with a
				1045	backslash.
				1046	</P>
				1047	<P>
				1048	For example, the character class [aeiou] matches any lower case vowel, while
				1049	[^aeiou] matches any character that is not a lower case vowel. Note that a
				1050	circumflex is just a convenient notation for specifying the characters that
				1051	are in the class by enumerating those that are not. A class that starts with a
				1052	circumflex is not an assertion; it still consumes a character from the subject
				1053	string, and therefore it fails if the current pointer is at the end of the
				1054	string.
				1055	</P>
				1056	<P>
				1057	In UTF-8 mode, characters with values greater than 255 can be included in a
				1058	class as a literal string of bytes, or by using the \x{ escaping mechanism.
				1059	</P>
				1060	<P>
				1061	When caseless matching is set, any letters in a class represent both their
				1062	upper case and lower case versions, so for example, a caseless [aeiou] matches
				1063	"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
				1064	caseful version would. In UTF-8 mode, PCRE always understands the concept of
				1065	case for characters whose values are less than 128, so caseless matching is
				1066	always possible. For characters with higher values, the concept of case is
				1067	supported if PCRE is compiled with Unicode property support, but not otherwise.
				1068	If you want to use caseless matching in UTF8-mode for characters 128 and above,
				1069	you must ensure that PCRE is compiled with Unicode property support as well as
				1070	with UTF-8 support.
				1071	</P>
				1072	<P>
				1073	Characters that might indicate line breaks are never treated in any special way
				1074	when matching character classes, whatever line-ending sequence is in use, and
				1075	whatever setting of the PCRE_DOTALL and PCRE_MULTILINE options is used. A class
				1076	such as [^a] always matches one of these characters.
				1077	</P>
				1078	<P>
				1079	The minus (hyphen) character can be used to specify a range of characters in a
				1080	character class. For example, [d-m] matches any letter between d and m,
				1081	inclusive. If a minus character is required in a class, it must be escaped with
				1082	a backslash or appear in a position where it cannot be interpreted as
				1083	indicating a range, typically as the first or last character in the class.
				1084	</P>
				1085	<P>
				1086	It is not possible to have the literal character "]" as the end character of a
				1087	range. A pattern such as [W-]46] is interpreted as a class of two characters
				1088	("W" and "-") followed by a literal string "46]", so it would match "W46]" or
				1089	"-46]". However, if the "]" is escaped with a backslash it is interpreted as
				1090	the end of range, so [W-\]46] is interpreted as a class containing a range
				1091	followed by two other characters. The octal or hexadecimal representation of
				1092	"]" can also be used to end a range.
				1093	</P>
				1094	<P>
				1095	Ranges operate in the collating sequence of character values. They can also be
				1096	used for characters specified numerically, for example [\000-\037]. In UTF-8
				1097	mode, ranges can include characters whose values are greater than 255, for
				1098	example [\x{100}-\x{2ff}].
				1099	</P>
				1100	<P>
				1101	If a range that includes letters is used when caseless matching is set, it
				1102	matches the letters in either case. For example, [W-c] is equivalent to
				1103	[][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if character
				1104	tables for a French locale are in use, [\xc8-\xcb] matches accented E
				1105	characters in both cases. In UTF-8 mode, PCRE supports the concept of case for
				1106	characters with values greater than 128 only when it is compiled with Unicode
				1107	property support.
				1108	</P>
				1109	<P>
				1110	The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,
				1111	\V, \w, and \W may appear in a character class, and add the characters that
				1112	they match to the class. For example, [\dABCDEF] matches any hexadecimal
				1113	digit. In UTF-8 mode, the PCRE_UCP option affects the meanings of \d, \s, \w
				1114	and their upper case partners, just as it does when they appear outside a
				1115	character class, as described in the section entitled
				1116	<a href="#genericchartypes">"Generic character types"</a>
				1117	above. The escape sequence \b has a different meaning inside a character
				1118	class; it matches the backspace character. The sequences \B, \N, \R, and \X
				1119	are not special inside a character class. Like any other unrecognized escape
				1120	sequences, they are treated as the literal characters "B", "N", "R", and "X" by
				1121	default, but cause an error if the PCRE_EXTRA option is set.
				1122	</P>
				1123	<P>
				1124	A circumflex can conveniently be used with the upper case character types to
				1125	specify a more restricted set of characters than the matching lower case type.
				1126	For example, the class [^\W_] matches any letter or digit, but not underscore,
				1127	whereas [\w] includes underscore. A positive character class should be read as
				1128	"something OR something OR ..." and a negative class as "NOT something AND NOT
				1129	something AND NOT ...".
				1130	</P>
				1131	<P>
				1132	The only metacharacters that are recognized in character classes are backslash,
				1133	hyphen (only where it can be interpreted as specifying a range), circumflex
				1134	(only at the start), opening square bracket (only when it can be interpreted as
				1135	introducing a POSIX class name - see the next section), and the terminating
				1136	closing square bracket. However, escaping other non-alphanumeric characters
				1137	does no harm.
				1138	</P>
				1139	<br><a name="SEC9" href="#TOC1">POSIX CHARACTER CLASSES</a><br>
				1140	<P>
				1141	Perl supports the POSIX notation for character classes. This uses names
				1142	enclosed by [: and :] within the enclosing square brackets. PCRE also supports
				1143	this notation. For example,
				1144	<pre>
				1145	[01[:alpha:]%]
				1146	</pre>
				1147	matches "0", "1", any alphabetic character, or "%". The supported class names
				1148	are:
				1149	<pre>
				1150	alnum letters and digits
				1151	alpha letters
				1152	ascii character codes 0 - 127
				1153	blank space or tab only
				1154	cntrl control characters
				1155	digit decimal digits (same as \d)
				1156	graph printing characters, excluding space
				1157	lower lower case letters
				1158	print printing characters, including space
				1159	punct printing characters, excluding letters and digits and space
				1160	space white space (not quite the same as \s)
				1161	upper upper case letters
				1162	word "word" characters (same as \w)
				1163	xdigit hexadecimal digits
				1164	</pre>
				1165	The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
				1166	space (32). Notice that this list includes the VT character (code 11). This
				1167	makes "space" different to \s, which does not include VT (for Perl
				1168	compatibility).
				1169	</P>
				1170	<P>
				1171	The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
				1172	5.8. Another Perl extension is negation, which is indicated by a ^ character
				1173	after the colon. For example,
				1174	<pre>
				1175	[12[:^digit:]]
				1176	</pre>
				1177	matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX
				1178	syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
				1179	supported, and an error is given if they are encountered.
				1180	</P>
				1181	<P>
				1182	By default, in UTF-8 mode, characters with values greater than 128 do not match
				1183	any of the POSIX character classes. However, if the PCRE_UCP option is passed
				1184	to <b>pcre_compile()</b>, some of the classes are changed so that Unicode
				1185	character properties are used. This is achieved by replacing the POSIX classes
				1186	by other sequences, as follows:
				1187	<pre>
				1188	[:alnum:] becomes \p{Xan}
				1189	[:alpha:] becomes \p{L}
				1190	[:blank:] becomes \h
				1191	[:digit:] becomes \p{Nd}
				1192	[:lower:] becomes \p{Ll}
				1193	[:space:] becomes \p{Xps}
				1194	[:upper:] becomes \p{Lu}
				1195	[:word:] becomes \p{Xwd}
				1196	</pre>
				1197	Negated versions, such as [:^alpha:] use \P instead of \p. The other POSIX
				1198	classes are unchanged, and match only characters with code points less than
				1199	128.
				1200	</P>
				1201	<br><a name="SEC10" href="#TOC1">VERTICAL BAR</a><br>
				1202	<P>
				1203	Vertical bar characters are used to separate alternative patterns. For example,
				1204	the pattern
				1205	<pre>
				1206	gilbert\|sullivan
				1207	</pre>
				1208	matches either "gilbert" or "sullivan". Any number of alternatives may appear,
				1209	and an empty alternative is permitted (matching the empty string). The matching
				1210	process tries each alternative in turn, from left to right, and the first one
				1211	that succeeds is used. If the alternatives are within a subpattern
				1212	<a href="#subpattern">(defined below),</a>
				1213	"succeeds" means matching the rest of the main pattern as well as the
				1214	alternative in the subpattern.
				1215	</P>
				1216	<br><a name="SEC11" href="#TOC1">INTERNAL OPTION SETTING</a><br>
				1217	<P>
				1218	The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
				1219	PCRE_EXTENDED options (which are Perl-compatible) can be changed from within
				1220	the pattern by a sequence of Perl option letters enclosed between "(?" and ")".
				1221	The option letters are
				1222	<pre>
				1223	i for PCRE_CASELESS
				1224	m for PCRE_MULTILINE
				1225	s for PCRE_DOTALL
				1226	x for PCRE_EXTENDED
				1227	</pre>
				1228	For example, (?im) sets caseless, multiline matching. It is also possible to
				1229	unset these options by preceding the letter with a hyphen, and a combined
				1230	setting and unsetting such as (?im-sx), which sets PCRE_CASELESS and
				1231	PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, is also
				1232	permitted. If a letter appears both before and after the hyphen, the option is
				1233	unset.
				1234	</P>
				1235	<P>
				1236	The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be
				1237	changed in the same way as the Perl-compatible options by using the characters
				1238	J, U and X respectively.
				1239	</P>
				1240	<P>
				1241	When one of these option changes occurs at top level (that is, not inside
				1242	subpattern parentheses), the change applies to the remainder of the pattern
				1243	that follows. If the change is placed right at the start of a pattern, PCRE
				1244	extracts it into the global options (and it will therefore show up in data
				1245	extracted by the <b>pcre_fullinfo()</b> function).
				1246	</P>
				1247	<P>
				1248	An option change within a subpattern (see below for a description of
				1249	subpatterns) affects only that part of the subpattern that follows it, so
				1250	<pre>
				1251	(a(?i)b)c
				1252	</pre>
				1253	matches abc and aBc and no other strings (assuming PCRE_CASELESS is not used).
				1254	By this means, options can be made to have different settings in different
				1255	parts of the pattern. Any changes made in one alternative do carry on
				1256	into subsequent branches within the same subpattern. For example,
				1257	<pre>
				1258	(a(?i)b\|c)
				1259	</pre>
				1260	matches "ab", "aB", "c", and "C", even though when matching "C" the first
				1261	branch is abandoned before the option setting. This is because the effects of
				1262	option settings happen at compile time. There would be some very weird
				1263	behaviour otherwise.
				1264	</P>
				1265	<P>
				1266	<b>Note:</b> There are other PCRE-specific options that can be set by the
				1267	application when the compile or match functions are called. In some cases the
				1268	pattern can contain special leading sequences such as (*CRLF) to override what
				1269	the application has set or what has been defaulted. Details are given in the
				1270	section entitled
				1271	<a href="#newlineseq">"Newline sequences"</a>
				1272	above. There are also the (UTF8) and (UCP) leading sequences that can be used
				1273	to set UTF-8 and Unicode property modes; they are equivalent to setting the
				1274	PCRE_UTF8 and the PCRE_UCP options, respectively.
				1275	<a name="subpattern"></a></P>
				1276	<br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br>
				1277	<P>
				1278	Subpatterns are delimited by parentheses (round brackets), which can be nested.
				1279	Turning part of a pattern into a subpattern does two things:
				1280	<br>
				1281	<br>
				1282	1. It localizes a set of alternatives. For example, the pattern
				1283	<pre>
				1284	cat(aract\|erpillar\|)
				1285	</pre>
				1286	matches "cataract", "caterpillar", or "cat". Without the parentheses, it would
				1287	match "cataract", "erpillar" or an empty string.
				1288	<br>
				1289	<br>
				1290	2. It sets up the subpattern as a capturing subpattern. This means that, when
				1291	the whole pattern matches, that portion of the subject string that matched the
				1292	subpattern is passed back to the caller via the <i>ovector</i> argument of
				1293	<b>pcre_exec()</b>. Opening parentheses are counted from left to right (starting
				1294	from 1) to obtain numbers for the capturing subpatterns. For example, if the
				1295	string "the red king" is matched against the pattern
				1296	<pre>
				1297	the ((red\|white) (king\|queen))
				1298	</pre>
				1299	the captured substrings are "red king", "red", and "king", and are numbered 1,
				1300	2, and 3, respectively.
				1301	</P>
				1302	<P>
				1303	The fact that plain parentheses fulfil two functions is not always helpful.
				1304	There are often times when a grouping subpattern is required without a
				1305	capturing requirement. If an opening parenthesis is followed by a question mark
				1306	and a colon, the subpattern does not do any capturing, and is not counted when
				1307	computing the number of any subsequent capturing subpatterns. For example, if
				1308	the string "the white queen" is matched against the pattern
				1309	<pre>
				1310	the ((?:red\|white) (king\|queen))
				1311	</pre>
				1312	the captured substrings are "white queen" and "queen", and are numbered 1 and
				1313	2. The maximum number of capturing subpatterns is 65535.
				1314	</P>
				1315	<P>
				1316	As a convenient shorthand, if any option settings are required at the start of
				1317	a non-capturing subpattern, the option letters may appear between the "?" and
				1318	the ":". Thus the two patterns
				1319	<pre>
				1320	(?i:saturday\|sunday)
				1321	(?:(?i)saturday\|sunday)
				1322	</pre>
				1323	match exactly the same set of strings. Because alternative branches are tried
				1324	from left to right, and options are not reset until the end of the subpattern
				1325	is reached, an option setting in one branch does affect subsequent branches, so
				1326	the above patterns match "SUNDAY" as well as "Saturday".
				1327	<a name="dupsubpatternnumber"></a></P>
				1328	<br><a name="SEC13" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br>
				1329	<P>
				1330	Perl 5.10 introduced a feature whereby each alternative in a subpattern uses
				1331	the same numbers for its capturing parentheses. Such a subpattern starts with
				1332	(?\| and is itself a non-capturing subpattern. For example, consider this
				1333	pattern:
				1334	<pre>
				1335	(?\|(Sat)ur\|(Sun))day
				1336	</pre>
				1337	Because the two alternatives are inside a (?\| group, both sets of capturing
				1338	parentheses are numbered one. Thus, when the pattern matches, you can look
				1339	at captured substring number one, whichever alternative matched. This construct
				1340	is useful when you want to capture part, but not all, of one of a number of
				1341	alternatives. Inside a (?\| group, parentheses are numbered as usual, but the
				1342	number is reset at the start of each branch. The numbers of any capturing
				1343	parentheses that follow the subpattern start after the highest number used in
				1344	any branch. The following example is taken from the Perl documentation. The
				1345	numbers underneath show in which buffer the captured content will be stored.
				1346	<pre>
				1347	# before ---------------branch-reset----------- after
				1348	/ ( a ) (?\| x ( y ) z \| (p (q) r) \| (t) u (v) ) ( z ) /x
				1349	# 1 2 2 3 2 3 4
				1350	</pre>
				1351	A back reference to a numbered subpattern uses the most recent value that is
				1352	set for that number by any subpattern. The following pattern matches "abcabc"
				1353	or "defdef":
				1354	<pre>
				1355	/(?\|(abc)\|(def))\1/
				1356	</pre>
				1357	In contrast, a subroutine call to a numbered subpattern always refers to the
				1358	first one in the pattern with the given number. The following pattern matches
				1359	"abcabc" or "defabc":
				1360	<pre>
				1361	/(?\|(abc)\|(def))(?1)/
				1362	</pre>
				1363	If a
				1364	<a href="#conditions">condition test</a>
				1365	for a subpattern's having matched refers to a non-unique number, the test is
				1366	true if any of the subpatterns of that number have matched.
				1367	</P>
				1368	<P>
				1369	An alternative approach to using this "branch reset" feature is to use
				1370	duplicate named subpatterns, as described in the next section.
				1371	</P>
				1372	<br><a name="SEC14" href="#TOC1">NAMED SUBPATTERNS</a><br>
				1373	<P>
				1374	Identifying capturing parentheses by number is simple, but it can be very hard
				1375	to keep track of the numbers in complicated regular expressions. Furthermore,
				1376	if an expression is modified, the numbers may change. To help with this
				1377	difficulty, PCRE supports the naming of subpatterns. This feature was not
				1378	added to Perl until release 5.10. Python had the feature earlier, and PCRE
				1379	introduced it at release 4.0, using the Python syntax. PCRE now supports both
				1380	the Perl and the Python syntax. Perl allows identically numbered subpatterns to
				1381	have different names, but PCRE does not.
				1382	</P>
				1383	<P>
				1384	In PCRE, a subpattern can be named in one of three ways: (?<name>...) or
				1385	(?'name'...) as in Perl, or (?P<name>...) as in Python. References to capturing
				1386	parentheses from other parts of the pattern, such as
				1387	<a href="#backreferences">back references,</a>
				1388	<a href="#recursion">recursion,</a>
				1389	and
				1390	<a href="#conditions">conditions,</a>
				1391	can be made by name as well as by number.
				1392	</P>
				1393	<P>
				1394	Names consist of up to 32 alphanumeric characters and underscores. Named
				1395	capturing parentheses are still allocated numbers as well as names, exactly as
				1396	if the names were not present. The PCRE API provides function calls for
				1397	extracting the name-to-number translation table from a compiled pattern. There
				1398	is also a convenience function for extracting a captured substring by name.
				1399	</P>
				1400	<P>
				1401	By default, a name must be unique within a pattern, but it is possible to relax
				1402	this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate
				1403	names are also always permitted for subpatterns with the same number, set up as
				1404	described in the previous section.) Duplicate names can be useful for patterns
				1405	where only one instance of the named parentheses can match. Suppose you want to
				1406	match the name of a weekday, either as a 3-letter abbreviation or as the full
				1407	name, and in both cases you want to extract the abbreviation. This pattern
				1408	(ignoring the line breaks) does the job:
				1409	<pre>
				1410	(?<DN>Mon\|Fri\|Sun)(?:day)?\|
				1411	(?<DN>Tue)(?:sday)?\|
				1412	(?<DN>Wed)(?:nesday)?\|
				1413	(?<DN>Thu)(?:rsday)?\|
				1414	(?<DN>Sat)(?:urday)?
				1415	</pre>
				1416	There are five capturing substrings, but only one is ever set after a match.
				1417	(An alternative way of solving this problem is to use a "branch reset"
				1418	subpattern, as described in the previous section.)
				1419	</P>
				1420	<P>
				1421	The convenience function for extracting the data by name returns the substring
				1422	for the first (and in this example, the only) subpattern of that name that
				1423	matched. This saves searching to find which numbered subpattern it was.
				1424	</P>
				1425	<P>
				1426	If you make a back reference to a non-unique named subpattern from elsewhere in
				1427	the pattern, the one that corresponds to the first occurrence of the name is
				1428	used. In the absence of duplicate numbers (see the previous section) this is
				1429	the one with the lowest number. If you use a named reference in a condition
				1430	test (see the
				1431	<a href="#conditions">section about conditions</a>
				1432	below), either to check whether a subpattern has matched, or to check for
				1433	recursion, all subpatterns with the same name are tested. If the condition is
				1434	true for any one of them, the overall condition is true. This is the same
				1435	behaviour as testing by number. For further details of the interfaces for
				1436	handling named subpatterns, see the
				1437	<a href="pcreapi.html"><b>pcreapi</b></a>
				1438	documentation.
				1439	</P>
				1440	<P>
				1441	<b>Warning:</b> You cannot use different names to distinguish between two
				1442	subpatterns with the same number because PCRE uses only the numbers when
				1443	matching. For this reason, an error is given at compile time if different names
				1444	are given to subpatterns with the same number. However, you can give the same
				1445	name to subpatterns with the same number, even when PCRE_DUPNAMES is not set.
				1446	</P>
				1447	<br><a name="SEC15" href="#TOC1">REPETITION</a><br>
				1448	<P>
				1449	Repetition is specified by quantifiers, which can follow any of the following
				1450	items:
				1451	<pre>
				1452	a literal data character
				1453	the dot metacharacter
				1454	the \C escape sequence
				1455	the \X escape sequence (in UTF-8 mode with Unicode properties)
				1456	the \R escape sequence
				1457	an escape such as \d or \pL that matches a single character
				1458	a character class
				1459	a back reference (see next section)
				1460	a parenthesized subpattern (including assertions)
				1461	a subroutine call to a subpattern (recursive or otherwise)
				1462	</pre>
				1463	The general repetition quantifier specifies a minimum and maximum number of
				1464	permitted matches, by giving the two numbers in curly brackets (braces),
				1465	separated by a comma. The numbers must be less than 65536, and the first must
				1466	be less than or equal to the second. For example:
				1467	<pre>
				1468	z{2,4}
				1469	</pre>
				1470	matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
				1471	character. If the second number is omitted, but the comma is present, there is
				1472	no upper limit; if the second number and the comma are both omitted, the
				1473	quantifier specifies an exact number of required matches. Thus
				1474	<pre>
				1475	[aeiou]{3,}
				1476	</pre>
				1477	matches at least 3 successive vowels, but may match many more, while
				1478	<pre>
				1479	\d{8}
				1480	</pre>
				1481	matches exactly 8 digits. An opening curly bracket that appears in a position
				1482	where a quantifier is not allowed, or one that does not match the syntax of a
				1483	quantifier, is taken as a literal character. For example, {,6} is not a
				1484	quantifier, but a literal string of four characters.
				1485	</P>
				1486	<P>
				1487	In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to individual
				1488	bytes. Thus, for example, \x{100}{2} matches two UTF-8 characters, each of
				1489	which is represented by a two-byte sequence. Similarly, when Unicode property
				1490	support is available, \X{3} matches three Unicode extended sequences, each of
				1491	which may be several bytes long (and they may be of different lengths).
				1492	</P>
				1493	<P>
				1494	The quantifier {0} is permitted, causing the expression to behave as if the
				1495	previous item and the quantifier were not present. This may be useful for
				1496	subpatterns that are referenced as
				1497	<a href="#subpatternsassubroutines">subroutines</a>
				1498	from elsewhere in the pattern (but see also the section entitled
				1499	<a href="#subdefine">"Defining subpatterns for use by reference only"</a>
				1500	below). Items other than subpatterns that have a {0} quantifier are omitted
				1501	from the compiled pattern.
				1502	</P>
				1503	<P>
				1504	For convenience, the three most common quantifiers have single-character
				1505	abbreviations:
				1506	<pre>
				1507	* is equivalent to {0,}
				1508	+ is equivalent to {1,}
				1509	? is equivalent to {0,1}
				1510	</pre>
				1511	It is possible to construct infinite loops by following a subpattern that can
				1512	match no characters with a quantifier that has no upper limit, for example:
				1513	<pre>
				1514	(a?)*
				1515	</pre>
				1516	Earlier versions of Perl and PCRE used to give an error at compile time for
				1517	such patterns. However, because there are cases where this can be useful, such
				1518	patterns are now accepted, but if any repetition of the subpattern does in fact
				1519	match no characters, the loop is forcibly broken.
				1520	</P>
				1521	<P>
				1522	By default, the quantifiers are "greedy", that is, they match as much as
				1523	possible (up to the maximum number of permitted times), without causing the
				1524	rest of the pattern to fail. The classic example of where this gives problems
				1525	is in trying to match comments in C programs. These appear between /* and */
				1526	and within the comment, individual * and / characters may appear. An attempt to
				1527	match C comments by applying the pattern
				1528	<pre>
				1529	/\.\*/
				1530	</pre>
				1531	to the string
				1532	<pre>
				1533	/* first comment / not comment / second comment */
				1534	</pre>
				1535	fails, because it matches the entire string owing to the greediness of the .*
				1536	item.
				1537	</P>
				1538	<P>
				1539	However, if a quantifier is followed by a question mark, it ceases to be
				1540	greedy, and instead matches the minimum number of times possible, so the
				1541	pattern
				1542	<pre>
				1543	/\.?\*/
				1544	</pre>
				1545	does the right thing with the C comments. The meaning of the various
				1546	quantifiers is not otherwise changed, just the preferred number of matches.
				1547	Do not confuse this use of question mark with its use as a quantifier in its
				1548	own right. Because it has two uses, it can sometimes appear doubled, as in
				1549	<pre>
				1550	\d??\d
				1551	</pre>
				1552	which matches one digit by preference, but can match two if that is the only
				1553	way the rest of the pattern matches.
				1554	</P>
				1555	<P>
				1556	If the PCRE_UNGREEDY option is set (an option that is not available in Perl),
				1557	the quantifiers are not greedy by default, but individual ones can be made
				1558	greedy by following them with a question mark. In other words, it inverts the
				1559	default behaviour.
				1560	</P>
				1561	<P>
				1562	When a parenthesized subpattern is quantified with a minimum repeat count that
				1563	is greater than 1 or with a limited maximum, more memory is required for the
				1564	compiled pattern, in proportion to the size of the minimum or maximum.
				1565	</P>
				1566	<P>
				1567	If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent
				1568	to Perl's /s) is set, thus allowing the dot to match newlines, the pattern is
				1569	implicitly anchored, because whatever follows will be tried against every
				1570	character position in the subject string, so there is no point in retrying the
				1571	overall match at any position after the first. PCRE normally treats such a
				1572	pattern as though it were preceded by \A.
				1573	</P>
				1574	<P>
				1575	In cases where it is known that the subject string contains no newlines, it is
				1576	worth setting PCRE_DOTALL in order to obtain this optimization, or
				1577	alternatively using ^ to indicate anchoring explicitly.
				1578	</P>
				1579	<P>
				1580	However, there is one situation where the optimization cannot be used. When .*
				1581	is inside capturing parentheses that are the subject of a back reference
				1582	elsewhere in the pattern, a match at the start may fail where a later one
				1583	succeeds. Consider, for example:
				1584	<pre>
				1585	(.*)abc\1
				1586	</pre>
				1587	If the subject is "xyz123abc123" the match point is the fourth character. For
				1588	this reason, such a pattern is not implicitly anchored.
				1589	</P>
				1590	<P>
				1591	When a capturing subpattern is repeated, the value captured is the substring
				1592	that matched the final iteration. For example, after
				1593	<pre>
				1594	(tweedle[dume]{3}\s*)+
				1595	</pre>
				1596	has matched "tweedledum tweedledee" the value of the captured substring is
				1597	"tweedledee". However, if there are nested capturing subpatterns, the
				1598	corresponding captured values may have been set in previous iterations. For
				1599	example, after
				1600	<pre>
				1601	/(a\|(b))+/
				1602	</pre>
				1603	matches "aba" the value of the second captured substring is "b".
				1604	<a name="atomicgroup"></a></P>
				1605	<br><a name="SEC16" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br>
				1606	<P>
				1607	With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
				1608	repetition, failure of what follows normally causes the repeated item to be
				1609	re-evaluated to see if a different number of repeats allows the rest of the
				1610	pattern to match. Sometimes it is useful to prevent this, either to change the
				1611	nature of the match, or to cause it fail earlier than it otherwise might, when
				1612	the author of the pattern knows there is no point in carrying on.
				1613	</P>
				1614	<P>
				1615	Consider, for example, the pattern \d+foo when applied to the subject line
				1616	<pre>
				1617	123456bar
				1618	</pre>
				1619	After matching all 6 digits and then failing to match "foo", the normal
				1620	action of the matcher is to try again with only 5 digits matching the \d+
				1621	item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
				1622	(a term taken from Jeffrey Friedl's book) provides the means for specifying
				1623	that once a subpattern has matched, it is not to be re-evaluated in this way.
				1624	</P>
				1625	<P>
				1626	If we use atomic grouping for the previous example, the matcher gives up
				1627	immediately on failing to match "foo" the first time. The notation is a kind of
				1628	special parenthesis, starting with (?> as in this example:
				1629	<pre>
				1630	(?>\d+)foo
				1631	</pre>
				1632	This kind of parenthesis "locks up" the part of the pattern it contains once
				1633	it has matched, and a failure further into the pattern is prevented from
				1634	backtracking into it. Backtracking past it to previous items, however, works as
				1635	normal.
				1636	</P>
				1637	<P>
				1638	An alternative description is that a subpattern of this type matches the string
				1639	of characters that an identical standalone pattern would match, if anchored at
				1640	the current point in the subject string.
				1641	</P>
				1642	<P>
				1643	Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as
				1644	the above example can be thought of as a maximizing repeat that must swallow
				1645	everything it can. So, while both \d+ and \d+? are prepared to adjust the
				1646	number of digits they match in order to make the rest of the pattern match,
				1647	(?>\d+) can only match an entire sequence of digits.
				1648	</P>
				1649	<P>
				1650	Atomic groups in general can of course contain arbitrarily complicated
				1651	subpatterns, and can be nested. However, when the subpattern for an atomic
				1652	group is just a single repeated item, as in the example above, a simpler
				1653	notation, called a "possessive quantifier" can be used. This consists of an
				1654	additional + character following a quantifier. Using this notation, the
				1655	previous example can be rewritten as
				1656	<pre>
				1657	\d++foo
				1658	</pre>
				1659	Note that a possessive quantifier can be used with an entire group, for
				1660	example:
				1661	<pre>
				1662	(abc\|xyz){2,3}+
				1663	</pre>
				1664	Possessive quantifiers are always greedy; the setting of the PCRE_UNGREEDY
				1665	option is ignored. They are a convenient notation for the simpler forms of
				1666	atomic group. However, there is no difference in the meaning of a possessive
				1667	quantifier and the equivalent atomic group, though there may be a performance
				1668	difference; possessive quantifiers should be slightly faster.
				1669	</P>
				1670	<P>
				1671	The possessive quantifier syntax is an extension to the Perl 5.8 syntax.
				1672	Jeffrey Friedl originated the idea (and the name) in the first edition of his
				1673	book. Mike McCloskey liked it, so implemented it when he built Sun's Java
				1674	package, and PCRE copied it from there. It ultimately found its way into Perl
				1675	at release 5.10.
				1676	</P>
				1677	<P>
				1678	PCRE has an optimization that automatically "possessifies" certain simple
				1679	pattern constructs. For example, the sequence A+B is treated as A++B because
				1680	there is no point in backtracking into a sequence of A's when B must follow.
				1681	</P>
				1682	<P>
				1683	When a pattern contains an unlimited repeat inside a subpattern that can itself
				1684	be repeated an unlimited number of times, the use of an atomic group is the
				1685	only way to avoid some failing matches taking a very long time indeed. The
				1686	pattern
				1687	<pre>
				1688	(\D+\|<\d+>)*[!?]
				1689	</pre>
				1690	matches an unlimited number of substrings that either consist of non-digits, or
				1691	digits enclosed in <>, followed by either ! or ?. When it matches, it runs
				1692	quickly. However, if it is applied to
				1693	<pre>
				1694	aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
				1695	</pre>
				1696	it takes a long time before reporting failure. This is because the string can
				1697	be divided between the internal \D+ repeat and the external * repeat in a
				1698	large number of ways, and all have to be tried. (The example uses [!?] rather
				1699	than a single character at the end, because both PCRE and Perl have an
				1700	optimization that allows for fast failure when a single character is used. They
				1701	remember the last single character that is required for a match, and fail early
				1702	if it is not present in the string.) If the pattern is changed so that it uses
				1703	an atomic group, like this:
				1704	<pre>
				1705	((?>\D+)\|<\d+>)*[!?]
				1706	</pre>
				1707	sequences of non-digits cannot be broken, and failure happens quickly.
				1708	<a name="backreferences"></a></P>
				1709	<br><a name="SEC17" href="#TOC1">BACK REFERENCES</a><br>
				1710	<P>
				1711	Outside a character class, a backslash followed by a digit greater than 0 (and
				1712	possibly further digits) is a back reference to a capturing subpattern earlier
				1713	(that is, to its left) in the pattern, provided there have been that many
				1714	previous capturing left parentheses.
				1715	</P>
				1716	<P>
				1717	However, if the decimal number following the backslash is less than 10, it is
				1718	always taken as a back reference, and causes an error only if there are not
				1719	that many capturing left parentheses in the entire pattern. In other words, the
				1720	parentheses that are referenced need not be to the left of the reference for
				1721	numbers less than 10. A "forward back reference" of this type can make sense
				1722	when a repetition is involved and the subpattern to the right has participated
				1723	in an earlier iteration.
				1724	</P>
				1725	<P>
				1726	It is not possible to have a numerical "forward back reference" to a subpattern
				1727	whose number is 10 or more using this syntax because a sequence such as \50 is
				1728	interpreted as a character defined in octal. See the subsection entitled
				1729	"Non-printing characters"
				1730	<a href="#digitsafterbackslash">above</a>
				1731	for further details of the handling of digits following a backslash. There is
				1732	no such problem when named parentheses are used. A back reference to any
				1733	subpattern is possible using named parentheses (see below).
				1734	</P>
				1735	<P>
				1736	Another way of avoiding the ambiguity inherent in the use of digits following a
				1737	backslash is to use the \g escape sequence. This escape must be followed by an
				1738	unsigned number or a negative number, optionally enclosed in braces. These
				1739	examples are all identical:
				1740	<pre>
				1741	(ring), \1
				1742	(ring), \g1
				1743	(ring), \g{1}
				1744	</pre>
				1745	An unsigned number specifies an absolute reference without the ambiguity that
				1746	is present in the older syntax. It is also useful when literal digits follow
				1747	the reference. A negative number is a relative reference. Consider this
				1748	example:
				1749	<pre>
				1750	(abc(def)ghi)\g{-1}
				1751	</pre>
				1752	The sequence \g{-1} is a reference to the most recently started capturing
				1753	subpattern before \g, that is, is it equivalent to \2 in this example.
				1754	Similarly, \g{-2} would be equivalent to \1. The use of relative references
				1755	can be helpful in long patterns, and also in patterns that are created by
				1756	joining together fragments that contain references within themselves.
				1757	</P>
				1758	<P>
				1759	A back reference matches whatever actually matched the capturing subpattern in
				1760	the current subject string, rather than anything matching the subpattern
				1761	itself (see
				1762	<a href="#subpatternsassubroutines">"Subpatterns as subroutines"</a>
				1763	below for a way of doing that). So the pattern
				1764	<pre>
				1765	(sens\|respons)e and \1ibility
				1766	</pre>
				1767	matches "sense and sensibility" and "response and responsibility", but not
				1768	"sense and responsibility". If caseful matching is in force at the time of the
				1769	back reference, the case of letters is relevant. For example,
				1770	<pre>
				1771	((?i)rah)\s+\1
				1772	</pre>
				1773	matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
				1774	capturing subpattern is matched caselessly.
				1775	</P>
				1776	<P>
				1777	There are several different ways of writing back references to named
				1778	subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
				1779	\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
				1780	back reference syntax, in which \g can be used for both numeric and named
				1781	references, is also supported. We could rewrite the above example in any of
				1782	the following ways:
				1783	<pre>
				1784	(?<p1>(?i)rah)\s+\k<p1>
				1785	(?'p1'(?i)rah)\s+\k{p1}
				1786	(?P<p1>(?i)rah)\s+(?P=p1)
				1787	(?<p1>(?i)rah)\s+\g{p1}
				1788	</pre>
				1789	A subpattern that is referenced by name may appear in the pattern before or
				1790	after the reference.
				1791	</P>
				1792	<P>
				1793	There may be more than one back reference to the same subpattern. If a
				1794	subpattern has not actually been used in a particular match, any back
				1795	references to it always fail by default. For example, the pattern
				1796	<pre>
				1797	(a\|(bc))\2
				1798	</pre>
				1799	always fails if it starts to match "a" rather than "bc". However, if the
				1800	PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back reference to an
				1801	unset value matches an empty string.
				1802	</P>
				1803	<P>
				1804	Because there may be many capturing parentheses in a pattern, all digits
				1805	following a backslash are taken as part of a potential back reference number.
				1806	If the pattern continues with a digit character, some delimiter must be used to
				1807	terminate the back reference. If the PCRE_EXTENDED option is set, this can be
				1808	whitespace. Otherwise, the \g{ syntax or an empty comment (see
				1809	<a href="#comments">"Comments"</a>
				1810	below) can be used.
				1811	</P>
				1812	<br><b>
				1813	Recursive back references
				1814	</b><br>
				1815	<P>
				1816	A back reference that occurs inside the parentheses to which it refers fails
				1817	when the subpattern is first used, so, for example, (a\1) never matches.
				1818	However, such references can be useful inside repeated subpatterns. For
				1819	example, the pattern
				1820	<pre>
				1821	(a\|b\1)+
				1822	</pre>
				1823	matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
				1824	the subpattern, the back reference matches the character string corresponding
				1825	to the previous iteration. In order for this to work, the pattern must be such
				1826	that the first iteration does not need to match the back reference. This can be
				1827	done using alternation, as in the example above, or by a quantifier with a
				1828	minimum of zero.
				1829	</P>
				1830	<P>
				1831	Back references of this type cause the group that they reference to be treated
				1832	as an
				1833	<a href="#atomicgroup">atomic group.</a>
				1834	Once the whole group has been matched, a subsequent matching failure cannot
				1835	cause backtracking into the middle of the group.
				1836	<a name="bigassertions"></a></P>
				1837	<br><a name="SEC18" href="#TOC1">ASSERTIONS</a><br>
				1838	<P>
				1839	An assertion is a test on the characters following or preceding the current
				1840	matching point that does not actually consume any characters. The simple
				1841	assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
				1842	<a href="#smallassertions">above.</a>
				1843	</P>
				1844	<P>
				1845	More complicated assertions are coded as subpatterns. There are two kinds:
				1846	those that look ahead of the current position in the subject string, and those
				1847	that look behind it. An assertion subpattern is matched in the normal way,
				1848	except that it does not cause the current matching position to be changed.
				1849	</P>
				1850	<P>
				1851	Assertion subpatterns are not capturing subpatterns. If such an assertion
				1852	contains capturing subpatterns within it, these are counted for the purposes of
				1853	numbering the capturing subpatterns in the whole pattern. However, substring
				1854	capturing is carried out only for positive assertions, because it does not make
				1855	sense for negative assertions.
				1856	</P>
				1857	<P>
				1858	For compatibility with Perl, assertion subpatterns may be repeated; though
				1859	it makes no sense to assert the same thing several times, the side effect of
				1860	capturing parentheses may occasionally be useful. In practice, there only three
				1861	cases:
				1862	<br>
				1863	<br>
				1864	(1) If the quantifier is {0}, the assertion is never obeyed during matching.
				1865	However, it may contain internal capturing parenthesized groups that are called
				1866	from elsewhere via the
				1867	<a href="#subpatternsassubroutines">subroutine mechanism.</a>
				1868	<br>
				1869	<br>
				1870	(2) If quantifier is {0,n} where n is greater than zero, it is treated as if it
				1871	were {0,1}. At run time, the rest of the pattern match is tried with and
				1872	without the assertion, the order depending on the greediness of the quantifier.
				1873	<br>
				1874	<br>
				1875	(3) If the minimum repetition is greater than zero, the quantifier is ignored.
				1876	The assertion is obeyed just once when encountered during matching.
				1877	</P>
				1878	<br><b>
				1879	Lookahead assertions
				1880	</b><br>
				1881	<P>
				1882	Lookahead assertions start with (?= for positive assertions and (?! for
				1883	negative assertions. For example,
				1884	<pre>
				1885	\w+(?=;)
				1886	</pre>
				1887	matches a word followed by a semicolon, but does not include the semicolon in
				1888	the match, and
				1889	<pre>
				1890	foo(?!bar)
				1891	</pre>
				1892	matches any occurrence of "foo" that is not followed by "bar". Note that the
				1893	apparently similar pattern
				1894	<pre>
				1895	(?!foo)bar
				1896	</pre>
				1897	does not find an occurrence of "bar" that is preceded by something other than
				1898	"foo"; it finds any occurrence of "bar" whatsoever, because the assertion
				1899	(?!foo) is always true when the next three characters are "bar". A
				1900	lookbehind assertion is needed to achieve the other effect.
				1901	</P>
				1902	<P>
				1903	If you want to force a matching failure at some point in a pattern, the most
				1904	convenient way to do it is with (?!) because an empty string always matches, so
				1905	an assertion that requires there not to be an empty string must always fail.
				1906	The backtracking control verb (FAIL) or (F) is a synonym for (?!).
				1907	<a name="lookbehind"></a></P>
				1908	<br><b>
				1909	Lookbehind assertions
				1910	</b><br>
				1911	<P>
				1912	Lookbehind assertions start with (?<= for positive assertions and (?<! for
				1913	negative assertions. For example,
				1914	<pre>
				1915	(?<!foo)bar
				1916	</pre>
				1917	does find an occurrence of "bar" that is not preceded by "foo". The contents of
				1918	a lookbehind assertion are restricted such that all the strings it matches must
				1919	have a fixed length. However, if there are several top-level alternatives, they
				1920	do not all have to have the same fixed length. Thus
				1921	<pre>
				1922	(?<=bullock\|donkey)
				1923	</pre>
				1924	is permitted, but
				1925	<pre>
				1926	(?<!dogs?\|cats?)
				1927	</pre>
				1928	causes an error at compile time. Branches that match different length strings
				1929	are permitted only at the top level of a lookbehind assertion. This is an
				1930	extension compared with Perl, which requires all branches to match the same
				1931	length of string. An assertion such as
				1932	<pre>
				1933	(?<=ab(c\|de))
				1934	</pre>
				1935	is not permitted, because its single top-level branch can match two different
				1936	lengths, but it is acceptable to PCRE if rewritten to use two top-level
				1937	branches:
				1938	<pre>
				1939	(?<=abc\|abde)
				1940	</pre>
				1941	In some cases, the escape sequence \K
				1942	<a href="#resetmatchstart">(see above)</a>
				1943	can be used instead of a lookbehind assertion to get round the fixed-length
				1944	restriction.
				1945	</P>
				1946	<P>
				1947	The implementation of lookbehind assertions is, for each alternative, to
				1948	temporarily move the current position back by the fixed length and then try to
				1949	match. If there are insufficient characters before the current position, the
				1950	assertion fails.
				1951	</P>
				1952	<P>
				1953	In UTF-8 mode, PCRE does not allow the \C escape (which matches a single byte,
				1954	even in UTF-8 mode) to appear in lookbehind assertions, because it makes it
				1955	impossible to calculate the length of the lookbehind. The \X and \R escapes,
				1956	which can match different numbers of bytes, are also not permitted.
				1957	</P>
				1958	<P>
				1959	<a href="#subpatternsassubroutines">"Subroutine"</a>
				1960	calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
				1961	as the subpattern matches a fixed-length string.
				1962	<a href="#recursion">Recursion,</a>
				1963	however, is not supported.
				1964	</P>
				1965	<P>
				1966	Possessive quantifiers can be used in conjunction with lookbehind assertions to
				1967	specify efficient matching of fixed-length strings at the end of subject
				1968	strings. Consider a simple pattern such as
				1969	<pre>
				1970	abcd$
				1971	</pre>
				1972	when applied to a long string that does not match. Because matching proceeds
				1973	from left to right, PCRE will look for each "a" in the subject and then see if
				1974	what follows matches the rest of the pattern. If the pattern is specified as
				1975	<pre>
				1976	^.*abcd$
				1977	</pre>
				1978	the initial .* matches the entire string at first, but when this fails (because
				1979	there is no following "a"), it backtracks to match all but the last character,
				1980	then all but the last two characters, and so on. Once again the search for "a"
				1981	covers the entire string, from right to left, so we are no better off. However,
				1982	if the pattern is written as
				1983	<pre>
				1984	^.*+(?<=abcd)
				1985	</pre>
				1986	there can be no backtracking for the .*+ item; it can match only the entire
				1987	string. The subsequent lookbehind assertion does a single test on the last four
				1988	characters. If it fails, the match fails immediately. For long strings, this
				1989	approach makes a significant difference to the processing time.
				1990	</P>
				1991	<br><b>
				1992	Using multiple assertions
				1993	</b><br>
				1994	<P>
				1995	Several assertions (of any sort) may occur in succession. For example,
				1996	<pre>
				1997	(?<=\d{3})(?<!999)foo
				1998	</pre>
				1999	matches "foo" preceded by three digits that are not "999". Notice that each of
				2000	the assertions is applied independently at the same point in the subject
				2001	string. First there is a check that the previous three characters are all
				2002	digits, and then there is a check that the same three characters are not "999".
				2003	This pattern does <i>not</i> match "foo" preceded by six characters, the first
				2004	of which are digits and the last three of which are not "999". For example, it
				2005	doesn't match "123abcfoo". A pattern to do that is
				2006	<pre>
				2007	(?<=\d{3}...)(?<!999)foo
				2008	</pre>
				2009	This time the first assertion looks at the preceding six characters, checking
				2010	that the first three are digits, and then the second assertion checks that the
				2011	preceding three characters are not "999".
				2012	</P>
				2013	<P>
				2014	Assertions can be nested in any combination. For example,
				2015	<pre>
				2016	(?<=(?<!foo)bar)baz
				2017	</pre>
				2018	matches an occurrence of "baz" that is preceded by "bar" which in turn is not
				2019	preceded by "foo", while
				2020	<pre>
				2021	(?<=\d{3}(?!999)...)foo
				2022	</pre>
				2023	is another pattern that matches "foo" preceded by three digits and any three
				2024	characters that are not "999".
				2025	<a name="conditions"></a></P>
				2026	<br><a name="SEC19" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
				2027	<P>
				2028	It is possible to cause the matching process to obey a subpattern
				2029	conditionally or to choose between two alternative subpatterns, depending on
				2030	the result of an assertion, or whether a specific capturing subpattern has
				2031	already been matched. The two possible forms of conditional subpattern are:
				2032	<pre>
				2033	(?(condition)yes-pattern)
				2034	(?(condition)yes-pattern\|no-pattern)
				2035	</pre>
				2036	If the condition is satisfied, the yes-pattern is used; otherwise the
				2037	no-pattern (if present) is used. If there are more than two alternatives in the
				2038	subpattern, a compile-time error occurs. Each of the two alternatives may
				2039	itself contain nested subpatterns of any form, including conditional
				2040	subpatterns; the restriction to two alternatives applies only at the level of
				2041	the condition. This pattern fragment is an example where the alternatives are
				2042	complex:
				2043	<pre>
				2044	(?(1) (A\|B\|C) \| (D \| (?(2)E\|F) \| E) )
				2045
				2046	</PRE>
				2047	</P>
				2048	<P>
				2049	There are four kinds of condition: references to subpatterns, references to
				2050	recursion, a pseudo-condition called DEFINE, and assertions.
				2051	</P>
				2052	<br><b>
				2053	Checking for a used subpattern by number
				2054	</b><br>
				2055	<P>
				2056	If the text between the parentheses consists of a sequence of digits, the
				2057	condition is true if a capturing subpattern of that number has previously
				2058	matched. If there is more than one capturing subpattern with the same number
				2059	(see the earlier
				2060	<a href="#recursion">section about duplicate subpattern numbers),</a>
				2061	the condition is true if any of them have matched. An alternative notation is
				2062	to precede the digits with a plus or minus sign. In this case, the subpattern
				2063	number is relative rather than absolute. The most recently opened parentheses
				2064	can be referenced by (?(-1), the next most recent by (?(-2), and so on. Inside
				2065	loops it can also make sense to refer to subsequent groups. The next
				2066	parentheses to be opened can be referenced as (?(+1), and so on. (The value
				2067	zero in any of these forms is not used; it provokes a compile-time error.)
				2068	</P>
				2069	<P>
				2070	Consider the following pattern, which contains non-significant white space to
				2071	make it more readable (assume the PCRE_EXTENDED option) and to divide it into
				2072	three parts for ease of discussion:
				2073	<pre>
				2074	( $ )? [^()]+ (?(1) $ )
				2075	</pre>
				2076	The first part matches an optional opening parenthesis, and if that
				2077	character is present, sets it as the first captured substring. The second part
				2078	matches one or more characters that are not parentheses. The third part is a
				2079	conditional subpattern that tests whether or not the first set of parentheses
				2080	matched. If they did, that is, if subject started with an opening parenthesis,
				2081	the condition is true, and so the yes-pattern is executed and a closing
				2082	parenthesis is required. Otherwise, since no-pattern is not present, the
				2083	subpattern matches nothing. In other words, this pattern matches a sequence of
				2084	non-parentheses, optionally enclosed in parentheses.
				2085	</P>
				2086	<P>
				2087	If you were embedding this pattern in a larger one, you could use a relative
				2088	reference:
				2089	<pre>
				2090	...other stuff... ( $ )? [^()]+ (?(-1) $ ) ...
				2091	</pre>
				2092	This makes the fragment independent of the parentheses in the larger pattern.
				2093	</P>
				2094	<br><b>
				2095	Checking for a used subpattern by name
				2096	</b><br>
				2097	<P>
				2098	Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
				2099	subpattern by name. For compatibility with earlier versions of PCRE, which had
				2100	this facility before Perl, the syntax (?(name)...) is also recognized. However,
				2101	there is a possible ambiguity with this syntax, because subpattern names may
				2102	consist entirely of digits. PCRE looks first for a named subpattern; if it
				2103	cannot find one and the name consists entirely of digits, PCRE looks for a
				2104	subpattern of that number, which must be greater than zero. Using subpattern
				2105	names that consist entirely of digits is not recommended.
				2106	</P>
				2107	<P>
				2108	Rewriting the above example to use a named subpattern gives this:
				2109	<pre>
				2110	(?<OPEN> $ )? [^()]+ (?(<OPEN>) $ )
				2111	</pre>
				2112	If the name used in a condition of this kind is a duplicate, the test is
				2113	applied to all subpatterns of the same name, and is true if any one of them has
				2114	matched.
				2115	</P>
				2116	<br><b>
				2117	Checking for pattern recursion
				2118	</b><br>
				2119	<P>
				2120	If the condition is the string (R), and there is no subpattern with the name R,
				2121	the condition is true if a recursive call to the whole pattern or any
				2122	subpattern has been made. If digits or a name preceded by ampersand follow the
				2123	letter R, for example:
				2124	<pre>
				2125	(?(R3)...) or (?(R&name)...)
				2126	</pre>
				2127	the condition is true if the most recent recursion is into a subpattern whose
				2128	number or name is given. This condition does not check the entire recursion
				2129	stack. If the name used in a condition of this kind is a duplicate, the test is
				2130	applied to all subpatterns of the same name, and is true if any one of them is
				2131	the most recent recursion.
				2132	</P>
				2133	<P>
				2134	At "top level", all these recursion test conditions are false.
				2135	<a href="#recursion">The syntax for recursive patterns</a>
				2136	is described below.
				2137	<a name="subdefine"></a></P>
				2138	<br><b>
				2139	Defining subpatterns for use by reference only
				2140	</b><br>
				2141	<P>
				2142	If the condition is the string (DEFINE), and there is no subpattern with the
				2143	name DEFINE, the condition is always false. In this case, there may be only one
				2144	alternative in the subpattern. It is always skipped if control reaches this
				2145	point in the pattern; the idea of DEFINE is that it can be used to define
				2146	subroutines that can be referenced from elsewhere. (The use of
				2147	<a href="#subpatternsassubroutines">subroutines</a>
				2148	is described below.) For example, a pattern to match an IPv4 address such as
				2149	"192.168.23.245" could be written like this (ignore whitespace and line
				2150	breaks):
				2151	<pre>
				2152	(?(DEFINE) (?<byte> 2[0-4]\d \| 25[0-5] \| 1\d\d \| [1-9]?\d) )
				2153	\b (?&byte) (\.(?&byte)){3} \b
				2154	</pre>
				2155	The first part of the pattern is a DEFINE group inside which a another group
				2156	named "byte" is defined. This matches an individual component of an IPv4
				2157	address (a number less than 256). When matching takes place, this part of the
				2158	pattern is skipped because DEFINE acts like a false condition. The rest of the
				2159	pattern uses references to the named group to match the four dot-separated
				2160	components of an IPv4 address, insisting on a word boundary at each end.
				2161	</P>
				2162	<br><b>
				2163	Assertion conditions
				2164	</b><br>
				2165	<P>
				2166	If the condition is not in any of the above formats, it must be an assertion.
				2167	This may be a positive or negative lookahead or lookbehind assertion. Consider
				2168	this pattern, again containing non-significant white space, and with the two
				2169	alternatives on the second line:
				2170	<pre>
				2171	(?(?=[^a-z]*[a-z])
				2172	\d{2}-[a-z]{3}-\d{2} \| \d{2}-\d{2}-\d{2} )
				2173	</pre>
				2174	The condition is a positive lookahead assertion that matches an optional
				2175	sequence of non-letters followed by a letter. In other words, it tests for the
				2176	presence of at least one letter in the subject. If a letter is found, the
				2177	subject is matched against the first alternative; otherwise it is matched
				2178	against the second. This pattern matches strings in one of the two forms
				2179	dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
				2180	<a name="comments"></a></P>
				2181	<br><a name="SEC20" href="#TOC1">COMMENTS</a><br>
				2182	<P>
				2183	There are two ways of including comments in patterns that are processed by
				2184	PCRE. In both cases, the start of the comment must not be in a character class,
				2185	nor in the middle of any other sequence of related characters such as (?: or a
				2186	subpattern name or number. The characters that make up a comment play no part
				2187	in the pattern matching.
				2188	</P>
				2189	<P>
				2190	The sequence (?# marks the start of a comment that continues up to the next
				2191	closing parenthesis. Nested parentheses are not permitted. If the PCRE_EXTENDED
				2192	option is set, an unescaped # character also introduces a comment, which in
				2193	this case continues to immediately after the next newline character or
				2194	character sequence in the pattern. Which characters are interpreted as newlines
				2195	is controlled by the options passed to <b>pcre_compile()</b> or by a special
				2196	sequence at the start of the pattern, as described in the section entitled
				2197	<a href="#newlines">"Newline conventions"</a>
				2198	above. Note that the end of this type of comment is a literal newline sequence
				2199	in the pattern; escape sequences that happen to represent a newline do not
				2200	count. For example, consider this pattern when PCRE_EXTENDED is set, and the
				2201	default newline convention is in force:
				2202	<pre>
				2203	abc #comment \n still comment
				2204	</pre>
				2205	On encountering the # character, <b>pcre_compile()</b> skips along, looking for
				2206	a newline in the pattern. The sequence \n is still literal at this stage, so
				2207	it does not terminate the comment. Only an actual character with the code value
				2208	0x0a (the default newline) does so.
				2209	<a name="recursion"></a></P>
				2210	<br><a name="SEC21" href="#TOC1">RECURSIVE PATTERNS</a><br>
				2211	<P>
				2212	Consider the problem of matching a string in parentheses, allowing for
				2213	unlimited nested parentheses. Without the use of recursion, the best that can
				2214	be done is to use a pattern that matches up to some fixed depth of nesting. It
				2215	is not possible to handle an arbitrary nesting depth.
				2216	</P>
				2217	<P>
				2218	For some time, Perl has provided a facility that allows regular expressions to
				2219	recurse (amongst other things). It does this by interpolating Perl code in the
				2220	expression at run time, and the code can refer to the expression itself. A Perl
				2221	pattern using code interpolation to solve the parentheses problem can be
				2222	created like this:
				2223	<pre>
				2224	$re = qr{$ (?: (?>[^()]+) \| (?p{$re}) )* $}x;
				2225	</pre>
				2226	The (?p{...}) item interpolates Perl code at run time, and in this case refers
				2227	recursively to the pattern in which it appears.
				2228	</P>
				2229	<P>
				2230	Obviously, PCRE cannot support the interpolation of Perl code. Instead, it
				2231	supports special syntax for recursion of the entire pattern, and also for
				2232	individual subpattern recursion. After its introduction in PCRE and Python,
				2233	this kind of recursion was subsequently introduced into Perl at release 5.10.
				2234	</P>
				2235	<P>
				2236	A special item that consists of (? followed by a number greater than zero and a
				2237	closing parenthesis is a recursive subroutine call of the subpattern of the
				2238	given number, provided that it occurs inside that subpattern. (If not, it is a
				2239	<a href="#subpatternsassubroutines">non-recursive subroutine</a>
				2240	call, which is described in the next section.) The special item (?R) or (?0) is
				2241	a recursive call of the entire regular expression.
				2242	</P>
				2243	<P>
				2244	This PCRE pattern solves the nested parentheses problem (assume the
				2245	PCRE_EXTENDED option is set so that white space is ignored):
				2246	<pre>
				2247	$ ( [^()]++ \| (?R) )* $
				2248	</pre>
				2249	First it matches an opening parenthesis. Then it matches any number of
				2250	substrings which can either be a sequence of non-parentheses, or a recursive
				2251	match of the pattern itself (that is, a correctly parenthesized substring).
				2252	Finally there is a closing parenthesis. Note the use of a possessive quantifier
				2253	to avoid backtracking into sequences of non-parentheses.
				2254	</P>
				2255	<P>
				2256	If this were part of a larger pattern, you would not want to recurse the entire
				2257	pattern, so instead you could use this:
				2258	<pre>
				2259	( $ ( [^()]++ \| (?1) )* $ )
				2260	</pre>
				2261	We have put the pattern into parentheses, and caused the recursion to refer to
				2262	them instead of the whole pattern.
				2263	</P>
				2264	<P>
				2265	In a larger pattern, keeping track of parenthesis numbers can be tricky. This
				2266	is made easier by the use of relative references. Instead of (?1) in the
				2267	pattern above you can write (?-2) to refer to the second most recently opened
				2268	parentheses preceding the recursion. In other words, a negative number counts
				2269	capturing parentheses leftwards from the point at which it is encountered.
				2270	</P>
				2271	<P>
				2272	It is also possible to refer to subsequently opened parentheses, by writing
				2273	references such as (?+2). However, these cannot be recursive because the
				2274	reference is not inside the parentheses that are referenced. They are always
				2275	<a href="#subpatternsassubroutines">non-recursive subroutine</a>
				2276	calls, as described in the next section.
				2277	</P>
				2278	<P>
				2279	An alternative approach is to use named parentheses instead. The Perl syntax
				2280	for this is (?&name); PCRE's earlier syntax (?P>name) is also supported. We
				2281	could rewrite the above example as follows:
				2282	<pre>
				2283	(?<pn> $ ( [^()]++ \| (?&pn) )* $ )
				2284	</pre>
				2285	If there is more than one subpattern with the same name, the earliest one is
				2286	used.
				2287	</P>
				2288	<P>
				2289	This particular example pattern that we have been looking at contains nested
				2290	unlimited repeats, and so the use of a possessive quantifier for matching
				2291	strings of non-parentheses is important when applying the pattern to strings
				2292	that do not match. For example, when this pattern is applied to
				2293	<pre>
				2294	(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
				2295	</pre>
				2296	it yields "no match" quickly. However, if a possessive quantifier is not used,
				2297	the match runs for a very long time indeed because there are so many different
				2298	ways the + and * repeats can carve up the subject, and all have to be tested
				2299	before failure can be reported.
				2300	</P>
				2301	<P>
				2302	At the end of a match, the values of capturing parentheses are those from
				2303	the outermost level. If you want to obtain intermediate values, a callout
				2304	function can be used (see below and the
				2305	<a href="pcrecallout.html"><b>pcrecallout</b></a>
				2306	documentation). If the pattern above is matched against
				2307	<pre>
				2308	(ab(cd)ef)
				2309	</pre>
				2310	the value for the inner capturing parentheses (numbered 2) is "ef", which is
				2311	the last value taken on at the top level. If a capturing subpattern is not
				2312	matched at the top level, its final captured value is unset, even if it was
				2313	(temporarily) set at a deeper level during the matching process.
				2314	</P>
				2315	<P>
				2316	If there are more than 15 capturing parentheses in a pattern, PCRE has to
				2317	obtain extra memory to store data during a recursion, which it does by using
				2318	<b>pcre_malloc</b>, freeing it via <b>pcre_free</b> afterwards. If no memory can
				2319	be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
				2320	</P>
				2321	<P>
				2322	Do not confuse the (?R) item with the condition (R), which tests for recursion.
				2323	Consider this pattern, which matches text in angle brackets, allowing for
				2324	arbitrary nesting. Only digits are allowed in nested brackets (that is, when
				2325	recursing), whereas any characters are permitted at the outer level.
				2326	<pre>
				2327	< (?: (?(R) \d++ \| [^<>]+) \| (?R)) >
				2328	</pre>
				2329	In this pattern, (?(R) is the start of a conditional subpattern, with two
				2330	different alternatives for the recursive and non-recursive cases. The (?R) item
				2331	is the actual recursive call.
				2332	<a name="recursiondifference"></a></P>
				2333	<br><b>
				2334	Differences in recursion processing between PCRE and Perl
				2335	</b><br>
				2336	<P>
				2337	Recursion processing in PCRE differs from Perl in two important ways. In PCRE
				2338	(like Python, but unlike Perl), a recursive subpattern call is always treated
				2339	as an atomic group. That is, once it has matched some of the subject string, it
				2340	is never re-entered, even if it contains untried alternatives and there is a
				2341	subsequent matching failure. This can be illustrated by the following pattern,
				2342	which purports to match a palindromic string that contains an odd number of
				2343	characters (for example, "a", "aba", "abcba", "abcdcba"):
				2344	<pre>
				2345	^(.\|(.)(?1)\2)$
				2346	</pre>
				2347	The idea is that it either matches a single character, or two identical
				2348	characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
				2349	it does not if the pattern is longer than three characters. Consider the
				2350	subject string "abcba":
				2351	</P>
				2352	<P>
				2353	At the top level, the first character is matched, but as it is not at the end
				2354	of the string, the first alternative fails; the second alternative is taken
				2355	and the recursion kicks in. The recursive call to subpattern 1 successfully
				2356	matches the next character ("b"). (Note that the beginning and end of line
				2357	tests are not part of the recursion).
				2358	</P>
				2359	<P>
				2360	Back at the top level, the next character ("c") is compared with what
				2361	subpattern 2 matched, which was "a". This fails. Because the recursion is
				2362	treated as an atomic group, there are now no backtracking points, and so the
				2363	entire match fails. (Perl is able, at this point, to re-enter the recursion and
				2364	try the second alternative.) However, if the pattern is written with the
				2365	alternatives in the other order, things are different:
				2366	<pre>
				2367	^((.)(?1)\2\|.)$
				2368	</pre>
				2369	This time, the recursing alternative is tried first, and continues to recurse
				2370	until it runs out of characters, at which point the recursion fails. But this
				2371	time we do have another alternative to try at the higher level. That is the big
				2372	difference: in the previous case the remaining alternative is at a deeper
				2373	recursion level, which PCRE cannot use.
				2374	</P>
				2375	<P>
				2376	To change the pattern so that it matches all palindromic strings, not just
				2377	those with an odd number of characters, it is tempting to change the pattern to
				2378	this:
				2379	<pre>
				2380	^((.)(?1)\2\|.?)$
				2381	</pre>
				2382	Again, this works in Perl, but not in PCRE, and for the same reason. When a
				2383	deeper recursion has matched a single character, it cannot be entered again in
				2384	order to match an empty string. The solution is to separate the two cases, and
				2385	write out the odd and even cases as alternatives at the higher level:
				2386	<pre>
				2387	^(?:((.)(?1)\2\|)\|((.)(?3)\4\|.))
				2388	</pre>
				2389	If you want to match typical palindromic phrases, the pattern has to ignore all
				2390	non-word characters, which can be done like this:
				2391	<pre>
				2392	^\W+(?:((.)\W+(?1)\W+\2\|)\|((.)\W+(?3)\W+\4\|\W+.\W+))\W+$
				2393	</pre>
				2394	If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
				2395	man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
				2396	the use of the possessive quantifier *+ to avoid backtracking into sequences of
				2397	non-word characters. Without this, PCRE takes a great deal longer (ten times or
				2398	more) to match typical phrases, and Perl takes so long that you think it has
				2399	gone into a loop.
				2400	</P>
				2401	<P>
				2402	<b>WARNING</b>: The palindrome-matching patterns above work only if the subject
				2403	string does not start with a palindrome that is shorter than the entire string.
				2404	For example, although "abcba" is correctly matched, if the subject is "ababa",
				2405	PCRE finds the palindrome "aba" at the start, then fails at top level because
				2406	the end of the string does not follow. Once again, it cannot jump back into the
				2407	recursion to try other alternatives, so the entire match fails.
				2408	</P>
				2409	<P>
				2410	The second way in which PCRE and Perl differ in their recursion processing is
				2411	in the handling of captured values. In Perl, when a subpattern is called
				2412	recursively or as a subpattern (see the next section), it has no access to any
				2413	values that were captured outside the recursion, whereas in PCRE these values
				2414	can be referenced. Consider this pattern:
				2415	<pre>
				2416	^(.)(\1\|a(?2))
				2417	</pre>
				2418	In PCRE, this pattern matches "bab". The first capturing parentheses match "b",
				2419	then in the second group, when the back reference \1 fails to match "b", the
				2420	second alternative matches "a" and then recurses. In the recursion, \1 does
				2421	now match "b" and so the whole match succeeds. In Perl, the pattern fails to
				2422	match because inside the recursive call \1 cannot access the externally set
				2423	value.
				2424	<a name="subpatternsassubroutines"></a></P>
				2425	<br><a name="SEC22" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
				2426	<P>
				2427	If the syntax for a recursive subpattern call (either by number or by
				2428	name) is used outside the parentheses to which it refers, it operates like a
				2429	subroutine in a programming language. The called subpattern may be defined
				2430	before or after the reference. A numbered reference can be absolute or
				2431	relative, as in these examples:
				2432	<pre>
				2433	(...(absolute)...)...(?2)...
				2434	(...(relative)...)...(?-1)...
				2435	(...(?+1)...(relative)...
				2436	</pre>
				2437	An earlier example pointed out that the pattern
				2438	<pre>
				2439	(sens\|respons)e and \1ibility
				2440	</pre>
				2441	matches "sense and sensibility" and "response and responsibility", but not
				2442	"sense and responsibility". If instead the pattern
				2443	<pre>
				2444	(sens\|respons)e and (?1)ibility
				2445	</pre>
				2446	is used, it does match "sense and responsibility" as well as the other two
				2447	strings. Another example is given in the discussion of DEFINE above.
				2448	</P>
				2449	<P>
				2450	All subroutine calls, whether recursive or not, are always treated as atomic
				2451	groups. That is, once a subroutine has matched some of the subject string, it
				2452	is never re-entered, even if it contains untried alternatives and there is a
				2453	subsequent matching failure. Any capturing parentheses that are set during the
				2454	subroutine call revert to their previous values afterwards.
				2455	</P>
				2456	<P>
				2457	Processing options such as case-independence are fixed when a subpattern is
				2458	defined, so if it is used as a subroutine, such options cannot be changed for
				2459	different calls. For example, consider this pattern:
				2460	<pre>
				2461	(abc)(?i:(?-1))
				2462	</pre>
				2463	It matches "abcabc". It does not match "abcABC" because the change of
				2464	processing option does not affect the called subpattern.
				2465	<a name="onigurumasubroutines"></a></P>
				2466	<br><a name="SEC23" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
				2467	<P>
				2468	For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
				2469	a number enclosed either in angle brackets or single quotes, is an alternative
				2470	syntax for referencing a subpattern as a subroutine, possibly recursively. Here
				2471	are two of the examples used above, rewritten using this syntax:
				2472	<pre>
				2473	(?<pn> $ ( (?>[^()]+) \| \g<pn> )* $ )
				2474	(sens\|respons)e and \g'1'ibility
				2475	</pre>
				2476	PCRE supports an extension to Oniguruma: if a number is preceded by a
				2477	plus or a minus sign it is taken as a relative reference. For example:
				2478	<pre>
				2479	(abc)(?i:\g<-1>)
				2480	</pre>
				2481	Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i>
				2482	synonymous. The former is a back reference; the latter is a subroutine call.
				2483	</P>
				2484	<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
				2485	<P>
				2486	Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
				2487	code to be obeyed in the middle of matching a regular expression. This makes it
				2488	possible, amongst other things, to extract different substrings that match the
				2489	same pair of parentheses when there is a repetition.
				2490	</P>
				2491	<P>
				2492	PCRE provides a similar feature, but of course it cannot obey arbitrary Perl
				2493	code. The feature is called "callout". The caller of PCRE provides an external
				2494	function by putting its entry point in the global variable <i>pcre_callout</i>.
				2495	By default, this variable contains NULL, which disables all calling out.
				2496	</P>
				2497	<P>
				2498	Within a regular expression, (?C) indicates the points at which the external
				2499	function is to be called. If you want to identify different callout points, you
				2500	can put a number less than 256 after the letter C. The default value is zero.
				2501	For example, this pattern has two callout points:
				2502	<pre>
				2503	(?C1)abc(?C2)def
				2504	</pre>
				2505	If the PCRE_AUTO_CALLOUT flag is passed to <b>pcre_compile()</b>, callouts are
				2506	automatically installed before each item in the pattern. They are all numbered
				2507	255.
				2508	</P>
				2509	<P>
				2510	During matching, when PCRE reaches a callout point (and <i>pcre_callout</i> is
				2511	set), the external function is called. It is provided with the number of the
				2512	callout, the position in the pattern, and, optionally, one item of data
				2513	originally supplied by the caller of <b>pcre_exec()</b>. The callout function
				2514	may cause matching to proceed, to backtrack, or to fail altogether. A complete
				2515	description of the interface to the callout function is given in the
				2516	<a href="pcrecallout.html"><b>pcrecallout</b></a>
				2517	documentation.
				2518	<a name="backtrackcontrol"></a></P>
				2519	<br><a name="SEC25" href="#TOC1">BACKTRACKING CONTROL</a><br>
				2520	<P>
				2521	Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
				2522	are described in the Perl documentation as "experimental and subject to change
				2523	or removal in a future version of Perl". It goes on to say: "Their usage in
				2524	production code should be noted to avoid problems during upgrades." The same
				2525	remarks apply to the PCRE features described in this section.
				2526	</P>
				2527	<P>
				2528	Since these verbs are specifically related to backtracking, most of them can be
				2529	used only when the pattern is to be matched using <b>pcre_exec()</b>, which uses
				2530	a backtracking algorithm. With the exception of (*FAIL), which behaves like a
				2531	failing negative assertion, they cause an error if encountered by
				2532	<b>pcre_dfa_exec()</b>.
				2533	</P>
				2534	<P>
				2535	If any of these verbs are used in an assertion or in a subpattern that is
				2536	called as a subroutine (whether or not recursively), their effect is confined
				2537	to that subpattern; it does not extend to the surrounding pattern, with one
				2538	exception: the name from a (MARK), (PRUNE), or (*THEN) that is encountered in
				2539	a successful positive assertion <i>is</i> passed back when a match succeeds
				2540	(compare capturing parentheses in assertions). Note that such subpatterns are
				2541	processed as anchored at the point where they are tested. Note also that Perl's
				2542	treatment of subroutines is different in some cases.
				2543	</P>
				2544	<P>
				2545	The new verbs make use of what was previously invalid syntax: an opening
				2546	parenthesis followed by an asterisk. They are generally of the form
				2547	(VERB) or (VERB:NAME). Some may take either form, with differing behaviour,
				2548	depending on whether or not an argument is present. A name is any sequence of
				2549	characters that does not include a closing parenthesis. If the name is empty,
				2550	that is, if the closing parenthesis immediately follows the colon, the effect
				2551	is as if the colon were not there. Any number of these verbs may occur in a
				2552	pattern.
				2553	</P>
				2554	<P>
				2555	PCRE contains some optimizations that are used to speed up matching by running
				2556	some checks at the start of each match attempt. For example, it may know the
				2557	minimum length of matching subject, or that a particular character must be
				2558	present. When one of these optimizations suppresses the running of a match, any
				2559	included backtracking verbs will not, of course, be processed. You can suppress
				2560	the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option
				2561	when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the
				2562	pattern with (*NO_START_OPT).
				2563	</P>
				2564	<P>
				2565	Experiments with Perl suggest that it too has similar optimizations, sometimes
				2566	leading to anomalous results.
				2567	</P>
				2568	<br><b>
				2569	Verbs that act immediately
				2570	</b><br>
				2571	<P>
				2572	The following verbs act as soon as they are encountered. They may not be
				2573	followed by a name.
				2574	<pre>
				2575	(*ACCEPT)
				2576	</pre>
				2577	This verb causes the match to end successfully, skipping the remainder of the
				2578	pattern. However, when it is inside a subpattern that is called as a
				2579	subroutine, only that subpattern is ended successfully. Matching then continues
				2580	at the outer level. If (*ACCEPT) is inside capturing parentheses, the data so
				2581	far is captured. For example:
				2582	<pre>
				2583	A((?:A\|B(*ACCEPT)\|C)D)
				2584	</pre>
				2585	This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
				2586	the outer parentheses.
				2587	<pre>
				2588	(FAIL) or (F)
				2589	</pre>
				2590	This verb causes a matching failure, forcing backtracking to occur. It is
				2591	equivalent to (?!) but easier to read. The Perl documentation notes that it is
				2592	probably useful only when combined with (?{}) or (??{}). Those are, of course,
				2593	Perl features that are not present in PCRE. The nearest equivalent is the
				2594	callout feature, as for example in this pattern:
				2595	<pre>
				2596	a+(?C)(*FAIL)
				2597	</pre>
				2598	A match with the string "aaaa" always fails, but the callout is taken before
				2599	each backtrack happens (in this example, 10 times).
				2600	</P>
				2601	<br><b>
				2602	Recording which path was taken
				2603	</b><br>
				2604	<P>
				2605	There is one verb whose main purpose is to track how a match was arrived at,
				2606	though it also has a secondary use in conjunction with advancing the match
				2607	starting point (see (*SKIP) below).
				2608	<pre>
				2609	(MARK:NAME) or (:NAME)
				2610	</pre>
				2611	A name is always required with this verb. There may be as many instances of
				2612	(*MARK) as you like in a pattern, and their names do not have to be unique.
				2613	</P>
				2614	<P>
				2615	When a match succeeds, the name of the last-encountered (*MARK) on the matching
				2616	path is passed back to the caller via the <i>pcre_extra</i> data structure, as
				2617	described in the
				2618	<a href="pcreapi.html#extradata">section on <i>pcre_extra</i></a>
				2619	in the
				2620	<a href="pcreapi.html"><b>pcreapi</b></a>
				2621	documentation. Here is an example of <b>pcretest</b> output, where the /K
				2622	modifier requests the retrieval and outputting of (*MARK) data:
				2623	<pre>
				2624	re> /X(MARK:A)Y\|X(MARK:B)Z/K
				2625	data> XY
				2626	0: XY
				2627	MK: A
				2628	XZ
				2629	0: XZ
				2630	MK: B
				2631	</pre>
				2632	The (*MARK) name is tagged with "MK:" in this output, and in this example it
				2633	indicates which of the two alternatives matched. This is a more efficient way
				2634	of obtaining this information than putting each alternative in its own
				2635	capturing parentheses.
				2636	</P>
				2637	<P>
				2638	If (*MARK) is encountered in a positive assertion, its name is recorded and
				2639	passed back if it is the last-encountered. This does not happen for negative
				2640	assertions.
				2641	</P>
				2642	<P>
				2643	After a partial match or a failed match, the name of the last encountered
				2644	(*MARK) in the entire match process is returned. For example:
				2645	<pre>
				2646	re> /X(MARK:A)Y\|X(MARK:B)Z/K
				2647	data> XP
				2648	No match, mark = B
				2649	</pre>
				2650	Note that in this unanchored example the mark is retained from the match
				2651	attempt that started at the letter "X". Subsequent match attempts starting at
				2652	"P" and then with an empty string do not get as far as the (*MARK) item, but
				2653	nevertheless do not reset it.
				2654	</P>
				2655	<br><b>
				2656	Verbs that act after backtracking
				2657	</b><br>
				2658	<P>
				2659	The following verbs do nothing when they are encountered. Matching continues
				2660	with what follows, but if there is no subsequent match, causing a backtrack to
				2661	the verb, a failure is forced. That is, backtracking cannot pass to the left of
				2662	the verb. However, when one of these verbs appears inside an atomic group, its
				2663	effect is confined to that group, because once the group has been matched,
				2664	there is never any backtracking into it. In this situation, backtracking can
				2665	"jump back" to the left of the entire atomic group. (Remember also, as stated
				2666	above, that this localization also applies in subroutine calls and assertions.)
				2667	</P>
				2668	<P>
				2669	These verbs differ in exactly what kind of failure occurs when backtracking
				2670	reaches them.
				2671	<pre>
				2672	(*COMMIT)
				2673	</pre>
				2674	This verb, which may not be followed by a name, causes the whole match to fail
				2675	outright if the rest of the pattern does not match. Even if the pattern is
				2676	unanchored, no further attempts to find a match by advancing the starting point
				2677	take place. Once (*COMMIT) has been passed, <b>pcre_exec()</b> is committed to
				2678	finding a match at the current starting point, or not at all. For example:
				2679	<pre>
				2680	a+(*COMMIT)b
				2681	</pre>
				2682	This matches "xxaab" but not "aacaab". It can be thought of as a kind of
				2683	dynamic anchor, or "I've started, so I must finish." The name of the most
				2684	recently passed (MARK) in the path is passed back when (COMMIT) forces a
				2685	match failure.
				2686	</P>
				2687	<P>
				2688	Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
				2689	unless PCRE's start-of-match optimizations are turned off, as shown in this
				2690	<b>pcretest</b> example:
				2691	<pre>
				2692	re> /(*COMMIT)abc/
				2693	data> xyzabc
				2694	0: abc
				2695	xyzabc\Y
				2696	No match
				2697	</pre>
				2698	PCRE knows that any match must start with "a", so the optimization skips along
				2699	the subject to "a" before running the first match attempt, which succeeds. When
				2700	the optimization is disabled by the \Y escape in the second subject, the match
				2701	starts at "x" and so the (*COMMIT) causes it to fail without trying any other
				2702	starting points.
				2703	<pre>
				2704	(PRUNE) or (PRUNE:NAME)
				2705	</pre>
				2706	This verb causes the match to fail at the current starting position in the
				2707	subject if the rest of the pattern does not match. If the pattern is
				2708	unanchored, the normal "bumpalong" advance to the next starting character then
				2709	happens. Backtracking can occur as usual to the left of (*PRUNE), before it is
				2710	reached, or when matching to the right of (*PRUNE), but if there is no match to
				2711	the right, backtracking cannot cross (*PRUNE). In simple cases, the use of
				2712	(*PRUNE) is just an alternative to an atomic group or possessive quantifier,
				2713	but there are some uses of (*PRUNE) that cannot be expressed in any other way.
				2714	The behaviour of (PRUNE:NAME) is the same as (MARK:NAME)(*PRUNE). In an
				2715	anchored pattern (PRUNE) has the same effect as (COMMIT).
				2716	<pre>
				2717	(*SKIP)
				2718	</pre>
				2719	This verb, when given without a name, is like (*PRUNE), except that if the
				2720	pattern is unanchored, the "bumpalong" advance is not to the next character,
				2721	but to the position in the subject where (SKIP) was encountered. (SKIP)
				2722	signifies that whatever text was matched leading up to it cannot be part of a
				2723	successful match. Consider:
				2724	<pre>
				2725	a+(*SKIP)b
				2726	</pre>
				2727	If the subject is "aaaac...", after the first match attempt fails (starting at
				2728	the first character in the string), the starting point skips on to start the
				2729	next attempt at "c". Note that a possessive quantifer does not have the same
				2730	effect as this example; although it would suppress backtracking during the
				2731	first match attempt, the second attempt would start at the second character
				2732	instead of skipping on to "c".
				2733	<pre>
				2734	(*SKIP:NAME)
				2735	</pre>
				2736	When (*SKIP) has an associated name, its behaviour is modified. If the
				2737	following pattern fails to match, the previous path through the pattern is
				2738	searched for the most recent (*MARK) that has the same name. If one is found,
				2739	the "bumpalong" advance is to the subject position that corresponds to that
				2740	(MARK) instead of to where (SKIP) was encountered. If no (*MARK) with a
				2741	matching name is found, the (*SKIP) is ignored.
				2742	<pre>
				2743	(THEN) or (THEN:NAME)
				2744	</pre>
				2745	This verb causes a skip to the next innermost alternative if the rest of the
				2746	pattern does not match. That is, it cancels pending backtracking, but only
				2747	within the current alternative. Its name comes from the observation that it can
				2748	be used for a pattern-based if-then-else block:
				2749	<pre>
				2750	( COND1 (THEN) FOO \| COND2 (THEN) BAR \| COND3 (*THEN) BAZ ) ...
				2751	</pre>
				2752	If the COND1 pattern matches, FOO is tried (and possibly further items after
				2753	the end of the group if FOO succeeds); on failure, the matcher skips to the
				2754	second alternative and tries COND2, without backtracking into COND1. The
				2755	behaviour of (THEN:NAME) is exactly the same as (MARK:NAME)(*THEN).
				2756	If (THEN) is not inside an alternation, it acts like (PRUNE).
				2757	</P>
				2758	<P>
				2759	Note that a subpattern that does not contain a \| character is just a part of
				2760	the enclosing alternative; it is not a nested alternation with only one
				2761	alternative. The effect of (*THEN) extends beyond such a subpattern to the
				2762	enclosing alternative. Consider this pattern, where A, B, etc. are complex
				2763	pattern fragments that do not contain any \| characters at this level:
				2764	<pre>
				2765	A (B(*THEN)C) \| D
				2766	</pre>
				2767	If A and B are matched, but there is a failure in C, matching does not
				2768	backtrack into A; instead it moves to the next alternative, that is, D.
				2769	However, if the subpattern containing (*THEN) is given an alternative, it
				2770	behaves differently:
				2771	<pre>
				2772	A (B(THEN)C \| (FAIL)) \| D
				2773	</pre>
				2774	The effect of (*THEN) is now confined to the inner subpattern. After a failure
				2775	in C, matching moves to (*FAIL), which causes the whole subpattern to fail
				2776	because there are no more alternatives to try. In this case, matching does now
				2777	backtrack into A.
				2778	</P>
				2779	<P>
				2780	Note also that a conditional subpattern is not considered as having two
				2781	alternatives, because only one is ever used. In other words, the \| character in
				2782	a conditional subpattern has a different meaning. Ignoring white space,
				2783	consider:
				2784	<pre>
				2785	^.? (?(?=a) a \| b(THEN)c )
				2786	</pre>
				2787	If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
				2788	it initially matches zero characters. The condition (?=a) then fails, the
				2789	character "b" is matched, but "c" is not. At this point, matching does not
				2790	backtrack to .*? as might perhaps be expected from the presence of the \|
				2791	character. The conditional subpattern is part of the single alternative that
				2792	comprises the whole pattern, and so the match fails. (If there was a backtrack
				2793	into .*?, allowing it to match "b", the match would succeed.)
				2794	</P>
				2795	<P>
				2796	The verbs just described provide four different "strengths" of control when
				2797	subsequent matching fails. (*THEN) is the weakest, carrying on the match at the
				2798	next alternative. (*PRUNE) comes next, failing the match at the current
				2799	starting position, but allowing an advance to the next character (for an
				2800	unanchored pattern). (*SKIP) is similar, except that the advance may be more
				2801	than one character. (*COMMIT) is the strongest, causing the entire match to
				2802	fail.
				2803	</P>
				2804	<P>
				2805	If more than one such verb is present in a pattern, the "strongest" one wins.
				2806	For example, consider this pattern, where A, B, etc. are complex pattern
				2807	fragments:
				2808	<pre>
				2809	(A(COMMIT)B(THEN)C\|D)
				2810	</pre>
				2811	Once A has matched, PCRE is committed to this match, at the current starting
				2812	position. If subsequently B matches, but C does not, the normal (*THEN) action
				2813	of trying the next alternative (that is, D) does not happen because (*COMMIT)
				2814	overrides.
				2815	</P>
				2816	<br><a name="SEC26" href="#TOC1">SEE ALSO</a><br>
				2817	<P>
				2818	<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3),
				2819	<b>pcresyntax</b>(3), <b>pcre</b>(3).
				2820	</P>
				2821	<br><a name="SEC27" href="#TOC1">AUTHOR</a><br>
				2822	<P>
				2823	Philip Hazel
				2824	<br>
				2825	University Computing Service
				2826	<br>
				2827	Cambridge CB2 3QH, England.
				2828	<br>
				2829	</P>
				2830	<br><a name="SEC28" href="#TOC1">REVISION</a><br>
				2831	<P>
				2832	Last updated: 29 November 2011
				2833	<br>
				2834	Copyright © 1997-2011 University of Cambridge.
				2835	<br>
				2836	<p>
				2837	Return to the <a href="index.html">PCRE index page</a>.
				2838	</p>