Blame - jni/libpcre/sources/doc/pcrepattern.3 - jami-client-android

blob: bb8a4a0d83d693c9b3ec917cbbc072fb9ab37b6d [file] [log] [blame]

Tristan Matthews	0461646	2013-11-14 16:09:34 -0500	[diff] [blame]	1	.TH PCREPATTERN 3
				2	.SH NAME
				3	PCRE - Perl-compatible regular expressions
				4	.SH "PCRE REGULAR EXPRESSION DETAILS"
				5	.rs
				6	.sp
				7	The syntax and semantics of the regular expressions that are supported by PCRE
				8	are described in detail below. There is a quick-reference syntax summary in the
				9	.\" HREF
				10	\fBpcresyntax\fP
				11	.\"
				12	page. PCRE tries to match Perl syntax and semantics as closely as it can. PCRE
				13	also supports some alternative regular expression syntax (which does not
				14	conflict with the Perl syntax) in order to provide some compatibility with
				15	regular expressions in Python, .NET, and Oniguruma.
				16	.P
				17	Perl's regular expressions are described in its own documentation, and
				18	regular expressions in general are covered in a number of books, some of which
				19	have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",
				20	published by O'Reilly, covers regular expressions in great detail. This
				21	description of PCRE's regular expressions is intended as reference material.
				22	.P
				23	The original operation of PCRE was on strings of one-byte characters. However,
				24	there is now also support for UTF-8 character strings. To use this,
				25	PCRE must be built to include UTF-8 support, and you must call
				26	\fBpcre_compile()\fP or \fBpcre_compile2()\fP with the PCRE_UTF8 option. There
				27	is also a special sequence that can be given at the start of a pattern:
				28	.sp
				29	(*UTF8)
				30	.sp
				31	Starting a pattern with this sequence is equivalent to setting the PCRE_UTF8
				32	option. This feature is not Perl-compatible. How setting UTF-8 mode affects
				33	pattern matching is mentioned in several places below. There is also a summary
				34	of UTF-8 features in the
				35	.\" HREF
				36	\fBpcreunicode\fP
				37	.\"
				38	page.
				39	.P
				40	Another special sequence that may appear at the start of a pattern or in
				41	combination with (*UTF8) is:
				42	.sp
				43	(*UCP)
				44	.sp
				45	This has the same effect as setting the PCRE_UCP option: it causes sequences
				46	such as \ed and \ew to use Unicode properties to determine character types,
				47	instead of recognizing only characters with codes less than 128 via a lookup
				48	table.
				49	.P
				50	If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
				51	PCRE_NO_START_OPTIMIZE option either at compile or matching time. There are
				52	also some more of these special sequences that are concerned with the handling
				53	of newlines; they are described below.
				54	.P
				55	The remainder of this document discusses the patterns that are supported by
				56	PCRE when its main matching function, \fBpcre_exec()\fP, is used.
				57	From release 6.0, PCRE offers a second matching function,
				58	\fBpcre_dfa_exec()\fP, which matches using a different algorithm that is not
				59	Perl-compatible. Some of the features discussed below are not available when
				60	\fBpcre_dfa_exec()\fP is used. The advantages and disadvantages of the
				61	alternative function, and how it differs from the normal function, are
				62	discussed in the
				63	.\" HREF
				64	\fBpcrematching\fP
				65	.\"
				66	page.
				67	.
				68	.
				69	.\" HTML <a name="newlines"></a>
				70	.SH "NEWLINE CONVENTIONS"
				71	.rs
				72	.sp
				73	PCRE supports five different conventions for indicating line breaks in
				74	strings: a single CR (carriage return) character, a single LF (linefeed)
				75	character, the two-character sequence CRLF, any of the three preceding, or any
				76	Unicode newline sequence. The
				77	.\" HREF
				78	\fBpcreapi\fP
				79	.\"
				80	page has
				81	.\" HTML <a href="pcreapi.html#newlines">
				82	.\" </a>
				83	further discussion
				84	.\"
				85	about newlines, and shows how to set the newline convention in the
				86	\fIoptions\fP arguments for the compiling and matching functions.
				87	.P
				88	It is also possible to specify a newline convention by starting a pattern
				89	string with one of the following five sequences:
				90	.sp
				91	(*CR) carriage return
				92	(*LF) linefeed
				93	(*CRLF) carriage return, followed by linefeed
				94	(*ANYCRLF) any of the three above
				95	(*ANY) all Unicode newline sequences
				96	.sp
				97	These override the default and the options given to \fBpcre_compile()\fP or
				98	\fBpcre_compile2()\fP. For example, on a Unix system where LF is the default
				99	newline sequence, the pattern
				100	.sp
				101	(*CR)a.b
				102	.sp
				103	changes the convention to CR. That pattern matches "a\enb" because LF is no
				104	longer a newline. Note that these special settings, which are not
				105	Perl-compatible, are recognized only at the very start of a pattern, and that
				106	they must be in upper case. If more than one of them is present, the last one
				107	is used.
				108	.P
				109	The newline convention affects the interpretation of the dot metacharacter when
				110	PCRE_DOTALL is not set, and also the behaviour of \eN. However, it does not
				111	affect what the \eR escape sequence matches. By default, this is any Unicode
				112	newline sequence, for Perl compatibility. However, this can be changed; see the
				113	description of \eR in the section entitled
				114	.\" HTML <a href="#newlineseq">
				115	.\" </a>
				116	"Newline sequences"
				117	.\"
				118	below. A change of \eR setting can be combined with a change of newline
				119	convention.
				120	.
				121	.
				122	.SH "CHARACTERS AND METACHARACTERS"
				123	.rs
				124	.sp
				125	A regular expression is a pattern that is matched against a subject string from
				126	left to right. Most characters stand for themselves in a pattern, and match the
				127	corresponding characters in the subject. As a trivial example, the pattern
				128	.sp
				129	The quick brown fox
				130	.sp
				131	matches a portion of a subject string that is identical to itself. When
				132	caseless matching is specified (the PCRE_CASELESS option), letters are matched
				133	independently of case. In UTF-8 mode, PCRE always understands the concept of
				134	case for characters whose values are less than 128, so caseless matching is
				135	always possible. For characters with higher values, the concept of case is
				136	supported if PCRE is compiled with Unicode property support, but not otherwise.
				137	If you want to use caseless matching for characters 128 and above, you must
				138	ensure that PCRE is compiled with Unicode property support as well as with
				139	UTF-8 support.
				140	.P
				141	The power of regular expressions comes from the ability to include alternatives
				142	and repetitions in the pattern. These are encoded in the pattern by the use of
				143	\fImetacharacters\fP, which do not stand for themselves but instead are
				144	interpreted in some special way.
				145	.P
				146	There are two different sets of metacharacters: those that are recognized
				147	anywhere in the pattern except within square brackets, and those that are
				148	recognized within square brackets. Outside square brackets, the metacharacters
				149	are as follows:
				150	.sp
				151	\e general escape character with several uses
				152	^ assert start of string (or line, in multiline mode)
				153	$ assert end of string (or line, in multiline mode)
				154	. match any character except newline (by default)
				155	[ start character class definition
				156	\| start of alternative branch
				157	( start subpattern
				158	) end subpattern
				159	? extends the meaning of (
				160	also 0 or 1 quantifier
				161	also quantifier minimizer
				162	* 0 or more quantifier
				163	+ 1 or more quantifier
				164	also "possessive quantifier"
				165	{ start min/max quantifier
				166	.sp
				167	Part of a pattern that is in square brackets is called a "character class". In
				168	a character class the only metacharacters are:
				169	.sp
				170	\e general escape character
				171	^ negate the class, but only if the first character
				172	- indicates character range
				173	.\" JOIN
				174	[ POSIX character class (only if followed by POSIX
				175	syntax)
				176	] terminates the character class
				177	.sp
				178	The following sections describe the use of each of the metacharacters.
				179	.
				180	.
				181	.SH BACKSLASH
				182	.rs
				183	.sp
				184	The backslash character has several uses. Firstly, if it is followed by a
				185	character that is not a number or a letter, it takes away any special meaning
				186	that character may have. This use of backslash as an escape character applies
				187	both inside and outside character classes.
				188	.P
				189	For example, if you want to match a * character, you write \e* in the pattern.
				190	This escaping action applies whether or not the following character would
				191	otherwise be interpreted as a metacharacter, so it is always safe to precede a
				192	non-alphanumeric with backslash to specify that it stands for itself. In
				193	particular, if you want to match a backslash, you write \e\e.
				194	.P
				195	In UTF-8 mode, only ASCII numbers and letters have any special meaning after a
				196	backslash. All other characters (in particular, those whose codepoints are
				197	greater than 127) are treated as literals.
				198	.P
				199	If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the
				200	pattern (other than in a character class) and characters between a # outside
				201	a character class and the next newline are ignored. An escaping backslash can
				202	be used to include a whitespace or # character as part of the pattern.
				203	.P
				204	If you want to remove the special meaning from a sequence of characters, you
				205	can do so by putting them between \eQ and \eE. This is different from Perl in
				206	that $ and @ are handled as literals in \eQ...\eE sequences in PCRE, whereas in
				207	Perl, $ and @ cause variable interpolation. Note the following examples:
				208	.sp
				209	Pattern PCRE matches Perl matches
				210	.sp
				211	.\" JOIN
				212	\eQabc$xyz\eE abc$xyz abc followed by the
				213	contents of $xyz
				214	\eQabc\e$xyz\eE abc\e$xyz abc\e$xyz
				215	\eQabc\eE\e$\eQxyz\eE abc$xyz abc$xyz
				216	.sp
				217	The \eQ...\eE sequence is recognized both inside and outside character classes.
				218	An isolated \eE that is not preceded by \eQ is ignored. If \eQ is not followed
				219	by \eE later in the pattern, the literal interpretation continues to the end of
				220	the pattern (that is, \eE is assumed at the end). If the isolated \eQ is inside
				221	a character class, this causes an error, because the character class is not
				222	terminated.
				223	.
				224	.
				225	.\" HTML <a name="digitsafterbackslash"></a>
				226	.SS "Non-printing characters"
				227	.rs
				228	.sp
				229	A second use of backslash provides a way of encoding non-printing characters
				230	in patterns in a visible manner. There is no restriction on the appearance of
				231	non-printing characters, apart from the binary zero that terminates a pattern,
				232	but when a pattern is being prepared by text editing, it is often easier to use
				233	one of the following escape sequences than the binary character it represents:
				234	.sp
				235	\ea alarm, that is, the BEL character (hex 07)
				236	\ecx "control-x", where x is any ASCII character
				237	\ee escape (hex 1B)
				238	\ef formfeed (hex 0C)
				239	\en linefeed (hex 0A)
				240	\er carriage return (hex 0D)
				241	\et tab (hex 09)
				242	\eddd character with octal code ddd, or back reference
				243	\exhh character with hex code hh
				244	\ex{hhh..} character with hex code hhh.. (non-JavaScript mode)
				245	\euhhhh character with hex code hhhh (JavaScript mode only)
				246	.sp
				247	The precise effect of \ecx is as follows: if x is a lower case letter, it
				248	is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
				249	Thus \ecz becomes hex 1A (z is 7A), but \ec{ becomes hex 3B ({ is 7B), while
				250	\ec; becomes hex 7B (; is 3B). If the byte following \ec has a value greater
				251	than 127, a compile-time error occurs. This locks out non-ASCII characters in
				252	both byte mode and UTF-8 mode. (When PCRE is compiled in EBCDIC mode, all byte
				253	values are valid. A lower case letter is converted to upper case, and then the
				254	0xc0 bits are flipped.)
				255	.P
				256	By default, after \ex, from zero to two hexadecimal digits are read (letters
				257	can be in upper or lower case). Any number of hexadecimal digits may appear
				258	between \ex{ and }, but the value of the character code must be less than 256
				259	in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum
				260	value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest
				261	Unicode code point, which is 10FFFF.
				262	.P
				263	If characters other than hexadecimal digits appear between \ex{ and }, or if
				264	there is no terminating }, this form of escape is not recognized. Instead, the
				265	initial \ex will be interpreted as a basic hexadecimal escape, with no
				266	following digits, giving a character whose value is zero.
				267	.P
				268	If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is
				269	as just described only when it is followed by two hexadecimal digits.
				270	Otherwise, it matches a literal "x" character. In JavaScript mode, support for
				271	code points greater than 256 is provided by \eu, which must be followed by
				272	four hexadecimal digits; otherwise it matches a literal "u" character.
				273	.P
				274	Characters whose value is less than 256 can be defined by either of the two
				275	syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the
				276	way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
				277	\eu00dc in JavaScript mode).
				278	.P
				279	After \e0 up to two further octal digits are read. If there are fewer than two
				280	digits, just those that are present are used. Thus the sequence \e0\ex\e07
				281	specifies two binary zeros followed by a BEL character (code value 7). Make
				282	sure you supply two digits after the initial zero if the pattern character that
				283	follows is itself an octal digit.
				284	.P
				285	The handling of a backslash followed by a digit other than 0 is complicated.
				286	Outside a character class, PCRE reads it and any following digits as a decimal
				287	number. If the number is less than 10, or if there have been at least that many
				288	previous capturing left parentheses in the expression, the entire sequence is
				289	taken as a \fIback reference\fP. A description of how this works is given
				290	.\" HTML <a href="#backreferences">
				291	.\" </a>
				292	later,
				293	.\"
				294	following the discussion of
				295	.\" HTML <a href="#subpattern">
				296	.\" </a>
				297	parenthesized subpatterns.
				298	.\"
				299	.P
				300	Inside a character class, or if the decimal number is greater than 9 and there
				301	have not been that many capturing subpatterns, PCRE re-reads up to three octal
				302	digits following the backslash, and uses them to generate a data character. Any
				303	subsequent digits stand for themselves. In non-UTF-8 mode, the value of a
				304	character specified in octal must be less than \e400. In UTF-8 mode, values up
				305	to \e777 are permitted. For example:
				306	.sp
				307	\e040 is another way of writing a space
				308	.\" JOIN
				309	\e40 is the same, provided there are fewer than 40
				310	previous capturing subpatterns
				311	\e7 is always a back reference
				312	.\" JOIN
				313	\e11 might be a back reference, or another way of
				314	writing a tab
				315	\e011 is always a tab
				316	\e0113 is a tab followed by the character "3"
				317	.\" JOIN
				318	\e113 might be a back reference, otherwise the
				319	character with octal code 113
				320	.\" JOIN
				321	\e377 might be a back reference, otherwise
				322	the byte consisting entirely of 1 bits
				323	.\" JOIN
				324	\e81 is either a back reference, or a binary zero
				325	followed by the two characters "8" and "1"
				326	.sp
				327	Note that octal values of 100 or greater must not be introduced by a leading
				328	zero, because no more than three octal digits are ever read.
				329	.P
				330	All the sequences that define a single character value can be used both inside
				331	and outside character classes. In addition, inside a character class, \eb is
				332	interpreted as the backspace character (hex 08).
				333	.P
				334	\eN is not allowed in a character class. \eB, \eR, and \eX are not special
				335	inside a character class. Like other unrecognized escape sequences, they are
				336	treated as the literal characters "B", "R", and "X" by default, but cause an
				337	error if the PCRE_EXTRA option is set. Outside a character class, these
				338	sequences have different meanings.
				339	.
				340	.
				341	.SS "Unsupported escape sequences"
				342	.rs
				343	.sp
				344	In Perl, the sequences \el, \eL, \eu, and \eU are recognized by its string
				345	handler and used to modify the case of following characters. By default, PCRE
				346	does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT
				347	option is set, \eU matches a "U" character, and \eu can be used to define a
				348	character by code point, as described in the previous section.
				349	.
				350	.
				351	.SS "Absolute and relative back references"
				352	.rs
				353	.sp
				354	The sequence \eg followed by an unsigned or a negative number, optionally
				355	enclosed in braces, is an absolute or relative back reference. A named back
				356	reference can be coded as \eg{name}. Back references are discussed
				357	.\" HTML <a href="#backreferences">
				358	.\" </a>
				359	later,
				360	.\"
				361	following the discussion of
				362	.\" HTML <a href="#subpattern">
				363	.\" </a>
				364	parenthesized subpatterns.
				365	.\"
				366	.
				367	.
				368	.SS "Absolute and relative subroutine calls"
				369	.rs
				370	.sp
				371	For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
				372	a number enclosed either in angle brackets or single quotes, is an alternative
				373	syntax for referencing a subpattern as a "subroutine". Details are discussed
				374	.\" HTML <a href="#onigurumasubroutines">
				375	.\" </a>
				376	later.
				377	.\"
				378	Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
				379	synonymous. The former is a back reference; the latter is a
				380	.\" HTML <a href="#subpatternsassubroutines">
				381	.\" </a>
				382	subroutine
				383	.\"
				384	call.
				385	.
				386	.
				387	.\" HTML <a name="genericchartypes"></a>
				388	.SS "Generic character types"
				389	.rs
				390	.sp
				391	Another use of backslash is for specifying generic character types:
				392	.sp
				393	\ed any decimal digit
				394	\eD any character that is not a decimal digit
				395	\eh any horizontal whitespace character
				396	\eH any character that is not a horizontal whitespace character
				397	\es any whitespace character
				398	\eS any character that is not a whitespace character
				399	\ev any vertical whitespace character
				400	\eV any character that is not a vertical whitespace character
				401	\ew any "word" character
				402	\eW any "non-word" character
				403	.sp
				404	There is also the single sequence \eN, which matches a non-newline character.
				405	This is the same as
				406	.\" HTML <a href="#fullstopdot">
				407	.\" </a>
				408	the "." metacharacter
				409	.\"
				410	when PCRE_DOTALL is not set. Perl also uses \eN to match characters by name;
				411	PCRE does not support this.
				412	.P
				413	Each pair of lower and upper case escape sequences partitions the complete set
				414	of characters into two disjoint sets. Any given character matches one, and only
				415	one, of each pair. The sequences can appear both inside and outside character
				416	classes. They each match one character of the appropriate type. If the current
				417	matching point is at the end of the subject string, all of them fail, because
				418	there is no character to match.
				419	.P
				420	For compatibility with Perl, \es does not match the VT character (code 11).
				421	This makes it different from the the POSIX "space" class. The \es characters
				422	are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is
				423	included in a Perl script, \es may match the VT character. In PCRE, it never
				424	does.
				425	.P
				426	A "word" character is an underscore or any character that is a letter or digit.
				427	By default, the definition of letters and digits is controlled by PCRE's
				428	low-valued character tables, and may vary if locale-specific matching is taking
				429	place (see
				430	.\" HTML <a href="pcreapi.html#localesupport">
				431	.\" </a>
				432	"Locale support"
				433	.\"
				434	in the
				435	.\" HREF
				436	\fBpcreapi\fP
				437	.\"
				438	page). For example, in a French locale such as "fr_FR" in Unix-like systems,
				439	or "french" in Windows, some character codes greater than 128 are used for
				440	accented letters, and these are then matched by \ew. The use of locales with
				441	Unicode is discouraged.
				442	.P
				443	By default, in UTF-8 mode, characters with values greater than 128 never match
				444	\ed, \es, or \ew, and always match \eD, \eS, and \eW. These sequences retain
				445	their original meanings from before UTF-8 support was available, mainly for
				446	efficiency reasons. However, if PCRE is compiled with Unicode property support,
				447	and the PCRE_UCP option is set, the behaviour is changed so that Unicode
				448	properties are used to determine character types, as follows:
				449	.sp
				450	\ed any character that \ep{Nd} matches (decimal digit)
				451	\es any character that \ep{Z} matches, plus HT, LF, FF, CR
				452	\ew any character that \ep{L} or \ep{N} matches, plus underscore
				453	.sp
				454	The upper case escapes match the inverse sets of characters. Note that \ed
				455	matches only decimal digits, whereas \ew matches any Unicode digit, as well as
				456	any Unicode letter, and underscore. Note also that PCRE_UCP affects \eb, and
				457	\eB because they are defined in terms of \ew and \eW. Matching these sequences
				458	is noticeably slower when PCRE_UCP is set.
				459	.P
				460	The sequences \eh, \eH, \ev, and \eV are features that were added to Perl at
				461	release 5.10. In contrast to the other sequences, which match only ASCII
				462	characters by default, these always match certain high-valued codepoints in
				463	UTF-8 mode, whether or not PCRE_UCP is set. The horizontal space characters
				464	are:
				465	.sp
				466	U+0009 Horizontal tab
				467	U+0020 Space
				468	U+00A0 Non-break space
				469	U+1680 Ogham space mark
				470	U+180E Mongolian vowel separator
				471	U+2000 En quad
				472	U+2001 Em quad
				473	U+2002 En space
				474	U+2003 Em space
				475	U+2004 Three-per-em space
				476	U+2005 Four-per-em space
				477	U+2006 Six-per-em space
				478	U+2007 Figure space
				479	U+2008 Punctuation space
				480	U+2009 Thin space
				481	U+200A Hair space
				482	U+202F Narrow no-break space
				483	U+205F Medium mathematical space
				484	U+3000 Ideographic space
				485	.sp
				486	The vertical space characters are:
				487	.sp
				488	U+000A Linefeed
				489	U+000B Vertical tab
				490	U+000C Formfeed
				491	U+000D Carriage return
				492	U+0085 Next line
				493	U+2028 Line separator
				494	U+2029 Paragraph separator
				495	.
				496	.
				497	.\" HTML <a name="newlineseq"></a>
				498	.SS "Newline sequences"
				499	.rs
				500	.sp
				501	Outside a character class, by default, the escape sequence \eR matches any
				502	Unicode newline sequence. In non-UTF-8 mode \eR is equivalent to the following:
				503	.sp
				504	(?>\er\en\|\en\|\ex0b\|\ef\|\er\|\ex85)
				505	.sp
				506	This is an example of an "atomic group", details of which are given
				507	.\" HTML <a href="#atomicgroup">
				508	.\" </a>
				509	below.
				510	.\"
				511	This particular group matches either the two-character sequence CR followed by
				512	LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
				513	U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next
				514	line, U+0085). The two-character sequence is treated as a single unit that
				515	cannot be split.
				516	.P
				517	In UTF-8 mode, two additional characters whose codepoints are greater than 255
				518	are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
				519	Unicode character property support is not needed for these characters to be
				520	recognized.
				521	.P
				522	It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
				523	complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF
				524	either at compile time or when the pattern is matched. (BSR is an abbrevation
				525	for "backslash R".) This can be made the default when PCRE is built; if this is
				526	the case, the other behaviour can be requested via the PCRE_BSR_UNICODE option.
				527	It is also possible to specify these settings by starting a pattern string with
				528	one of the following sequences:
				529	.sp
				530	(*BSR_ANYCRLF) CR, LF, or CRLF only
				531	(*BSR_UNICODE) any Unicode newline sequence
				532	.sp
				533	These override the default and the options given to \fBpcre_compile()\fP or
				534	\fBpcre_compile2()\fP, but they can be overridden by options given to
				535	\fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP. Note that these special settings,
				536	which are not Perl-compatible, are recognized only at the very start of a
				537	pattern, and that they must be in upper case. If more than one of them is
				538	present, the last one is used. They can be combined with a change of newline
				539	convention; for example, a pattern can start with:
				540	.sp
				541	(ANY)(BSR_ANYCRLF)
				542	.sp
				543	They can also be combined with the (UTF8) or (UCP) special sequences. Inside
				544	a character class, \eR is treated as an unrecognized escape sequence, and so
				545	matches the letter "R" by default, but causes an error if PCRE_EXTRA is set.
				546	.
				547	.
				548	.\" HTML <a name="uniextseq"></a>
				549	.SS Unicode character properties
				550	.rs
				551	.sp
				552	When PCRE is built with Unicode character property support, three additional
				553	escape sequences that match characters with specific properties are available.
				554	When not in UTF-8 mode, these sequences are of course limited to testing
				555	characters whose codepoints are less than 256, but they do work in this mode.
				556	The extra escape sequences are:
				557	.sp
				558	\ep{\fIxx\fP} a character with the \fIxx\fP property
				559	\eP{\fIxx\fP} a character without the \fIxx\fP property
				560	\eX an extended Unicode sequence
				561	.sp
				562	The property names represented by \fIxx\fP above are limited to the Unicode
				563	script names, the general category properties, "Any", which matches any
				564	character (including newline), and some special PCRE properties (described
				565	in the
				566	.\" HTML <a href="#extraprops">
				567	.\" </a>
				568	next section).
				569	.\"
				570	Other Perl properties such as "InMusicalSymbols" are not currently supported by
				571	PCRE. Note that \eP{Any} does not match any characters, so always causes a
				572	match failure.
				573	.P
				574	Sets of Unicode characters are defined as belonging to certain scripts. A
				575	character from one of these sets can be matched using a script name. For
				576	example:
				577	.sp
				578	\ep{Greek}
				579	\eP{Han}
				580	.sp
				581	Those that are not part of an identified script are lumped together as
				582	"Common". The current list of scripts is:
				583	.P
				584	Arabic,
				585	Armenian,
				586	Avestan,
				587	Balinese,
				588	Bamum,
				589	Bengali,
				590	Bopomofo,
				591	Braille,
				592	Buginese,
				593	Buhid,
				594	Canadian_Aboriginal,
				595	Carian,
				596	Cham,
				597	Cherokee,
				598	Common,
				599	Coptic,
				600	Cuneiform,
				601	Cypriot,
				602	Cyrillic,
				603	Deseret,
				604	Devanagari,
				605	Egyptian_Hieroglyphs,
				606	Ethiopic,
				607	Georgian,
				608	Glagolitic,
				609	Gothic,
				610	Greek,
				611	Gujarati,
				612	Gurmukhi,
				613	Han,
				614	Hangul,
				615	Hanunoo,
				616	Hebrew,
				617	Hiragana,
				618	Imperial_Aramaic,
				619	Inherited,
				620	Inscriptional_Pahlavi,
				621	Inscriptional_Parthian,
				622	Javanese,
				623	Kaithi,
				624	Kannada,
				625	Katakana,
				626	Kayah_Li,
				627	Kharoshthi,
				628	Khmer,
				629	Lao,
				630	Latin,
				631	Lepcha,
				632	Limbu,
				633	Linear_B,
				634	Lisu,
				635	Lycian,
				636	Lydian,
				637	Malayalam,
				638	Meetei_Mayek,
				639	Mongolian,
				640	Myanmar,
				641	New_Tai_Lue,
				642	Nko,
				643	Ogham,
				644	Old_Italic,
				645	Old_Persian,
				646	Old_South_Arabian,
				647	Old_Turkic,
				648	Ol_Chiki,
				649	Oriya,
				650	Osmanya,
				651	Phags_Pa,
				652	Phoenician,
				653	Rejang,
				654	Runic,
				655	Samaritan,
				656	Saurashtra,
				657	Shavian,
				658	Sinhala,
				659	Sundanese,
				660	Syloti_Nagri,
				661	Syriac,
				662	Tagalog,
				663	Tagbanwa,
				664	Tai_Le,
				665	Tai_Tham,
				666	Tai_Viet,
				667	Tamil,
				668	Telugu,
				669	Thaana,
				670	Thai,
				671	Tibetan,
				672	Tifinagh,
				673	Ugaritic,
				674	Vai,
				675	Yi.
				676	.P
				677	Each character has exactly one Unicode general category property, specified by
				678	a two-letter abbreviation. For compatibility with Perl, negation can be
				679	specified by including a circumflex between the opening brace and the property
				680	name. For example, \ep{^Lu} is the same as \eP{Lu}.
				681	.P
				682	If only one letter is specified with \ep or \eP, it includes all the general
				683	category properties that start with that letter. In this case, in the absence
				684	of negation, the curly brackets in the escape sequence are optional; these two
				685	examples have the same effect:
				686	.sp
				687	\ep{L}
				688	\epL
				689	.sp
				690	The following general category property codes are supported:
				691	.sp
				692	C Other
				693	Cc Control
				694	Cf Format
				695	Cn Unassigned
				696	Co Private use
				697	Cs Surrogate
				698	.sp
				699	L Letter
				700	Ll Lower case letter
				701	Lm Modifier letter
				702	Lo Other letter
				703	Lt Title case letter
				704	Lu Upper case letter
				705	.sp
				706	M Mark
				707	Mc Spacing mark
				708	Me Enclosing mark
				709	Mn Non-spacing mark
				710	.sp
				711	N Number
				712	Nd Decimal number
				713	Nl Letter number
				714	No Other number
				715	.sp
				716	P Punctuation
				717	Pc Connector punctuation
				718	Pd Dash punctuation
				719	Pe Close punctuation
				720	Pf Final punctuation
				721	Pi Initial punctuation
				722	Po Other punctuation
				723	Ps Open punctuation
				724	.sp
				725	S Symbol
				726	Sc Currency symbol
				727	Sk Modifier symbol
				728	Sm Mathematical symbol
				729	So Other symbol
				730	.sp
				731	Z Separator
				732	Zl Line separator
				733	Zp Paragraph separator
				734	Zs Space separator
				735	.sp
				736	The special property L& is also supported: it matches a character that has
				737	the Lu, Ll, or Lt property, in other words, a letter that is not classified as
				738	a modifier or "other".
				739	.P
				740	The Cs (Surrogate) property applies only to characters in the range U+D800 to
				741	U+DFFF. Such characters are not valid in UTF-8 strings (see RFC 3629) and so
				742	cannot be tested by PCRE, unless UTF-8 validity checking has been turned off
				743	(see the discussion of PCRE_NO_UTF8_CHECK in the
				744	.\" HREF
				745	\fBpcreapi\fP
				746	.\"
				747	page). Perl does not support the Cs property.
				748	.P
				749	The long synonyms for property names that Perl supports (such as \ep{Letter})
				750	are not supported by PCRE, nor is it permitted to prefix any of these
				751	properties with "Is".
				752	.P
				753	No character that is in the Unicode table has the Cn (unassigned) property.
				754	Instead, this property is assumed for any code point that is not in the
				755	Unicode table.
				756	.P
				757	Specifying caseless matching does not affect these escape sequences. For
				758	example, \ep{Lu} always matches only upper case letters.
				759	.P
				760	The \eX escape matches any number of Unicode characters that form an extended
				761	Unicode sequence. \eX is equivalent to
				762	.sp
				763	(?>\ePM\epM*)
				764	.sp
				765	That is, it matches a character without the "mark" property, followed by zero
				766	or more characters with the "mark" property, and treats the sequence as an
				767	atomic group
				768	.\" HTML <a href="#atomicgroup">
				769	.\" </a>
				770	(see below).
				771	.\"
				772	Characters with the "mark" property are typically accents that affect the
				773	preceding character. None of them have codepoints less than 256, so in
				774	non-UTF-8 mode \eX matches any one character.
				775	.P
				776	Note that recent versions of Perl have changed \eX to match what Unicode calls
				777	an "extended grapheme cluster", which has a more complicated definition.
				778	.P
				779	Matching characters by Unicode property is not fast, because PCRE has to search
				780	a structure that contains data for over fifteen thousand characters. That is
				781	why the traditional escape sequences such as \ed and \ew do not use Unicode
				782	properties in PCRE by default, though you can make them do so by setting the
				783	PCRE_UCP option for \fBpcre_compile()\fP or by starting the pattern with
				784	(*UCP).
				785	.
				786	.
				787	.\" HTML <a name="extraprops"></a>
				788	.SS PCRE's additional properties
				789	.rs
				790	.sp
				791	As well as the standard Unicode properties described in the previous
				792	section, PCRE supports four more that make it possible to convert traditional
				793	escape sequences such as \ew and \es and POSIX character classes to use Unicode
				794	properties. PCRE uses these non-standard, non-Perl properties internally when
				795	PCRE_UCP is set. They are:
				796	.sp
				797	Xan Any alphanumeric character
				798	Xps Any POSIX space character
				799	Xsp Any Perl space character
				800	Xwd Any Perl "word" character
				801	.sp
				802	Xan matches characters that have either the L (letter) or the N (number)
				803	property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or
				804	carriage return, and any other character that has the Z (separator) property.
				805	Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the
				806	same characters as Xan, plus underscore.
				807	.
				808	.
				809	.\" HTML <a name="resetmatchstart"></a>
				810	.SS "Resetting the match start"
				811	.rs
				812	.sp
				813	The escape sequence \eK causes any previously matched characters not to be
				814	included in the final matched sequence. For example, the pattern:
				815	.sp
				816	foo\eKbar
				817	.sp
				818	matches "foobar", but reports that it has matched "bar". This feature is
				819	similar to a lookbehind assertion
				820	.\" HTML <a href="#lookbehind">
				821	.\" </a>
				822	(described below).
				823	.\"
				824	However, in this case, the part of the subject before the real match does not
				825	have to be of fixed length, as lookbehind assertions do. The use of \eK does
				826	not interfere with the setting of
				827	.\" HTML <a href="#subpattern">
				828	.\" </a>
				829	captured substrings.
				830	.\"
				831	For example, when the pattern
				832	.sp
				833	(foo)\eKbar
				834	.sp
				835	matches "foobar", the first substring is still set to "foo".
				836	.P
				837	Perl documents that the use of \eK within assertions is "not well defined". In
				838	PCRE, \eK is acted upon when it occurs inside positive assertions, but is
				839	ignored in negative assertions.
				840	.
				841	.
				842	.\" HTML <a name="smallassertions"></a>
				843	.SS "Simple assertions"
				844	.rs
				845	.sp
				846	The final use of backslash is for certain simple assertions. An assertion
				847	specifies a condition that has to be met at a particular point in a match,
				848	without consuming any characters from the subject string. The use of
				849	subpatterns for more complicated assertions is described
				850	.\" HTML <a href="#bigassertions">
				851	.\" </a>
				852	below.
				853	.\"
				854	The backslashed assertions are:
				855	.sp
				856	\eb matches at a word boundary
				857	\eB matches when not at a word boundary
				858	\eA matches at the start of the subject
				859	\eZ matches at the end of the subject
				860	also matches before a newline at the end of the subject
				861	\ez matches only at the end of the subject
				862	\eG matches at the first matching position in the subject
				863	.sp
				864	Inside a character class, \eb has a different meaning; it matches the backspace
				865	character. If any other of these assertions appears in a character class, by
				866	default it matches the corresponding literal character (for example, \eB
				867	matches the letter B). However, if the PCRE_EXTRA option is set, an "invalid
				868	escape sequence" error is generated instead.
				869	.P
				870	A word boundary is a position in the subject string where the current character
				871	and the previous character do not both match \ew or \eW (i.e. one matches
				872	\ew and the other matches \eW), or the start or end of the string if the
				873	first or last character matches \ew, respectively. In UTF-8 mode, the meanings
				874	of \ew and \eW can be changed by setting the PCRE_UCP option. When this is
				875	done, it also affects \eb and \eB. Neither PCRE nor Perl has a separate "start
				876	of word" or "end of word" metasequence. However, whatever follows \eb normally
				877	determines which it is. For example, the fragment \eba matches "a" at the start
				878	of a word.
				879	.P
				880	The \eA, \eZ, and \ez assertions differ from the traditional circumflex and
				881	dollar (described in the next section) in that they only ever match at the very
				882	start and end of the subject string, whatever options are set. Thus, they are
				883	independent of multiline mode. These three assertions are not affected by the
				884	PCRE_NOTBOL or PCRE_NOTEOL options, which affect only the behaviour of the
				885	circumflex and dollar metacharacters. However, if the \fIstartoffset\fP
				886	argument of \fBpcre_exec()\fP is non-zero, indicating that matching is to start
				887	at a point other than the beginning of the subject, \eA can never match. The
				888	difference between \eZ and \ez is that \eZ matches before a newline at the end
				889	of the string as well as at the very end, whereas \ez matches only at the end.
				890	.P
				891	The \eG assertion is true only when the current matching position is at the
				892	start point of the match, as specified by the \fIstartoffset\fP argument of
				893	\fBpcre_exec()\fP. It differs from \eA when the value of \fIstartoffset\fP is
				894	non-zero. By calling \fBpcre_exec()\fP multiple times with appropriate
				895	arguments, you can mimic Perl's /g option, and it is in this kind of
				896	implementation where \eG can be useful.
				897	.P
				898	Note, however, that PCRE's interpretation of \eG, as the start of the current
				899	match, is subtly different from Perl's, which defines it as the end of the
				900	previous match. In Perl, these can be different when the previously matched
				901	string was empty. Because PCRE does just one match at a time, it cannot
				902	reproduce this behaviour.
				903	.P
				904	If all the alternatives of a pattern begin with \eG, the expression is anchored
				905	to the starting match position, and the "anchored" flag is set in the compiled
				906	regular expression.
				907	.
				908	.
				909	.SH "CIRCUMFLEX AND DOLLAR"
				910	.rs
				911	.sp
				912	Outside a character class, in the default matching mode, the circumflex
				913	character is an assertion that is true only if the current matching point is
				914	at the start of the subject string. If the \fIstartoffset\fP argument of
				915	\fBpcre_exec()\fP is non-zero, circumflex can never match if the PCRE_MULTILINE
				916	option is unset. Inside a character class, circumflex has an entirely different
				917	meaning
				918	.\" HTML <a href="#characterclass">
				919	.\" </a>
				920	(see below).
				921	.\"
				922	.P
				923	Circumflex need not be the first character of the pattern if a number of
				924	alternatives are involved, but it should be the first thing in each alternative
				925	in which it appears if the pattern is ever to match that branch. If all
				926	possible alternatives start with a circumflex, that is, if the pattern is
				927	constrained to match only at the start of the subject, it is said to be an
				928	"anchored" pattern. (There are also other constructs that can cause a pattern
				929	to be anchored.)
				930	.P
				931	A dollar character is an assertion that is true only if the current matching
				932	point is at the end of the subject string, or immediately before a newline
				933	at the end of the string (by default). Dollar need not be the last character of
				934	the pattern if a number of alternatives are involved, but it should be the last
				935	item in any branch in which it appears. Dollar has no special meaning in a
				936	character class.
				937	.P
				938	The meaning of dollar can be changed so that it matches only at the very end of
				939	the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This
				940	does not affect the \eZ assertion.
				941	.P
				942	The meanings of the circumflex and dollar characters are changed if the
				943	PCRE_MULTILINE option is set. When this is the case, a circumflex matches
				944	immediately after internal newlines as well as at the start of the subject
				945	string. It does not match after a newline that ends the string. A dollar
				946	matches before any newlines in the string, as well as at the very end, when
				947	PCRE_MULTILINE is set. When newline is specified as the two-character
				948	sequence CRLF, isolated CR and LF characters do not indicate newlines.
				949	.P
				950	For example, the pattern /^abc$/ matches the subject string "def\enabc" (where
				951	\en represents a newline) in multiline mode, but not otherwise. Consequently,
				952	patterns that are anchored in single line mode because all branches start with
				953	^ are not anchored in multiline mode, and a match for circumflex is possible
				954	when the \fIstartoffset\fP argument of \fBpcre_exec()\fP is non-zero. The
				955	PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
				956	.P
				957	Note that the sequences \eA, \eZ, and \ez can be used to match the start and
				958	end of the subject in both modes, and if all branches of a pattern start with
				959	\eA it is always anchored, whether or not PCRE_MULTILINE is set.
				960	.
				961	.
				962	.\" HTML <a name="fullstopdot"></a>
				963	.SH "FULL STOP (PERIOD, DOT) AND \eN"
				964	.rs
				965	.sp
				966	Outside a character class, a dot in the pattern matches any one character in
				967	the subject string except (by default) a character that signifies the end of a
				968	line. In UTF-8 mode, the matched character may be more than one byte long.
				969	.P
				970	When a line ending is defined as a single character, dot never matches that
				971	character; when the two-character sequence CRLF is used, dot does not match CR
				972	if it is immediately followed by LF, but otherwise it matches all characters
				973	(including isolated CRs and LFs). When any Unicode line endings are being
				974	recognized, dot does not match CR or LF or any of the other line ending
				975	characters.
				976	.P
				977	The behaviour of dot with regard to newlines can be changed. If the PCRE_DOTALL
				978	option is set, a dot matches any one character, without exception. If the
				979	two-character sequence CRLF is present in the subject string, it takes two dots
				980	to match it.
				981	.P
				982	The handling of dot is entirely independent of the handling of circumflex and
				983	dollar, the only relationship being that they both involve newlines. Dot has no
				984	special meaning in a character class.
				985	.P
				986	The escape sequence \eN behaves like a dot, except that it is not affected by
				987	the PCRE_DOTALL option. In other words, it matches any character except one
				988	that signifies the end of a line. Perl also uses \eN to match characters by
				989	name; PCRE does not support this.
				990	.
				991	.
				992	.SH "MATCHING A SINGLE BYTE"
				993	.rs
				994	.sp
				995	Outside a character class, the escape sequence \eC matches any one byte, both
				996	in and out of UTF-8 mode. Unlike a dot, it always matches line-ending
				997	characters. The feature is provided in Perl in order to match individual bytes
				998	in UTF-8 mode, but it is unclear how it can usefully be used. Because \eC
				999	breaks up characters into individual bytes, matching one byte with \eC in UTF-8
				1000	mode means that the rest of the string may start with a malformed UTF-8
				1001	character. This has undefined results, because PCRE assumes that it is dealing
				1002	with valid UTF-8 strings (and by default it checks this at the start of
				1003	processing unless the PCRE_NO_UTF8_CHECK option is used).
				1004	.P
				1005	PCRE does not allow \eC to appear in lookbehind assertions
				1006	.\" HTML <a href="#lookbehind">
				1007	.\" </a>
				1008	(described below)
				1009	.\"
				1010	in UTF-8 mode, because this would make it impossible to calculate the length of
				1011	the lookbehind.
				1012	.P
				1013	In general, the \eC escape sequence is best avoided in UTF-8 mode. However, one
				1014	way of using it that avoids the problem of malformed UTF-8 characters is to
				1015	use a lookahead to check the length of the next character, as in this pattern
				1016	(ignore white space and line breaks):
				1017	.sp
				1018	(?\| (?=[\ex00-\ex7f])(\eC) \|
				1019	(?=[\ex80-\ex{7ff}])(\eC)(\eC) \|
				1020	(?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) \|
				1021	(?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
				1022	.sp
				1023	A group that starts with (?\| resets the capturing parentheses numbers in each
				1024	alternative (see
				1025	.\" HTML <a href="#dupsubpatternnumber">
				1026	.\" </a>
				1027	"Duplicate Subpattern Numbers"
				1028	.\"
				1029	below). The assertions at the start of each branch check the next UTF-8
				1030	character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
				1031	character's individual bytes are then captured by the appropriate number of
				1032	groups.
				1033	.
				1034	.
				1035	.\" HTML <a name="characterclass"></a>
				1036	.SH "SQUARE BRACKETS AND CHARACTER CLASSES"
				1037	.rs
				1038	.sp
				1039	An opening square bracket introduces a character class, terminated by a closing
				1040	square bracket. A closing square bracket on its own is not special by default.
				1041	However, if the PCRE_JAVASCRIPT_COMPAT option is set, a lone closing square
				1042	bracket causes a compile-time error. If a closing square bracket is required as
				1043	a member of the class, it should be the first data character in the class
				1044	(after an initial circumflex, if present) or escaped with a backslash.
				1045	.P
				1046	A character class matches a single character in the subject. In UTF-8 mode, the
				1047	character may be more than one byte long. A matched character must be in the
				1048	set of characters defined by the class, unless the first character in the class
				1049	definition is a circumflex, in which case the subject character must not be in
				1050	the set defined by the class. If a circumflex is actually required as a member
				1051	of the class, ensure it is not the first character, or escape it with a
				1052	backslash.
				1053	.P
				1054	For example, the character class [aeiou] matches any lower case vowel, while
				1055	[^aeiou] matches any character that is not a lower case vowel. Note that a
				1056	circumflex is just a convenient notation for specifying the characters that
				1057	are in the class by enumerating those that are not. A class that starts with a
				1058	circumflex is not an assertion; it still consumes a character from the subject
				1059	string, and therefore it fails if the current pointer is at the end of the
				1060	string.
				1061	.P
				1062	In UTF-8 mode, characters with values greater than 255 can be included in a
				1063	class as a literal string of bytes, or by using the \ex{ escaping mechanism.
				1064	.P
				1065	When caseless matching is set, any letters in a class represent both their
				1066	upper case and lower case versions, so for example, a caseless [aeiou] matches
				1067	"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
				1068	caseful version would. In UTF-8 mode, PCRE always understands the concept of
				1069	case for characters whose values are less than 128, so caseless matching is
				1070	always possible. For characters with higher values, the concept of case is
				1071	supported if PCRE is compiled with Unicode property support, but not otherwise.
				1072	If you want to use caseless matching in UTF8-mode for characters 128 and above,
				1073	you must ensure that PCRE is compiled with Unicode property support as well as
				1074	with UTF-8 support.
				1075	.P
				1076	Characters that might indicate line breaks are never treated in any special way
				1077	when matching character classes, whatever line-ending sequence is in use, and
				1078	whatever setting of the PCRE_DOTALL and PCRE_MULTILINE options is used. A class
				1079	such as [^a] always matches one of these characters.
				1080	.P
				1081	The minus (hyphen) character can be used to specify a range of characters in a
				1082	character class. For example, [d-m] matches any letter between d and m,
				1083	inclusive. If a minus character is required in a class, it must be escaped with
				1084	a backslash or appear in a position where it cannot be interpreted as
				1085	indicating a range, typically as the first or last character in the class.
				1086	.P
				1087	It is not possible to have the literal character "]" as the end character of a
				1088	range. A pattern such as [W-]46] is interpreted as a class of two characters
				1089	("W" and "-") followed by a literal string "46]", so it would match "W46]" or
				1090	"-46]". However, if the "]" is escaped with a backslash it is interpreted as
				1091	the end of range, so [W-\e]46] is interpreted as a class containing a range
				1092	followed by two other characters. The octal or hexadecimal representation of
				1093	"]" can also be used to end a range.
				1094	.P
				1095	Ranges operate in the collating sequence of character values. They can also be
				1096	used for characters specified numerically, for example [\e000-\e037]. In UTF-8
				1097	mode, ranges can include characters whose values are greater than 255, for
				1098	example [\ex{100}-\ex{2ff}].
				1099	.P
				1100	If a range that includes letters is used when caseless matching is set, it
				1101	matches the letters in either case. For example, [W-c] is equivalent to
				1102	[][\e\e^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if character
				1103	tables for a French locale are in use, [\exc8-\excb] matches accented E
				1104	characters in both cases. In UTF-8 mode, PCRE supports the concept of case for
				1105	characters with values greater than 128 only when it is compiled with Unicode
				1106	property support.
				1107	.P
				1108	The character escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es, \eS, \ev,
				1109	\eV, \ew, and \eW may appear in a character class, and add the characters that
				1110	they match to the class. For example, [\edABCDEF] matches any hexadecimal
				1111	digit. In UTF-8 mode, the PCRE_UCP option affects the meanings of \ed, \es, \ew
				1112	and their upper case partners, just as it does when they appear outside a
				1113	character class, as described in the section entitled
				1114	.\" HTML <a href="#genericchartypes">
				1115	.\" </a>
				1116	"Generic character types"
				1117	.\"
				1118	above. The escape sequence \eb has a different meaning inside a character
				1119	class; it matches the backspace character. The sequences \eB, \eN, \eR, and \eX
				1120	are not special inside a character class. Like any other unrecognized escape
				1121	sequences, they are treated as the literal characters "B", "N", "R", and "X" by
				1122	default, but cause an error if the PCRE_EXTRA option is set.
				1123	.P
				1124	A circumflex can conveniently be used with the upper case character types to
				1125	specify a more restricted set of characters than the matching lower case type.
				1126	For example, the class [^\eW_] matches any letter or digit, but not underscore,
				1127	whereas [\ew] includes underscore. A positive character class should be read as
				1128	"something OR something OR ..." and a negative class as "NOT something AND NOT
				1129	something AND NOT ...".
				1130	.P
				1131	The only metacharacters that are recognized in character classes are backslash,
				1132	hyphen (only where it can be interpreted as specifying a range), circumflex
				1133	(only at the start), opening square bracket (only when it can be interpreted as
				1134	introducing a POSIX class name - see the next section), and the terminating
				1135	closing square bracket. However, escaping other non-alphanumeric characters
				1136	does no harm.
				1137	.
				1138	.
				1139	.SH "POSIX CHARACTER CLASSES"
				1140	.rs
				1141	.sp
				1142	Perl supports the POSIX notation for character classes. This uses names
				1143	enclosed by [: and :] within the enclosing square brackets. PCRE also supports
				1144	this notation. For example,
				1145	.sp
				1146	[01[:alpha:]%]
				1147	.sp
				1148	matches "0", "1", any alphabetic character, or "%". The supported class names
				1149	are:
				1150	.sp
				1151	alnum letters and digits
				1152	alpha letters
				1153	ascii character codes 0 - 127
				1154	blank space or tab only
				1155	cntrl control characters
				1156	digit decimal digits (same as \ed)
				1157	graph printing characters, excluding space
				1158	lower lower case letters
				1159	print printing characters, including space
				1160	punct printing characters, excluding letters and digits and space
				1161	space white space (not quite the same as \es)
				1162	upper upper case letters
				1163	word "word" characters (same as \ew)
				1164	xdigit hexadecimal digits
				1165	.sp
				1166	The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
				1167	space (32). Notice that this list includes the VT character (code 11). This
				1168	makes "space" different to \es, which does not include VT (for Perl
				1169	compatibility).
				1170	.P
				1171	The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
				1172	5.8. Another Perl extension is negation, which is indicated by a ^ character
				1173	after the colon. For example,
				1174	.sp
				1175	[12[:^digit:]]
				1176	.sp
				1177	matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX
				1178	syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
				1179	supported, and an error is given if they are encountered.
				1180	.P
				1181	By default, in UTF-8 mode, characters with values greater than 128 do not match
				1182	any of the POSIX character classes. However, if the PCRE_UCP option is passed
				1183	to \fBpcre_compile()\fP, some of the classes are changed so that Unicode
				1184	character properties are used. This is achieved by replacing the POSIX classes
				1185	by other sequences, as follows:
				1186	.sp
				1187	[:alnum:] becomes \ep{Xan}
				1188	[:alpha:] becomes \ep{L}
				1189	[:blank:] becomes \eh
				1190	[:digit:] becomes \ep{Nd}
				1191	[:lower:] becomes \ep{Ll}
				1192	[:space:] becomes \ep{Xps}
				1193	[:upper:] becomes \ep{Lu}
				1194	[:word:] becomes \ep{Xwd}
				1195	.sp
				1196	Negated versions, such as [:^alpha:] use \eP instead of \ep. The other POSIX
				1197	classes are unchanged, and match only characters with code points less than
				1198	128.
				1199	.
				1200	.
				1201	.SH "VERTICAL BAR"
				1202	.rs
				1203	.sp
				1204	Vertical bar characters are used to separate alternative patterns. For example,
				1205	the pattern
				1206	.sp
				1207	gilbert\|sullivan
				1208	.sp
				1209	matches either "gilbert" or "sullivan". Any number of alternatives may appear,
				1210	and an empty alternative is permitted (matching the empty string). The matching
				1211	process tries each alternative in turn, from left to right, and the first one
				1212	that succeeds is used. If the alternatives are within a subpattern
				1213	.\" HTML <a href="#subpattern">
				1214	.\" </a>
				1215	(defined below),
				1216	.\"
				1217	"succeeds" means matching the rest of the main pattern as well as the
				1218	alternative in the subpattern.
				1219	.
				1220	.
				1221	.SH "INTERNAL OPTION SETTING"
				1222	.rs
				1223	.sp
				1224	The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
				1225	PCRE_EXTENDED options (which are Perl-compatible) can be changed from within
				1226	the pattern by a sequence of Perl option letters enclosed between "(?" and ")".
				1227	The option letters are
				1228	.sp
				1229	i for PCRE_CASELESS
				1230	m for PCRE_MULTILINE
				1231	s for PCRE_DOTALL
				1232	x for PCRE_EXTENDED
				1233	.sp
				1234	For example, (?im) sets caseless, multiline matching. It is also possible to
				1235	unset these options by preceding the letter with a hyphen, and a combined
				1236	setting and unsetting such as (?im-sx), which sets PCRE_CASELESS and
				1237	PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, is also
				1238	permitted. If a letter appears both before and after the hyphen, the option is
				1239	unset.
				1240	.P
				1241	The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be
				1242	changed in the same way as the Perl-compatible options by using the characters
				1243	J, U and X respectively.
				1244	.P
				1245	When one of these option changes occurs at top level (that is, not inside
				1246	subpattern parentheses), the change applies to the remainder of the pattern
				1247	that follows. If the change is placed right at the start of a pattern, PCRE
				1248	extracts it into the global options (and it will therefore show up in data
				1249	extracted by the \fBpcre_fullinfo()\fP function).
				1250	.P
				1251	An option change within a subpattern (see below for a description of
				1252	subpatterns) affects only that part of the subpattern that follows it, so
				1253	.sp
				1254	(a(?i)b)c
				1255	.sp
				1256	matches abc and aBc and no other strings (assuming PCRE_CASELESS is not used).
				1257	By this means, options can be made to have different settings in different
				1258	parts of the pattern. Any changes made in one alternative do carry on
				1259	into subsequent branches within the same subpattern. For example,
				1260	.sp
				1261	(a(?i)b\|c)
				1262	.sp
				1263	matches "ab", "aB", "c", and "C", even though when matching "C" the first
				1264	branch is abandoned before the option setting. This is because the effects of
				1265	option settings happen at compile time. There would be some very weird
				1266	behaviour otherwise.
				1267	.P
				1268	\fBNote:\fP There are other PCRE-specific options that can be set by the
				1269	application when the compile or match functions are called. In some cases the
				1270	pattern can contain special leading sequences such as (*CRLF) to override what
				1271	the application has set or what has been defaulted. Details are given in the
				1272	section entitled
				1273	.\" HTML <a href="#newlineseq">
				1274	.\" </a>
				1275	"Newline sequences"
				1276	.\"
				1277	above. There are also the (UTF8) and (UCP) leading sequences that can be used
				1278	to set UTF-8 and Unicode property modes; they are equivalent to setting the
				1279	PCRE_UTF8 and the PCRE_UCP options, respectively.
				1280	.
				1281	.
				1282	.\" HTML <a name="subpattern"></a>
				1283	.SH SUBPATTERNS
				1284	.rs
				1285	.sp
				1286	Subpatterns are delimited by parentheses (round brackets), which can be nested.
				1287	Turning part of a pattern into a subpattern does two things:
				1288	.sp
				1289	1. It localizes a set of alternatives. For example, the pattern
				1290	.sp
				1291	cat(aract\|erpillar\|)
				1292	.sp
				1293	matches "cataract", "caterpillar", or "cat". Without the parentheses, it would
				1294	match "cataract", "erpillar" or an empty string.
				1295	.sp
				1296	2. It sets up the subpattern as a capturing subpattern. This means that, when
				1297	the whole pattern matches, that portion of the subject string that matched the
				1298	subpattern is passed back to the caller via the \fIovector\fP argument of
				1299	\fBpcre_exec()\fP. Opening parentheses are counted from left to right (starting
				1300	from 1) to obtain numbers for the capturing subpatterns. For example, if the
				1301	string "the red king" is matched against the pattern
				1302	.sp
				1303	the ((red\|white) (king\|queen))
				1304	.sp
				1305	the captured substrings are "red king", "red", and "king", and are numbered 1,
				1306	2, and 3, respectively.
				1307	.P
				1308	The fact that plain parentheses fulfil two functions is not always helpful.
				1309	There are often times when a grouping subpattern is required without a
				1310	capturing requirement. If an opening parenthesis is followed by a question mark
				1311	and a colon, the subpattern does not do any capturing, and is not counted when
				1312	computing the number of any subsequent capturing subpatterns. For example, if
				1313	the string "the white queen" is matched against the pattern
				1314	.sp
				1315	the ((?:red\|white) (king\|queen))
				1316	.sp
				1317	the captured substrings are "white queen" and "queen", and are numbered 1 and
				1318	2. The maximum number of capturing subpatterns is 65535.
				1319	.P
				1320	As a convenient shorthand, if any option settings are required at the start of
				1321	a non-capturing subpattern, the option letters may appear between the "?" and
				1322	the ":". Thus the two patterns
				1323	.sp
				1324	(?i:saturday\|sunday)
				1325	(?:(?i)saturday\|sunday)
				1326	.sp
				1327	match exactly the same set of strings. Because alternative branches are tried
				1328	from left to right, and options are not reset until the end of the subpattern
				1329	is reached, an option setting in one branch does affect subsequent branches, so
				1330	the above patterns match "SUNDAY" as well as "Saturday".
				1331	.
				1332	.
				1333	.\" HTML <a name="dupsubpatternnumber"></a>
				1334	.SH "DUPLICATE SUBPATTERN NUMBERS"
				1335	.rs
				1336	.sp
				1337	Perl 5.10 introduced a feature whereby each alternative in a subpattern uses
				1338	the same numbers for its capturing parentheses. Such a subpattern starts with
				1339	(?\| and is itself a non-capturing subpattern. For example, consider this
				1340	pattern:
				1341	.sp
				1342	(?\|(Sat)ur\|(Sun))day
				1343	.sp
				1344	Because the two alternatives are inside a (?\| group, both sets of capturing
				1345	parentheses are numbered one. Thus, when the pattern matches, you can look
				1346	at captured substring number one, whichever alternative matched. This construct
				1347	is useful when you want to capture part, but not all, of one of a number of
				1348	alternatives. Inside a (?\| group, parentheses are numbered as usual, but the
				1349	number is reset at the start of each branch. The numbers of any capturing
				1350	parentheses that follow the subpattern start after the highest number used in
				1351	any branch. The following example is taken from the Perl documentation. The
				1352	numbers underneath show in which buffer the captured content will be stored.
				1353	.sp
				1354	# before ---------------branch-reset----------- after
				1355	/ ( a ) (?\| x ( y ) z \| (p (q) r) \| (t) u (v) ) ( z ) /x
				1356	# 1 2 2 3 2 3 4
				1357	.sp
				1358	A back reference to a numbered subpattern uses the most recent value that is
				1359	set for that number by any subpattern. The following pattern matches "abcabc"
				1360	or "defdef":
				1361	.sp
				1362	/(?\|(abc)\|(def))\e1/
				1363	.sp
				1364	In contrast, a subroutine call to a numbered subpattern always refers to the
				1365	first one in the pattern with the given number. The following pattern matches
				1366	"abcabc" or "defabc":
				1367	.sp
				1368	/(?\|(abc)\|(def))(?1)/
				1369	.sp
				1370	If a
				1371	.\" HTML <a href="#conditions">
				1372	.\" </a>
				1373	condition test
				1374	.\"
				1375	for a subpattern's having matched refers to a non-unique number, the test is
				1376	true if any of the subpatterns of that number have matched.
				1377	.P
				1378	An alternative approach to using this "branch reset" feature is to use
				1379	duplicate named subpatterns, as described in the next section.
				1380	.
				1381	.
				1382	.SH "NAMED SUBPATTERNS"
				1383	.rs
				1384	.sp
				1385	Identifying capturing parentheses by number is simple, but it can be very hard
				1386	to keep track of the numbers in complicated regular expressions. Furthermore,
				1387	if an expression is modified, the numbers may change. To help with this
				1388	difficulty, PCRE supports the naming of subpatterns. This feature was not
				1389	added to Perl until release 5.10. Python had the feature earlier, and PCRE
				1390	introduced it at release 4.0, using the Python syntax. PCRE now supports both
				1391	the Perl and the Python syntax. Perl allows identically numbered subpatterns to
				1392	have different names, but PCRE does not.
				1393	.P
				1394	In PCRE, a subpattern can be named in one of three ways: (?<name>...) or
				1395	(?'name'...) as in Perl, or (?P<name>...) as in Python. References to capturing
				1396	parentheses from other parts of the pattern, such as
				1397	.\" HTML <a href="#backreferences">
				1398	.\" </a>
				1399	back references,
				1400	.\"
				1401	.\" HTML <a href="#recursion">
				1402	.\" </a>
				1403	recursion,
				1404	.\"
				1405	and
				1406	.\" HTML <a href="#conditions">
				1407	.\" </a>
				1408	conditions,
				1409	.\"
				1410	can be made by name as well as by number.
				1411	.P
				1412	Names consist of up to 32 alphanumeric characters and underscores. Named
				1413	capturing parentheses are still allocated numbers as well as names, exactly as
				1414	if the names were not present. The PCRE API provides function calls for
				1415	extracting the name-to-number translation table from a compiled pattern. There
				1416	is also a convenience function for extracting a captured substring by name.
				1417	.P
				1418	By default, a name must be unique within a pattern, but it is possible to relax
				1419	this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate
				1420	names are also always permitted for subpatterns with the same number, set up as
				1421	described in the previous section.) Duplicate names can be useful for patterns
				1422	where only one instance of the named parentheses can match. Suppose you want to
				1423	match the name of a weekday, either as a 3-letter abbreviation or as the full
				1424	name, and in both cases you want to extract the abbreviation. This pattern
				1425	(ignoring the line breaks) does the job:
				1426	.sp
				1427	(?<DN>Mon\|Fri\|Sun)(?:day)?\|
				1428	(?<DN>Tue)(?:sday)?\|
				1429	(?<DN>Wed)(?:nesday)?\|
				1430	(?<DN>Thu)(?:rsday)?\|
				1431	(?<DN>Sat)(?:urday)?
				1432	.sp
				1433	There are five capturing substrings, but only one is ever set after a match.
				1434	(An alternative way of solving this problem is to use a "branch reset"
				1435	subpattern, as described in the previous section.)
				1436	.P
				1437	The convenience function for extracting the data by name returns the substring
				1438	for the first (and in this example, the only) subpattern of that name that
				1439	matched. This saves searching to find which numbered subpattern it was.
				1440	.P
				1441	If you make a back reference to a non-unique named subpattern from elsewhere in
				1442	the pattern, the one that corresponds to the first occurrence of the name is
				1443	used. In the absence of duplicate numbers (see the previous section) this is
				1444	the one with the lowest number. If you use a named reference in a condition
				1445	test (see the
				1446	.\"
				1447	.\" HTML <a href="#conditions">
				1448	.\" </a>
				1449	section about conditions
				1450	.\"
				1451	below), either to check whether a subpattern has matched, or to check for
				1452	recursion, all subpatterns with the same name are tested. If the condition is
				1453	true for any one of them, the overall condition is true. This is the same
				1454	behaviour as testing by number. For further details of the interfaces for
				1455	handling named subpatterns, see the
				1456	.\" HREF
				1457	\fBpcreapi\fP
				1458	.\"
				1459	documentation.
				1460	.P
				1461	\fBWarning:\fP You cannot use different names to distinguish between two
				1462	subpatterns with the same number because PCRE uses only the numbers when
				1463	matching. For this reason, an error is given at compile time if different names
				1464	are given to subpatterns with the same number. However, you can give the same
				1465	name to subpatterns with the same number, even when PCRE_DUPNAMES is not set.
				1466	.
				1467	.
				1468	.SH REPETITION
				1469	.rs
				1470	.sp
				1471	Repetition is specified by quantifiers, which can follow any of the following
				1472	items:
				1473	.sp
				1474	a literal data character
				1475	the dot metacharacter
				1476	the \eC escape sequence
				1477	the \eX escape sequence (in UTF-8 mode with Unicode properties)
				1478	the \eR escape sequence
				1479	an escape such as \ed or \epL that matches a single character
				1480	a character class
				1481	a back reference (see next section)
				1482	a parenthesized subpattern (including assertions)
				1483	a subroutine call to a subpattern (recursive or otherwise)
				1484	.sp
				1485	The general repetition quantifier specifies a minimum and maximum number of
				1486	permitted matches, by giving the two numbers in curly brackets (braces),
				1487	separated by a comma. The numbers must be less than 65536, and the first must
				1488	be less than or equal to the second. For example:
				1489	.sp
				1490	z{2,4}
				1491	.sp
				1492	matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
				1493	character. If the second number is omitted, but the comma is present, there is
				1494	no upper limit; if the second number and the comma are both omitted, the
				1495	quantifier specifies an exact number of required matches. Thus
				1496	.sp
				1497	[aeiou]{3,}
				1498	.sp
				1499	matches at least 3 successive vowels, but may match many more, while
				1500	.sp
				1501	\ed{8}
				1502	.sp
				1503	matches exactly 8 digits. An opening curly bracket that appears in a position
				1504	where a quantifier is not allowed, or one that does not match the syntax of a
				1505	quantifier, is taken as a literal character. For example, {,6} is not a
				1506	quantifier, but a literal string of four characters.
				1507	.P
				1508	In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to individual
				1509	bytes. Thus, for example, \ex{100}{2} matches two UTF-8 characters, each of
				1510	which is represented by a two-byte sequence. Similarly, when Unicode property
				1511	support is available, \eX{3} matches three Unicode extended sequences, each of
				1512	which may be several bytes long (and they may be of different lengths).
				1513	.P
				1514	The quantifier {0} is permitted, causing the expression to behave as if the
				1515	previous item and the quantifier were not present. This may be useful for
				1516	subpatterns that are referenced as
				1517	.\" HTML <a href="#subpatternsassubroutines">
				1518	.\" </a>
				1519	subroutines
				1520	.\"
				1521	from elsewhere in the pattern (but see also the section entitled
				1522	.\" HTML <a href="#subdefine">
				1523	.\" </a>
				1524	"Defining subpatterns for use by reference only"
				1525	.\"
				1526	below). Items other than subpatterns that have a {0} quantifier are omitted
				1527	from the compiled pattern.
				1528	.P
				1529	For convenience, the three most common quantifiers have single-character
				1530	abbreviations:
				1531	.sp
				1532	* is equivalent to {0,}
				1533	+ is equivalent to {1,}
				1534	? is equivalent to {0,1}
				1535	.sp
				1536	It is possible to construct infinite loops by following a subpattern that can
				1537	match no characters with a quantifier that has no upper limit, for example:
				1538	.sp
				1539	(a?)*
				1540	.sp
				1541	Earlier versions of Perl and PCRE used to give an error at compile time for
				1542	such patterns. However, because there are cases where this can be useful, such
				1543	patterns are now accepted, but if any repetition of the subpattern does in fact
				1544	match no characters, the loop is forcibly broken.
				1545	.P
				1546	By default, the quantifiers are "greedy", that is, they match as much as
				1547	possible (up to the maximum number of permitted times), without causing the
				1548	rest of the pattern to fail. The classic example of where this gives problems
				1549	is in trying to match comments in C programs. These appear between /* and */
				1550	and within the comment, individual * and / characters may appear. An attempt to
				1551	match C comments by applying the pattern
				1552	.sp
				1553	/\e.\e*/
				1554	.sp
				1555	to the string
				1556	.sp
				1557	/* first comment / not comment / second comment */
				1558	.sp
				1559	fails, because it matches the entire string owing to the greediness of the .*
				1560	item.
				1561	.P
				1562	However, if a quantifier is followed by a question mark, it ceases to be
				1563	greedy, and instead matches the minimum number of times possible, so the
				1564	pattern
				1565	.sp
				1566	/\e.?\e*/
				1567	.sp
				1568	does the right thing with the C comments. The meaning of the various
				1569	quantifiers is not otherwise changed, just the preferred number of matches.
				1570	Do not confuse this use of question mark with its use as a quantifier in its
				1571	own right. Because it has two uses, it can sometimes appear doubled, as in
				1572	.sp
				1573	\ed??\ed
				1574	.sp
				1575	which matches one digit by preference, but can match two if that is the only
				1576	way the rest of the pattern matches.
				1577	.P
				1578	If the PCRE_UNGREEDY option is set (an option that is not available in Perl),
				1579	the quantifiers are not greedy by default, but individual ones can be made
				1580	greedy by following them with a question mark. In other words, it inverts the
				1581	default behaviour.
				1582	.P
				1583	When a parenthesized subpattern is quantified with a minimum repeat count that
				1584	is greater than 1 or with a limited maximum, more memory is required for the
				1585	compiled pattern, in proportion to the size of the minimum or maximum.
				1586	.P
				1587	If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent
				1588	to Perl's /s) is set, thus allowing the dot to match newlines, the pattern is
				1589	implicitly anchored, because whatever follows will be tried against every
				1590	character position in the subject string, so there is no point in retrying the
				1591	overall match at any position after the first. PCRE normally treats such a
				1592	pattern as though it were preceded by \eA.
				1593	.P
				1594	In cases where it is known that the subject string contains no newlines, it is
				1595	worth setting PCRE_DOTALL in order to obtain this optimization, or
				1596	alternatively using ^ to indicate anchoring explicitly.
				1597	.P
				1598	However, there is one situation where the optimization cannot be used. When .*
				1599	is inside capturing parentheses that are the subject of a back reference
				1600	elsewhere in the pattern, a match at the start may fail where a later one
				1601	succeeds. Consider, for example:
				1602	.sp
				1603	(.*)abc\e1
				1604	.sp
				1605	If the subject is "xyz123abc123" the match point is the fourth character. For
				1606	this reason, such a pattern is not implicitly anchored.
				1607	.P
				1608	When a capturing subpattern is repeated, the value captured is the substring
				1609	that matched the final iteration. For example, after
				1610	.sp
				1611	(tweedle[dume]{3}\es*)+
				1612	.sp
				1613	has matched "tweedledum tweedledee" the value of the captured substring is
				1614	"tweedledee". However, if there are nested capturing subpatterns, the
				1615	corresponding captured values may have been set in previous iterations. For
				1616	example, after
				1617	.sp
				1618	/(a\|(b))+/
				1619	.sp
				1620	matches "aba" the value of the second captured substring is "b".
				1621	.
				1622	.
				1623	.\" HTML <a name="atomicgroup"></a>
				1624	.SH "ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS"
				1625	.rs
				1626	.sp
				1627	With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
				1628	repetition, failure of what follows normally causes the repeated item to be
				1629	re-evaluated to see if a different number of repeats allows the rest of the
				1630	pattern to match. Sometimes it is useful to prevent this, either to change the
				1631	nature of the match, or to cause it fail earlier than it otherwise might, when
				1632	the author of the pattern knows there is no point in carrying on.
				1633	.P
				1634	Consider, for example, the pattern \ed+foo when applied to the subject line
				1635	.sp
				1636	123456bar
				1637	.sp
				1638	After matching all 6 digits and then failing to match "foo", the normal
				1639	action of the matcher is to try again with only 5 digits matching the \ed+
				1640	item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
				1641	(a term taken from Jeffrey Friedl's book) provides the means for specifying
				1642	that once a subpattern has matched, it is not to be re-evaluated in this way.
				1643	.P
				1644	If we use atomic grouping for the previous example, the matcher gives up
				1645	immediately on failing to match "foo" the first time. The notation is a kind of
				1646	special parenthesis, starting with (?> as in this example:
				1647	.sp
				1648	(?>\ed+)foo
				1649	.sp
				1650	This kind of parenthesis "locks up" the part of the pattern it contains once
				1651	it has matched, and a failure further into the pattern is prevented from
				1652	backtracking into it. Backtracking past it to previous items, however, works as
				1653	normal.
				1654	.P
				1655	An alternative description is that a subpattern of this type matches the string
				1656	of characters that an identical standalone pattern would match, if anchored at
				1657	the current point in the subject string.
				1658	.P
				1659	Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as
				1660	the above example can be thought of as a maximizing repeat that must swallow
				1661	everything it can. So, while both \ed+ and \ed+? are prepared to adjust the
				1662	number of digits they match in order to make the rest of the pattern match,
				1663	(?>\ed+) can only match an entire sequence of digits.
				1664	.P
				1665	Atomic groups in general can of course contain arbitrarily complicated
				1666	subpatterns, and can be nested. However, when the subpattern for an atomic
				1667	group is just a single repeated item, as in the example above, a simpler
				1668	notation, called a "possessive quantifier" can be used. This consists of an
				1669	additional + character following a quantifier. Using this notation, the
				1670	previous example can be rewritten as
				1671	.sp
				1672	\ed++foo
				1673	.sp
				1674	Note that a possessive quantifier can be used with an entire group, for
				1675	example:
				1676	.sp
				1677	(abc\|xyz){2,3}+
				1678	.sp
				1679	Possessive quantifiers are always greedy; the setting of the PCRE_UNGREEDY
				1680	option is ignored. They are a convenient notation for the simpler forms of
				1681	atomic group. However, there is no difference in the meaning of a possessive
				1682	quantifier and the equivalent atomic group, though there may be a performance
				1683	difference; possessive quantifiers should be slightly faster.
				1684	.P
				1685	The possessive quantifier syntax is an extension to the Perl 5.8 syntax.
				1686	Jeffrey Friedl originated the idea (and the name) in the first edition of his
				1687	book. Mike McCloskey liked it, so implemented it when he built Sun's Java
				1688	package, and PCRE copied it from there. It ultimately found its way into Perl
				1689	at release 5.10.
				1690	.P
				1691	PCRE has an optimization that automatically "possessifies" certain simple
				1692	pattern constructs. For example, the sequence A+B is treated as A++B because
				1693	there is no point in backtracking into a sequence of A's when B must follow.
				1694	.P
				1695	When a pattern contains an unlimited repeat inside a subpattern that can itself
				1696	be repeated an unlimited number of times, the use of an atomic group is the
				1697	only way to avoid some failing matches taking a very long time indeed. The
				1698	pattern
				1699	.sp
				1700	(\eD+\|<\ed+>)*[!?]
				1701	.sp
				1702	matches an unlimited number of substrings that either consist of non-digits, or
				1703	digits enclosed in <>, followed by either ! or ?. When it matches, it runs
				1704	quickly. However, if it is applied to
				1705	.sp
				1706	aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
				1707	.sp
				1708	it takes a long time before reporting failure. This is because the string can
				1709	be divided between the internal \eD+ repeat and the external * repeat in a
				1710	large number of ways, and all have to be tried. (The example uses [!?] rather
				1711	than a single character at the end, because both PCRE and Perl have an
				1712	optimization that allows for fast failure when a single character is used. They
				1713	remember the last single character that is required for a match, and fail early
				1714	if it is not present in the string.) If the pattern is changed so that it uses
				1715	an atomic group, like this:
				1716	.sp
				1717	((?>\eD+)\|<\ed+>)*[!?]
				1718	.sp
				1719	sequences of non-digits cannot be broken, and failure happens quickly.
				1720	.
				1721	.
				1722	.\" HTML <a name="backreferences"></a>
				1723	.SH "BACK REFERENCES"
				1724	.rs
				1725	.sp
				1726	Outside a character class, a backslash followed by a digit greater than 0 (and
				1727	possibly further digits) is a back reference to a capturing subpattern earlier
				1728	(that is, to its left) in the pattern, provided there have been that many
				1729	previous capturing left parentheses.
				1730	.P
				1731	However, if the decimal number following the backslash is less than 10, it is
				1732	always taken as a back reference, and causes an error only if there are not
				1733	that many capturing left parentheses in the entire pattern. In other words, the
				1734	parentheses that are referenced need not be to the left of the reference for
				1735	numbers less than 10. A "forward back reference" of this type can make sense
				1736	when a repetition is involved and the subpattern to the right has participated
				1737	in an earlier iteration.
				1738	.P
				1739	It is not possible to have a numerical "forward back reference" to a subpattern
				1740	whose number is 10 or more using this syntax because a sequence such as \e50 is
				1741	interpreted as a character defined in octal. See the subsection entitled
				1742	"Non-printing characters"
				1743	.\" HTML <a href="#digitsafterbackslash">
				1744	.\" </a>
				1745	above
				1746	.\"
				1747	for further details of the handling of digits following a backslash. There is
				1748	no such problem when named parentheses are used. A back reference to any
				1749	subpattern is possible using named parentheses (see below).
				1750	.P
				1751	Another way of avoiding the ambiguity inherent in the use of digits following a
				1752	backslash is to use the \eg escape sequence. This escape must be followed by an
				1753	unsigned number or a negative number, optionally enclosed in braces. These
				1754	examples are all identical:
				1755	.sp
				1756	(ring), \e1
				1757	(ring), \eg1
				1758	(ring), \eg{1}
				1759	.sp
				1760	An unsigned number specifies an absolute reference without the ambiguity that
				1761	is present in the older syntax. It is also useful when literal digits follow
				1762	the reference. A negative number is a relative reference. Consider this
				1763	example:
				1764	.sp
				1765	(abc(def)ghi)\eg{-1}
				1766	.sp
				1767	The sequence \eg{-1} is a reference to the most recently started capturing
				1768	subpattern before \eg, that is, is it equivalent to \e2 in this example.
				1769	Similarly, \eg{-2} would be equivalent to \e1. The use of relative references
				1770	can be helpful in long patterns, and also in patterns that are created by
				1771	joining together fragments that contain references within themselves.
				1772	.P
				1773	A back reference matches whatever actually matched the capturing subpattern in
				1774	the current subject string, rather than anything matching the subpattern
				1775	itself (see
				1776	.\" HTML <a href="#subpatternsassubroutines">
				1777	.\" </a>
				1778	"Subpatterns as subroutines"
				1779	.\"
				1780	below for a way of doing that). So the pattern
				1781	.sp
				1782	(sens\|respons)e and \e1ibility
				1783	.sp
				1784	matches "sense and sensibility" and "response and responsibility", but not
				1785	"sense and responsibility". If caseful matching is in force at the time of the
				1786	back reference, the case of letters is relevant. For example,
				1787	.sp
				1788	((?i)rah)\es+\e1
				1789	.sp
				1790	matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
				1791	capturing subpattern is matched caselessly.
				1792	.P
				1793	There are several different ways of writing back references to named
				1794	subpatterns. The .NET syntax \ek{name} and the Perl syntax \ek<name> or
				1795	\ek'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
				1796	back reference syntax, in which \eg can be used for both numeric and named
				1797	references, is also supported. We could rewrite the above example in any of
				1798	the following ways:
				1799	.sp
				1800	(?<p1>(?i)rah)\es+\ek<p1>
				1801	(?'p1'(?i)rah)\es+\ek{p1}
				1802	(?P<p1>(?i)rah)\es+(?P=p1)
				1803	(?<p1>(?i)rah)\es+\eg{p1}
				1804	.sp
				1805	A subpattern that is referenced by name may appear in the pattern before or
				1806	after the reference.
				1807	.P
				1808	There may be more than one back reference to the same subpattern. If a
				1809	subpattern has not actually been used in a particular match, any back
				1810	references to it always fail by default. For example, the pattern
				1811	.sp
				1812	(a\|(bc))\e2
				1813	.sp
				1814	always fails if it starts to match "a" rather than "bc". However, if the
				1815	PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back reference to an
				1816	unset value matches an empty string.
				1817	.P
				1818	Because there may be many capturing parentheses in a pattern, all digits
				1819	following a backslash are taken as part of a potential back reference number.
				1820	If the pattern continues with a digit character, some delimiter must be used to
				1821	terminate the back reference. If the PCRE_EXTENDED option is set, this can be
				1822	whitespace. Otherwise, the \eg{ syntax or an empty comment (see
				1823	.\" HTML <a href="#comments">
				1824	.\" </a>
				1825	"Comments"
				1826	.\"
				1827	below) can be used.
				1828	.
				1829	.SS "Recursive back references"
				1830	.rs
				1831	.sp
				1832	A back reference that occurs inside the parentheses to which it refers fails
				1833	when the subpattern is first used, so, for example, (a\e1) never matches.
				1834	However, such references can be useful inside repeated subpatterns. For
				1835	example, the pattern
				1836	.sp
				1837	(a\|b\e1)+
				1838	.sp
				1839	matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
				1840	the subpattern, the back reference matches the character string corresponding
				1841	to the previous iteration. In order for this to work, the pattern must be such
				1842	that the first iteration does not need to match the back reference. This can be
				1843	done using alternation, as in the example above, or by a quantifier with a
				1844	minimum of zero.
				1845	.P
				1846	Back references of this type cause the group that they reference to be treated
				1847	as an
				1848	.\" HTML <a href="#atomicgroup">
				1849	.\" </a>
				1850	atomic group.
				1851	.\"
				1852	Once the whole group has been matched, a subsequent matching failure cannot
				1853	cause backtracking into the middle of the group.
				1854	.
				1855	.
				1856	.\" HTML <a name="bigassertions"></a>
				1857	.SH ASSERTIONS
				1858	.rs
				1859	.sp
				1860	An assertion is a test on the characters following or preceding the current
				1861	matching point that does not actually consume any characters. The simple
				1862	assertions coded as \eb, \eB, \eA, \eG, \eZ, \ez, ^ and $ are described
				1863	.\" HTML <a href="#smallassertions">
				1864	.\" </a>
				1865	above.
				1866	.\"
				1867	.P
				1868	More complicated assertions are coded as subpatterns. There are two kinds:
				1869	those that look ahead of the current position in the subject string, and those
				1870	that look behind it. An assertion subpattern is matched in the normal way,
				1871	except that it does not cause the current matching position to be changed.
				1872	.P
				1873	Assertion subpatterns are not capturing subpatterns. If such an assertion
				1874	contains capturing subpatterns within it, these are counted for the purposes of
				1875	numbering the capturing subpatterns in the whole pattern. However, substring
				1876	capturing is carried out only for positive assertions, because it does not make
				1877	sense for negative assertions.
				1878	.P
				1879	For compatibility with Perl, assertion subpatterns may be repeated; though
				1880	it makes no sense to assert the same thing several times, the side effect of
				1881	capturing parentheses may occasionally be useful. In practice, there only three
				1882	cases:
				1883	.sp
				1884	(1) If the quantifier is {0}, the assertion is never obeyed during matching.
				1885	However, it may contain internal capturing parenthesized groups that are called
				1886	from elsewhere via the
				1887	.\" HTML <a href="#subpatternsassubroutines">
				1888	.\" </a>
				1889	subroutine mechanism.
				1890	.\"
				1891	.sp
				1892	(2) If quantifier is {0,n} where n is greater than zero, it is treated as if it
				1893	were {0,1}. At run time, the rest of the pattern match is tried with and
				1894	without the assertion, the order depending on the greediness of the quantifier.
				1895	.sp
				1896	(3) If the minimum repetition is greater than zero, the quantifier is ignored.
				1897	The assertion is obeyed just once when encountered during matching.
				1898	.
				1899	.
				1900	.SS "Lookahead assertions"
				1901	.rs
				1902	.sp
				1903	Lookahead assertions start with (?= for positive assertions and (?! for
				1904	negative assertions. For example,
				1905	.sp
				1906	\ew+(?=;)
				1907	.sp
				1908	matches a word followed by a semicolon, but does not include the semicolon in
				1909	the match, and
				1910	.sp
				1911	foo(?!bar)
				1912	.sp
				1913	matches any occurrence of "foo" that is not followed by "bar". Note that the
				1914	apparently similar pattern
				1915	.sp
				1916	(?!foo)bar
				1917	.sp
				1918	does not find an occurrence of "bar" that is preceded by something other than
				1919	"foo"; it finds any occurrence of "bar" whatsoever, because the assertion
				1920	(?!foo) is always true when the next three characters are "bar". A
				1921	lookbehind assertion is needed to achieve the other effect.
				1922	.P
				1923	If you want to force a matching failure at some point in a pattern, the most
				1924	convenient way to do it is with (?!) because an empty string always matches, so
				1925	an assertion that requires there not to be an empty string must always fail.
				1926	The backtracking control verb (FAIL) or (F) is a synonym for (?!).
				1927	.
				1928	.
				1929	.\" HTML <a name="lookbehind"></a>
				1930	.SS "Lookbehind assertions"
				1931	.rs
				1932	.sp
				1933	Lookbehind assertions start with (?<= for positive assertions and (?<! for
				1934	negative assertions. For example,
				1935	.sp
				1936	(?<!foo)bar
				1937	.sp
				1938	does find an occurrence of "bar" that is not preceded by "foo". The contents of
				1939	a lookbehind assertion are restricted such that all the strings it matches must
				1940	have a fixed length. However, if there are several top-level alternatives, they
				1941	do not all have to have the same fixed length. Thus
				1942	.sp
				1943	(?<=bullock\|donkey)
				1944	.sp
				1945	is permitted, but
				1946	.sp
				1947	(?<!dogs?\|cats?)
				1948	.sp
				1949	causes an error at compile time. Branches that match different length strings
				1950	are permitted only at the top level of a lookbehind assertion. This is an
				1951	extension compared with Perl, which requires all branches to match the same
				1952	length of string. An assertion such as
				1953	.sp
				1954	(?<=ab(c\|de))
				1955	.sp
				1956	is not permitted, because its single top-level branch can match two different
				1957	lengths, but it is acceptable to PCRE if rewritten to use two top-level
				1958	branches:
				1959	.sp
				1960	(?<=abc\|abde)
				1961	.sp
				1962	In some cases, the escape sequence \eK
				1963	.\" HTML <a href="#resetmatchstart">
				1964	.\" </a>
				1965	(see above)
				1966	.\"
				1967	can be used instead of a lookbehind assertion to get round the fixed-length
				1968	restriction.
				1969	.P
				1970	The implementation of lookbehind assertions is, for each alternative, to
				1971	temporarily move the current position back by the fixed length and then try to
				1972	match. If there are insufficient characters before the current position, the
				1973	assertion fails.
				1974	.P
				1975	In UTF-8 mode, PCRE does not allow the \eC escape (which matches a single byte,
				1976	even in UTF-8 mode) to appear in lookbehind assertions, because it makes it
				1977	impossible to calculate the length of the lookbehind. The \eX and \eR escapes,
				1978	which can match different numbers of bytes, are also not permitted.
				1979	.P
				1980	.\" HTML <a href="#subpatternsassubroutines">
				1981	.\" </a>
				1982	"Subroutine"
				1983	.\"
				1984	calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
				1985	as the subpattern matches a fixed-length string.
				1986	.\" HTML <a href="#recursion">
				1987	.\" </a>
				1988	Recursion,
				1989	.\"
				1990	however, is not supported.
				1991	.P
				1992	Possessive quantifiers can be used in conjunction with lookbehind assertions to
				1993	specify efficient matching of fixed-length strings at the end of subject
				1994	strings. Consider a simple pattern such as
				1995	.sp
				1996	abcd$
				1997	.sp
				1998	when applied to a long string that does not match. Because matching proceeds
				1999	from left to right, PCRE will look for each "a" in the subject and then see if
				2000	what follows matches the rest of the pattern. If the pattern is specified as
				2001	.sp
				2002	^.*abcd$
				2003	.sp
				2004	the initial .* matches the entire string at first, but when this fails (because
				2005	there is no following "a"), it backtracks to match all but the last character,
				2006	then all but the last two characters, and so on. Once again the search for "a"
				2007	covers the entire string, from right to left, so we are no better off. However,
				2008	if the pattern is written as
				2009	.sp
				2010	^.*+(?<=abcd)
				2011	.sp
				2012	there can be no backtracking for the .*+ item; it can match only the entire
				2013	string. The subsequent lookbehind assertion does a single test on the last four
				2014	characters. If it fails, the match fails immediately. For long strings, this
				2015	approach makes a significant difference to the processing time.
				2016	.
				2017	.
				2018	.SS "Using multiple assertions"
				2019	.rs
				2020	.sp
				2021	Several assertions (of any sort) may occur in succession. For example,
				2022	.sp
				2023	(?<=\ed{3})(?<!999)foo
				2024	.sp
				2025	matches "foo" preceded by three digits that are not "999". Notice that each of
				2026	the assertions is applied independently at the same point in the subject
				2027	string. First there is a check that the previous three characters are all
				2028	digits, and then there is a check that the same three characters are not "999".
				2029	This pattern does \fInot\fP match "foo" preceded by six characters, the first
				2030	of which are digits and the last three of which are not "999". For example, it
				2031	doesn't match "123abcfoo". A pattern to do that is
				2032	.sp
				2033	(?<=\ed{3}...)(?<!999)foo
				2034	.sp
				2035	This time the first assertion looks at the preceding six characters, checking
				2036	that the first three are digits, and then the second assertion checks that the
				2037	preceding three characters are not "999".
				2038	.P
				2039	Assertions can be nested in any combination. For example,
				2040	.sp
				2041	(?<=(?<!foo)bar)baz
				2042	.sp
				2043	matches an occurrence of "baz" that is preceded by "bar" which in turn is not
				2044	preceded by "foo", while
				2045	.sp
				2046	(?<=\ed{3}(?!999)...)foo
				2047	.sp
				2048	is another pattern that matches "foo" preceded by three digits and any three
				2049	characters that are not "999".
				2050	.
				2051	.
				2052	.\" HTML <a name="conditions"></a>
				2053	.SH "CONDITIONAL SUBPATTERNS"
				2054	.rs
				2055	.sp
				2056	It is possible to cause the matching process to obey a subpattern
				2057	conditionally or to choose between two alternative subpatterns, depending on
				2058	the result of an assertion, or whether a specific capturing subpattern has
				2059	already been matched. The two possible forms of conditional subpattern are:
				2060	.sp
				2061	(?(condition)yes-pattern)
				2062	(?(condition)yes-pattern\|no-pattern)
				2063	.sp
				2064	If the condition is satisfied, the yes-pattern is used; otherwise the
				2065	no-pattern (if present) is used. If there are more than two alternatives in the
				2066	subpattern, a compile-time error occurs. Each of the two alternatives may
				2067	itself contain nested subpatterns of any form, including conditional
				2068	subpatterns; the restriction to two alternatives applies only at the level of
				2069	the condition. This pattern fragment is an example where the alternatives are
				2070	complex:
				2071	.sp
				2072	(?(1) (A\|B\|C) \| (D \| (?(2)E\|F) \| E) )
				2073	.sp
				2074	.P
				2075	There are four kinds of condition: references to subpatterns, references to
				2076	recursion, a pseudo-condition called DEFINE, and assertions.
				2077	.
				2078	.SS "Checking for a used subpattern by number"
				2079	.rs
				2080	.sp
				2081	If the text between the parentheses consists of a sequence of digits, the
				2082	condition is true if a capturing subpattern of that number has previously
				2083	matched. If there is more than one capturing subpattern with the same number
				2084	(see the earlier
				2085	.\"
				2086	.\" HTML <a href="#recursion">
				2087	.\" </a>
				2088	section about duplicate subpattern numbers),
				2089	.\"
				2090	the condition is true if any of them have matched. An alternative notation is
				2091	to precede the digits with a plus or minus sign. In this case, the subpattern
				2092	number is relative rather than absolute. The most recently opened parentheses
				2093	can be referenced by (?(-1), the next most recent by (?(-2), and so on. Inside
				2094	loops it can also make sense to refer to subsequent groups. The next
				2095	parentheses to be opened can be referenced as (?(+1), and so on. (The value
				2096	zero in any of these forms is not used; it provokes a compile-time error.)
				2097	.P
				2098	Consider the following pattern, which contains non-significant white space to
				2099	make it more readable (assume the PCRE_EXTENDED option) and to divide it into
				2100	three parts for ease of discussion:
				2101	.sp
				2102	( \e( )? [^()]+ (?(1) \e) )
				2103	.sp
				2104	The first part matches an optional opening parenthesis, and if that
				2105	character is present, sets it as the first captured substring. The second part
				2106	matches one or more characters that are not parentheses. The third part is a
				2107	conditional subpattern that tests whether or not the first set of parentheses
				2108	matched. If they did, that is, if subject started with an opening parenthesis,
				2109	the condition is true, and so the yes-pattern is executed and a closing
				2110	parenthesis is required. Otherwise, since no-pattern is not present, the
				2111	subpattern matches nothing. In other words, this pattern matches a sequence of
				2112	non-parentheses, optionally enclosed in parentheses.
				2113	.P
				2114	If you were embedding this pattern in a larger one, you could use a relative
				2115	reference:
				2116	.sp
				2117	...other stuff... ( \e( )? [^()]+ (?(-1) \e) ) ...
				2118	.sp
				2119	This makes the fragment independent of the parentheses in the larger pattern.
				2120	.
				2121	.SS "Checking for a used subpattern by name"
				2122	.rs
				2123	.sp
				2124	Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
				2125	subpattern by name. For compatibility with earlier versions of PCRE, which had
				2126	this facility before Perl, the syntax (?(name)...) is also recognized. However,
				2127	there is a possible ambiguity with this syntax, because subpattern names may
				2128	consist entirely of digits. PCRE looks first for a named subpattern; if it
				2129	cannot find one and the name consists entirely of digits, PCRE looks for a
				2130	subpattern of that number, which must be greater than zero. Using subpattern
				2131	names that consist entirely of digits is not recommended.
				2132	.P
				2133	Rewriting the above example to use a named subpattern gives this:
				2134	.sp
				2135	(?<OPEN> \e( )? [^()]+ (?(<OPEN>) \e) )
				2136	.sp
				2137	If the name used in a condition of this kind is a duplicate, the test is
				2138	applied to all subpatterns of the same name, and is true if any one of them has
				2139	matched.
				2140	.
				2141	.SS "Checking for pattern recursion"
				2142	.rs
				2143	.sp
				2144	If the condition is the string (R), and there is no subpattern with the name R,
				2145	the condition is true if a recursive call to the whole pattern or any
				2146	subpattern has been made. If digits or a name preceded by ampersand follow the
				2147	letter R, for example:
				2148	.sp
				2149	(?(R3)...) or (?(R&name)...)
				2150	.sp
				2151	the condition is true if the most recent recursion is into a subpattern whose
				2152	number or name is given. This condition does not check the entire recursion
				2153	stack. If the name used in a condition of this kind is a duplicate, the test is
				2154	applied to all subpatterns of the same name, and is true if any one of them is
				2155	the most recent recursion.
				2156	.P
				2157	At "top level", all these recursion test conditions are false.
				2158	.\" HTML <a href="#recursion">
				2159	.\" </a>
				2160	The syntax for recursive patterns
				2161	.\"
				2162	is described below.
				2163	.
				2164	.\" HTML <a name="subdefine"></a>
				2165	.SS "Defining subpatterns for use by reference only"
				2166	.rs
				2167	.sp
				2168	If the condition is the string (DEFINE), and there is no subpattern with the
				2169	name DEFINE, the condition is always false. In this case, there may be only one
				2170	alternative in the subpattern. It is always skipped if control reaches this
				2171	point in the pattern; the idea of DEFINE is that it can be used to define
				2172	subroutines that can be referenced from elsewhere. (The use of
				2173	.\" HTML <a href="#subpatternsassubroutines">
				2174	.\" </a>
				2175	subroutines
				2176	.\"
				2177	is described below.) For example, a pattern to match an IPv4 address such as
				2178	"192.168.23.245" could be written like this (ignore whitespace and line
				2179	breaks):
				2180	.sp
				2181	(?(DEFINE) (?<byte> 2[0-4]\ed \| 25[0-5] \| 1\ed\ed \| [1-9]?\ed) )
				2182	\eb (?&byte) (\e.(?&byte)){3} \eb
				2183	.sp
				2184	The first part of the pattern is a DEFINE group inside which a another group
				2185	named "byte" is defined. This matches an individual component of an IPv4
				2186	address (a number less than 256). When matching takes place, this part of the
				2187	pattern is skipped because DEFINE acts like a false condition. The rest of the
				2188	pattern uses references to the named group to match the four dot-separated
				2189	components of an IPv4 address, insisting on a word boundary at each end.
				2190	.
				2191	.SS "Assertion conditions"
				2192	.rs
				2193	.sp
				2194	If the condition is not in any of the above formats, it must be an assertion.
				2195	This may be a positive or negative lookahead or lookbehind assertion. Consider
				2196	this pattern, again containing non-significant white space, and with the two
				2197	alternatives on the second line:
				2198	.sp
				2199	(?(?=[^a-z]*[a-z])
				2200	\ed{2}-[a-z]{3}-\ed{2} \| \ed{2}-\ed{2}-\ed{2} )
				2201	.sp
				2202	The condition is a positive lookahead assertion that matches an optional
				2203	sequence of non-letters followed by a letter. In other words, it tests for the
				2204	presence of at least one letter in the subject. If a letter is found, the
				2205	subject is matched against the first alternative; otherwise it is matched
				2206	against the second. This pattern matches strings in one of the two forms
				2207	dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
				2208	.
				2209	.
				2210	.\" HTML <a name="comments"></a>
				2211	.SH COMMENTS
				2212	.rs
				2213	.sp
				2214	There are two ways of including comments in patterns that are processed by
				2215	PCRE. In both cases, the start of the comment must not be in a character class,
				2216	nor in the middle of any other sequence of related characters such as (?: or a
				2217	subpattern name or number. The characters that make up a comment play no part
				2218	in the pattern matching.
				2219	.P
				2220	The sequence (?# marks the start of a comment that continues up to the next
				2221	closing parenthesis. Nested parentheses are not permitted. If the PCRE_EXTENDED
				2222	option is set, an unescaped # character also introduces a comment, which in
				2223	this case continues to immediately after the next newline character or
				2224	character sequence in the pattern. Which characters are interpreted as newlines
				2225	is controlled by the options passed to \fBpcre_compile()\fP or by a special
				2226	sequence at the start of the pattern, as described in the section entitled
				2227	.\" HTML <a href="#newlines">
				2228	.\" </a>
				2229	"Newline conventions"
				2230	.\"
				2231	above. Note that the end of this type of comment is a literal newline sequence
				2232	in the pattern; escape sequences that happen to represent a newline do not
				2233	count. For example, consider this pattern when PCRE_EXTENDED is set, and the
				2234	default newline convention is in force:
				2235	.sp
				2236	abc #comment \en still comment
				2237	.sp
				2238	On encountering the # character, \fBpcre_compile()\fP skips along, looking for
				2239	a newline in the pattern. The sequence \en is still literal at this stage, so
				2240	it does not terminate the comment. Only an actual character with the code value
				2241	0x0a (the default newline) does so.
				2242	.
				2243	.
				2244	.\" HTML <a name="recursion"></a>
				2245	.SH "RECURSIVE PATTERNS"
				2246	.rs
				2247	.sp
				2248	Consider the problem of matching a string in parentheses, allowing for
				2249	unlimited nested parentheses. Without the use of recursion, the best that can
				2250	be done is to use a pattern that matches up to some fixed depth of nesting. It
				2251	is not possible to handle an arbitrary nesting depth.
				2252	.P
				2253	For some time, Perl has provided a facility that allows regular expressions to
				2254	recurse (amongst other things). It does this by interpolating Perl code in the
				2255	expression at run time, and the code can refer to the expression itself. A Perl
				2256	pattern using code interpolation to solve the parentheses problem can be
				2257	created like this:
				2258	.sp
				2259	$re = qr{\e( (?: (?>[^()]+) \| (?p{$re}) )* \e)}x;
				2260	.sp
				2261	The (?p{...}) item interpolates Perl code at run time, and in this case refers
				2262	recursively to the pattern in which it appears.
				2263	.P
				2264	Obviously, PCRE cannot support the interpolation of Perl code. Instead, it
				2265	supports special syntax for recursion of the entire pattern, and also for
				2266	individual subpattern recursion. After its introduction in PCRE and Python,
				2267	this kind of recursion was subsequently introduced into Perl at release 5.10.
				2268	.P
				2269	A special item that consists of (? followed by a number greater than zero and a
				2270	closing parenthesis is a recursive subroutine call of the subpattern of the
				2271	given number, provided that it occurs inside that subpattern. (If not, it is a
				2272	.\" HTML <a href="#subpatternsassubroutines">
				2273	.\" </a>
				2274	non-recursive subroutine
				2275	.\"
				2276	call, which is described in the next section.) The special item (?R) or (?0) is
				2277	a recursive call of the entire regular expression.
				2278	.P
				2279	This PCRE pattern solves the nested parentheses problem (assume the
				2280	PCRE_EXTENDED option is set so that white space is ignored):
				2281	.sp
				2282	\e( ( [^()]++ \| (?R) )* \e)
				2283	.sp
				2284	First it matches an opening parenthesis. Then it matches any number of
				2285	substrings which can either be a sequence of non-parentheses, or a recursive
				2286	match of the pattern itself (that is, a correctly parenthesized substring).
				2287	Finally there is a closing parenthesis. Note the use of a possessive quantifier
				2288	to avoid backtracking into sequences of non-parentheses.
				2289	.P
				2290	If this were part of a larger pattern, you would not want to recurse the entire
				2291	pattern, so instead you could use this:
				2292	.sp
				2293	( \e( ( [^()]++ \| (?1) )* \e) )
				2294	.sp
				2295	We have put the pattern into parentheses, and caused the recursion to refer to
				2296	them instead of the whole pattern.
				2297	.P
				2298	In a larger pattern, keeping track of parenthesis numbers can be tricky. This
				2299	is made easier by the use of relative references. Instead of (?1) in the
				2300	pattern above you can write (?-2) to refer to the second most recently opened
				2301	parentheses preceding the recursion. In other words, a negative number counts
				2302	capturing parentheses leftwards from the point at which it is encountered.
				2303	.P
				2304	It is also possible to refer to subsequently opened parentheses, by writing
				2305	references such as (?+2). However, these cannot be recursive because the
				2306	reference is not inside the parentheses that are referenced. They are always
				2307	.\" HTML <a href="#subpatternsassubroutines">
				2308	.\" </a>
				2309	non-recursive subroutine
				2310	.\"
				2311	calls, as described in the next section.
				2312	.P
				2313	An alternative approach is to use named parentheses instead. The Perl syntax
				2314	for this is (?&name); PCRE's earlier syntax (?P>name) is also supported. We
				2315	could rewrite the above example as follows:
				2316	.sp
				2317	(?<pn> \e( ( [^()]++ \| (?&pn) )* \e) )
				2318	.sp
				2319	If there is more than one subpattern with the same name, the earliest one is
				2320	used.
				2321	.P
				2322	This particular example pattern that we have been looking at contains nested
				2323	unlimited repeats, and so the use of a possessive quantifier for matching
				2324	strings of non-parentheses is important when applying the pattern to strings
				2325	that do not match. For example, when this pattern is applied to
				2326	.sp
				2327	(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
				2328	.sp
				2329	it yields "no match" quickly. However, if a possessive quantifier is not used,
				2330	the match runs for a very long time indeed because there are so many different
				2331	ways the + and * repeats can carve up the subject, and all have to be tested
				2332	before failure can be reported.
				2333	.P
				2334	At the end of a match, the values of capturing parentheses are those from
				2335	the outermost level. If you want to obtain intermediate values, a callout
				2336	function can be used (see below and the
				2337	.\" HREF
				2338	\fBpcrecallout\fP
				2339	.\"
				2340	documentation). If the pattern above is matched against
				2341	.sp
				2342	(ab(cd)ef)
				2343	.sp
				2344	the value for the inner capturing parentheses (numbered 2) is "ef", which is
				2345	the last value taken on at the top level. If a capturing subpattern is not
				2346	matched at the top level, its final captured value is unset, even if it was
				2347	(temporarily) set at a deeper level during the matching process.
				2348	.P
				2349	If there are more than 15 capturing parentheses in a pattern, PCRE has to
				2350	obtain extra memory to store data during a recursion, which it does by using
				2351	\fBpcre_malloc\fP, freeing it via \fBpcre_free\fP afterwards. If no memory can
				2352	be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
				2353	.P
				2354	Do not confuse the (?R) item with the condition (R), which tests for recursion.
				2355	Consider this pattern, which matches text in angle brackets, allowing for
				2356	arbitrary nesting. Only digits are allowed in nested brackets (that is, when
				2357	recursing), whereas any characters are permitted at the outer level.
				2358	.sp
				2359	< (?: (?(R) \ed++ \| [^<>]+) \| (?R)) >
				2360	.sp
				2361	In this pattern, (?(R) is the start of a conditional subpattern, with two
				2362	different alternatives for the recursive and non-recursive cases. The (?R) item
				2363	is the actual recursive call.
				2364	.
				2365	.
				2366	.\" HTML <a name="recursiondifference"></a>
				2367	.SS "Differences in recursion processing between PCRE and Perl"
				2368	.rs
				2369	.sp
				2370	Recursion processing in PCRE differs from Perl in two important ways. In PCRE
				2371	(like Python, but unlike Perl), a recursive subpattern call is always treated
				2372	as an atomic group. That is, once it has matched some of the subject string, it
				2373	is never re-entered, even if it contains untried alternatives and there is a
				2374	subsequent matching failure. This can be illustrated by the following pattern,
				2375	which purports to match a palindromic string that contains an odd number of
				2376	characters (for example, "a", "aba", "abcba", "abcdcba"):
				2377	.sp
				2378	^(.\|(.)(?1)\e2)$
				2379	.sp
				2380	The idea is that it either matches a single character, or two identical
				2381	characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
				2382	it does not if the pattern is longer than three characters. Consider the
				2383	subject string "abcba":
				2384	.P
				2385	At the top level, the first character is matched, but as it is not at the end
				2386	of the string, the first alternative fails; the second alternative is taken
				2387	and the recursion kicks in. The recursive call to subpattern 1 successfully
				2388	matches the next character ("b"). (Note that the beginning and end of line
				2389	tests are not part of the recursion).
				2390	.P
				2391	Back at the top level, the next character ("c") is compared with what
				2392	subpattern 2 matched, which was "a". This fails. Because the recursion is
				2393	treated as an atomic group, there are now no backtracking points, and so the
				2394	entire match fails. (Perl is able, at this point, to re-enter the recursion and
				2395	try the second alternative.) However, if the pattern is written with the
				2396	alternatives in the other order, things are different:
				2397	.sp
				2398	^((.)(?1)\e2\|.)$
				2399	.sp
				2400	This time, the recursing alternative is tried first, and continues to recurse
				2401	until it runs out of characters, at which point the recursion fails. But this
				2402	time we do have another alternative to try at the higher level. That is the big
				2403	difference: in the previous case the remaining alternative is at a deeper
				2404	recursion level, which PCRE cannot use.
				2405	.P
				2406	To change the pattern so that it matches all palindromic strings, not just
				2407	those with an odd number of characters, it is tempting to change the pattern to
				2408	this:
				2409	.sp
				2410	^((.)(?1)\e2\|.?)$
				2411	.sp
				2412	Again, this works in Perl, but not in PCRE, and for the same reason. When a
				2413	deeper recursion has matched a single character, it cannot be entered again in
				2414	order to match an empty string. The solution is to separate the two cases, and
				2415	write out the odd and even cases as alternatives at the higher level:
				2416	.sp
				2417	^(?:((.)(?1)\e2\|)\|((.)(?3)\e4\|.))
				2418	.sp
				2419	If you want to match typical palindromic phrases, the pattern has to ignore all
				2420	non-word characters, which can be done like this:
				2421	.sp
				2422	^\eW+(?:((.)\eW+(?1)\eW+\e2\|)\|((.)\eW+(?3)\eW+\e4\|\eW+.\eW+))\eW+$
				2423	.sp
				2424	If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
				2425	man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
				2426	the use of the possessive quantifier *+ to avoid backtracking into sequences of
				2427	non-word characters. Without this, PCRE takes a great deal longer (ten times or
				2428	more) to match typical phrases, and Perl takes so long that you think it has
				2429	gone into a loop.
				2430	.P
				2431	\fBWARNING\fP: The palindrome-matching patterns above work only if the subject
				2432	string does not start with a palindrome that is shorter than the entire string.
				2433	For example, although "abcba" is correctly matched, if the subject is "ababa",
				2434	PCRE finds the palindrome "aba" at the start, then fails at top level because
				2435	the end of the string does not follow. Once again, it cannot jump back into the
				2436	recursion to try other alternatives, so the entire match fails.
				2437	.P
				2438	The second way in which PCRE and Perl differ in their recursion processing is
				2439	in the handling of captured values. In Perl, when a subpattern is called
				2440	recursively or as a subpattern (see the next section), it has no access to any
				2441	values that were captured outside the recursion, whereas in PCRE these values
				2442	can be referenced. Consider this pattern:
				2443	.sp
				2444	^(.)(\e1\|a(?2))
				2445	.sp
				2446	In PCRE, this pattern matches "bab". The first capturing parentheses match "b",
				2447	then in the second group, when the back reference \e1 fails to match "b", the
				2448	second alternative matches "a" and then recurses. In the recursion, \e1 does
				2449	now match "b" and so the whole match succeeds. In Perl, the pattern fails to
				2450	match because inside the recursive call \e1 cannot access the externally set
				2451	value.
				2452	.
				2453	.
				2454	.\" HTML <a name="subpatternsassubroutines"></a>
				2455	.SH "SUBPATTERNS AS SUBROUTINES"
				2456	.rs
				2457	.sp
				2458	If the syntax for a recursive subpattern call (either by number or by
				2459	name) is used outside the parentheses to which it refers, it operates like a
				2460	subroutine in a programming language. The called subpattern may be defined
				2461	before or after the reference. A numbered reference can be absolute or
				2462	relative, as in these examples:
				2463	.sp
				2464	(...(absolute)...)...(?2)...
				2465	(...(relative)...)...(?-1)...
				2466	(...(?+1)...(relative)...
				2467	.sp
				2468	An earlier example pointed out that the pattern
				2469	.sp
				2470	(sens\|respons)e and \e1ibility
				2471	.sp
				2472	matches "sense and sensibility" and "response and responsibility", but not
				2473	"sense and responsibility". If instead the pattern
				2474	.sp
				2475	(sens\|respons)e and (?1)ibility
				2476	.sp
				2477	is used, it does match "sense and responsibility" as well as the other two
				2478	strings. Another example is given in the discussion of DEFINE above.
				2479	.P
				2480	All subroutine calls, whether recursive or not, are always treated as atomic
				2481	groups. That is, once a subroutine has matched some of the subject string, it
				2482	is never re-entered, even if it contains untried alternatives and there is a
				2483	subsequent matching failure. Any capturing parentheses that are set during the
				2484	subroutine call revert to their previous values afterwards.
				2485	.P
				2486	Processing options such as case-independence are fixed when a subpattern is
				2487	defined, so if it is used as a subroutine, such options cannot be changed for
				2488	different calls. For example, consider this pattern:
				2489	.sp
				2490	(abc)(?i:(?-1))
				2491	.sp
				2492	It matches "abcabc". It does not match "abcABC" because the change of
				2493	processing option does not affect the called subpattern.
				2494	.
				2495	.
				2496	.\" HTML <a name="onigurumasubroutines"></a>
				2497	.SH "ONIGURUMA SUBROUTINE SYNTAX"
				2498	.rs
				2499	.sp
				2500	For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
				2501	a number enclosed either in angle brackets or single quotes, is an alternative
				2502	syntax for referencing a subpattern as a subroutine, possibly recursively. Here
				2503	are two of the examples used above, rewritten using this syntax:
				2504	.sp
				2505	(?<pn> \e( ( (?>[^()]+) \| \eg<pn> )* \e) )
				2506	(sens\|respons)e and \eg'1'ibility
				2507	.sp
				2508	PCRE supports an extension to Oniguruma: if a number is preceded by a
				2509	plus or a minus sign it is taken as a relative reference. For example:
				2510	.sp
				2511	(abc)(?i:\eg<-1>)
				2512	.sp
				2513	Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
				2514	synonymous. The former is a back reference; the latter is a subroutine call.
				2515	.
				2516	.
				2517	.SH CALLOUTS
				2518	.rs
				2519	.sp
				2520	Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
				2521	code to be obeyed in the middle of matching a regular expression. This makes it
				2522	possible, amongst other things, to extract different substrings that match the
				2523	same pair of parentheses when there is a repetition.
				2524	.P
				2525	PCRE provides a similar feature, but of course it cannot obey arbitrary Perl
				2526	code. The feature is called "callout". The caller of PCRE provides an external
				2527	function by putting its entry point in the global variable \fIpcre_callout\fP.
				2528	By default, this variable contains NULL, which disables all calling out.
				2529	.P
				2530	Within a regular expression, (?C) indicates the points at which the external
				2531	function is to be called. If you want to identify different callout points, you
				2532	can put a number less than 256 after the letter C. The default value is zero.
				2533	For example, this pattern has two callout points:
				2534	.sp
				2535	(?C1)abc(?C2)def
				2536	.sp
				2537	If the PCRE_AUTO_CALLOUT flag is passed to \fBpcre_compile()\fP, callouts are
				2538	automatically installed before each item in the pattern. They are all numbered
				2539	255.
				2540	.P
				2541	During matching, when PCRE reaches a callout point (and \fIpcre_callout\fP is
				2542	set), the external function is called. It is provided with the number of the
				2543	callout, the position in the pattern, and, optionally, one item of data
				2544	originally supplied by the caller of \fBpcre_exec()\fP. The callout function
				2545	may cause matching to proceed, to backtrack, or to fail altogether. A complete
				2546	description of the interface to the callout function is given in the
				2547	.\" HREF
				2548	\fBpcrecallout\fP
				2549	.\"
				2550	documentation.
				2551	.
				2552	.
				2553	.\" HTML <a name="backtrackcontrol"></a>
				2554	.SH "BACKTRACKING CONTROL"
				2555	.rs
				2556	.sp
				2557	Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
				2558	are described in the Perl documentation as "experimental and subject to change
				2559	or removal in a future version of Perl". It goes on to say: "Their usage in
				2560	production code should be noted to avoid problems during upgrades." The same
				2561	remarks apply to the PCRE features described in this section.
				2562	.P
				2563	Since these verbs are specifically related to backtracking, most of them can be
				2564	used only when the pattern is to be matched using \fBpcre_exec()\fP, which uses
				2565	a backtracking algorithm. With the exception of (*FAIL), which behaves like a
				2566	failing negative assertion, they cause an error if encountered by
				2567	\fBpcre_dfa_exec()\fP.
				2568	.P
				2569	If any of these verbs are used in an assertion or in a subpattern that is
				2570	called as a subroutine (whether or not recursively), their effect is confined
				2571	to that subpattern; it does not extend to the surrounding pattern, with one
				2572	exception: the name from a (MARK), (PRUNE), or (*THEN) that is encountered in
				2573	a successful positive assertion \fIis\fP passed back when a match succeeds
				2574	(compare capturing parentheses in assertions). Note that such subpatterns are
				2575	processed as anchored at the point where they are tested. Note also that Perl's
				2576	treatment of subroutines is different in some cases.
				2577	.P
				2578	The new verbs make use of what was previously invalid syntax: an opening
				2579	parenthesis followed by an asterisk. They are generally of the form
				2580	(VERB) or (VERB:NAME). Some may take either form, with differing behaviour,
				2581	depending on whether or not an argument is present. A name is any sequence of
				2582	characters that does not include a closing parenthesis. If the name is empty,
				2583	that is, if the closing parenthesis immediately follows the colon, the effect
				2584	is as if the colon were not there. Any number of these verbs may occur in a
				2585	pattern.
				2586	.P
				2587	PCRE contains some optimizations that are used to speed up matching by running
				2588	some checks at the start of each match attempt. For example, it may know the
				2589	minimum length of matching subject, or that a particular character must be
				2590	present. When one of these optimizations suppresses the running of a match, any
				2591	included backtracking verbs will not, of course, be processed. You can suppress
				2592	the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option
				2593	when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the
				2594	pattern with (*NO_START_OPT).
				2595	.P
				2596	Experiments with Perl suggest that it too has similar optimizations, sometimes
				2597	leading to anomalous results.
				2598	.
				2599	.
				2600	.SS "Verbs that act immediately"
				2601	.rs
				2602	.sp
				2603	The following verbs act as soon as they are encountered. They may not be
				2604	followed by a name.
				2605	.sp
				2606	(*ACCEPT)
				2607	.sp
				2608	This verb causes the match to end successfully, skipping the remainder of the
				2609	pattern. However, when it is inside a subpattern that is called as a
				2610	subroutine, only that subpattern is ended successfully. Matching then continues
				2611	at the outer level. If (*ACCEPT) is inside capturing parentheses, the data so
				2612	far is captured. For example:
				2613	.sp
				2614	A((?:A\|B(*ACCEPT)\|C)D)
				2615	.sp
				2616	This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
				2617	the outer parentheses.
				2618	.sp
				2619	(FAIL) or (F)
				2620	.sp
				2621	This verb causes a matching failure, forcing backtracking to occur. It is
				2622	equivalent to (?!) but easier to read. The Perl documentation notes that it is
				2623	probably useful only when combined with (?{}) or (??{}). Those are, of course,
				2624	Perl features that are not present in PCRE. The nearest equivalent is the
				2625	callout feature, as for example in this pattern:
				2626	.sp
				2627	a+(?C)(*FAIL)
				2628	.sp
				2629	A match with the string "aaaa" always fails, but the callout is taken before
				2630	each backtrack happens (in this example, 10 times).
				2631	.
				2632	.
				2633	.SS "Recording which path was taken"
				2634	.rs
				2635	.sp
				2636	There is one verb whose main purpose is to track how a match was arrived at,
				2637	though it also has a secondary use in conjunction with advancing the match
				2638	starting point (see (*SKIP) below).
				2639	.sp
				2640	(MARK:NAME) or (:NAME)
				2641	.sp
				2642	A name is always required with this verb. There may be as many instances of
				2643	(*MARK) as you like in a pattern, and their names do not have to be unique.
				2644	.P
				2645	When a match succeeds, the name of the last-encountered (*MARK) on the matching
				2646	path is passed back to the caller via the \fIpcre_extra\fP data structure, as
				2647	described in the
				2648	.\" HTML <a href="pcreapi.html#extradata">
				2649	.\" </a>
				2650	section on \fIpcre_extra\fP
				2651	.\"
				2652	in the
				2653	.\" HREF
				2654	\fBpcreapi\fP
				2655	.\"
				2656	documentation. Here is an example of \fBpcretest\fP output, where the /K
				2657	modifier requests the retrieval and outputting of (*MARK) data:
				2658	.sp
				2659	re> /X(MARK:A)Y\|X(MARK:B)Z/K
				2660	data> XY
				2661	0: XY
				2662	MK: A
				2663	XZ
				2664	0: XZ
				2665	MK: B
				2666	.sp
				2667	The (*MARK) name is tagged with "MK:" in this output, and in this example it
				2668	indicates which of the two alternatives matched. This is a more efficient way
				2669	of obtaining this information than putting each alternative in its own
				2670	capturing parentheses.
				2671	.P
				2672	If (*MARK) is encountered in a positive assertion, its name is recorded and
				2673	passed back if it is the last-encountered. This does not happen for negative
				2674	assertions.
				2675	.P
				2676	After a partial match or a failed match, the name of the last encountered
				2677	(*MARK) in the entire match process is returned. For example:
				2678	.sp
				2679	re> /X(MARK:A)Y\|X(MARK:B)Z/K
				2680	data> XP
				2681	No match, mark = B
				2682	.sp
				2683	Note that in this unanchored example the mark is retained from the match
				2684	attempt that started at the letter "X". Subsequent match attempts starting at
				2685	"P" and then with an empty string do not get as far as the (*MARK) item, but
				2686	nevertheless do not reset it.
				2687	.
				2688	.
				2689	.SS "Verbs that act after backtracking"
				2690	.rs
				2691	.sp
				2692	The following verbs do nothing when they are encountered. Matching continues
				2693	with what follows, but if there is no subsequent match, causing a backtrack to
				2694	the verb, a failure is forced. That is, backtracking cannot pass to the left of
				2695	the verb. However, when one of these verbs appears inside an atomic group, its
				2696	effect is confined to that group, because once the group has been matched,
				2697	there is never any backtracking into it. In this situation, backtracking can
				2698	"jump back" to the left of the entire atomic group. (Remember also, as stated
				2699	above, that this localization also applies in subroutine calls and assertions.)
				2700	.P
				2701	These verbs differ in exactly what kind of failure occurs when backtracking
				2702	reaches them.
				2703	.sp
				2704	(*COMMIT)
				2705	.sp
				2706	This verb, which may not be followed by a name, causes the whole match to fail
				2707	outright if the rest of the pattern does not match. Even if the pattern is
				2708	unanchored, no further attempts to find a match by advancing the starting point
				2709	take place. Once (*COMMIT) has been passed, \fBpcre_exec()\fP is committed to
				2710	finding a match at the current starting point, or not at all. For example:
				2711	.sp
				2712	a+(*COMMIT)b
				2713	.sp
				2714	This matches "xxaab" but not "aacaab". It can be thought of as a kind of
				2715	dynamic anchor, or "I've started, so I must finish." The name of the most
				2716	recently passed (MARK) in the path is passed back when (COMMIT) forces a
				2717	match failure.
				2718	.P
				2719	Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
				2720	unless PCRE's start-of-match optimizations are turned off, as shown in this
				2721	\fBpcretest\fP example:
				2722	.sp
				2723	re> /(*COMMIT)abc/
				2724	data> xyzabc
				2725	0: abc
				2726	xyzabc\eY
				2727	No match
				2728	.sp
				2729	PCRE knows that any match must start with "a", so the optimization skips along
				2730	the subject to "a" before running the first match attempt, which succeeds. When
				2731	the optimization is disabled by the \eY escape in the second subject, the match
				2732	starts at "x" and so the (*COMMIT) causes it to fail without trying any other
				2733	starting points.
				2734	.sp
				2735	(PRUNE) or (PRUNE:NAME)
				2736	.sp
				2737	This verb causes the match to fail at the current starting position in the
				2738	subject if the rest of the pattern does not match. If the pattern is
				2739	unanchored, the normal "bumpalong" advance to the next starting character then
				2740	happens. Backtracking can occur as usual to the left of (*PRUNE), before it is
				2741	reached, or when matching to the right of (*PRUNE), but if there is no match to
				2742	the right, backtracking cannot cross (*PRUNE). In simple cases, the use of
				2743	(*PRUNE) is just an alternative to an atomic group or possessive quantifier,
				2744	but there are some uses of (*PRUNE) that cannot be expressed in any other way.
				2745	The behaviour of (PRUNE:NAME) is the same as (MARK:NAME)(*PRUNE). In an
				2746	anchored pattern (PRUNE) has the same effect as (COMMIT).
				2747	.sp
				2748	(*SKIP)
				2749	.sp
				2750	This verb, when given without a name, is like (*PRUNE), except that if the
				2751	pattern is unanchored, the "bumpalong" advance is not to the next character,
				2752	but to the position in the subject where (SKIP) was encountered. (SKIP)
				2753	signifies that whatever text was matched leading up to it cannot be part of a
				2754	successful match. Consider:
				2755	.sp
				2756	a+(*SKIP)b
				2757	.sp
				2758	If the subject is "aaaac...", after the first match attempt fails (starting at
				2759	the first character in the string), the starting point skips on to start the
				2760	next attempt at "c". Note that a possessive quantifer does not have the same
				2761	effect as this example; although it would suppress backtracking during the
				2762	first match attempt, the second attempt would start at the second character
				2763	instead of skipping on to "c".
				2764	.sp
				2765	(*SKIP:NAME)
				2766	.sp
				2767	When (*SKIP) has an associated name, its behaviour is modified. If the
				2768	following pattern fails to match, the previous path through the pattern is
				2769	searched for the most recent (*MARK) that has the same name. If one is found,
				2770	the "bumpalong" advance is to the subject position that corresponds to that
				2771	(MARK) instead of to where (SKIP) was encountered. If no (*MARK) with a
				2772	matching name is found, the (*SKIP) is ignored.
				2773	.sp
				2774	(THEN) or (THEN:NAME)
				2775	.sp
				2776	This verb causes a skip to the next innermost alternative if the rest of the
				2777	pattern does not match. That is, it cancels pending backtracking, but only
				2778	within the current alternative. Its name comes from the observation that it can
				2779	be used for a pattern-based if-then-else block:
				2780	.sp
				2781	( COND1 (THEN) FOO \| COND2 (THEN) BAR \| COND3 (*THEN) BAZ ) ...
				2782	.sp
				2783	If the COND1 pattern matches, FOO is tried (and possibly further items after
				2784	the end of the group if FOO succeeds); on failure, the matcher skips to the
				2785	second alternative and tries COND2, without backtracking into COND1. The
				2786	behaviour of (THEN:NAME) is exactly the same as (MARK:NAME)(*THEN).
				2787	If (THEN) is not inside an alternation, it acts like (PRUNE).
				2788	.P
				2789	Note that a subpattern that does not contain a \| character is just a part of
				2790	the enclosing alternative; it is not a nested alternation with only one
				2791	alternative. The effect of (*THEN) extends beyond such a subpattern to the
				2792	enclosing alternative. Consider this pattern, where A, B, etc. are complex
				2793	pattern fragments that do not contain any \| characters at this level:
				2794	.sp
				2795	A (B(*THEN)C) \| D
				2796	.sp
				2797	If A and B are matched, but there is a failure in C, matching does not
				2798	backtrack into A; instead it moves to the next alternative, that is, D.
				2799	However, if the subpattern containing (*THEN) is given an alternative, it
				2800	behaves differently:
				2801	.sp
				2802	A (B(THEN)C \| (FAIL)) \| D
				2803	.sp
				2804	The effect of (*THEN) is now confined to the inner subpattern. After a failure
				2805	in C, matching moves to (*FAIL), which causes the whole subpattern to fail
				2806	because there are no more alternatives to try. In this case, matching does now
				2807	backtrack into A.
				2808	.P
				2809	Note also that a conditional subpattern is not considered as having two
				2810	alternatives, because only one is ever used. In other words, the \| character in
				2811	a conditional subpattern has a different meaning. Ignoring white space,
				2812	consider:
				2813	.sp
				2814	^.? (?(?=a) a \| b(THEN)c )
				2815	.sp
				2816	If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
				2817	it initially matches zero characters. The condition (?=a) then fails, the
				2818	character "b" is matched, but "c" is not. At this point, matching does not
				2819	backtrack to .*? as might perhaps be expected from the presence of the \|
				2820	character. The conditional subpattern is part of the single alternative that
				2821	comprises the whole pattern, and so the match fails. (If there was a backtrack
				2822	into .*?, allowing it to match "b", the match would succeed.)
				2823	.P
				2824	The verbs just described provide four different "strengths" of control when
				2825	subsequent matching fails. (*THEN) is the weakest, carrying on the match at the
				2826	next alternative. (*PRUNE) comes next, failing the match at the current
				2827	starting position, but allowing an advance to the next character (for an
				2828	unanchored pattern). (*SKIP) is similar, except that the advance may be more
				2829	than one character. (*COMMIT) is the strongest, causing the entire match to
				2830	fail.
				2831	.P
				2832	If more than one such verb is present in a pattern, the "strongest" one wins.
				2833	For example, consider this pattern, where A, B, etc. are complex pattern
				2834	fragments:
				2835	.sp
				2836	(A(COMMIT)B(THEN)C\|D)
				2837	.sp
				2838	Once A has matched, PCRE is committed to this match, at the current starting
				2839	position. If subsequently B matches, but C does not, the normal (*THEN) action
				2840	of trying the next alternative (that is, D) does not happen because (*COMMIT)
				2841	overrides.
				2842	.
				2843	.
				2844	.SH "SEE ALSO"
				2845	.rs
				2846	.sp
				2847	\fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3),
				2848	\fBpcresyntax\fP(3), \fBpcre\fP(3).
				2849	.
				2850	.
				2851	.SH AUTHOR
				2852	.rs
				2853	.sp
				2854	.nf
				2855	Philip Hazel
				2856	University Computing Service
				2857	Cambridge CB2 3QH, England.
				2858	.fi
				2859	.
				2860	.
				2861	.SH REVISION
				2862	.rs
				2863	.sp
				2864	.nf
				2865	Last updated: 29 November 2011
				2866	Copyright (c) 1997-2011 University of Cambridge.
				2867	.fi