Blame - jni/libpcre/sources/HACKING - jami-client-android

blob: bc92560c983e4ef8a26be6f03d89bfd59d8b1c5a [file] [log] [blame]

Tristan Matthews	0461646	2013-11-14 16:09:34 -0500	[diff] [blame]	1	Technical Notes about PCRE
				2	--------------------------
				3
				4	These are very rough technical notes that record potentially useful information
				5	about PCRE internals. For information about testing PCRE, see the pcretest
				6	documentation and the comment at the head of the RunTest file.
				7
				8
				9	Historical note 1
				10	-----------------
				11
				12	Many years ago I implemented some regular expression functions to an algorithm
				13	suggested by Martin Richards. These were not Unix-like in form, and were quite
				14	restricted in what they could do by comparison with Perl. The interesting part
				15	about the algorithm was that the amount of space required to hold the compiled
				16	form of an expression was known in advance. The code to apply an expression did
				17	not operate by backtracking, as the original Henry Spencer code and current
				18	Perl code does, but instead checked all possibilities simultaneously by keeping
				19	a list of current states and checking all of them as it advanced through the
				20	subject string. In the terminology of Jeffrey Friedl's book, it was a "DFA
				21	algorithm", though it was not a traditional Finite State Machine (FSM). When
				22	the pattern was all used up, all remaining states were possible matches, and
				23	the one matching the longest subset of the subject string was chosen. This did
				24	not necessarily maximize the individual wild portions of the pattern, as is
				25	expected in Unix and Perl-style regular expressions.
				26
				27
				28	Historical note 2
				29	-----------------
				30
				31	By contrast, the code originally written by Henry Spencer (which was
				32	subsequently heavily modified for Perl) compiles the expression twice: once in
				33	a dummy mode in order to find out how much store will be needed, and then for
				34	real. (The Perl version probably doesn't do this any more; I'm talking about
				35	the original library.) The execution function operates by backtracking and
				36	maximizing (or, optionally, minimizing in Perl) the amount of the subject that
				37	matches individual wild portions of the pattern. This is an "NFA algorithm" in
				38	Friedl's terminology.
				39
				40
				41	OK, here's the real stuff
				42	-------------------------
				43
				44	For the set of functions that form the "basic" PCRE library (which are
				45	unrelated to those mentioned above), I tried at first to invent an algorithm
				46	that used an amount of store bounded by a multiple of the number of characters
				47	in the pattern, to save on compiling time. However, because of the greater
				48	complexity in Perl regular expressions, I couldn't do this. In any case, a
				49	first pass through the pattern is helpful for other reasons.
				50
				51
				52	Computing the memory requirement: how it was
				53	--------------------------------------------
				54
				55	Up to and including release 6.7, PCRE worked by running a very degenerate first
				56	pass to calculate a maximum store size, and then a second pass to do the real
				57	compile - which might use a bit less than the predicted amount of memory. The
				58	idea was that this would turn out faster than the Henry Spencer code because
				59	the first pass is degenerate and the second pass can just store stuff straight
				60	into the vector, which it knows is big enough.
				61
				62
				63	Computing the memory requirement: how it is
				64	-------------------------------------------
				65
				66	By the time I was working on a potential 6.8 release, the degenerate first pass
				67	had become very complicated and hard to maintain. Indeed one of the early
				68	things I did for 6.8 was to fix Yet Another Bug in the memory computation. Then
				69	I had a flash of inspiration as to how I could run the real compile function in
				70	a "fake" mode that enables it to compute how much memory it would need, while
				71	actually only ever using a few hundred bytes of working memory, and without too
				72	many tests of the mode that might slow it down. So I refactored the compiling
				73	functions to work this way. This got rid of about 600 lines of source. It
				74	should make future maintenance and development easier. As this was such a major
				75	change, I never released 6.8, instead upping the number to 7.0 (other quite
				76	major changes were also present in the 7.0 release).
				77
				78	A side effect of this work was that the previous limit of 200 on the nesting
				79	depth of parentheses was removed. However, there is a downside: pcre_compile()
				80	runs more slowly than before (30% or more, depending on the pattern) because it
				81	is doing a full analysis of the pattern. My hope was that this would not be a
				82	big issue, and in the event, nobody has commented on it.
				83
				84
				85	Traditional matching function
				86	-----------------------------
				87
				88	The "traditional", and original, matching function is called pcre_exec(), and
				89	it implements an NFA algorithm, similar to the original Henry Spencer algorithm
				90	and the way that Perl works. This is not surprising, since it is intended to be
				91	as compatible with Perl as possible. This is the function most users of PCRE
				92	will use most of the time. From release 8.20, if PCRE is compiled with
				93	just-in-time (JIT) support, and studying a compiled pattern with JIT is
				94	successful, the JIT code is run instead of the normal pcre_exec() code, but the
				95	result is the same.
				96
				97
				98	Supplementary matching function
				99	-------------------------------
				100
				101	From PCRE 6.0, there is also a supplementary matching function called
				102	pcre_dfa_exec(). This implements a DFA matching algorithm that searches
				103	simultaneously for all possible matches that start at one point in the subject
				104	string. (Going back to my roots: see Historical Note 1 above.) This function
				105	intreprets the same compiled pattern data as pcre_exec(); however, not all the
				106	facilities are available, and those that are do not always work in quite the
				107	same way. See the user documentation for details.
				108
				109	The algorithm that is used for pcre_dfa_exec() is not a traditional FSM,
				110	because it may have a number of states active at one time. More work would be
				111	needed at compile time to produce a traditional FSM where only one state is
				112	ever active at once. I believe some other regex matchers work this way.
				113
				114
				115	Changeable options
				116	------------------
				117
				118	The /i, /m, or /s options (PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL) may
				119	change in the middle of patterns. From PCRE 8.13, their processing is handled
				120	entirely at compile time by generating different opcodes for the different
				121	settings. The runtime functions do not need to keep track of an options state
				122	any more.
				123
				124
				125	Format of compiled patterns
				126	---------------------------
				127
				128	The compiled form of a pattern is a vector of bytes, containing items of
				129	variable length. The first byte in an item is an opcode, and the length of the
				130	item is either implicit in the opcode or contained in the data bytes that
				131	follow it.
				132
				133	In many cases below LINK_SIZE data values are specified for offsets within the
				134	compiled pattern. The default value for LINK_SIZE is 2, but PCRE can be
				135	compiled to use 3-byte or 4-byte values for these offsets (impairing the
				136	performance). This is necessary only when patterns whose compiled length is
				137	greater than 64K are going to be processed. In this description, we assume the
				138	"normal" compilation options. Data values that are counts (e.g. for
				139	quantifiers) are always just two bytes long.
				140
				141	Opcodes with no following data
				142	------------------------------
				143
				144	These items are all just one byte long
				145
				146	OP_END end of pattern
				147	OP_ANY match any one character other than newline
				148	OP_ALLANY match any one character, including newline
				149	OP_ANYBYTE match any single byte, even in UTF-8 mode
				150	OP_SOD match start of data: \A
				151	OP_SOM, start of match (subject + offset): \G
				152	OP_SET_SOM, set start of match (\K)
				153	OP_CIRC ^ (start of data)
				154	OP_CIRCM ^ multiline mode (start of data or after newline)
				155	OP_NOT_WORD_BOUNDARY \W
				156	OP_WORD_BOUNDARY \w
				157	OP_NOT_DIGIT \D
				158	OP_DIGIT \d
				159	OP_NOT_HSPACE \H
				160	OP_HSPACE \h
				161	OP_NOT_WHITESPACE \S
				162	OP_WHITESPACE \s
				163	OP_NOT_VSPACE \V
				164	OP_VSPACE \v
				165	OP_NOT_WORDCHAR \W
				166	OP_WORDCHAR \w
				167	OP_EODN match end of data or \n at end: \Z
				168	OP_EOD match end of data: \z
				169	OP_DOLL $ (end of data, or before final newline)
				170	OP_DOLLM $ multiline mode (end of data or before newline)
				171	OP_EXTUNI match an extended Unicode character
				172	OP_ANYNL match any Unicode newline sequence
				173
				174	OP_ACCEPT ) These are Perl 5.10's "backtracking control
				175	OP_COMMIT ) verbs". If OP_ACCEPT is inside capturing
				176	OP_FAIL ) parentheses, it may be preceded by one or more
				177	OP_PRUNE ) OP_CLOSE, followed by a 2-byte number,
				178	OP_SKIP ) indicating which parentheses must be closed.
				179
				180
				181	Backtracking control verbs with (optional) data
				182	-----------------------------------------------
				183
				184	(*THEN) without an argument generates the opcode OP_THEN and no following data.
				185	OP_MARK is followed by the mark name, preceded by a one-byte length, and
				186	followed by a binary zero. For (PRUNE), (SKIP), and (*THEN) with arguments,
				187	the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used, with the name
				188	following in the same format.
				189
				190
				191	Matching literal characters
				192	---------------------------
				193
				194	The OP_CHAR opcode is followed by a single character that is to be matched
				195	casefully. For caseless matching, OP_CHARI is used. In UTF-8 mode, the
				196	character may be more than one byte long. (Earlier versions of PCRE used
				197	multi-character strings, but this was changed to allow some new features to be
				198	added.)
				199
				200
				201	Repeating single characters
				202	---------------------------
				203
				204	The common repeats (*, +, ?) when applied to a single character use the
				205	following opcodes, which come in caseful and caseless versions:
				206
				207	Caseful Caseless
				208	OP_STAR OP_STARI
				209	OP_MINSTAR OP_MINSTARI
				210	OP_POSSTAR OP_POSSTARI
				211	OP_PLUS OP_PLUSI
				212	OP_MINPLUS OP_MINPLUSI
				213	OP_POSPLUS OP_POSPLUSI
				214	OP_QUERY OP_QUERYI
				215	OP_MINQUERY OP_MINQUERYI
				216	OP_POSQUERY OP_POSQUERYI
				217
				218	In ASCII mode, these are two-byte items; in UTF-8 mode, the length is variable.
				219	Those with "MIN" in their name are the minimizing versions. Those with "POS" in
				220	their names are possessive versions. Each is followed by the character that is
				221	to be repeated. Other repeats make use of these opcodes:
				222
				223	Caseful Caseless
				224	OP_UPTO OP_UPTOI
				225	OP_MINUPTO OP_MINUPTOI
				226	OP_POSUPTO OP_POSUPTOI
				227	OP_EXACT OP_EXACTI
				228
				229	Each of these is followed by a two-byte count (most significant first) and the
				230	repeated character. OP_UPTO matches from 0 to the given number. A repeat with a
				231	non-zero minimum and a fixed maximum is coded as an OP_EXACT followed by an
				232	OP_UPTO (or OP_MINUPTO or OPT_POSUPTO).
				233
				234
				235	Repeating character types
				236	-------------------------
				237
				238	Repeats of things like \d are done exactly as for single characters, except
				239	that instead of a character, the opcode for the type is stored in the data
				240	byte. The opcodes are:
				241
				242	OP_TYPESTAR
				243	OP_TYPEMINSTAR
				244	OP_TYPEPOSSTAR
				245	OP_TYPEPLUS
				246	OP_TYPEMINPLUS
				247	OP_TYPEPOSPLUS
				248	OP_TYPEQUERY
				249	OP_TYPEMINQUERY
				250	OP_TYPEPOSQUERY
				251	OP_TYPEUPTO
				252	OP_TYPEMINUPTO
				253	OP_TYPEPOSUPTO
				254	OP_TYPEEXACT
				255
				256
				257	Match by Unicode property
				258	-------------------------
				259
				260	OP_PROP and OP_NOTPROP are used for positive and negative matches of a
				261	character by testing its Unicode property (the \p and \P escape sequences).
				262	Each is followed by two bytes that encode the desired property as a type and a
				263	value.
				264
				265	Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
				266	three bytes: OP_PROP or OP_NOTPROP and then the desired property type and
				267	value.
				268
				269
				270	Character classes
				271	-----------------
				272
				273	If there is only one character, OP_CHAR or OP_CHARI is used for a positive
				274	class, and OP_NOT or OP_NOTI for a negative one (that is, for something like
				275	[^a]). However, in UTF-8 mode, the use of OP_NOT[I] applies only to characters
				276	with values < 128, because OP_NOT[I] is confined to single bytes.
				277
				278	Another set of 13 repeating opcodes (called OP_NOTSTAR etc.) are used for a
				279	repeated, negated, single-character class. The normal single-character opcodes
				280	(OP_STAR, etc.) are used for a repeated positive single-character class.
				281
				282	When there is more than one character in a class and all the characters are
				283	less than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a
				284	negative one. In either case, the opcode is followed by a 32-byte bit map
				285	containing a 1 bit for every character that is acceptable. The bits are counted
				286	from the least significant end of each byte. In caseless mode, bits for both
				287	cases are set.
				288
				289	The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode,
				290	subject characters with values greater than 256 can be handled correctly. For
				291	OP_CLASS they do not match, whereas for OP_NCLASS they do.
				292
				293	For classes containing characters with values > 255, OP_XCLASS is used. It
				294	optionally uses a bit map (if any characters lie within it), followed by a list
				295	of pairs (for a range) and single characters. In caseless mode, both cases are
				296	explicitly listed. There is a flag character than indicates whether it is a
				297	positive or a negative class.
				298
				299
				300	Back references
				301	---------------
				302
				303	OP_REF (caseful) or OP_REFI (caseless) is followed by two bytes containing the
				304	reference number.
				305
				306
				307	Repeating character classes and back references
				308	-----------------------------------------------
				309
				310	Single-character classes are handled specially (see above). This section
				311	applies to OP_CLASS and OP_REF[I]. In both cases, the repeat information
				312	follows the base item. The matching code looks at the following opcode to see
				313	if it is one of
				314
				315	OP_CRSTAR
				316	OP_CRMINSTAR
				317	OP_CRPLUS
				318	OP_CRMINPLUS
				319	OP_CRQUERY
				320	OP_CRMINQUERY
				321	OP_CRRANGE
				322	OP_CRMINRANGE
				323
				324	All but the last two are just single-byte items. The others are followed by
				325	four bytes of data, comprising the minimum and maximum repeat counts. There are
				326	no special possessive opcodes for these repeats; a possessive repeat is
				327	compiled into an atomic group.
				328
				329
				330	Brackets and alternation
				331	------------------------
				332
				333	A pair of non-capturing (round) brackets is wrapped round each expression at
				334	compile time, so alternation always happens in the context of brackets.
				335
				336	[Note for North Americans: "bracket" to some English speakers, including
				337	myself, can be round, square, curly, or pointy. Hence this usage.]
				338
				339	Non-capturing brackets use the opcode OP_BRA. Originally PCRE was limited to 99
				340	capturing brackets and it used a different opcode for each one. From release
				341	3.5, the limit was removed by putting the bracket number into the data for
				342	higher-numbered brackets. From release 7.0 all capturing brackets are handled
				343	this way, using the single opcode OP_CBRA.
				344
				345	A bracket opcode is followed by LINK_SIZE bytes which give the offset to the
				346	next alternative OP_ALT or, if there aren't any branches, to the matching
				347	OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to
				348	the next one, or to the OP_KET opcode. For capturing brackets, the bracket
				349	number immediately follows the offset, always as a 2-byte item.
				350
				351	OP_KET is used for subpatterns that do not repeat indefinitely, while
				352	OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or
				353	maximally respectively (see below for possessive repetitions). All three are
				354	followed by LINK_SIZE bytes giving (as a positive number) the offset back to
				355	the matching bracket opcode.
				356
				357	If a subpattern is quantified such that it is permitted to match zero times, it
				358	is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
				359	single-byte opcodes that tell the matcher that skipping the following
				360	subpattern entirely is a valid branch. In the case of the first two, not
				361	skipping the pattern is also valid (greedy and non-greedy). The third is used
				362	when a pattern has the quantifier {0,0}. It cannot be entirely discarded,
				363	because it may be called as a subroutine from elsewhere in the regex.
				364
				365	A subpattern with an indefinite maximum repetition is replicated in the
				366	compiled data its minimum number of times (or once with OP_BRAZERO if the
				367	minimum is zero), with the final copy terminating with OP_KETRMIN or OP_KETRMAX
				368	as appropriate.
				369
				370	A subpattern with a bounded maximum repetition is replicated in a nested
				371	fashion up to the maximum number of times, with OP_BRAZERO or OP_BRAMINZERO
				372	before each replication after the minimum, so that, for example, (abc){2,5} is
				373	compiled as (abc)(abc)((abc)((abc)(abc)?)?)?, except that each bracketed group
				374	has the same number.
				375
				376	When a repeated subpattern has an unbounded upper limit, it is checked to see
				377	whether it could match an empty string. If this is the case, the opcode in the
				378	final replication is changed to OP_SBRA or OP_SCBRA. This tells the matcher
				379	that it needs to check for matching an empty string when it hits OP_KETRMIN or
				380	OP_KETRMAX, and if so, to break the loop.
				381
				382	Possessive brackets
				383	-------------------
				384
				385	When a repeated group (capturing or non-capturing) is marked as possessive by
				386	the "+" notation, e.g. (abc)++, different opcodes are used. Their names all
				387	have POS on the end, e.g. OP_BRAPOS instead of OP_BRA and OP_SCPBRPOS instead
				388	of OP_SCBRA. The end of such a group is marked by OP_KETRPOS. If the minimum
				389	repetition is zero, the group is preceded by OP_BRAPOSZERO.
				390
				391
				392	Assertions
				393	----------
				394
				395	Forward assertions are just like other subpatterns, but starting with one of
				396	the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
				397	OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
				398	is OP_REVERSE, followed by a two byte count of the number of characters to move
				399	back the pointer in the subject string. When operating in UTF-8 mode, the count
				400	is a character count rather than a byte count. A separate count is present in
				401	each alternative of a lookbehind assertion, allowing them to have different
				402	fixed lengths.
				403
				404
				405	Once-only (atomic) subpatterns
				406	------------------------------
				407
				408	These are also just like other subpatterns, but they start with the opcode
				409	OP_ONCE. The check for matching an empty string in an unbounded repeat is
				410	handled entirely at runtime, so there is just this one opcode.
				411
				412
				413	Conditional subpatterns
				414	-----------------------
				415
				416	These are like other subpatterns, but they start with the opcode OP_COND, or
				417	OP_SCOND for one that might match an empty string in an unbounded repeat. If
				418	the condition is a back reference, this is stored at the start of the
				419	subpattern using the opcode OP_CREF followed by two bytes containing the
				420	reference number. OP_NCREF is used instead if the reference was generated by
				421	name (so that the runtime code knows to check for duplicate names).
				422
				423	If the condition is "in recursion" (coded as "(?(R)"), or "in recursion of
				424	group x" (coded as "(?(Rx)"), the group number is stored at the start of the
				425	subpattern using the opcode OP_RREF or OP_NRREF (cf OP_NCREF), and a value of
				426	zero for "the whole pattern". For a DEFINE condition, just the single byte
				427	OP_DEF is used (it has no associated data). Otherwise, a conditional subpattern
				428	always starts with one of the assertions.
				429
				430
				431	Recursion
				432	---------
				433
				434	Recursion either matches the current regex, or some subexpression. The opcode
				435	OP_RECURSE is followed by an value which is the offset to the starting bracket
				436	from the start of the whole pattern. From release 6.5, OP_RECURSE is
				437	automatically wrapped inside OP_ONCE brackets (because otherwise some patterns
				438	broke it). OP_RECURSE is also used for "subroutine" calls, even though they
				439	are not strictly a recursion.
				440
				441
				442	Callout
				443	-------
				444
				445	OP_CALLOUT is followed by one byte of data that holds a callout number in the
				446	range 0 to 254 for manual callouts, or 255 for an automatic callout. In both
				447	cases there follows a two-byte value giving the offset in the pattern to the
				448	start of the following item, and another two-byte item giving the length of the
				449	next item.
				450
				451
				452	Philip Hazel
				453	October 2011