| .TH PCRECPP 3 |
| .SH NAME |
| PCRE - Perl-compatible regular expressions. |
| .SH "SYNOPSIS OF C++ WRAPPER" |
| .rs |
| .sp |
| .B #include <pcrecpp.h> |
| . |
| .SH DESCRIPTION |
| .rs |
| .sp |
| The C++ wrapper for PCRE was provided by Google Inc. Some additional |
| functionality was added by Giuseppe Maxia. This brief man page was constructed |
| from the notes in the \fIpcrecpp.h\fP file, which should be consulted for |
| further details. |
| . |
| . |
| .SH "MATCHING INTERFACE" |
| .rs |
| .sp |
| The "FullMatch" operation checks that supplied text matches a supplied pattern |
| exactly. If pointer arguments are supplied, it copies matched sub-strings that |
| match sub-patterns into them. |
| .sp |
| Example: successful match |
| pcrecpp::RE re("h.*o"); |
| re.FullMatch("hello"); |
| .sp |
| Example: unsuccessful match (requires full match): |
| pcrecpp::RE re("e"); |
| !re.FullMatch("hello"); |
| .sp |
| Example: creating a temporary RE object: |
| pcrecpp::RE("h.*o").FullMatch("hello"); |
| .sp |
| You can pass in a "const char*" or a "string" for "text". The examples below |
| tend to use a const char*. You can, as in the different examples above, store |
| the RE object explicitly in a variable or use a temporary RE object. The |
| examples below use one mode or the other arbitrarily. Either could correctly be |
| used for any of these examples. |
| .P |
| You must supply extra pointer arguments to extract matched subpieces. |
| .sp |
| Example: extracts "ruby" into "s" and 1234 into "i" |
| int i; |
| string s; |
| pcrecpp::RE re("(\e\ew+):(\e\ed+)"); |
| re.FullMatch("ruby:1234", &s, &i); |
| .sp |
| Example: does not try to extract any extra sub-patterns |
| re.FullMatch("ruby:1234", &s); |
| .sp |
| Example: does not try to extract into NULL |
| re.FullMatch("ruby:1234", NULL, &i); |
| .sp |
| Example: integer overflow causes failure |
| !re.FullMatch("ruby:1234567891234", NULL, &i); |
| .sp |
| Example: fails because there aren't enough sub-patterns: |
| !pcrecpp::RE("\e\ew+:\e\ed+").FullMatch("ruby:1234", &s); |
| .sp |
| Example: fails because string cannot be stored in integer |
| !pcrecpp::RE("(.*)").FullMatch("ruby", &i); |
| .sp |
| The provided pointer arguments can be pointers to any scalar numeric |
| type, or one of: |
| .sp |
| string (matched piece is copied to string) |
| StringPiece (StringPiece is mutated to point to matched piece) |
| T (where "bool T::ParseFrom(const char*, int)" exists) |
| NULL (the corresponding matched sub-pattern is not copied) |
| .sp |
| The function returns true iff all of the following conditions are satisfied: |
| .sp |
| a. "text" matches "pattern" exactly; |
| .sp |
| b. The number of matched sub-patterns is >= number of supplied |
| pointers; |
| .sp |
| c. The "i"th argument has a suitable type for holding the |
| string captured as the "i"th sub-pattern. If you pass in |
| void * NULL for the "i"th argument, or a non-void * NULL |
| of the correct type, or pass fewer arguments than the |
| number of sub-patterns, "i"th captured sub-pattern is |
| ignored. |
| .sp |
| CAVEAT: An optional sub-pattern that does not exist in the matched |
| string is assigned the empty string. Therefore, the following will |
| return false (because the empty string is not a valid number): |
| .sp |
| int number; |
| pcrecpp::RE::FullMatch("abc", "[a-z]+(\e\ed+)?", &number); |
| .sp |
| The matching interface supports at most 16 arguments per call. |
| If you need more, consider using the more general interface |
| \fBpcrecpp::RE::DoMatch\fP. See \fBpcrecpp.h\fP for the signature for |
| \fBDoMatch\fP. |
| .P |
| NOTE: Do not use \fBno_arg\fP, which is used internally to mark the end of a |
| list of optional arguments, as a placeholder for missing arguments, as this can |
| lead to segfaults. |
| . |
| . |
| .SH "QUOTING METACHARACTERS" |
| .rs |
| .sp |
| You can use the "QuoteMeta" operation to insert backslashes before all |
| potentially meaningful characters in a string. The returned string, used as a |
| regular expression, will exactly match the original string. |
| .sp |
| Example: |
| string quoted = RE::QuoteMeta(unquoted); |
| .sp |
| Note that it's legal to escape a character even if it has no special meaning in |
| a regular expression -- so this function does that. (This also makes it |
| identical to the perl function of the same name; see "perldoc -f quotemeta".) |
| For example, "1.5-2.0?" becomes "1\e.5\e-2\e.0\e?". |
| . |
| .SH "PARTIAL MATCHES" |
| .rs |
| .sp |
| You can use the "PartialMatch" operation when you want the pattern |
| to match any substring of the text. |
| .sp |
| Example: simple search for a string: |
| pcrecpp::RE("ell").PartialMatch("hello"); |
| .sp |
| Example: find first number in a string: |
| int number; |
| pcrecpp::RE re("(\e\ed+)"); |
| re.PartialMatch("x*100 + 20", &number); |
| assert(number == 100); |
| . |
| . |
| .SH "UTF-8 AND THE MATCHING INTERFACE" |
| .rs |
| .sp |
| By default, pattern and text are plain text, one byte per character. The UTF8 |
| flag, passed to the constructor, causes both pattern and string to be treated |
| as UTF-8 text, still a byte stream but potentially multiple bytes per |
| character. In practice, the text is likelier to be UTF-8 than the pattern, but |
| the match returned may depend on the UTF8 flag, so always use it when matching |
| UTF8 text. For example, "." will match one byte normally but with UTF8 set may |
| match up to three bytes of a multi-byte character. |
| .sp |
| Example: |
| pcrecpp::RE_Options options; |
| options.set_utf8(); |
| pcrecpp::RE re(utf8_pattern, options); |
| re.FullMatch(utf8_string); |
| .sp |
| Example: using the convenience function UTF8(): |
| pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8()); |
| re.FullMatch(utf8_string); |
| .sp |
| NOTE: The UTF8 flag is ignored if pcre was not configured with the |
| --enable-utf8 flag. |
| . |
| . |
| .SH "PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE" |
| .rs |
| .sp |
| PCRE defines some modifiers to change the behavior of the regular expression |
| engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to |
| pass such modifiers to a RE class. Currently, the following modifiers are |
| supported: |
| .sp |
| modifier description Perl corresponding |
| .sp |
| PCRE_CASELESS case insensitive match /i |
| PCRE_MULTILINE multiple lines match /m |
| PCRE_DOTALL dot matches newlines /s |
| PCRE_DOLLAR_ENDONLY $ matches only at end N/A |
| PCRE_EXTRA strict escape parsing N/A |
| PCRE_EXTENDED ignore whitespaces /x |
| PCRE_UTF8 handles UTF8 chars built-in |
| PCRE_UNGREEDY reverses * and *? N/A |
| PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*) |
| .sp |
| (*) Both Perl and PCRE allow non capturing parentheses by means of the |
| "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not |
| capture, while (ab|cd) does. |
| .P |
| For a full account on how each modifier works, please check the |
| PCRE API reference page. |
| .P |
| For each modifier, there are two member functions whose name is made |
| out of the modifier in lowercase, without the "PCRE_" prefix. For |
| instance, PCRE_CASELESS is handled by |
| .sp |
| bool caseless() |
| .sp |
| which returns true if the modifier is set, and |
| .sp |
| RE_Options & set_caseless(bool) |
| .sp |
| which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be |
| accessed through the \fBset_match_limit()\fP and \fBmatch_limit()\fP member |
| functions. Setting \fImatch_limit\fP to a non-zero value will limit the |
| execution of pcre to keep it from doing bad things like blowing the stack or |
| taking an eternity to return a result. A value of 5000 is good enough to stop |
| stack blowup in a 2MB thread stack. Setting \fImatch_limit\fP to zero disables |
| match limiting. Alternatively, you can call \fBmatch_limit_recursion()\fP |
| which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE |
| recurses. \fBmatch_limit()\fP limits the number of matches PCRE does; |
| \fBmatch_limit_recursion()\fP limits the depth of internal recursion, and |
| therefore the amount of stack that is used. |
| .P |
| Normally, to pass one or more modifiers to a RE class, you declare |
| a \fIRE_Options\fP object, set the appropriate options, and pass this |
| object to a RE constructor. Example: |
| .sp |
| RE_Options opt; |
| opt.set_caseless(true); |
| if (RE("HELLO", opt).PartialMatch("hello world")) ... |
| .sp |
| RE_options has two constructors. The default constructor takes no arguments and |
| creates a set of flags that are off by default. The optional parameter |
| \fIoption_flags\fP is to facilitate transfer of legacy code from C programs. |
| This lets you do |
| .sp |
| RE(pattern, |
| RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str); |
| .sp |
| However, new code is better off doing |
| .sp |
| RE(pattern, |
| RE_Options().set_caseless(true).set_multiline(true)) |
| .PartialMatch(str); |
| .sp |
| If you are going to pass one of the most used modifiers, there are some |
| convenience functions that return a RE_Options class with the |
| appropriate modifier already set: \fBCASELESS()\fP, \fBUTF8()\fP, |
| \fBMULTILINE()\fP, \fBDOTALL\fP(), and \fBEXTENDED()\fP. |
| .P |
| If you need to set several options at once, and you don't want to go through |
| the pains of declaring a RE_Options object and setting several options, there |
| is a parallel method that give you such ability on the fly. You can concatenate |
| several \fBset_xxxxx()\fP member functions, since each of them returns a |
| reference to its class object. For example, to pass PCRE_CASELESS, |
| PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write: |
| .sp |
| RE(" ^ xyz \e\es+ .* blah$", |
| RE_Options() |
| .set_caseless(true) |
| .set_extended(true) |
| .set_multiline(true)).PartialMatch(sometext); |
| .sp |
| . |
| . |
| .SH "SCANNING TEXT INCREMENTALLY" |
| .rs |
| .sp |
| The "Consume" operation may be useful if you want to repeatedly |
| match regular expressions at the front of a string and skip over |
| them as they match. This requires use of the "StringPiece" type, |
| which represents a sub-range of a real string. Like RE, StringPiece |
| is defined in the pcrecpp namespace. |
| .sp |
| Example: read lines of the form "var = value" from a string. |
| string contents = ...; // Fill string somehow |
| pcrecpp::StringPiece input(contents); // Wrap in a StringPiece |
| .sp |
| string var; |
| int value; |
| pcrecpp::RE re("(\e\ew+) = (\e\ed+)\en"); |
| while (re.Consume(&input, &var, &value)) { |
| ...; |
| } |
| .sp |
| Each successful call to "Consume" will set "var/value", and also |
| advance "input" so it points past the matched text. |
| .P |
| The "FindAndConsume" operation is similar to "Consume" but does not |
| anchor your match at the beginning of the string. For example, you |
| could extract all words from a string by repeatedly calling |
| .sp |
| pcrecpp::RE("(\e\ew+)").FindAndConsume(&input, &word) |
| . |
| . |
| .SH "PARSING HEX/OCTAL/C-RADIX NUMBERS" |
| .rs |
| .sp |
| By default, if you pass a pointer to a numeric value, the |
| corresponding text is interpreted as a base-10 number. You can |
| instead wrap the pointer with a call to one of the operators Hex(), |
| Octal(), or CRadix() to interpret the text in another base. The |
| CRadix operator interprets C-style "0" (base-8) and "0x" (base-16) |
| prefixes, but defaults to base-10. |
| .sp |
| Example: |
| int a, b, c, d; |
| pcrecpp::RE re("(.*) (.*) (.*) (.*)"); |
| re.FullMatch("100 40 0100 0x40", |
| pcrecpp::Octal(&a), pcrecpp::Hex(&b), |
| pcrecpp::CRadix(&c), pcrecpp::CRadix(&d)); |
| .sp |
| will leave 64 in a, b, c, and d. |
| . |
| . |
| .SH "REPLACING PARTS OF STRINGS" |
| .rs |
| .sp |
| You can replace the first match of "pattern" in "str" with "rewrite". |
| Within "rewrite", backslash-escaped digits (\e1 to \e9) can be |
| used to insert text matching corresponding parenthesized group |
| from the pattern. \e0 in "rewrite" refers to the entire matching |
| text. For example: |
| .sp |
| string s = "yabba dabba doo"; |
| pcrecpp::RE("b+").Replace("d", &s); |
| .sp |
| will leave "s" containing "yada dabba doo". The result is true if the pattern |
| matches and a replacement occurs, false otherwise. |
| .P |
| \fBGlobalReplace\fP is like \fBReplace\fP except that it replaces all |
| occurrences of the pattern in the string with the rewrite. Replacements are |
| not subject to re-matching. For example: |
| .sp |
| string s = "yabba dabba doo"; |
| pcrecpp::RE("b+").GlobalReplace("d", &s); |
| .sp |
| will leave "s" containing "yada dada doo". It returns the number of |
| replacements made. |
| .P |
| \fBExtract\fP is like \fBReplace\fP, except that if the pattern matches, |
| "rewrite" is copied into "out" (an additional argument) with substitutions. |
| The non-matching portions of "text" are ignored. Returns true iff a match |
| occurred and the extraction happened successfully; if no match occurs, the |
| string is left unaffected. |
| . |
| . |
| .SH AUTHOR |
| .rs |
| .sp |
| .nf |
| The C++ wrapper was contributed by Google Inc. |
| Copyright (c) 2007 Google Inc. |
| .fi |
| . |
| . |
| .SH REVISION |
| .rs |
| .sp |
| .nf |
| Last updated: 17 March 2009 |
| Minor typo fixed: 25 July 2011 |
| .fi |