Tristan Matthews | 0461646 | 2013-11-14 16:09:34 -0500 | [diff] [blame] | 1 | <html> |
| 2 | <head> |
| 3 | <title>pcrecpp specification</title> |
| 4 | </head> |
| 5 | <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> |
| 6 | <h1>pcrecpp man page</h1> |
| 7 | <p> |
| 8 | Return to the <a href="index.html">PCRE index page</a>. |
| 9 | </p> |
| 10 | <p> |
| 11 | This page is part of the PCRE HTML documentation. It was generated automatically |
| 12 | from the original man page. If there is any nonsense in it, please consult the |
| 13 | man page, in case the conversion went wrong. |
| 14 | <br> |
| 15 | <ul> |
| 16 | <li><a name="TOC1" href="#SEC1">SYNOPSIS OF C++ WRAPPER</a> |
| 17 | <li><a name="TOC2" href="#SEC2">DESCRIPTION</a> |
| 18 | <li><a name="TOC3" href="#SEC3">MATCHING INTERFACE</a> |
| 19 | <li><a name="TOC4" href="#SEC4">QUOTING METACHARACTERS</a> |
| 20 | <li><a name="TOC5" href="#SEC5">PARTIAL MATCHES</a> |
| 21 | <li><a name="TOC6" href="#SEC6">UTF-8 AND THE MATCHING INTERFACE</a> |
| 22 | <li><a name="TOC7" href="#SEC7">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a> |
| 23 | <li><a name="TOC8" href="#SEC8">SCANNING TEXT INCREMENTALLY</a> |
| 24 | <li><a name="TOC9" href="#SEC9">PARSING HEX/OCTAL/C-RADIX NUMBERS</a> |
| 25 | <li><a name="TOC10" href="#SEC10">REPLACING PARTS OF STRINGS</a> |
| 26 | <li><a name="TOC11" href="#SEC11">AUTHOR</a> |
| 27 | <li><a name="TOC12" href="#SEC12">REVISION</a> |
| 28 | </ul> |
| 29 | <br><a name="SEC1" href="#TOC1">SYNOPSIS OF C++ WRAPPER</a><br> |
| 30 | <P> |
| 31 | <b>#include <pcrecpp.h></b> |
| 32 | </P> |
| 33 | <br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br> |
| 34 | <P> |
| 35 | The C++ wrapper for PCRE was provided by Google Inc. Some additional |
| 36 | functionality was added by Giuseppe Maxia. This brief man page was constructed |
| 37 | from the notes in the <i>pcrecpp.h</i> file, which should be consulted for |
| 38 | further details. |
| 39 | </P> |
| 40 | <br><a name="SEC3" href="#TOC1">MATCHING INTERFACE</a><br> |
| 41 | <P> |
| 42 | The "FullMatch" operation checks that supplied text matches a supplied pattern |
| 43 | exactly. If pointer arguments are supplied, it copies matched sub-strings that |
| 44 | match sub-patterns into them. |
| 45 | <pre> |
| 46 | Example: successful match |
| 47 | pcrecpp::RE re("h.*o"); |
| 48 | re.FullMatch("hello"); |
| 49 | |
| 50 | Example: unsuccessful match (requires full match): |
| 51 | pcrecpp::RE re("e"); |
| 52 | !re.FullMatch("hello"); |
| 53 | |
| 54 | Example: creating a temporary RE object: |
| 55 | pcrecpp::RE("h.*o").FullMatch("hello"); |
| 56 | </pre> |
| 57 | You can pass in a "const char*" or a "string" for "text". The examples below |
| 58 | tend to use a const char*. You can, as in the different examples above, store |
| 59 | the RE object explicitly in a variable or use a temporary RE object. The |
| 60 | examples below use one mode or the other arbitrarily. Either could correctly be |
| 61 | used for any of these examples. |
| 62 | </P> |
| 63 | <P> |
| 64 | You must supply extra pointer arguments to extract matched subpieces. |
| 65 | <pre> |
| 66 | Example: extracts "ruby" into "s" and 1234 into "i" |
| 67 | int i; |
| 68 | string s; |
| 69 | pcrecpp::RE re("(\\w+):(\\d+)"); |
| 70 | re.FullMatch("ruby:1234", &s, &i); |
| 71 | |
| 72 | Example: does not try to extract any extra sub-patterns |
| 73 | re.FullMatch("ruby:1234", &s); |
| 74 | |
| 75 | Example: does not try to extract into NULL |
| 76 | re.FullMatch("ruby:1234", NULL, &i); |
| 77 | |
| 78 | Example: integer overflow causes failure |
| 79 | !re.FullMatch("ruby:1234567891234", NULL, &i); |
| 80 | |
| 81 | Example: fails because there aren't enough sub-patterns: |
| 82 | !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s); |
| 83 | |
| 84 | Example: fails because string cannot be stored in integer |
| 85 | !pcrecpp::RE("(.*)").FullMatch("ruby", &i); |
| 86 | </pre> |
| 87 | The provided pointer arguments can be pointers to any scalar numeric |
| 88 | type, or one of: |
| 89 | <pre> |
| 90 | string (matched piece is copied to string) |
| 91 | StringPiece (StringPiece is mutated to point to matched piece) |
| 92 | T (where "bool T::ParseFrom(const char*, int)" exists) |
| 93 | NULL (the corresponding matched sub-pattern is not copied) |
| 94 | </pre> |
| 95 | The function returns true iff all of the following conditions are satisfied: |
| 96 | <pre> |
| 97 | a. "text" matches "pattern" exactly; |
| 98 | |
| 99 | b. The number of matched sub-patterns is >= number of supplied |
| 100 | pointers; |
| 101 | |
| 102 | c. The "i"th argument has a suitable type for holding the |
| 103 | string captured as the "i"th sub-pattern. If you pass in |
| 104 | void * NULL for the "i"th argument, or a non-void * NULL |
| 105 | of the correct type, or pass fewer arguments than the |
| 106 | number of sub-patterns, "i"th captured sub-pattern is |
| 107 | ignored. |
| 108 | </pre> |
| 109 | CAVEAT: An optional sub-pattern that does not exist in the matched |
| 110 | string is assigned the empty string. Therefore, the following will |
| 111 | return false (because the empty string is not a valid number): |
| 112 | <pre> |
| 113 | int number; |
| 114 | pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number); |
| 115 | </pre> |
| 116 | The matching interface supports at most 16 arguments per call. |
| 117 | If you need more, consider using the more general interface |
| 118 | <b>pcrecpp::RE::DoMatch</b>. See <b>pcrecpp.h</b> for the signature for |
| 119 | <b>DoMatch</b>. |
| 120 | </P> |
| 121 | <P> |
| 122 | NOTE: Do not use <b>no_arg</b>, which is used internally to mark the end of a |
| 123 | list of optional arguments, as a placeholder for missing arguments, as this can |
| 124 | lead to segfaults. |
| 125 | </P> |
| 126 | <br><a name="SEC4" href="#TOC1">QUOTING METACHARACTERS</a><br> |
| 127 | <P> |
| 128 | You can use the "QuoteMeta" operation to insert backslashes before all |
| 129 | potentially meaningful characters in a string. The returned string, used as a |
| 130 | regular expression, will exactly match the original string. |
| 131 | <pre> |
| 132 | Example: |
| 133 | string quoted = RE::QuoteMeta(unquoted); |
| 134 | </pre> |
| 135 | Note that it's legal to escape a character even if it has no special meaning in |
| 136 | a regular expression -- so this function does that. (This also makes it |
| 137 | identical to the perl function of the same name; see "perldoc -f quotemeta".) |
| 138 | For example, "1.5-2.0?" becomes "1\.5\-2\.0\?". |
| 139 | </P> |
| 140 | <br><a name="SEC5" href="#TOC1">PARTIAL MATCHES</a><br> |
| 141 | <P> |
| 142 | You can use the "PartialMatch" operation when you want the pattern |
| 143 | to match any substring of the text. |
| 144 | <pre> |
| 145 | Example: simple search for a string: |
| 146 | pcrecpp::RE("ell").PartialMatch("hello"); |
| 147 | |
| 148 | Example: find first number in a string: |
| 149 | int number; |
| 150 | pcrecpp::RE re("(\\d+)"); |
| 151 | re.PartialMatch("x*100 + 20", &number); |
| 152 | assert(number == 100); |
| 153 | </PRE> |
| 154 | </P> |
| 155 | <br><a name="SEC6" href="#TOC1">UTF-8 AND THE MATCHING INTERFACE</a><br> |
| 156 | <P> |
| 157 | By default, pattern and text are plain text, one byte per character. The UTF8 |
| 158 | flag, passed to the constructor, causes both pattern and string to be treated |
| 159 | as UTF-8 text, still a byte stream but potentially multiple bytes per |
| 160 | character. In practice, the text is likelier to be UTF-8 than the pattern, but |
| 161 | the match returned may depend on the UTF8 flag, so always use it when matching |
| 162 | UTF8 text. For example, "." will match one byte normally but with UTF8 set may |
| 163 | match up to three bytes of a multi-byte character. |
| 164 | <pre> |
| 165 | Example: |
| 166 | pcrecpp::RE_Options options; |
| 167 | options.set_utf8(); |
| 168 | pcrecpp::RE re(utf8_pattern, options); |
| 169 | re.FullMatch(utf8_string); |
| 170 | |
| 171 | Example: using the convenience function UTF8(): |
| 172 | pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8()); |
| 173 | re.FullMatch(utf8_string); |
| 174 | </pre> |
| 175 | NOTE: The UTF8 flag is ignored if pcre was not configured with the |
| 176 | <pre> |
| 177 | --enable-utf8 flag. |
| 178 | </PRE> |
| 179 | </P> |
| 180 | <br><a name="SEC7" href="#TOC1">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a><br> |
| 181 | <P> |
| 182 | PCRE defines some modifiers to change the behavior of the regular expression |
| 183 | engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to |
| 184 | pass such modifiers to a RE class. Currently, the following modifiers are |
| 185 | supported: |
| 186 | <pre> |
| 187 | modifier description Perl corresponding |
| 188 | |
| 189 | PCRE_CASELESS case insensitive match /i |
| 190 | PCRE_MULTILINE multiple lines match /m |
| 191 | PCRE_DOTALL dot matches newlines /s |
| 192 | PCRE_DOLLAR_ENDONLY $ matches only at end N/A |
| 193 | PCRE_EXTRA strict escape parsing N/A |
| 194 | PCRE_EXTENDED ignore whitespaces /x |
| 195 | PCRE_UTF8 handles UTF8 chars built-in |
| 196 | PCRE_UNGREEDY reverses * and *? N/A |
| 197 | PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*) |
| 198 | </pre> |
| 199 | (*) Both Perl and PCRE allow non capturing parentheses by means of the |
| 200 | "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not |
| 201 | capture, while (ab|cd) does. |
| 202 | </P> |
| 203 | <P> |
| 204 | For a full account on how each modifier works, please check the |
| 205 | PCRE API reference page. |
| 206 | </P> |
| 207 | <P> |
| 208 | For each modifier, there are two member functions whose name is made |
| 209 | out of the modifier in lowercase, without the "PCRE_" prefix. For |
| 210 | instance, PCRE_CASELESS is handled by |
| 211 | <pre> |
| 212 | bool caseless() |
| 213 | </pre> |
| 214 | which returns true if the modifier is set, and |
| 215 | <pre> |
| 216 | RE_Options & set_caseless(bool) |
| 217 | </pre> |
| 218 | which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be |
| 219 | accessed through the <b>set_match_limit()</b> and <b>match_limit()</b> member |
| 220 | functions. Setting <i>match_limit</i> to a non-zero value will limit the |
| 221 | execution of pcre to keep it from doing bad things like blowing the stack or |
| 222 | taking an eternity to return a result. A value of 5000 is good enough to stop |
| 223 | stack blowup in a 2MB thread stack. Setting <i>match_limit</i> to zero disables |
| 224 | match limiting. Alternatively, you can call <b>match_limit_recursion()</b> |
| 225 | which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE |
| 226 | recurses. <b>match_limit()</b> limits the number of matches PCRE does; |
| 227 | <b>match_limit_recursion()</b> limits the depth of internal recursion, and |
| 228 | therefore the amount of stack that is used. |
| 229 | </P> |
| 230 | <P> |
| 231 | Normally, to pass one or more modifiers to a RE class, you declare |
| 232 | a <i>RE_Options</i> object, set the appropriate options, and pass this |
| 233 | object to a RE constructor. Example: |
| 234 | <pre> |
| 235 | RE_Options opt; |
| 236 | opt.set_caseless(true); |
| 237 | if (RE("HELLO", opt).PartialMatch("hello world")) ... |
| 238 | </pre> |
| 239 | RE_options has two constructors. The default constructor takes no arguments and |
| 240 | creates a set of flags that are off by default. The optional parameter |
| 241 | <i>option_flags</i> is to facilitate transfer of legacy code from C programs. |
| 242 | This lets you do |
| 243 | <pre> |
| 244 | RE(pattern, |
| 245 | RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str); |
| 246 | </pre> |
| 247 | However, new code is better off doing |
| 248 | <pre> |
| 249 | RE(pattern, |
| 250 | RE_Options().set_caseless(true).set_multiline(true)) |
| 251 | .PartialMatch(str); |
| 252 | </pre> |
| 253 | If you are going to pass one of the most used modifiers, there are some |
| 254 | convenience functions that return a RE_Options class with the |
| 255 | appropriate modifier already set: <b>CASELESS()</b>, <b>UTF8()</b>, |
| 256 | <b>MULTILINE()</b>, <b>DOTALL</b>(), and <b>EXTENDED()</b>. |
| 257 | </P> |
| 258 | <P> |
| 259 | If you need to set several options at once, and you don't want to go through |
| 260 | the pains of declaring a RE_Options object and setting several options, there |
| 261 | is a parallel method that give you such ability on the fly. You can concatenate |
| 262 | several <b>set_xxxxx()</b> member functions, since each of them returns a |
| 263 | reference to its class object. For example, to pass PCRE_CASELESS, |
| 264 | PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write: |
| 265 | <pre> |
| 266 | RE(" ^ xyz \\s+ .* blah$", |
| 267 | RE_Options() |
| 268 | .set_caseless(true) |
| 269 | .set_extended(true) |
| 270 | .set_multiline(true)).PartialMatch(sometext); |
| 271 | |
| 272 | </PRE> |
| 273 | </P> |
| 274 | <br><a name="SEC8" href="#TOC1">SCANNING TEXT INCREMENTALLY</a><br> |
| 275 | <P> |
| 276 | The "Consume" operation may be useful if you want to repeatedly |
| 277 | match regular expressions at the front of a string and skip over |
| 278 | them as they match. This requires use of the "StringPiece" type, |
| 279 | which represents a sub-range of a real string. Like RE, StringPiece |
| 280 | is defined in the pcrecpp namespace. |
| 281 | <pre> |
| 282 | Example: read lines of the form "var = value" from a string. |
| 283 | string contents = ...; // Fill string somehow |
| 284 | pcrecpp::StringPiece input(contents); // Wrap in a StringPiece |
| 285 | |
| 286 | string var; |
| 287 | int value; |
| 288 | pcrecpp::RE re("(\\w+) = (\\d+)\n"); |
| 289 | while (re.Consume(&input, &var, &value)) { |
| 290 | ...; |
| 291 | } |
| 292 | </pre> |
| 293 | Each successful call to "Consume" will set "var/value", and also |
| 294 | advance "input" so it points past the matched text. |
| 295 | </P> |
| 296 | <P> |
| 297 | The "FindAndConsume" operation is similar to "Consume" but does not |
| 298 | anchor your match at the beginning of the string. For example, you |
| 299 | could extract all words from a string by repeatedly calling |
| 300 | <pre> |
| 301 | pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word) |
| 302 | </PRE> |
| 303 | </P> |
| 304 | <br><a name="SEC9" href="#TOC1">PARSING HEX/OCTAL/C-RADIX NUMBERS</a><br> |
| 305 | <P> |
| 306 | By default, if you pass a pointer to a numeric value, the |
| 307 | corresponding text is interpreted as a base-10 number. You can |
| 308 | instead wrap the pointer with a call to one of the operators Hex(), |
| 309 | Octal(), or CRadix() to interpret the text in another base. The |
| 310 | CRadix operator interprets C-style "0" (base-8) and "0x" (base-16) |
| 311 | prefixes, but defaults to base-10. |
| 312 | <pre> |
| 313 | Example: |
| 314 | int a, b, c, d; |
| 315 | pcrecpp::RE re("(.*) (.*) (.*) (.*)"); |
| 316 | re.FullMatch("100 40 0100 0x40", |
| 317 | pcrecpp::Octal(&a), pcrecpp::Hex(&b), |
| 318 | pcrecpp::CRadix(&c), pcrecpp::CRadix(&d)); |
| 319 | </pre> |
| 320 | will leave 64 in a, b, c, and d. |
| 321 | </P> |
| 322 | <br><a name="SEC10" href="#TOC1">REPLACING PARTS OF STRINGS</a><br> |
| 323 | <P> |
| 324 | You can replace the first match of "pattern" in "str" with "rewrite". |
| 325 | Within "rewrite", backslash-escaped digits (\1 to \9) can be |
| 326 | used to insert text matching corresponding parenthesized group |
| 327 | from the pattern. \0 in "rewrite" refers to the entire matching |
| 328 | text. For example: |
| 329 | <pre> |
| 330 | string s = "yabba dabba doo"; |
| 331 | pcrecpp::RE("b+").Replace("d", &s); |
| 332 | </pre> |
| 333 | will leave "s" containing "yada dabba doo". The result is true if the pattern |
| 334 | matches and a replacement occurs, false otherwise. |
| 335 | </P> |
| 336 | <P> |
| 337 | <b>GlobalReplace</b> is like <b>Replace</b> except that it replaces all |
| 338 | occurrences of the pattern in the string with the rewrite. Replacements are |
| 339 | not subject to re-matching. For example: |
| 340 | <pre> |
| 341 | string s = "yabba dabba doo"; |
| 342 | pcrecpp::RE("b+").GlobalReplace("d", &s); |
| 343 | </pre> |
| 344 | will leave "s" containing "yada dada doo". It returns the number of |
| 345 | replacements made. |
| 346 | </P> |
| 347 | <P> |
| 348 | <b>Extract</b> is like <b>Replace</b>, except that if the pattern matches, |
| 349 | "rewrite" is copied into "out" (an additional argument) with substitutions. |
| 350 | The non-matching portions of "text" are ignored. Returns true iff a match |
| 351 | occurred and the extraction happened successfully; if no match occurs, the |
| 352 | string is left unaffected. |
| 353 | </P> |
| 354 | <br><a name="SEC11" href="#TOC1">AUTHOR</a><br> |
| 355 | <P> |
| 356 | The C++ wrapper was contributed by Google Inc. |
| 357 | <br> |
| 358 | Copyright © 2007 Google Inc. |
| 359 | <br> |
| 360 | </P> |
| 361 | <br><a name="SEC12" href="#TOC1">REVISION</a><br> |
| 362 | <P> |
| 363 | Last updated: 17 March 2009 |
| 364 | <br> |
| 365 | Minor typo fixed: 25 July 2011 |
| 366 | <br> |
| 367 | <p> |
| 368 | Return to the <a href="index.html">PCRE index page</a>. |
| 369 | </p> |