Blame - jni/libpcre/sources/doc/pcre.txt - jami-client-android

Tristan Matthews

0461646

2013-11-14 16:09:34 -0500

[diff] [blame]

1

-----------------------------------------------------------------------------

2

This file contains a concatenation of the PCRE man pages, converted to plain

3

text format for ease of searching with a text editor, or for use on systems

4

that do not have a man page processor. The small individual files that give

5

synopses of each function in the library have not been included. Neither has

6

the pcredemo program. There are separate text files for the pcregrep and

7

pcretest commands.

8

-----------------------------------------------------------------------------

PCRE(3) PCRE(3)

NAME

PCRE - Perl-compatible regular expressions

INTRODUCTION

The PCRE library is a set of functions that implement regular expres-

21

sion pattern matching using the same syntax and semantics as Perl, with

22

just a few differences. Some features that appeared in Python and PCRE

23

before they appeared in Perl are also available using the Python syn-

24

tax, there is some support for one or two .NET and Oniguruma syntax

25

items, and there is an option for requesting some minor changes that

26

give better JavaScript compatibility.

27

28

The current implementation of PCRE corresponds approximately with Perl

29

5.12, including support for UTF-8 encoded strings and Unicode general

30

category properties. However, UTF-8 and Unicode support has to be

31

explicitly enabled; it is not the default. The Unicode tables corre-

32

spond to Unicode release 6.0.0.

33

34

In addition to the Perl-compatible matching function, PCRE contains an

35

alternative function that matches the same compiled patterns in a dif-

36

ferent way. In certain circumstances, the alternative function has some

37

advantages. For a discussion of the two matching algorithms, see the

38

pcrematching page.

39

40

PCRE is written in C and released as a C library. A number of people

41

have written wrappers and interfaces of various kinds. In particular,

42

Google Inc. have provided a comprehensive C++ wrapper. This is now

43

included as part of the PCRE distribution. The pcrecpp page has details

44

of this interface. Other people's contributions can be found in the

45

Contrib directory at the primary FTP site, which is:

46

47

ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre

48

49

Details of exactly which Perl regular expression features are and are

50

not supported by PCRE are given in separate documents. See the pcrepat-

51

tern and pcrecompat pages. There is a syntax summary in the pcresyntax

52

page.

53

54

Some features of PCRE can be included, excluded, or changed when the

55

library is built. The pcre_config() function makes it possible for a

56

client to discover which features are available. The features them-

57

selves are described in the pcrebuild page. Documentation about build-

58

ing PCRE for various operating systems can be found in the README and

59

NON-UNIX-USE files in the source distribution.

60

61

The library contains a number of undocumented internal functions and

62

data tables that are used by more than one of the exported external

63

functions, but which are not intended for use by external callers.

64

Their names all begin with "_pcre_", which hopefully will not provoke

65

any name clashes. In some environments, it is possible to control which

66

external symbols are exported when a shared library is built, and in

67

these cases the undocumented symbols are not exported.

USER DOCUMENTATION

The user documentation for PCRE comprises a number of different sec-

73

tions. In the "man" format, each of these is a separate "man page". In

74

the HTML format, each is a separate page, linked from the index page.

75

In the plain text format, all the sections, except the pcredemo sec-

76

tion, are concatenated, for ease of searching. The sections are as fol-

lows:

pcre this document

pcre-config show PCRE installation configuration information

81

pcreapi details of PCRE's native C API

82

pcrebuild options for building PCRE

83

pcrecallout details of the callout feature

84

pcrecompat discussion of Perl compatibility

85

pcrecpp details of the C++ wrapper

86

pcredemo a demonstration C program that uses PCRE

87

pcregrep description of the pcregrep command

88

pcrejit discussion of the just-in-time optimization support

89

pcrelimits details of size and other limits

90

pcrematching discussion of the two matching algorithms

91

pcrepartial details of the partial matching facility

92

pcrepattern syntax and semantics of supported

93

regular expressions

94

pcreperform discussion of performance issues

95

pcreposix the POSIX-compatible C API

96

pcreprecompile details of saving and re-using precompiled patterns

97

pcresample discussion of the pcredemo program

98

pcrestack discussion of stack usage

99

pcresyntax quick syntax reference

100

pcretest description of the pcretest testing command

101

pcreunicode discussion of Unicode and UTF-8 support

102

103

In addition, in the "man" and HTML formats, there is a short page for

104

each C library function, listing its arguments and results.

AUTHOR

Philip Hazel

University Computing Service

111

Cambridge CB2 3QH, England.

112

113

Putting an actual email address here seems to have been a spam magnet,

114

so I've taken it away. If you want to email me, use my two initials,

115

followed by the two digits 10, at the domain cam.ac.uk.

REVISION

Last updated: 24 August 2011

121

122

------------------------------------------------------------------------------

123

124

125

PCREBUILD(3) PCREBUILD(3)

NAME

PCRE - Perl-compatible regular expressions

130

131

132

PCRE BUILD-TIME OPTIONS

133

134

This document describes the optional features of PCRE that can be

135

selected when the library is compiled. It assumes use of the configure

136

script, where the optional features are selected or deselected by pro-

137

viding options to configure before running the make command. However,

138

the same options can be selected in both Unix-like and non-Unix-like

139

environments using the GUI facility of cmake-gui if you are using CMake

140

instead of configure to build PCRE.

141

142

There is a lot more information about building PCRE in non-Unix-like

143

environments in the file called NON_UNIX_USE, which is part of the PCRE

144

distribution. You should consult this file as well as the README file

145

if you are building in a non-Unix-like environment.

146

147

The complete list of options for configure (which includes the standard

148

ones such as the selection of the installation directory) can be

obtained by running

./configure --help

The following sections include descriptions of options whose names

154

begin with --enable or --disable. These settings specify changes to the

155

defaults for the configure command. Because of the way that configure

156

works, --enable and --disable always come in pairs, so the complemen-

157

tary option always exists as well, but as it specifies the default, it

is not described.

BUILDING SHARED AND STATIC LIBRARIES

162

163

The PCRE building process uses libtool to build both shared and static

164

Unix libraries by default. You can suppress one of these by adding one

of

--disable-shared

--disable-static

to the configure command, as required.

C++ SUPPORT

By default, the configure script will search for a C++ compiler and C++

176

header files. If it finds them, it automatically builds the C++ wrapper

177

library for PCRE. You can disable this by adding

--disable-cpp

to the configure command.

UTF-8 SUPPORT

To build PCRE with support for UTF-8 Unicode character strings, add

--enable-utf8

to the configure command. Of itself, this does not make PCRE treat

191

strings as UTF-8. As well as compiling PCRE with this option, you also

192

have have to set the PCRE_UTF8 option when you call the pcre_compile()

193

or pcre_compile2() functions.

194

195

If you set --enable-utf8 when compiling in an EBCDIC environment, PCRE

196

expects its input to be either ASCII or UTF-8 (depending on the runtime

197

option). It is not possible to support both EBCDIC and UTF-8 codes in

198

the same version of the library. Consequently, --enable-utf8 and

199

--enable-ebcdic are mutually exclusive.

200

201

202

UNICODE CHARACTER PROPERTY SUPPORT

203

204

UTF-8 support allows PCRE to process character values greater than 255

205

in the strings that it handles. On its own, however, it does not pro-

206

vide any facilities for accessing the properties of such characters. If

207

you want to be able to use the pattern escapes \P, \p, and \X, which

208

refer to Unicode character properties, you must add

209

210

--enable-unicode-properties

211

212

to the configure command. This implies UTF-8 support, even if you have

213

not explicitly requested it.

214

215

Including Unicode property support adds around 30K of tables to the

216

PCRE library. Only the general category properties such as Lu and Nd

217

are supported. Details are given in the pcrepattern documentation.

218

219

220

JUST-IN-TIME COMPILER SUPPORT

221

222

Just-in-time compiler support is included in the build by specifying

--enable-jit

This support is available only for certain hardware architectures. If

227

this option is set for an unsupported architecture, a compile time

228

error occurs. See the pcrejit documentation for a discussion of JIT

229

usage. When JIT support is enabled, pcregrep automatically makes use of

230

it, unless you add

231

232

--disable-pcregrep-jit

233

234

to the "configure" command.

235

236

237

CODE VALUE OF NEWLINE

238

239

By default, PCRE interprets the linefeed (LF) character as indicating

240

the end of a line. This is the normal newline character on Unix-like

241

systems. You can compile PCRE to use carriage return (CR) instead, by

242

adding

243

244

--enable-newline-is-cr

245

246

to the configure command. There is also a --enable-newline-is-lf

247

option, which explicitly specifies linefeed as the newline character.

248

249

Alternatively, you can specify that line endings are to be indicated by

250

the two character sequence CRLF. If you want this, add

251

252

--enable-newline-is-crlf

253

254

to the configure command. There is a fourth option, specified by

255

256

--enable-newline-is-anycrlf

257

258

which causes PCRE to recognize any of the three sequences CR, LF, or

259

CRLF as indicating a line ending. Finally, a fifth option, specified by

260

261

--enable-newline-is-any

262

263

causes PCRE to recognize any Unicode newline sequence.

264

265

Whatever line ending convention is selected when PCRE is built can be

266

overridden when the library functions are called. At build time it is

267

conventional to use the standard for your operating system.

WHAT \R MATCHES

By default, the sequence \R in a pattern matches any Unicode newline

273

sequence, whatever has been selected as the line ending sequence. If

you specify

--enable-bsr-anycrlf

the default is changed so that \R matches only CR, LF, or CRLF. What-

279

ever is selected when PCRE is built can be overridden when the library

280

functions are called.

POSIX MALLOC USAGE

When PCRE is called through the POSIX interface (see the pcreposix doc-

286

umentation), additional working storage is required for holding the

287

pointers to capturing substrings, because PCRE requires three integers

288

per substring, whereas the POSIX interface provides only two. If the

289

number of expected substrings is small, the wrapper function uses space

290

on the stack, because this is faster than using malloc() for each call.

291

The default threshold above which the stack is no longer used is 10; it

292

can be changed by adding a setting such as

293

294

--with-posix-malloc-threshold=20

295

296

to the configure command.

297

298

299

HANDLING VERY LARGE PATTERNS

300

301

Within a compiled pattern, offset values are used to point from one

302

part to another (for example, from an opening parenthesis to an alter-

303

nation metacharacter). By default, two-byte values are used for these

304

offsets, leading to a maximum size for a compiled pattern of around

305

64K. This is sufficient to handle all but the most gigantic patterns.

306

Nevertheless, some people do want to process truyl enormous patterns,

307

so it is possible to compile PCRE to use three-byte or four-byte off-

308

sets by adding a setting such as

--with-link-size=3

to the configure command. The value given must be 2, 3, or 4. Using

313

longer offsets slows down the operation of PCRE because it has to load

314

additional bytes when handling them.

315

316

317

AVOIDING EXCESSIVE STACK USAGE

318

319

When matching with the pcre_exec() function, PCRE implements backtrack-

320

ing by making recursive calls to an internal function called match().

321

In environments where the size of the stack is limited, this can se-

322

verely limit PCRE's operation. (The Unix environment does not usually

323

suffer from this problem, but it may sometimes be necessary to increase

324

the maximum stack size. There is a discussion in the pcrestack docu-

325

mentation.) An alternative approach to recursion that uses memory from

326

the heap to remember data, instead of using recursive function calls,

327

has been implemented to work round the problem of limited stack size.

328

If you want to build a version of PCRE that works this way, add

329

330

--disable-stack-for-recursion

331

332

to the configure command. With this configuration, PCRE will use the

333

pcre_stack_malloc and pcre_stack_free variables to call memory manage-

334

ment functions. By default these point to malloc() and free(), but you

335

can replace the pointers so that your own functions are used instead.

336

337

Separate functions are provided rather than using pcre_malloc and

338

pcre_free because the usage is very predictable: the block sizes

339

requested are always the same, and the blocks are always freed in

340

reverse order. A calling program might be able to implement optimized

341

functions that perform better than malloc() and free(). PCRE runs

342

noticeably more slowly when built in this way. This option affects only

343

the pcre_exec() function; it is not relevant for pcre_dfa_exec().

344

345

346

LIMITING PCRE RESOURCE USAGE

347

348

Internally, PCRE has a function called match(), which it calls repeat-

349

edly (sometimes recursively) when matching a pattern with the

350

pcre_exec() function. By controlling the maximum number of times this

351

function may be called during a single matching operation, a limit can

352

be placed on the resources used by a single call to pcre_exec(). The

353

limit can be changed at run time, as described in the pcreapi documen-

354

tation. The default is 10 million, but this can be changed by adding a

355

setting such as

356

357

--with-match-limit=500000

358

359

to the configure command. This setting has no effect on the

360

pcre_dfa_exec() matching function.

361

362

In some environments it is desirable to limit the depth of recursive

363

calls of match() more strictly than the total number of calls, in order

364

to restrict the maximum amount of stack (or heap, if --disable-stack-

365

for-recursion is specified) that is used. A second limit controls this;

366

it defaults to the value that is set for --with-match-limit, which

367

imposes no additional constraints. However, you can set a lower limit

368

by adding, for example,

369

370

--with-match-limit-recursion=10000

371

372

to the configure command. This value can also be overridden at run

time.

CREATING CHARACTER TABLES AT BUILD TIME

377

378

PCRE uses fixed tables for processing characters whose code values are

379

less than 256. By default, PCRE is built with a set of tables that are

380

distributed in the file pcre_chartables.c.dist. These tables are for

381

ASCII codes only. If you add

382

383

--enable-rebuild-chartables

384

385

to the configure command, the distributed tables are no longer used.

386

Instead, a program called dftables is compiled and run. This outputs

387

the source for new set of tables, created in the default locale of your

388

C runtime system. (This method of replacing the tables does not work if

389

you are cross compiling, because dftables is run on the local host. If

390

you need to create alternative tables when cross compiling, you will

391

have to do so "by hand".)

USING EBCDIC CODE

PCRE assumes by default that it will run in an environment where the

397

character code is ASCII (or Unicode, which is a superset of ASCII).

398

This is the case for most computer operating systems. PCRE can, how-

399

ever, be compiled to run in an EBCDIC environment by adding

--enable-ebcdic

to the configure command. This setting implies --enable-rebuild-charta-

404

bles. You should only use it if you know that you are in an EBCDIC

405

environment (for example, an IBM mainframe operating system). The

406

--enable-ebcdic option is incompatible with --enable-utf8.

407

408

409

PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT

410

411

By default, pcregrep reads all files as plain text. You can build it so

412

that it recognizes files whose names end in .gz or .bz2, and reads them

413

with libz or libbz2, respectively, by adding one or both of

414

415

--enable-pcregrep-libz

416

--enable-pcregrep-libbz2

417

418

to the configure command. These options naturally require that the rel-

419

evant libraries are installed on your system. Configuration will fail

if they are not.

PCREGREP BUFFER SIZE

pcregrep uses an internal buffer to hold a "window" on the file it is

426

scanning, in order to be able to output "before" and "after" lines when

427

it finds a match. The size of the buffer is controlled by a parameter

428

whose default value is 20K. The buffer itself is three times this size,

429

but because of the way it is used for holding "before" lines, the long-

430

est line that is guaranteed to be processable is the parameter size.

431

You can change the default parameter value by adding, for example,

432

433

--with-pcregrep-bufsize=50K

434

435

to the configure command. The caller of pcregrep can, however, override

436

this value by specifying a run-time option.

437

438

439

PCRETEST OPTION FOR LIBREADLINE SUPPORT

If you add

--enable-pcretest-libreadline

444

445

to the configure command, pcretest is linked with the libreadline

446

library, and when its input is from a terminal, it reads it using the

447

readline() function. This provides line-editing and history facilities.

448

Note that libreadline is GPL-licensed, so if you distribute a binary of

449

pcretest linked in this way, there may be licensing issues.

450

451

Setting this option causes the -lreadline option to be added to the

452

pcretest build. In many operating environments with a sytem-installed

453

libreadline this is sufficient. However, in some environments (e.g. if

454

an unmodified distribution version of readline is in use), some extra

455

configuration may be necessary. The INSTALL file for libreadline says

456

this:

457

458

"Readline uses the termcap functions, but does not link with the

459

termcap or curses library itself, allowing applications which link

460

with readline the to choose an appropriate library."

461

462

If your environment has not been set up so that an appropriate library

463

is automatically included, you may need to add something like

LIBS="-ncurses"

immediately before the configure command.

SEE ALSO

pcreapi(3), pcre_config(3).

AUTHOR

Philip Hazel

University Computing Service

479

Cambridge CB2 3QH, England.

REVISION

Last updated: 06 September 2011

485

486

------------------------------------------------------------------------------

487

488

489

PCREMATCHING(3) PCREMATCHING(3)

NAME

PCRE - Perl-compatible regular expressions

494

495

496

PCRE MATCHING ALGORITHMS

497

498

This document describes the two different algorithms that are available

499

in PCRE for matching a compiled regular expression against a given sub-

500

ject string. The "standard" algorithm is the one provided by the

501

pcre_exec() function. This works in the same was as Perl's matching

502

function, and provides a Perl-compatible matching operation.

503

504

An alternative algorithm is provided by the pcre_dfa_exec() function;

505

this operates in a different way, and is not Perl-compatible. It has

506

advantages and disadvantages compared with the standard algorithm, and

507

these are described below.

508

509

When there is only one possible way in which a given subject string can

510

match a pattern, the two algorithms give the same answer. A difference

511

arises, however, when there are multiple possibilities. For example, if

the pattern

^<.*>

is matched against the string

there are three possible answers. The standard algorithm finds only one

521

of them, whereas the alternative algorithm finds all three.

522

523

524

REGULAR EXPRESSIONS AS TREES

525

526

The set of strings that are matched by a regular expression can be rep-

527

resented as a tree structure. An unlimited repetition in the pattern

528

makes the tree of infinite size, but it is still a tree. Matching the

529

pattern to a given subject string (from a given starting point) can be

530

thought of as a search of the tree. There are two ways to search a

531

tree: depth-first and breadth-first, and these correspond to the two

532

matching algorithms provided by PCRE.

533

534

535

THE STANDARD MATCHING ALGORITHM

536

537

In the terminology of Jeffrey Friedl's book "Mastering Regular Expres-

538

sions", the standard algorithm is an "NFA algorithm". It conducts a

539

depth-first search of the pattern tree. That is, it proceeds along a

540

single path through the tree, checking that the subject matches what is

541

required. When there is a mismatch, the algorithm tries any alterna-

542

tives at the current point, and if they all fail, it backs up to the

543

previous branch point in the tree, and tries the next alternative

544

branch at that level. This often involves backing up (moving to the

545

left) in the subject string as well. The order in which repetition

546

branches are tried is controlled by the greedy or ungreedy nature of

547

the quantifier.

548

549

If a leaf node is reached, a matching string has been found, and at

550

that point the algorithm stops. Thus, if there is more than one possi-

551

ble match, this algorithm returns the first one that it finds. Whether

552

this is the shortest, the longest, or some intermediate length depends

553

on the way the greedy and ungreedy repetition quantifiers are specified

554

in the pattern.

555

556

Because it ends up with a single path through the tree, it is rela-

557

tively straightforward for this algorithm to keep track of the sub-

558

strings that are matched by portions of the pattern in parentheses.

559

This provides support for capturing parentheses and back references.

560

561

562

THE ALTERNATIVE MATCHING ALGORITHM

563

564

This algorithm conducts a breadth-first search of the tree. Starting

565

from the first matching point in the subject, it scans the subject

566

string from left to right, once, character by character, and as it does

567

this, it remembers all the paths through the tree that represent valid

568

matches. In Friedl's terminology, this is a kind of "DFA algorithm",

569

though it is not implemented as a traditional finite state machine (it

570

keeps multiple states active simultaneously).

571

572

Although the general principle of this matching algorithm is that it

573

scans the subject string only once, without backtracking, there is one

574

exception: when a lookaround assertion is encountered, the characters

575

following or preceding the current point have to be independently

576

inspected.

577

578

The scan continues until either the end of the subject is reached, or

579

there are no more unterminated paths. At this point, terminated paths

580

represent the different matching possibilities (if there are none, the

581

match has failed). Thus, if there is more than one possible match,

582

this algorithm finds all of them, and in particular, it finds the long-

583

est. The matches are returned in decreasing order of length. There is

584

an option to stop the algorithm after the first match (which is neces-

585

sarily the shortest) is found.

586

587

Note that all the matches that are found start at the same point in the

588

subject. If the pattern

cat(er(pillar)?)?

is matched against the string "the caterpillar catchment", the result

593

will be the three strings "caterpillar", "cater", and "cat" that start

594

at the fifth character of the subject. The algorithm does not automati-

595

cally move on to find matches that start at later positions.

596

597

There are a number of features of PCRE regular expressions that are not

598

supported by the alternative matching algorithm. They are as follows:

599

600

1. Because the algorithm finds all possible matches, the greedy or

601

ungreedy nature of repetition quantifiers is not relevant. Greedy and

602

ungreedy quantifiers are treated in exactly the same way. However, pos-

603

sessive quantifiers can make a difference when what follows could also

604

match what is quantified, for example in a pattern like this:

^a++\w!

This pattern matches "aaab!" but not "aaa!", which would be matched by

609

a non-possessive quantifier. Similarly, if an atomic group is present,

610

it is matched as if it were a standalone pattern at the current point,

611

and the longest match is then "locked in" for the rest of the overall

612

pattern.

613

614

2. When dealing with multiple paths through the tree simultaneously, it

615

is not straightforward to keep track of captured substrings for the

616

different matching possibilities, and PCRE's implementation of this

617

algorithm does not attempt to do this. This means that no captured sub-

618

strings are available.

619

620

3. Because no substrings are captured, back references within the pat-

621

tern are not supported, and cause errors if encountered.

622

623

4. For the same reason, conditional expressions that use a backrefer-

624

ence as the condition or test for a specific group recursion are not

625

supported.

626

627

5. Because many paths through the tree may be active, the \K escape

628

sequence, which resets the start of the match when encountered (but may

629

be on some paths and not on others), is not supported. It causes an

630

error if encountered.

631

632

6. Callouts are supported, but the value of the capture_top field is

633

always 1, and the value of the capture_last field is always -1.

634

635

7. The \C escape sequence, which (in the standard algorithm) matches a

636

single byte, even in UTF-8 mode, is not supported in UTF-8 mode,

637

because the alternative algorithm moves through the subject string one

638

character at a time, for all active paths through the tree.

639

640

8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)

641

are not supported. (*FAIL) is supported, and behaves like a failing

negative assertion.

ADVANTAGES OF THE ALTERNATIVE ALGORITHM

646

647

Using the alternative matching algorithm provides the following advan-

648

tages:

649

650

1. All possible matches (at a single point in the subject) are automat-

651

ically found, and in particular, the longest match is found. To find

652

more than one match using the standard algorithm, you have to do kludgy

653

things with callouts.

654

655

2. Because the alternative algorithm scans the subject string just

656

once, and never needs to backtrack, it is possible to pass very long

657

subject strings to the matching function in several pieces, checking

658

for partial matching each time. Although it is possible to do multi-

659

segment matching using the standard algorithm (pcre_exec()), by retain-

660

ing partially matched substrings, it is more complicated. The pcrepar-

661

tial documentation gives details of partial matching and discusses

662

multi-segment matching.

663

664

665

DISADVANTAGES OF THE ALTERNATIVE ALGORITHM

666

667

The alternative algorithm suffers from a number of disadvantages:

668

669

1. It is substantially slower than the standard algorithm. This is

670

partly because it has to search for all possible matches, but is also

671

because it is less susceptible to optimization.

672

673

2. Capturing parentheses and back references are not supported.

674

675

3. Although atomic groups are supported, their use does not provide the

676

performance advantage that it does for the standard algorithm.

AUTHOR

Philip Hazel

University Computing Service

683

Cambridge CB2 3QH, England.

REVISION

Last updated: 19 November 2011

689

690

------------------------------------------------------------------------------

691

692

693

PCREAPI(3) PCREAPI(3)

NAME

PCRE - Perl-compatible regular expressions

698

699

700

PCRE NATIVE API BASIC FUNCTIONS

#include <pcre.h>

pcre *pcre_compile(const char *pattern, int options,

705

const char **errptr, int *erroffset,

706

const unsigned char *tableptr);

707

708

pcre *pcre_compile2(const char *pattern, int options,

709

int *errorcodeptr,

710

const char **errptr, int *erroffset,

711

const unsigned char *tableptr);

712

713

pcre_extra *pcre_study(const pcre *code, int options,

714

const char **errptr);

715

716

void pcre_free_study(pcre_extra *extra);

717

718

int pcre_exec(const pcre *code, const pcre_extra *extra,

719

const char *subject, int length, int startoffset,

720

int options, int *ovector, int ovecsize);

721

722

723

PCRE NATIVE API AUXILIARY FUNCTIONS

724

725

pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);

726

727

void pcre_jit_stack_free(pcre_jit_stack *stack);

728

729

void pcre_assign_jit_stack(pcre_extra *extra,

730

pcre_jit_callback callback, void *data);

731

732

int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,

733

const char *subject, int length, int startoffset,

734

int options, int *ovector, int ovecsize,

735

int *workspace, int wscount);

736

737

int pcre_copy_named_substring(const pcre *code,

738

const char *subject, int *ovector,

739

int stringcount, const char *stringname,

740

char *buffer, int buffersize);

741

742

int pcre_copy_substring(const char *subject, int *ovector,

743

int stringcount, int stringnumber, char *buffer,

744

int buffersize);

745

746

int pcre_get_named_substring(const pcre *code,

747

const char *subject, int *ovector,

748

int stringcount, const char *stringname,

749

const char **stringptr);

750

751

int pcre_get_stringnumber(const pcre *code,

752

const char *name);

753

754

int pcre_get_stringtable_entries(const pcre *code,

755

const char *name, char **first, char **last);

756

757

int pcre_get_substring(const char *subject, int *ovector,

758

int stringcount, int stringnumber,

759

const char **stringptr);

760

761

int pcre_get_substring_list(const char *subject,

762

int *ovector, int stringcount, const char ***listptr);

763

764

void pcre_free_substring(const char *stringptr);

765

766

void pcre_free_substring_list(const char **stringptr);

767

768

const unsigned char *pcre_maketables(void);

769

770

int pcre_fullinfo(const pcre *code, const pcre_extra *extra,

771

int what, void *where);

772

773

int pcre_info(const pcre *code, int *optptr, int *firstcharptr);

774

775

int pcre_refcount(pcre *code, int adjust);

776

777

int pcre_config(int what, void *where);

778

779

char *pcre_version(void);

780

781

782

PCRE NATIVE API INDIRECTED FUNCTIONS

783

784

void *(*pcre_malloc)(size_t);

785

786

void (*pcre_free)(void *);

787

788

void *(*pcre_stack_malloc)(size_t);

789

790

void (*pcre_stack_free)(void *);

791

792

int (*pcre_callout)(pcre_callout_block *);

PCRE API OVERVIEW

PCRE has its own native API, which is described in this document. There

798

are also some wrapper functions that correspond to the POSIX regular

799

expression API, but they do not give access to all the functionality.

800

They are described in the pcreposix documentation. Both of these APIs

801

define a set of C function calls. A C++ wrapper is also distributed

802

with PCRE. It is documented in the pcrecpp page.

803

804

The native API C function prototypes are defined in the header file

805

pcre.h, and on Unix systems the library itself is called libpcre. It

806

can normally be accessed by adding -lpcre to the command for linking an

807

application that uses PCRE. The header file defines the macros

808

PCRE_MAJOR and PCRE_MINOR to contain the major and minor release num-

809

bers for the library. Applications can use these to include support

810

for different releases of PCRE.

811

812

In a Windows environment, if you want to statically link an application

813

program against a non-dll pcre.a file, you must define PCRE_STATIC

814

before including pcre.h or pcrecpp.h, because otherwise the pcre_mal-

815

loc() and pcre_free() exported functions will be declared

816

__declspec(dllimport), with unwanted results.

817

818

The functions pcre_compile(), pcre_compile2(), pcre_study(), and

819

pcre_exec() are used for compiling and matching regular expressions in

820

a Perl-compatible manner. A sample program that demonstrates the sim-

821

plest way of using them is provided in the file called pcredemo.c in

822

the PCRE source distribution. A listing of this program is given in the

823

pcredemo documentation, and the pcresample documentation describes how

824

to compile and run it.

825

826

Just-in-time compiler support is an optional feature of PCRE that can

827

be built in appropriate hardware environments. It greatly speeds up the

828

matching performance of many patterns. Simple programs can easily

829

request that it be used if available, by setting an option that is

830

ignored when it is not relevant. More complicated programs might need

831

to make use of the functions pcre_jit_stack_alloc(),

832

pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to control

833

the JIT code's memory usage. These functions are discussed in the

834

pcrejit documentation.

835

836

A second matching function, pcre_dfa_exec(), which is not Perl-compati-

837

ble, is also provided. This uses a different algorithm for the match-

838

ing. The alternative algorithm finds all possible matches (at a given

839

point in the subject), and scans the subject just once (unless there

840

are lookbehind assertions). However, this algorithm does not return

841

captured substrings. A description of the two matching algorithms and

842

their advantages and disadvantages is given in the pcrematching docu-

843

mentation.

844

845

In addition to the main compiling and matching functions, there are

846

convenience functions for extracting captured substrings from a subject

847

string that is matched by pcre_exec(). They are:

848

849

pcre_copy_substring()

850

pcre_copy_named_substring()

851

pcre_get_substring()

852

pcre_get_named_substring()

853

pcre_get_substring_list()

854

pcre_get_stringnumber()

855

pcre_get_stringtable_entries()

856

857

pcre_free_substring() and pcre_free_substring_list() are also provided,

858

to free the memory used for extracted strings.

859

860

The function pcre_maketables() is used to build a set of character

861

tables in the current locale for passing to pcre_compile(),

862

pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is

863

provided for specialist use. Most commonly, no special tables are

864

passed, in which case internal tables that are generated when PCRE is

865

built are used.

866

867

The function pcre_fullinfo() is used to find out information about a

868

compiled pattern; pcre_info() is an obsolete version that returns only

869

some of the available information, but is retained for backwards com-

870

patibility. The function pcre_version() returns a pointer to a string

871

containing the version of PCRE and its date of release.

872

873

The function pcre_refcount() maintains a reference count in a data

874

block containing a compiled pattern. This is provided for the benefit

875

of object-oriented applications.

876

877

The global variables pcre_malloc and pcre_free initially contain the

878

entry points of the standard malloc() and free() functions, respec-

879

tively. PCRE calls the memory management functions via these variables,

880

so a calling program can replace them if it wishes to intercept the

881

calls. This should be done before calling any PCRE functions.

882

883

The global variables pcre_stack_malloc and pcre_stack_free are also

884

indirections to memory management functions. These special functions

885

are used only when PCRE is compiled to use the heap for remembering

886

data, instead of recursive function calls, when running the pcre_exec()

887

function. See the pcrebuild documentation for details of how to do

888

this. It is a non-standard way of building PCRE, for use in environ-

889

ments that have limited stacks. Because of the greater use of memory

890

management, it runs more slowly. Separate functions are provided so

891

that special-purpose external code can be used for this case. When

892

used, these functions are always called in a stack-like manner (last

893

obtained, first freed), and always for memory blocks of the same size.

894

There is a discussion about PCRE's stack usage in the pcrestack docu-

895

mentation.

896

897

The global variable pcre_callout initially contains NULL. It can be set

898

by the caller to a "callout" function, which PCRE will then call at

899

specified points during a matching operation. Details are given in the

900

pcrecallout documentation.

NEWLINES

PCRE supports five different conventions for indicating line breaks in

906

strings: a single CR (carriage return) character, a single LF (line-

907

feed) character, the two-character sequence CRLF, any of the three pre-

908

ceding, or any Unicode newline sequence. The Unicode newline sequences

909

are the three just mentioned, plus the single characters VT (vertical

910

tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line

911

separator, U+2028), and PS (paragraph separator, U+2029).

912

913

Each of the first three conventions is used by at least one operating

914

system as its standard newline sequence. When PCRE is built, a default

915

can be specified. The default default is LF, which is the Unix stan-

916

dard. When PCRE is run, the default can be overridden, either when a

917

pattern is compiled, or when it is matched.

918

919

At compile time, the newline convention can be specified by the options

920

argument of pcre_compile(), or it can be specified by special text at

921

the start of the pattern itself; this overrides any other settings. See

922

the pcrepattern page for details of the special character sequences.

923

924

In the PCRE documentation the word "newline" is used to mean "the char-

925

acter or pair of characters that indicate a line break". The choice of

926

newline convention affects the handling of the dot, circumflex, and

927

dollar metacharacters, the handling of #-comments in /x mode, and, when

928

CRLF is a recognized line ending sequence, the match position advance-

929

ment for a non-anchored pattern. There is more detail about this in the

930

section on pcre_exec() options below.

931

932

The choice of newline convention does not affect the interpretation of

933

the \n or \r escape sequences, nor does it affect what \R matches,

934

which is controlled in a similar way, but by separate options.

MULTITHREADING

The PCRE functions can be used in multi-threading applications, with

940

the proviso that the memory management functions pointed to by

941

pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the

942

callout function pointed to by pcre_callout, are shared by all threads.

943

944

The compiled form of a regular expression is not altered during match-

945

ing, so the same compiled pattern can safely be used by several threads

946

at once.

947

948

If the just-in-time optimization feature is being used, it needs sepa-

949

rate memory stack areas for each thread. See the pcrejit documentation

for more details.

SAVING PRECOMPILED PATTERNS FOR LATER USE

954

955

The compiled form of a regular expression can be saved and re-used at a

956

later time, possibly by a different program, and even on a host other

957

than the one on which it was compiled. Details are given in the

958

pcreprecompile documentation. However, compiling a regular expression

959

with one version of PCRE for use with a different version is not guar-

960

anteed to work and may cause crashes.

961

962

963

CHECKING BUILD-TIME OPTIONS

964

965

int pcre_config(int what, void *where);

966

967

The function pcre_config() makes it possible for a PCRE client to dis-

968

cover which optional features have been compiled into the PCRE library.

969

The pcrebuild documentation has more details about these optional fea-

970

tures.

971

972

The first argument for pcre_config() is an integer, specifying which

973

information is required; the second argument is a pointer to a variable

974

into which the information is placed. The following information is

available:

PCRE_CONFIG_UTF8

The output is an integer that is set to one if UTF-8 support is avail-

980

able; otherwise it is set to zero.

981

982

PCRE_CONFIG_UNICODE_PROPERTIES

983

984

The output is an integer that is set to one if support for Unicode

985

character properties is available; otherwise it is set to zero.

PCRE_CONFIG_JIT

The output is an integer that is set to one if support for just-in-time

990

compiling is available; otherwise it is set to zero.

PCRE_CONFIG_NEWLINE

The output is an integer whose value specifies the default character

995

sequence that is recognized as meaning "newline". The four values that

996

are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,

997

and -1 for ANY. Though they are derived from ASCII, the same values

998

are returned in EBCDIC environments. The default should normally corre-

999

spond to the standard sequence for your operating system.

PCRE_CONFIG_BSR

The output is an integer whose value indicates what character sequences

1004

the \R escape sequence matches by default. A value of 0 means that \R

1005

matches any Unicode line ending sequence; a value of 1 means that \R

1006

matches only CR, LF, or CRLF. The default can be overridden when a pat-

1007

tern is compiled or matched.

1008

1009

PCRE_CONFIG_LINK_SIZE

1010

1011

The output is an integer that contains the number of bytes used for

1012

internal linkage in compiled regular expressions. The value is 2, 3, or

1013

4. Larger values allow larger regular expressions to be compiled, at

1014

the expense of slower matching. The default value of 2 is sufficient

1015

for all but the most massive patterns, since it allows the compiled

1016

pattern to be up to 64K in size.

1017

1018

PCRE_CONFIG_POSIX_MALLOC_THRESHOLD

1019

1020

The output is an integer that contains the threshold above which the

1021

POSIX interface uses malloc() for output vectors. Further details are

1022

given in the pcreposix documentation.

1023

1024

PCRE_CONFIG_MATCH_LIMIT

1025

1026

The output is a long integer that gives the default limit for the num-

1027

ber of internal matching function calls in a pcre_exec() execution.

1028

Further details are given with pcre_exec() below.

1029

1030

PCRE_CONFIG_MATCH_LIMIT_RECURSION

1031

1032

The output is a long integer that gives the default limit for the depth

1033

of recursion when calling the internal matching function in a

1034

pcre_exec() execution. Further details are given with pcre_exec()

1035

below.

1036

1037

PCRE_CONFIG_STACKRECURSE

1038

1039

The output is an integer that is set to one if internal recursion when

1040

running pcre_exec() is implemented by recursive function calls that use

1041

the stack to remember their state. This is the usual way that PCRE is

1042

compiled. The output is zero if PCRE was compiled to use blocks of data

1043

on the heap instead of recursive function calls. In this case,

1044

pcre_stack_malloc and pcre_stack_free are called to manage memory

1045

blocks on the heap, thus avoiding the use of the stack.

COMPILING A PATTERN

pcre *pcre_compile(const char *pattern, int options,

1051

const char **errptr, int *erroffset,

1052

const unsigned char *tableptr);

1053

1054

pcre *pcre_compile2(const char *pattern, int options,

1055

int *errorcodeptr,

1056

const char **errptr, int *erroffset,

1057

const unsigned char *tableptr);

1058

1059

Either of the functions pcre_compile() or pcre_compile2() can be called

1060

to compile a pattern into an internal form. The only difference between

1061

the two interfaces is that pcre_compile2() has an additional argument,

1062

errorcodeptr, via which a numerical error code can be returned. To

1063

avoid too much repetition, we refer just to pcre_compile() below, but

1064

the information applies equally to pcre_compile2().

1065

1066

The pattern is a C string terminated by a binary zero, and is passed in

1067

the pattern argument. A pointer to a single block of memory that is

1068

obtained via pcre_malloc is returned. This contains the compiled code

1069

and related data. The pcre type is defined for the returned block; this

1070

is a typedef for a structure whose contents are not externally defined.

1071

It is up to the caller to free the memory (via pcre_free) when it is no

1072

longer required.

1073

1074

Although the compiled code of a PCRE regex is relocatable, that is, it

1075

does not depend on memory location, the complete pcre data block is not

1076

fully relocatable, because it may contain a copy of the tableptr argu-

1077

ment, which is an address (see below).

1078

1079

The options argument contains various bit settings that affect the com-

1080

pilation. It should be zero if no options are required. The available

1081

options are described below. Some of them (in particular, those that

1082

are compatible with Perl, but some others as well) can also be set and

1083

unset from within the pattern (see the detailed description in the

1084

pcrepattern documentation). For those options that can be different in

1085

different parts of the pattern, the contents of the options argument

1086

specifies their settings at the start of compilation and execution. The

1087

PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and

1088

PCRE_NO_START_OPT options can be set at the time of matching as well as

1089

at compile time.

1090

1091

If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,

1092

if compilation of a pattern fails, pcre_compile() returns NULL, and

1093

sets the variable pointed to by errptr to point to a textual error mes-

1094

sage. This is a static string that is part of the library. You must not

1095

try to free it. Normally, the offset from the start of the pattern to

1096

the byte that was being processed when the error was discovered is

1097

placed in the variable pointed to by erroffset, which must not be NULL

1098

(if it is, an immediate error is given). However, for an invalid UTF-8

1099

string, the offset is that of the first byte of the failing character.

1100

Also, some errors are not detected until checks are carried out when

1101

the whole pattern has been scanned; in these cases the offset passed

1102

back is the length of the pattern.

1103

1104

Note that the offset is in bytes, not characters, even in UTF-8 mode.

1105

It may sometimes point into the middle of a UTF-8 character.

1106

1107

If pcre_compile2() is used instead of pcre_compile(), and the error-

1108

codeptr argument is not NULL, a non-zero error code number is returned

1109

via this argument in the event of an error. This is in addition to the

1110

textual error message. Error codes and messages are listed below.

1111

1112

If the final argument, tableptr, is NULL, PCRE uses a default set of

1113

character tables that are built when PCRE is compiled, using the

1114

default C locale. Otherwise, tableptr must be an address that is the

1115

result of a call to pcre_maketables(). This value is stored with the

1116

compiled pattern, and used again by pcre_exec(), unless another table

1117

pointer is passed to it. For more discussion, see the section on locale

1118

support below.

1119

1120

This code fragment shows a typical straightforward call to pcre_com-

pile():

pcre *re;

const char *error;

int erroffset;

re = pcre_compile(

"^A.*Z", /* the pattern */

1128

0, /* default options */

1129

&error, /* for error message */

1130

&erroffset, /* for error offset */

1131

NULL); /* use default character tables */

1132

1133

The following names for option bits are defined in the pcre.h header

file:

PCRE_ANCHORED

If this bit is set, the pattern is forced to be "anchored", that is, it

1139

is constrained to match only at the first matching point in the string

1140

that is being searched (the "subject string"). This effect can also be

1141

achieved by appropriate constructs in the pattern itself, which is the

1142

only way to do it in Perl.

PCRE_AUTO_CALLOUT

If this bit is set, pcre_compile() automatically inserts callout items,

1147

all with number 255, before each pattern item. For discussion of the

1148

callout facility, see the pcrecallout documentation.

PCRE_BSR_ANYCRLF

PCRE_BSR_UNICODE

These options (which are mutually exclusive) control what the \R escape

1154

sequence matches. The choice is either to match only CR, LF, or CRLF,

1155

or to match any Unicode newline sequence. The default is specified when

1156

PCRE is built. It can be overridden from within the pattern, or by set-

1157

ting an option when a compiled pattern is matched.

PCRE_CASELESS

If this bit is set, letters in the pattern match both upper and lower

1162

case letters. It is equivalent to Perl's /i option, and it can be

1163

changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE

1164

always understands the concept of case for characters whose values are

1165

less than 128, so caseless matching is always possible. For characters

1166

with higher values, the concept of case is supported if PCRE is com-

1167

piled with Unicode property support, but not otherwise. If you want to

1168

use caseless matching for characters 128 and above, you must ensure

1169

that PCRE is compiled with Unicode property support as well as with

UTF-8 support.

PCRE_DOLLAR_ENDONLY

If this bit is set, a dollar metacharacter in the pattern matches only

1175

at the end of the subject string. Without this option, a dollar also

1176

matches immediately before a newline at the end of the string (but not

1177

before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored

1178

if PCRE_MULTILINE is set. There is no equivalent to this option in

1179

Perl, and no way to set it within a pattern.

PCRE_DOTALL

If this bit is set, a dot metacharacter in the pattern matches a char-

1184

acter of any value, including one that indicates a newline. However, it

1185

only ever matches one character, even if newlines are coded as CRLF.

1186

Without this option, a dot does not match when the current position is

1187

at a newline. This option is equivalent to Perl's /s option, and it can

1188

be changed within a pattern by a (?s) option setting. A negative class

1189

such as [^a] always matches newline characters, independent of the set-

ting of this option.

PCRE_DUPNAMES

If this bit is set, names used to identify capturing subpatterns need

1195

not be unique. This can be helpful for certain types of pattern when it

1196

is known that only one instance of the named subpattern can ever be

1197

matched. There are more details of named subpatterns below; see also

1198

the pcrepattern documentation.

PCRE_EXTENDED

If this bit is set, whitespace data characters in the pattern are

1203

totally ignored except when escaped or inside a character class. White-

1204

space does not include the VT character (code 11). In addition, charac-

1205

ters between an unescaped # outside a character class and the next new-

1206

line, inclusive, are also ignored. This is equivalent to Perl's /x

1207

option, and it can be changed within a pattern by a (?x) option set-

1208

ting.

1209

1210

Which characters are interpreted as newlines is controlled by the

1211

options passed to pcre_compile() or by a special sequence at the start

1212

of the pattern, as described in the section entitled "Newline conven-

1213

tions" in the pcrepattern documentation. Note that the end of this type

1214

of comment is a literal newline sequence in the pattern; escape

1215

sequences that happen to represent a newline do not count.

1216

1217

This option makes it possible to include comments inside complicated

1218

patterns. Note, however, that this applies only to data characters.

1219

Whitespace characters may never appear within special character

1220

sequences in a pattern, for example within the sequence (?( that intro-

1221

duces a conditional subpattern.

PCRE_EXTRA

This option was invented in order to turn on additional functionality

1226

of PCRE that is incompatible with Perl, but it is currently of very

1227

little use. When set, any backslash in a pattern that is followed by a

1228

letter that has no special meaning causes an error, thus reserving

1229

these combinations for future expansion. By default, as in Perl, a

1230

backslash followed by a letter with no special meaning is treated as a

1231

literal. (Perl can, however, be persuaded to give an error for this, by

1232

running it with the -w option.) There are at present no other features

1233

controlled by this option. It can also be set by a (?X) option setting

within a pattern.

PCRE_FIRSTLINE

If this option is set, an unanchored pattern is required to match

1239

before or at the first newline in the subject string, though the

1240

matched text may continue over the newline.

1241

1242

PCRE_JAVASCRIPT_COMPAT

1243

1244

If this option is set, PCRE's behaviour is changed in some ways so that

1245

it is compatible with JavaScript rather than Perl. The changes are as

1246

follows:

1247

1248

(1) A lone closing square bracket in a pattern causes a compile-time

1249

error, because this is illegal in JavaScript (by default it is treated

1250

as a data character). Thus, the pattern AB]CD becomes illegal when this

1251

option is set.

1252

1253

(2) At run time, a back reference to an unset subpattern group matches

1254

an empty string (by default this causes the current matching alterna-

1255

tive to fail). A pattern such as (\1)(a) succeeds when this option is

1256

set (assuming it can find an "a" in the subject), whereas it fails by

1257

default, for Perl compatibility.

1258

1259

(3) \U matches an upper case "U" character; by default \U causes a com-

1260

pile time error (Perl uses \U to upper case subsequent characters).

1261

1262

(4) \u matches a lower case "u" character unless it is followed by four

1263

hexadecimal digits, in which case the hexadecimal number defines the

1264

code point to match. By default, \u causes a compile time error (Perl

1265

uses it to upper case the following character).

1266

1267

(5) \x matches a lower case "x" character unless it is followed by two

1268

hexadecimal digits, in which case the hexadecimal number defines the

1269

code point to match. By default, as in Perl, a hexadecimal number is

1270

always expected after \x, but it may have zero, one, or two digits (so,

1271

for example, \xz matches a binary zero character followed by z).

PCRE_MULTILINE

By default, PCRE treats the subject string as consisting of a single

1276

line of characters (even if it actually contains newlines). The "start

1277

of line" metacharacter (^) matches only at the start of the string,

1278

while the "end of line" metacharacter ($) matches only at the end of

1279

the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY

1280

is set). This is the same as Perl.

1281

1282

When PCRE_MULTILINE it is set, the "start of line" and "end of line"

1283

constructs match immediately following or immediately before internal

1284

newlines in the subject string, respectively, as well as at the very

1285

start and end. This is equivalent to Perl's /m option, and it can be

1286

changed within a pattern by a (?m) option setting. If there are no new-

1287

lines in a subject string, or no occurrences of ^ or $ in a pattern,

1288

setting PCRE_MULTILINE has no effect.

PCRE_NEWLINE_CR

PCRE_NEWLINE_LF

PCRE_NEWLINE_CRLF

PCRE_NEWLINE_ANYCRLF

PCRE_NEWLINE_ANY

These options override the default newline definition that was chosen

1297

when PCRE was built. Setting the first or the second specifies that a

1298

newline is indicated by a single character (CR or LF, respectively).

1299

Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the

1300

two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies

1301

that any of the three preceding sequences should be recognized. Setting

1302

PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be

1303

recognized. The Unicode newline sequences are the three just mentioned,

1304

plus the single characters VT (vertical tab, U+000B), FF (formfeed,

1305

U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS

1306

(paragraph separator, U+2029). The last two are recognized only in

1307

UTF-8 mode.

1308

1309

The newline setting in the options word uses three bits that are

1310

treated as a number, giving eight possibilities. Currently only six are

1311

used (default plus the five values above). This means that if you set

1312

more than one newline option, the combination may or may not be sensi-

1313

ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to

1314

PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and

1315

cause an error.

1316

1317

The only time that a line break in a pattern is specially recognized

1318

when compiling is when PCRE_EXTENDED is set. CR and LF are whitespace

1319

characters, and so are ignored in this mode. Also, an unescaped # out-

1320

side a character class indicates a comment that lasts until after the

1321

next line break sequence. In other circumstances, line break sequences

1322

in patterns are treated as literal data.

1323

1324

The newline option that is set at compile time becomes the default that

1325

is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.

PCRE_NO_AUTO_CAPTURE

If this option is set, it disables the use of numbered capturing paren-

1330

theses in the pattern. Any opening parenthesis that is not followed by

1331

? behaves as if it were followed by ?: but named parentheses can still

1332

be used for capturing (and they acquire numbers in the usual way).

1333

There is no equivalent of this option in Perl.

NO_START_OPTIMIZE

This is an option that acts at matching time; that is, it is really an

1338

option for pcre_exec() or pcre_dfa_exec(). If it is set at compile

1339

time, it is remembered with the compiled pattern and assumed at match-

1340

ing time. For details see the discussion of PCRE_NO_START_OPTIMIZE

below.

PCRE_UCP

This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,

1346

\w, and some of the POSIX character classes. By default, only ASCII

1347

characters are recognized, but if PCRE_UCP is set, Unicode properties

1348

are used instead to classify characters. More details are given in the

1349

section on generic character types in the pcrepattern page. If you set

1350

PCRE_UCP, matching one of the items it affects takes much longer. The

1351

option is available only if PCRE has been compiled with Unicode prop-

erty support.

PCRE_UNGREEDY

This option inverts the "greediness" of the quantifiers so that they

1357

are not greedy by default, but become greedy if followed by "?". It is

1358

not compatible with Perl. It can also be set by a (?U) option setting

within the pattern.

PCRE_UTF8

This option causes PCRE to regard both the pattern and the subject as

1364

strings of UTF-8 characters instead of single-byte character strings.

1365

However, it is available only when PCRE is built to include UTF-8 sup-

1366

port. If not, the use of this option provokes an error. Details of how

1367

this option changes the behaviour of PCRE are given in the pcreunicode

page.

PCRE_NO_UTF8_CHECK

When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is

1373

automatically checked. There is a discussion about the validity of

1374

UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of

1375

bytes is found, pcre_compile() returns an error. If you already know

1376

that your pattern is valid, and you want to skip this check for perfor-

1377

mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is

1378

set, the effect of passing an invalid UTF-8 string as a pattern is

1379

undefined. It may cause your program to crash. Note that this option

1380

can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the

1381

UTF-8 validity checking of subject strings.

1382

1383

1384

COMPILATION ERROR CODES

1385

1386

The following table lists the error codes than may be returned by

1387

pcre_compile2(), along with the error messages that may be returned by

1388

both compiling functions. As PCRE has developed, some error codes have

1389

fallen out of use. To avoid confusion, they have not been re-used.

1390

1391

0 no error

1392

1 \ at end of pattern

1393

2 \c at end of pattern

1394

3 unrecognized character follows \

1395

4 numbers out of order in {} quantifier

1396

5 number too big in {} quantifier

1397

6 missing terminating ] for character class

1398

7 invalid escape sequence in character class

1399

8 range out of order in character class

1400

9 nothing to repeat

1401

10 [this code is not in use]

1402

11 internal error: unexpected repeat

1403

12 unrecognized character after (? or (?-

1404

13 POSIX named classes are supported only within a class

1405

14 missing )

1406

15 reference to non-existent subpattern

1407

16 erroffset passed as NULL

1408

17 unknown option bit(s) set

1409

18 missing ) after comment

1410

19 [this code is not in use]

1411

20 regular expression is too large

1412

21 failed to get memory

1413

22 unmatched parentheses

1414

23 internal error: code overflow

1415

24 unrecognized character after (?<

1416

25 lookbehind assertion is not fixed length

1417

26 malformed number or name after (?(

1418

27 conditional group contains more than two branches

1419

28 assertion expected after (?(

1420

29 (?R or (?[+-]digits must be followed by )

1421

30 unknown POSIX class name

1422

31 POSIX collating elements are not supported

1423

32 this version of PCRE is not compiled with PCRE_UTF8 support

1424

33 [this code is not in use]

1425

34 character value in \x{...} sequence is too large

1426

35 invalid condition (?(0)

1427

36 \C not allowed in lookbehind assertion

1428

37 PCRE does not support \L, \l, \N{name}, \U, or \u

1429

38 number after (?C is > 255

1430

39 closing ) for (?C expected

1431

40 recursive call could loop indefinitely

1432

41 unrecognized character after (?P

1433

42 syntax error in subpattern name (missing terminator)

1434

43 two named subpatterns have the same name

1435

44 invalid UTF-8 string

1436

45 support for \P, \p, and \X has not been compiled

1437

46 malformed \P or \p sequence

1438

47 unknown property name after \P or \p

1439

48 subpattern name is too long (maximum 32 characters)

1440

49 too many named subpatterns (maximum 10000)

1441

50 [this code is not in use]

1442

51 octal value is greater than \377 (not in UTF-8 mode)

1443

52 internal error: overran compiling workspace

1444

53 internal error: previously-checked referenced subpattern

1445

not found

1446

54 DEFINE group contains more than one branch

1447

55 repeating a DEFINE group is not allowed

1448

56 inconsistent NEWLINE options

1449

57 \g is not followed by a braced, angle-bracketed, or quoted

1450

name/number or by a plain number

1451

58 a numbered reference must not be zero

1452

59 an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)

1453

60 (*VERB) not recognized

1454

61 number is too big

1455

62 subpattern name expected

1456

63 digit expected after (?+

1457

64 ] is an invalid data character in JavaScript compatibility mode

1458

65 different names for subpatterns of the same number are

1459

not allowed

1460

66 (*MARK) must have an argument

1461

67 this version of PCRE is not compiled with PCRE_UCP support

1462

68 \c must be followed by an ASCII character

1463

69 \k is not followed by a braced, angle-bracketed, or quoted name

1464

1465

The numbers 32 and 10000 in errors 48 and 49 are defaults; different

1466

values may be used if the limits were changed when PCRE was built.

STUDYING A PATTERN

pcre_extra *pcre_study(const pcre *code, int options

1472

const char **errptr);

1473

1474

If a compiled pattern is going to be used several times, it is worth

1475

spending more time analyzing it in order to speed up the time taken for

1476

matching. The function pcre_study() takes a pointer to a compiled pat-

1477

tern as its first argument. If studying the pattern produces additional

1478

information that will help speed up matching, pcre_study() returns a

1479

pointer to a pcre_extra block, in which the study_data field points to

1480

the results of the study.

1481

1482

The returned value from pcre_study() can be passed directly to

1483

pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block also con-

1484

tains other fields that can be set by the caller before the block is

1485

passed; these are described below in the section on matching a pattern.

1486

1487

If studying the pattern does not produce any useful information,

1488

pcre_study() returns NULL. In that circumstance, if the calling program

1489

wants to pass any of the other fields to pcre_exec() or

1490

pcre_dfa_exec(), it must set up its own pcre_extra block.

1491

1492

The second argument of pcre_study() contains option bits. There is only

1493

one option: PCRE_STUDY_JIT_COMPILE. If this is set, and the just-in-

1494

time compiler is available, the pattern is further compiled into

1495

machine code that executes much faster than the pcre_exec() matching

1496

function. If the just-in-time compiler is not available, this option is

1497

ignored. All other bits in the options argument must be zero.

1498

1499

JIT compilation is a heavyweight optimization. It can take some time

1500

for patterns to be analyzed, and for one-off matches and simple pat-

1501

terns the benefit of faster execution might be offset by a much slower

1502

study time. Not all patterns can be optimized by the JIT compiler. For

1503

those that cannot be handled, matching automatically falls back to the

1504

pcre_exec() interpreter. For more details, see the pcrejit documenta-

1505

tion.

1506

1507

The third argument for pcre_study() is a pointer for an error message.

1508

If studying succeeds (even if no data is returned), the variable it

1509

points to is set to NULL. Otherwise it is set to point to a textual

1510

error message. This is a static string that is part of the library. You

1511

must not try to free it. You should test the error pointer for NULL

1512

after calling pcre_study(), to be sure that it has run successfully.

1513

1514

When you are finished with a pattern, you can free the memory used for

1515

the study data by calling pcre_free_study(). This function was added to

1516

the API for release 8.20. For earlier versions, the memory could be

1517

freed with pcre_free(), just like the pattern itself. This will still

1518

work in cases where PCRE_STUDY_JIT_COMPILE is not used, but it is

1519

advisable to change to the new function when convenient.

1520

1521

This is a typical way in which pcre_study() is used (except that in a

1522

real application there should be tests for errors):

int rc;

pcre *re;

pcre_extra *sd;

re = pcre_compile("pattern", 0, &error, &erroroffset, NULL);

1528

sd = pcre_study(

1529

re, /* result of pcre_compile() */

1530

0, /* no options */

1531

&error); /* set to NULL or points to a message */

1532

rc = pcre_exec( /* see below for details of pcre_exec() options */

1533

re, sd, "subject", 7, 0, 0, ovector, 30);

...

pcre_free_study(sd);

pcre_free(re);

Studying a pattern does two things: first, a lower bound for the length

1539

of subject string that is needed to match the pattern is computed. This

1540

does not mean that there are any strings of that length that match, but

1541

it does guarantee that no shorter strings match. The value is used by

1542

pcre_exec() and pcre_dfa_exec() to avoid wasting time by trying to

1543

match strings that are shorter than the lower bound. You can find out

1544

the value in a calling program via the pcre_fullinfo() function.

1545

1546

Studying a pattern is also useful for non-anchored patterns that do not

1547

have a single fixed starting character. A bitmap of possible starting

1548

bytes is created. This speeds up finding a position in the subject at

1549

which to start matching.

1550

1551

These two optimizations apply to both pcre_exec() and pcre_dfa_exec().

1552

However, they are not used by pcre_exec() if pcre_study() is called

1553

with the PCRE_STUDY_JIT_COMPILE option, and just-in-time compiling is

1554

successful. The optimizations can be disabled by setting the

1555

PCRE_NO_START_OPTIMIZE option when calling pcre_exec() or

1556

pcre_dfa_exec(). You might want to do this if your pattern contains

1557

callouts or (*MARK) (which cannot be handled by the JIT compiler), and

1558

you want to make use of these facilities in cases where matching fails.

1559

See the discussion of PCRE_NO_START_OPTIMIZE below.

LOCALE SUPPORT

PCRE handles caseless matching, and determines whether characters are

1565

letters, digits, or whatever, by reference to a set of tables, indexed

1566

by character value. When running in UTF-8 mode, this applies only to

1567

characters with codes less than 128. By default, higher-valued codes

1568

never match escapes such as \w or \d, but they can be tested with \p if

1569

PCRE is built with Unicode character property support. Alternatively,

1570

the PCRE_UCP option can be set at compile time; this causes \w and

1571

friends to use Unicode property support instead of built-in tables. The

1572

use of locales with Unicode is discouraged. If you are handling charac-

1573

ters with codes greater than 128, you should either use UTF-8 and Uni-

1574

code, or use locales, but not try to mix the two.

1575

1576

PCRE contains an internal set of tables that are used when the final

1577

argument of pcre_compile() is NULL. These are sufficient for many

1578

applications. Normally, the internal tables recognize only ASCII char-

1579

acters. However, when PCRE is built, it is possible to cause the inter-

1580

nal tables to be rebuilt in the default "C" locale of the local system,

1581

which may cause them to be different.

1582

1583

The internal tables can always be overridden by tables supplied by the

1584

application that calls PCRE. These may be created in a different locale

1585

from the default. As more and more applications change to using Uni-

1586

code, the need for this locale support is expected to die away.

1587

1588

External tables are built by calling the pcre_maketables() function,

1589

which has no arguments, in the relevant locale. The result can then be

1590

passed to pcre_compile() or pcre_exec() as often as necessary. For

1591

example, to build and use tables that are appropriate for the French

1592

locale (where accented characters with values greater than 128 are

1593

treated as letters), the following code could be used:

1594

1595

setlocale(LC_CTYPE, "fr_FR");

1596

tables = pcre_maketables();

1597

re = pcre_compile(..., tables);

1598

1599

The locale name "fr_FR" is used on Linux and other Unix-like systems;

1600

if you are using Windows, the name for the French locale is "french".

1601

1602

When pcre_maketables() runs, the tables are built in memory that is

1603

obtained via pcre_malloc. It is the caller's responsibility to ensure

1604

that the memory containing the tables remains available for as long as

1605

it is needed.

1606

1607

The pointer that is passed to pcre_compile() is saved with the compiled

1608

pattern, and the same tables are used via this pointer by pcre_study()

1609

and normally also by pcre_exec(). Thus, by default, for any single pat-

1610

tern, compilation, studying and matching all happen in the same locale,

1611

but different patterns can be compiled in different locales.

1612

1613

It is possible to pass a table pointer or NULL (indicating the use of

1614

the internal tables) to pcre_exec(). Although not intended for this

1615

purpose, this facility could be used to match a pattern in a different

1616

locale from the one in which it was compiled. Passing table pointers at

1617

run time is discussed below in the section on matching a pattern.

1618

1619

1620

INFORMATION ABOUT A PATTERN

1621

1622

int pcre_fullinfo(const pcre *code, const pcre_extra *extra,

1623

int what, void *where);

1624

1625

The pcre_fullinfo() function returns information about a compiled pat-

1626

tern. It replaces the obsolete pcre_info() function, which is neverthe-

1627

less retained for backwards compability (and is documented below).

1628

1629

The first argument for pcre_fullinfo() is a pointer to the compiled

1630

pattern. The second argument is the result of pcre_study(), or NULL if

1631

the pattern was not studied. The third argument specifies which piece

1632

of information is required, and the fourth argument is a pointer to a

1633

variable to receive the data. The yield of the function is zero for

1634

success, or one of the following negative numbers:

1635

1636

PCRE_ERROR_NULL the argument code was NULL

1637

the argument where was NULL

1638

PCRE_ERROR_BADMAGIC the "magic number" was not found

1639

PCRE_ERROR_BADOPTION the value of what was invalid

1640

1641

The "magic number" is placed at the start of each compiled pattern as

1642

an simple check against passing an arbitrary memory pointer. Here is a

1643

typical call of pcre_fullinfo(), to obtain the length of the compiled

pattern:

int rc;

size_t length;

rc = pcre_fullinfo(

re, /* result of pcre_compile() */

1650

sd, /* result of pcre_study(), or NULL */

1651

PCRE_INFO_SIZE, /* what is required */

1652

&length); /* where to put the data */

1653

1654

The possible values for the third argument are defined in pcre.h, and

are as follows:

PCRE_INFO_BACKREFMAX

Return the number of the highest back reference in the pattern. The

1660

fourth argument should point to an int variable. Zero is returned if

1661

there are no back references.

1662

1663

PCRE_INFO_CAPTURECOUNT

1664

1665

Return the number of capturing subpatterns in the pattern. The fourth

1666

argument should point to an int variable.

1667

1668

PCRE_INFO_DEFAULT_TABLES

1669

1670

Return a pointer to the internal default character tables within PCRE.

1671

The fourth argument should point to an unsigned char * variable. This

1672

information call is provided for internal use by the pcre_study() func-

1673

tion. External callers can cause PCRE to use its internal tables by

1674

passing a NULL table pointer.

PCRE_INFO_FIRSTBYTE

Return information about the first byte of any matched string, for a

1679

non-anchored pattern. The fourth argument should point to an int vari-

1680

able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name

1681

is still recognized for backwards compatibility.)

1682

1683

If there is a fixed first byte, for example, from a pattern such as

1684

(cat|cow|coyote), its value is returned. Otherwise, if either

1685

1686

(a) the pattern was compiled with the PCRE_MULTILINE option, and every

1687

branch starts with "^", or

1688

1689

(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not

1690

set (if it were set, the pattern would be anchored),

1691

1692

-1 is returned, indicating that the pattern matches only at the start

1693

of a subject string or after any newline within the string. Otherwise

1694

-2 is returned. For anchored patterns, -2 is returned.

PCRE_INFO_FIRSTTABLE

If the pattern was studied, and this resulted in the construction of a

1699

256-bit table indicating a fixed set of bytes for the first byte in any

1700

matching string, a pointer to the table is returned. Otherwise NULL is

1701

returned. The fourth argument should point to an unsigned char * vari-

able.

PCRE_INFO_HASCRORLF

Return 1 if the pattern contains any explicit matches for CR or LF

1707

characters, otherwise 0. The fourth argument should point to an int

1708

variable. An explicit match is either a literal CR or LF character, or

\r or \n.

PCRE_INFO_JCHANGED

Return 1 if the (?J) or (?-J) option setting is used in the pattern,

1714

otherwise 0. The fourth argument should point to an int variable. (?J)

1715

and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.

PCRE_INFO_JIT

Return 1 if the pattern was studied with the PCRE_STUDY_JIT_COMPILE

1720

option, and just-in-time compiling was successful. The fourth argument

1721

should point to an int variable. A return value of 0 means that JIT

1722

support is not available in this version of PCRE, or that the pattern

1723

was not studied with the PCRE_STUDY_JIT_COMPILE option, or that the JIT

1724

compiler could not handle this particular pattern. See the pcrejit doc-

1725

umentation for details of what can and cannot be handled.

PCRE_INFO_JITSIZE

If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE

1730

option, return the size of the JIT compiled code, otherwise return

1731

zero. The fourth argument should point to a size_t variable.

1732

1733

PCRE_INFO_LASTLITERAL

1734

1735

Return the value of the rightmost literal byte that must exist in any

1736

matched string, other than at its start, if such a byte has been

1737

recorded. The fourth argument should point to an int variable. If there

1738

is no such byte, -1 is returned. For anchored patterns, a last literal

1739

byte is recorded only if it follows something of variable length. For

1740

example, for the pattern /^a\d+z\d+/ the returned value is "z", but for

1741

/^a\dz\d/ the returned value is -1.

PCRE_INFO_MINLENGTH

If the pattern was studied and a minimum length for matching subject

1746

strings was computed, its value is returned. Otherwise the returned

1747

value is -1. The value is a number of characters, not bytes (this may

1748

be relevant in UTF-8 mode). The fourth argument should point to an int

1749

variable. A non-negative value is a lower bound to the length of any

1750

matching string. There may not be any strings of that length that do

1751

actually match, but every string that does match is at least that long.

1752

1753

PCRE_INFO_NAMECOUNT

1754

PCRE_INFO_NAMEENTRYSIZE

1755

PCRE_INFO_NAMETABLE

1756

1757

PCRE supports the use of named as well as numbered capturing parenthe-

1758

ses. The names are just an additional way of identifying the parenthe-

1759

ses, which still acquire numbers. Several convenience functions such as

1760

pcre_get_named_substring() are provided for extracting captured sub-

1761

strings by name. It is also possible to extract the data directly, by

1762

first converting the name to a number in order to access the correct

1763

pointers in the output vector (described with pcre_exec() below). To do

1764

the conversion, you need to use the name-to-number map, which is

1765

described by these three values.

1766

1767

The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT

1768

gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size

1769

of each entry; both of these return an int value. The entry size

1770

depends on the length of the longest name. PCRE_INFO_NAMETABLE returns

1771

a pointer to the first entry of the table (a pointer to char). The

1772

first two bytes of each entry are the number of the capturing parenthe-

1773

sis, most significant byte first. The rest of the entry is the corre-

1774

sponding name, zero terminated.

1775

1776

The names are in alphabetical order. Duplicate names may appear if (?|

1777

is used to create multiple groups with the same number, as described in

1778

the section on duplicate subpattern numbers in the pcrepattern page.

1779

Duplicate names for subpatterns with different numbers are permitted

1780

only if PCRE_DUPNAMES is set. In all cases of duplicate names, they

1781

appear in the table in the order in which they were found in the pat-

1782

tern. In the absence of (?| this is the order of increasing number;

1783

when (?| is used this is not necessarily the case because later subpat-

1784

terns may have lower numbers.

1785

1786

As a simple example of the name/number table, consider the following

1787

pattern (assume PCRE_EXTENDED is set, so white space - including new-

1788

lines - is ignored):

1789

1790

(?<date> (?<year>(\d\d)?\d\d) -

1791

(?<month>\d\d) - (?<day>\d\d) )

1792

1793

There are four named subpatterns, so the table has four entries, and

1794

each entry in the table is eight bytes long. The table is as follows,

1795

with non-printing bytes shows in hexadecimal, and undefined bytes shown

as ??:

00 01 d a t e 00 ??

00 05 d a y 00 ?? ??

00 04 m o n t h 00

00 02 y e a r 00 ??

When writing code to extract data from named subpatterns using the

1804

name-to-number map, remember that the length of the entries is likely

1805

to be different for each compiled pattern.

PCRE_INFO_OKPARTIAL

Return 1 if the pattern can be used for partial matching with

1810

pcre_exec(), otherwise 0. The fourth argument should point to an int

1811

variable. From release 8.00, this always returns 1, because the

1812

restrictions that previously applied to partial matching have been

1813

lifted. The pcrepartial documentation gives details of partial match-

ing.

PCRE_INFO_OPTIONS

Return a copy of the options with which the pattern was compiled. The

1819

fourth argument should point to an unsigned long int variable. These

1820

option bits are those specified in the call to pcre_compile(), modified

1821

by any top-level option settings at the start of the pattern itself. In

1822

other words, they are the options that will be in force when matching

1823

starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with

1824

the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,

1825

and PCRE_EXTENDED.

1826

1827

A pattern is automatically anchored by PCRE if all of its top-level

1828

alternatives begin with one of the following:

1829

1830

^ unless PCRE_MULTILINE is set

1831

\A always

1832

\G always

1833

.* if PCRE_DOTALL is set and there are no back

1834

references to the subpattern in which .* appears

1835

1836

For such patterns, the PCRE_ANCHORED bit is set in the options returned

by pcre_fullinfo().

PCRE_INFO_SIZE

Return the size of the compiled pattern. The fourth argument should

1842

point to a size_t variable. This value does not include the size of the

1843

pcre structure that is returned by pcre_compile(). The value that is

1844

passed as the argument to pcre_malloc() when pcre_compile() is getting

1845

memory in which to place the compiled data is the value returned by

1846

this option plus the size of the pcre structure. Studying a compiled

1847

pattern, with or without JIT, does not alter the value returned by this

option.

PCRE_INFO_STUDYSIZE

Return the size of the data block pointed to by the study_data field in

1853

a pcre_extra block. If pcre_extra is NULL, or there is no study data,

1854

zero is returned. The fourth argument should point to a size_t vari-

1855

able. The study_data field is set by pcre_study() to record informa-

1856

tion that will speed up matching (see the section entitled "Studying a

1857

pattern" above). The format of the study_data block is private, but its

1858

length is made available via this option so that it can be saved and

1859

restored (see the pcreprecompile documentation for details).

1860

1861

1862

OBSOLETE INFO FUNCTION

1863

1864

int pcre_info(const pcre *code, int *optptr, int *firstcharptr);

1865

1866

The pcre_info() function is now obsolete because its interface is too

1867

restrictive to return all the available data about a compiled pattern.

1868

New programs should use pcre_fullinfo() instead. The yield of

1869

pcre_info() is the number of capturing subpatterns, or one of the fol-

1870

lowing negative numbers:

1871

1872

PCRE_ERROR_NULL the argument code was NULL

1873

PCRE_ERROR_BADMAGIC the "magic number" was not found

1874

1875

If the optptr argument is not NULL, a copy of the options with which

1876

the pattern was compiled is placed in the integer it points to (see

1877

PCRE_INFO_OPTIONS above).

1878

1879

If the pattern is not anchored and the firstcharptr argument is not

1880

NULL, it is used to pass back information about the first character of

1881

any matched string (see PCRE_INFO_FIRSTBYTE above).

REFERENCE COUNTS

int pcre_refcount(pcre *code, int adjust);

1887

1888

The pcre_refcount() function is used to maintain a reference count in

1889

the data block that contains a compiled pattern. It is provided for the

1890

benefit of applications that operate in an object-oriented manner,

1891

where different parts of the application may be using the same compiled

1892

pattern, but you want to free the block when they are all done.

1893

1894

When a pattern is compiled, the reference count field is initialized to

1895

zero. It is changed only by calling this function, whose action is to

1896

add the adjust value (which may be positive or negative) to it. The

1897

yield of the function is the new value. However, the value of the count

1898

is constrained to lie between 0 and 65535, inclusive. If the new value

1899

is outside these limits, it is forced to the appropriate limit value.

1900

1901

Except when it is zero, the reference count is not correctly preserved

1902

if a pattern is compiled on one host and then transferred to a host

1903

whose byte-order is different. (This seems a highly unlikely scenario.)

1904

1905

1906

MATCHING A PATTERN: THE TRADITIONAL FUNCTION

1907

1908

int pcre_exec(const pcre *code, const pcre_extra *extra,

1909

const char *subject, int length, int startoffset,

1910

int options, int *ovector, int ovecsize);

1911

1912

The function pcre_exec() is called to match a subject string against a

1913

compiled pattern, which is passed in the code argument. If the pattern

1914

was studied, the result of the study should be passed in the extra

1915

argument. You can call pcre_exec() with the same code and extra argu-

1916

ments as many times as you like, in order to match different subject

1917

strings with the same pattern.

1918

1919

This function is the main matching facility of the library, and it

1920

operates in a Perl-like manner. For specialist use there is also an

1921

alternative matching function, which is described below in the section

1922

about the pcre_dfa_exec() function.

1923

1924

In most applications, the pattern will have been compiled (and option-

1925

ally studied) in the same process that calls pcre_exec(). However, it

1926

is possible to save compiled patterns and study data, and then use them

1927

later in different processes, possibly even on different hosts. For a

1928

discussion about this, see the pcreprecompile documentation.

1929

1930

Here is an example of a simple call to pcre_exec():

int rc;

int ovector[30];

rc = pcre_exec(

re, /* result of pcre_compile() */

1936

NULL, /* we didn't study the pattern */

1937

"some string", /* the subject string */

1938

11, /* the length of the subject string */

1939

0, /* start at offset 0 in the subject */

1940

0, /* default options */

1941

ovector, /* vector of integers for substring information */

1942

30); /* number of elements (NOT size in bytes) */

1943

1944

Extra data for pcre_exec()

1945

1946

If the extra argument is not NULL, it must point to a pcre_extra data

1947

block. The pcre_study() function returns such a block (when it doesn't

1948

return NULL), but you can also create one for yourself, and pass addi-

1949

tional information in it. The pcre_extra block contains the following

1950

fields (not necessarily in this order):

1951

1952

unsigned long int flags;

1953

void *study_data;

1954

void *executable_jit;

1955

unsigned long int match_limit;

1956

unsigned long int match_limit_recursion;

1957

void *callout_data;

1958

const unsigned char *tables;

1959

unsigned char **mark;

1960

1961

The flags field is a bitmap that specifies which of the other fields

1962

are set. The flag bits are:

1963

1964

PCRE_EXTRA_STUDY_DATA

1965

PCRE_EXTRA_EXECUTABLE_JIT

1966

PCRE_EXTRA_MATCH_LIMIT

1967

PCRE_EXTRA_MATCH_LIMIT_RECURSION

1968

PCRE_EXTRA_CALLOUT_DATA

PCRE_EXTRA_TABLES

PCRE_EXTRA_MARK

Other flag bits should be set to zero. The study_data field and some-

1973

times the executable_jit field are set in the pcre_extra block that is

1974

returned by pcre_study(), together with the appropriate flag bits. You

1975

should not set these yourself, but you may add to the block by setting

1976

the other fields and their corresponding flag bits.

1977

1978

The match_limit field provides a means of preventing PCRE from using up

1979

a vast amount of resources when running patterns that are not going to

1980

match, but which have a very large number of possibilities in their

1981

search trees. The classic example is a pattern that uses nested unlim-

1982

ited repeats.

1983

1984

Internally, pcre_exec() uses a function called match(), which it calls

1985

repeatedly (sometimes recursively). The limit set by match_limit is

1986

imposed on the number of times this function is called during a match,

1987

which has the effect of limiting the amount of backtracking that can

1988

take place. For patterns that are not anchored, the count restarts from

1989

zero for each position in the subject string.

1990

1991

When pcre_exec() is called with a pattern that was successfully studied

1992

with the PCRE_STUDY_JIT_COMPILE option, the way that the matching is

1993

executed is entirely different. However, there is still the possibility

1994

of runaway matching that goes on for a very long time, and so the

1995

match_limit value is also used in this case (but in a different way) to

1996

limit how long the matching can continue.

1997

1998

The default value for the limit can be set when PCRE is built; the

1999

default default is 10 million, which handles all but the most extreme

2000

cases. You can override the default by suppling pcre_exec() with a

2001

pcre_extra block in which match_limit is set, and

2002

PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is

2003

exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.

2004

2005

The match_limit_recursion field is similar to match_limit, but instead

2006

of limiting the total number of times that match() is called, it limits

2007

the depth of recursion. The recursion depth is a smaller number than

2008

the total number of calls, because not all calls to match() are recur-

2009

sive. This limit is of use only if it is set smaller than match_limit.

2010

2011

Limiting the recursion depth limits the amount of machine stack that

2012

can be used, or, when PCRE has been compiled to use memory on the heap

2013

instead of the stack, the amount of heap memory that can be used. This

2014

limit is not relevant, and is ignored, if the pattern was successfully

2015

studied with PCRE_STUDY_JIT_COMPILE.

2016

2017

The default value for match_limit_recursion can be set when PCRE is

2018

built; the default default is the same value as the default for

2019

match_limit. You can override the default by suppling pcre_exec() with

2020

a pcre_extra block in which match_limit_recursion is set, and

2021

PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the

2022

limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.

2023

2024

The callout_data field is used in conjunction with the "callout" fea-

2025

ture, and is described in the pcrecallout documentation.

2026

2027

The tables field is used to pass a character tables pointer to

2028

pcre_exec(); this overrides the value that is stored with the compiled

2029

pattern. A non-NULL value is stored with the compiled pattern only if

2030

custom tables were supplied to pcre_compile() via its tableptr argu-

2031

ment. If NULL is passed to pcre_exec() using this mechanism, it forces

2032

PCRE's internal tables to be used. This facility is helpful when re-

2033

using patterns that have been saved after compiling with an external

2034

set of tables, because the external tables might be at a different

2035

address when pcre_exec() is called. See the pcreprecompile documenta-

2036

tion for a discussion of saving compiled patterns for later use.

2037

2038

If PCRE_EXTRA_MARK is set in the flags field, the mark field must be

2039

set to point to a char * variable. If the pattern contains any back-

2040

tracking control verbs such as (*MARK:NAME), and the execution ends up

2041

with a name to pass back, a pointer to the name string (zero termi-

2042

nated) is placed in the variable pointed to by the mark field. The

2043

names are within the compiled pattern; if you wish to retain such a

2044

name you must copy it before freeing the memory of a compiled pattern.

2045

If there is no name to pass back, the variable pointed to by the mark

2046

field set to NULL. For details of the backtracking control verbs, see

2047

the section entitled "Backtracking control" in the pcrepattern documen-

2048

tation.

2049

2050

Option bits for pcre_exec()

2051

2052

The unused bits of the options argument for pcre_exec() must be zero.

2053

The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,

2054

PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,

2055

PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and

2056

PCRE_PARTIAL_HARD.

2057

2058

If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE

2059

option, the only supported options for JIT execution are

2060

PCRE_NO_UTF8_CHECK, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and

2061

PCRE_NOTEMPTY_ATSTART. Note in particular that partial matching is not

2062

supported. If an unsupported option is used, JIT execution is disabled

2063

and the normal interpretive code in pcre_exec() is run.

PCRE_ANCHORED

The PCRE_ANCHORED option limits pcre_exec() to matching at the first

2068

matching position. If a pattern was compiled with PCRE_ANCHORED, or

2069

turned out to be anchored by virtue of its contents, it cannot be made

2070

unachored at matching time.

PCRE_BSR_ANYCRLF

PCRE_BSR_UNICODE

These options (which are mutually exclusive) control what the \R escape

2076

sequence matches. The choice is either to match only CR, LF, or CRLF,

2077

or to match any Unicode newline sequence. These options override the

2078

choice that was made or defaulted when the pattern was compiled.

PCRE_NEWLINE_CR

PCRE_NEWLINE_LF

PCRE_NEWLINE_CRLF

PCRE_NEWLINE_ANYCRLF

PCRE_NEWLINE_ANY

These options override the newline definition that was chosen or

2087

defaulted when the pattern was compiled. For details, see the descrip-

2088

tion of pcre_compile() above. During matching, the newline choice

2089

affects the behaviour of the dot, circumflex, and dollar metacharac-

2090

ters. It may also alter the way the match position is advanced after a

2091

match failure for an unanchored pattern.

2092

2093

When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is

2094

set, and a match attempt for an unanchored pattern fails when the cur-

2095

rent position is at a CRLF sequence, and the pattern contains no

2096

explicit matches for CR or LF characters, the match position is

2097

advanced by two characters instead of one, in other words, to after the

2098

CRLF.

2099

2100

The above rule is a compromise that makes the most common cases work as

2101

expected. For example, if the pattern is .+A (and the PCRE_DOTALL

2102

option is not set), it does not match the string "\r\nA" because, after

2103

failing at the start, it skips both the CR and the LF before retrying.

2104

However, the pattern [\r\n]A does match that string, because it con-

2105

tains an explicit CR or LF reference, and so advances only by one char-

2106

acter after the first failure.

2107

2108

An explicit match for CR of LF is either a literal appearance of one of

2109

those characters, or one of the \r or \n escape sequences. Implicit

2110

matches such as [^X] do not count, nor does \s (which includes CR and

2111

LF in the characters that it matches).

2112

2113

Notwithstanding the above, anomalous effects may still occur when CRLF

2114

is a valid newline sequence and explicit \r or \n escapes appear in the

pattern.

PCRE_NOTBOL

This option specifies that first character of the subject string is not

2120

the beginning of a line, so the circumflex metacharacter should not

2121

match before it. Setting this without PCRE_MULTILINE (at compile time)

2122

causes circumflex never to match. This option affects only the behav-

2123

iour of the circumflex metacharacter. It does not affect \A.

PCRE_NOTEOL

This option specifies that the end of the subject string is not the end

2128

of a line, so the dollar metacharacter should not match it nor (except

2129

in multiline mode) a newline immediately before it. Setting this with-

2130

out PCRE_MULTILINE (at compile time) causes dollar never to match. This

2131

option affects only the behaviour of the dollar metacharacter. It does

not affect \Z or \z.

PCRE_NOTEMPTY

An empty string is not considered to be a valid match if this option is

2137

set. If there are alternatives in the pattern, they are tried. If all

2138

the alternatives match the empty string, the entire match fails. For

2139

example, if the pattern

a?b?

is applied to a string not beginning with "a" or "b", it matches an

2144

empty string at the start of the subject. With PCRE_NOTEMPTY set, this

2145

match is not valid, so PCRE searches further into the string for occur-

2146

rences of "a" or "b".

2147

2148

PCRE_NOTEMPTY_ATSTART

2149

2150

This is like PCRE_NOTEMPTY, except that an empty string match that is

2151

not at the start of the subject is permitted. If the pattern is

2152

anchored, such a match can occur only if the pattern contains \K.

2153

2154

Perl has no direct equivalent of PCRE_NOTEMPTY or

2155

PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern

2156

match of the empty string within its split() function, and when using

2157

the /g modifier. It is possible to emulate Perl's behaviour after

2158

matching a null string by first trying the match again at the same off-

2159

set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that

2160

fails, by advancing the starting offset (see below) and trying an ordi-

2161

nary match again. There is some code that demonstrates how to do this

2162

in the pcredemo sample program. In the most general case, you have to

2163

check to see if the newline convention recognizes CRLF as a newline,

2164

and if so, and the current character is CR followed by LF, advance the

2165

starting offset by two characters instead of one.

2166

2167

PCRE_NO_START_OPTIMIZE

2168

2169

There are a number of optimizations that pcre_exec() uses at the start

2170

of a match, in order to speed up the process. For example, if it is

2171

known that an unanchored match must start with a specific character, it

2172

searches the subject for that character, and fails immediately if it

2173

cannot find it, without actually running the main matching function.

2174

This means that a special item such as (*COMMIT) at the start of a pat-

2175

tern is not considered until after a suitable starting point for the

2176

match has been found. When callouts or (*MARK) items are in use, these

2177

"start-up" optimizations can cause them to be skipped if the pattern is

2178

never actually used. The start-up optimizations are in effect a pre-

2179

scan of the subject that takes place before the pattern is run.

2180

2181

The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,

2182

possibly causing performance to suffer, but ensuring that in cases

2183

where the result is "no match", the callouts do occur, and that items

2184

such as (*COMMIT) and (*MARK) are considered at every possible starting

2185

position in the subject string. If PCRE_NO_START_OPTIMIZE is set at

2186

compile time, it cannot be unset at matching time.

2187

2188

Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching

2189

operation. Consider the pattern

(*COMMIT)ABC

When this is compiled, PCRE records the fact that a match must start

2194

with the character "A". Suppose the subject string is "DEFABC". The

2195

start-up optimization scans along the subject, finds "A" and runs the

2196

first match attempt from there. The (*COMMIT) item means that the pat-

2197

tern must match the current starting position, which in this case, it

2198

does. However, if the same match is run with PCRE_NO_START_OPTIMIZE

2199

set, the initial scan along the subject string does not happen. The

2200

first match attempt is run starting from "D" and when this fails,

2201

(*COMMIT) prevents any further matches being tried, so the overall

2202

result is "no match". If the pattern is studied, more start-up opti-

2203

mizations may be used. For example, a minimum length for the subject

2204

may be recorded. Consider the pattern

(*MARK:A)(X|Y)

The minimum length for a match is one character. If the subject is

2209

"ABC", there will be attempts to match "ABC", "BC", "C", and then

2210

finally an empty string. If the pattern is studied, the final attempt

2211

does not take place, because PCRE knows that the subject is too short,

2212

and so the (*MARK) is never encountered. In this case, studying the

2213

pattern does not affect the overall match result, which is still "no

2214

match", but it does affect the auxiliary information that is returned.

PCRE_NO_UTF8_CHECK

When PCRE_UTF8 is set at compile time, the validity of the subject as a

2219

UTF-8 string is automatically checked when pcre_exec() is subsequently

2220

called. The value of startoffset is also checked to ensure that it

2221

points to the start of a UTF-8 character. There is a discussion about

2222

the validity of UTF-8 strings in the section on UTF-8 support in the

2223

main pcre page. If an invalid UTF-8 sequence of bytes is found,

2224

pcre_exec() returns the error PCRE_ERROR_BADUTF8 or, if PCRE_PAR-

2225

TIAL_HARD is set and the problem is a truncated UTF-8 character at the

2226

end of the subject, PCRE_ERROR_SHORTUTF8. In both cases, information

2227

about the precise nature of the error may also be returned (see the

2228

descriptions of these errors in the section entitled Error return val-

2229

ues from pcre_exec() below). If startoffset contains a value that does

2230

not point to the start of a UTF-8 character (or to the end of the sub-

2231

ject), PCRE_ERROR_BADUTF8_OFFSET is returned.

2232

2233

If you already know that your subject is valid, and you want to skip

2234

these checks for performance reasons, you can set the

2235

PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to

2236

do this for the second and subsequent calls to pcre_exec() if you are

2237

making repeated calls to find all the matches in a single subject

2238

string. However, you should be sure that the value of startoffset

2239

points to the start of a UTF-8 character (or the end of the subject).

2240

When PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8

2241

string as a subject or an invalid value of startoffset is undefined.

2242

Your program may crash.

PCRE_PARTIAL_HARD

PCRE_PARTIAL_SOFT

These options turn on the partial matching feature. For backwards com-

2248

patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial

2249

match occurs if the end of the subject string is reached successfully,

2250

but there are not enough subject characters to complete the match. If

2251

this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,

2252

matching continues by testing any remaining alternatives. Only if no

2253

complete match can be found is PCRE_ERROR_PARTIAL returned instead of

2254

PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the

2255

caller is prepared to handle a partial match, but only if no complete

2256

match can be found.

2257

2258

If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this

2259

case, if a partial match is found, pcre_exec() immediately returns

2260

PCRE_ERROR_PARTIAL, without considering any other alternatives. In

2261

other words, when PCRE_PARTIAL_HARD is set, a partial match is consid-

2262

ered to be more important that an alternative complete match.

2263

2264

In both cases, the portion of the string that was inspected when the

2265

partial match was found is set as the first matching string. There is a

2266

more detailed discussion of partial and multi-segment matching, with

2267

examples, in the pcrepartial documentation.

2268

2269

The string to be matched by pcre_exec()

2270

2271

The subject string is passed to pcre_exec() as a pointer in subject, a

2272

length (in bytes) in length, and a starting byte offset in startoffset.

2273

If this is negative or greater than the length of the subject,

2274

pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is

2275

zero, the search for a match starts at the beginning of the subject,

2276

and this is by far the most common case. In UTF-8 mode, the byte offset

2277

must point to the start of a UTF-8 character (or the end of the sub-

2278

ject). Unlike the pattern string, the subject may contain binary zero

2279

bytes.

2280

2281

A non-zero starting offset is useful when searching for another match

2282

in the same subject by calling pcre_exec() again after a previous suc-

2283

cess. Setting startoffset differs from just passing over a shortened

2284

string and setting PCRE_NOTBOL in the case of a pattern that begins

2285

with any kind of lookbehind. For example, consider the pattern

\Biss\B

which finds occurrences of "iss" in the middle of words. (\B matches

2290

only if the current position in the subject is not a word boundary.)

2291

When applied to the string "Mississipi" the first call to pcre_exec()

2292

finds the first occurrence. If pcre_exec() is called again with just

2293

the remainder of the subject, namely "issipi", it does not match,

2294

because \B is always false at the start of the subject, which is deemed

2295

to be a word boundary. However, if pcre_exec() is passed the entire

2296

string again, but with startoffset set to 4, it finds the second occur-

2297

rence of "iss" because it is able to look behind the starting point to

2298

discover that it is preceded by a letter.

2299

2300

Finding all the matches in a subject is tricky when the pattern can

2301

match an empty string. It is possible to emulate Perl's /g behaviour by

2302

first trying the match again at the same offset, with the

2303

PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that

2304

fails, advancing the starting offset and trying an ordinary match

2305

again. There is some code that demonstrates how to do this in the pcre-

2306

demo sample program. In the most general case, you have to check to see

2307

if the newline convention recognizes CRLF as a newline, and if so, and

2308

the current character is CR followed by LF, advance the starting offset

2309

by two characters instead of one.

2310

2311

If a non-zero starting offset is passed when the pattern is anchored,

2312

one attempt to match at the given offset is made. This can only succeed

2313

if the pattern does not require the match to be at the start of the

2314

subject.

2315

2316

How pcre_exec() returns captured substrings

2317

2318

In general, a pattern matches a certain portion of the subject, and in

2319

addition, further substrings from the subject may be picked out by

2320

parts of the pattern. Following the usage in Jeffrey Friedl's book,

2321

this is called "capturing" in what follows, and the phrase "capturing

2322

subpattern" is used for a fragment of a pattern that picks out a sub-

2323

string. PCRE supports several other kinds of parenthesized subpattern

2324

that do not cause substrings to be captured.

2325

2326

Captured substrings are returned to the caller via a vector of integers

2327

whose address is passed in ovector. The number of elements in the vec-

2328

tor is passed in ovecsize, which must be a non-negative number. Note:

2329

this argument is NOT the size of ovector in bytes.

2330

2331

The first two-thirds of the vector is used to pass back captured sub-

2332

strings, each substring using a pair of integers. The remaining third

2333

of the vector is used as workspace by pcre_exec() while matching cap-

2334

turing subpatterns, and is not available for passing back information.

2335

The number passed in ovecsize should always be a multiple of three. If

2336

it is not, it is rounded down.

2337

2338

When a match is successful, information about captured substrings is

2339

returned in pairs of integers, starting at the beginning of ovector,

2340

and continuing up to two-thirds of its length at the most. The first

2341

element of each pair is set to the byte offset of the first character

2342

in a substring, and the second is set to the byte offset of the first

2343

character after the end of a substring. Note: these values are always

2344

byte offsets, even in UTF-8 mode. They are not character counts.

2345

2346

The first pair of integers, ovector[0] and ovector[1], identify the

2347

portion of the subject string matched by the entire pattern. The next

2348

pair is used for the first capturing subpattern, and so on. The value

2349

returned by pcre_exec() is one more than the highest numbered pair that

2350

has been set. For example, if two substrings have been captured, the

2351

returned value is 3. If there are no capturing subpatterns, the return

2352

value from a successful match is 1, indicating that just the first pair

2353

of offsets has been set.

2354

2355

If a capturing subpattern is matched repeatedly, it is the last portion

2356

of the string that it matched that is returned.

2357

2358

If the vector is too small to hold all the captured substring offsets,

2359

it is used as far as possible (up to two-thirds of its length), and the

2360

function returns a value of zero. If neither the actual string matched

2361

not any captured substrings are of interest, pcre_exec() may be called

2362

with ovector passed as NULL and ovecsize as zero. However, if the pat-

2363

tern contains back references and the ovector is not big enough to

2364

remember the related substrings, PCRE has to get additional memory for

2365

use during matching. Thus it is usually advisable to supply an ovector

2366

of reasonable size.

2367

2368

There are some cases where zero is returned (indicating vector over-

2369

flow) when in fact the vector is exactly the right size for the final

2370

match. For example, consider the pattern

(a)(?:(b)c|bd)

If a vector of 6 elements (allowing for only 1 captured substring) is

2375

given with subject string "abd", pcre_exec() will try to set the second

2376

captured string, thereby recording a vector overflow, before failing to

2377

match "c" and backing up to try the second alternative. The zero

2378

return, however, does correctly indicate that the maximum number of

2379

slots (namely 2) have been filled. In similar cases where there is tem-

2380

porary overflow, but the final number of used slots is actually less

2381

than the maximum, a non-zero value is returned.

2382

2383

The pcre_fullinfo() function can be used to find out how many capturing

2384

subpatterns there are in a compiled pattern. The smallest size for

2385

ovector that will allow for n captured substrings, in addition to the

2386

offsets of the substring matched by the whole pattern, is (n+1)*3.

2387

2388

It is possible for capturing subpattern number n+1 to match some part

2389

of the subject when subpattern n has not been used at all. For example,

2390

if the string "abc" is matched against the pattern (a|(z))(bc) the

2391

return from the function is 4, and subpatterns 1 and 3 are matched, but

2392

2 is not. When this happens, both values in the offset pairs corre-

2393

sponding to unused subpatterns are set to -1.

2394

2395

Offset values that correspond to unused subpatterns at the end of the

2396

expression are also set to -1. For example, if the string "abc" is

2397

matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not

2398

matched. The return from the function is 2, because the highest used

2399

capturing subpattern number is 1, and the offsets for for the second

2400

and third capturing subpatterns (assuming the vector is large enough,

2401

of course) are set to -1.

2402

2403

Note: Elements in the first two-thirds of ovector that do not corre-

2404

spond to capturing parentheses in the pattern are never changed. That

2405

is, if a pattern contains n capturing parentheses, no more than ovec-

2406

tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements (in

2407

the first two-thirds) retain whatever values they previously had.

2408

2409

Some convenience functions are provided for extracting the captured

2410

substrings as separate strings. These are described below.

2411

2412

Error return values from pcre_exec()

2413

2414

If pcre_exec() fails, it returns a negative number. The following are

2415

defined in the header file:

2416

2417

PCRE_ERROR_NOMATCH (-1)

2418

2419

The subject string did not match the pattern.

PCRE_ERROR_NULL (-2)

Either code or subject was passed as NULL, or ovector was NULL and

2424

ovecsize was not zero.

2425

2426

PCRE_ERROR_BADOPTION (-3)

2427

2428

An unrecognized bit was set in the options argument.

2429

2430

PCRE_ERROR_BADMAGIC (-4)

2431

2432

PCRE stores a 4-byte "magic number" at the start of the compiled code,

2433

to catch the case when it is passed a junk pointer and to detect when a

2434

pattern that was compiled in an environment of one endianness is run in

2435

an environment with the other endianness. This is the error that PCRE

2436

gives when the magic number is not present.

2437

2438

PCRE_ERROR_UNKNOWN_OPCODE (-5)

2439

2440

While running the pattern match, an unknown item was encountered in the

2441

compiled pattern. This error could be caused by a bug in PCRE or by

2442

overwriting of the compiled pattern.

2443

2444

PCRE_ERROR_NOMEMORY (-6)

2445

2446

If a pattern contains back references, but the ovector that is passed

2447

to pcre_exec() is not big enough to remember the referenced substrings,

2448

PCRE gets a block of memory at the start of matching to use for this

2449

purpose. If the call via pcre_malloc() fails, this error is given. The

2450

memory is automatically freed at the end of matching.

2451

2452

This error is also given if pcre_stack_malloc() fails in pcre_exec().

2453

This can happen only when PCRE has been compiled with --disable-stack-

2454

for-recursion.

2455

2456

PCRE_ERROR_NOSUBSTRING (-7)

2457

2458

This error is used by the pcre_copy_substring(), pcre_get_substring(),

2459

and pcre_get_substring_list() functions (see below). It is never

2460

returned by pcre_exec().

2461

2462

PCRE_ERROR_MATCHLIMIT (-8)

2463

2464

The backtracking limit, as specified by the match_limit field in a

2465

pcre_extra structure (or defaulted) was reached. See the description

2466

above.

2467

2468

PCRE_ERROR_CALLOUT (-9)

2469

2470

This error is never generated by pcre_exec() itself. It is provided for

2471

use by callout functions that want to yield a distinctive error code.

2472

See the pcrecallout documentation for details.

2473

2474

PCRE_ERROR_BADUTF8 (-10)

2475

2476

A string that contains an invalid UTF-8 byte sequence was passed as a

2477

subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of

2478

the output vector (ovecsize) is at least 2, the byte offset to the

2479

start of the the invalid UTF-8 character is placed in the first ele-

2480

ment, and a reason code is placed in the second element. The reason

2481

codes are listed in the following section. For backward compatibility,

2482

if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-

2483

acter at the end of the subject (reason codes 1 to 5),

2484

PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.

2485

2486

PCRE_ERROR_BADUTF8_OFFSET (-11)

2487

2488

The UTF-8 byte sequence that was passed as a subject was checked and

2489

found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the

2490

value of startoffset did not point to the beginning of a UTF-8 charac-

2491

ter or the end of the subject.

2492

2493

PCRE_ERROR_PARTIAL (-12)

2494

2495

The subject string did not match, but it did match partially. See the

2496

pcrepartial documentation for details of partial matching.

2497

2498

PCRE_ERROR_BADPARTIAL (-13)

2499

2500

This code is no longer in use. It was formerly returned when the

2501

PCRE_PARTIAL option was used with a compiled pattern containing items

2502

that were not supported for partial matching. From release 8.00

2503

onwards, there are no restrictions on partial matching.

2504

2505

PCRE_ERROR_INTERNAL (-14)

2506

2507

An unexpected internal error has occurred. This error could be caused

2508

by a bug in PCRE or by overwriting of the compiled pattern.

2509

2510

PCRE_ERROR_BADCOUNT (-15)

2511

2512

This error is given if the value of the ovecsize argument is negative.

2513

2514

PCRE_ERROR_RECURSIONLIMIT (-21)

2515

2516

The internal recursion limit, as specified by the match_limit_recursion

2517

field in a pcre_extra structure (or defaulted) was reached. See the

2518

description above.

2519

2520

PCRE_ERROR_BADNEWLINE (-23)

2521

2522

An invalid combination of PCRE_NEWLINE_xxx options was given.

2523

2524

PCRE_ERROR_BADOFFSET (-24)

2525

2526

The value of startoffset was negative or greater than the length of the

2527

subject, that is, the value in length.

2528

2529

PCRE_ERROR_SHORTUTF8 (-25)

2530

2531

This error is returned instead of PCRE_ERROR_BADUTF8 when the subject

2532

string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD

2533

option is set. Information about the failure is returned as for

2534

PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but

2535

this special error code for PCRE_PARTIAL_HARD precedes the implementa-

2536

tion of returned information; it is retained for backwards compatibil-

2537

ity.

2538

2539

PCRE_ERROR_RECURSELOOP (-26)

2540

2541

This error is returned when pcre_exec() detects a recursion loop within

2542

the pattern. Specifically, it means that either the whole pattern or a

2543

subpattern has been called recursively for the second time at the same

2544

position in the subject string. Some simple patterns that might do this

2545

are detected and faulted at compile time, but more complicated cases,

2546

in particular mutual recursions between two different subpatterns, can-

2547

not be detected until run time.

2548

2549

PCRE_ERROR_JIT_STACKLIMIT (-27)

2550

2551

This error is returned when a pattern that was successfully studied

2552

using the PCRE_STUDY_JIT_COMPILE option is being matched, but the mem-

2553

ory available for the just-in-time processing stack is not large

2554

enough. See the pcrejit documentation for more details.

2555

2556

Error numbers -16 to -20 and -22 are not used by pcre_exec().

2557

2558

Reason codes for invalid UTF-8 strings

2559

2560

When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-

2561

UTF8, and the size of the output vector (ovecsize) is at least 2, the

2562

offset of the start of the invalid UTF-8 character is placed in the

2563

first output vector element (ovector[0]) and a reason code is placed in

2564

the second element (ovector[1]). The reason codes are given names in

2565

the pcre.h header file:

PCRE_UTF8_ERR1

PCRE_UTF8_ERR2

PCRE_UTF8_ERR3

PCRE_UTF8_ERR4

PCRE_UTF8_ERR5

The string ends with a truncated UTF-8 character; the code specifies

2574

how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8

2575

characters to be no longer than 4 bytes, the encoding scheme (origi-

2576

nally defined by RFC 2279) allows for up to 6 bytes, and this is

2577

checked first; hence the possibility of 4 or 5 missing bytes.

PCRE_UTF8_ERR6

PCRE_UTF8_ERR7

PCRE_UTF8_ERR8

PCRE_UTF8_ERR9

PCRE_UTF8_ERR10

The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of

2586

the character do not have the binary value 0b10 (that is, either the

2587

most significant bit is 0, or the next bit is 1).

PCRE_UTF8_ERR11

PCRE_UTF8_ERR12

A character that is valid by the RFC 2279 rules is either 5 or 6 bytes

2593

long; these code points are excluded by RFC 3629.

PCRE_UTF8_ERR13

A 4-byte character has a value greater than 0x10fff; these code points

2598

are excluded by RFC 3629.

PCRE_UTF8_ERR14

A 3-byte character has a value in the range 0xd800 to 0xdfff; this

2603

range of code points are reserved by RFC 3629 for use with UTF-16, and

2604

so are excluded from UTF-8.

PCRE_UTF8_ERR15

PCRE_UTF8_ERR16

PCRE_UTF8_ERR17

PCRE_UTF8_ERR18

PCRE_UTF8_ERR19

A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes

2613

for a value that can be represented by fewer bytes, which is invalid.

2614

For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-

2615

rect coding uses just one byte.

PCRE_UTF8_ERR20

The two most significant bits of the first byte of a character have the

2620

binary value 0b10 (that is, the most significant bit is 1 and the sec-

2621

ond is 0). Such a byte can only validly occur as the second or subse-

2622

quent byte of a multi-byte character.

PCRE_UTF8_ERR21

The first byte of a character has the value 0xfe or 0xff. These values

2627

can never occur in a valid UTF-8 string.

2628

2629

2630

EXTRACTING CAPTURED SUBSTRINGS BY NUMBER

2631

2632

int pcre_copy_substring(const char *subject, int *ovector,

2633

int stringcount, int stringnumber, char *buffer,

2634

int buffersize);

2635

2636

int pcre_get_substring(const char *subject, int *ovector,

2637

int stringcount, int stringnumber,

2638

const char **stringptr);

2639

2640

int pcre_get_substring_list(const char *subject,

2641

int *ovector, int stringcount, const char ***listptr);

2642

2643

Captured substrings can be accessed directly by using the offsets

2644

returned by pcre_exec() in ovector. For convenience, the functions

2645

pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub-

2646

string_list() are provided for extracting captured substrings as new,

2647

separate, zero-terminated strings. These functions identify substrings

2648

by number. The next section describes functions for extracting named

2649

substrings.

2650

2651

A substring that contains a binary zero is correctly extracted and has

2652

a further zero added on the end, but the result is not, of course, a C

2653

string. However, you can process such a string by referring to the

2654

length that is returned by pcre_copy_substring() and pcre_get_sub-

2655

string(). Unfortunately, the interface to pcre_get_substring_list() is

2656

not adequate for handling strings containing binary zeros, because the

2657

end of the final string is not independently indicated.

2658

2659

The first three arguments are the same for all three of these func-

2660

tions: subject is the subject string that has just been successfully

2661

matched, ovector is a pointer to the vector of integer offsets that was

2662

passed to pcre_exec(), and stringcount is the number of substrings that

2663

were captured by the match, including the substring that matched the

2664

entire regular expression. This is the value returned by pcre_exec() if

2665

it is greater than zero. If pcre_exec() returned zero, indicating that

2666

it ran out of space in ovector, the value passed as stringcount should

2667

be the number of elements in the vector divided by three.

2668

2669

The functions pcre_copy_substring() and pcre_get_substring() extract a

2670

single substring, whose number is given as stringnumber. A value of

2671

zero extracts the substring that matched the entire pattern, whereas

2672

higher values extract the captured substrings. For pcre_copy_sub-

2673

string(), the string is placed in buffer, whose length is given by

2674

buffersize, while for pcre_get_substring() a new block of memory is

2675

obtained via pcre_malloc, and its address is returned via stringptr.

2676

The yield of the function is the length of the string, not including

2677

the terminating zero, or one of these error codes:

2678

2679

PCRE_ERROR_NOMEMORY (-6)

2680

2681

The buffer was too small for pcre_copy_substring(), or the attempt to

2682

get memory failed for pcre_get_substring().

2683

2684

PCRE_ERROR_NOSUBSTRING (-7)

2685

2686

There is no substring whose number is stringnumber.

2687

2688

The pcre_get_substring_list() function extracts all available sub-

2689

strings and builds a list of pointers to them. All this is done in a

2690

single block of memory that is obtained via pcre_malloc. The address of

2691

the memory block is returned via listptr, which is also the start of

2692

the list of string pointers. The end of the list is marked by a NULL

2693

pointer. The yield of the function is zero if all went well, or the

2694

error code

2695

2696

PCRE_ERROR_NOMEMORY (-6)

2697

2698

if the attempt to get the memory block failed.

2699

2700

When any of these functions encounter a substring that is unset, which

2701

can happen when capturing subpattern number n+1 matches some part of

2702

the subject, but subpattern n has not been used at all, they return an

2703

empty string. This can be distinguished from a genuine zero-length sub-

2704

string by inspecting the appropriate offset in ovector, which is nega-

2705

tive for unset substrings.

2706

2707

The two convenience functions pcre_free_substring() and pcre_free_sub-

2708

string_list() can be used to free the memory returned by a previous

2709

call of pcre_get_substring() or pcre_get_substring_list(), respec-

2710

tively. They do nothing more than call the function pointed to by

2711

pcre_free, which of course could be called directly from a C program.

2712

However, PCRE is used in some situations where it is linked via a spe-

2713

cial interface to another programming language that cannot use

2714

pcre_free directly; it is for these cases that the functions are pro-

vided.

EXTRACTING CAPTURED SUBSTRINGS BY NAME

2719

2720

int pcre_get_stringnumber(const pcre *code,

2721

const char *name);

2722

2723

int pcre_copy_named_substring(const pcre *code,

2724

const char *subject, int *ovector,

2725

int stringcount, const char *stringname,

2726

char *buffer, int buffersize);

2727

2728

int pcre_get_named_substring(const pcre *code,

2729

const char *subject, int *ovector,

2730

int stringcount, const char *stringname,

2731

const char **stringptr);

2732

2733

To extract a substring by name, you first have to find associated num-

2734

ber. For example, for this pattern

(a+)b(?<xxx>\d+)...

the number of the subpattern called "xxx" is 2. If the name is known to

2739

be unique (PCRE_DUPNAMES was not set), you can find the number from the

2740

name by calling pcre_get_stringnumber(). The first argument is the com-

2741

piled pattern, and the second is the name. The yield of the function is

2742

the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no

2743

subpattern of that name.

2744

2745

Given the number, you can extract the substring directly, or use one of

2746

the functions described in the previous section. For convenience, there

2747

are also two functions that do the whole job.

2748

2749

Most of the arguments of pcre_copy_named_substring() and

2750

pcre_get_named_substring() are the same as those for the similarly

2751

named functions that extract by number. As these are described in the

2752

previous section, they are not re-described here. There are just two

2753

differences:

2754

2755

First, instead of a substring number, a substring name is given. Sec-

2756

ond, there is an extra argument, given at the start, which is a pointer

2757

to the compiled pattern. This is needed in order to gain access to the

2758

name-to-number translation table.

2759

2760

These functions call pcre_get_stringnumber(), and if it succeeds, they

2761

then call pcre_copy_substring() or pcre_get_substring(), as appropri-

2762

ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the

2763

behaviour may not be what you want (see the next section).

2764

2765

Warning: If the pattern uses the (?| feature to set up multiple subpat-

2766

terns with the same number, as described in the section on duplicate

2767

subpattern numbers in the pcrepattern page, you cannot use names to

2768

distinguish the different subpatterns, because names are not included

2769

in the compiled code. The matching process uses only numbers. For this

2770

reason, the use of different names for subpatterns of the same number

2771

causes an error at compile time.

2772

2773

2774

DUPLICATE SUBPATTERN NAMES

2775

2776

int pcre_get_stringtable_entries(const pcre *code,

2777

const char *name, char **first, char **last);

2778

2779

When a pattern is compiled with the PCRE_DUPNAMES option, names for

2780

subpatterns are not required to be unique. (Duplicate names are always

2781

allowed for subpatterns with the same number, created by using the (?|

2782

feature. Indeed, if such subpatterns are named, they are required to

2783

use the same names.)

2784

2785

Normally, patterns with duplicate names are such that in any one match,

2786

only one of the named subpatterns participates. An example is shown in

2787

the pcrepattern documentation.

2788

2789

When duplicates are present, pcre_copy_named_substring() and

2790

pcre_get_named_substring() return the first substring corresponding to

2791

the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING

2792

(-7) is returned; no data is returned. The pcre_get_stringnumber()

2793

function returns one of the numbers that are associated with the name,

2794

but it is not defined which it is.

2795

2796

If you want to get full details of all captured substrings for a given

2797

name, you must use the pcre_get_stringtable_entries() function. The

2798

first argument is the compiled pattern, and the second is the name. The

2799

third and fourth are pointers to variables which are updated by the

2800

function. After it has run, they point to the first and last entries in

2801

the name-to-number table for the given name. The function itself

2802

returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if

2803

there are none. The format of the table is described above in the sec-

2804

tion entitled Information about a pattern above. Given all the rele-

2805

vant entries for the name, you can extract each of their numbers, and

2806

hence the captured data, if any.

2807

2808

2809

FINDING ALL POSSIBLE MATCHES

2810

2811

The traditional matching function uses a similar algorithm to Perl,

2812

which stops when it finds the first match, starting at a given point in

2813

the subject. If you want to find all possible matches, or the longest

2814

possible match, consider using the alternative matching function (see

2815

below) instead. If you cannot use the alternative function, but still

2816

need to find all possible matches, you can kludge it up by making use

2817

of the callout facility, which is described in the pcrecallout documen-

2818

tation.

2819

2820

What you have to do is to insert a callout right at the end of the pat-

2821

tern. When your callout function is called, extract and save the cur-

2822

rent matched substring. Then return 1, which forces pcre_exec() to

2823

backtrack and try other alternatives. Ultimately, when it runs out of

2824

matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.

2825

2826

2827

MATCHING A PATTERN: THE ALTERNATIVE FUNCTION

2828

2829

int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,

2830

const char *subject, int length, int startoffset,

2831

int options, int *ovector, int ovecsize,

2832

int *workspace, int wscount);

2833

2834

The function pcre_dfa_exec() is called to match a subject string

2835

against a compiled pattern, using a matching algorithm that scans the

2836

subject string just once, and does not backtrack. This has different

2837

characteristics to the normal algorithm, and is not compatible with

2838

Perl. Some of the features of PCRE patterns are not supported. Never-

2839

theless, there are times when this kind of matching can be useful. For

2840

a discussion of the two matching algorithms, and a list of features

2841

that pcre_dfa_exec() does not support, see the pcrematching documenta-

2842

tion.

2843

2844

The arguments for the pcre_dfa_exec() function are the same as for

2845

pcre_exec(), plus two extras. The ovector argument is used in a differ-

2846

ent way, and this is described below. The other common arguments are

2847

used in the same way as for pcre_exec(), so their description is not

2848

repeated here.

2849

2850

The two additional arguments provide workspace for the function. The

2851

workspace vector should contain at least 20 elements. It is used for

2852

keeping track of multiple paths through the pattern tree. More

2853

workspace will be needed for patterns and subjects where there are a

2854

lot of potential matches.

2855

2856

Here is an example of a simple call to pcre_dfa_exec():

int rc;

int ovector[10];

int wspace[20];

rc = pcre_dfa_exec(

re, /* result of pcre_compile() */

2863

NULL, /* we didn't study the pattern */

2864

"some string", /* the subject string */

2865

11, /* the length of the subject string */

2866

0, /* start at offset 0 in the subject */

2867

0, /* default options */

2868

ovector, /* vector of integers for substring information */

2869

10, /* number of elements (NOT size in bytes) */

2870

wspace, /* working space vector */

2871

20); /* number of elements (NOT size in bytes) */

2872

2873

Option bits for pcre_dfa_exec()

2874

2875

The unused bits of the options argument for pcre_dfa_exec() must be

2876

zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-

2877

LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,

2878

PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF,

2879

PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-

2880

TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last

2881

four of these are exactly the same as for pcre_exec(), so their

2882

description is not repeated here.

PCRE_PARTIAL_HARD

PCRE_PARTIAL_SOFT

These have the same general effect as they do for pcre_exec(), but the

2888

details are slightly different. When PCRE_PARTIAL_HARD is set for

2889

pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-

2890

ject is reached and there is still at least one matching possibility

2891

that requires additional characters. This happens even if some complete

2892

matches have also been found. When PCRE_PARTIAL_SOFT is set, the return

2893

code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end

2894

of the subject is reached, there have been no complete matches, but

2895

there is still at least one matching possibility. The portion of the

2896

string that was inspected when the longest partial match was found is

2897

set as the first matching string in both cases. There is a more

2898

detailed discussion of partial and multi-segment matching, with exam-

2899

ples, in the pcrepartial documentation.

PCRE_DFA_SHORTEST

Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to

2904

stop as soon as it has found one match. Because of the way the alterna-

2905

tive algorithm works, this is necessarily the shortest possible match

2906

at the first possible matching point in the subject string.

PCRE_DFA_RESTART

When pcre_dfa_exec() returns a partial match, it is possible to call it

2911

again, with additional subject characters, and have it continue with

2912

the same match. The PCRE_DFA_RESTART option requests this action; when

2913

it is set, the workspace and wscount options must reference the same

2914

vector as before because data about the match so far is left in them

2915

after a partial match. There is more discussion of this facility in the

2916

pcrepartial documentation.

2917

2918

Successful returns from pcre_dfa_exec()

2919

2920

When pcre_dfa_exec() succeeds, it may have matched more than one sub-

2921

string in the subject. Note, however, that all the matches from one run

2922

of the function start at the same point in the subject. The shorter

2923

matches are all initial substrings of the longer matches. For example,

if the pattern

<.*>

is matched against the string

2929

2930

This is <something> <something else> <something further> no more

2931

2932

the three matched strings are

On success, the yield of the function is a number greater than zero,

2939

which is the number of matched substrings. The substrings themselves

2940

are returned in ovector. Each string uses two elements; the first is

2941

the offset to the start, and the second is the offset to the end. In

2942

fact, all the strings have the same start offset. (Space could have

2943

been saved by giving this only once, but it was decided to retain some

2944

compatibility with the way pcre_exec() returns data, even though the

2945

meaning of the strings is different.)

2946

2947

The strings are returned in reverse order of length; that is, the long-

2948

est matching string is given first. If there were too many matches to

2949

fit into ovector, the yield of the function is zero, and the vector is

2950

filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec()

2951

can use the entire ovector for returning matched strings.

2952

2953

Error returns from pcre_dfa_exec()

2954

2955

The pcre_dfa_exec() function returns a negative number when it fails.

2956

Many of the errors are the same as for pcre_exec(), and these are

2957

described above. There are in addition the following errors that are

2958

specific to pcre_dfa_exec():

2959

2960

PCRE_ERROR_DFA_UITEM (-16)

2961

2962

This return is given if pcre_dfa_exec() encounters an item in the pat-

2963

tern that it does not support, for instance, the use of \C or a back

2964

reference.

2965

2966

PCRE_ERROR_DFA_UCOND (-17)

2967

2968

This return is given if pcre_dfa_exec() encounters a condition item

2969

that uses a back reference for the condition, or a test for recursion

2970

in a specific group. These are not supported.

2971

2972

PCRE_ERROR_DFA_UMLIMIT (-18)

2973

2974

This return is given if pcre_dfa_exec() is called with an extra block

2975

that contains a setting of the match_limit or match_limit_recursion

2976

fields. This is not supported (these fields are meaningless for DFA

2977

matching).

2978

2979

PCRE_ERROR_DFA_WSSIZE (-19)

2980

2981

This return is given if pcre_dfa_exec() runs out of space in the

2982

workspace vector.

2983

2984

PCRE_ERROR_DFA_RECURSE (-20)

2985

2986

When a recursive subpattern is processed, the matching function calls

2987

itself recursively, using private vectors for ovector and workspace.

2988

This error is given if the output vector is not large enough. This

2989

should be extremely rare, as a vector of size 1000 is used.

SEE ALSO

pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-

2995

tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).

AUTHOR

Philip Hazel

University Computing Service

3002

Cambridge CB2 3QH, England.

REVISION

Last updated: 02 December 2011

3008

3009

------------------------------------------------------------------------------

3010

3011

3012

PCRECALLOUT(3) PCRECALLOUT(3)

NAME

PCRE - Perl-compatible regular expressions

PCRE CALLOUTS

int (*pcre_callout)(pcre_callout_block *);

3022

3023

PCRE provides a feature called "callout", which is a means of temporar-

3024

ily passing control to the caller of PCRE in the middle of pattern

3025

matching. The caller of PCRE provides an external function by putting

3026

its entry point in the global variable pcre_callout. By default, this

3027

variable contains NULL, which disables all calling out.

3028

3029

Within a regular expression, (?C) indicates the points at which the

3030

external function is to be called. Different callout points can be

3031

identified by putting a number less than 256 after the letter C. The

3032

default value is zero. For example, this pattern has two callout

points:

(?C1)abc(?C2)def

If the PCRE_AUTO_CALLOUT option bit is set when pcre_compile() or

3038

pcre_compile2() is called, PCRE automatically inserts callouts, all

3039

with number 255, before each item in the pattern. For example, if

3040

PCRE_AUTO_CALLOUT is used with the pattern

A(\d{2}|--)

it is processed as if it were

3045

3046

(?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)

3047

3048

Notice that there is a callout before and after each parenthesis and

3049

alternation bar. Automatic callouts can be used for tracking the

3050

progress of pattern matching. The pcretest command has an option that

3051

sets automatic callouts; when it is used, the output indicates how the

3052

pattern is matched. This is useful information when you are trying to

3053

optimize the performance of a particular pattern.

3054

3055

The use of callouts in a pattern makes it ineligible for optimization

3056

by the just-in-time compiler. Studying such a pattern with the

3057

PCRE_STUDY_JIT_COMPILE option always fails.

MISSING CALLOUTS

You should be aware that, because of optimizations in the way PCRE

3063

matches patterns by default, callouts sometimes do not happen. For

3064

example, if the pattern is

ab(?C4)cd

PCRE knows that any matching string must contain the letter "d". If the

3069

subject string is "abyz", the lack of "d" means that matching doesn't

3070

ever start, and the callout is never reached. However, with "abyd",

3071

though the result is still no match, the callout is obeyed.

3072

3073

If the pattern is studied, PCRE knows the minimum length of a matching

3074

string, and will immediately give a "no match" return without actually

3075

running a match if the subject is not long enough, or, for unanchored

3076

patterns, if it has been scanned far enough.

3077

3078

You can disable these optimizations by passing the PCRE_NO_START_OPTI-

3079

MIZE option to pcre_compile(), pcre_exec(), or pcre_dfa_exec(), or by

3080

starting the pattern with (*NO_START_OPT). This slows down the matching

3081

process, but does ensure that callouts such as the example above are

obeyed.

THE CALLOUT INTERFACE

3086

3087

During matching, when PCRE reaches a callout point, the external func-

3088

tion defined by pcre_callout is called (if it is set). This applies to

3089

both the pcre_exec() and the pcre_dfa_exec() matching functions. The

3090

only argument to the callout function is a pointer to a pcre_callout

3091

block. This structure contains the following fields:

int version;

int callout_number;

int *offset_vector;

const char *subject;

int subject_length;

int start_match;

int current_position;

int capture_top;

int capture_last;

void *callout_data;

int pattern_position;

3104

int next_item_length;

3105

const unsigned char *mark;

3106

3107

The version field is an integer containing the version number of the

3108

block format. The initial version was 0; the current version is 2. The

3109

version number will change again in future if additional fields are

3110

added, but the intention is never to remove any of the existing fields.

3111

3112

The callout_number field contains the number of the callout, as com-

3113

piled into the pattern (that is, the number after ?C for manual call-

3114

outs, and 255 for automatically generated callouts).

3115

3116

The offset_vector field is a pointer to the vector of offsets that was

3117

passed by the caller to pcre_exec() or pcre_dfa_exec(). When

3118

pcre_exec() is used, the contents can be inspected in order to extract

3119

substrings that have been matched so far, in the same way as for

3120

extracting substrings after a match has completed. For pcre_dfa_exec()

3121

this field is not useful.

3122

3123

The subject and subject_length fields contain copies of the values that

3124

were passed to pcre_exec().

3125

3126

The start_match field normally contains the offset within the subject

3127

at which the current match attempt started. However, if the escape

3128

sequence \K has been encountered, this value is changed to reflect the

3129

modified starting point. If the pattern is not anchored, the callout

3130

function may be called several times from the same point in the pattern

3131

for different starting points in the subject.

3132

3133

The current_position field contains the offset within the subject of

3134

the current match pointer.

3135

3136

When the pcre_exec() function is used, the capture_top field contains

3137

one more than the number of the highest numbered captured substring so

3138

far. If no substrings have been captured, the value of capture_top is

3139

one. This is always the case when pcre_dfa_exec() is used, because it

3140

does not support captured substrings.

3141

3142

The capture_last field contains the number of the most recently cap-

3143

tured substring. If no substrings have been captured, its value is -1.

3144

This is always the case when pcre_dfa_exec() is used.

3145

3146

The callout_data field contains a value that is passed to pcre_exec()

3147

or pcre_dfa_exec() specifically so that it can be passed back in call-

3148

outs. It is passed in the pcre_callout field of the pcre_extra data

3149

structure. If no such data was passed, the value of callout_data in a

3150

pcre_callout block is NULL. There is a description of the pcre_extra

3151

structure in the pcreapi documentation.

3152

3153

The pattern_position field is present from version 1 of the pcre_call-

3154

out structure. It contains the offset to the next item to be matched in

3155

the pattern string.

3156

3157

The next_item_length field is present from version 1 of the pcre_call-

3158

out structure. It contains the length of the next item to be matched in

3159

the pattern string. When the callout immediately precedes an alterna-

3160

tion bar, a closing parenthesis, or the end of the pattern, the length

3161

is zero. When the callout precedes an opening parenthesis, the length

3162

is that of the entire subpattern.

3163

3164

The pattern_position and next_item_length fields are intended to help

3165

in distinguishing between different automatic callouts, which all have

3166

the same callout number. However, they are set for all callouts.

3167

3168

The mark field is present from version 2 of the pcre_callout structure.

3169

In callouts from pcre_exec() it contains a pointer to the zero-termi-

3170

nated name of the most recently passed (*MARK), (*PRUNE), or (*THEN)

3171

item in the match, or NULL if no such items have been passed. Instances

3172

of (*PRUNE) or (*THEN) without a name do not obliterate a previous

3173

(*MARK). In callouts from pcre_dfa_exec() this field always contains

NULL.

RETURN VALUES

The external callout function returns an integer to PCRE. If the value

3180

is zero, matching proceeds as normal. If the value is greater than

3181

zero, matching fails at the current point, but the testing of other

3182

matching possibilities goes ahead, just as if a lookahead assertion had

3183

failed. If the value is less than zero, the match is abandoned, and

3184

pcre_exec() or pcre_dfa_exec() returns the negative value.

3185

3186

Negative values should normally be chosen from the set of

3187

PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-

3188

dard "no match" failure. The error number PCRE_ERROR_CALLOUT is

3189

reserved for use by callout functions; it will never be used by PCRE

itself.

AUTHOR

Philip Hazel

University Computing Service

3197

Cambridge CB2 3QH, England.

REVISION

Last updated: 30 November 2011

3203

3204

------------------------------------------------------------------------------

3205

3206

3207

PCRECOMPAT(3) PCRECOMPAT(3)

NAME

PCRE - Perl-compatible regular expressions

3212

3213

3214

DIFFERENCES BETWEEN PCRE AND PERL

3215

3216

This document describes the differences in the ways that PCRE and Perl

3217

handle regular expressions. The differences described here are with

3218

respect to Perl versions 5.10 and above.

3219

3220

1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details

3221

of what it does have are given in the pcreunicode page.

3222

3223

2. PCRE allows repeat quantifiers only on parenthesized assertions, but

3224

they do not mean what you might think. For example, (?!a){3} does not

3225

assert that the next three characters are not "a". It just asserts that

3226

the next character is not "a" three times (in principle: PCRE optimizes

3227

this to run the assertion just once). Perl allows repeat quantifiers on

3228

other assertions such as \b, but these do not seem to have any use.

3229

3230

3. Capturing subpatterns that occur inside negative lookahead asser-

3231

tions are counted, but their entries in the offsets vector are never

3232

set. Perl sets its numerical variables from any such patterns that are

3233

matched before the assertion fails to match something (thereby succeed-

3234

ing), but only if the negative lookahead assertion contains just one

3235

branch.

3236

3237

4. Though binary zero characters are supported in the subject string,

3238

they are not allowed in a pattern string because it is passed as a nor-

3239

mal C string, terminated by zero. The escape sequence \0 can be used in

3240

the pattern to represent a binary zero.

3241

3242

5. The following Perl escape sequences are not supported: \l, \u, \L,

3243

\U, and \N when followed by a character name or Unicode value. (\N on

3244

its own, matching a non-newline character, is supported.) In fact these

3245

are implemented by Perl's general string-handling and are not part of

3246

its pattern matching engine. If any of these are encountered by PCRE,

3247

an error is generated by default. However, if the PCRE_JAVASCRIPT_COM-

3248

PAT option is set, \U and \u are interpreted as JavaScript interprets

3249

them.

3250

3251

6. The Perl escape sequences \p, \P, and \X are supported only if PCRE

3252

is built with Unicode character property support. The properties that

3253

can be tested with \p and \P are limited to the general category prop-

3254

erties such as Lu and Nd, script names such as Greek or Han, and the

3255

derived properties Any and L&. PCRE does support the Cs (surrogate)

3256

property, which Perl does not; the Perl documentation says "Because

3257

Perl hides the need for the user to understand the internal representa-

3258

tion of Unicode characters, there is no need to implement the somewhat

3259

messy concept of surrogates."

3260

3261

7. PCRE implements a simpler version of \X than Perl, which changed to

3262

make \X match what Unicode calls an "extended grapheme cluster". This

3263

is more complicated than an extended Unicode sequence, which is what

3264

PCRE matches.

3265

3266

8. PCRE does support the \Q...\E escape for quoting substrings. Charac-

3267

ters in between are treated as literals. This is slightly different

3268

from Perl in that $ and @ are also handled as literals inside the

3269

quotes. In Perl, they cause variable interpolation (but of course PCRE

3270

does not have variables). Note the following examples:

3271

3272

Pattern PCRE matches Perl matches

3273

3274

\Qabc$xyz\E abc$xyz abc followed by the

3275

contents of $xyz

3276

\Qabc\$xyz\E abc\$xyz abc\$xyz

3277

\Qabc\E\$\Qxyz\E abc$xyz abc$xyz

3278

3279

The \Q...\E sequence is recognized both inside and outside character

3280

classes.

3281

3282

9. Fairly obviously, PCRE does not support the (?{code}) and (??{code})

3283

constructions. However, there is support for recursive patterns. This

3284

is not available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE

3285

"callout" feature allows an external function to be called during pat-

3286

tern matching. See the pcrecallout documentation for details.

3287

3288

10. Subpatterns that are called as subroutines (whether or not recur-

3289

sively) are always treated as atomic groups in PCRE. This is like

3290

Python, but unlike Perl. Captured values that are set outside a sub-

3291

routine call can be reference from inside in PCRE, but not in Perl.

3292

There is a discussion that explains these differences in more detail in

3293

the section on recursion differences from Perl in the pcrepattern page.

3294

3295

11. If (*THEN) is present in a group that is called as a subroutine,

3296

its action is limited to that group, even if the group does not contain

3297

any | characters.

3298

3299

12. There are some differences that are concerned with the settings of

3300

captured strings when part of a pattern is repeated. For example,

3301

matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2

3302

unset, but in PCRE it is set to "b".

3303

3304

13. PCRE's handling of duplicate subpattern numbers and duplicate sub-

3305

pattern names is not as general as Perl's. This is a consequence of the

3306

fact the PCRE works internally just with numbers, using an external ta-

3307

ble to translate between numbers and names. In particular, a pattern

3308

such as (?|(?<a>A)|(?<b)B), where the two capturing parentheses have

3309

the same number but different names, is not supported, and causes an

3310

error at compile time. If it were allowed, it would not be possible to

3311

distinguish which parentheses matched, because both names map to cap-

3312

turing subpattern number 1. To avoid this confusing situation, an error

3313

is given at compile time.

3314

3315

14. Perl recognizes comments in some places that PCRE does not, for

3316

example, between the ( and ? at the start of a subpattern. If the /x

3317

modifier is set, Perl allows whitespace between ( and ? but PCRE never

3318

does, even if the PCRE_EXTENDED option is set.

3319

3320

15. PCRE provides some extensions to the Perl regular expression facil-

3321

ities. Perl 5.10 includes new features that are not in earlier ver-

3322

sions of Perl, some of which (such as named parentheses) have been in

3323

PCRE for some time. This list is with respect to Perl 5.10:

3324

3325

(a) Although lookbehind assertions in PCRE must match fixed length

3326

strings, each alternative branch of a lookbehind assertion can match a

3327

different length of string. Perl requires them all to have the same

3328

length.

3329

3330

(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $

3331

meta-character matches only at the very end of the string.

3332

3333

(c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-

3334

cial meaning is faulted. Otherwise, like Perl, the backslash is quietly

3335

ignored. (Perl can be made to issue a warning.)

3336

3337

(d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti-

3338

fiers is inverted, that is, by default they are not greedy, but if fol-

3339

lowed by a question mark they are.

3340

3341

(e) PCRE_ANCHORED can be used at matching time to force a pattern to be

3342

tried only at the first matching position in the subject string.

3343

3344

(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,

3345

and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl equiva-

3346

lents.

3347

3348

(g) The \R escape sequence can be restricted to match only CR, LF, or

3349

CRLF by the PCRE_BSR_ANYCRLF option.

3350

3351

(h) The callout facility is PCRE-specific.

3352

3353

(i) The partial matching facility is PCRE-specific.

3354

3355

(j) Patterns compiled by PCRE can be saved and re-used at a later time,

3356

even on different hosts that have the other endianness. However, this

3357

does not apply to optimized data created by the just-in-time compiler.

3358

3359

(k) The alternative matching function (pcre_dfa_exec()) matches in a

3360

different way and is not Perl-compatible.

3361

3362

(l) PCRE recognizes some special sequences such as (*CR) at the start

3363

of a pattern that set overall options that cannot be changed within the

pattern.

AUTHOR

Philip Hazel

University Computing Service

3371

Cambridge CB2 3QH, England.

REVISION

Last updated: 14 November 2011

3377

3378

------------------------------------------------------------------------------

3379

3380

3381

PCREPATTERN(3) PCREPATTERN(3)

NAME

PCRE - Perl-compatible regular expressions

3386

3387

3388

PCRE REGULAR EXPRESSION DETAILS

3389

3390

The syntax and semantics of the regular expressions that are supported

3391

by PCRE are described in detail below. There is a quick-reference syn-

3392

tax summary in the pcresyntax page. PCRE tries to match Perl syntax and

3393

semantics as closely as it can. PCRE also supports some alternative

3394

regular expression syntax (which does not conflict with the Perl syn-

3395

tax) in order to provide some compatibility with regular expressions in

3396

Python, .NET, and Oniguruma.

3397

3398

Perl's regular expressions are described in its own documentation, and

3399

regular expressions in general are covered in a number of books, some

3400

of which have copious examples. Jeffrey Friedl's "Mastering Regular

3401

Expressions", published by O'Reilly, covers regular expressions in

3402

great detail. This description of PCRE's regular expressions is

3403

intended as reference material.

3404

3405

The original operation of PCRE was on strings of one-byte characters.

3406

However, there is now also support for UTF-8 character strings. To use

3407

this, PCRE must be built to include UTF-8 support, and you must call

3408

pcre_compile() or pcre_compile2() with the PCRE_UTF8 option. There is

3409

also a special sequence that can be given at the start of a pattern:

(*UTF8)

Starting a pattern with this sequence is equivalent to setting the

3414

PCRE_UTF8 option. This feature is not Perl-compatible. How setting

3415

UTF-8 mode affects pattern matching is mentioned in several places

3416

below. There is also a summary of UTF-8 features in the pcreunicode

3417

page.

3418

3419

Another special sequence that may appear at the start of a pattern or

3420

in combination with (*UTF8) is:

(*UCP)

This has the same effect as setting the PCRE_UCP option: it causes

3425

sequences such as \d and \w to use Unicode properties to determine

3426

character types, instead of recognizing only characters with codes less

3427

than 128 via a lookup table.

3428

3429

If a pattern starts with (*NO_START_OPT), it has the same effect as

3430

setting the PCRE_NO_START_OPTIMIZE option either at compile or matching

3431

time. There are also some more of these special sequences that are con-

3432

cerned with the handling of newlines; they are described below.

3433

3434

The remainder of this document discusses the patterns that are sup-

3435

ported by PCRE when its main matching function, pcre_exec(), is used.

3436

From release 6.0, PCRE offers a second matching function,

3437

pcre_dfa_exec(), which matches using a different algorithm that is not

3438

Perl-compatible. Some of the features discussed below are not available

3439

when pcre_dfa_exec() is used. The advantages and disadvantages of the

3440

alternative function, and how it differs from the normal function, are

3441

discussed in the pcrematching page.

NEWLINE CONVENTIONS

PCRE supports five different conventions for indicating line breaks in

3447

strings: a single CR (carriage return) character, a single LF (line-

3448

feed) character, the two-character sequence CRLF, any of the three pre-

3449

ceding, or any Unicode newline sequence. The pcreapi page has further

3450

discussion about newlines, and shows how to set the newline convention

3451

in the options arguments for the compiling and matching functions.

3452

3453

It is also possible to specify a newline convention by starting a pat-

3454

tern string with one of the following five sequences:

3455

3456

(*CR) carriage return

3457

(*LF) linefeed

3458

(*CRLF) carriage return, followed by linefeed

3459

(*ANYCRLF) any of the three above

3460

(*ANY) all Unicode newline sequences

3461

3462

These override the default and the options given to pcre_compile() or

3463

pcre_compile2(). For example, on a Unix system where LF is the default

3464

newline sequence, the pattern

(*CR)a.b

changes the convention to CR. That pattern matches "a\nb" because LF is

3469

no longer a newline. Note that these special settings, which are not

3470

Perl-compatible, are recognized only at the very start of a pattern,

3471

and that they must be in upper case. If more than one of them is

3472

present, the last one is used.

3473

3474

The newline convention affects the interpretation of the dot metachar-

3475

acter when PCRE_DOTALL is not set, and also the behaviour of \N. How-

3476

ever, it does not affect what the \R escape sequence matches. By

3477

default, this is any Unicode newline sequence, for Perl compatibility.

3478

However, this can be changed; see the description of \R in the section

3479

entitled "Newline sequences" below. A change of \R setting can be com-

3480

bined with a change of newline convention.

3481

3482

3483

CHARACTERS AND METACHARACTERS

3484

3485

A regular expression is a pattern that is matched against a subject

3486

string from left to right. Most characters stand for themselves in a

3487

pattern, and match the corresponding characters in the subject. As a

3488

trivial example, the pattern

The quick brown fox

matches a portion of a subject string that is identical to itself. When

3493

caseless matching is specified (the PCRE_CASELESS option), letters are

3494

matched independently of case. In UTF-8 mode, PCRE always understands

3495

the concept of case for characters whose values are less than 128, so

3496

caseless matching is always possible. For characters with higher val-

3497

ues, the concept of case is supported if PCRE is compiled with Unicode

3498

property support, but not otherwise. If you want to use caseless

3499

matching for characters 128 and above, you must ensure that PCRE is

3500

compiled with Unicode property support as well as with UTF-8 support.

3501

3502

The power of regular expressions comes from the ability to include

3503

alternatives and repetitions in the pattern. These are encoded in the

3504

pattern by the use of metacharacters, which do not stand for themselves

3505

but instead are interpreted in some special way.

3506

3507

There are two different sets of metacharacters: those that are recog-

3508

nized anywhere in the pattern except within square brackets, and those

3509

that are recognized within square brackets. Outside square brackets,

3510

the metacharacters are as follows:

3511

3512

\ general escape character with several uses

3513

^ assert start of string (or line, in multiline mode)

3514

$ assert end of string (or line, in multiline mode)

3515

. match any character except newline (by default)

3516

[ start character class definition

3517

| start of alternative branch

3518

( start subpattern

3519

) end subpattern

3520

? extends the meaning of (

3521

also 0 or 1 quantifier

3522

also quantifier minimizer

3523

* 0 or more quantifier

3524

+ 1 or more quantifier

3525

also "possessive quantifier"

3526

{ start min/max quantifier

3527

3528

Part of a pattern that is in square brackets is called a "character

3529

class". In a character class the only metacharacters are:

3530

3531

\ general escape character

3532

^ negate the class, but only if the first character

3533

- indicates character range

3534

[ POSIX character class (only if followed by POSIX

3535

syntax)

3536

] terminates the character class

3537

3538

The following sections describe the use of each of the metacharacters.

BACKSLASH

The backslash character has several uses. Firstly, if it is followed by

3544

a character that is not a number or a letter, it takes away any special

3545

meaning that character may have. This use of backslash as an escape

3546

character applies both inside and outside character classes.

3547

3548

For example, if you want to match a * character, you write \* in the

3549

pattern. This escaping action applies whether or not the following

3550

character would otherwise be interpreted as a metacharacter, so it is

3551

always safe to precede a non-alphanumeric with backslash to specify

3552

that it stands for itself. In particular, if you want to match a back-

3553

slash, you write \\.

3554

3555

In UTF-8 mode, only ASCII numbers and letters have any special meaning

3556

after a backslash. All other characters (in particular, those whose

3557

codepoints are greater than 127) are treated as literals.

3558

3559

If a pattern is compiled with the PCRE_EXTENDED option, whitespace in

3560

the pattern (other than in a character class) and characters between a

3561

# outside a character class and the next newline are ignored. An escap-

3562

ing backslash can be used to include a whitespace or # character as

3563

part of the pattern.

3564

3565

If you want to remove the special meaning from a sequence of charac-

3566

ters, you can do so by putting them between \Q and \E. This is differ-

3567

ent from Perl in that $ and @ are handled as literals in \Q...\E

3568

sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-

3569

tion. Note the following examples:

3570

3571

Pattern PCRE matches Perl matches

3572

3573

\Qabc$xyz\E abc$xyz abc followed by the

3574

contents of $xyz

3575

\Qabc\$xyz\E abc\$xyz abc\$xyz

3576

\Qabc\E\$\Qxyz\E abc$xyz abc$xyz

3577

3578

The \Q...\E sequence is recognized both inside and outside character

3579

classes. An isolated \E that is not preceded by \Q is ignored. If \Q

3580

is not followed by \E later in the pattern, the literal interpretation

3581

continues to the end of the pattern (that is, \E is assumed at the

3582

end). If the isolated \Q is inside a character class, this causes an

3583

error, because the character class is not terminated.

3584

3585

Non-printing characters

3586

3587

A second use of backslash provides a way of encoding non-printing char-

3588

acters in patterns in a visible manner. There is no restriction on the

3589

appearance of non-printing characters, apart from the binary zero that

3590

terminates a pattern, but when a pattern is being prepared by text

3591

editing, it is often easier to use one of the following escape

3592

sequences than the binary character it represents:

3593

3594

\a alarm, that is, the BEL character (hex 07)

3595

\cx "control-x", where x is any ASCII character

\e escape (hex 1B)

\f formfeed (hex 0C)

\n linefeed (hex 0A)

\r carriage return (hex 0D)

3600

\t tab (hex 09)

3601

\ddd character with octal code ddd, or back reference

3602

\xhh character with hex code hh

3603

\x{hhh..} character with hex code hhh.. (non-JavaScript mode)

3604

\uhhhh character with hex code hhhh (JavaScript mode only)

3605

3606

The precise effect of \cx is as follows: if x is a lower case letter,

3607

it is converted to upper case. Then bit 6 of the character (hex 40) is

3608

inverted. Thus \cz becomes hex 1A (z is 7A), but \c{ becomes hex 3B ({

3609

is 7B), while \c; becomes hex 7B (; is 3B). If the byte following \c

3610

has a value greater than 127, a compile-time error occurs. This locks

3611

out non-ASCII characters in both byte mode and UTF-8 mode. (When PCRE

3612

is compiled in EBCDIC mode, all byte values are valid. A lower case

3613

letter is converted to upper case, and then the 0xc0 bits are flipped.)

3614

3615

By default, after \x, from zero to two hexadecimal digits are read

3616

(letters can be in upper or lower case). Any number of hexadecimal dig-

3617

its may appear between \x{ and }, but the value of the character code

3618

must be less than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8

3619

mode. That is, the maximum value in hexadecimal is 7FFFFFFF. Note that

3620

this is bigger than the largest Unicode code point, which is 10FFFF.

3621

3622

If characters other than hexadecimal digits appear between \x{ and },

3623

or if there is no terminating }, this form of escape is not recognized.

3624

Instead, the initial \x will be interpreted as a basic hexadecimal

3625

escape, with no following digits, giving a character whose value is

3626

zero.

3627

3628

If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x

3629

is as just described only when it is followed by two hexadecimal dig-

3630

its. Otherwise, it matches a literal "x" character. In JavaScript

3631

mode, support for code points greater than 256 is provided by \u, which

3632

must be followed by four hexadecimal digits; otherwise it matches a

3633

literal "u" character.

3634

3635

Characters whose value is less than 256 can be defined by either of the

3636

two syntaxes for \x (or by \u in JavaScript mode). There is no differ-

3637

ence in the way they are handled. For example, \xdc is exactly the same

3638

as \x{dc} (or \u00dc in JavaScript mode).

3639

3640

After \0 up to two further octal digits are read. If there are fewer

3641

than two digits, just those that are present are used. Thus the

3642

sequence \0\x\07 specifies two binary zeros followed by a BEL character

3643

(code value 7). Make sure you supply two digits after the initial zero

3644

if the pattern character that follows is itself an octal digit.

3645

3646

The handling of a backslash followed by a digit other than 0 is compli-

3647

cated. Outside a character class, PCRE reads it and any following dig-

3648

its as a decimal number. If the number is less than 10, or if there

3649

have been at least that many previous capturing left parentheses in the

3650

expression, the entire sequence is taken as a back reference. A

3651

description of how this works is given later, following the discussion

3652

of parenthesized subpatterns.

3653

3654

Inside a character class, or if the decimal number is greater than 9

3655

and there have not been that many capturing subpatterns, PCRE re-reads

3656

up to three octal digits following the backslash, and uses them to gen-

3657

erate a data character. Any subsequent digits stand for themselves. In

3658

non-UTF-8 mode, the value of a character specified in octal must be

3659

less than \400. In UTF-8 mode, values up to \777 are permitted. For

3660

example:

3661

3662

\040 is another way of writing a space

3663

\40 is the same, provided there are fewer than 40

3664

previous capturing subpatterns

3665

\7 is always a back reference

3666

\11 might be a back reference, or another way of

3667

writing a tab

3668

\011 is always a tab

3669

\0113 is a tab followed by the character "3"

3670

\113 might be a back reference, otherwise the

3671

character with octal code 113

3672

\377 might be a back reference, otherwise

3673

the byte consisting entirely of 1 bits

3674

\81 is either a back reference, or a binary zero

3675

followed by the two characters "8" and "1"

3676

3677

Note that octal values of 100 or greater must not be introduced by a

3678

leading zero, because no more than three octal digits are ever read.

3679

3680

All the sequences that define a single character value can be used both

3681

inside and outside character classes. In addition, inside a character

3682

class, \b is interpreted as the backspace character (hex 08).

3683

3684

\N is not allowed in a character class. \B, \R, and \X are not special

3685

inside a character class. Like other unrecognized escape sequences,

3686

they are treated as the literal characters "B", "R", and "X" by

3687

default, but cause an error if the PCRE_EXTRA option is set. Outside a

3688

character class, these sequences have different meanings.

3689

3690

Unsupported escape sequences

3691

3692

In Perl, the sequences \l, \L, \u, and \U are recognized by its string

3693

handler and used to modify the case of following characters. By

3694

default, PCRE does not support these escape sequences. However, if the

3695

PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U" character, and

3696

\u can be used to define a character by code point, as described in the

3697

previous section.

3698

3699

Absolute and relative back references

3700

3701

The sequence \g followed by an unsigned or a negative number, option-

3702

ally enclosed in braces, is an absolute or relative back reference. A

3703

named back reference can be coded as \g{name}. Back references are dis-

3704

cussed later, following the discussion of parenthesized subpatterns.

3705

3706

Absolute and relative subroutine calls

3707

3708

For compatibility with Oniguruma, the non-Perl syntax \g followed by a

3709

name or a number enclosed either in angle brackets or single quotes, is

3710

an alternative syntax for referencing a subpattern as a "subroutine".

3711

Details are discussed later. Note that \g{...} (Perl syntax) and

3712

\g<...> (Oniguruma syntax) are not synonymous. The former is a back

3713

reference; the latter is a subroutine call.

3714

3715

Generic character types

3716

3717

Another use of backslash is for specifying generic character types:

3718

3719

\d any decimal digit

3720

\D any character that is not a decimal digit

3721

\h any horizontal whitespace character

3722

\H any character that is not a horizontal whitespace character

3723

\s any whitespace character

3724

\S any character that is not a whitespace character

3725

\v any vertical whitespace character

3726

\V any character that is not a vertical whitespace character

3727

\w any "word" character

3728

\W any "non-word" character

3729

3730

There is also the single sequence \N, which matches a non-newline char-

3731

acter. This is the same as the "." metacharacter when PCRE_DOTALL is

3732

not set. Perl also uses \N to match characters by name; PCRE does not

3733

support this.

3734

3735

Each pair of lower and upper case escape sequences partitions the com-

3736

plete set of characters into two disjoint sets. Any given character

3737

matches one, and only one, of each pair. The sequences can appear both

3738

inside and outside character classes. They each match one character of

3739

the appropriate type. If the current matching point is at the end of

3740

the subject string, all of them fail, because there is no character to

3741

match.

3742

3743

For compatibility with Perl, \s does not match the VT character (code

3744

11). This makes it different from the the POSIX "space" class. The \s

3745

characters are HT (9), LF (10), FF (12), CR (13), and space (32). If

3746

"use locale;" is included in a Perl script, \s may match the VT charac-

3747

ter. In PCRE, it never does.

3748

3749

A "word" character is an underscore or any character that is a letter

3750

or digit. By default, the definition of letters and digits is con-

3751

trolled by PCRE's low-valued character tables, and may vary if locale-

3752

specific matching is taking place (see "Locale support" in the pcreapi

3753

page). For example, in a French locale such as "fr_FR" in Unix-like

3754

systems, or "french" in Windows, some character codes greater than 128

3755

are used for accented letters, and these are then matched by \w. The

3756

use of locales with Unicode is discouraged.

3757

3758

By default, in UTF-8 mode, characters with values greater than 128

3759

never match \d, \s, or \w, and always match \D, \S, and \W. These

3760

sequences retain their original meanings from before UTF-8 support was

3761

available, mainly for efficiency reasons. However, if PCRE is compiled

3762

with Unicode property support, and the PCRE_UCP option is set, the be-

3763

haviour is changed so that Unicode properties are used to determine

3764

character types, as follows:

3765

3766

\d any character that \p{Nd} matches (decimal digit)

3767

\s any character that \p{Z} matches, plus HT, LF, FF, CR

3768

\w any character that \p{L} or \p{N} matches, plus underscore

3769

3770

The upper case escapes match the inverse sets of characters. Note that

3771

\d matches only decimal digits, whereas \w matches any Unicode digit,

3772

as well as any Unicode letter, and underscore. Note also that PCRE_UCP

3773

affects \b, and \B because they are defined in terms of \w and \W.

3774

Matching these sequences is noticeably slower when PCRE_UCP is set.

3775

3776

The sequences \h, \H, \v, and \V are features that were added to Perl

3777

at release 5.10. In contrast to the other sequences, which match only

3778

ASCII characters by default, these always match certain high-valued

3779

codepoints in UTF-8 mode, whether or not PCRE_UCP is set. The horizon-

3780

tal space characters are:

3781

3782

U+0009 Horizontal tab

3783

U+0020 Space

3784

U+00A0 Non-break space

3785

U+1680 Ogham space mark

3786

U+180E Mongolian vowel separator

U+2000 En quad

U+2001 Em quad

U+2002 En space

U+2003 Em space

U+2004 Three-per-em space

3792

U+2005 Four-per-em space

3793

U+2006 Six-per-em space

3794

U+2007 Figure space

3795

U+2008 Punctuation space

3796

U+2009 Thin space

3797

U+200A Hair space

3798

U+202F Narrow no-break space

3799

U+205F Medium mathematical space

3800

U+3000 Ideographic space

3801

3802

The vertical space characters are:

U+000A Linefeed

U+000B Vertical tab

U+000C Formfeed

U+000D Carriage return

3808

U+0085 Next line

3809

U+2028 Line separator

3810

U+2029 Paragraph separator

Newline sequences

Outside a character class, by default, the escape sequence \R matches

3815

any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the

3816

following:

3817

3818

(?>\r\n|\n|\x0b|\f|\r|\x85)

3819

3820

This is an example of an "atomic group", details of which are given

3821

below. This particular group matches either the two-character sequence

3822

CR followed by LF, or one of the single characters LF (linefeed,

3823

U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage

3824

return, U+000D), or NEL (next line, U+0085). The two-character sequence

3825

is treated as a single unit that cannot be split.

3826

3827

In UTF-8 mode, two additional characters whose codepoints are greater

3828

than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-

3829

rator, U+2029). Unicode character property support is not needed for

3830

these characters to be recognized.

3831

3832

It is possible to restrict \R to match only CR, LF, or CRLF (instead of

3833

the complete set of Unicode line endings) by setting the option

3834

PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.

3835

(BSR is an abbrevation for "backslash R".) This can be made the default

3836

when PCRE is built; if this is the case, the other behaviour can be

3837

requested via the PCRE_BSR_UNICODE option. It is also possible to

3838

specify these settings by starting a pattern string with one of the

3839

following sequences:

3840

3841

(*BSR_ANYCRLF) CR, LF, or CRLF only

3842

(*BSR_UNICODE) any Unicode newline sequence

3843

3844

These override the default and the options given to pcre_compile() or

3845

pcre_compile2(), but they can be overridden by options given to

3846

pcre_exec() or pcre_dfa_exec(). Note that these special settings, which

3847

are not Perl-compatible, are recognized only at the very start of a

3848

pattern, and that they must be in upper case. If more than one of them

3849

is present, the last one is used. They can be combined with a change of

3850

newline convention; for example, a pattern can start with:

(*ANY)(*BSR_ANYCRLF)

They can also be combined with the (*UTF8) or (*UCP) special sequences.

3855

Inside a character class, \R is treated as an unrecognized escape

3856

sequence, and so matches the letter "R" by default, but causes an error

3857

if PCRE_EXTRA is set.

3858

3859

Unicode character properties

3860

3861

When PCRE is built with Unicode character property support, three addi-

3862

tional escape sequences that match characters with specific properties

3863

are available. When not in UTF-8 mode, these sequences are of course

3864

limited to testing characters whose codepoints are less than 256, but

3865

they do work in this mode. The extra escape sequences are:

3866

3867

\p{xx} a character with the xx property

3868

\P{xx} a character without the xx property

3869

\X an extended Unicode sequence

3870

3871

The property names represented by xx above are limited to the Unicode

3872

script names, the general category properties, "Any", which matches any

3873

character (including newline), and some special PCRE properties

3874

(described in the next section). Other Perl properties such as "InMu-

3875

sicalSymbols" are not currently supported by PCRE. Note that \P{Any}

3876

does not match any characters, so always causes a match failure.

3877

3878

Sets of Unicode characters are defined as belonging to certain scripts.

3879

A character from one of these sets can be matched using a script name.

For example:

\p{Greek}

\P{Han}

Those that are not part of an identified script are lumped together as

3886

"Common". The current list of scripts is:

3887

3888

Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,

3889

Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common,

3890

Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp-

3891

tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek,

3892

Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe-

3893

rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,

3894

Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,

3895

Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam,

3896

Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,

3897

Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya,

3898

Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian,

3899

Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,

3900

Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,

3901

Ugaritic, Vai, Yi.

3902

3903

Each character has exactly one Unicode general category property, spec-

3904

ified by a two-letter abbreviation. For compatibility with Perl, nega-

3905

tion can be specified by including a circumflex between the opening

3906

brace and the property name. For example, \p{^Lu} is the same as

3907

\P{Lu}.

3908

3909

If only one letter is specified with \p or \P, it includes all the gen-

3910

eral category properties that start with that letter. In this case, in

3911

the absence of negation, the curly brackets in the escape sequence are

3912

optional; these two examples have the same effect:

\p{L}

\pL

The following general category property codes are supported:

C Other

Cc Control

Cf Format

Cn Unassigned

Co Private use

Cs Surrogate

L Letter

Ll Lower case letter

Lm Modifier letter

Lo Other letter

Lt Title case letter

Lu Upper case letter

M Mark

Mc Spacing mark

Me Enclosing mark

Mn Non-spacing mark

N Number

Nd Decimal number

Nl Letter number

No Other number

P Punctuation

Pc Connector punctuation

Pd Dash punctuation

Pe Close punctuation

Pf Final punctuation

Pi Initial punctuation

Po Other punctuation

Ps Open punctuation

S Symbol

Sc Currency symbol

Sk Modifier symbol

Sm Mathematical symbol

So Other symbol

Z Separator

Zl Line separator

Zp Paragraph separator

3961

Zs Space separator

3962

3963

The special property L& is also supported: it matches a character that

3964

has the Lu, Ll, or Lt property, in other words, a letter that is not

3965

classified as a modifier or "other".

3966

3967

The Cs (Surrogate) property applies only to characters in the range

3968

U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see

3969

RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-

3970

ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in

3971

the pcreapi page). Perl does not support the Cs property.

3972

3973

The long synonyms for property names that Perl supports (such as

3974

\p{Letter}) are not supported by PCRE, nor is it permitted to prefix

3975

any of these properties with "Is".

3976

3977

No character that is in the Unicode table has the Cn (unassigned) prop-

3978

erty. Instead, this property is assumed for any code point that is not

3979

in the Unicode table.

3980

3981

Specifying caseless matching does not affect these escape sequences.

3982

For example, \p{Lu} always matches only upper case letters.

3983

3984

The \X escape matches any number of Unicode characters that form an

3985

extended Unicode sequence. \X is equivalent to

(?>\PM\pM*)

That is, it matches a character without the "mark" property, followed

3990

by zero or more characters with the "mark" property, and treats the

3991

sequence as an atomic group (see below). Characters with the "mark"

3992

property are typically accents that affect the preceding character.

3993

None of them have codepoints less than 256, so in non-UTF-8 mode \X

3994

matches any one character.

3995

3996

Note that recent versions of Perl have changed \X to match what Unicode

3997

calls an "extended grapheme cluster", which has a more complicated def-

3998

inition.

3999

4000

Matching characters by Unicode property is not fast, because PCRE has

4001

to search a structure that contains data for over fifteen thousand

4002

characters. That is why the traditional escape sequences such as \d and

4003

\w do not use Unicode properties in PCRE by default, though you can

4004

make them do so by setting the PCRE_UCP option for pcre_compile() or by

4005

starting the pattern with (*UCP).

4006

4007

PCRE's additional properties

4008

4009

As well as the standard Unicode properties described in the previous

4010

section, PCRE supports four more that make it possible to convert tra-

4011

ditional escape sequences such as \w and \s and POSIX character classes

4012

to use Unicode properties. PCRE uses these non-standard, non-Perl prop-

4013

erties internally when PCRE_UCP is set. They are:

4014

4015

Xan Any alphanumeric character

4016

Xps Any POSIX space character

4017

Xsp Any Perl space character

4018

Xwd Any Perl "word" character

4019

4020

Xan matches characters that have either the L (letter) or the N (num-

4021

ber) property. Xps matches the characters tab, linefeed, vertical tab,

4022

formfeed, or carriage return, and any other character that has the Z

4023

(separator) property. Xsp is the same as Xps, except that vertical tab

4024

is excluded. Xwd matches the same characters as Xan, plus underscore.

4025

4026

Resetting the match start

4027

4028

The escape sequence \K causes any previously matched characters not to

4029

be included in the final matched sequence. For example, the pattern:

foo\Kbar

matches "foobar", but reports that it has matched "bar". This feature

4034

is similar to a lookbehind assertion (described below). However, in

4035

this case, the part of the subject before the real match does not have

4036

to be of fixed length, as lookbehind assertions do. The use of \K does

4037

not interfere with the setting of captured substrings. For example,

when the pattern

(foo)\Kbar

matches "foobar", the first substring is still set to "foo".

4043

4044

Perl documents that the use of \K within assertions is "not well

4045

defined". In PCRE, \K is acted upon when it occurs inside positive

4046

assertions, but is ignored in negative assertions.

Simple assertions

The final use of backslash is for certain simple assertions. An asser-

4051

tion specifies a condition that has to be met at a particular point in

4052

a match, without consuming any characters from the subject string. The

4053

use of subpatterns for more complicated assertions is described below.

4054

The backslashed assertions are:

4055

4056

\b matches at a word boundary

4057

\B matches when not at a word boundary

4058

\A matches at the start of the subject

4059

\Z matches at the end of the subject

4060

also matches before a newline at the end of the subject

4061

\z matches only at the end of the subject

4062

\G matches at the first matching position in the subject

4063

4064

Inside a character class, \b has a different meaning; it matches the

4065

backspace character. If any other of these assertions appears in a

4066

character class, by default it matches the corresponding literal char-

4067

acter (for example, \B matches the letter B). However, if the

4068

PCRE_EXTRA option is set, an "invalid escape sequence" error is gener-

4069

ated instead.

4070

4071

A word boundary is a position in the subject string where the current

4072

character and the previous character do not both match \w or \W (i.e.

4073

one matches \w and the other matches \W), or the start or end of the

4074

string if the first or last character matches \w, respectively. In

4075

UTF-8 mode, the meanings of \w and \W can be changed by setting the

4076

PCRE_UCP option. When this is done, it also affects \b and \B. Neither

4077

PCRE nor Perl has a separate "start of word" or "end of word" metase-

4078

quence. However, whatever follows \b normally determines which it is.

4079

For example, the fragment \ba matches "a" at the start of a word.

4080

4081

The \A, \Z, and \z assertions differ from the traditional circumflex

4082

and dollar (described in the next section) in that they only ever match

4083

at the very start and end of the subject string, whatever options are

4084

set. Thus, they are independent of multiline mode. These three asser-

4085

tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which

4086

affect only the behaviour of the circumflex and dollar metacharacters.

4087

However, if the startoffset argument of pcre_exec() is non-zero, indi-

4088

cating that matching is to start at a point other than the beginning of

4089

the subject, \A can never match. The difference between \Z and \z is

4090

that \Z matches before a newline at the end of the string as well as at

4091

the very end, whereas \z matches only at the end.

4092

4093

The \G assertion is true only when the current matching position is at

4094

the start point of the match, as specified by the startoffset argument

4095

of pcre_exec(). It differs from \A when the value of startoffset is

4096

non-zero. By calling pcre_exec() multiple times with appropriate argu-

4097

ments, you can mimic Perl's /g option, and it is in this kind of imple-

4098

mentation where \G can be useful.

4099

4100

Note, however, that PCRE's interpretation of \G, as the start of the

4101

current match, is subtly different from Perl's, which defines it as the

4102

end of the previous match. In Perl, these can be different when the

4103

previously matched string was empty. Because PCRE does just one match

4104

at a time, it cannot reproduce this behaviour.

4105

4106

If all the alternatives of a pattern begin with \G, the expression is

4107

anchored to the starting match position, and the "anchored" flag is set

4108

in the compiled regular expression.

4109

4110

4111

CIRCUMFLEX AND DOLLAR

4112

4113

Outside a character class, in the default matching mode, the circumflex

4114

character is an assertion that is true only if the current matching

4115

point is at the start of the subject string. If the startoffset argu-

4116

ment of pcre_exec() is non-zero, circumflex can never match if the

4117

PCRE_MULTILINE option is unset. Inside a character class, circumflex

4118

has an entirely different meaning (see below).

4119

4120

Circumflex need not be the first character of the pattern if a number

4121

of alternatives are involved, but it should be the first thing in each

4122

alternative in which it appears if the pattern is ever to match that

4123

branch. If all possible alternatives start with a circumflex, that is,

4124

if the pattern is constrained to match only at the start of the sub-

4125

ject, it is said to be an "anchored" pattern. (There are also other

4126

constructs that can cause a pattern to be anchored.)

4127

4128

A dollar character is an assertion that is true only if the current

4129

matching point is at the end of the subject string, or immediately

4130

before a newline at the end of the string (by default). Dollar need not

4131

be the last character of the pattern if a number of alternatives are

4132

involved, but it should be the last item in any branch in which it

4133

appears. Dollar has no special meaning in a character class.

4134

4135

The meaning of dollar can be changed so that it matches only at the

4136

very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at

4137

compile time. This does not affect the \Z assertion.

4138

4139

The meanings of the circumflex and dollar characters are changed if the

4140

PCRE_MULTILINE option is set. When this is the case, a circumflex

4141

matches immediately after internal newlines as well as at the start of

4142

the subject string. It does not match after a newline that ends the

4143

string. A dollar matches before any newlines in the string, as well as

4144

at the very end, when PCRE_MULTILINE is set. When newline is specified

4145

as the two-character sequence CRLF, isolated CR and LF characters do

4146

not indicate newlines.

4147

4148

For example, the pattern /^abc$/ matches the subject string "def\nabc"

4149

(where \n represents a newline) in multiline mode, but not otherwise.

4150

Consequently, patterns that are anchored in single line mode because

4151

all branches start with ^ are not anchored in multiline mode, and a

4152

match for circumflex is possible when the startoffset argument of

4153

pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if

4154

PCRE_MULTILINE is set.

4155

4156

Note that the sequences \A, \Z, and \z can be used to match the start

4157

and end of the subject in both modes, and if all branches of a pattern

4158

start with \A it is always anchored, whether or not PCRE_MULTILINE is

set.

FULL STOP (PERIOD, DOT) AND \N

4163

4164

Outside a character class, a dot in the pattern matches any one charac-

4165

ter in the subject string except (by default) a character that signi-

4166

fies the end of a line. In UTF-8 mode, the matched character may be

4167

more than one byte long.

4168

4169

When a line ending is defined as a single character, dot never matches

4170

that character; when the two-character sequence CRLF is used, dot does

4171

not match CR if it is immediately followed by LF, but otherwise it

4172

matches all characters (including isolated CRs and LFs). When any Uni-

4173

code line endings are being recognized, dot does not match CR or LF or

4174

any of the other line ending characters.

4175

4176

The behaviour of dot with regard to newlines can be changed. If the

4177

PCRE_DOTALL option is set, a dot matches any one character, without

4178

exception. If the two-character sequence CRLF is present in the subject

4179

string, it takes two dots to match it.

4180

4181

The handling of dot is entirely independent of the handling of circum-

4182

flex and dollar, the only relationship being that they both involve

4183

newlines. Dot has no special meaning in a character class.

4184

4185

The escape sequence \N behaves like a dot, except that it is not

4186

affected by the PCRE_DOTALL option. In other words, it matches any

4187

character except one that signifies the end of a line. Perl also uses

4188

\N to match characters by name; PCRE does not support this.

4189

4190

4191

MATCHING A SINGLE BYTE

4192

4193

Outside a character class, the escape sequence \C matches any one byte,

4194

both in and out of UTF-8 mode. Unlike a dot, it always matches line-

4195

ending characters. The feature is provided in Perl in order to match

4196

individual bytes in UTF-8 mode, but it is unclear how it can usefully

4197

be used. Because \C breaks up characters into individual bytes, match-

4198

ing one byte with \C in UTF-8 mode means that the rest of the string

4199

may start with a malformed UTF-8 character. This has undefined results,

4200

because PCRE assumes that it is dealing with valid UTF-8 strings (and

4201

by default it checks this at the start of processing unless the

4202

PCRE_NO_UTF8_CHECK option is used).

4203

4204

PCRE does not allow \C to appear in lookbehind assertions (described

4205

below) in UTF-8 mode, because this would make it impossible to calcu-

4206

late the length of the lookbehind.

4207

4208

In general, the \C escape sequence is best avoided in UTF-8 mode. How-

4209

ever, one way of using it that avoids the problem of malformed UTF-8

4210

characters is to use a lookahead to check the length of the next char-

4211

acter, as in this pattern (ignore white space and line breaks):

4212

4213

(?| (?=[\x00-\x7f])(\C) |

4214

(?=[\x80-\x{7ff}])(\C)(\C) |

4215

(?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |

4216

(?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))

4217

4218

A group that starts with (?| resets the capturing parentheses numbers

4219

in each alternative (see "Duplicate Subpattern Numbers" below). The

4220

assertions at the start of each branch check the next UTF-8 character

4221

for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The

4222

character's individual bytes are then captured by the appropriate num-

ber of groups.

SQUARE BRACKETS AND CHARACTER CLASSES

4227

4228

An opening square bracket introduces a character class, terminated by a

4229

closing square bracket. A closing square bracket on its own is not spe-

4230

cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,

4231

a lone closing square bracket causes a compile-time error. If a closing

4232

square bracket is required as a member of the class, it should be the

4233

first data character in the class (after an initial circumflex, if

4234

present) or escaped with a backslash.

4235

4236

A character class matches a single character in the subject. In UTF-8

4237

mode, the character may be more than one byte long. A matched character

4238

must be in the set of characters defined by the class, unless the first

4239

character in the class definition is a circumflex, in which case the

4240

subject character must not be in the set defined by the class. If a

4241

circumflex is actually required as a member of the class, ensure it is

4242

not the first character, or escape it with a backslash.

4243

4244

For example, the character class [aeiou] matches any lower case vowel,

4245

while [^aeiou] matches any character that is not a lower case vowel.

4246

Note that a circumflex is just a convenient notation for specifying the

4247

characters that are in the class by enumerating those that are not. A

4248

class that starts with a circumflex is not an assertion; it still con-

4249

sumes a character from the subject string, and therefore it fails if

4250

the current pointer is at the end of the string.

4251

4252

In UTF-8 mode, characters with values greater than 255 can be included

4253

in a class as a literal string of bytes, or by using the \x{ escaping

4254

mechanism.

4255

4256

When caseless matching is set, any letters in a class represent both

4257

their upper case and lower case versions, so for example, a caseless

4258

[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not

4259

match "A", whereas a caseful version would. In UTF-8 mode, PCRE always

4260

understands the concept of case for characters whose values are less

4261

than 128, so caseless matching is always possible. For characters with

4262

higher values, the concept of case is supported if PCRE is compiled

4263

with Unicode property support, but not otherwise. If you want to use

4264

caseless matching in UTF8-mode for characters 128 and above, you must

4265

ensure that PCRE is compiled with Unicode property support as well as

4266

with UTF-8 support.

4267

4268

Characters that might indicate line breaks are never treated in any

4269

special way when matching character classes, whatever line-ending

4270

sequence is in use, and whatever setting of the PCRE_DOTALL and

4271

PCRE_MULTILINE options is used. A class such as [^a] always matches one

4272

of these characters.

4273

4274

The minus (hyphen) character can be used to specify a range of charac-

4275

ters in a character class. For example, [d-m] matches any letter

4276

between d and m, inclusive. If a minus character is required in a

4277

class, it must be escaped with a backslash or appear in a position

4278

where it cannot be interpreted as indicating a range, typically as the

4279

first or last character in the class.

4280

4281

It is not possible to have the literal character "]" as the end charac-

4282

ter of a range. A pattern such as [W-]46] is interpreted as a class of

4283

two characters ("W" and "-") followed by a literal string "46]", so it

4284

would match "W46]" or "-46]". However, if the "]" is escaped with a

4285

backslash it is interpreted as the end of range, so [W-\]46] is inter-

4286

preted as a class containing a range followed by two other characters.

4287

The octal or hexadecimal representation of "]" can also be used to end

4288

a range.

4289

4290

Ranges operate in the collating sequence of character values. They can

4291

also be used for characters specified numerically, for example

4292

[\000-\037]. In UTF-8 mode, ranges can include characters whose values

4293

are greater than 255, for example [\x{100}-\x{2ff}].

4294

4295

If a range that includes letters is used when caseless matching is set,

4296

it matches the letters in either case. For example, [W-c] is equivalent

4297

to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if

4298

character tables for a French locale are in use, [\xc8-\xcb] matches

4299

accented E characters in both cases. In UTF-8 mode, PCRE supports the

4300

concept of case for characters with values greater than 128 only when

4301

it is compiled with Unicode property support.

4302

4303

The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,

4304

\w, and \W may appear in a character class, and add the characters that

4305

they match to the class. For example, [\dABCDEF] matches any hexadeci-

4306

mal digit. In UTF-8 mode, the PCRE_UCP option affects the meanings of

4307

\d, \s, \w and their upper case partners, just as it does when they

4308

appear outside a character class, as described in the section entitled

4309

"Generic character types" above. The escape sequence \b has a different

4310

meaning inside a character class; it matches the backspace character.

4311

The sequences \B, \N, \R, and \X are not special inside a character

4312

class. Like any other unrecognized escape sequences, they are treated

4313

as the literal characters "B", "N", "R", and "X" by default, but cause

4314

an error if the PCRE_EXTRA option is set.

4315

4316

A circumflex can conveniently be used with the upper case character

4317

types to specify a more restricted set of characters than the matching

4318

lower case type. For example, the class [^\W_] matches any letter or

4319

digit, but not underscore, whereas [\w] includes underscore. A positive

4320

character class should be read as "something OR something OR ..." and a

4321

negative class as "NOT something AND NOT something AND NOT ...".

4322

4323

The only metacharacters that are recognized in character classes are

4324

backslash, hyphen (only where it can be interpreted as specifying a

4325

range), circumflex (only at the start), opening square bracket (only

4326

when it can be interpreted as introducing a POSIX class name - see the

4327

next section), and the terminating closing square bracket. However,

4328

escaping other non-alphanumeric characters does no harm.

4329

4330

4331

POSIX CHARACTER CLASSES

4332

4333

Perl supports the POSIX notation for character classes. This uses names

4334

enclosed by [: and :] within the enclosing square brackets. PCRE also

4335

supports this notation. For example,

[01[:alpha:]%]

matches "0", "1", any alphabetic character, or "%". The supported class

4340

names are:

4341

4342

alnum letters and digits

4343

alpha letters

4344

ascii character codes 0 - 127

4345

blank space or tab only

4346

cntrl control characters

4347

digit decimal digits (same as \d)

4348

graph printing characters, excluding space

4349

lower lower case letters

4350

print printing characters, including space

4351

punct printing characters, excluding letters and digits and space

4352

space white space (not quite the same as \s)

4353

upper upper case letters

4354

word "word" characters (same as \w)

4355

xdigit hexadecimal digits

4356

4357

The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),

4358

and space (32). Notice that this list includes the VT character (code

4359

11). This makes "space" different to \s, which does not include VT (for

4360

Perl compatibility).

4361

4362

The name "word" is a Perl extension, and "blank" is a GNU extension

4363

from Perl 5.8. Another Perl extension is negation, which is indicated

4364

by a ^ character after the colon. For example,

[12[:^digit:]]

matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the

4369

POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but

4370

these are not supported, and an error is given if they are encountered.

4371

4372

By default, in UTF-8 mode, characters with values greater than 128 do

4373

not match any of the POSIX character classes. However, if the PCRE_UCP

4374

option is passed to pcre_compile(), some of the classes are changed so

4375

that Unicode character properties are used. This is achieved by replac-

4376

ing the POSIX classes by other sequences, as follows:

4377

4378

[:alnum:] becomes \p{Xan}

4379

[:alpha:] becomes \p{L}

4380

[:blank:] becomes \h

4381

[:digit:] becomes \p{Nd}

4382

[:lower:] becomes \p{Ll}

4383

[:space:] becomes \p{Xps}

4384

[:upper:] becomes \p{Lu}

4385

[:word:] becomes \p{Xwd}

4386

4387

Negated versions, such as [:^alpha:] use \P instead of \p. The other

4388

POSIX classes are unchanged, and match only characters with code points

less than 128.

VERTICAL BAR

Vertical bar characters are used to separate alternative patterns. For

example, the pattern

gilbert|sullivan

matches either "gilbert" or "sullivan". Any number of alternatives may

4400

appear, and an empty alternative is permitted (matching the empty

4401

string). The matching process tries each alternative in turn, from left

4402

to right, and the first one that succeeds is used. If the alternatives

4403

are within a subpattern (defined below), "succeeds" means matching the

4404

rest of the main pattern as well as the alternative in the subpattern.

4405

4406

4407

INTERNAL OPTION SETTING

4408

4409

The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and

4410

PCRE_EXTENDED options (which are Perl-compatible) can be changed from

4411

within the pattern by a sequence of Perl option letters enclosed

4412

between "(?" and ")". The option letters are

i for PCRE_CASELESS

m for PCRE_MULTILINE

s for PCRE_DOTALL

x for PCRE_EXTENDED

For example, (?im) sets caseless, multiline matching. It is also possi-

4420

ble to unset these options by preceding the letter with a hyphen, and a

4421

combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-

4422

LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,

4423

is also permitted. If a letter appears both before and after the

4424

hyphen, the option is unset.

4425

4426

The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA

4427

can be changed in the same way as the Perl-compatible options by using

4428

the characters J, U and X respectively.

4429

4430

When one of these option changes occurs at top level (that is, not

4431

inside subpattern parentheses), the change applies to the remainder of

4432

the pattern that follows. If the change is placed right at the start of

4433

a pattern, PCRE extracts it into the global options (and it will there-

4434

fore show up in data extracted by the pcre_fullinfo() function).

4435

4436

An option change within a subpattern (see below for a description of

4437

subpatterns) affects only that part of the subpattern that follows it,

so

(a(?i)b)c

matches abc and aBc and no other strings (assuming PCRE_CASELESS is not

4443

used). By this means, options can be made to have different settings

4444

in different parts of the pattern. Any changes made in one alternative

4445

do carry on into subsequent branches within the same subpattern. For

example,

(a(?i)b|c)

matches "ab", "aB", "c", and "C", even though when matching "C" the

4451

first branch is abandoned before the option setting. This is because

4452

the effects of option settings happen at compile time. There would be

4453

some very weird behaviour otherwise.

4454

4455

Note: There are other PCRE-specific options that can be set by the

4456

application when the compile or match functions are called. In some

4457

cases the pattern can contain special leading sequences such as (*CRLF)

4458

to override what the application has set or what has been defaulted.

4459

Details are given in the section entitled "Newline sequences" above.

4460

There are also the (*UTF8) and (*UCP) leading sequences that can be

4461

used to set UTF-8 and Unicode property modes; they are equivalent to

4462

setting the PCRE_UTF8 and the PCRE_UCP options, respectively.

SUBPATTERNS

Subpatterns are delimited by parentheses (round brackets), which can be

4468

nested. Turning part of a pattern into a subpattern does two things:

4469

4470

1. It localizes a set of alternatives. For example, the pattern

cat(aract|erpillar|)

matches "cataract", "caterpillar", or "cat". Without the parentheses,

4475

it would match "cataract", "erpillar" or an empty string.

4476

4477

2. It sets up the subpattern as a capturing subpattern. This means

4478

that, when the whole pattern matches, that portion of the subject

4479

string that matched the subpattern is passed back to the caller via the

4480

ovector argument of pcre_exec(). Opening parentheses are counted from

4481

left to right (starting from 1) to obtain numbers for the capturing

4482

subpatterns. For example, if the string "the red king" is matched

4483

against the pattern

4484

4485

the ((red|white) (king|queen))

4486

4487

the captured substrings are "red king", "red", and "king", and are num-

4488

bered 1, 2, and 3, respectively.

4489

4490

The fact that plain parentheses fulfil two functions is not always

4491

helpful. There are often times when a grouping subpattern is required

4492

without a capturing requirement. If an opening parenthesis is followed

4493

by a question mark and a colon, the subpattern does not do any captur-

4494

ing, and is not counted when computing the number of any subsequent

4495

capturing subpatterns. For example, if the string "the white queen" is

4496

matched against the pattern

4497

4498

the ((?:red|white) (king|queen))

4499

4500

the captured substrings are "white queen" and "queen", and are numbered

4501

1 and 2. The maximum number of capturing subpatterns is 65535.

4502

4503

As a convenient shorthand, if any option settings are required at the

4504

start of a non-capturing subpattern, the option letters may appear

4505

between the "?" and the ":". Thus the two patterns

4506

4507

(?i:saturday|sunday)

4508

(?:(?i)saturday|sunday)

4509

4510

match exactly the same set of strings. Because alternative branches are

4511

tried from left to right, and options are not reset until the end of

4512

the subpattern is reached, an option setting in one branch does affect

4513

subsequent branches, so the above patterns match "SUNDAY" as well as

"Saturday".

DUPLICATE SUBPATTERN NUMBERS

4518

4519

Perl 5.10 introduced a feature whereby each alternative in a subpattern

4520

uses the same numbers for its capturing parentheses. Such a subpattern

4521

starts with (?| and is itself a non-capturing subpattern. For example,

4522

consider this pattern:

(?|(Sat)ur|(Sun))day

Because the two alternatives are inside a (?| group, both sets of cap-

4527

turing parentheses are numbered one. Thus, when the pattern matches,

4528

you can look at captured substring number one, whichever alternative

4529

matched. This construct is useful when you want to capture part, but

4530

not all, of one of a number of alternatives. Inside a (?| group, paren-

4531

theses are numbered as usual, but the number is reset at the start of

4532

each branch. The numbers of any capturing parentheses that follow the

4533

subpattern start after the highest number used in any branch. The fol-

4534

lowing example is taken from the Perl documentation. The numbers under-

4535

neath show in which buffer the captured content will be stored.

4536

4537

# before ---------------branch-reset----------- after

4538

/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x

4539

# 1 2 2 3 2 3 4

4540

4541

A back reference to a numbered subpattern uses the most recent value

4542

that is set for that number by any subpattern. The following pattern

4543

matches "abcabc" or "defdef":

/(?|(abc)|(def))\1/

In contrast, a subroutine call to a numbered subpattern always refers

4548

to the first one in the pattern with the given number. The following

4549

pattern matches "abcabc" or "defabc":

4550

4551

/(?|(abc)|(def))(?1)/

4552

4553

If a condition test for a subpattern's having matched refers to a non-

4554

unique number, the test is true if any of the subpatterns of that num-

4555

ber have matched.

4556

4557

An alternative approach to using this "branch reset" feature is to use

4558

duplicate named subpatterns, as described in the next section.

NAMED SUBPATTERNS

Identifying capturing parentheses by number is simple, but it can be

4564

very hard to keep track of the numbers in complicated regular expres-

4565

sions. Furthermore, if an expression is modified, the numbers may

4566

change. To help with this difficulty, PCRE supports the naming of sub-

4567

patterns. This feature was not added to Perl until release 5.10. Python

4568

had the feature earlier, and PCRE introduced it at release 4.0, using

4569

the Python syntax. PCRE now supports both the Perl and the Python syn-

4570

tax. Perl allows identically numbered subpatterns to have different

4571

names, but PCRE does not.

4572

4573

In PCRE, a subpattern can be named in one of three ways: (?<name>...)

4574

or (?'name'...) as in Perl, or (?P<name>...) as in Python. References

4575

to capturing parentheses from other parts of the pattern, such as back

4576

references, recursion, and conditions, can be made by name as well as

4577

by number.

4578

4579

Names consist of up to 32 alphanumeric characters and underscores.

4580

Named capturing parentheses are still allocated numbers as well as

4581

names, exactly as if the names were not present. The PCRE API provides

4582

function calls for extracting the name-to-number translation table from

4583

a compiled pattern. There is also a convenience function for extracting

4584

a captured substring by name.

4585

4586

By default, a name must be unique within a pattern, but it is possible

4587

to relax this constraint by setting the PCRE_DUPNAMES option at compile

4588

time. (Duplicate names are also always permitted for subpatterns with

4589

the same number, set up as described in the previous section.) Dupli-

4590

cate names can be useful for patterns where only one instance of the

4591

named parentheses can match. Suppose you want to match the name of a

4592

weekday, either as a 3-letter abbreviation or as the full name, and in

4593

both cases you want to extract the abbreviation. This pattern (ignoring

4594

the line breaks) does the job:

4595

4596

(?<DN>Mon|Fri|Sun)(?:day)?|

4597

(?<DN>Tue)(?:sday)?|

4598

(?<DN>Wed)(?:nesday)?|

4599

(?<DN>Thu)(?:rsday)?|

4600

(?<DN>Sat)(?:urday)?

4601

4602

There are five capturing substrings, but only one is ever set after a

4603

match. (An alternative way of solving this problem is to use a "branch

4604

reset" subpattern, as described in the previous section.)

4605

4606

The convenience function for extracting the data by name returns the

4607

substring for the first (and in this example, the only) subpattern of

4608

that name that matched. This saves searching to find which numbered

4609

subpattern it was.

4610

4611

If you make a back reference to a non-unique named subpattern from

4612

elsewhere in the pattern, the one that corresponds to the first occur-

4613

rence of the name is used. In the absence of duplicate numbers (see the

4614

previous section) this is the one with the lowest number. If you use a

4615

named reference in a condition test (see the section about conditions

4616

below), either to check whether a subpattern has matched, or to check

4617

for recursion, all subpatterns with the same name are tested. If the

4618

condition is true for any one of them, the overall condition is true.

4619

This is the same behaviour as testing by number. For further details of

4620

the interfaces for handling named subpatterns, see the pcreapi documen-

4621

tation.

4622

4623

Warning: You cannot use different names to distinguish between two sub-

4624

patterns with the same number because PCRE uses only the numbers when

4625

matching. For this reason, an error is given at compile time if differ-

4626

ent names are given to subpatterns with the same number. However, you

4627

can give the same name to subpatterns with the same number, even when

4628

PCRE_DUPNAMES is not set.

REPETITION

Repetition is specified by quantifiers, which can follow any of the

4634

following items:

4635

4636

a literal data character

4637

the dot metacharacter

4638

the \C escape sequence

4639

the \X escape sequence (in UTF-8 mode with Unicode properties)

4640

the \R escape sequence

4641

an escape such as \d or \pL that matches a single character

4642

a character class

4643

a back reference (see next section)

4644

a parenthesized subpattern (including assertions)

4645

a subroutine call to a subpattern (recursive or otherwise)

4646

4647

The general repetition quantifier specifies a minimum and maximum num-

4648

ber of permitted matches, by giving the two numbers in curly brackets

4649

(braces), separated by a comma. The numbers must be less than 65536,

4650

and the first must be less than or equal to the second. For example:

z{2,4}

matches "zz", "zzz", or "zzzz". A closing brace on its own is not a

4655

special character. If the second number is omitted, but the comma is

4656

present, there is no upper limit; if the second number and the comma

4657

are both omitted, the quantifier specifies an exact number of required

matches. Thus

[aeiou]{3,}

matches at least 3 successive vowels, but may match many more, while

\d{8}

matches exactly 8 digits. An opening curly bracket that appears in a

4667

position where a quantifier is not allowed, or one that does not match

4668

the syntax of a quantifier, is taken as a literal character. For exam-

4669

ple, {,6} is not a quantifier, but a literal string of four characters.

4670

4671

In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to

4672

individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-

4673

acters, each of which is represented by a two-byte sequence. Similarly,

4674

when Unicode property support is available, \X{3} matches three Unicode

4675

extended sequences, each of which may be several bytes long (and they

4676

may be of different lengths).

4677

4678

The quantifier {0} is permitted, causing the expression to behave as if

4679

the previous item and the quantifier were not present. This may be use-

4680

ful for subpatterns that are referenced as subroutines from elsewhere

4681

in the pattern (but see also the section entitled "Defining subpatterns

4682

for use by reference only" below). Items other than subpatterns that

4683

have a {0} quantifier are omitted from the compiled pattern.

4684

4685

For convenience, the three most common quantifiers have single-charac-

4686

ter abbreviations:

4687

4688

* is equivalent to {0,}

4689

+ is equivalent to {1,}

4690

? is equivalent to {0,1}

4691

4692

It is possible to construct infinite loops by following a subpattern

4693

that can match no characters with a quantifier that has no upper limit,

for example:

(a?)*

Earlier versions of Perl and PCRE used to give an error at compile time

4699

for such patterns. However, because there are cases where this can be

4700

useful, such patterns are now accepted, but if any repetition of the

4701

subpattern does in fact match no characters, the loop is forcibly bro-

4702

ken.

4703

4704

By default, the quantifiers are "greedy", that is, they match as much

4705

as possible (up to the maximum number of permitted times), without

4706

causing the rest of the pattern to fail. The classic example of where

4707

this gives problems is in trying to match comments in C programs. These

4708

appear between /* and */ and within the comment, individual * and /

4709

characters may appear. An attempt to match C comments by applying the

pattern

/\*.*\*/

to the string

/* first comment */ not comment /* second comment */

4717

4718

fails, because it matches the entire string owing to the greediness of

4719

the .* item.

4720

4721

However, if a quantifier is followed by a question mark, it ceases to

4722

be greedy, and instead matches the minimum number of times possible, so

the pattern

/\*.*?\*/

does the right thing with the C comments. The meaning of the various

4728

quantifiers is not otherwise changed, just the preferred number of

4729

matches. Do not confuse this use of question mark with its use as a

4730

quantifier in its own right. Because it has two uses, it can sometimes

4731

appear doubled, as in

\d??\d

which matches one digit by preference, but can match two if that is the

4736

only way the rest of the pattern matches.

4737

4738

If the PCRE_UNGREEDY option is set (an option that is not available in

4739

Perl), the quantifiers are not greedy by default, but individual ones

4740

can be made greedy by following them with a question mark. In other

4741

words, it inverts the default behaviour.

4742

4743

When a parenthesized subpattern is quantified with a minimum repeat

4744

count that is greater than 1 or with a limited maximum, more memory is

4745

required for the compiled pattern, in proportion to the size of the

4746

minimum or maximum.

4747

4748

If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-

4749

alent to Perl's /s) is set, thus allowing the dot to match newlines,

4750

the pattern is implicitly anchored, because whatever follows will be

4751

tried against every character position in the subject string, so there

4752

is no point in retrying the overall match at any position after the

4753

first. PCRE normally treats such a pattern as though it were preceded

4754

by \A.

4755

4756

In cases where it is known that the subject string contains no new-

4757

lines, it is worth setting PCRE_DOTALL in order to obtain this opti-

4758

mization, or alternatively using ^ to indicate anchoring explicitly.

4759

4760

However, there is one situation where the optimization cannot be used.

4761

When .* is inside capturing parentheses that are the subject of a back

4762

reference elsewhere in the pattern, a match at the start may fail where

4763

a later one succeeds. Consider, for example:

(.*)abc\1

If the subject is "xyz123abc123" the match point is the fourth charac-

4768

ter. For this reason, such a pattern is not implicitly anchored.

4769

4770

When a capturing subpattern is repeated, the value captured is the sub-

4771

string that matched the final iteration. For example, after

4772

4773

(tweedle[dume]{3}\s*)+

4774

4775

has matched "tweedledum tweedledee" the value of the captured substring

4776

is "tweedledee". However, if there are nested capturing subpatterns,

4777

the corresponding captured values may have been set in previous itera-

4778

tions. For example, after

/(a|(b))+/

matches "aba" the value of the second captured substring is "b".

4783

4784

4785

ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

4786

4787

With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")

4788

repetition, failure of what follows normally causes the repeated item

4789

to be re-evaluated to see if a different number of repeats allows the

4790

rest of the pattern to match. Sometimes it is useful to prevent this,

4791

either to change the nature of the match, or to cause it fail earlier

4792

than it otherwise might, when the author of the pattern knows there is

4793

no point in carrying on.

4794

4795

Consider, for example, the pattern \d+foo when applied to the subject

line

123456bar

After matching all 6 digits and then failing to match "foo", the normal

4801

action of the matcher is to try again with only 5 digits matching the

4802

\d+ item, and then with 4, and so on, before ultimately failing.

4803

"Atomic grouping" (a term taken from Jeffrey Friedl's book) provides

4804

the means for specifying that once a subpattern has matched, it is not

4805

to be re-evaluated in this way.

4806

4807

If we use atomic grouping for the previous example, the matcher gives

4808

up immediately on failing to match "foo" the first time. The notation

4809

is a kind of special parenthesis, starting with (?> as in this example:

(?>\d+)foo

This kind of parenthesis "locks up" the part of the pattern it con-

4814

tains once it has matched, and a failure further into the pattern is

4815

prevented from backtracking into it. Backtracking past it to previous

4816

items, however, works as normal.

4817

4818

An alternative description is that a subpattern of this type matches

4819

the string of characters that an identical standalone pattern would

4820

match, if anchored at the current point in the subject string.

4821

4822

Atomic grouping subpatterns are not capturing subpatterns. Simple cases

4823

such as the above example can be thought of as a maximizing repeat that

4824

must swallow everything it can. So, while both \d+ and \d+? are pre-

4825

pared to adjust the number of digits they match in order to make the

4826

rest of the pattern match, (?>\d+) can only match an entire sequence of

4827

digits.

4828

4829

Atomic groups in general can of course contain arbitrarily complicated

4830

subpatterns, and can be nested. However, when the subpattern for an

4831

atomic group is just a single repeated item, as in the example above, a

4832

simpler notation, called a "possessive quantifier" can be used. This

4833

consists of an additional + character following a quantifier. Using

4834

this notation, the previous example can be rewritten as

\d++foo

Note that a possessive quantifier can be used with an entire group, for

example:

(abc|xyz){2,3}+

Possessive quantifiers are always greedy; the setting of the

4844

PCRE_UNGREEDY option is ignored. They are a convenient notation for the

4845

simpler forms of atomic group. However, there is no difference in the

4846

meaning of a possessive quantifier and the equivalent atomic group,

4847

though there may be a performance difference; possessive quantifiers

4848

should be slightly faster.

4849

4850

The possessive quantifier syntax is an extension to the Perl 5.8 syn-

4851

tax. Jeffrey Friedl originated the idea (and the name) in the first

4852

edition of his book. Mike McCloskey liked it, so implemented it when he

4853

built Sun's Java package, and PCRE copied it from there. It ultimately

4854

found its way into Perl at release 5.10.

4855

4856

PCRE has an optimization that automatically "possessifies" certain sim-

4857

ple pattern constructs. For example, the sequence A+B is treated as

4858

A++B because there is no point in backtracking into a sequence of A's

4859

when B must follow.

4860

4861

When a pattern contains an unlimited repeat inside a subpattern that

4862

can itself be repeated an unlimited number of times, the use of an

4863

atomic group is the only way to avoid some failing matches taking a

4864

very long time indeed. The pattern

(\D+|<\d+>)*[!?]

matches an unlimited number of substrings that either consist of non-

4869

digits, or digits enclosed in <>, followed by either ! or ?. When it

4870

matches, it runs quickly. However, if it is applied to

4871

4872

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

4873

4874

it takes a long time before reporting failure. This is because the

4875

string can be divided between the internal \D+ repeat and the external

4876

* repeat in a large number of ways, and all have to be tried. (The

4877

example uses [!?] rather than a single character at the end, because

4878

both PCRE and Perl have an optimization that allows for fast failure

4879

when a single character is used. They remember the last single charac-

4880

ter that is required for a match, and fail early if it is not present

4881

in the string.) If the pattern is changed so that it uses an atomic

group, like this:

((?>\D+)|<\d+>)*[!?]

sequences of non-digits cannot be broken, and failure happens quickly.

BACK REFERENCES

Outside a character class, a backslash followed by a digit greater than

4892

0 (and possibly further digits) is a back reference to a capturing sub-

4893

pattern earlier (that is, to its left) in the pattern, provided there

4894

have been that many previous capturing left parentheses.

4895

4896

However, if the decimal number following the backslash is less than 10,

4897

it is always taken as a back reference, and causes an error only if

4898

there are not that many capturing left parentheses in the entire pat-

4899

tern. In other words, the parentheses that are referenced need not be

4900

to the left of the reference for numbers less than 10. A "forward back

4901

reference" of this type can make sense when a repetition is involved

4902

and the subpattern to the right has participated in an earlier itera-

4903

tion.

4904

4905

It is not possible to have a numerical "forward back reference" to a

4906

subpattern whose number is 10 or more using this syntax because a

4907

sequence such as \50 is interpreted as a character defined in octal.

4908

See the subsection entitled "Non-printing characters" above for further

4909

details of the handling of digits following a backslash. There is no

4910

such problem when named parentheses are used. A back reference to any

4911

subpattern is possible using named parentheses (see below).

4912

4913

Another way of avoiding the ambiguity inherent in the use of digits

4914

following a backslash is to use the \g escape sequence. This escape

4915

must be followed by an unsigned number or a negative number, optionally

4916

enclosed in braces. These examples are all identical:

(ring), \1

(ring), \g1

(ring), \g{1}

An unsigned number specifies an absolute reference without the ambigu-

4923

ity that is present in the older syntax. It is also useful when literal

4924

digits follow the reference. A negative number is a relative reference.

4925

Consider this example:

(abc(def)ghi)\g{-1}

The sequence \g{-1} is a reference to the most recently started captur-

4930

ing subpattern before \g, that is, is it equivalent to \2 in this exam-

4931

ple. Similarly, \g{-2} would be equivalent to \1. The use of relative

4932

references can be helpful in long patterns, and also in patterns that

4933

are created by joining together fragments that contain references

4934

within themselves.

4935

4936

A back reference matches whatever actually matched the capturing sub-

4937

pattern in the current subject string, rather than anything matching

4938

the subpattern itself (see "Subpatterns as subroutines" below for a way

4939

of doing that). So the pattern

4940

4941

(sens|respons)e and \1ibility

4942

4943

matches "sense and sensibility" and "response and responsibility", but

4944

not "sense and responsibility". If caseful matching is in force at the

4945

time of the back reference, the case of letters is relevant. For exam-

ple,

((?i)rah)\s+\1

matches "rah rah" and "RAH RAH", but not "RAH rah", even though the

4951

original capturing subpattern is matched caselessly.

4952

4953

There are several different ways of writing back references to named

4954

subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or

4955

\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's

4956

unified back reference syntax, in which \g can be used for both numeric

4957

and named references, is also supported. We could rewrite the above

4958

example in any of the following ways:

4959

4960

(?<p1>(?i)rah)\s+\k<p1>

4961

(?'p1'(?i)rah)\s+\k{p1}

4962

(?P<p1>(?i)rah)\s+(?P=p1)

4963

(?<p1>(?i)rah)\s+\g{p1}

4964

4965

A subpattern that is referenced by name may appear in the pattern

4966

before or after the reference.

4967

4968

There may be more than one back reference to the same subpattern. If a

4969

subpattern has not actually been used in a particular match, any back

4970

references to it always fail by default. For example, the pattern

(a|(bc))\2

always fails if it starts to match "a" rather than "bc". However, if

4975

the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-

4976

ence to an unset value matches an empty string.

4977

4978

Because there may be many capturing parentheses in a pattern, all dig-

4979

its following a backslash are taken as part of a potential back refer-

4980

ence number. If the pattern continues with a digit character, some

4981

delimiter must be used to terminate the back reference. If the

4982

PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{

4983

syntax or an empty comment (see "Comments" below) can be used.

4984

4985

Recursive back references

4986

4987

A back reference that occurs inside the parentheses to which it refers

4988

fails when the subpattern is first used, so, for example, (a\1) never

4989

matches. However, such references can be useful inside repeated sub-

4990

patterns. For example, the pattern

(a|b\1)+

matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-

4995

ation of the subpattern, the back reference matches the character

4996

string corresponding to the previous iteration. In order for this to

4997

work, the pattern must be such that the first iteration does not need

4998

to match the back reference. This can be done using alternation, as in

4999

the example above, or by a quantifier with a minimum of zero.

5000

5001

Back references of this type cause the group that they reference to be

5002

treated as an atomic group. Once the whole group has been matched, a

5003

subsequent matching failure cannot cause backtracking into the middle

of the group.

ASSERTIONS

An assertion is a test on the characters following or preceding the

5010

current matching point that does not actually consume any characters.

5011

The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are

5012

described above.

5013

5014

More complicated assertions are coded as subpatterns. There are two

5015

kinds: those that look ahead of the current position in the subject

5016

string, and those that look behind it. An assertion subpattern is

5017

matched in the normal way, except that it does not cause the current

5018

matching position to be changed.

5019

5020

Assertion subpatterns are not capturing subpatterns. If such an asser-

5021

tion contains capturing subpatterns within it, these are counted for

5022

the purposes of numbering the capturing subpatterns in the whole pat-

5023

tern. However, substring capturing is carried out only for positive

5024

assertions, because it does not make sense for negative assertions.

5025

5026

For compatibility with Perl, assertion subpatterns may be repeated;

5027

though it makes no sense to assert the same thing several times, the

5028

side effect of capturing parentheses may occasionally be useful. In

5029

practice, there only three cases:

5030

5031

(1) If the quantifier is {0}, the assertion is never obeyed during

5032

matching. However, it may contain internal capturing parenthesized

5033

groups that are called from elsewhere via the subroutine mechanism.

5034

5035

(2) If quantifier is {0,n} where n is greater than zero, it is treated

5036

as if it were {0,1}. At run time, the rest of the pattern match is

5037

tried with and without the assertion, the order depending on the greed-

5038

iness of the quantifier.

5039

5040

(3) If the minimum repetition is greater than zero, the quantifier is

5041

ignored. The assertion is obeyed just once when encountered during

matching.

Lookahead assertions

Lookahead assertions start with (?= for positive assertions and (?! for

5047

negative assertions. For example,

\w+(?=;)

matches a word followed by a semicolon, but does not include the semi-

5052

colon in the match, and

foo(?!bar)

matches any occurrence of "foo" that is not followed by "bar". Note

5057

that the apparently similar pattern

(?!foo)bar

does not find an occurrence of "bar" that is preceded by something

5062

other than "foo"; it finds any occurrence of "bar" whatsoever, because

5063

the assertion (?!foo) is always true when the next three characters are

5064

"bar". A lookbehind assertion is needed to achieve the other effect.

5065

5066

If you want to force a matching failure at some point in a pattern, the

5067

most convenient way to do it is with (?!) because an empty string

5068

always matches, so an assertion that requires there not to be an empty

5069

string must always fail. The backtracking control verb (*FAIL) or (*F)

5070

is a synonym for (?!).

5071

5072

Lookbehind assertions

5073

5074

Lookbehind assertions start with (?<= for positive assertions and (?<!

5075

for negative assertions. For example,

(?<!foo)bar

does find an occurrence of "bar" that is not preceded by "foo". The

5080

contents of a lookbehind assertion are restricted such that all the

5081

strings it matches must have a fixed length. However, if there are sev-

5082

eral top-level alternatives, they do not all have to have the same

fixed length. Thus

(?<=bullock|donkey)

is permitted, but

(?<!dogs?|cats?)

causes an error at compile time. Branches that match different length

5092

strings are permitted only at the top level of a lookbehind assertion.

5093

This is an extension compared with Perl, which requires all branches to

5094

match the same length of string. An assertion such as

(?<=ab(c|de))

is not permitted, because its single top-level branch can match two

5099

different lengths, but it is acceptable to PCRE if rewritten to use two

top-level branches:

(?<=abc|abde)

In some cases, the escape sequence \K (see above) can be used instead

5105

of a lookbehind assertion to get round the fixed-length restriction.

5106

5107

The implementation of lookbehind assertions is, for each alternative,

5108

to temporarily move the current position back by the fixed length and

5109

then try to match. If there are insufficient characters before the cur-

5110

rent position, the assertion fails.

5111

5112

In UTF-8 mode, PCRE does not allow the \C escape (which matches a sin-

5113

gle byte, even in UTF-8 mode) to appear in lookbehind assertions,

5114

because it makes it impossible to calculate the length of the lookbe-

5115

hind. The \X and \R escapes, which can match different numbers of

5116

bytes, are also not permitted.

5117

5118

"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in

5119

lookbehinds, as long as the subpattern matches a fixed-length string.

5120

Recursion, however, is not supported.

5121

5122

Possessive quantifiers can be used in conjunction with lookbehind

5123

assertions to specify efficient matching of fixed-length strings at the

5124

end of subject strings. Consider a simple pattern such as

abcd$

when applied to a long string that does not match. Because matching

5129

proceeds from left to right, PCRE will look for each "a" in the subject

5130

and then see if what follows matches the rest of the pattern. If the

5131

pattern is specified as

^.*abcd$

the initial .* matches the entire string at first, but when this fails

5136

(because there is no following "a"), it backtracks to match all but the

5137

last character, then all but the last two characters, and so on. Once

5138

again the search for "a" covers the entire string, from right to left,

5139

so we are no better off. However, if the pattern is written as

^.*+(?<=abcd)

there can be no backtracking for the .*+ item; it can match only the

5144

entire string. The subsequent lookbehind assertion does a single test

5145

on the last four characters. If it fails, the match fails immediately.

5146

For long strings, this approach makes a significant difference to the

5147

processing time.

5148

5149

Using multiple assertions

5150

5151

Several assertions (of any sort) may occur in succession. For example,

5152

5153

(?<=\d{3})(?<!999)foo

5154

5155

matches "foo" preceded by three digits that are not "999". Notice that

5156

each of the assertions is applied independently at the same point in

5157

the subject string. First there is a check that the previous three

5158

characters are all digits, and then there is a check that the same

5159

three characters are not "999". This pattern does not match "foo" pre-

5160

ceded by six characters, the first of which are digits and the last

5161

three of which are not "999". For example, it doesn't match "123abc-

5162

foo". A pattern to do that is

5163

5164

(?<=\d{3}...)(?<!999)foo

5165

5166

This time the first assertion looks at the preceding six characters,

5167

checking that the first three are digits, and then the second assertion

5168

checks that the preceding three characters are not "999".

5169

5170

Assertions can be nested in any combination. For example,

(?<=(?<!foo)bar)baz

matches an occurrence of "baz" that is preceded by "bar" which in turn

5175

is not preceded by "foo", while

5176

5177

(?<=\d{3}(?!999)...)foo

5178

5179

is another pattern that matches "foo" preceded by three digits and any

5180

three characters that are not "999".

5181

5182

5183

CONDITIONAL SUBPATTERNS

5184

5185

It is possible to cause the matching process to obey a subpattern con-

5186

ditionally or to choose between two alternative subpatterns, depending

5187

on the result of an assertion, or whether a specific capturing subpat-

5188

tern has already been matched. The two possible forms of conditional

5189

subpattern are:

5190

5191

(?(condition)yes-pattern)

5192

(?(condition)yes-pattern|no-pattern)

5193

5194

If the condition is satisfied, the yes-pattern is used; otherwise the

5195

no-pattern (if present) is used. If there are more than two alterna-

5196

tives in the subpattern, a compile-time error occurs. Each of the two

5197

alternatives may itself contain nested subpatterns of any form, includ-

5198

ing conditional subpatterns; the restriction to two alternatives

5199

applies only at the level of the condition. This pattern fragment is an

5200

example where the alternatives are complex:

5201

5202

(?(1) (A|B|C) | (D | (?(2)E|F) | E) )

5203

5204

5205

There are four kinds of condition: references to subpatterns, refer-

5206

ences to recursion, a pseudo-condition called DEFINE, and assertions.

5207

5208

Checking for a used subpattern by number

5209

5210

If the text between the parentheses consists of a sequence of digits,

5211

the condition is true if a capturing subpattern of that number has pre-

5212

viously matched. If there is more than one capturing subpattern with

5213

the same number (see the earlier section about duplicate subpattern

5214

numbers), the condition is true if any of them have matched. An alter-

5215

native notation is to precede the digits with a plus or minus sign. In

5216

this case, the subpattern number is relative rather than absolute. The

5217

most recently opened parentheses can be referenced by (?(-1), the next

5218

most recent by (?(-2), and so on. Inside loops it can also make sense

5219

to refer to subsequent groups. The next parentheses to be opened can be

5220

referenced as (?(+1), and so on. (The value zero in any of these forms

5221

is not used; it provokes a compile-time error.)

5222

5223

Consider the following pattern, which contains non-significant white

5224

space to make it more readable (assume the PCRE_EXTENDED option) and to

5225

divide it into three parts for ease of discussion:

5226

5227

( $ )? [^()]+ (?(1) $ )

5228

5229

The first part matches an optional opening parenthesis, and if that

5230

character is present, sets it as the first captured substring. The sec-

5231

ond part matches one or more characters that are not parentheses. The

5232

third part is a conditional subpattern that tests whether or not the

5233

first set of parentheses matched. If they did, that is, if subject

5234

started with an opening parenthesis, the condition is true, and so the

5235

yes-pattern is executed and a closing parenthesis is required. Other-

5236

wise, since no-pattern is not present, the subpattern matches nothing.

5237

In other words, this pattern matches a sequence of non-parentheses,

5238

optionally enclosed in parentheses.

5239

5240

If you were embedding this pattern in a larger one, you could use a

5241

relative reference:

5242

5243

...other stuff... ( $ )? [^()]+ (?(-1) $ ) ...

5244

5245

This makes the fragment independent of the parentheses in the larger

5246

pattern.

5247

5248

Checking for a used subpattern by name

5249

5250

Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a

5251

used subpattern by name. For compatibility with earlier versions of

5252

PCRE, which had this facility before Perl, the syntax (?(name)...) is

5253

also recognized. However, there is a possible ambiguity with this syn-

5254

tax, because subpattern names may consist entirely of digits. PCRE

5255

looks first for a named subpattern; if it cannot find one and the name

5256

consists entirely of digits, PCRE looks for a subpattern of that num-

5257

ber, which must be greater than zero. Using subpattern names that con-

5258

sist entirely of digits is not recommended.

5259

5260

Rewriting the above example to use a named subpattern gives this:

5261

5262

(?<OPEN> $ )? [^()]+ (?(<OPEN>) $ )

5263

5264

If the name used in a condition of this kind is a duplicate, the test

5265

is applied to all subpatterns of the same name, and is true if any one

5266

of them has matched.

5267

5268

Checking for pattern recursion

5269

5270

If the condition is the string (R), and there is no subpattern with the

5271

name R, the condition is true if a recursive call to the whole pattern

5272

or any subpattern has been made. If digits or a name preceded by amper-

5273

sand follow the letter R, for example:

5274

5275

(?(R3)...) or (?(R&name)...)

5276

5277

the condition is true if the most recent recursion is into a subpattern

5278

whose number or name is given. This condition does not check the entire

5279

recursion stack. If the name used in a condition of this kind is a

5280

duplicate, the test is applied to all subpatterns of the same name, and

5281

is true if any one of them is the most recent recursion.

5282

5283

At "top level", all these recursion test conditions are false. The

5284

syntax for recursive patterns is described below.

5285

5286

Defining subpatterns for use by reference only

5287

5288

If the condition is the string (DEFINE), and there is no subpattern

5289

with the name DEFINE, the condition is always false. In this case,

5290

there may be only one alternative in the subpattern. It is always

5291

skipped if control reaches this point in the pattern; the idea of

5292

DEFINE is that it can be used to define subroutines that can be refer-

5293

enced from elsewhere. (The use of subroutines is described below.) For

5294

example, a pattern to match an IPv4 address such as "192.168.23.245"

5295

could be written like this (ignore whitespace and line breaks):

5296

5297

(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )

5298

\b (?&byte) (\.(?&byte)){3} \b

5299

5300

The first part of the pattern is a DEFINE group inside which a another

5301

group named "byte" is defined. This matches an individual component of

5302

an IPv4 address (a number less than 256). When matching takes place,

5303

this part of the pattern is skipped because DEFINE acts like a false

5304

condition. The rest of the pattern uses references to the named group

5305

to match the four dot-separated components of an IPv4 address, insist-

5306

ing on a word boundary at each end.

Assertion conditions

If the condition is not in any of the above formats, it must be an

5311

assertion. This may be a positive or negative lookahead or lookbehind

5312

assertion. Consider this pattern, again containing non-significant

5313

white space, and with the two alternatives on the second line:

5314

5315

(?(?=[^a-z]*[a-z])

5316

\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )

5317

5318

The condition is a positive lookahead assertion that matches an

5319

optional sequence of non-letters followed by a letter. In other words,

5320

it tests for the presence of at least one letter in the subject. If a

5321

letter is found, the subject is matched against the first alternative;

5322

otherwise it is matched against the second. This pattern matches

5323

strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are

5324

letters and dd are digits.

COMMENTS

There are two ways of including comments in patterns that are processed

5330

by PCRE. In both cases, the start of the comment must not be in a char-

5331

acter class, nor in the middle of any other sequence of related charac-

5332

ters such as (?: or a subpattern name or number. The characters that

5333

make up a comment play no part in the pattern matching.

5334

5335

The sequence (?# marks the start of a comment that continues up to the

5336

next closing parenthesis. Nested parentheses are not permitted. If the

5337

PCRE_EXTENDED option is set, an unescaped # character also introduces a

5338

comment, which in this case continues to immediately after the next

5339

newline character or character sequence in the pattern. Which charac-

5340

ters are interpreted as newlines is controlled by the options passed to

5341

pcre_compile() or by a special sequence at the start of the pattern, as

5342

described in the section entitled "Newline conventions" above. Note

5343

that the end of this type of comment is a literal newline sequence in

5344

the pattern; escape sequences that happen to represent a newline do not

5345

count. For example, consider this pattern when PCRE_EXTENDED is set,

5346

and the default newline convention is in force:

5347

5348

abc #comment \n still comment

5349

5350

On encountering the # character, pcre_compile() skips along, looking

5351

for a newline in the pattern. The sequence \n is still literal at this

5352

stage, so it does not terminate the comment. Only an actual character

5353

with the code value 0x0a (the default newline) does so.

RECURSIVE PATTERNS

Consider the problem of matching a string in parentheses, allowing for

5359

unlimited nested parentheses. Without the use of recursion, the best

5360

that can be done is to use a pattern that matches up to some fixed

5361

depth of nesting. It is not possible to handle an arbitrary nesting

5362

depth.

5363

5364

For some time, Perl has provided a facility that allows regular expres-

5365

sions to recurse (amongst other things). It does this by interpolating

5366

Perl code in the expression at run time, and the code can refer to the

5367

expression itself. A Perl pattern using code interpolation to solve the

5368

parentheses problem can be created like this:

5369

5370

$re = qr{$ (?: (?>[^()]+) | (?p{$re}) )* $}x;

5371

5372

The (?p{...}) item interpolates Perl code at run time, and in this case

5373

refers recursively to the pattern in which it appears.

5374

5375

Obviously, PCRE cannot support the interpolation of Perl code. Instead,

5376

it supports special syntax for recursion of the entire pattern, and

5377

also for individual subpattern recursion. After its introduction in

5378

PCRE and Python, this kind of recursion was subsequently introduced

5379

into Perl at release 5.10.

5380

5381

A special item that consists of (? followed by a number greater than

5382

zero and a closing parenthesis is a recursive subroutine call of the

5383

subpattern of the given number, provided that it occurs inside that

5384

subpattern. (If not, it is a non-recursive subroutine call, which is

5385

described in the next section.) The special item (?R) or (?0) is a

5386

recursive call of the entire regular expression.

5387

5388

This PCRE pattern solves the nested parentheses problem (assume the

5389

PCRE_EXTENDED option is set so that white space is ignored):

5390

5391

$ ( [^()]++ | (?R) )* $

5392

5393

First it matches an opening parenthesis. Then it matches any number of

5394

substrings which can either be a sequence of non-parentheses, or a

5395

recursive match of the pattern itself (that is, a correctly parenthe-

5396

sized substring). Finally there is a closing parenthesis. Note the use

5397

of a possessive quantifier to avoid backtracking into sequences of non-

5398

parentheses.

5399

5400

If this were part of a larger pattern, you would not want to recurse

5401

the entire pattern, so instead you could use this:

5402

5403

( $ ( [^()]++ | (?1) )* $ )

5404

5405

We have put the pattern into parentheses, and caused the recursion to

5406

refer to them instead of the whole pattern.

5407

5408

In a larger pattern, keeping track of parenthesis numbers can be

5409

tricky. This is made easier by the use of relative references. Instead

5410

of (?1) in the pattern above you can write (?-2) to refer to the second

5411

most recently opened parentheses preceding the recursion. In other

5412

words, a negative number counts capturing parentheses leftwards from

5413

the point at which it is encountered.

5414

5415

It is also possible to refer to subsequently opened parentheses, by

5416

writing references such as (?+2). However, these cannot be recursive

5417

because the reference is not inside the parentheses that are refer-

5418

enced. They are always non-recursive subroutine calls, as described in

5419

the next section.

5420

5421

An alternative approach is to use named parentheses instead. The Perl

5422

syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also

5423

supported. We could rewrite the above example as follows:

5424

5425

(?<pn> $ ( [^()]++ | (?&pn) )* $ )

5426

5427

If there is more than one subpattern with the same name, the earliest

5428

one is used.

5429

5430

This particular example pattern that we have been looking at contains

5431

nested unlimited repeats, and so the use of a possessive quantifier for

5432

matching strings of non-parentheses is important when applying the pat-

5433

tern to strings that do not match. For example, when this pattern is

5434

applied to

5435

5436

(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()

5437

5438

it yields "no match" quickly. However, if a possessive quantifier is

5439

not used, the match runs for a very long time indeed because there are

5440

so many different ways the + and * repeats can carve up the subject,

5441

and all have to be tested before failure can be reported.

5442

5443

At the end of a match, the values of capturing parentheses are those

5444

from the outermost level. If you want to obtain intermediate values, a

5445

callout function can be used (see below and the pcrecallout documenta-

5446

tion). If the pattern above is matched against

(ab(cd)ef)

the value for the inner capturing parentheses (numbered 2) is "ef",

5451

which is the last value taken on at the top level. If a capturing sub-

5452

pattern is not matched at the top level, its final captured value is

5453

unset, even if it was (temporarily) set at a deeper level during the

5454

matching process.

5455

5456

If there are more than 15 capturing parentheses in a pattern, PCRE has

5457

to obtain extra memory to store data during a recursion, which it does

5458

by using pcre_malloc, freeing it via pcre_free afterwards. If no memory

5459

can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.

5460

5461

Do not confuse the (?R) item with the condition (R), which tests for

5462

recursion. Consider this pattern, which matches text in angle brack-

5463

ets, allowing for arbitrary nesting. Only digits are allowed in nested

5464

brackets (that is, when recursing), whereas any characters are permit-

5465

ted at the outer level.

5466

5467

< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >

5468

5469

In this pattern, (?(R) is the start of a conditional subpattern, with

5470

two different alternatives for the recursive and non-recursive cases.

5471

The (?R) item is the actual recursive call.

5472

5473

Differences in recursion processing between PCRE and Perl

5474

5475

Recursion processing in PCRE differs from Perl in two important ways.

5476

In PCRE (like Python, but unlike Perl), a recursive subpattern call is

5477

always treated as an atomic group. That is, once it has matched some of

5478

the subject string, it is never re-entered, even if it contains untried

5479

alternatives and there is a subsequent matching failure. This can be

5480

illustrated by the following pattern, which purports to match a palin-

5481

dromic string that contains an odd number of characters (for example,

5482

"a", "aba", "abcba", "abcdcba"):

^(.|(.)(?1)\2)$

The idea is that it either matches a single character, or two identical

5487

characters surrounding a sub-palindrome. In Perl, this pattern works;

5488

in PCRE it does not if the pattern is longer than three characters.

5489

Consider the subject string "abcba":

5490

5491

At the top level, the first character is matched, but as it is not at

5492

the end of the string, the first alternative fails; the second alterna-

5493

tive is taken and the recursion kicks in. The recursive call to subpat-

5494

tern 1 successfully matches the next character ("b"). (Note that the

5495

beginning and end of line tests are not part of the recursion).

5496

5497

Back at the top level, the next character ("c") is compared with what

5498

subpattern 2 matched, which was "a". This fails. Because the recursion

5499

is treated as an atomic group, there are now no backtracking points,

5500

and so the entire match fails. (Perl is able, at this point, to re-

5501

enter the recursion and try the second alternative.) However, if the

5502

pattern is written with the alternatives in the other order, things are

different:

^((.)(?1)\2|.)$

This time, the recursing alternative is tried first, and continues to

5508

recurse until it runs out of characters, at which point the recursion

5509

fails. But this time we do have another alternative to try at the

5510

higher level. That is the big difference: in the previous case the

5511

remaining alternative is at a deeper recursion level, which PCRE cannot

5512

use.

5513

5514

To change the pattern so that it matches all palindromic strings, not

5515

just those with an odd number of characters, it is tempting to change

the pattern to this:

^((.)(?1)\2|.?)$

Again, this works in Perl, but not in PCRE, and for the same reason.

5521

When a deeper recursion has matched a single character, it cannot be

5522

entered again in order to match an empty string. The solution is to

5523

separate the two cases, and write out the odd and even cases as alter-

5524

natives at the higher level:

5525

5526

^(?:((.)(?1)\2|)|((.)(?3)\4|.))

5527

5528

If you want to match typical palindromic phrases, the pattern has to

5529

ignore all non-word characters, which can be done like this:

5530

5531

^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$

5532

5533

If run with the PCRE_CASELESS option, this pattern matches phrases such

5534

as "A man, a plan, a canal: Panama!" and it works well in both PCRE and

5535

Perl. Note the use of the possessive quantifier *+ to avoid backtrack-

5536

ing into sequences of non-word characters. Without this, PCRE takes a

5537

great deal longer (ten times or more) to match typical phrases, and

5538

Perl takes so long that you think it has gone into a loop.

5539

5540

WARNING: The palindrome-matching patterns above work only if the sub-

5541

ject string does not start with a palindrome that is shorter than the

5542

entire string. For example, although "abcba" is correctly matched, if

5543

the subject is "ababa", PCRE finds the palindrome "aba" at the start,

5544

then fails at top level because the end of the string does not follow.

5545

Once again, it cannot jump back into the recursion to try other alter-

5546

natives, so the entire match fails.

5547

5548

The second way in which PCRE and Perl differ in their recursion pro-

5549

cessing is in the handling of captured values. In Perl, when a subpat-

5550

tern is called recursively or as a subpattern (see the next section),

5551

it has no access to any values that were captured outside the recur-

5552

sion, whereas in PCRE these values can be referenced. Consider this

pattern:

^(.)(\1|a(?2))

In PCRE, this pattern matches "bab". The first capturing parentheses

5558

match "b", then in the second group, when the back reference \1 fails

5559

to match "b", the second alternative matches "a" and then recurses. In

5560

the recursion, \1 does now match "b" and so the whole match succeeds.

5561

In Perl, the pattern fails to match because inside the recursive call

5562

\1 cannot access the externally set value.

5563

5564

5565

SUBPATTERNS AS SUBROUTINES

5566

5567

If the syntax for a recursive subpattern call (either by number or by

5568

name) is used outside the parentheses to which it refers, it operates

5569

like a subroutine in a programming language. The called subpattern may

5570

be defined before or after the reference. A numbered reference can be

5571

absolute or relative, as in these examples:

5572

5573

(...(absolute)...)...(?2)...

5574

(...(relative)...)...(?-1)...

5575

(...(?+1)...(relative)...

5576

5577

An earlier example pointed out that the pattern

5578

5579

(sens|respons)e and \1ibility

5580

5581

matches "sense and sensibility" and "response and responsibility", but

5582

not "sense and responsibility". If instead the pattern

5583

5584

(sens|respons)e and (?1)ibility

5585

5586

is used, it does match "sense and responsibility" as well as the other

5587

two strings. Another example is given in the discussion of DEFINE

5588

above.

5589

5590

All subroutine calls, whether recursive or not, are always treated as

5591

atomic groups. That is, once a subroutine has matched some of the sub-

5592

ject string, it is never re-entered, even if it contains untried alter-

5593

natives and there is a subsequent matching failure. Any capturing

5594

parentheses that are set during the subroutine call revert to their

5595

previous values afterwards.

5596

5597

Processing options such as case-independence are fixed when a subpat-

5598

tern is defined, so if it is used as a subroutine, such options cannot

5599

be changed for different calls. For example, consider this pattern:

(abc)(?i:(?-1))

It matches "abcabc". It does not match "abcABC" because the change of

5604

processing option does not affect the called subpattern.

5605

5606

5607

ONIGURUMA SUBROUTINE SYNTAX

5608

5609

For compatibility with Oniguruma, the non-Perl syntax \g followed by a

5610

name or a number enclosed either in angle brackets or single quotes, is

5611

an alternative syntax for referencing a subpattern as a subroutine,

5612

possibly recursively. Here are two of the examples used above, rewrit-

5613

ten using this syntax:

5614

5615

(?<pn> $ ( (?>[^()]+) | \g<pn> )* $ )

5616

(sens|respons)e and \g'1'ibility

5617

5618

PCRE supports an extension to Oniguruma: if a number is preceded by a

5619

plus or a minus sign it is taken as a relative reference. For example:

(abc)(?i:\g<-1>)

Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not

5624

synonymous. The former is a back reference; the latter is a subroutine

call.

CALLOUTS

Perl has a feature whereby using the sequence (?{...}) causes arbitrary

5631

Perl code to be obeyed in the middle of matching a regular expression.

5632

This makes it possible, amongst other things, to extract different sub-

5633

strings that match the same pair of parentheses when there is a repeti-

5634

tion.

5635

5636

PCRE provides a similar feature, but of course it cannot obey arbitrary

5637

Perl code. The feature is called "callout". The caller of PCRE provides

5638

an external function by putting its entry point in the global variable

5639

pcre_callout. By default, this variable contains NULL, which disables

5640

all calling out.

5641

5642

Within a regular expression, (?C) indicates the points at which the

5643

external function is to be called. If you want to identify different

5644

callout points, you can put a number less than 256 after the letter C.

5645

The default value is zero. For example, this pattern has two callout

points:

(?C1)abc(?C2)def

If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are

5651

automatically installed before each item in the pattern. They are all

5652

numbered 255.

5653

5654

During matching, when PCRE reaches a callout point (and pcre_callout is

5655

set), the external function is called. It is provided with the number

5656

of the callout, the position in the pattern, and, optionally, one item

5657

of data originally supplied by the caller of pcre_exec(). The callout

5658

function may cause matching to proceed, to backtrack, or to fail alto-

5659

gether. A complete description of the interface to the callout function

5660

is given in the pcrecallout documentation.

BACKTRACKING CONTROL

Perl 5.10 introduced a number of "Special Backtracking Control Verbs",

5666

which are described in the Perl documentation as "experimental and sub-

5667

ject to change or removal in a future version of Perl". It goes on to

5668

say: "Their usage in production code should be noted to avoid problems

5669

during upgrades." The same remarks apply to the PCRE features described

5670

in this section.

5671

5672

Since these verbs are specifically related to backtracking, most of

5673

them can be used only when the pattern is to be matched using

5674

pcre_exec(), which uses a backtracking algorithm. With the exception of

5675

(*FAIL), which behaves like a failing negative assertion, they cause an

5676

error if encountered by pcre_dfa_exec().

5677

5678

If any of these verbs are used in an assertion or in a subpattern that

5679

is called as a subroutine (whether or not recursively), their effect is

5680

confined to that subpattern; it does not extend to the surrounding pat-

5681

tern, with one exception: the name from a *(MARK), (*PRUNE), or (*THEN)

5682

that is encountered in a successful positive assertion is passed back

5683

when a match succeeds (compare capturing parentheses in assertions).

5684

Note that such subpatterns are processed as anchored at the point where

5685

they are tested. Note also that Perl's treatment of subroutines is dif-

5686

ferent in some cases.

5687

5688

The new verbs make use of what was previously invalid syntax: an open-

5689

ing parenthesis followed by an asterisk. They are generally of the form

5690

(*VERB) or (*VERB:NAME). Some may take either form, with differing be-

5691

haviour, depending on whether or not an argument is present. A name is

5692

any sequence of characters that does not include a closing parenthesis.

5693

If the name is empty, that is, if the closing parenthesis immediately

5694

follows the colon, the effect is as if the colon were not there. Any

5695

number of these verbs may occur in a pattern.

5696

5697

PCRE contains some optimizations that are used to speed up matching by

5698

running some checks at the start of each match attempt. For example, it

5699

may know the minimum length of matching subject, or that a particular

5700

character must be present. When one of these optimizations suppresses

5701

the running of a match, any included backtracking verbs will not, of

5702

course, be processed. You can suppress the start-of-match optimizations

5703

by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_com-

5704

pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).

5705

5706

Experiments with Perl suggest that it too has similar optimizations,

5707

sometimes leading to anomalous results.

5708

5709

Verbs that act immediately

5710

5711

The following verbs act as soon as they are encountered. They may not

5712

be followed by a name.

(*ACCEPT)

This verb causes the match to end successfully, skipping the remainder

5717

of the pattern. However, when it is inside a subpattern that is called

5718

as a subroutine, only that subpattern is ended successfully. Matching

5719

then continues at the outer level. If (*ACCEPT) is inside capturing

5720

parentheses, the data so far is captured. For example:

5721

5722

A((?:A|B(*ACCEPT)|C)D)

5723

5724

This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-

5725

tured by the outer parentheses.

(*FAIL) or (*F)

This verb causes a matching failure, forcing backtracking to occur. It

5730

is equivalent to (?!) but easier to read. The Perl documentation notes

5731

that it is probably useful only when combined with (?{}) or (??{}).

5732

Those are, of course, Perl features that are not present in PCRE. The

5733

nearest equivalent is the callout feature, as for example in this pat-

tern:

a+(?C)(*FAIL)

A match with the string "aaaa" always fails, but the callout is taken

5739

before each backtrack happens (in this example, 10 times).

5740

5741

Recording which path was taken

5742

5743

There is one verb whose main purpose is to track how a match was

5744

arrived at, though it also has a secondary use in conjunction with

5745

advancing the match starting point (see (*SKIP) below).

5746

5747

(*MARK:NAME) or (*:NAME)

5748

5749

A name is always required with this verb. There may be as many

5750

instances of (*MARK) as you like in a pattern, and their names do not

5751

have to be unique.

5752

5753

When a match succeeds, the name of the last-encountered (*MARK) on the

5754

matching path is passed back to the caller via the pcre_extra data

5755

structure, as described in the section on pcre_extra in the pcreapi

5756

documentation. Here is an example of pcretest output, where the /K mod-

5757

ifier requests the retrieval and outputting of (*MARK) data:

5758

5759

re> /X(*MARK:A)Y|X(*MARK:B)Z/K

data> XY

0: XY

MK: A

XZ

0: XZ

MK: B

The (*MARK) name is tagged with "MK:" in this output, and in this exam-

5768

ple it indicates which of the two alternatives matched. This is a more

5769

efficient way of obtaining this information than putting each alterna-

5770

tive in its own capturing parentheses.

5771

5772

If (*MARK) is encountered in a positive assertion, its name is recorded

5773

and passed back if it is the last-encountered. This does not happen for

5774

negative assertions.

5775

5776

After a partial match or a failed match, the name of the last encoun-

5777

tered (*MARK) in the entire match process is returned. For example:

5778

5779

re> /X(*MARK:A)Y|X(*MARK:B)Z/K

data> XP

No match, mark = B

Note that in this unanchored example the mark is retained from the

5784

match attempt that started at the letter "X". Subsequent match attempts

5785

starting at "P" and then with an empty string do not get as far as the

5786

(*MARK) item, but nevertheless do not reset it.

5787

5788

Verbs that act after backtracking

5789

5790

The following verbs do nothing when they are encountered. Matching con-

5791

tinues with what follows, but if there is no subsequent match, causing

5792

a backtrack to the verb, a failure is forced. That is, backtracking

5793

cannot pass to the left of the verb. However, when one of these verbs

5794

appears inside an atomic group, its effect is confined to that group,

5795

because once the group has been matched, there is never any backtrack-

5796

ing into it. In this situation, backtracking can "jump back" to the

5797

left of the entire atomic group. (Remember also, as stated above, that

5798

this localization also applies in subroutine calls and assertions.)

5799

5800

These verbs differ in exactly what kind of failure occurs when back-

5801

tracking reaches them.

(*COMMIT)

This verb, which may not be followed by a name, causes the whole match

5806

to fail outright if the rest of the pattern does not match. Even if the

5807

pattern is unanchored, no further attempts to find a match by advancing

5808

the starting point take place. Once (*COMMIT) has been passed,

5809

pcre_exec() is committed to finding a match at the current starting

5810

point, or not at all. For example:

a+(*COMMIT)b

This matches "xxaab" but not "aacaab". It can be thought of as a kind

5815

of dynamic anchor, or "I've started, so I must finish." The name of the

5816

most recently passed (*MARK) in the path is passed back when (*COMMIT)

5817

forces a match failure.

5818

5819

Note that (*COMMIT) at the start of a pattern is not the same as an

5820

anchor, unless PCRE's start-of-match optimizations are turned off, as

5821

shown in this pcretest example:

re> /(*COMMIT)abc/

data> xyzabc

0: abc

xyzabc\Y

No match

PCRE knows that any match must start with "a", so the optimization

5830

skips along the subject to "a" before running the first match attempt,

5831

which succeeds. When the optimization is disabled by the \Y escape in

5832

the second subject, the match starts at "x" and so the (*COMMIT) causes

5833

it to fail without trying any other starting points.

5834

5835

(*PRUNE) or (*PRUNE:NAME)

5836

5837

This verb causes the match to fail at the current starting position in

5838

the subject if the rest of the pattern does not match. If the pattern

5839

is unanchored, the normal "bumpalong" advance to the next starting

5840

character then happens. Backtracking can occur as usual to the left of

5841

(*PRUNE), before it is reached, or when matching to the right of

5842

(*PRUNE), but if there is no match to the right, backtracking cannot

5843

cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter-

5844

native to an atomic group or possessive quantifier, but there are some

5845

uses of (*PRUNE) that cannot be expressed in any other way. The behav-

5846

iour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an

5847

anchored pattern (*PRUNE) has the same effect as (*COMMIT).

(*SKIP)

This verb, when given without a name, is like (*PRUNE), except that if

5852

the pattern is unanchored, the "bumpalong" advance is not to the next

5853

character, but to the position in the subject where (*SKIP) was encoun-

5854

tered. (*SKIP) signifies that whatever text was matched leading up to

5855

it cannot be part of a successful match. Consider:

a+(*SKIP)b

If the subject is "aaaac...", after the first match attempt fails

5860

(starting at the first character in the string), the starting point

5861

skips on to start the next attempt at "c". Note that a possessive quan-

5862

tifer does not have the same effect as this example; although it would

5863

suppress backtracking during the first match attempt, the second

5864

attempt would start at the second character instead of skipping on to

"c".

(*SKIP:NAME)

When (*SKIP) has an associated name, its behaviour is modified. If the

5870

following pattern fails to match, the previous path through the pattern

5871

is searched for the most recent (*MARK) that has the same name. If one

5872

is found, the "bumpalong" advance is to the subject position that cor-

5873

responds to that (*MARK) instead of to where (*SKIP) was encountered.

5874

If no (*MARK) with a matching name is found, the (*SKIP) is ignored.

5875

5876

(*THEN) or (*THEN:NAME)

5877

5878

This verb causes a skip to the next innermost alternative if the rest

5879

of the pattern does not match. That is, it cancels pending backtrack-

5880

ing, but only within the current alternative. Its name comes from the

5881

observation that it can be used for a pattern-based if-then-else block:

5882

5883

( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...

5884

5885

If the COND1 pattern matches, FOO is tried (and possibly further items

5886

after the end of the group if FOO succeeds); on failure, the matcher

5887

skips to the second alternative and tries COND2, without backtracking

5888

into COND1. The behaviour of (*THEN:NAME) is exactly the same as

5889

(*MARK:NAME)(*THEN). If (*THEN) is not inside an alternation, it acts

5890

like (*PRUNE).

5891

5892

Note that a subpattern that does not contain a | character is just a

5893

part of the enclosing alternative; it is not a nested alternation with

5894

only one alternative. The effect of (*THEN) extends beyond such a sub-

5895

pattern to the enclosing alternative. Consider this pattern, where A,

5896

B, etc. are complex pattern fragments that do not contain any | charac-

ters at this level:

A (B(*THEN)C) | D

If A and B are matched, but there is a failure in C, matching does not

5902

backtrack into A; instead it moves to the next alternative, that is, D.

5903

However, if the subpattern containing (*THEN) is given an alternative,

5904

it behaves differently:

5905

5906

A (B(*THEN)C | (*FAIL)) | D

5907

5908

The effect of (*THEN) is now confined to the inner subpattern. After a

5909

failure in C, matching moves to (*FAIL), which causes the whole subpat-

5910

tern to fail because there are no more alternatives to try. In this

5911

case, matching does now backtrack into A.

5912

5913

Note also that a conditional subpattern is not considered as having two

5914

alternatives, because only one is ever used. In other words, the |

5915

character in a conditional subpattern has a different meaning. Ignoring

5916

white space, consider:

5917

5918

^.*? (?(?=a) a | b(*THEN)c )

5919

5920

If the subject is "ba", this pattern does not match. Because .*? is

5921

ungreedy, it initially matches zero characters. The condition (?=a)

5922

then fails, the character "b" is matched, but "c" is not. At this

5923

point, matching does not backtrack to .*? as might perhaps be expected

5924

from the presence of the | character. The conditional subpattern is

5925

part of the single alternative that comprises the whole pattern, and so

5926

the match fails. (If there was a backtrack into .*?, allowing it to

5927

match "b", the match would succeed.)

5928

5929

The verbs just described provide four different "strengths" of control

5930

when subsequent matching fails. (*THEN) is the weakest, carrying on the

5931

match at the next alternative. (*PRUNE) comes next, failing the match

5932

at the current starting position, but allowing an advance to the next

5933

character (for an unanchored pattern). (*SKIP) is similar, except that

5934

the advance may be more than one character. (*COMMIT) is the strongest,

5935

causing the entire match to fail.

5936

5937

If more than one such verb is present in a pattern, the "strongest" one

5938

wins. For example, consider this pattern, where A, B, etc. are complex

5939

pattern fragments:

5940

5941

(A(*COMMIT)B(*THEN)C|D)

5942

5943

Once A has matched, PCRE is committed to this match, at the current

5944

starting position. If subsequently B matches, but C does not, the nor-

5945

mal (*THEN) action of trying the next alternative (that is, D) does not

5946

happen because (*COMMIT) overrides.

SEE ALSO

pcreapi(3), pcrecallout(3), pcrematching(3), pcresyntax(3), pcre(3).

AUTHOR

Philip Hazel

University Computing Service

5958

Cambridge CB2 3QH, England.

REVISION

Last updated: 29 November 2011

5964

5965

------------------------------------------------------------------------------

5966

5967

5968

PCRESYNTAX(3) PCRESYNTAX(3)

NAME

PCRE - Perl-compatible regular expressions

5973

5974

5975

PCRE REGULAR EXPRESSION SYNTAX SUMMARY

5976

5977

The full syntax and semantics of the regular expressions that are sup-

5978

ported by PCRE are described in the pcrepattern documentation. This

5979

document contains just a quick-reference summary of the syntax.

QUOTING

\x where x is non-alphanumeric is a literal x

5985

\Q...\E treat enclosed characters as literal

CHARACTERS

\a alarm, that is, the BEL character (hex 07)

5991

\cx "control-x", where x is any ASCII character

\e escape (hex 1B)

\f formfeed (hex 0C)

\n newline (hex 0A)

\r carriage return (hex 0D)

5996

\t tab (hex 09)

5997

\ddd character with octal code ddd, or backreference

5998

\xhh character with hex code hh

5999

\x{hhh..} character with hex code hhh..

CHARACTER TYPES

. any character except newline;

6005

in dotall mode, any character whatsoever

6006

\C one byte, even in UTF-8 mode (best avoided)

6007

\d a decimal digit

6008

\D a character that is not a decimal digit

6009

\h a horizontal whitespace character

6010

\H a character that is not a horizontal whitespace character

6011

\N a character that is not a newline

6012

\p{xx} a character with the xx property

6013

\P{xx} a character without the xx property

6014

\R a newline sequence

6015

\s a whitespace character

6016

\S a character that is not a whitespace character

6017

\v a vertical whitespace character

6018

\V a character that is not a vertical whitespace character

6019

\w a "word" character

6020

\W a "non-word" character

6021

\X an extended Unicode sequence

6022

6023

In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII

6024

characters, even in UTF-8 mode. However, this can be changed by setting

the PCRE_UCP option.

GENERAL CATEGORY PROPERTIES FOR \p and \P

C Other

Cc Control

Cf Format

Cn Unassigned

Co Private use

Cs Surrogate

L Letter

Ll Lower case letter

Lm Modifier letter

Lo Other letter

Lt Title case letter

Lu Upper case letter

L& Ll, Lu, or Lt

M Mark

Mc Spacing mark

Me Enclosing mark

Mn Non-spacing mark

N Number

Nd Decimal number

Nl Letter number

No Other number

P Punctuation

Pc Connector punctuation

Pd Dash punctuation

Pe Close punctuation

Pf Final punctuation

Pi Initial punctuation

Po Other punctuation

Ps Open punctuation

S Symbol

Sc Currency symbol

Sk Modifier symbol

Sm Mathematical symbol

So Other symbol

Z Separator

Zl Line separator

Zp Paragraph separator

Zs Space separator

PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P

6077

6078

Xan Alphanumeric: union of properties L and N

6079

Xps POSIX space: property Z or tab, NL, VT, FF, CR

6080

Xsp Perl space: property Z or tab, NL, FF, CR

6081

Xwd Perl word: property Xan or underscore

6082

6083

6084

SCRIPT NAMES FOR \p AND \P

6085

6086

Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,

6087

Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common,

6088

Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp-

6089

tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek,

6090

Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe-

6091

rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,

6092

Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,

6093

Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam,

6094

Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,

6095

Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya,

6096

Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian,

6097

Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,

6098

Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,

Ugaritic, Vai, Yi.

CHARACTER CLASSES

[...] positive character class

6105

[^...] negative character class

6106

[x-y] range (can be used for hex characters)

6107

[[:xxx:]] positive POSIX named set

6108

[[:^xxx:]] negative POSIX named set

alnum alphanumeric

alpha alphabetic

ascii 0-127

blank space or tab

cntrl control character

6115

digit decimal digit

6116

graph printing, excluding space

6117

lower lower case letter

6118

print printing, including space

6119

punct printing, excluding alphanumeric

6120

space whitespace

6121

upper upper case letter

6122

word same as \w

6123

xdigit hexadecimal digit

6124

6125

In PCRE, POSIX character set names recognize only ASCII characters by

6126

default, but some of them use Unicode properties if PCRE_UCP is set.

6127

You can use \Q...\E inside a character class.

QUANTIFIERS

? 0 or 1, greedy

?+ 0 or 1, possessive

6134

?? 0 or 1, lazy

6135

* 0 or more, greedy

6136

*+ 0 or more, possessive

6137

*? 0 or more, lazy

6138

+ 1 or more, greedy

6139

++ 1 or more, possessive

6140

+? 1 or more, lazy

6141

{n} exactly n

6142

{n,m} at least n, no more than m, greedy

6143

{n,m}+ at least n, no more than m, possessive

6144

{n,m}? at least n, no more than m, lazy

6145

{n,} n or more, greedy

6146

{n,}+ n or more, possessive

6147

{n,}? n or more, lazy

6148

6149

6150

ANCHORS AND SIMPLE ASSERTIONS

6151

6152

\b word boundary

6153

\B not a word boundary

6154

^ start of subject

6155

also after internal newline in multiline mode

6156

\A start of subject

6157

$ end of subject

6158

also before newline at end of subject

6159

also before internal newline in multiline mode

6160

\Z end of subject

6161

also before newline at end of subject

6162

\z end of subject

6163

\G first matching position in subject

MATCH POINT RESET

\K reset start of match

ALTERNATION

expr|expr|expr...

CAPTURING

(...) capturing group

6179

(?<name>...) named capturing group (Perl)

6180

(?'name'...) named capturing group (Perl)

6181

(?P<name>...) named capturing group (Python)

6182

(?:...) non-capturing group

6183

(?|...) non-capturing group; reset group numbers for

6184

capturing groups in each alternative

ATOMIC GROUPS

(?>...) atomic, non-capturing group

COMMENT

(?#....) comment (not nestable)

OPTION SETTING

(?i) caseless

(?J) allow duplicate names

6201

(?m) multiline

6202

(?s) single line (dotall)

6203

(?U) default ungreedy (lazy)

6204

(?x) extended (ignore white space)

6205

(?-...) unset option(s)

6206

6207

The following are recognized only at the start of a pattern or after

6208

one of the newline-setting options with similar syntax:

6209

6210

(*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)

6211

(*UTF8) set UTF-8 mode (PCRE_UTF8)

6212

(*UCP) set PCRE_UCP (use Unicode properties for \d etc)

6213

6214

6215

LOOKAHEAD AND LOOKBEHIND ASSERTIONS

6216

6217

(?=...) positive look ahead

6218

(?!...) negative look ahead

6219

(?<=...) positive look behind

6220

(?<!...) negative look behind

6221

6222

Each top-level branch of a look behind must be of a fixed length.

BACKREFERENCES

\n reference by number (can be ambiguous)

6228

\gn reference by number

6229

\g{n} reference by number

6230

\g{-n} relative reference by number

6231

\k<name> reference by name (Perl)

6232

\k'name' reference by name (Perl)

6233

\g{name} reference by name (Perl)

6234

\k{name} reference by name (.NET)

6235

(?P=name) reference by name (Python)

6236

6237

6238

SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)

6239

6240

(?R) recurse whole pattern

6241

(?n) call subpattern by absolute number

6242

(?+n) call subpattern by relative number

6243

(?-n) call subpattern by relative number

6244

(?&name) call subpattern by name (Perl)

6245

(?P>name) call subpattern by name (Python)

6246

\g<name> call subpattern by name (Oniguruma)

6247

\g'name' call subpattern by name (Oniguruma)

6248

\g<n> call subpattern by absolute number (Oniguruma)

6249

\g'n' call subpattern by absolute number (Oniguruma)

6250

\g<+n> call subpattern by relative number (PCRE extension)

6251

\g'+n' call subpattern by relative number (PCRE extension)

6252

\g<-n> call subpattern by relative number (PCRE extension)

6253

\g'-n' call subpattern by relative number (PCRE extension)

CONDITIONAL PATTERNS

(?(condition)yes-pattern)

6259

(?(condition)yes-pattern|no-pattern)

6260

6261

(?(n)... absolute reference condition

6262

(?(+n)... relative reference condition

6263

(?(-n)... relative reference condition

6264

(?(<name>)... named reference condition (Perl)

6265

(?('name')... named reference condition (Perl)

6266

(?(name)... named reference condition (PCRE)

6267

(?(R)... overall recursion condition

6268

(?(Rn)... specific group recursion condition

6269

(?(R&name)... specific recursion condition

6270

(?(DEFINE)... define subpattern for reference

6271

(?(assert)... assertion condition

BACKTRACKING CONTROL

The following act immediately they are reached:

6277

6278

(*ACCEPT) force successful match

6279

(*FAIL) force backtrack; synonym (*F)

6280

6281

The following act only when a subsequent match failure causes a back-

6282

track to reach them. They all force a match failure, but they differ in

6283

what happens afterwards. Those that advance the start-of-match point do

6284

so only if the pattern is not anchored.

6285

6286

(*COMMIT) overall failure, no advance of starting point

6287

(*PRUNE) advance to next starting character

6288

(*SKIP) advance start to current matching position

6289

(*THEN) local failure, backtrack to next alternation

NEWLINE CONVENTIONS

These are recognized only at the very start of the pattern or after a

6295

(*BSR_...) or (*UTF8) or (*UCP) option.

6296

6297

(*CR) carriage return only

6298

(*LF) linefeed only

6299

(*CRLF) carriage return followed by linefeed

6300

(*ANYCRLF) all three of the above

6301

(*ANY) any Unicode newline sequence

WHAT \R MATCHES

These are recognized only at the very start of the pattern or after a

6307

(*...) option that sets the newline convention or UTF-8 or UCP mode.

6308

6309

(*BSR_ANYCRLF) CR, LF, or CRLF

6310

(*BSR_UNICODE) any Unicode newline sequence

CALLOUTS

(?C) callout

(?Cn) callout with data n

SEE ALSO

pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).

AUTHOR

Philip Hazel

University Computing Service

6328

Cambridge CB2 3QH, England.

REVISION

Last updated: 21 November 2010

6334

6335

------------------------------------------------------------------------------

6336

6337

6338

PCREUNICODE(3) PCREUNICODE(3)

NAME

PCRE - Perl-compatible regular expressions

6343

6344

6345

UTF-8 AND UNICODE PROPERTY SUPPORT

6346

6347

In order process UTF-8 strings, you must build PCRE to include UTF-8

6348

support in the code, and, in addition, you must call pcre_compile()

6349

with the PCRE_UTF8 option flag, or the pattern must start with the

6350

sequence (*UTF8). When either of these is the case, both the pattern

6351

and any subject strings that are matched against it are treated as

6352

UTF-8 strings instead of strings of 1-byte characters. PCRE does not

6353

support any other formats (in particular, it does not support UTF-16).

6354

6355

If you compile PCRE with UTF-8 support, but do not use it at run time,

6356

the library will be a bit bigger, but the additional run time overhead

6357

is limited to testing the PCRE_UTF8 flag occasionally, so should not be

6358

very big.

6359

6360

If PCRE is built with Unicode character property support (which implies

6361

UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup-

6362

ported. The available properties that can be tested are limited to the

6363

general category properties such as Lu for an upper case letter or Nd

6364

for a decimal number, the Unicode script names such as Arabic or Han,

6365

and the derived properties Any and L&. A full list is given in the

6366

pcrepattern documentation. Only the short names for properties are sup-

6367

ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let-

6368

ter}, is not supported. Furthermore, in Perl, many properties may

6369

optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE

6370

does not support this.

6371

6372

Validity of UTF-8 strings

6373

6374

When you set the PCRE_UTF8 flag, the strings passed as patterns and

6375

subjects are (by default) checked for validity on entry to the relevant

6376

functions. From release 7.3 of PCRE, the check is according the rules

6377

of RFC 3629, which are themselves derived from the Unicode specifica-

6378

tion. Earlier releases of PCRE followed the rules of RFC 2279, which

6379

allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current

6380

check allows only values in the range U+0 to U+10FFFF, excluding U+D800

6381

to U+DFFF.

6382

6383

The excluded code points are the "Low Surrogate Area" of Unicode, of

6384

which the Unicode Standard says this: "The Low Surrogate Area does not

6385

contain any character assignments, consequently no character code

6386

charts or namelists are provided for this area. Surrogates are reserved

6387

for use with UTF-16 and then must be used in pairs." The code points

6388

that are encoded by UTF-16 pairs are available as independent code

6389

points in the UTF-8 encoding. (In other words, the whole surrogate

6390

thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)

6391

6392

If an invalid UTF-8 string is passed to PCRE, an error return is given.

6393

At compile time, the only additional information is the offset to the

6394

first byte of the failing character. The runtime functions pcre_exec()

6395

and pcre_dfa_exec() also pass back this information, as well as a more

6396

detailed reason code if the caller has provided memory in which to do

6397

this.

6398

6399

In some situations, you may already know that your strings are valid,

6400

and therefore want to skip these checks in order to improve perfor-

6401

mance. If you set the PCRE_NO_UTF8_CHECK flag at compile time or at run

6402

time, PCRE assumes that the pattern or subject it is given (respec-

6403

tively) contains only valid UTF-8 codes. In this case, it does not

6404

diagnose an invalid UTF-8 string.

6405

6406

If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,

6407

what happens depends on why the string is invalid. If the string con-

6408

forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a

6409

string of characters in the range 0 to 0x7FFFFFFF by pcre_dfa_exec()

6410

and the interpreted version of pcre_exec(). In other words, apart from

6411

the initial validity test, these functions (when in UTF-8 mode) handle

6412

strings according to the more liberal rules of RFC 2279. However, the

6413

just-in-time (JIT) optimization for pcre_exec() supports only RFC 3629.

6414

If you are using JIT optimization, or if the string does not even con-

6415

form to RFC 2279, the result is undefined. Your program may crash.

6416

6417

If you want to process strings of values in the full range 0 to

6418

0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can

6419

set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in

6420

this situation, you will have to apply your own validity check, and

6421

avoid the use of JIT optimization.

6422

6423

General comments about UTF-8 mode

6424

6425

1. An unbraced hexadecimal escape sequence (such as \xb3) matches a

6426

two-byte UTF-8 character if the value is greater than 127.

6427

6428

2. Octal numbers up to \777 are recognized, and match two-byte UTF-8

6429

characters for values greater than \177.

6430

6431

3. Repeat quantifiers apply to complete UTF-8 characters, not to indi-

6432

vidual bytes, for example: \x{100}{3}.

6433

6434

4. The dot metacharacter matches one UTF-8 character instead of a sin-

6435

gle byte.

6436

6437

5. The escape sequence \C can be used to match a single byte in UTF-8

6438

mode, but its use can lead to some strange effects because it breaks up

6439

multibyte characters (see the description of \C in the pcrepattern doc-

6440

umentation). The use of \C is not supported in the alternative matching

6441

function pcre_dfa_exec(), nor is it supported in UTF-8 mode by the JIT

6442

optimization of pcre_exec(). If JIT optimization is requested for a

6443

UTF-8 pattern that contains \C, it will not succeed, and so the match-

6444

ing will be carried out by the normal interpretive function.

6445

6446

6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly

6447

test characters of any code value, but, by default, the characters that

6448

PCRE recognizes as digits, spaces, or word characters remain the same

6449

set as before, all with values less than 256. This remains true even

6450

when PCRE is built to include Unicode property support, because to do

6451

otherwise would slow down PCRE in many common cases. Note in particular

6452

that this applies to \b and \B, because they are defined in terms of \w

6453

and \W. If you really want to test for a wider sense of, say, "digit",

6454

you can use explicit Unicode property tests such as \p{Nd}. Alterna-

6455

tively, if you set the PCRE_UCP option, the way that the character

6456

escapes work is changed so that Unicode properties are used to deter-

6457

mine which characters match. There are more details in the section on

6458

generic character types in the pcrepattern documentation.

6459

6460

7. Similarly, characters that match the POSIX named character classes

6461

are all low-valued characters, unless the PCRE_UCP option is set.

6462

6463

8. However, the horizontal and vertical whitespace matching escapes

6464

(\h, \H, \v, and \V) do match all the appropriate Unicode characters,

6465

whether or not PCRE_UCP is set.

6466

6467

9. Case-insensitive matching applies only to characters whose values

6468

are less than 128, unless PCRE is built with Unicode property support.

6469

Even when Unicode property support is available, PCRE still uses its

6470

own character tables when checking the case of low-valued characters,

6471

so as not to degrade performance. The Unicode property information is

6472

used only for characters with higher values. Furthermore, PCRE supports

6473

case-insensitive matching only when there is a one-to-one mapping

6474

between a letter's cases. There are a small number of many-to-one map-

6475

pings in Unicode; these are not supported by PCRE.

AUTHOR

Philip Hazel

University Computing Service

6482

Cambridge CB2 3QH, England.

REVISION

Last updated: 19 October 2011

6488

6489

------------------------------------------------------------------------------

6490

6491

6492

PCREJIT(3) PCREJIT(3)

NAME

PCRE - Perl-compatible regular expressions

6497

6498

6499

PCRE JUST-IN-TIME COMPILER SUPPORT

6500

6501

Just-in-time compiling is a heavyweight optimization that can greatly

6502

speed up pattern matching. However, it comes at the cost of extra pro-

6503

cessing before the match is performed. Therefore, it is of most benefit

6504

when the same pattern is going to be matched many times. This does not

6505

necessarily mean many calls of pcre_exec(); if the pattern is not

6506

anchored, matching attempts may take place many times at various posi-

6507

tions in the subject, even for a single call to pcre_exec(). If the

6508

subject string is very long, it may still pay to use JIT for one-off

6509

matches.

6510

6511

JIT support applies only to the traditional matching function,

6512

pcre_exec(). It does not apply when pcre_dfa_exec() is being used. The

6513

code for this support was written by Zoltan Herczeg.

6514

6515

6516

AVAILABILITY OF JIT SUPPORT

6517

6518

JIT support is an optional feature of PCRE. The "configure" option

6519

--enable-jit (or equivalent CMake option) must be set when PCRE is

6520

built if you want to use JIT. The support is limited to the following

6521

hardware platforms:

6522

6523

ARM v5, v7, and Thumb2

6524

Intel x86 32-bit and 64-bit

6525

MIPS 32-bit

6526

Power PC 32-bit and 64-bit (experimental)

6527

6528

The Power PC support is designated as experimental because it has not

6529

been fully tested. If --enable-jit is set on an unsupported platform,

6530

compilation fails.

6531

6532

A program that is linked with PCRE 8.20 or later can tell if JIT sup-

6533

port is available by calling pcre_config() with the PCRE_CONFIG_JIT

6534

option. The result is 1 when JIT is available, and 0 otherwise. How-

6535

ever, a simple program does not need to check this in order to use JIT.

6536

The API is implemented in a way that falls back to the ordinary PCRE

6537

code if JIT is not available.

6538

6539

If your program may sometimes be linked with versions of PCRE that are

6540

older than 8.20, but you want to use JIT when it is available, you can

6541

test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT

6542

macro such as PCRE_CONFIG_JIT, for compile-time control of your code.

SIMPLE USE OF JIT

You have to do two things to make use of the JIT support in the sim-

6548

plest way:

6549

6550

(1) Call pcre_study() with the PCRE_STUDY_JIT_COMPILE option for

6551

each compiled pattern, and pass the resulting pcre_extra block to

6552

pcre_exec().

6553

6554

(2) Use pcre_free_study() to free the pcre_extra block when it is

6555

no longer needed instead of just freeing it yourself. This

6556

ensures that any JIT data is also freed.

6557

6558

For a program that may be linked with pre-8.20 versions of PCRE, you

6559

can insert

6560

6561

#ifndef PCRE_STUDY_JIT_COMPILE

6562

#define PCRE_STUDY_JIT_COMPILE 0

6563

#endif

6564

6565

so that no option is passed to pcre_study(), and then use something

6566

like this to free the study data:

6567

6568

#ifdef PCRE_CONFIG_JIT

6569

pcre_free_study(study_ptr);

6570

#else

6571

pcre_free(study_ptr);

6572

#endif

6573

6574

In some circumstances you may need to call additional functions. These

6575

are described in the section entitled "Controlling the JIT stack"

6576

below.

6577

6578

If JIT support is not available, PCRE_STUDY_JIT_COMPILE is ignored, and

6579

no JIT data is set up. Otherwise, the compiled pattern is passed to the

6580

JIT compiler, which turns it into machine code that executes much

6581

faster than the normal interpretive code. When pcre_exec() is passed a

6582

pcre_extra block containing a pointer to JIT code, it obeys that

6583

instead of the normal code. The result is identical, but the code runs

6584

much faster.

6585

6586

There are some pcre_exec() options that are not supported for JIT exe-

6587

cution. There are also some pattern items that JIT cannot handle.

6588

Details are given below. In both cases, execution automatically falls

6589

back to the interpretive code.

6590

6591

If the JIT compiler finds an unsupported item, no JIT data is gener-

6592

ated. You can find out if JIT execution is available after studying a

6593

pattern by calling pcre_fullinfo() with the PCRE_INFO_JIT option. A

6594

result of 1 means that JIT compilation was successful. A result of 0

6595

means that JIT support is not available, or the pattern was not studied

6596

with PCRE_STUDY_JIT_COMPILE, or the JIT compiler was not able to handle

6597

the pattern.

6598

6599

Once a pattern has been studied, with or without JIT, it can be used as

6600

many times as you like for matching different subject strings.

6601

6602

6603

UNSUPPORTED OPTIONS AND PATTERN ITEMS

6604

6605

The only pcre_exec() options that are supported for JIT execution are

6606

PCRE_NO_UTF8_CHECK, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and

6607

PCRE_NOTEMPTY_ATSTART. Note in particular that partial matching is not

6608

supported.

6609

6610

The unsupported pattern items are:

6611

6612

\C match a single byte; not supported in UTF-8 mode

(?Cn) callouts

(*COMMIT) )

(*MARK) )

(*PRUNE) ) the backtracking control verbs

(*SKIP) )

(*THEN) )

Support for some of these may be added in future.

6621

6622

6623

RETURN VALUES FROM JIT EXECUTION

6624

6625

When a pattern is matched using JIT execution, the return values are

6626

the same as those given by the interpretive pcre_exec() code, with the

6627

addition of one new error code: PCRE_ERROR_JIT_STACKLIMIT. This means

6628

that the memory used for the JIT stack was insufficient. See "Control-

6629

ling the JIT stack" below for a discussion of JIT stack usage. For com-

6630

patibility with the interpretive pcre_exec() code, no more than two-

6631

thirds of the ovector argument is used for passing back captured sub-

6632

strings.

6633

6634

The error code PCRE_ERROR_MATCHLIMIT is returned by the JIT code if

6635

searching a very large pattern tree goes on for too long, as it is in

6636

the same circumstance when JIT is not used, but the details of exactly

6637

what is counted are not the same. The PCRE_ERROR_RECURSIONLIMIT error

6638

code is never returned by JIT execution.

6639

6640

6641

SAVING AND RESTORING COMPILED PATTERNS

6642

6643

The code that is generated by the JIT compiler is architecture-spe-

6644

cific, and is also position dependent. For those reasons it cannot be

6645

saved (in a file or database) and restored later like the bytecode and

6646

other data of a compiled pattern. Saving and restoring compiled pat-

6647

terns is not something many people do. More detail about this facility

6648

is given in the pcreprecompile documentation. It should be possible to

6649

run pcre_study() on a saved and restored pattern, and thereby recreate

6650

the JIT data, but because JIT compilation uses significant resources,

6651

it is probably not worth doing this; you might as well recompile the

original pattern.

CONTROLLING THE JIT STACK

6656

6657

When the compiled JIT code runs, it needs a block of memory to use as a

6658

stack. By default, it uses 32K on the machine stack. However, some

6659

large or complicated patterns need more than this. The error

6660

PCRE_ERROR_JIT_STACKLIMIT is given when there is not enough stack.

6661

Three functions are provided for managing blocks of memory for use as

6662

JIT stacks. There is further discussion about the use of JIT stacks in

6663

the section entitled "JIT stack FAQ" below.

6664

6665

The pcre_jit_stack_alloc() function creates a JIT stack. Its arguments

6666

are a starting size and a maximum size, and it returns a pointer to an

6667

opaque structure of type pcre_jit_stack, or NULL if there is an error.

6668

The pcre_jit_stack_free() function can be used to free a stack that is

6669

no longer needed. (For the technically minded: the address space is

6670

allocated by mmap or VirtualAlloc.)

6671

6672

JIT uses far less memory for recursion than the interpretive code, and

6673

a maximum stack size of 512K to 1M should be more than enough for any

6674

pattern.

6675

6676

The pcre_assign_jit_stack() function specifies which stack JIT code

6677

should use. Its arguments are as follows:

6678

6679

pcre_extra *extra

6680

pcre_jit_callback callback

6681

void *data

6682

6683

The extra argument must be the result of studying a pattern with

6684

PCRE_STUDY_JIT_COMPILE. There are three cases for the values of the

6685

other two options:

6686

6687

(1) If callback is NULL and data is NULL, an internal 32K block

6688

on the machine stack is used.

6689

6690

(2) If callback is NULL and data is not NULL, data must be

6691

a valid JIT stack, the result of calling pcre_jit_stack_alloc().

6692

6693

(3) If callback not NULL, it must point to a function that is called

6694

with data as an argument at the start of matching, in order to

6695

set up a JIT stack. If the result is NULL, the internal 32K stack

6696

is used; otherwise the return value must be a valid JIT stack,

6697

the result of calling pcre_jit_stack_alloc().

6698

6699

You may safely assign the same JIT stack to more than one pattern, as

6700

long as they are all matched sequentially in the same thread. In a mul-

6701

tithread application, each thread must use its own JIT stack.

6702

6703

Strictly speaking, even more is allowed. You can assign the same stack

6704

to any number of patterns as long as they are not used for matching by

6705

multiple threads at the same time. For example, you can assign the same

6706

stack to all compiled patterns, and use a global mutex in the callback

6707

to wait until the stack is available for use. However, this is an inef-

6708

ficient solution, and not recommended.

6709

6710

This is a suggestion for how a typical multithreaded program might

6711

operate:

6712

6713

During thread initalization

6714

thread_local_var = pcre_jit_stack_alloc(...)

6715

6716

During thread exit

6717

pcre_jit_stack_free(thread_local_var)

6718

6719

Use a one-line callback function

6720

return thread_local_var

6721

6722

All the functions described in this section do nothing if JIT is not

6723

available, and pcre_assign_jit_stack() does nothing unless the extra

6724

argument is non-NULL and points to a pcre_extra block that is the

6725

result of a successful study with PCRE_STUDY_JIT_COMPILE.

JIT STACK FAQ

(1) Why do we need JIT stacks?

6731

6732

PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack

6733

where the local data of the current node is pushed before checking its

6734

child nodes. Allocating real machine stack on some platforms is diffi-

6735

cult. For example, the stack chain needs to be updated every time if we

6736

extend the stack on PowerPC. Although it is possible, its updating

6737

time overhead decreases performance. So we do the recursion in memory.

6738

6739

(2) Why don't we simply allocate blocks of memory with malloc()?

6740

6741

Modern operating systems have a nice feature: they can reserve an

6742

address space instead of allocating memory. We can safely allocate mem-

6743

ory pages inside this address space, so the stack could grow without

6744

moving memory data (this is important because of pointers). Thus we can

6745

allocate 1M address space, and use only a single memory page (usually

6746

4K) if that is enough. However, we can still grow up to 1M anytime if

6747

needed.

6748

6749

(3) Who "owns" a JIT stack?

6750

6751

The owner of the stack is the user program, not the JIT studied pattern

6752

or anything else. The user program must ensure that if a stack is used

6753

by pcre_exec(), (that is, it is assigned to the pattern currently run-

6754

ning), that stack must not be used by any other threads (to avoid over-

6755

writing the same memory area). The best practice for multithreaded pro-

6756

grams is to allocate a stack for each thread, and return this stack

6757

through the JIT callback function.

6758

6759

(4) When should a JIT stack be freed?

6760

6761

You can free a JIT stack at any time, as long as it will not be used by

6762

pcre_exec() again. When you assign the stack to a pattern, only a

6763

pointer is set. There is no reference counting or any other magic. You

6764

can free the patterns and stacks in any order, anytime. Just do not

6765

call pcre_exec() with a pattern pointing to an already freed stack, as

6766

that will cause SEGFAULT. (Also, do not free a stack currently used by

6767

pcre_exec() in another thread). You can also replace the stack for a

6768

pattern at any time. You can even free the previous stack before

6769

assigning a replacement.

6770

6771

(5) Should I allocate/free a stack every time before/after calling

6772

pcre_exec()?

6773

6774

No, because this is too costly in terms of resources. However, you

6775

could implement some clever idea which release the stack if it is not

6776

used in let's say two minutes. The JIT callback can help to achive this

6777

without keeping a list of the currently JIT studied patterns.

6778

6779

(6) OK, the stack is for long term memory allocation. But what happens

6780

if a pattern causes stack overflow with a stack of 1M? Is that 1M kept

6781

until the stack is freed?

6782

6783

Especially on embedded sytems, it might be a good idea to release mem-

6784

ory sometimes without freeing the stack. There is no API for this at

6785

the moment. Probably a function call which returns with the currently

6786

allocated memory for any stack and another which allows releasing mem-

6787

ory (shrinking the stack) would be a good idea if someone needs this.

6788

6789

(7) This is too much of a headache. Isn't there any better solution for

6790

JIT stack handling?

6791

6792

No, thanks to Windows. If POSIX threads were used everywhere, we could

6793

throw out this complicated API.

EXAMPLE CODE

This is a single-threaded example that specifies a JIT stack without

using a callback.

int rc;

int ovector[30];

pcre *re;

pcre_extra *extra;

pcre_jit_stack *jit_stack;

6806

6807

re = pcre_compile(pattern, 0, &error, &erroffset, NULL);

6808

/* Check for errors */

6809

extra = pcre_study(re, PCRE_STUDY_JIT_COMPILE, &error);

6810

jit_stack = pcre_jit_stack_alloc(32*1024, 512*1024);

6811

/* Check for error (NULL) */

6812

pcre_assign_jit_stack(extra, NULL, jit_stack);

6813

rc = pcre_exec(re, extra, subject, length, 0, 0, ovector, 30);

6814

/* Check results */

6815

pcre_free(re);

6816

pcre_free_study(extra);

6817

pcre_jit_stack_free(jit_stack);

SEE ALSO

pcreapi(3)

AUTHOR

Philip Hazel (FAQ by Zoltan Herczeg)

6828

University Computing Service

6829

Cambridge CB2 3QH, England.

REVISION

Last updated: 26 November 2011

6835

6836

------------------------------------------------------------------------------

6837

6838

6839

PCREPARTIAL(3) PCREPARTIAL(3)

NAME

PCRE - Perl-compatible regular expressions

6844

6845

6846

PARTIAL MATCHING IN PCRE

6847

6848

In normal use of PCRE, if the subject string that is passed to

6849

pcre_exec() or pcre_dfa_exec() matches as far as it goes, but is too

6850

short to match the entire pattern, PCRE_ERROR_NOMATCH is returned.

6851

There are circumstances where it might be helpful to distinguish this

6852

case from other cases in which there is no match.

6853

6854

Consider, for example, an application where a human is required to type

6855

in data for a field with specific formatting requirements. An example

6856

might be a date in the form ddmmmyy, defined by this pattern:

6857

6858

^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$

6859

6860

If the application sees the user's keystrokes one by one, and can check

6861

that what has been typed so far is potentially valid, it is able to

6862

raise an error as soon as a mistake is made, by beeping and not

6863

reflecting the character that has been typed, for example. This immedi-

6864

ate feedback is likely to be a better user interface than a check that

6865

is delayed until the entire string has been entered. Partial matching

6866

can also be useful when the subject string is very long and is not all

6867

available at once.

6868

6869

PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and

6870

PCRE_PARTIAL_HARD options, which can be set when calling pcre_exec() or

6871

pcre_dfa_exec(). For backwards compatibility, PCRE_PARTIAL is a synonym

6872

for PCRE_PARTIAL_SOFT. The essential difference between the two options

6873

is whether or not a partial match is preferred to an alternative com-

6874

plete match, though the details differ between the two matching func-

6875

tions. If both options are set, PCRE_PARTIAL_HARD takes precedence.

6876

6877

Setting a partial matching option for pcre_exec() disables the use of

6878

any just-in-time code that was set up by calling pcre_study() with the

6879

PCRE_STUDY_JIT_COMPILE option. It also disables two of PCRE's standard

6880

optimizations. PCRE remembers the last literal byte in a pattern, and

6881

abandons matching immediately if such a byte is not present in the sub-

6882

ject string. This optimization cannot be used for a subject string that

6883

might match only partially. If the pattern was studied, PCRE knows the

6884

minimum length of a matching string, and does not bother to run the

6885

matching function on shorter strings. This optimization is also dis-

6886

abled for partial matching.

6887

6888

6889

PARTIAL MATCHING USING pcre_exec()

6890

6891

A partial match occurs during a call to pcre_exec() when the end of the

6892

subject string is reached successfully, but matching cannot continue

6893

because more characters are needed. However, at least one character in

6894

the subject must have been inspected. This character need not form part

6895

of the final matched string; lookbehind assertions and the \K escape

6896

sequence provide ways of inspecting characters before the start of a

6897

matched substring. The requirement for inspecting at least one charac-

6898

ter exists because an empty string can always be matched; without such

6899

a restriction there would always be a partial match of an empty string

6900

at the end of the subject.

6901

6902

If there are at least two slots in the offsets vector when pcre_exec()

6903

returns with a partial match, the first slot is set to the offset of

6904

the earliest character that was inspected when the partial match was

6905

found. For convenience, the second offset points to the end of the sub-

6906

ject so that a substring can easily be identified.

6907

6908

For the majority of patterns, the first offset identifies the start of

6909

the partially matched string. However, for patterns that contain look-

6910

behind assertions, or \K, or begin with \b or \B, earlier characters

6911

have been inspected while carrying out the match. For example:

/(?<=abc)123/

This pattern matches "123", but only if it is preceded by "abc". If the

6916

subject string is "xyzabc12", the offsets after a partial match are for

6917

the substring "abc12", because all these characters are needed if

6918

another match is tried with extra characters added to the subject.

6919

6920

What happens when a partial match is identified depends on which of the

6921

two partial matching options are set.

6922

6923

PCRE_PARTIAL_SOFT with pcre_exec()

6924

6925

If PCRE_PARTIAL_SOFT is set when pcre_exec() identifies a partial

6926

match, the partial match is remembered, but matching continues as nor-

6927

mal, and other alternatives in the pattern are tried. If no complete

6928

match can be found, pcre_exec() returns PCRE_ERROR_PARTIAL instead of

6929

PCRE_ERROR_NOMATCH.

6930

6931

This option is "soft" because it prefers a complete match over a par-

6932

tial match. All the various matching items in a pattern behave as if

6933

the subject string is potentially complete. For example, \z, \Z, and $

6934

match at the end of the subject, as normal, and for \b and \B the end

6935

of the subject is treated as a non-alphanumeric.

6936

6937

If there is more than one partial match, the first one that was found

6938

provides the data that is returned. Consider this pattern:

/123\w+X|dogY/

If this is matched against the subject string "abc123dog", both alter-

6943

natives fail to match, but the end of the subject is reached during

6944

matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set to 3

6945

and 9, identifying "123dog" as the first partial match that was found.

6946

(In this example, there are two partial matches, because "dog" on its

6947

own partially matches the second alternative.)

6948

6949

PCRE_PARTIAL_HARD with pcre_exec()

6950

6951

If PCRE_PARTIAL_HARD is set for pcre_exec(), it returns PCRE_ERROR_PAR-

6952

TIAL as soon as a partial match is found, without continuing to search

6953

for possible complete matches. This option is "hard" because it prefers

6954

an earlier partial match over a later complete match. For this reason,

6955

the assumption is made that the end of the supplied subject string may

6956

not be the true end of the available data, and so, if \z, \Z, \b, \B,

6957

or $ are encountered at the end of the subject, the result is

6958

PCRE_ERROR_PARTIAL.

6959

6960

Setting PCRE_PARTIAL_HARD also affects the way pcre_exec() checks UTF-8

6961

subject strings for validity. Normally, an invalid UTF-8 sequence

6962

causes the error PCRE_ERROR_BADUTF8. However, in the special case of a

6963

truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORT-

6964

UTF8 is returned when PCRE_PARTIAL_HARD is set.

6965

6966

Comparing hard and soft partial matching

6967

6968

The difference between the two partial matching options can be illus-

6969

trated by a pattern such as:

/dog(sbody)?/

This matches either "dog" or "dogsbody", greedily (that is, it prefers

6974

the longer string if possible). If it is matched against the string

6975

"dog" with PCRE_PARTIAL_SOFT, it yields a complete match for "dog".

6976

However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.

6977

On the other hand, if the pattern is made ungreedy the result is dif-

ferent:

/dog(sbody)??/

In this case the result is always a complete match because pcre_exec()

6983

finds that first, and it never continues after finding a match. It

6984

might be easier to follow this explanation by thinking of the two pat-

6985

terns like this:

6986

6987

/dog(sbody)?/ is the same as /dogsbody|dog/

6988

/dog(sbody)??/ is the same as /dog|dogsbody/

6989

6990

The second pattern will never match "dogsbody" when pcre_exec() is

6991

used, because it will always find the shorter match first.

6992

6993

6994

PARTIAL MATCHING USING pcre_dfa_exec()

6995

6996

The pcre_dfa_exec() function moves along the subject string character

6997

by character, without backtracking, searching for all possible matches

6998

simultaneously. If the end of the subject is reached before the end of

6999

the pattern, there is the possibility of a partial match, again pro-

7000

vided that at least one character has been inspected.

7001

7002

When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if

7003

there have been no complete matches. Otherwise, the complete matches

7004

are returned. However, if PCRE_PARTIAL_HARD is set, a partial match

7005

takes precedence over any complete matches. The portion of the string

7006

that was inspected when the longest partial match was found is set as

7007

the first matching string, provided there are at least two slots in the

7008

offsets vector.

7009

7010

Because pcre_dfa_exec() always searches for all possible matches, and

7011

there is no difference between greedy and ungreedy repetition, its be-

7012

haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con-

7013

sider the string "dog" matched against the ungreedy pattern shown

above:

/dog(sbody)??/

Whereas pcre_exec() stops as soon as it finds the complete match for

7019

"dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and

7020

so returns that when PCRE_PARTIAL_HARD is set.

7021

7022

7023

PARTIAL MATCHING AND WORD BOUNDARIES

7024

7025

If a pattern ends with one of sequences \b or \B, which test for word

7026

boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-

7027

intuitive results. Consider this pattern:

/\bcat\b/

This matches "cat", provided there is a word boundary at either end. If

7032

the subject string is "the cat", the comparison of the final "t" with a

7033

following character cannot take place, so a partial match is found.

7034

However, pcre_exec() carries on with normal matching, which matches \b

7035

at the end of the subject when the last character is a letter, thus

7036

finding a complete match. The result, therefore, is not PCRE_ERROR_PAR-

7037

TIAL. The same thing happens with pcre_dfa_exec(), because it also

7038

finds the complete match.

7039

7040

Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL,

7041

because then the partial match takes precedence.

7042

7043

7044

FORMERLY RESTRICTED PATTERNS

7045

7046

For releases of PCRE prior to 8.00, because of the way certain internal

7047

optimizations were implemented in the pcre_exec() function, the

7048

PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be

7049

used with all patterns. From release 8.00 onwards, the restrictions no

7050

longer apply, and partial matching with pcre_exec() can be requested

7051

for any pattern.

7052

7053

Items that were formerly restricted were repeated single characters and

7054

repeated metasequences. If PCRE_PARTIAL was set for a pattern that did

7055

not conform to the restrictions, pcre_exec() returned the error code

7056

PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The

7057

PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if a compiled

7058

pattern can be used for partial matching now always returns 1.

7059

7060

7061

EXAMPLE OF PARTIAL MATCHING USING PCRETEST

7062

7063

If the escape sequence \P is present in a pcretest data line, the

7064

PCRE_PARTIAL_SOFT option is used for the match. Here is a run of

7065

pcretest that uses the date example quoted above:

7066

7067

re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/

data> 25jun04\P

0: 25jun04

1: jun

data> 25dec3\P

Partial match: 23dec3

data> 3ju\P

Partial match: 3ju

data> 3juj\P

No match

data> j\P

No match

The first data string is matched completely, so pcretest shows the

7081

matched substrings. The remaining four strings do not match the com-

7082

plete pattern, but the first two are partial matches. Similar output is

7083

obtained when pcre_dfa_exec() is used.

7084

7085

If the escape sequence \P is present more than once in a pcretest data

7086

line, the PCRE_PARTIAL_HARD option is set for the match.

7087

7088

7089

MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()

7090

7091

When a partial match has been found using pcre_dfa_exec(), it is possi-

7092

ble to continue the match by providing additional subject data and

7093

calling pcre_dfa_exec() again with the same compiled regular expres-

7094

sion, this time setting the PCRE_DFA_RESTART option. You must pass the

7095

same working space as before, because this is where details of the pre-

7096

vious partial match are stored. Here is an example using pcretest,

7097

using the \R escape sequence to set the PCRE_DFA_RESTART option (\D

7098

specifies the use of pcre_dfa_exec()):

7099

7100

re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/

data> 23ja\P\D

Partial match: 23ja

data> n05\R\D

0: n05

The first call has "23ja" as the subject, and requests partial match-

7107

ing; the second call has "n05" as the subject for the continued

7108

(restarted) match. Notice that when the match is complete, only the

7109

last part is shown; PCRE does not retain the previously partially-

7110

matched string. It is up to the calling program to do that if it needs

7111

to.

7112

7113

You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with

7114

PCRE_DFA_RESTART to continue partial matching over multiple segments.

7115

This facility can be used to pass very long subject strings to

pcre_dfa_exec().

MULTI-SEGMENT MATCHING WITH pcre_exec()

7120

7121

From release 8.00, pcre_exec() can also be used to do multi-segment

7122

matching. Unlike pcre_dfa_exec(), it is not possible to restart the

7123

previous match with a new segment of data. Instead, new data must be

7124

added to the previous subject string, and the entire match re-run,

7125

starting from the point where the partial match occurred. Earlier data

7126

can be discarded. It is best to use PCRE_PARTIAL_HARD in this situa-

7127

tion, because it does not treat the end of a segment as the end of the

7128

subject when matching \z, \Z, \b, \B, and $. Consider an unanchored

7129

pattern that matches dates:

7130

7131

re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/

7132

data> The date is 23ja\P\P

7133

Partial match: 23ja

7134

7135

At this stage, an application could discard the text preceding "23ja",

7136

add on text from the next segment, and call pcre_exec() again. Unlike

7137

pcre_dfa_exec(), the entire matching string must always be available,

7138

and the complete matching process occurs for each call, so more memory

7139

and more processing time is needed.

7140

7141

Note: If the pattern contains lookbehind assertions, or \K, or starts

7142

with \b or \B, the string that is returned for a partial match will

7143

include characters that precede the partially matched string itself,

7144

because these must be retained when adding on more characters for a

7145

subsequent matching attempt.

7146

7147

7148

ISSUES WITH MULTI-SEGMENT MATCHING

7149

7150

Certain types of pattern may give problems with multi-segment matching,

7151

whichever matching function is used.

7152

7153

1. If the pattern contains a test for the beginning of a line, you need

7154

to pass the PCRE_NOTBOL option when the subject string for any call

7155

does start at the beginning of a line. There is also a PCRE_NOTEOL

7156

option, but in practice when doing multi-segment matching you should be

7157

using PCRE_PARTIAL_HARD, which includes the effect of PCRE_NOTEOL.

7158

7159

2. Lookbehind assertions at the start of a pattern are catered for in

7160

the offsets that are returned for a partial match. However, in theory,

7161

a lookbehind assertion later in the pattern could require even earlier

7162

characters to be inspected, and it might not have been reached when a

7163

partial match occurs. This is probably an extremely unlikely case; you

7164

could guard against it to a certain extent by always including extra

7165

characters at the start.

7166

7167

3. Matching a subject string that is split into multiple segments may

7168

not always produce exactly the same result as matching over one single

7169

long string, especially when PCRE_PARTIAL_SOFT is used. The section

7170

"Partial Matching and Word Boundaries" above describes an issue that

7171

arises if the pattern ends with \b or \B. Another kind of difference

7172

may occur when there are multiple matching possibilities, because (for

7173

PCRE_PARTIAL_SOFT) a partial match result is given only when there are

7174

no completed matches. This means that as soon as the shortest match has

7175

been found, continuation to a new subject segment is no longer possi-

7176

ble. Consider again this pcretest example:

re> /dog(sbody)?/

data> dogsb\P

0: dog

data> do\P\D

Partial match: do

data> gsb\R\P\D

0: g

data> dogsbody\D

0: dogsbody

1: dog

The first data line passes the string "dogsb" to pcre_exec(), setting

7190

the PCRE_PARTIAL_SOFT option. Although the string is a partial match

7191

for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the

7192

shorter string "dog" is a complete match. Similarly, when the subject

7193

is presented to pcre_dfa_exec() in several parts ("do" and "gsb" being

7194

the first two) the match stops when "dog" has been found, and it is not

7195

possible to continue. On the other hand, if "dogsbody" is presented as

7196

a single string, pcre_dfa_exec() finds both matches.

7197

7198

Because of these problems, it is best to use PCRE_PARTIAL_HARD when

7199

matching multi-segment data. The example above then behaves differ-

ently:

re> /dog(sbody)?/

data> dogsb\P\P

Partial match: dogsb

data> do\P\D

Partial match: do

data> gsb\R\P\P\D

Partial match: gsb

4. Patterns that contain alternatives at the top level which do not all

7211

start with the same pattern item may not work as expected when

7212

PCRE_DFA_RESTART is used with pcre_dfa_exec(). For example, consider

this pattern:

1234|3789

If the first part of the subject is "ABC123", a partial match of the

7218

first alternative is found at offset 3. There is no partial match for

7219

the second alternative, because such a match does not start at the same

7220

point in the subject string. Attempting to continue with the string

7221

"7890" does not yield a match because only those alternatives that

7222

match at one point in the subject are remembered. The problem arises

7223

because the start of the second alternative matches within the first

7224

alternative. There is no problem with anchored patterns or patterns

such as:

1234|ABCD

where no string can be a partial match for both alternatives. This is

7230

not a problem if pcre_exec() is used, because the entire match has to

be rerun each time:

re> /1234|3789/

data> ABC123\P\P

Partial match: 123

data> 1237890

0: 3789

Of course, instead of using PCRE_DFA_RESTART, the same technique of re-

7240

running the entire match can also be used with pcre_dfa_exec(). Another

7241

possibility is to work with two buffers. If a partial match at offset n

7242

in the first buffer is followed by "no match" when PCRE_DFA_RESTART is

7243

used on the second buffer, you can then try a new match starting at

7244

offset n+1 in the first buffer.

AUTHOR

Philip Hazel

University Computing Service

7251

Cambridge CB2 3QH, England.

REVISION

Last updated: 26 August 2011

7257

7258

------------------------------------------------------------------------------

7259

7260

7261

PCREPRECOMPILE(3) PCREPRECOMPILE(3)

NAME

PCRE - Perl-compatible regular expressions

7266

7267

7268

SAVING AND RE-USING PRECOMPILED PCRE PATTERNS

7269

7270

If you are running an application that uses a large number of regular

7271

expression patterns, it may be useful to store them in a precompiled

7272

form instead of having to compile them every time the application is

7273

run. If you are not using any private character tables (see the

7274

pcre_maketables() documentation), this is relatively straightforward.

7275

If you are using private tables, it is a little bit more complicated.

7276

However, if you are using the just-in-time optimization feature of

7277

pcre_study(), it is not possible to save and reload the JIT data.

7278

7279

If you save compiled patterns to a file, you can copy them to a differ-

7280

ent host and run them there. This works even if the new host has the

7281

opposite endianness to the one on which the patterns were compiled.

7282

There may be a small performance penalty, but it should be insignifi-

7283

cant. However, compiling regular expressions with one version of PCRE

7284

for use with a different version is not guaranteed to work and may

7285

cause crashes, and saving and restoring a compiled pattern loses any

7286

JIT optimization data.

7287

7288

7289

SAVING A COMPILED PATTERN

7290

7291

The value returned by pcre_compile() points to a single block of memory

7292

that holds the compiled pattern and associated data. You can find the

7293

length of this block in bytes by calling pcre_fullinfo() with an argu-

7294

ment of PCRE_INFO_SIZE. You can then save the data in any appropriate

7295

manner. Here is sample code that compiles a pattern and writes it to a

7296

file. It assumes that the variable fd refers to a file that is open for

7297

output:

7298

7299

int erroroffset, rc, size;

char *error;

pcre *re;

re = pcre_compile("my pattern", 0, &error, &erroroffset, NULL);

7304

if (re == NULL) { ... handle errors ... }

7305

rc = pcre_fullinfo(re, NULL, PCRE_INFO_SIZE, &size);

7306

if (rc < 0) { ... handle errors ... }

7307

rc = fwrite(re, 1, size, fd);

7308

if (rc != size) { ... handle errors ... }

7309

7310

In this example, the bytes that comprise the compiled pattern are

7311

copied exactly. Note that this is binary data that may contain any of

7312

the 256 possible byte values. On systems that make a distinction

7313

between binary and non-binary data, be sure that the file is opened for

7314

binary output.

7315

7316

If you want to write more than one pattern to a file, you will have to

7317

devise a way of separating them. For binary data, preceding each pat-

7318

tern with its length is probably the most straightforward approach.

7319

Another possibility is to write out the data in hexadecimal instead of

7320

binary, one pattern to a line.

7321

7322

Saving compiled patterns in a file is only one possible way of storing

7323

them for later use. They could equally well be saved in a database, or

7324

in the memory of some daemon process that passes them via sockets to

7325

the processes that want them.

7326

7327

If the pattern has been studied, it is also possible to save the normal

7328

study data in a similar way to the compiled pattern itself. However, if

7329

the PCRE_STUDY_JIT_COMPILE was used, the just-in-time data that is cre-

7330

ated cannot be saved because it is too dependent on the current envi-

7331

ronment. When studying generates additional information, pcre_study()

7332

returns a pointer to a pcre_extra data block. Its format is defined in

7333

the section on matching a pattern in the pcreapi documentation. The

7334

study_data field points to the binary study data, and this is what you

7335

must save (not the pcre_extra block itself). The length of the study

7336

data can be obtained by calling pcre_fullinfo() with an argument of

7337

PCRE_INFO_STUDYSIZE. Remember to check that pcre_study() did return a

7338

non-NULL value before trying to save the study data.

7339

7340

7341

RE-USING A PRECOMPILED PATTERN

7342

7343

Re-using a precompiled pattern is straightforward. Having reloaded it

7344

into main memory, you pass its pointer to pcre_exec() or

7345

pcre_dfa_exec() in the usual way. This should work even on another

7346

host, and even if that host has the opposite endianness to the one

7347

where the pattern was compiled.

7348

7349

However, if you passed a pointer to custom character tables when the

7350

pattern was compiled (the tableptr argument of pcre_compile()), you

7351

must now pass a similar pointer to pcre_exec() or pcre_dfa_exec(),

7352

because the value saved with the compiled pattern will obviously be

7353

nonsense. A field in a pcre_extra() block is used to pass this data, as

7354

described in the section on matching a pattern in the pcreapi documen-

7355

tation.

7356

7357

If you did not provide custom character tables when the pattern was

7358

compiled, the pointer in the compiled pattern is NULL, which causes

7359

pcre_exec() to use PCRE's internal tables. Thus, you do not need to

7360

take any special action at run time in this case.

7361

7362

If you saved study data with the compiled pattern, you need to create

7363

your own pcre_extra data block and set the study_data field to point to

7364

the reloaded study data. You must also set the PCRE_EXTRA_STUDY_DATA

7365

bit in the flags field to indicate that study data is present. Then

7366

pass the pcre_extra block to pcre_exec() or pcre_dfa_exec() in the

7367

usual way. If the pattern was studied for just-in-time optimization,

7368

that data cannot be saved, and so is lost by a save/restore cycle.

7369

7370

7371

COMPATIBILITY WITH DIFFERENT PCRE RELEASES

7372

7373

In general, it is safest to recompile all saved patterns when you

7374

update to a new PCRE release, though not all updates actually require

this.

AUTHOR

Philip Hazel

University Computing Service

7382

Cambridge CB2 3QH, England.

REVISION

Last updated: 26 August 2011

7388

7389

------------------------------------------------------------------------------

7390

7391

7392

PCREPERFORM(3) PCREPERFORM(3)

NAME

PCRE - Perl-compatible regular expressions

PCRE PERFORMANCE

Two aspects of performance are discussed below: memory usage and pro-

7402

cessing time. The way you express your pattern as a regular expression

7403

can affect both of them.

7404

7405

7406

COMPILED PATTERN MEMORY USAGE

7407

7408

Patterns are compiled by PCRE into a reasonably efficient byte code, so

7409

that most simple patterns do not use much memory. However, there is one

7410

case where the memory usage of a compiled pattern can be unexpectedly

7411

large. If a parenthesized subpattern has a quantifier with a minimum

7412

greater than 1 and/or a limited maximum, the whole subpattern is

7413

repeated in the compiled code. For example, the pattern

(abc|def){2,4}

is compiled as if it were

7418

7419

(abc|def)(abc|def)((abc|def)(abc|def)?)?

7420

7421

(Technical aside: It is done this way so that backtrack points within

7422

each of the repetitions can be independently maintained.)

7423

7424

For regular expressions whose quantifiers use only small numbers, this

7425

is not usually a problem. However, if the numbers are large, and par-

7426

ticularly if such repetitions are nested, the memory usage can become

7427

an embarrassment. For example, the very simple pattern

((ab){1,1000}c){1,3}

uses 51K bytes when compiled. When PCRE is compiled with its default

7432

internal pointer size of two bytes, the size limit on a compiled pat-

7433

tern is 64K, and this is reached with the above pattern if the outer

7434

repetition is increased from 3 to 4. PCRE can be compiled to use larger

7435

internal pointers and thus handle larger compiled patterns, but it is

7436

better to try to rewrite your pattern to use less memory if you can.

7437

7438

One way of reducing the memory usage for such patterns is to make use

7439

of PCRE's "subroutine" facility. Re-writing the above pattern as

7440

7441

((ab)(?2){0,999}c)(?1){0,2}

7442

7443

reduces the memory requirements to 18K, and indeed it remains under 20K

7444

even with the outer repetition increased to 100. However, this pattern

7445

is not exactly equivalent, because the "subroutine" calls are treated

7446

as atomic groups into which there can be no backtracking if there is a

7447

subsequent matching failure. Therefore, PCRE cannot do this kind of

7448

rewriting automatically. Furthermore, there is a noticeable loss of

7449

speed when executing the modified pattern. Nevertheless, if the atomic

7450

grouping is not a problem and the loss of speed is acceptable, this

7451

kind of rewriting will allow you to process patterns that PCRE cannot

otherwise handle.

STACK USAGE AT RUN TIME

7456

7457

When pcre_exec() is used for matching, certain kinds of pattern can

7458

cause it to use large amounts of the process stack. In some environ-

7459

ments the default process stack is quite small, and if it runs out the

7460

result is often SIGSEGV. This issue is probably the most frequently

7461

raised problem with PCRE. Rewriting your pattern can often help. The

7462

pcrestack documentation discusses this issue in detail.

PROCESSING TIME

Certain items in regular expression patterns are processed more effi-

7468

ciently than others. It is more efficient to use a character class like

7469

[aeiou] than a set of single-character alternatives such as

7470

(a|e|i|o|u). In general, the simplest construction that provides the

7471

required behaviour is usually the most efficient. Jeffrey Friedl's book

7472

contains a lot of useful general discussion about optimizing regular

7473

expressions for efficient performance. This document contains a few

7474

observations about PCRE.

7475

7476

Using Unicode character properties (the \p, \P, and \X escapes) is

7477

slow, because PCRE has to scan a structure that contains data for over

7478

fifteen thousand characters whenever it needs a character's property.

7479

If you can find an alternative pattern that does not use character

7480

properties, it will probably be faster.

7481

7482

By default, the escape sequences \b, \d, \s, and \w, and the POSIX

7483

character classes such as [:alpha:] do not use Unicode properties,

7484

partly for backwards compatibility, and partly for performance reasons.

7485

However, you can set PCRE_UCP if you want Unicode character properties

7486

to be used. This can double the matching time for items such as \d,

7487

when matched with pcre_exec(); the performance loss is less with

7488

pcre_dfa_exec(), and in both cases there is not much difference for \b.

7489

7490

When a pattern begins with .* not in parentheses, or in parentheses

7491

that are not the subject of a backreference, and the PCRE_DOTALL option

7492

is set, the pattern is implicitly anchored by PCRE, since it can match

7493

only at the start of a subject string. However, if PCRE_DOTALL is not

7494

set, PCRE cannot make this optimization, because the . metacharacter

7495

does not then match a newline, and if the subject string contains new-

7496

lines, the pattern may match from the character immediately following

7497

one of them instead of from the very start. For example, the pattern

.*second

matches the subject "first\nand second" (where \n stands for a newline

7502

character), with the match starting at the seventh character. In order

7503

to do this, PCRE has to retry the match starting after every newline in

7504

the subject.

7505

7506

If you are using such a pattern with subject strings that do not con-

7507

tain newlines, the best performance is obtained by setting PCRE_DOTALL,

7508

or starting the pattern with ^.* or ^.*? to indicate explicit anchor-

7509

ing. That saves PCRE from having to scan along the subject looking for

7510

a newline to restart at.

7511

7512

Beware of patterns that contain nested indefinite repeats. These can

7513

take a long time to run when applied to a string that does not match.

7514

Consider the pattern fragment

^(a+)*

This can match "aaaa" in 16 different ways, and this number increases

7519

very rapidly as the string gets longer. (The * repeat can match 0, 1,

7520

2, 3, or 4 times, and for each of those cases other than 0 or 4, the +

7521

repeats can match different numbers of times.) When the remainder of

7522

the pattern is such that the entire match is going to fail, PCRE has in

7523

principle to try every possible variation, and this can take an

7524

extremely long time, even for relatively short strings.

7525

7526

An optimization catches some of the more simple cases such as

(a+)*b

where a literal character follows. Before embarking on the standard

7531

matching procedure, PCRE checks that there is a "b" later in the sub-

7532

ject string, and if there is not, it fails the match immediately. How-

7533

ever, when there is no following literal this optimization cannot be

7534

used. You can see the difference by comparing the behaviour of

(a+)*\d

with the pattern above. The former gives a failure almost instantly

7539

when applied to a whole line of "a" characters, whereas the latter

7540

takes an appreciable time with strings longer than about 20 characters.

7541

7542

In many cases, the solution to this kind of performance issue is to use

7543

an atomic group or a possessive quantifier.

AUTHOR

Philip Hazel

University Computing Service

7550

Cambridge CB2 3QH, England.

REVISION

Last updated: 16 May 2010

7556

7557

------------------------------------------------------------------------------

7558

7559

7560

PCREPOSIX(3) PCREPOSIX(3)

NAME

PCRE - Perl-compatible regular expressions.

7565

7566

7567

SYNOPSIS OF POSIX API

7568

7569

#include <pcreposix.h>

7570

7571

int regcomp(regex_t *preg, const char *pattern,

7572

int cflags);

7573

7574

int regexec(regex_t *preg, const char *string,

7575

size_t nmatch, regmatch_t pmatch[], int eflags);

7576

7577

size_t regerror(int errcode, const regex_t *preg,

7578

char *errbuf, size_t errbuf_size);

7579

7580

void regfree(regex_t *preg);

DESCRIPTION

This set of functions provides a POSIX-style API to the PCRE regular

7586

expression package. See the pcreapi documentation for a description of

7587

PCRE's native API, which contains much additional functionality.

7588

7589

The functions described here are just wrapper functions that ultimately

7590

call the PCRE native API. Their prototypes are defined in the

7591

pcreposix.h header file, and on Unix systems the library itself is

7592

called pcreposix.a, so can be accessed by adding -lpcreposix to the

7593

command for linking an application that uses them. Because the POSIX

7594

functions call the native ones, it is also necessary to add -lpcre.

7595

7596

I have implemented only those POSIX option bits that can be reasonably

7597

mapped to PCRE native options. In addition, the option REG_EXTENDED is

7598

defined with the value zero. This has no effect, but since programs

7599

that are written to the POSIX interface often use it, this makes it

7600

easier to slot in PCRE as a replacement library. Other POSIX options

7601

are not even defined.

7602

7603

There are also some other options that are not defined by POSIX. These

7604

have been added at the request of users who want to make use of certain

7605

PCRE-specific features via the POSIX calling interface.

7606

7607

When PCRE is called via these functions, it is only the API that is

7608

POSIX-like in style. The syntax and semantics of the regular expres-

7609

sions themselves are still those of Perl, subject to the setting of

7610

various PCRE options, as described below. "POSIX-like in style" means

7611

that the API approximates to the POSIX definition; it is not fully

7612

POSIX-compatible, and in multi-byte encoding domains it is probably

7613

even less compatible.

7614

7615

The header for these functions is supplied as pcreposix.h to avoid any

7616

potential clash with other POSIX libraries. It can, of course, be

7617

renamed or aliased as regex.h, which is the "correct" name. It provides

7618

two structure types, regex_t for compiled internal forms, and reg-

7619

match_t for returning captured substrings. It also defines some con-

7620

stants whose names start with "REG_"; these are used for setting

7621

options and identifying error codes.

COMPILING A PATTERN

The function regcomp() is called to compile a pattern into an internal

7627

form. The pattern is a C string terminated by a binary zero, and is

7628

passed in the argument pattern. The preg argument is a pointer to a

7629

regex_t structure that is used as a base for storing information about

7630

the compiled regular expression.

7631

7632

The argument cflags is either zero, or contains one or more of the bits

7633

defined by the following macros:

REG_DOTALL

The PCRE_DOTALL option is set when the regular expression is passed for

7638

compilation to the native function. Note that REG_DOTALL is not part of

the POSIX standard.

REG_ICASE

The PCRE_CASELESS option is set when the regular expression is passed

7644

for compilation to the native function.

REG_NEWLINE

The PCRE_MULTILINE option is set when the regular expression is passed

7649

for compilation to the native function. Note that this does not mimic

7650

the defined POSIX behaviour for REG_NEWLINE (see the following sec-

tion).

REG_NOSUB

The PCRE_NO_AUTO_CAPTURE option is set when the regular expression is

7656

passed for compilation to the native function. In addition, when a pat-

7657

tern that is compiled with this flag is passed to regexec() for match-

7658

ing, the nmatch and pmatch arguments are ignored, and no captured

7659

strings are returned.

REG_UCP

The PCRE_UCP option is set when the regular expression is passed for

7664

compilation to the native function. This causes PCRE to use Unicode

7665

properties when matchine \d, \w, etc., instead of just recognizing

7666

ASCII values. Note that REG_UTF8 is not part of the POSIX standard.

REG_UNGREEDY

The PCRE_UNGREEDY option is set when the regular expression is passed

7671

for compilation to the native function. Note that REG_UNGREEDY is not

7672

part of the POSIX standard.

REG_UTF8

The PCRE_UTF8 option is set when the regular expression is passed for

7677

compilation to the native function. This causes the pattern itself and

7678

all data strings used for matching it to be treated as UTF-8 strings.

7679

Note that REG_UTF8 is not part of the POSIX standard.

7680

7681

In the absence of these flags, no options are passed to the native

7682

function. This means the the regex is compiled with PCRE default

7683

semantics. In particular, the way it handles newline characters in the

7684

subject string is the Perl way, not the POSIX way. Note that setting

7685

PCRE_MULTILINE has only some of the effects specified for REG_NEWLINE.

7686

It does not affect the way newlines are matched by . (they are not) or

7687

by a negative class such as [^a] (they are).

7688

7689

The yield of regcomp() is zero on success, and non-zero otherwise. The

7690

preg structure is filled in on success, and one member of the structure

7691

is public: re_nsub contains the number of capturing subpatterns in the

7692

regular expression. Various error codes are defined in the header file.

7693

7694

NOTE: If the yield of regcomp() is non-zero, you must not attempt to

7695

use the contents of the preg structure. If, for example, you pass it to

7696

regexec(), the result is undefined and your program is likely to crash.

7697

7698

7699

MATCHING NEWLINE CHARACTERS

7700

7701

This area is not simple, because POSIX and Perl take different views of

7702

things. It is not possible to get PCRE to obey POSIX semantics, but

7703

then PCRE was never intended to be a POSIX engine. The following table

7704

lists the different possibilities for matching newline characters in

PCRE:

Default Change with

. matches newline no PCRE_DOTALL

7710

newline matches [^a] yes not changeable

7711

$ matches \n at end yes PCRE_DOLLARENDONLY

7712

$ matches \n in middle no PCRE_MULTILINE

7713

^ matches \n in middle no PCRE_MULTILINE

7714

7715

This is the equivalent table for POSIX:

Default Change with

. matches newline yes REG_NEWLINE

7720

newline matches [^a] yes REG_NEWLINE

7721

$ matches \n at end no REG_NEWLINE

7722

$ matches \n in middle no REG_NEWLINE

7723

^ matches \n in middle no REG_NEWLINE

7724

7725

PCRE's behaviour is the same as Perl's, except that there is no equiva-

7726

lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is

7727

no way to stop newline from matching [^a].

7728

7729

The default POSIX newline handling can be obtained by setting

7730

PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE

7731

behave exactly as for the REG_NEWLINE action.

MATCHING A PATTERN

The function regexec() is called to match a compiled pattern preg

7737

against a given string, which is by default terminated by a zero byte

7738

(but see REG_STARTEND below), subject to the options in eflags. These

can be:

REG_NOTBOL

The PCRE_NOTBOL option is set when calling the underlying PCRE matching

function.

REG_NOTEMPTY

The PCRE_NOTEMPTY option is set when calling the underlying PCRE match-

7749

ing function. Note that REG_NOTEMPTY is not part of the POSIX standard.

7750

However, setting this option can give more POSIX-like behaviour in some

situations.

REG_NOTEOL

The PCRE_NOTEOL option is set when calling the underlying PCRE matching

function.

REG_STARTEND

The string is considered to start at string + pmatch[0].rm_so and to

7761

have a terminating NUL located at string + pmatch[0].rm_eo (there need

7762

not actually be a NUL at that location), regardless of the value of

7763

nmatch. This is a BSD extension, compatible with but not specified by

7764

IEEE Standard 1003.2 (POSIX.2), and should be used with caution in

7765

software intended to be portable to other systems. Note that a non-zero

7766

rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location

7767

of the string, not how it is matched.

7768

7769

If the pattern was compiled with the REG_NOSUB flag, no data about any

7770

matched strings is returned. The nmatch and pmatch arguments of

7771

regexec() are ignored.

7772

7773

If the value of nmatch is zero, or if the value pmatch is NULL, no data

7774

about any matched strings is returned.

7775

7776

Otherwise,the portion of the string that was matched, and also any cap-

7777

tured substrings, are returned via the pmatch argument, which points to

7778

an array of nmatch structures of type regmatch_t, containing the mem-

7779

bers rm_so and rm_eo. These contain the offset to the first character

7780

of each substring and the offset to the first character after the end

7781

of each substring, respectively. The 0th element of the vector relates

7782

to the entire portion of string that was matched; subsequent elements

7783

relate to the capturing subpatterns of the regular expression. Unused

7784

entries in the array have both structure members set to -1.

7785

7786

A successful match yields a zero return; various error codes are

7787

defined in the header file, of which REG_NOMATCH is the "expected"

failure code.

ERROR MESSAGES

The regerror() function maps a non-zero errorcode from either regcomp()

7794

or regexec() to a printable message. If preg is not NULL, the error

7795

should have arisen from the use of that structure. A message terminated

7796

by a binary zero is placed in errbuf. The length of the message,

7797

including the zero, is limited to errbuf_size. The yield of the func-

7798

tion is the size of buffer needed to hold the whole message.

MEMORY USAGE

Compiling a regular expression causes memory to be allocated and asso-

7804

ciated with the preg structure. The function regfree() frees all such

7805

memory, after which preg may no longer be used as a compiled expres-

sion.

AUTHOR

Philip Hazel

University Computing Service

7813

Cambridge CB2 3QH, England.

REVISION

Last updated: 16 May 2010

7819

7820

------------------------------------------------------------------------------

7821

7822

7823

PCRECPP(3) PCRECPP(3)

NAME

PCRE - Perl-compatible regular expressions.

7828

7829

7830

SYNOPSIS OF C++ WRAPPER

#include <pcrecpp.h>

DESCRIPTION

The C++ wrapper for PCRE was provided by Google Inc. Some additional

7838

functionality was added by Giuseppe Maxia. This brief man page was con-

7839

structed from the notes in the pcrecpp.h file, which should be con-

7840

sulted for further details.

MATCHING INTERFACE

The "FullMatch" operation checks that supplied text matches a supplied

7846

pattern exactly. If pointer arguments are supplied, it copies matched

7847

sub-strings that match sub-patterns into them.

7848

7849

Example: successful match

7850

pcrecpp::RE re("h.*o");

7851

re.FullMatch("hello");

7852

7853

Example: unsuccessful match (requires full match):

7854

pcrecpp::RE re("e");

7855

!re.FullMatch("hello");

7856

7857

Example: creating a temporary RE object:

7858

pcrecpp::RE("h.*o").FullMatch("hello");

7859

7860

You can pass in a "const char*" or a "string" for "text". The examples

7861

below tend to use a const char*. You can, as in the different examples

7862

above, store the RE object explicitly in a variable or use a temporary

7863

RE object. The examples below use one mode or the other arbitrarily.

7864

Either could correctly be used for any of these examples.

7865

7866

You must supply extra pointer arguments to extract matched subpieces.

7867

7868

Example: extracts "ruby" into "s" and 1234 into "i"

7869

int i;

7870

string s;

7871

pcrecpp::RE re("(\\w+):(\\d+)");

7872

re.FullMatch("ruby:1234", &s, &i);

7873

7874

Example: does not try to extract any extra sub-patterns

7875

re.FullMatch("ruby:1234", &s);

7876

7877

Example: does not try to extract into NULL

7878

re.FullMatch("ruby:1234", NULL, &i);

7879

7880

Example: integer overflow causes failure

7881

!re.FullMatch("ruby:1234567891234", NULL, &i);

7882

7883

Example: fails because there aren't enough sub-patterns:

7884

!pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);

7885

7886

Example: fails because string cannot be stored in integer

7887

!pcrecpp::RE("(.*)").FullMatch("ruby", &i);

7888

7889

The provided pointer arguments can be pointers to any scalar numeric

7890

type, or one of:

7891

7892

string (matched piece is copied to string)

7893

StringPiece (StringPiece is mutated to point to matched piece)

7894

T (where "bool T::ParseFrom(const char*, int)" exists)

7895

NULL (the corresponding matched sub-pattern is not copied)

7896

7897

The function returns true iff all of the following conditions are sat-

7898

isfied:

7899

7900

a. "text" matches "pattern" exactly;

7901

7902

b. The number of matched sub-patterns is >= number of supplied

7903

pointers;

7904

7905

c. The "i"th argument has a suitable type for holding the

7906

string captured as the "i"th sub-pattern. If you pass in

7907

void * NULL for the "i"th argument, or a non-void * NULL

7908

of the correct type, or pass fewer arguments than the

7909

number of sub-patterns, "i"th captured sub-pattern is

7910

ignored.

7911

7912

CAVEAT: An optional sub-pattern that does not exist in the matched

7913

string is assigned the empty string. Therefore, the following will

7914

return false (because the empty string is not a valid number):

7915

7916

int number;

7917

pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);

7918

7919

The matching interface supports at most 16 arguments per call. If you

7920

need more, consider using the more general interface

7921

pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.

7922

7923

NOTE: Do not use no_arg, which is used internally to mark the end of a

7924

list of optional arguments, as a placeholder for missing arguments, as

7925

this can lead to segfaults.

7926

7927

7928

QUOTING METACHARACTERS

7929

7930

You can use the "QuoteMeta" operation to insert backslashes before all

7931

potentially meaningful characters in a string. The returned string,

7932

used as a regular expression, will exactly match the original string.

7933

7934

Example:

7935

string quoted = RE::QuoteMeta(unquoted);

7936

7937

Note that it's legal to escape a character even if it has no special

7938

meaning in a regular expression -- so this function does that. (This

7939

also makes it identical to the perl function of the same name; see

7940

"perldoc -f quotemeta".) For example, "1.5-2.0?" becomes

"1\.5\-2\.0\?".

PARTIAL MATCHES

You can use the "PartialMatch" operation when you want the pattern to

7947

match any substring of the text.

7948

7949

Example: simple search for a string:

7950

pcrecpp::RE("ell").PartialMatch("hello");

7951

7952

Example: find first number in a string:

7953

int number;

7954

pcrecpp::RE re("(\\d+)");

7955

re.PartialMatch("x*100 + 20", &number);

7956

assert(number == 100);

7957

7958

7959

UTF-8 AND THE MATCHING INTERFACE

7960

7961

By default, pattern and text are plain text, one byte per character.

7962

The UTF8 flag, passed to the constructor, causes both pattern and

7963

string to be treated as UTF-8 text, still a byte stream but potentially

7964

multiple bytes per character. In practice, the text is likelier to be

7965

UTF-8 than the pattern, but the match returned may depend on the UTF8

7966

flag, so always use it when matching UTF8 text. For example, "." will

7967

match one byte normally but with UTF8 set may match up to three bytes

7968

of a multi-byte character.

7969

7970

Example:

7971

pcrecpp::RE_Options options;

7972

options.set_utf8();

7973

pcrecpp::RE re(utf8_pattern, options);

7974

re.FullMatch(utf8_string);

7975

7976

Example: using the convenience function UTF8():

7977

pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());

7978

re.FullMatch(utf8_string);

7979

7980

NOTE: The UTF8 flag is ignored if pcre was not configured with the

--enable-utf8 flag.

PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE

7985

7986

PCRE defines some modifiers to change the behavior of the regular

7987

expression engine. The C++ wrapper defines an auxiliary class,

7988

RE_Options, as a vehicle to pass such modifiers to a RE class. Cur-

7989

rently, the following modifiers are supported:

7990

7991

modifier description Perl corresponding

7992

7993

PCRE_CASELESS case insensitive match /i

7994

PCRE_MULTILINE multiple lines match /m

7995

PCRE_DOTALL dot matches newlines /s

7996

PCRE_DOLLAR_ENDONLY $ matches only at end N/A

7997

PCRE_EXTRA strict escape parsing N/A

7998

PCRE_EXTENDED ignore whitespaces /x

7999

PCRE_UTF8 handles UTF8 chars built-in

8000

PCRE_UNGREEDY reverses * and *? N/A

8001

PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)

8002

8003

(*) Both Perl and PCRE allow non capturing parentheses by means of the

8004

"?:" modifier within the pattern itself. e.g. (?:ab|cd) does not cap-

8005

ture, while (ab|cd) does.

8006

8007

For a full account on how each modifier works, please check the PCRE

8008

API reference page.

8009

8010

For each modifier, there are two member functions whose name is made

8011

out of the modifier in lowercase, without the "PCRE_" prefix. For

8012

instance, PCRE_CASELESS is handled by

bool caseless()

which returns true if the modifier is set, and

8017

8018

RE_Options & set_caseless(bool)

8019

8020

which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can

8021

be accessed through the set_match_limit() and match_limit() member

8022

functions. Setting match_limit to a non-zero value will limit the exe-

8023

cution of pcre to keep it from doing bad things like blowing the stack

8024

or taking an eternity to return a result. A value of 5000 is good

8025

enough to stop stack blowup in a 2MB thread stack. Setting match_limit

8026

to zero disables match limiting. Alternatively, you can call

8027

match_limit_recursion() which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to

8028

limit how much PCRE recurses. match_limit() limits the number of

8029

matches PCRE does; match_limit_recursion() limits the depth of internal

8030

recursion, and therefore the amount of stack that is used.

8031

8032

Normally, to pass one or more modifiers to a RE class, you declare a

8033

RE_Options object, set the appropriate options, and pass this object to

8034

a RE constructor. Example:

8035

8036

RE_Options opt;

8037

opt.set_caseless(true);

8038

if (RE("HELLO", opt).PartialMatch("hello world")) ...

8039

8040

RE_options has two constructors. The default constructor takes no argu-

8041

ments and creates a set of flags that are off by default. The optional

8042

parameter option_flags is to facilitate transfer of legacy code from C

8043

programs. This lets you do

8044

8045

RE(pattern,

8046

RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);

8047

8048

However, new code is better off doing

8049

8050

RE(pattern,

8051

RE_Options().set_caseless(true).set_multiline(true))

8052

.PartialMatch(str);

8053

8054

If you are going to pass one of the most used modifiers, there are some

8055

convenience functions that return a RE_Options class with the appropri-

8056

ate modifier already set: CASELESS(), UTF8(), MULTILINE(), DOTALL(),

8057

and EXTENDED().

8058

8059

If you need to set several options at once, and you don't want to go

8060

through the pains of declaring a RE_Options object and setting several

8061

options, there is a parallel method that give you such ability on the

8062

fly. You can concatenate several set_xxxxx() member functions, since

8063

each of them returns a reference to its class object. For example, to

8064

pass PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one

8065

statement, you may write:

8066

8067

RE(" ^ xyz \\s+ .* blah$",

RE_Options()

.set_caseless(true)

.set_extended(true)

.set_multiline(true)).PartialMatch(sometext);

8072

8073

8074

SCANNING TEXT INCREMENTALLY

8075

8076

The "Consume" operation may be useful if you want to repeatedly match

8077

regular expressions at the front of a string and skip over them as they

8078

match. This requires use of the "StringPiece" type, which represents a

8079

sub-range of a real string. Like RE, StringPiece is defined in the

8080

pcrecpp namespace.

8081

8082

Example: read lines of the form "var = value" from a string.

8083

string contents = ...; // Fill string somehow

8084

pcrecpp::StringPiece input(contents); // Wrap in a StringPiece

string var;

int value;

pcrecpp::RE re("(\\w+) = (\\d+)\n");

8089

while (re.Consume(&input, &var, &value)) {

...;

}

Each successful call to "Consume" will set "var/value", and also

8094

advance "input" so it points past the matched text.

8095

8096

The "FindAndConsume" operation is similar to "Consume" but does not

8097

anchor your match at the beginning of the string. For example, you

8098

could extract all words from a string by repeatedly calling

8099

8100

pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)

8101

8102

8103

PARSING HEX/OCTAL/C-RADIX NUMBERS

8104

8105

By default, if you pass a pointer to a numeric value, the corresponding

8106

text is interpreted as a base-10 number. You can instead wrap the

8107

pointer with a call to one of the operators Hex(), Octal(), or CRadix()

8108

to interpret the text in another base. The CRadix operator interprets

8109

C-style "0" (base-8) and "0x" (base-16) prefixes, but defaults to

base-10.

Example:

int a, b, c, d;

pcrecpp::RE re("(.*) (.*) (.*) (.*)");

8115

re.FullMatch("100 40 0100 0x40",

8116

pcrecpp::Octal(&a), pcrecpp::Hex(&b),

8117

pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));

8118

8119

will leave 64 in a, b, c, and d.

8120

8121

8122

REPLACING PARTS OF STRINGS

8123

8124

You can replace the first match of "pattern" in "str" with "rewrite".

8125

Within "rewrite", backslash-escaped digits (\1 to \9) can be used to

8126

insert text matching corresponding parenthesized group from the pat-

8127

tern. \0 in "rewrite" refers to the entire matching text. For example:

8128

8129

string s = "yabba dabba doo";

8130

pcrecpp::RE("b+").Replace("d", &s);

8131

8132

will leave "s" containing "yada dabba doo". The result is true if the

8133

pattern matches and a replacement occurs, false otherwise.

8134

8135

GlobalReplace is like Replace except that it replaces all occurrences

8136

of the pattern in the string with the rewrite. Replacements are not

8137

subject to re-matching. For example:

8138

8139

string s = "yabba dabba doo";

8140

pcrecpp::RE("b+").GlobalReplace("d", &s);

8141

8142

will leave "s" containing "yada dada doo". It returns the number of

8143

replacements made.

8144

8145

Extract is like Replace, except that if the pattern matches, "rewrite"

8146

is copied into "out" (an additional argument) with substitutions. The

8147

non-matching portions of "text" are ignored. Returns true iff a match

8148

occurred and the extraction happened successfully; if no match occurs,

8149

the string is left unaffected.

AUTHOR

The C++ wrapper was contributed by Google Inc.

REVISION

Last updated: 17 March 2009

8161

Minor typo fixed: 25 July 2011

8162

------------------------------------------------------------------------------

8163

8164

8165

PCRESAMPLE(3) PCRESAMPLE(3)

NAME

PCRE - Perl-compatible regular expressions

PCRE SAMPLE PROGRAM

A simple, complete demonstration program, to get you started with using

8175

PCRE, is supplied in the file pcredemo.c in the PCRE distribution. A

8176

listing of this program is given in the pcredemo documentation. If you

8177

do not have a copy of the PCRE distribution, you can save this listing

8178

to re-create pcredemo.c.

8179

8180

The program compiles the regular expression that is its first argument,

8181

and matches it against the subject string in its second argument. No

8182

PCRE options are set, and default character tables are used. If match-

8183

ing succeeds, the program outputs the portion of the subject that

8184

matched, together with the contents of any captured substrings.

8185

8186

If the -g option is given on the command line, the program then goes on

8187

to check for further matches of the same regular expression in the same

8188

subject string. The logic is a little bit tricky because of the possi-

8189

bility of matching an empty string. Comments in the code explain what

8190

is going on.

8191

8192

If PCRE is installed in the standard include and library directories

8193

for your operating system, you should be able to compile the demonstra-

8194

tion program using this command:

8195

8196

gcc -o pcredemo pcredemo.c -lpcre

8197

8198

If PCRE is installed elsewhere, you may need to add additional options

8199

to the command line. For example, on a Unix-like system that has PCRE

8200

installed in /usr/local, you can compile the demonstration program

8201

using a command like this:

8202

8203

gcc -o pcredemo -I/usr/local/include pcredemo.c \

8204

-L/usr/local/lib -lpcre

8205

8206

In a Windows environment, if you want to statically link the program

8207

against a non-dll pcre.a file, you must uncomment the line that defines

8208

PCRE_STATIC before including pcre.h, because otherwise the pcre_mal-

8209

loc() and pcre_free() exported functions will be declared

8210

__declspec(dllimport), with unwanted results.

8211

8212

Once you have compiled and linked the demonstration program, you can

8213

run simple tests like this:

8214

8215

./pcredemo 'cat|dog' 'the cat sat on the mat'

8216

./pcredemo -g 'cat|dog' 'the dog sat on the cat'

8217

8218

Note that there is a much more comprehensive test program, called

8219

pcretest, which supports many more facilities for testing regular

8220

expressions and the PCRE library. The pcredemo program is provided as a

8221

simple coding example.

8222

8223

If you try to run pcredemo when PCRE is not installed in the standard

8224

library directory, you may get an error like this on some operating

8225

systems (e.g. Solaris):

8226

8227

ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or

8228