PCRE(3)PCRE(3)NAME
PCRE - Perl-compatible regular expressions
SYNOPSIS OF PCRE API
#include <pcre.h>
pcre *pcre_compile(const char *pattern, int options,
const char **errptr, int *erroffset,
const unsigned char *tableptr);
pcre_extra *pcre_study(const pcre *code, int options,
const char **errptr);
int pcre_exec(const pcre *code, const pcre_extra *extra,
const char *subject, int length, int startoffset,
int options, int *ovector, int ovecsize);
int pcre_copy_named_substring(const pcre *code,
const char *subject, int *ovector,
int stringcount, const char *stringname,
char *buffer, int buffersize);
int pcre_copy_substring(const char *subject, int *ovector,
int stringcount, int stringnumber, char *buffer,
int buffersize);
int pcre_get_named_substring(const pcre *code,
const char *subject, int *ovector,
int stringcount, const char *stringname,
const char **stringptr);
int pcre_get_stringnumber(const pcre *code,
const char *name);
int pcre_get_substring(const char *subject, int *ovector,
int stringcount, int stringnumber,
const char **stringptr);
int pcre_get_substring_list(const char *subject,
int *ovector, int stringcount, const char ***listptr);
void pcre_free_substring(const char *stringptr);
void pcre_free_substring_list(const char **stringptr);
const unsigned char *pcre_maketables(void);
int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
int what, void *where);
int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
int pcre_config(int what, void *where);
char *pcre_version(void);
void *(*pcre_malloc)(size_t);
void (*pcre_free)(void *);
void *(*pcre_stack_malloc)(size_t);
void (*pcre_stack_free)(void *);
int (*pcre_callout)(pcre_callout_block *);
PCRE API
PCRE has its own native API, which is described in this document. There
is also a set of wrapper functions that correspond to the POSIX regular
expression API. These are described in the pcreposix documentation.
The native API function prototypes are defined in the header file
pcre.h, and on Unix systems the library itself is called libpcre.a, so
can be accessed by adding -lpcre to the command for linking an applica‐
tion which calls it. The header file defines the macros PCRE_MAJOR and
PCRE_MINOR to contain the major and minor release numbers for the
library. Applications can use these to include support for different
releases.
The functions pcre_compile(), pcre_study(), and pcre_exec() are used
for compiling and matching regular expressions. A sample program that
demonstrates the simplest way of using them is given in the file pcre‐
demo.c. The pcresample documentation describes how to run it.
There are convenience functions for extracting captured substrings from
a matched subject string. They are:
pcre_copy_substring()pcre_copy_named_substring()pcre_get_substring()pcre_get_named_substring()pcre_get_substring_list()pcre_free_substring() and pcre_free_substring_list() are also provided,
to free the memory used for extracted strings.
The function pcre_maketables() is used (optionally) to build a set of
character tables in the current locale for passing to pcre_compile().
The function pcre_fullinfo() is used to find out information about a
compiled pattern; pcre_info() is an obsolete version which returns only
some of the available information, but is retained for backwards com‐
patibility. The function pcre_version() returns a pointer to a string
containing the version of PCRE and its date of release.
The global variables pcre_malloc and pcre_free initially contain the
entry points of the standard malloc() and free() functions respec‐
tively. PCRE calls the memory management functions via these variables,
so a calling program can replace them if it wishes to intercept the
calls. This should be done before calling any PCRE functions.
The global variables pcre_stack_malloc and pcre_stack_free are also
indirections to memory management functions. These special functions
are used only when PCRE is compiled to use the heap for remembering
data, instead of recursive function calls. This is a non-standard way
of building PCRE, for use in environments that have limited stacks.
Because of the greater use of memory management, it runs more slowly.
Separate functions are provided so that special-purpose external code
can be used for this case. When used, these functions are always called
in a stack-like manner (last obtained, first freed), and always for
memory blocks of the same size.
The global variable pcre_callout initially contains NULL. It can be set
by the caller to a "callout" function, which PCRE will then call at
specified points during a matching operation. Details are given in the
pcrecallout documentation.
MULTITHREADING
The PCRE functions can be used in multi-threading applications, with
the proviso that the memory management functions pointed to by
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
callout function pointed to by pcre_callout, are shared by all threads.
The compiled form of a regular expression is not altered during match‐
ing, so the same compiled pattern can safely be used by several threads
at once.
CHECKING BUILD-TIME OPTIONS
int pcre_config(int what, void *where);
The function pcre_config() makes it possible for a PCRE client to dis‐
cover which optional features have been compiled into the PCRE library.
The pcrebuild documentation has more details about these optional fea‐
tures.
The first argument for pcre_config() is an integer, specifying which
information is required; the second argument is a pointer to a variable
into which the information is placed. The following information is
available:
PCRE_CONFIG_UTF8
The output is an integer that is set to one if UTF-8 support is avail‐
able; otherwise it is set to zero.
PCRE_CONFIG_NEWLINE
The output is an integer that is set to the value of the code that is
used for the newline character. It is either linefeed (10) or carriage
return (13), and should normally be the standard character for your
operating system.
PCRE_CONFIG_LINK_SIZE
The output is an integer that contains the number of bytes used for
internal linkage in compiled regular expressions. The value is 2, 3, or
4. Larger values allow larger regular expressions to be compiled, at
the expense of slower matching. The default value of 2 is sufficient
for all but the most massive patterns, since it allows the compiled
pattern to be up to 64K in size.
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
The output is an integer that contains the threshold above which the
POSIX interface uses malloc() for output vectors. Further details are
given in the pcreposix documentation.
PCRE_CONFIG_MATCH_LIMIT
The output is an integer that gives the default limit for the number of
internal matching function calls in a pcre_exec() execution. Further
details are given with pcre_exec() below.
PCRE_CONFIG_STACKRECURSE
The output is an integer that is set to one if internal recursion is
implemented by recursive function calls that use the stack to remember
their state. This is the usual way that PCRE is compiled. The output is
zero if PCRE was compiled to use blocks of data on the heap instead of
recursive function calls. In this case, pcre_stack_malloc and
pcre_stack_free are called to manage memory blocks on the heap, thus
avoiding the use of the stack.
COMPILING A PATTERN
pcre *pcre_compile(const char *pattern, int options,
const char **errptr, int *erroffset,
const unsigned char *tableptr);
The function pcre_compile() is called to compile a pattern into an
internal form. The pattern is a C string terminated by a binary zero,
and is passed in the argument pattern. A pointer to a single block of
memory that is obtained via pcre_malloc is returned. This contains the
compiled code and related data. The pcre type is defined for the
returned block; this is a typedef for a structure whose contents are
not externally defined. It is up to the caller to free the memory when
it is no longer required.
Although the compiled code of a PCRE regex is relocatable, that is, it
does not depend on memory location, the complete pcre data block is not
fully relocatable, because it contains a copy of the tableptr argument,
which is an address (see below).
The options argument contains independent bits that affect the compila‐
tion. It should be zero if no options are required. Some of the
options, in particular, those that are compatible with Perl, can also
be set and unset from within the pattern (see the detailed description
of regular expressions in the pcrepattern documentation). For these
options, the contents of the options argument specifies their initial
settings at the start of compilation and execution. The PCRE_ANCHORED
option can be set at the time of matching as well as at compile time.
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
if compilation of a pattern fails, pcre_compile() returns NULL, and
sets the variable pointed to by errptr to point to a textual error mes‐
sage. The offset from the start of the pattern to the character where
the error was discovered is placed in the variable pointed to by
erroffset, which must not be NULL. If it is, an immediate error is
given.
If the final argument, tableptr, is NULL, PCRE uses a default set of
character tables which are built when it is compiled, using the default
C locale. Otherwise, tableptr must be the result of a call to
pcre_maketables(). See the section on locale support below.
This code fragment shows a typical straightforward call to pcre_com‐
pile():
pcre *re;
const char *error;
int erroffset;
re = pcre_compile(
"^A.*Z", /* the pattern */
0, /* default options */
&error, /* for error message */
&erroffset, /* for error offset */
NULL); /* use default character tables */
The following option bits are defined:
PCRE_ANCHORED
If this bit is set, the pattern is forced to be "anchored", that is, it
is constrained to match only at the first matching point in the string
which is being searched (the "subject string"). This effect can also be
achieved by appropriate constructs in the pattern itself, which is the
only way to do it in Perl.
PCRE_CASELESS
If this bit is set, letters in the pattern match both upper and lower
case letters. It is equivalent to Perl's /i option, and it can be
changed within a pattern by a (?i) option setting.
PCRE_DOLLAR_ENDONLY
If this bit is set, a dollar metacharacter in the pattern matches only
at the end of the subject string. Without this option, a dollar also
matches immediately before the final character if it is a newline (but
not before any other newlines). The PCRE_DOLLAR_ENDONLY option is
ignored if PCRE_MULTILINE is set. There is no equivalent to this option
in Perl, and no way to set it within a pattern.
PCRE_DOTALL
If this bit is set, a dot metacharater in the pattern matches all char‐
acters, including newlines. Without it, newlines are excluded. This
option is equivalent to Perl's /s option, and it can be changed within
a pattern by a (?s) option setting. A negative class such as [^a]
always matches a newline character, independent of the setting of this
option.
PCRE_EXTENDED
If this bit is set, whitespace data characters in the pattern are
totally ignored except when escaped or inside a character class. White‐
space does not include the VT character (code 11). In addition, charac‐
ters between an unescaped # outside a character class and the next new‐
line character, inclusive, are also ignored. This is equivalent to
Perl's /x option, and it can be changed within a pattern by a (?x)
option setting.
This option makes it possible to include comments inside complicated
patterns. Note, however, that this applies only to data characters.
Whitespace characters may never appear within special character
sequences in a pattern, for example within the sequence (?( which
introduces a conditional subpattern.
PCRE_EXTRA
This option was invented in order to turn on additional functionality
of PCRE that is incompatible with Perl, but it is currently of very
little use. When set, any backslash in a pattern that is followed by a
letter that has no special meaning causes an error, thus reserving
these combinations for future expansion. By default, as in Perl, a
backslash followed by a letter with no special meaning is treated as a
literal. There are at present no other features controlled by this
option. It can also be set by a (?X) option setting within a pattern.
PCRE_MULTILINE
By default, PCRE treats the subject string as consisting of a single
"line" of characters (even if it actually contains several newlines).
The "start of line" metacharacter (^) matches only at the start of the
string, while the "end of line" metacharacter ($) matches only at the
end of the string, or before a terminating newline (unless PCRE_DOL‐
LAR_ENDONLY is set). This is the same as Perl.
When PCRE_MULTILINE it is set, the "start of line" and "end of line"
constructs match immediately following or immediately before any new‐
line in the subject string, respectively, as well as at the very start
and end. This is equivalent to Perl's /m option, and it can be changed
within a pattern by a (?m) option setting. If there are no "\n" charac‐
ters in a subject string, or no occurrences of ^ or $ in a pattern,
setting PCRE_MULTILINE has no effect.
PCRE_NO_AUTO_CAPTURE
If this option is set, it disables the use of numbered capturing paren‐
theses in the pattern. Any opening parenthesis that is not followed by
? behaves as if it were followed by ?: but named parentheses can still
be used for capturing (and they acquire numbers in the usual way).
There is no equivalent of this option in Perl.
PCRE_UNGREEDY
This option inverts the "greediness" of the quantifiers so that they
are not greedy by default, but become greedy if followed by "?". It is
not compatible with Perl. It can also be set by a (?U) option setting
within the pattern.
PCRE_UTF8
This option causes PCRE to regard both the pattern and the subject as
strings of UTF-8 characters instead of single-byte character strings.
However, it is available only if PCRE has been built to include UTF-8
support. If not, the use of this option provokes an error. Details of
how this option changes the behaviour of PCRE are given in the section
on UTF-8 support in the main pcre page.
PCRE_NO_UTF8_CHECK
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
automatically checked. If an invalid UTF-8 sequence of bytes is found,
pcre_compile() returns an error. If you already know that your pattern
is valid, and you want to skip this check for performance reasons, you
can set the PCRE_NO_UTF8_CHECK option. When it is set, the effect of
passing an invalid UTF-8 string as a pattern is undefined. It may cause
your program to crash. Note that there is a similar option for sup‐
pressing the checking of subject strings passed to pcre_exec().
STUDYING A PATTERN
pcre_extra *pcre_study(const pcre *code, int options,
const char **errptr);
When a pattern is going to be used several times, it is worth spending
more time analyzing it in order to speed up the time taken for match‐
ing. The function pcre_study() takes a pointer to a compiled pattern as
its first argument. If studing the pattern produces additional informa‐
tion that will help speed up matching, pcre_study() returns a pointer
to a pcre_extra block, in which the study_data field points to the
results of the study.
The returned value from a pcre_study() can be passed directly to
pcre_exec(). However, the pcre_extra block also contains other fields
that can be set by the caller before the block is passed; these are
described below. If studying the pattern does not produce any addi‐
tional information, pcre_study() returns NULL. In that circumstance, if
the calling program wants to pass some of the other fields to
pcre_exec(), it must set up its own pcre_extra block.
The second argument contains option bits. At present, no options are
defined for pcre_study(), and this argument should always be zero.
The third argument for pcre_study() is a pointer for an error message.
If studying succeeds (even if no data is returned), the variable it
points to is set to NULL. Otherwise it points to a textual error mes‐
sage. You should therefore test the error pointer for NULL after call‐
ing pcre_study(), to be sure that it has run successfully.
This is a typical call to pcre_study():
pcre_extra *pe;
pe = pcre_study(
re, /* result of pcre_compile() */
0, /* no options exist */
&error); /* set to NULL or points to a message */
At present, studying a pattern is useful only for non-anchored patterns
that do not have a single fixed starting character. A bitmap of possi‐
ble starting characters is created.
LOCALE SUPPORT
PCRE handles caseless matching, and determines whether characters are
letters, digits, or whatever, by reference to a set of tables. When
running in UTF-8 mode, this applies only to characters with codes less
than 256. The library contains a default set of tables that is created
in the default C locale when PCRE is compiled. This is used when the
final argument of pcre_compile() is NULL, and is sufficient for many
applications.
An alternative set of tables can, however, be supplied. Such tables are
built by calling the pcre_maketables() function, which has no argu‐
ments, in the relevant locale. The result can then be passed to
pcre_compile() as often as necessary. For example, to build and use
tables that are appropriate for the French locale (where accented char‐
acters with codes greater than 128 are treated as letters), the follow‐
ing code could be used:
setlocale(LC_CTYPE, "fr");
tables = pcre_maketables();
re = pcre_compile(..., tables);
The tables are built in memory that is obtained via pcre_malloc. The
pointer that is passed to pcre_compile is saved with the compiled pat‐
tern, and the same tables are used via this pointer by pcre_study() and
pcre_exec(). Thus, for any single pattern, compilation, studying and
matching all happen in the same locale, but different patterns can be
compiled in different locales. It is the caller's responsibility to
ensure that the memory containing the tables remains available for as
long as it is needed.
INFORMATION ABOUT A PATTERN
int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
int what, void *where);
The pcre_fullinfo() function returns information about a compiled pat‐
tern. It replaces the obsolete pcre_info() function, which is neverthe‐
less retained for backwards compability (and is documented below).
The first argument for pcre_fullinfo() is a pointer to the compiled
pattern. The second argument is the result of pcre_study(), or NULL if
the pattern was not studied. The third argument specifies which piece
of information is required, and the fourth argument is a pointer to a
variable to receive the data. The yield of the function is zero for
success, or one of the following negative numbers:
PCRE_ERROR_NULL the argument code was NULL
the argument where was NULL
PCRE_ERROR_BADMAGIC the "magic number" was not found
PCRE_ERROR_BADOPTION the value of what was invalid
Here is a typical call of pcre_fullinfo(), to obtain the length of the
compiled pattern:
int rc;
unsigned long int length;
rc = pcre_fullinfo(
re, /* result of pcre_compile() */
pe, /* result of pcre_study(), or NULL */
PCRE_INFO_SIZE, /* what is required */
&length); /* where to put the data */
The possible values for the third argument are defined in pcre.h, and
are as follows:
PCRE_INFO_BACKREFMAX
Return the number of the highest back reference in the pattern. The
fourth argument should point to an int variable. Zero is returned if
there are no back references.
PCRE_INFO_CAPTURECOUNT
Return the number of capturing subpatterns in the pattern. The fourth
argument should point to an int variable.
PCRE_INFO_FIRSTBYTE
Return information about the first byte of any matched string, for a
non-anchored pattern. (This option used to be called
PCRE_INFO_FIRSTCHAR; the old name is still recognized for backwards
compatibility.)
If there is a fixed first byte, e.g. from a pattern such as
(cat|cow|coyote), it is returned in the integer pointed to by where.
Otherwise, if either
(a) the pattern was compiled with the PCRE_MULTILINE option, and every
branch starts with "^", or
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
set (if it were set, the pattern would be anchored),
-1 is returned, indicating that the pattern matches only at the start
of a subject string or after any newline within the string. Otherwise
-2 is returned. For anchored patterns, -2 is returned.
PCRE_INFO_FIRSTTABLE
If the pattern was studied, and this resulted in the construction of a
256-bit table indicating a fixed set of bytes for the first byte in any
matching string, a pointer to the table is returned. Otherwise NULL is
returned. The fourth argument should point to an unsigned char * vari‐
able.
PCRE_INFO_LASTLITERAL
Return the value of the rightmost literal byte that must exist in any
matched string, other than at its start, if such a byte has been
recorded. The fourth argument should point to an int variable. If there
is no such byte, -1 is returned. For anchored patterns, a last literal
byte is recorded only if it follows something of variable length. For
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
/^a\dz\d/ the returned value is -1.
PCRE_INFO_NAMECOUNT
PCRE_INFO_NAMEENTRYSIZE
PCRE_INFO_NAMETABLE
PCRE supports the use of named as well as numbered capturing parenthe‐
ses. The names are just an additional way of identifying the parenthe‐
ses, which still acquire a number. A caller that wants to extract data
from a named subpattern must convert the name to a number in order to
access the correct pointers in the output vector (described with
pcre_exec() below). In order to do this, it must first use these three
values to obtain the name-to-number mapping table for the pattern.
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
of each entry; both of these return an int value. The entry size
depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
a pointer to the first entry of the table (a pointer to char). The
first two bytes of each entry are the number of the capturing parenthe‐
sis, most significant byte first. The rest of the entry is the corre‐
sponding name, zero terminated. The names are in alphabetical order.
For example, consider the following pattern (assume PCRE_EXTENDED is
set, so white space - including newlines - is ignored):
(?P<date> (?P<year>(\d\d)?\d\d) -
(?P<month>\d\d) - (?P<day>\d\d) )
There are four named subpatterns, so the table has four entries, and
each entry in the table is eight bytes long. The table is as follows,
with non-printing bytes shows in hex, and undefined bytes shown as ??:
00 01 d a t e 00 ??
00 05 d a y 00 ?? ??
00 04 m o n t h 00
00 02 y e a r 00 ??
When writing code to extract data from named subpatterns, remember that
the length of each entry may be different for each compiled pattern.
PCRE_INFO_OPTIONS
Return a copy of the options with which the pattern was compiled. The
fourth argument should point to an unsigned long int variable. These
option bits are those specified in the call to pcre_compile(), modified
by any top-level option settings within the pattern itself.
A pattern is automatically anchored by PCRE if all of its top-level
alternatives begin with one of the following:
^ unless PCRE_MULTILINE is set
\A always
\G always
.* if PCRE_DOTALL is set and there are no back
references to the subpattern in which .* appears
For such patterns, the PCRE_ANCHORED bit is set in the options returned
by pcre_fullinfo().
PCRE_INFO_SIZE
Return the size of the compiled pattern, that is, the value that was
passed as the argument to pcre_malloc() when PCRE was getting memory in
which to place the compiled data. The fourth argument should point to a
size_t variable.
PCRE_INFO_STUDYSIZE
Returns the size of the data block pointed to by the study_data field
in a pcre_extra block. That is, it is the value that was passed to
pcre_malloc() when PCRE was getting memory into which to place the data
created by pcre_study(). The fourth argument should point to a size_t
variable.
OBSOLETE INFO FUNCTION
int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
The pcre_info() function is now obsolete because its interface is too
restrictive to return all the available data about a compiled pattern.
New programs should use pcre_fullinfo() instead. The yield of
pcre_info() is the number of capturing subpatterns, or one of the fol‐
lowing negative numbers:
PCRE_ERROR_NULL the argument code was NULL
PCRE_ERROR_BADMAGIC the "magic number" was not found
If the optptr argument is not NULL, a copy of the options with which
the pattern was compiled is placed in the integer it points to (see
PCRE_INFO_OPTIONS above).
If the pattern is not anchored and the firstcharptr argument is not
NULL, it is used to pass back information about the first character of
any matched string (see PCRE_INFO_FIRSTBYTE above).
MATCHING A PATTERN
int pcre_exec(const pcre *code, const pcre_extra *extra,
const char *subject, int length, int startoffset,
int options, int *ovector, int ovecsize);
The function pcre_exec() is called to match a subject string against a
pre-compiled pattern, which is passed in the code argument. If the pat‐
tern has been studied, the result of the study should be passed in the
extra argument.
Here is an example of a simple call to pcre_exec():
int rc;
int ovector[30];
rc = pcre_exec(
re, /* result of pcre_compile() */
NULL, /* we didn't study the pattern */
"some string", /* the subject string */
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* vector for substring information */
30); /* number of elements in the vector */
If the extra argument is not NULL, it must point to a pcre_extra data
block. The pcre_study() function returns such a block (when it doesn't
return NULL), but you can also create one for yourself, and pass addi‐
tional information in it. The fields in the block are as follows:
unsigned long int flags;
void *study_data;
unsigned long int match_limit;
void *callout_data;
The flags field is a bitmap that specifies which of the other fields
are set. The flag bits are:
PCRE_EXTRA_STUDY_DATA
PCRE_EXTRA_MATCH_LIMIT
PCRE_EXTRA_CALLOUT_DATA
Other flag bits should be set to zero. The study_data field is set in
the pcre_extra block that is returned by pcre_study(), together with
the appropriate flag bit. You should not set this yourself, but you can
add to the block by setting the other fields.
The match_limit field provides a means of preventing PCRE from using up
a vast amount of resources when running patterns that are not going to
match, but which have a very large number of possibilities in their
search trees. The classic example is the use of nested unlimited
repeats. Internally, PCRE uses a function called match() which it calls
repeatedly (sometimes recursively). The limit is imposed on the number
of times this function is called during a match, which has the effect
of limiting the amount of recursion and backtracking that can take
place. For patterns that are not anchored, the count starts from zero
for each position in the subject string.
The default limit for the library can be set when PCRE is built; the
default default is 10 million, which handles all but the most extreme
cases. You can reduce the default by suppling pcre_exec() with a
pcre_extra block in which match_limit is set to a smaller value, and
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
The pcre_callout field is used in conjunction with the "callout" fea‐
ture, which is described in the pcrecallout documentation.
The PCRE_ANCHORED option can be passed in the options argument, whose
unused bits must be zero. This limits pcre_exec() to matching at the
first matching position. However, if a pattern was compiled with
PCRE_ANCHORED, or turned out to be anchored by virtue of its contents,
it cannot be made unachored at matching time.
When PCRE_UTF8 was set at compile time, the validity of the subject as
a UTF-8 string is automatically checked, and the value of startoffset
is also checked to ensure that it points to the start of a UTF-8 char‐
acter. If an invalid UTF-8 sequence of bytes is found, pcre_exec()
returns the error PCRE_ERROR_BADUTF8. If startoffset contains an
invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.
If you already know that your subject is valid, and you want to skip
these checks for performance reasons, you can set the
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
do this for the second and subsequent calls to pcre_exec() if you are
making repeated calls to find all the matches in a single subject
string. However, you should be sure that the value of startoffset
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
set, the effect of passing an invalid UTF-8 string as a subject, or a
value of startoffset that does not point to the start of a UTF-8 char‐
acter, is undefined. Your program may crash.
There are also three further options that can be set only at matching
time:
PCRE_NOTBOL
The first character of the string is not the beginning of a line, so
the circumflex metacharacter should not match before it. Setting this
without PCRE_MULTILINE (at compile time) causes circumflex never to
match.
PCRE_NOTEOL
The end of the string is not the end of a line, so the dollar metachar‐
acter should not match it nor (except in multiline mode) a newline
immediately before it. Setting this without PCRE_MULTILINE (at compile
time) causes dollar never to match.
PCRE_NOTEMPTY
An empty string is not considered to be a valid match if this option is
set. If there are alternatives in the pattern, they are tried. If all
the alternatives match the empty string, the entire match fails. For
example, if the pattern
a?b?
is applied to a string not beginning with "a" or "b", it matches the
empty string at the start of the subject. With PCRE_NOTEMPTY set, this
match is not valid, so PCRE searches further into the string for occur‐
rences of "a" or "b".
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe‐
cial case of a pattern match of the empty string within its split()
function, and when using the /g modifier. It is possible to emulate
Perl's behaviour after matching a null string by first trying the match
again at the same offset with PCRE_NOTEMPTY set, and then if that fails
by advancing the starting offset (see below) and trying an ordinary
match again.
The subject string is passed to pcre_exec() as a pointer in subject, a
length in length, and a starting byte offset in startoffset. Unlike the
pattern string, the subject may contain binary zero bytes. When the
starting offset is zero, the search for a match starts at the beginning
of the subject, and this is by far the most common case.
If the pattern was compiled with the PCRE_UTF8 option, the subject must
be a sequence of bytes that is a valid UTF-8 string, and the starting
offset must point to the beginning of a UTF-8 character. If an invalid
UTF-8 string or offset is passed, an error (either PCRE_ERROR_BADUTF8
or PCRE_ERROR_BADUTF8_OFFSET) is returned, unless the option
PCRE_NO_UTF8_CHECK is set, in which case PCRE's behaviour is not
defined.
A non-zero starting offset is useful when searching for another match
in the same subject by calling pcre_exec() again after a previous suc‐
cess. Setting startoffset differs from just passing over a shortened
string and setting PCRE_NOTBOL in the case of a pattern that begins
with any kind of lookbehind. For example, consider the pattern
\Biss\B
which finds occurrences of "iss" in the middle of words. (\B matches
only if the current position in the subject is not a word boundary.)
When applied to the string "Mississipi" the first call to pcre_exec()
finds the first occurrence. If pcre_exec() is called again with just
the remainder of the subject, namely "issipi", it does not match,
because \B is always false at the start of the subject, which is deemed
to be a word boundary. However, if pcre_exec() is passed the entire
string again, but with startoffset set to 4, it finds the second occur‐
rence of "iss" because it is able to look behind the starting point to
discover that it is preceded by a letter.
If a non-zero starting offset is passed when the pattern is anchored,
one attempt to match at the given offset is tried. This can only suc‐
ceed if the pattern does not require the match to be at the start of
the subject.
In general, a pattern matches a certain portion of the subject, and in
addition, further substrings from the subject may be picked out by
parts of the pattern. Following the usage in Jeffrey Friedl's book,
this is called "capturing" in what follows, and the phrase "capturing
subpattern" is used for a fragment of a pattern that picks out a sub‐
string. PCRE supports several other kinds of parenthesized subpattern
that do not cause substrings to be captured.
Captured substrings are returned to the caller via a vector of integer
offsets whose address is passed in ovector. The number of elements in
the vector is passed in ovecsize. The first two-thirds of the vector is
used to pass back captured substrings, each substring using a pair of
integers. The remaining third of the vector is used as workspace by
pcre_exec() while matching capturing subpatterns, and is not available
for passing back information. The length passed in ovecsize should
always be a multiple of three. If it is not, it is rounded down.
When a match has been successful, information about captured substrings
is returned in pairs of integers, starting at the beginning of ovector,
and continuing up to two-thirds of its length at the most. The first
element of a pair is set to the offset of the first character in a sub‐
string, and the second is set to the offset of the first character
after the end of a substring. The first pair, ovector[0] and ovec‐
tor[1], identify the portion of the subject string matched by the
entire pattern. The next pair is used for the first capturing subpat‐
tern, and so on. The value returned by pcre_exec() is the number of
pairs that have been set. If there are no capturing subpatterns, the
return value from a successful match is 1, indicating that just the
first pair of offsets has been set.
Some convenience functions are provided for extracting the captured
substrings as separate strings. These are described in the following
section.
It is possible for an capturing subpattern number n+1 to match some
part of the subject when subpattern n has not been used at all. For
example, if the string "abc" is matched against the pattern (a|(z))(bc)
subpatterns 1 and 3 are matched, but 2 is not. When this happens, both
offset values corresponding to the unused subpattern are set to -1.
If a capturing subpattern is matched repeatedly, it is the last portion
of the string that it matched that gets returned.
If the vector is too small to hold all the captured substrings, it is
used as far as possible (up to two-thirds of its length), and the func‐
tion returns a value of zero. In particular, if the substring offsets
are not of interest, pcre_exec() may be called with ovector passed as
NULL and ovecsize as zero. However, if the pattern contains back refer‐
ences and the ovector isn't big enough to remember the related sub‐
strings, PCRE has to get additional memory for use during matching.
Thus it is usually advisable to supply an ovector.
Note that pcre_info() can be used to find out how many capturing sub‐
patterns there are in a compiled pattern. The smallest size for ovector
that will allow for n captured substrings, in addition to the offsets
of the substring matched by the whole pattern, is (n+1)*3.
If pcre_exec() fails, it returns a negative number. The following are
defined in the header file:
PCRE_ERROR_NOMATCH (-1)
The subject string did not match the pattern.
PCRE_ERROR_NULL (-2)
Either code or subject was passed as NULL, or ovector was NULL and
ovecsize was not zero.
PCRE_ERROR_BADOPTION (-3)
An unrecognized bit was set in the options argument.
PCRE_ERROR_BADMAGIC (-4)
PCRE stores a 4-byte "magic number" at the start of the compiled code,
to catch the case when it is passed a junk pointer. This is the error
it gives when the magic number isn't present.
PCRE_ERROR_UNKNOWN_NODE (-5)
While running the pattern match, an unknown item was encountered in the
compiled pattern. This error could be caused by a bug in PCRE or by
overwriting of the compiled pattern.
PCRE_ERROR_NOMEMORY (-6)
If a pattern contains back references, but the ovector that is passed
to pcre_exec() is not big enough to remember the referenced substrings,
PCRE gets a block of memory at the start of matching to use for this
purpose. If the call via pcre_malloc() fails, this error is given. The
memory is freed at the end of matching.
PCRE_ERROR_NOSUBSTRING (-7)
This error is used by the pcre_copy_substring(), pcre_get_substring(),
and pcre_get_substring_list() functions (see below). It is never
returned by pcre_exec().
PCRE_ERROR_MATCHLIMIT (-8)
The recursion and backtracking limit, as specified by the match_limit
field in a pcre_extra structure (or defaulted) was reached. See the
description above.
PCRE_ERROR_CALLOUT (-9)
This error is never generated by pcre_exec() itself. It is provided for
use by callout functions that want to yield a distinctive error code.
See the pcrecallout documentation for details.
PCRE_ERROR_BADUTF8 (-10)
A string that contains an invalid UTF-8 byte sequence was passed as a
subject.
PCRE_ERROR_BADUTF8_OFFSET (-11)
The UTF-8 byte sequence that was passed as a subject was valid, but the
value of startoffset did not point to the beginning of a UTF-8 charac‐
ter.
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
int pcre_copy_substring(const char *subject, int *ovector,
int stringcount, int stringnumber, char *buffer,
int buffersize);
int pcre_get_substring(const char *subject, int *ovector,
int stringcount, int stringnumber,
const char **stringptr);
int pcre_get_substring_list(const char *subject,
int *ovector, int stringcount, const char ***listptr);
Captured substrings can be accessed directly by using the offsets
returned by pcre_exec() in ovector. For convenience, the functions
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub‐
string_list() are provided for extracting captured substrings as new,
separate, zero-terminated strings. These functions identify substrings
by number. The next section describes functions for extracting named
substrings. A substring that contains a binary zero is correctly
extracted and has a further zero added on the end, but the result is
not, of course, a C string.
The first three arguments are the same for all three of these func‐
tions: subject is the subject string which has just been successfully
matched, ovector is a pointer to the vector of integer offsets that was
passed to pcre_exec(), and stringcount is the number of substrings that
were captured by the match, including the substring that matched the
entire regular expression. This is the value returned by pcre_exec if
it is greater than zero. If pcre_exec() returned zero, indicating that
it ran out of space in ovector, the value passed as stringcount should
be the size of the vector divided by three.
The functions pcre_copy_substring() and pcre_get_substring() extract a
single substring, whose number is given as stringnumber. A value of
zero extracts the substring that matched the entire pattern, while
higher values extract the captured substrings. For pcre_copy_sub‐
string(), the string is placed in buffer, whose length is given by
buffersize, while for pcre_get_substring() a new block of memory is
obtained via pcre_malloc, and its address is returned via stringptr.
The yield of the function is the length of the string, not including
the terminating zero, or one of
PCRE_ERROR_NOMEMORY (-6)
The buffer was too small for pcre_copy_substring(), or the attempt to
get memory failed for pcre_get_substring().
PCRE_ERROR_NOSUBSTRING (-7)
There is no substring whose number is stringnumber.
The pcre_get_substring_list() function extracts all available sub‐
strings and builds a list of pointers to them. All this is done in a
single block of memory which is obtained via pcre_malloc. The address
of the memory block is returned via listptr, which is also the start of
the list of string pointers. The end of the list is marked by a NULL
pointer. The yield of the function is zero if all went well, or
PCRE_ERROR_NOMEMORY (-6)
if the attempt to get the memory block failed.
When any of these functions encounter a substring that is unset, which
can happen when capturing subpattern number n+1 matches some part of
the subject, but subpattern n has not been used at all, they return an
empty string. This can be distinguished from a genuine zero-length sub‐
string by inspecting the appropriate offset in ovector, which is nega‐
tive for unset substrings.
The two convenience functions pcre_free_substring() and pcre_free_sub‐
string_list() can be used to free the memory returned by a previous
call of pcre_get_substring() or pcre_get_substring_list(), respec‐
tively. They do nothing more than call the function pointed to by
pcre_free, which of course could be called directly from a C program.
However, PCRE is used in some situations where it is linked via a spe‐
cial interface to another programming language which cannot use
pcre_free directly; it is for these cases that the functions are pro‐
vided.
EXTRACTING CAPTURED SUBSTRINGS BY NAME
int pcre_copy_named_substring(const pcre *code,
const char *subject, int *ovector,
int stringcount, const char *stringname,
char *buffer, int buffersize);
int pcre_get_stringnumber(const pcre *code,
const char *name);
int pcre_get_named_substring(const pcre *code,
const char *subject, int *ovector,
int stringcount, const char *stringname,
const char **stringptr);
To extract a substring by name, you first have to find associated num‐
ber. This can be done by calling pcre_get_stringnumber(). The first
argument is the compiled pattern, and the second is the name. For exam‐
ple, for this pattern
ab(?<xxx>\d+)...
the number of the subpattern called "xxx" is 1. Given the number, you
can then extract the substring directly, or use one of the functions
described in the previous section. For convenience, there are also two
functions that do the whole job.
Most of the arguments of pcre_copy_named_substring() and
pcre_get_named_substring() are the same as those for the functions that
extract by number, and so are not re-described here. There are just two
differences.
First, instead of a substring number, a substring name is given. Sec‐
ond, there is an extra argument, given at the start, which is a pointer
to the compiled pattern. This is needed in order to gain access to the
name-to-number translation table.
These functions call pcre_get_stringnumber(), and if it succeeds, they
then call pcre_copy_substring() or pcre_get_substring(), as appropri‐
ate.
Last updated: 09 December 2003Copyright (c) 1997-2003 University of Cambridge.
PCRE(3)