pcre man page on FreeBSD

pcre man page on FreeBSD
Man page or keyword search:
man Server 9747 pages
apropos Keyword Search (all sections)
Output format
PCRE(3)								       PCRE(3)

NAME
       PCRE - Perl-compatible regular expressions

INTRODUCTION

       The  PCRE  library is a set of functions that implement regular expres‐
       sion pattern matching using the same syntax and semantics as Perl, with
       just  a few differences. Some features that appeared in Python and PCRE
       before they appeared in Perl are also available using the  Python  syn‐
       tax,  there  is	some  support for one or two .NET and Oniguruma syntax
       items, and there is an option for requesting some  minor	 changes  that
       give better JavaScript compatibility.

       The  current implementation of PCRE corresponds approximately with Perl
       5.12, including support for UTF-8 encoded strings and  Unicode  general
       category	 properties.  However,	UTF-8  and  Unicode  support has to be
       explicitly enabled; it is not the default. The  Unicode	tables	corre‐
       spond to Unicode release 5.2.0.

       In  addition to the Perl-compatible matching function, PCRE contains an
       alternative function that matches the same compiled patterns in a  dif‐
       ferent way. In certain circumstances, the alternative function has some
       advantages.  For a discussion of the two matching algorithms,  see  the
       pcrematching page.

       PCRE  is	 written  in C and released as a C library. A number of people
       have written wrappers and interfaces of various kinds.  In  particular,
       Google  Inc.   have  provided  a comprehensive C++ wrapper. This is now
       included as part of the PCRE distribution. The pcrecpp page has details
       of  this	 interface.  Other  people's contributions can be found in the
       Contrib directory at the primary FTP site, which is:

       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre

       Details of exactly which Perl regular expression features are  and  are
       not supported by PCRE are given in separate documents. See the pcrepat‐
       tern and pcrecompat pages. There is a syntax summary in the  pcresyntax
       page.

       Some  features  of  PCRE can be included, excluded, or changed when the
       library is built. The pcre_config() function makes it  possible	for  a
       client  to  discover  which  features are available. The features them‐
       selves are described in the pcrebuild page. Documentation about	build‐
       ing  PCRE  for various operating systems can be found in the README and
       NON-UNIX-USE files in the source distribution.

       The library contains a number of undocumented  internal	functions  and
       data  tables  that  are	used by more than one of the exported external
       functions, but which are not intended  for  use	by  external  callers.
       Their  names  all begin with "_pcre_", which hopefully will not provoke
       any name clashes. In some environments, it is possible to control which
       external	 symbols  are  exported when a shared library is built, and in
       these cases the undocumented symbols are not exported.

USER DOCUMENTATION

       The user documentation for PCRE comprises a number  of  different  sec‐
       tions.  In the "man" format, each of these is a separate "man page". In
       the HTML format, each is a separate page, linked from the  index	 page.
       In  the	plain  text format, all the sections, except the pcredemo sec‐
       tion, are concatenated, for ease of searching. The sections are as fol‐
       lows:

	 pcre		   this document
	 pcre-config	   show PCRE installation configuration information
	 pcreapi	   details of PCRE's native C API
	 pcrebuild	   options for building PCRE
	 pcrecallout	   details of the callout feature
	 pcrecompat	   discussion of Perl compatibility
	 pcrecpp	   details of the C++ wrapper
	 pcredemo	   a demonstration C program that uses PCRE
	 pcregrep	   description of the pcregrep command
	 pcrematching	   discussion of the two matching algorithms
	 pcrepartial	   details of the partial matching facility
	 pcrepattern	   syntax and semantics of supported
			     regular expressions
	 pcreperform	   discussion of performance issues
	 pcreposix	   the POSIX-compatible C API
	 pcreprecompile	   details of saving and re-using precompiled patterns
	 pcresample	   discussion of the pcredemo program
	 pcrestack	   discussion of stack usage
	 pcresyntax	   quick syntax reference
	 pcretest	   description of the pcretest testing command

       In  addition,  in the "man" and HTML formats, there is a short page for
       each C library function, listing its arguments and results.

LIMITATIONS

       There are some size limitations in PCRE but it is hoped that they  will
       never in practice be relevant.

       The  maximum  length of a compiled pattern is 65539 (sic) bytes if PCRE
       is compiled with the default internal linkage size of 2. If you want to
       process	regular	 expressions  that are truly enormous, you can compile
       PCRE with an internal linkage size of 3 or 4 (see the  README  file  in
       the  source  distribution and the pcrebuild documentation for details).
       In these cases the limit is substantially larger.  However,  the	 speed
       of execution is slower.

       All values in repeating quantifiers must be less than 65536.

       There is no limit to the number of parenthesized subpatterns, but there
       can be no more than 65535 capturing subpatterns.

       The maximum length of name for a named subpattern is 32 characters, and
       the maximum number of named subpatterns is 10000.

       The  maximum  length of a subject string is the largest positive number
       that an integer variable can hold. However, when using the  traditional
       matching function, PCRE uses recursion to handle subpatterns and indef‐
       inite repetition.  This means that the available stack space may	 limit
       the size of a subject string that can be processed by certain patterns.
       For a discussion of stack issues, see the pcrestack documentation.

UTF-8 AND UNICODE PROPERTY SUPPORT

       From release 3.3, PCRE has  had	some  support  for  character  strings
       encoded	in the UTF-8 format. For release 4.0 this was greatly extended
       to cover most common requirements, and in release 5.0  additional  sup‐
       port for Unicode general category properties was added.

       In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
       support in the code, and, in addition,  you  must  call	pcre_compile()
       with  the  PCRE_UTF8  option  flag,  or the pattern must start with the
       sequence (*UTF8). When either of these is the case,  both  the  pattern
       and  any	 subject  strings  that	 are matched against it are treated as
       UTF-8 strings instead of strings of 1-byte characters.

       If you compile PCRE with UTF-8 support, but do not use it at run	 time,
       the  library will be a bit bigger, but the additional run time overhead
       is limited to testing the PCRE_UTF8 flag occasionally, so should not be
       very big.

       If PCRE is built with Unicode character property support (which implies
       UTF-8 support), the escape sequences \p{..}, \P{..}, and	 \X  are  sup‐
       ported.	The available properties that can be tested are limited to the
       general category properties such as Lu for an upper case letter	or  Nd
       for  a  decimal number, the Unicode script names such as Arabic or Han,
       and the derived properties Any and L&. A full  list  is	given  in  the
       pcrepattern documentation. Only the short names for properties are sup‐
       ported. For example, \p{L} matches a letter. Its Perl synonym,  \p{Let‐
       ter},  is  not  supported.   Furthermore,  in Perl, many properties may
       optionally be prefixed by "Is", for compatibility with Perl  5.6.  PCRE
       does not support this.

   Validity of UTF-8 strings

       When  you  set  the  PCRE_UTF8 flag, the strings passed as patterns and
       subjects are (by default) checked for validity on entry to the relevant
       functions.  From	 release 7.3 of PCRE, the check is according the rules
       of RFC 3629, which are themselves derived from the  Unicode  specifica‐
       tion.  Earlier  releases	 of PCRE followed the rules of RFC 2279, which
       allows the full range of 31-bit values (0 to 0x7FFFFFFF).  The  current
       check allows only values in the range U+0 to U+10FFFF, excluding U+D800
       to U+DFFF.

       The excluded code points are the "Low Surrogate Area"  of  Unicode,  of
       which  the Unicode Standard says this: "The Low Surrogate Area does not
       contain any  character  assignments,  consequently  no  character  code
       charts or namelists are provided for this area. Surrogates are reserved
       for use with UTF-16 and then must be used in pairs."  The  code	points
       that  are  encoded  by  UTF-16  pairs are available as independent code
       points in the UTF-8 encoding. (In  other	 words,	 the  whole  surrogate
       thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)

       If  an  invalid	UTF-8  string  is  passed  to  PCRE,  an  error return
       (PCRE_ERROR_BADUTF8) is given. In some situations, you may already know
       that your strings are valid, and therefore want to skip these checks in
       order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at
       compile	time  or at run time, PCRE assumes that the pattern or subject
       it is given (respectively) contains only valid  UTF-8  codes.  In  this
       case, it does not diagnose an invalid UTF-8 string.

       If  you	pass  an  invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,
       what happens depends on why the string is invalid. If the  string  con‐
       forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
       string of characters in the range 0  to	0x7FFFFFFF.  In	 other	words,
       apart from the initial validity test, PCRE (when in UTF-8 mode) handles
       strings according to the more liberal rules of RFC  2279.  However,  if
       the  string does not even conform to RFC 2279, the result is undefined.
       Your program may crash.

       If you want to process strings  of  values  in  the  full  range	 0  to
       0x7FFFFFFF,  encoded in a UTF-8-like manner as per the old RFC, you can
       set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
       this situation, you will have to apply your own validity check.

   General comments about UTF-8 mode

       1.  An  unbraced	 hexadecimal  escape sequence (such as \xb3) matches a
       two-byte UTF-8 character if the value is greater than 127.

       2. Octal numbers up to \777 are recognized, and	match  two-byte	 UTF-8
       characters for values greater than \177.

       3.  Repeat quantifiers apply to complete UTF-8 characters, not to indi‐
       vidual bytes, for example: \x{100}{3}.

       4. The dot metacharacter matches one UTF-8 character instead of a  sin‐
       gle byte.

       5.  The	escape sequence \C can be used to match a single byte in UTF-8
       mode, but its use can lead to some strange effects.  This  facility  is
       not available in the alternative matching function, pcre_dfa_exec().

       6.  The	character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
       test characters of any code value, but, by default, the characters that
       PCRE  recognizes	 as digits, spaces, or word characters remain the same
       set as before, all with values less than 256. This  remains  true  even
       when  PCRE  is built to include Unicode property support, because to do
       otherwise would slow down PCRE in many common cases. Note in particular
       that this applies to \b and \B, because they are defined in terms of \w
       and \W. If you really want to test for a wider sense of, say,  "digit",
       you  can	 use  explicit Unicode property tests such as \p{Nd}. Alterna‐
       tively, if you set the PCRE_UCP option,	the  way  that	the  character
       escapes	work  is changed so that Unicode properties are used to deter‐
       mine which characters match. There are more details in the  section  on
       generic character types in the pcrepattern documentation.

       7.  Similarly,  characters that match the POSIX named character classes
       are all low-valued characters, unless the PCRE_UCP option is set.

       8. However, the horizontal and  vertical	 whitespace  matching  escapes
       (\h,  \H,  \v, and \V) do match all the appropriate Unicode characters,
       whether or not PCRE_UCP is set.

       9. Case-insensitive matching applies only to  characters	 whose	values
       are  less than 128, unless PCRE is built with Unicode property support.
       Even when Unicode property support is available, PCRE  still  uses  its
       own  character  tables when checking the case of low-valued characters,
       so as not to degrade performance.  The Unicode property information  is
       used only for characters with higher values. Furthermore, PCRE supports
       case-insensitive matching only  when  there  is	a  one-to-one  mapping
       between	a letter's cases. There are a small number of many-to-one map‐
       pings in Unicode; these are not supported by PCRE.

AUTHOR

       Philip Hazel
       University Computing Service
       Cambridge CB2 3QH, England.

       Putting an actual email address here seems to have been a spam  magnet,
       so  I've	 taken	it away. If you want to email me, use my two initials,
       followed by the two digits 10, at the domain cam.ac.uk.

REVISION

       Last updated: 13 November 2010
       Copyright (c) 1997-2010 University of Cambridge.

								       PCRE(3)
[top]

List of man pages available for FreeBSD

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome