New text.regex package

  Status: implemented,  Mon May 12 14:53:11 2008
  • Created: 2008-05-01 12:25:10+0200
  • Categories: text
  • Author: friebe

Scope of Change

A new package text.regex will be added.


Rationale

Object oriented API for regular expressions.

Functionality

The entry point class is text.regex.Pattern which is a wrapper around the preg_*() functions in PHP.

Testing whether a pattern matches

The most common use-case is to test whether a given pattern matches.

  // Current
if (preg_match('/([w]{3}\.)?example\.(com|net|org)/', $string)) {
...
}

// New
if (Pattern::compile('([w]{3}\.)?example\.(com|net|org)')->matches($string)) {
...
}

The problem with the preg_match() approach is that it will return FALSE if the pattern is malformed (and raise a warning) - this is something that can lead to long debugging / wtf?! sessions. The Pattern class will throw an exception.

Retrieving matched text

To match parts out of a string:

  // Current
preg_match('/(([w]{3})\.)?example\.(com|net|org)/', $string, $matches);
Console::writeLine
($matches);

// New
$match= Pattern::compile('(([w]{3})\.)?example\.(com|net|org)')->match($string);
Console::writeLine
($match->group(0));

The results in both cases is a string-array with the contents [ "www.example.com", "www.", "www", "com" ].

Working with string objects

The text.regex pattern supports the lang.types.String object built-in:

  $string= new String('RFC #0001');
$num= Pattern::compile('RFC #([0-9]+)')->match($string)->group(1); // "0001"

Modifiers

Instead of embedding the modifiers in the pattern string, they need to be passed to the Pattern class' compile() method as bitfield:

  // Current
$ok= preg_match('/[a-z0-9_]+/i', $username);

// New
$ok= Pattern::compile('[a-z0-9_]+', Pattern::CASE_INSENSITIVE)->matches($username);

Further modifiers are:

  Constant name    Modifier
  ================ ========
  CASE_INSENSITIVE i
  MULTILINE        m
  DOTALL           s
  EXTENDED         x
  ANCHORED         A
  DOLLAR_ENDONLY   D
  ANALYSIS         S
  UNGREEDY         U
  UTF8             u

This is more verbose but easier to read.

Security considerations

None.

Speed impact

Slightly slower than procedural approach.

Dependencies

PCRE extension (enabled by default)

Related documents



Comments



<EOF>


Table of contents