A Java Package for Regular Expressions |
Djun Kim, Mathematics, The University of British Columbia |
Introduction The following problem is commonly encountered by programmers and computer users:
For example we might wish to specify all files in a directory which end with the four characters ".htm". Many operating systems allow us to specify this set of filenames via the expression "*.htm". Or, we may be interested in finding all occurences in a text file of "search", "searching", "searched", "searches" but not, for example, "research". Programmers generally use text editors which allow such a search using an expression such as "[^a-z]search.*". Such search specifiers are examples of regular expressions. Problems involving regular expression searches involve finding all strings of a particular, ``easily described''' format, such as dates, filenames, variable names, formatting commands for typesetting programs. However, not everything which we might wish to search for can be specified as a regular expression. For example, we might reasonably want to pick out all examples of mathematical expressions in a file containing the text of an article. Or we might want to find all examples of verbs. Each of these examples attempts to specify string-sets whose structure is too complicated to describe compactly. What is a Regular Expression? Let us define precisely what a regular expression is. First, we need to specify A, the alphabet from which the strings we will be searching through are composed. The closure of A, denote by A*, is the set of all strings (that is, n-tuples) of characters from A. The 0-tuple, or null string, will be denoted by ¥. The reader is warned at this point that one of the potential obstructions to understanding regular expressions arises in confusing an object with the symbols used to describe it. A language over A is a subset of A*. Given strings x of length n and y of length m from some language L we can append y to x to form a string of length n+m, denoted by xy, and called the concatenation of x and y. For example, if x = fox and y = trot then xy = foxtrot. The notation for contatenation is meant to suggest that it is a kind of product. Analogy suggests that we define exponentiation as the iterated concatenation of a string with itself:
Let L and M be given languages over a fixed alphabet. We can construct new languages from L and M by various operations. For example, let L U M denote the union of the sets L and M; thus a string is in L U M if it is in either L or M. We can define the concatenation LM of L and M to be the language
Similarly Lk = LL ··· L where there are k factors of L in the right hand side. By convention, L0 denotes the set containing ¥, the null string. The Kleene closure of a language L, denoted by L*, is the union over all i = 0, 1, ... of the languages Li; Similiarly, L+, the positive closure of L, is the union over all i = 1, 2, ... of the languages Li. Note L+ does not contain L0. We shall use parentheses to group operations on sets: for example,
If r is a regular expression over the alphabet A, the symbol L(r) denotes the language over A defined by r. Languages defined by regular expressions are called regular languages. The regular expressions over A form a language R over (A U M) where M is the following set of metacharacters (assumed to be disjoint from A):
As an aside, it turns out that R is not a regular language over (A U M)! We now define the language R of regular expressions to be the smallest language over A U M satisfying the following four conditions:
The first two rules specify basic elements; the third rule allows us to parenthesize regular expressions, and the fourth provides us with an inductive construction. For convenience of notation, we'll introduce a few conventions. First, we'll establish an operator precedence which will allow us to dispense with unnecessary parentheses: * will bind more tightly than (the left associative operation of) concatentation, which in turn has higher precedence than the left-associative operation of alternation (``|''). For example, the expression (a|b*) | ((c*)(b)) would be written equivalently as a|b* | c*b |
Return to the Living Mathematics Project homepage. |