One of my favorite features in the new Java 1.7 aside from the try-with-resources statement are named capturing groups in the regular expression API.

Although, captured groups can be referenced numerically in the order of which they are declared from left to right, named capturing makes this more intuitive as I will demonstrate in the construction of a lexer.

Lexer, a Definition

To describe lexers, we must first describe a tokenizer. Tokenizers simply break up strings into a set of tokens which are, of course, more strings. Subsequently, a lexer is a type of tokenizer that adds a context to the tokens such as the type of token extracted e.g. (NUMBER 1234) whereas a simple token would be 1234. Lexers are important for parsing languages, however, that is a discussion beyond the scope of this tutorial.

For example, given the input "11 + 22 + 33", we should receive the following tokens from a lexer:

(NUMBER 11)
(BINARYOP +)
(NUMBER 22)
(BINARYOP -)
(NUMBER 33)

Note that BINARYOP refers to binary operator. Binary operators includes any operator that accepts two arguments. The archetypal example is addition which accepts two numbers, one to the left and the other to the right of the operator.

Setting-Up the Program

The input is a sentence to be scanned. For this tutorial, we will scan a simple arithmetic grammar that includes addition, multiplication and subtraction. Consequently, we will parse the following input:

public class Lexer {
  public static void main(String[] args) {
    String input = "11 + 22 - 33";
  }
}

Next, we must define the types of the tokens that we are extracting and the regular expression that they match.

Number
`-?[0-9]+` Matches negative infinity to positive infinity without decimals.
Binary Operator
`[*|/|+|-]` Matches any standard arithmetic operators.
Whitespace
`[ \t\f\r\n]+` Matches whitespace, tabs, form feeds or newlines in a sequence. Will be skipped.
public class Lexer {
  public static enum TokenType {
    // Token types cannot have underscores
    NUMBER("-?[0-9]+"), BINARYOP("[*|/|+|-]"), WHITESPACE("[ \t\f\r\n]+");

    public final String pattern;

    private TokenType(String pattern) {
      this.pattern = pattern;
    }
  }

  public static void main(String[] args) {
    String input = "11 + 22 - 33";
  }
}

Enumerations in Java can only have private constructors because there is only a finite set of objects created at run-time. Consequently, their data fields are frequently declared as final.

Finally, we declare a data structure for holding the token data. Additionally, I will override the toString() method for printing out the token's contextual data at the end of this tutorial in the format I have mentioned earlier: (<TYPE> <DATA>).

public class Lexer {
  public static enum TokenType {
    // Token types cannot have underscores
    NUMBER("-?[0-9]+"), BINARYOP("[*|/|+|-]"), WHITESPACE("[ \t\f\r\n]+");

    public final String pattern;

    private TokenType(String pattern) {
      this.pattern = pattern;
    }
  }

  public static class Token {
    public TokenType type;
    public String data;

    public Token(TokenType type, String data) {
      this.type = type;
      this.data = data;
    }

    @Override
    public String toString() {
      return String.format("(%s %s)", type.name(), data);
    }
  }

  public static void main(String[] args) {
    String input = "11 + 22 - 33";
  }
}

Now that we have our input, token types and data structure for tokens, we may begin lexical analysis of the input string into a set of tokens with its corresponding token type.

Lexical Analysis with Regular Expressions

We begin by framing our lexical analysis method as lex(), a function which returns a list of Token objects. Additionally, we will need to import ArrayList in order to store the Token objects into the list.

import java.util.ArrayList;

public class Lexer {
  public static enum TokenType {
    // Token types cannot have underscores
    NUMBER("-?[0-9]+"), BINARYOP("[*|/|+|-]"), WHITESPACE("[ \t\f\r\n]+");

    public final String pattern;

    private TokenType(String pattern) {
      this.pattern = pattern;
    }
  }

  public static class Token {
    public TokenType type;
    public String data;

    public Token(TokenType type, String data) {
      this.type = type;
      this.name = data;
    }

    @Override
    public String toString() {
      return String.format("(%s %s)", type.name(), data);
    }
  }

  public static ArrayList<Token> lex(String input) {
    // The tokens to return
    ArrayList<Token> tokens = new ArrayList<Token>();

    // Lexer logic begins here

    return tokens;
  }

  public static void main(String[] args) {
    String input = "11 + 22 - 33";

    // Create tokens and print them
    ArrayList<Token> tokens = lex(input);
    for (Token token : tokens)
      System.out.println(token);
  }
}

Now, we need to encode all of the regular expression patterns for each of the token types into a single pattern in the algorithm shown below. This is the case where we use named capturing groups in regular expressions as (?<TYPE> PATTERN) so that once a pattern is matched, we can retrieve the token by calling its group name, the TYPE.

Additionally, we import the Pattern class to compile regular expression patterns.

import java.util.regex.Pattern;

public static ArrayList<Token> lex(String input) {
  // The tokens to return
  ArrayList<Token> tokens = new ArrayList<Token>();

  // Lexer logic begins here
  StringBuffer tokenPatternsBuffer = new StringBuffer();
  for (TokenType tokenType : TokenType.values())
    tokenPatternsBuffer.append(String.format("|(?<%s>%s)", tokenType.name(), tokenType.pattern));
  String tokenPatterns = Pattern.compile(new String(tokenPatternsBuffer.substring(1)));

  return tokens;
}

Next, we begin tokenizing by creating a Matcher object from the compiled pattern, tokenPatterns, from earlier. The matcher will return any token matched with any of the corresponding token type patterns. Note that we must also import the Matcher class here.

We will iterate through the list of token types and ask if the token type was matched. If the token returns a match, we will add it to our list of tokens with the corresponding token type and continue parsing the input.

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public static ArrayList<Token> lex(String input) {
  // The tokens to return
  ArrayList<Token> tokens = new ArrayList<Token>();

  // Lexer logic begins here
  StringBuffer tokenPatternsBuffer = new StringBuffer();
  for (TokenType tokenType : TokenType.values())
    tokenPatternsBuffer.append(String.format("|(?<%s>%s)", tokenType.name(), tokenType.pattern));
  Pattern tokenPatterns = Pattern.compile(new String(tokenPatternsBuffer.substring(1)));

  // Begin matching tokens
  Matcher matcher = tokenPatterns.matcher(input);
  while (matcher.find()) {
    if (matcher.group(TokenType.NUMBER.name()) != null) {
      tokens.add(new Token(TokenType.NUMBER, matcher.group(TokenType.NUMBER.name())));
      continue;
    } else if (matcher.group(TokenType.BINARYOP.name()) != null) {
      tokens.add(new Token(TokenType.BINARYOP, matcher.group(TokenType.BINARYOP.name())));
      continue;
    } else if (matcher.group(TokenType.WHITESPACE.name()) != null)
      continue;
  }

  return tokens;
}

And the algorithm is complete! The magic of named capturing groups here happens as we match of the token types. Note that instead of matching groups by their numerical reference, matcher.group(0), we use the actual name which is far more intuitive and much easier to maintain.

Here is the complete source code:

import java.util.ArrayList;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Lexer {
  public static enum TokenType {
    // Token types cannot have underscores
    NUMBER("-?[0-9]+"), BINARYOP("[*|/|+|-]"), WHITESPACE("[ \t\f\r\n]+");

    public final String pattern;

    private TokenType(String pattern) {
      this.pattern = pattern;
    }
  }

  public static class Token {
    public TokenType type;
    public String data;

    public Token(TokenType type, String data) {
      this.type = type;
      this.data = data;
    }

    @Override
    public String toString() {
      return String.format("(%s %s)", type.name(), data);
    }
  }

  public static ArrayList<Token> lex(String input) {
    // The tokens to return
    ArrayList<Token> tokens = new ArrayList<Token>();

    // Lexer logic begins here
    StringBuffer tokenPatternsBuffer = new StringBuffer();
    for (TokenType tokenType : TokenType.values())
      tokenPatternsBuffer.append(String.format("|(?<%s>%s)", tokenType.name(), tokenType.pattern));
    Pattern tokenPatterns = Pattern.compile(new String(tokenPatternsBuffer.substring(1)));

    // Begin matching tokens
    Matcher matcher = tokenPatterns.matcher(input);
    while (matcher.find()) {
      if (matcher.group(TokenType.NUMBER.name()) != null) {
        tokens.add(new Token(TokenType.NUMBER, matcher.group(TokenType.NUMBER.name())));
        continue;
      } else if (matcher.group(TokenType.BINARYOP.name()) != null) {
        tokens.add(new Token(TokenType.BINARYOP, matcher.group(TokenType.BINARYOP.name())));
        continue;
      } else if (matcher.group(TokenType.WHITESPACE.name()) != null)
        continue;
    }

    return tokens;
  }

  public static void main(String[] args) {
    String input = "11 + 22 - 33";

    // Create tokens and print them
    ArrayList<Token> tokens = lex(input);
    for (Token token : tokens)
      System.out.println(token);
  }
}

Running the Algorithm

For completeness, when we run the program, we should receive the following output.

(NUMBER 11)
(BINARYOP +)
(NUMBER 22)
(BINARYOP -)
(NUMBER 33)

Conclusion

Although lexical analysis is doable without it, named capturing groups in regular expressions certainly makes the code more intuitive and easier to maintain. It's also nice that Java is beginning to provide features that act as syntactic sugar.