Writing a Lexer in Java 1.7 using Regex Named Capturing Groups

One of my favorite features in the new Java 1.7 aside from the try-with-resources statement are named capturing groups in the regular expression API.

Although, captured groups can be referenced numerically in the order of which they are declared from left to right, named capturing makes this more intuitive as I will demonstrate in the construction of a lexer.

Lexer, a Definition

To describe lexers, we must first describe a tokenizer. Tokenizers simply break up strings into a set of tokens which are, of course, more strings. Subsequently, a lexer is a type of tokenizer that adds a context to the tokens such as the type of token extracted e.g. (NUMBER 1234) whereas a simple token would be 1234. Lexers are important for parsing languages, however, that is a discussion beyond the scope of this tutorial.

For example, given the input "11 + 22 + 33", we should receive the following tokens from a lexer:

(NUMBER 11)
(BINARYOP +)
(NUMBER 22)
(BINARYOP -)
(NUMBER 33)

Note that BINARYOP refers to binary operator. Binary operators includes any operator that accepts two arguments. The archetypal example is addition which accepts two numbers, one to the left and the other to the right of the operator.

Setting-Up the Program

The input is a sentence to be scanned. For this tutorial, we will scan a simple arithmetic grammar that includes addition, multiplication and subtraction. Consequently, we will parse the following input:

public class Lexer {
	public static void main(String[] args) {
		String input = "11 + 22 - 33";
	}
}

Next, we must define the types of the tokens that we are extracting and the regular expression that they match.

Number
-?[0-9]+ Matches negative infinity to positive infinity without decimals.
Binary Operator
[*|/|+|-] Matches any standard arithmetic operators.
Whitespace
[ \t\f\r\n]+ Matches whitespace, tabs, form feeds or newlines in a sequence. Will be skipped.
public class Lexer {
	public static enum TokenType {
		// Token types cannot have underscores
		NUMBER("-?[0-9]+"), BINARYOP("[*|/|+|-]"), WHITESPACE("[ \t\f\r\n]+");

		public final String pattern;

		private TokenType(String pattern) {
			this.pattern = pattern;
		}
	}

	public static void main(String[] args) {
		String input = "11 + 22 - 33";
	}
}

Enumerations in Java can only have private constructors because there is only a finite set of objects created at run-time. Consequently, their data fields are frequently declared as final.

Finally, we declare a data structure for holding the token data. Additionally, I will override the toString() method for printing out the token’s contextual data at the end of this tutorial in the format I have mentioned earlier: (<TYPE> <DATA>).

public class Lexer {
	public static enum TokenType {
		// Token types cannot have underscores
		NUMBER("-?[0-9]+"), BINARYOP("[*|/|+|-]"), WHITESPACE("[ \t\f\r\n]+");

		public final String pattern;

		private TokenType(String pattern) {
			this.pattern = pattern;
		}
	}

	public static class Token {
		public TokenType type;
		public String data;

		public Token(TokenType type, String data) {
			this.type = type;
			this.data = data;
		}

		@Override
		public String toString() {
			return String.format("(%s %s)", type.name(), data);
		}
	}

	public static void main(String[] args) {
		String input = "11 + 22 - 33";
	}
}

Now that we have our input, token types and data structure for tokens, we may begin lexical analysis of the input string into a set of tokens with its corresponding token type.

Lexical Analysis with Regular Expressions

We begin by framing our lexical analysis method as lex(), a function which returns a list of Token objects. Additionally, we will need to import ArrayList in order to store the Token objects into the list.

import java.util.ArrayList;

public class Lexer {
	public static enum TokenType {
		// Token types cannot have underscores
		NUMBER("-?[0-9]+"), BINARYOP("[*|/|+|-]"), WHITESPACE("[ \t\f\r\n]+");

		public final String pattern;

		private TokenType(String pattern) {
			this.pattern = pattern;
		}
	}

	public static class Token {
		public TokenType type;
		public String data;

		public Token(TokenType type, String data) {
			this.type = type;
			this.name = data;
		}

		@Override
		public String toString() {
			return String.format("(%s %s)", type.name(), data);
		}
	}

	public static ArrayList<Token> lex(String input) {
		// The tokens to return
		ArrayList<Token> tokens = new ArrayList<Token>();

		// Lexer logic begins here

		return tokens;
	}

	public static void main(String[] args) {
		String input = "11 + 22 - 33";

		// Create tokens and print them
		ArrayList<Token> tokens = lex(input);
		for (Token token : tokens)
			System.out.println(token);
	}
}

Now, we need to encode all of the regular expression patterns for each of the token types into a single pattern in the algorithm shown below. This is the case where we use named capturing groups in regular expressions as (?<TYPE> PATTERN) so that once a pattern is matched, we can retrieve the token by calling its group name, the TYPE.

Additionally, we import the Pattern class to compile regular expression patterns.

import java.util.regex.Pattern;

public static ArrayList<Token> lex(String input) {
	// The tokens to return
	ArrayList<Token> tokens = new ArrayList<Token>();

	// Lexer logic begins here
	StringBuffer tokenPatternsBuffer = new StringBuffer();
	for (TokenType tokenType : TokenType.values())
		tokenPatternsBuffer.append(String.format("|(?<%s>%s)", tokenType.name(), tokenType.pattern));
	String tokenPatterns = Pattern.compile(new String(tokenPatternsBuffer.substring(1)));

	return tokens;
}

Next, we begin tokenizing by creating a Matcher object from the compiled pattern, tokenPatterns, from earlier. The matcher will return any token matched with any of the corresponding token type patterns. Note that we must also import the Matcher class here.

We will iterate through the list of token types and ask if the token type was matched. If the token returns a match, we will add it to our list of tokens with the corresponding token type and continue parsing the input.

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public static ArrayList<Token> lex(String input) {
	// The tokens to return
	ArrayList<Token> tokens = new ArrayList<Token>();

	// Lexer logic begins here
	StringBuffer tokenPatternsBuffer = new StringBuffer();
	for (TokenType tokenType : TokenType.values())
		tokenPatternsBuffer.append(String.format("|(?<%s>%s)", tokenType.name(), tokenType.pattern));
	Pattern tokenPatterns = Pattern.compile(new String(tokenPatternsBuffer.substring(1)));

	// Begin matching tokens
	Matcher matcher = tokenPatterns.matcher(input);
	while (matcher.find()) {
		if (matcher.group(TokenType.NUMBER.name()) != null) {
			tokens.add(new Token(TokenType.NUMBER, matcher.group(TokenType.NUMBER.name())));
			continue;
		} else if (matcher.group(TokenType.BINARYOP.name()) != null) {
			tokens.add(new Token(TokenType.BINARYOP, matcher.group(TokenType.BINARYOP.name())));
			continue;
		} else if (matcher.group(TokenType.WHITESPACE.name()) != null)
			continue;
	}

	return tokens;
}

And the algorithm is complete! The magic of named capturing groups here happens as we match of the token types. Note that instead of matching groups by their numerical reference, matcher.group(0), we use the actual name which is far more intuitive and much easier to maintain.

Here is the complete source code:

import java.util.ArrayList;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Lexer {
	public static enum TokenType {
		// Token types cannot have underscores
		NUMBER("-?[0-9]+"), BINARYOP("[*|/|+|-]"), WHITESPACE("[ \t\f\r\n]+");

		public final String pattern;

		private TokenType(String pattern) {
			this.pattern = pattern;
		}
	}

	public static class Token {
		public TokenType type;
		public String data;

		public Token(TokenType type, String data) {
			this.type = type;
			this.data = data;
		}

		@Override
		public String toString() {
			return String.format("(%s %s)", type.name(), data);
		}
	}

	public static ArrayList<Token> lex(String input) {
		// The tokens to return
		ArrayList<Token> tokens = new ArrayList<Token>();

		// Lexer logic begins here
		StringBuffer tokenPatternsBuffer = new StringBuffer();
		for (TokenType tokenType : TokenType.values())
			tokenPatternsBuffer.append(String.format("|(?<%s>%s)", tokenType.name(), tokenType.pattern));
		Pattern tokenPatterns = Pattern.compile(new String(tokenPatternsBuffer.substring(1)));

		// Begin matching tokens
		Matcher matcher = tokenPatterns.matcher(input);
		while (matcher.find()) {
			if (matcher.group(TokenType.NUMBER.name()) != null) {
				tokens.add(new Token(TokenType.NUMBER, matcher.group(TokenType.NUMBER.name())));
				continue;
			} else if (matcher.group(TokenType.BINARYOP.name()) != null) {
				tokens.add(new Token(TokenType.BINARYOP, matcher.group(TokenType.BINARYOP.name())));
				continue;
			} else if (matcher.group(TokenType.WHITESPACE.name()) != null)
				continue;
		}

		return tokens;
	}

	public static void main(String[] args) {
		String input = "11 + 22 - 33";

		// Create tokens and print them
		ArrayList<Token> tokens = lex(input);
		for (Token token : tokens)
			System.out.println(token);
	}
}

Running the Algorithm

For completeness, when we run the program, we should receive the following output.

(NUMBER 11)
(BINARYOP +)
(NUMBER 22)
(BINARYOP -)
(NUMBER 33)

Conclusion

Although lexical analysis is doable without it, named capturing groups in regular expressions certainly makes the code more intuitive and easier to maintain. It’s also nice that Java is beginning to provide features that act as syntactic sugar.

This entry was posted in Ingenuity and tagged , , , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
  • http://pfmiles.github.com/ Yue

    Seems good, but I feel a little despaired that it looks no greater progress than the version I wrote before, using an old java version: https://gist.github.com/2464374
    The most important code here is the “group recognition” section:

    while (matcher.find()) {
    if (matcher.group(TokenType.NUMBER.name()) != null) {
    tokens.add(new Token(TokenType.NUMBER, matcher.group(TokenType.NUMBER.name())));
    continue;
    } else if (matcher.group(TokenType.BINARYOP.name()) != null) {
    tokens.add(new Token(TokenType.BINARYOP, matcher.group(TokenType.BINARYOP.name())));
    continue;
    } else if (matcher.group(TokenType.WHITESPACE.name()) != null)
    continue;
    }

    It’s a looping and traversal style, just as I did in the code I mentioned in the gist, except that you did this by calling the groups’ names, and I using groups’ indexes, but they are almost the same thing…
    I wish there could be a more elegant way of finding matched group in new version of java. Something like ‘matcher.findMatchedGroupName’ for convenient , that would be nice.

  • http://adam.wojs.pl Adam Wójs

    Elegant solution :-)

  • Raghu K

    A really awesome & useful post. Thank you Gio!

  • toto

    Really nice :)
    Code is good and explanations are excellent. Thanks !

  • Guest

    How can I add error handling part? For example : 3i+24 can match 3,+,24, How can I send the error message about “i”?

  • About Gio

    I am a torrent of ingenuity (or insanity) with a myriad of innovations (sometimes fallacies) and a wealth of inspiration (possibly naiveté). My name is Gio Carlo Cielo Borje and I like puffer fish because they're just cooltalkin', highwalkin' and fastlivin'.

    I'm also twenty and a current student at UC Irvine for Computer Science.