Skip to main content

Parsing email addresses in Java (without having the JavaMail API available)

Posted in

Today in the "can't be that hard to code" category: parsing email addresses.

Boss comes in and tells you, that some Java project will now allow the user to submit an email address. It is your task to write the sanity checks, rejecting anything, that is not compliant with the address specification found in RFC 2822. Oh, and by the way, the customer does not have the JavaMail API installed and cannot be brought to do so. Meaning, you have to re-invent the wheel.

So far, it sounds fairly trivial. Seems like all you have to do is to check a string, whether or not it contains an "@" symbol. Oh, wait. You must also ensure, that there is only exactly one "@" symbol (unless quoted) and it may neither be the first or the last character. Furthermore, there are a couple of characters, which may not appear at all or only in certain combinations. For example, dots are fine as long as the string does not resemble something like ".@.", which would be rather unrouteable. The boss said "RFC 2822 compliant". Unfortunately, this also includes the "display name" portion of the address. Looks like, there is a bit more to email addresses, than meets the eye.

After pondering a couple of minutes, what is (not) allowed for a valid email address, you'll probably end up reading RFC 2822 anyway, which will trigger about the following train of thought:

  1. This cannot reasonable be done by using if statements in any meaningful way.
  2. The grammar for email addresses is regular, but the corresponding regular expression to parse it would be utmost complex.
  3. An ad hoc solution using if statements or regular expressions will likely crash and burn once it is given in user's hands. It will also be painful to maintain.
  4. The boss is waiting.
  5. It's a common problem. Someone else might have solved it already. Maybe Google knows more...
  6. A lot of people already solved the problem...
  7. ... They used regular expressions and/or if statements.

To illustrate the problem, here are a few examples, of what is and what is not allowed by RFC 2822:

/*
* First what is allowed
*/

devnull@onyxbits.de
< devnull @ onyxbits.de >
devnull@onyxbits.de
Patrick devnull@onyxbits.de
Patrickdevnull@onyxbits.de
"Patrick Ahlbrecht" devnull@onyxbits.de
Patrick "dev null"@onyxbits.de

/*
* And now some things, that are not allowed
*/

// A local part may not start with a "."
.devnull@onyxbits.de

// A domain may also not start with a "."
devnull@.onyxbits.de

// A domain may not end with a "."
devnull@onyxbits.de.

// Rather obviously not allowed
devnull@onyxbits..de

// Not so obvious: You may not have ".." in the localpart, unless quoted.
dev..null@onyxbits.de

// There may not be a space in the display name, unless it is quoted.
Patrick Ahlbrecht devnull@onyxbits.de

// And naturally, there is a whole array of forbidden characters
Patrick Ahlbrecht <dev:null@onyxbits.de>

Any parser code, based on "if, then, else" statements or regular expressions will clearly be messy and difficult to maintain. It will also most certainly take days to develop and test one. The alternative of using the messy code of someone else, should be even less appealing. If that fails after deployment, you'll have even less of a clue, where to look for the error. So, how can this problem be solved, then?
Say "hello" to compiler construction. The address specification in RFC 2822 is provided in a BNF style syntax. This is the natural input format for compiler compilers and therefore allows to construct a parser, by transcribing the specification with minor modifications.

The code snippet below contains a control file for the javacc compiler compiler. It is derived from RFC 2822, but simplified in a couple of ways, adding a bit more strictness to the parser. When using, please keep the following restrictions in mind:

  • Only spaces and tabs are allowed as whitespace characters.
  • The standard allows for comments in an address, the parser does not support this.
  • The "obs-" (obsolete) rules, included for backwards compatibility were omitted.
  • The parser does not support domain-literals (used as routing information) in the address.

In other words: The parsercode is intended to be used to filter user input. Do not use it, where strict RFC 2822 compliance is a requirement.

To use the code below, you must have a working copy of JavaCC installed. The command to compile is simply javacc grammar.jj (with grammar.jj the file containing the specification). The resulting Java sourcecode does not have any further dependencies and is ready to use.


/**
** Regular Java code to be copied into the EmailAddr class is found
** between PARSER_BEING and PARSER_END. You may want to modify the package
** statement. You may also want to remove the main() method, which is
** included for testing purposes.
**
** This file can be compiled by JavaCC. The result will be regular Java
** source with no further dependencies.
*/

PARSER_BEGIN (EmailAddr)
// package ... YOUR PACKAGE HERE!
import java.io.*;
import java.util.*;

/**
* A combined parser and data object for RFC2822 compliant email addresses.
* To use this class, just call the parse() method.
*
* This code was taken from http://www.onyxbits.de
*/
public class EmailAddr {

/**
* The part left of the "@" symbol (without the displayName). Note: This
* field contains a verbatim copy of the input value (including quotes,
* if present).
*/
public String localPart;

/**
* Anything to the right of the "@" symbol. Note: This field contains,
* what is legal in the context of RFC 2822, which includes characters and
* character combinations, that domain registries do not allow.
*/
public String domain;

/**
* Anything, prefixed to the actual address. Note: If this field is non
* null, it contains a verbatim copy of the input value (including quotes,
* if present).
*/
public String displayName;

/**
* To be filled by the parse method.
*/
private EmailAddr() {}

/**
* The object, doing the parsing.
*/
private static EmailAddr parser;

/**
* Parse a piece of text, that is suppose to be an RFC2822 compliant
* email address.
* @param txt the text to parse
* @return an object representing the parsed address.
* @throws ParseException if the submitted string cannot be parsed.
*/
public static synchronized EmailAddr parse(String txt) throws ParseException {
if (parser==null) {
parser = new EmailAddr(new ByteArrayInputStream(txt.getBytes()));
}
else ReInit(new ByteArrayInputStream(txt.getBytes()));
return parser.mailbox();
}

// Overridden from java.lang.Object
public String toString() {
StringBuffer sb = new StringBuffer();
if (displayName!=null) sb.append(displayName+" ");
sb.append("<");
sb.append(localPart);
sb.append("@");
sb.append(domain);
sb.append(">");
return sb.toString();
}

/**
* For testing
* @param args strings to parse
*/
public static void main(String args[]) throws ParseException {
for (int i=0;i<args.length;i++) {
System.err.println(i+": "+EmailAddr.parse(args[i]));
}
}
}

PARSER_END (EmailAddr)

/**
** Token definition
*/

SKIP: {" " | "\t" }

TOKEN:
{
<#ATEXT: (["a"-"z"] | ["A"-"Z"] | ["0"-"9"] |
"!" | "#" | "$" | "%" | "&" | "'" |
"" | "+" | "-" | "/" | "=" | "?" |
"^" | "_" | "`" | "{" | "|" | "}" |
"~"
)
>
|
)+ >
|
)+ ("." ()+)
>
|
"\""
( (~["\"","\\","\n","\r"])
| ("\\"
( ["n","t","b","r","f","\\","'","\""]
| ["0"-"7"] ( ["0"-"7"] )?
| ["0"-"3"] ["0"-"7"] ["0"-"7"]
)
)
)*
"\""
>
}

/**
** Definition of the grammar.
*/

private EmailAddr mailbox(): {EmailAddr ret;}
{
LOOKAHEAD(2)
ret=name_addr() {return ret;}
|
ret=addr_spec() { return ret; }
}

private EmailAddr name_addr() : { EmailAddr ret; Token tmp; }

{
tmp= ret=angle_addr()
{ret.displayName=tmp.toString(); return ret;}
|
tmp= ret=angle_addr() { ret.displayName=tmp.toString(); return ret; }
|
ret=angle_addr() { return ret; }
}

private EmailAddr angle_addr() : {EmailAddr ret; } {
"<" ret=addr_spec() ">" { return ret; }
}

private EmailAddr addr_spec() : {EmailAddr ret = new EmailAddr();} {
ret.localPart=local_part() "@" ret.domain=domain() { return ret;}
}

private String local_part() : { Token ret; } {
/* JavaCC quirk: RFC28222 says | only. However,
* JavaCC gives precedence in matching over , so it must
* be added for matching all local parts, that do not feature a dot. This
* does not pose a problem, as is a subset of .
*/
ret= { return ret.toString(); }
|
ret= { return ret.toString(); }
|
ret= { return ret.toString(); }
}

private String domain() : {Token ret;} {
ret= { return ret.toString(); }
}