May 18th, 2011
3:12 pm
String.split–some simple regular expression (regex) examples

Posted under Java
Tags ,

String.split is now the preferred means of tokenising strings as java.util.StringTokeniser is now deprecated. String.split returns a string array containing the split strings. Using String.split requires some understanding of regular expressions, so I have included a couple of simple examples here.

String[] output = input.split(“[,\\s]+”);

This example takes string input and splits it on any white space (including newlines) or commas. Multiple white space and/or multiple comma characters are treated as a single delimiter. The regex string operates as follows:-

  • []+ matches any of the characters/constructs in the brackets, and matches 1-n times via the plus character. Multiples are treated as a single match.
  • , matches the comma character.
  • \\s matches any white space. The double backslash is needed as both Java and Regex use it as an escape character, so it is needed twice.

String[] output = input.split(“\\s*,\\s*”);

This example takes string input and splits it on commas only. Each comma may be preceeded or followed with zero or more white space characters (and newlines), which are also matched and therefore are stripped from the output. Multiple commas will result in empty strings in the output array, as there must always be exactly one comma as a delimiter. The regex string operates as follows:-

  • \\s* matches zero or many white space characters, and is placed both before and after the comma. This means that white space is swallowed up and not returned if present (as it is part of the delimiter), but it is not required to be present i.e. is not mandatory for the delimiter.
  • ,  matches a single comma character. This means that a delimiter must have one and only one comma. Multiple commas are treated as multiple delimiters.

 

Some points about string splitting in general:-

  • Any leading or trailing white space on the input string needs careful handling. In the first case above,  leading white space will cause an empty first output string to be part of the array. In the second example, leading white space will precede the first output string, and trailing white space will be left appended to the last output string. To avoid these problems, it is a good idea to trim the input string before splitting it, e.g.:-

String[] output = input.trim().split(“[,\\s]+”);
String[] output = input.trim().split(“\\s*,\\s*”);

  • Any characters which are matched as part of the delimiter are swallowed up and not returned in the output array. This gives an opportunity to clean up the string as I have done in the second example here, by making other characters (in this case white space) an optional part of the delimiter, so that they are removed if present.
  • You can compile the regular expression string into a pattern, which is efficient if the same pattern will be reused a number of times. You then perform the split via the pattern, so that our second example above would then be:-

Pattern pattern = Pattern.compile((“\\s*,\\s*”);
String[] output = pattern.split(input);

 

The Sun Java tutorial on regular expressions may be found here. There are many other sites with examples as well, as for example this google search shows.

No Comments »