There are at least two ways of tokenizing Strings in Java. Simple examples work like a charm, but it is very easy to encounter some weird or unintuitive behavior while experimenting with complex regexes or some corner cases. Thus, in order to gain deeper knowledge, I am gonna dive into the implementations details and present some general rules of string tokenizing in Java.
How to tokenize a String:
-
String[] Pattern.compile(String regex).split(CharSequence input, int limit)
-
// (This method is actually the equivalent of the method presented above)
String[] String.split(String regex, int limit)
-
Scanner s = new Scanner(String text).useDelimiter(String regex);
while(s.hasNext()) s.next();
To pass for example SCJP a programmer has to predict the exact result of the split method invocation (including the empty matches – what causes most of the problems). I will now present general rules which could probably help in systematizing the knowledge regarding the String tokenization.
The rules are organized in such a way, that the preceding rules have a bigger priority that the following rules, so you should follow the list until one of the rules matches the situation you are actually examining:
1.) If the regex expression does not match any part of the input:
- Matcher: returns exactly one element – namely the given String
- Scanner: returns exactly one element – namely the given String
Example:
- Matcher: “James Bond”.split(“MI6″, 0) == ["James Bond"]
- Scanner: new Scanner(“James Bond”).useDelimiter(“MI6″).next() == “James Bond”;
2.) When the given String is empty:
- Matcher: returns exactly one empty match [""]
- Scanner: the result is empty []
Example:
- Matcher: “”.split(“MI6″, 0) = [""]
- Scanner: new Scanner(“”).useDelimiter(“MI6″).hasNext() == false;
3.) When the delimiter regex is empty:
- Matcher (index == 0): Tokenized characters of the given String preceded by one empty String
- Matcher (index < 0): Tokenized characters of the given String preceded and followed by one empty String
- Scanner: Tokenized characters of the given String
Example:
- Matcher (index == 0): “007″.split(“”, 0) = ["", "0", "0", "7"]
- Matcher (index < 0): "007".split("", -1) = ["", "0", "0", "7", ""]
- Scanner: Scanner s = new Scanner(“”).useDelimiter(“MI6″); s.next() = “0″, s.next() = “0″, s.next() = “7″, s.hasNext() = false;
The last thing I want to mention are the “empty-matches” or “zero-length” matches:
- Matcher (index == 0): Returns: leading and inner “empty-matches” – BUT IF ALL MATCHES ARE EMPTY – the result set is empty
- Matcher (index < 0): Returns: leading, inner and trailing "empty-matches"
- Scanner: Returns: only inner “empty-matches”
Example:
- Matcher (index == 0): “::”.split(“:”, 0) = []
- Matcher (index < 0): "::".split(":", -1) = ["", "", ""]
- Matcher (index == 0): “:1::2::”.split(“:”, 0) = ["" , "1", "", "2"]
- Matcher (index < 0): ":1::2::".split(":", -1) = ["" , "1", "", "2", "", ""]
- Scanner: Scanner s = new Scanner(“::”).useDelimiter(“:”); s.next() = “”, s.hasNext() = false;
Hope this helps!