Hi
Member @TaoLaoBidaoBanBanhBa posted an interesting API using RegEx and I half-jokingly told him, quoting IT veteran and master Jamie Zawinski, that anyone who has a problem and tries to solve it with RegEx will get TWO problems. Aside from the increasingly unreadable cryptic strings such as
Pattern pattern = Pattern.compile("[[a-z][A-Z]]+|\\d+(\\.\\d+)?|[-^+*/(),π!]|[^\\w\\d.\\-^+*/(),π!]+");
RegEx is slow. Very slow. That’s why Master Zawinski gave a little hint with TWO problems that RegEx should be used with caution, otherwise it will lead to performance problems (the resulting 2nd problem) and it’s hard to believe that RegEx is the real time hog. Why? Because the coding is very short and looks good. But the parsing and compiling of the pattern in the background takes time. A lot of time.
I’ll show you 2 small apps with conventional coding and cool RegEx and the result is… horrible.
- String[] String.split(String regex) versus workaround
import java.util.*;
public class Split {
public static void main(String[] args) throws Exception {
String[] a = null, b = null;
if (args.length < 2) {
System.out.println("Usage: java Split string pattern");
System.exit(0);
}
// sync into cache
for (int i = 0; i < 100; ++i) {
a = split(args[0], args[1]);
}
// now measure the time
long t0 = System.nanoTime();
for (int i = 0; i < 100; ++i) {
a = split(args[0], args[1]);
}
long t1 = System.nanoTime();
double d0 = (double)(t1-t0)/100;
// sync into cache
for (int i = 0; i < 100; ++i) {
b = regex(args[0], args[1]);
}
// now measure the time
long t2 = System.nanoTime();
for (int i = 0; i < 100; ++i) {
b = regex(args[0], args[1]);
}
long t3 = System.nanoTime();
double d1 = (double)(t3-t2)/100;
// compare the results
System.out.printf("t_split: %7.2f nSec. t_regex: %7.2f nSec.\n",d0, d1);
if (a.length != b.length) {
System.out.println("Different length. split:"+a.length+" and regex:"+b.length);
System.exit(0);
}
for (int i = 0; i < a.length; ++i) if (!a[i].equals(b[i])) {
System.out.println("Unequal--->split["+i+"]:"+a[i]+" and regex["+i+"]:"+b[i]);
System.exit(0);
}
System.out.println("split() and regex() deliver the same results");
for (int i = 0; i < a.length; ++i)
System.out.println("split["+i+"]:"+a[i]+" and regex["+i+"]:"+b[i]);
}
// work-around
public static String[] split(String str, String pat) {
List<String> lst = new ArrayList<String>();
for (int a = 0, b = 0, le = pat.length(), mx = str.length(); a < mx; a = b + le) {
b = str.indexOf(pat, a);
if (b < 0) break;
if ((a+le) < b) lst.add(str.substring(a, b));
}
if (lst.size() > 0) return lst.toArray(new String[lst.size()]);
return new String[] {""};
}
// regex
public static String[] regex(String str, String pat) {
return str.split("["+pat+"]+");
}
}
- String replaceAll(String regex, String replacement) versus workaround
public class Compare {
public static void main(String[] args) throws Exception {
String a = null, b = null;
if (args.length < 2) {
System.out.println("Usage: java Split string pattern");
System.exit(0);
}
// sync into cache
for (int i = 0; i < 100; ++i) {
a = compare(args[0], args[1]);
}
// now measure the time
long t0 = System.nanoTime();
for (int i = 0; i < 100; ++i) {
a = compare(args[0], args[1]);
}
long t1 = System.nanoTime();
double d0 = (double)(t1-t0)/100;
// sync into cache
for (int i = 0; i < 100; ++i) {
b = regex(args[0], args[1]);
}
// now measure the time
long t2 = System.nanoTime();
for (int i = 0; i < 100; ++i) {
b = regex(args[0], args[1]);
}
long t3 = System.nanoTime();
double d1 = (double)(t3-t2)/100;
// compare the results
System.out.printf("t_compare: %7.2f nSec. t_regex: %7.2f nSec.\n",d0, d1);
if (a.length() != b.length()) {
System.out.println("Different length. compare:"+a+" and regex:"+b);
System.exit(0);
}
System.out.println("compare() and regex() deliver the same results:"+a.equals(b)+
"\na:"+a+"\nb:"+b);
}
// work-around
public static String compare(String str, String pat) {
StringBuilder sb = new StringBuilder();
for (int a = 0, b = 0, le = pat.length(), mx = str.length(); a < mx; a = b + le) {
b = str.indexOf(pat, a);
if (b < 0) break;
if ((a+le) < b) sb.append(str.substring(a, b));
}
return sb.toString();
}
// regex
public static String regex(String str, String pat) {
return str.replaceAll("["+pat+"]+", "");
}
}
If you compile these apps and run them in a CMD windows, you will be flabbergasted by the outcomes. And that’s what I did:
C:\JoeApp\ODB\Tools>java Split "Hello Joe T. Schwarz " " "
t_split: 11131,00 nSec. t_regex: 36777,00 nSec.
split() and regex() deliver the same results
split[0]:Hello and regex[0]:Hello
split[1]:Joe and regex[1]:Joe
split[2]:T. and regex[2]:T.
split[3]:Schwarz and regex[3]:Schwarz
C:\JoeApp\ODB\Tools>java Split "Hello Joe T. Schwarz " " "
t_split: 8461,00 nSec. t_regex: 36288,00 nSec.
split() and regex() deliver the same results
split[0]:Hello and regex[0]:Hello
split[1]:Joe and regex[1]:Joe
split[2]:T. and regex[2]:T.
split[3]:Schwarz and regex[3]:Schwarz
C:\JoeApp\ODB\Tools>java Compare "Hello Joe T. Schwarz " " "
t_compare: 6445,00 nSec. t_regex: 29240,00 nSec.
compare() and regex() deliver the same results:true
a:HelloJoeT.Schwarz
b:HelloJoeT.Schwarz
C:\JoeApp\ODB\Tools>java Compare "Hello Joe T. Schwarz " " "
t_compare: 6818,00 nSec. t_regex: 33509,00 nSec.
compare() and regex() deliver the same results:true
a:HelloJoeT.Schwarz
b:HelloJoeT.Schwarz
C:\JoeApp\ODB\Tools>