The second (hidden) problem of RegEx

Hi
Member @TaoLaoBidaoBanBanhBa posted an interesting API using RegEx and I half-jokingly told him, quoting IT veteran and master Jamie Zawinski, that anyone who has a problem and tries to solve it with RegEx will get TWO problems. Aside from the increasingly unreadable cryptic strings such as

Pattern pattern = Pattern.compile("[[a-z][A-Z]]+|\\d+(\\.\\d+)?|[-^+*/(),π!]|[^\\w\\d.\\-^+*/(),π!]+");

RegEx is slow. Very slow. That’s why Master Zawinski gave a little hint with TWO problems that RegEx should be used with caution, otherwise it will lead to performance problems (the resulting 2nd problem) and it’s hard to believe that RegEx is the real time hog. Why? Because the coding is very short and looks good. But the parsing and compiling of the pattern in the background takes time. A lot of time.
I’ll show you 2 small apps with conventional coding and cool RegEx and the result is… horrible.

  1. String[] String.split(String regex) versus workaround
import java.util.*;
public class Split {
  public static void main(String[] args) throws Exception {
    String[] a = null, b = null;
    if (args.length < 2) {
      System.out.println("Usage: java Split string pattern");
      System.exit(0);
    }
    // sync into cache
    for (int i = 0; i < 100; ++i) {
      a = split(args[0], args[1]); 
    }
    // now measure the time
    long t0 = System.nanoTime();
    for (int i = 0; i < 100; ++i) {
      a = split(args[0], args[1]); 
    }
    long t1 = System.nanoTime();
    double d0 = (double)(t1-t0)/100;
    // sync into cache
    for (int i = 0; i < 100; ++i) {
      b = regex(args[0], args[1]); 
    }
    // now measure the time
    long t2 = System.nanoTime();
    for (int i = 0; i < 100; ++i) {
      b = regex(args[0], args[1]); 
    }
    long t3 = System.nanoTime();
    double d1 = (double)(t3-t2)/100;
    // compare the results
    System.out.printf("t_split: %7.2f nSec. t_regex: %7.2f nSec.\n",d0, d1);
    if (a.length != b.length) {
      System.out.println("Different length. split:"+a.length+" and regex:"+b.length);
      System.exit(0);
    }
    for (int i = 0; i < a.length; ++i) if (!a[i].equals(b[i])) {
      System.out.println("Unequal--->split["+i+"]:"+a[i]+" and regex["+i+"]:"+b[i]);
      System.exit(0);
    }
    System.out.println("split() and regex() deliver the same results");
    for (int i = 0; i < a.length; ++i)
      System.out.println("split["+i+"]:"+a[i]+" and regex["+i+"]:"+b[i]);
  }
  // work-around
  public static String[] split(String str, String pat) {
    List<String> lst = new ArrayList<String>();
    for (int a = 0, b = 0, le = pat.length(), mx = str.length(); a < mx; a = b + le) {
      b = str.indexOf(pat, a);
      if (b < 0) break;
      if ((a+le) < b) lst.add(str.substring(a, b));
    }
    if (lst.size() > 0)  return lst.toArray(new String[lst.size()]);
    return new String[] {""};
  }
  // regex
  public static String[] regex(String str, String pat) {
    return str.split("["+pat+"]+");
  }
}
  1. String replaceAll(String regex, String replacement) versus workaround
public class Compare {
  public static void main(String[] args) throws Exception {
    String a = null, b = null;
    if (args.length < 2) {
      System.out.println("Usage: java Split string pattern");
      System.exit(0);
    }
    // sync into cache
    for (int i = 0; i < 100; ++i) {
      a = compare(args[0], args[1]); 
    }
    // now measure the time
    long t0 = System.nanoTime();
    for (int i = 0; i < 100; ++i) {
      a = compare(args[0], args[1]); 
    }
    long t1 = System.nanoTime();
    double d0 = (double)(t1-t0)/100;
    // sync into cache
    for (int i = 0; i < 100; ++i) {
      b = regex(args[0], args[1]); 
    }
    // now measure the time
    long t2 = System.nanoTime();
    for (int i = 0; i < 100; ++i) {
      b = regex(args[0], args[1]); 
    }
    long t3 = System.nanoTime();
    double d1 = (double)(t3-t2)/100;
    // compare the results
    System.out.printf("t_compare: %7.2f nSec. t_regex: %7.2f nSec.\n",d0, d1);
    if (a.length() != b.length()) {
      System.out.println("Different length. compare:"+a+" and regex:"+b);
      System.exit(0);
    }
    System.out.println("compare() and regex() deliver the same results:"+a.equals(b)+
                       "\na:"+a+"\nb:"+b);
  }
  // work-around
  public static String compare(String str, String pat) {
    StringBuilder sb = new StringBuilder();
    for (int a = 0, b = 0, le = pat.length(), mx = str.length(); a < mx; a = b + le) {
      b = str.indexOf(pat, a);
      if (b < 0) break;
      if ((a+le) < b) sb.append(str.substring(a, b));
    }
    return sb.toString();
  }
  // regex
  public static String regex(String str, String pat) {
    return str.replaceAll("["+pat+"]+", "");
  }
}

If you compile these apps and run them in a CMD windows, you will be flabbergasted by the outcomes. And that’s what I did:

C:\JoeApp\ODB\Tools>java Split "Hello       Joe T.           Schwarz  " " "
t_split: 11131,00 nSec. t_regex: 36777,00 nSec.
split() and regex() deliver the same results
split[0]:Hello and regex[0]:Hello
split[1]:Joe and regex[1]:Joe
split[2]:T. and regex[2]:T.
split[3]:Schwarz and regex[3]:Schwarz

C:\JoeApp\ODB\Tools>java Split "Hello       Joe T.           Schwarz  " " "
t_split: 8461,00 nSec. t_regex: 36288,00 nSec.
split() and regex() deliver the same results
split[0]:Hello and regex[0]:Hello
split[1]:Joe and regex[1]:Joe
split[2]:T. and regex[2]:T.
split[3]:Schwarz and regex[3]:Schwarz

C:\JoeApp\ODB\Tools>java Compare "Hello       Joe T.           Schwarz  " " "
t_compare: 6445,00 nSec. t_regex: 29240,00 nSec.
compare() and regex() deliver the same results:true
a:HelloJoeT.Schwarz
b:HelloJoeT.Schwarz

C:\JoeApp\ODB\Tools>java Compare "Hello       Joe T.           Schwarz  " " "
t_compare: 6818,00 nSec. t_regex: 33509,00 nSec.
compare() and regex() deliver the same results:true
a:HelloJoeT.Schwarz
b:HelloJoeT.Schwarz

C:\JoeApp\ODB\Tools>
1 Like

Speed is not my concern since 0,5 -1 seconds don’t make any difference in my poject
Of course, it’s a big difference if you need to execute the code 5-10 millions times continuously.

The code doesn’t need to be fast, it just needs to be fast enough to not make the user feel annoyed.

To me, the biggest issue is maintaining (making changes or improve the regex when it needed). And my solution is to put the RexEx somewhere it can be changed dynamically (in database, in a text or json file…), which of course will take more time to load for each use, but again, speed is not my biggest concern.

83% thành viên diễn đàn không hỏi bài tập, còn bạn thì sao?