Speed up text character lookup (!93) · Merge requests · Lartu / ldpl

Speed up text character lookup

Created by: dvkt

These are two little "optimizations" that, under some circumstances, make LDPL a lot faster when working with text.

I was trying to parse a big file using LDPL, one character at a time, but it was going kinda slow. So I started tracking down where time was spent, and it looked like it was mainly in two function: join() and utf8_substr().

The first thing I did was run some benchmarks. It looked like a += b was a heck of a lot faster than join(a, b, a), so I made the compiler special case any concat-style joins.

The second one is kinda nutty: Every time you call GET CHARACTER AT - FROM - IN, the whole string has to be searched starting at the beginning and all the positions of all the characters have to be re-discovered. Because of the variable widths of characters in utf8. Kind of a necessary evil, since it's so nice that LDPL supports utf8! But it means if you call get character at 1000 from $filetext in var then get character at 1001 from $filetext in var, and so on in a loop, each call is re-discovering the string and going through all 1000+ characters to find the one you asked for. It's like that movie Memento - everyone is always getting re-introduced! For big files, this can really get sluggish.

So, I added a little utf8cache inside charat() that maps positions of characters to their utf8 code points the first time a search is done on a string. This means the get in the loop basically becomes a constant time lookup, instead of O(whatever^n). There might still be some issues with how I compute the string id but it seems to work okay. All the tests pass and all the examples seem to work, as well as my projects.

I started getting a benchmark ready of the slowness, but I had to bail after 30 minutes:

LDPL 3.0.4 -- 10 runs

$ time ./slow-bin 
*** STARTING TEST ***************************************************
> performing 10 runs
^C

real    30m29.856s
user    30m26.815s
sys 0m1.417s

So I ran it again with just 1 run and it took a minute:

LDPL 3.0.4 -- 1 run

$ ./slow-bin
*** STARTING TEST ***************************************************
> performing 1 runs
> looped 146720 times (1 runs of 146720 chars)
> took 67 seconds
*************************************************** TEST COMPLETE ***

LDPL w/ character cache

Now here's the benchmark with this patch, starting out with 10 runs:

$ ./slow-bin 10
*** STARTING TEST ***************************************************
> performing 10 runs
> looped 3408240 times (10 runs of 340824 chars)
> took 1 seconds
*************************************************** TEST COMPLETE ***

So yes, much faster, but a crazy approach. And maybe it'll be slower for small strings? But it's made gild a lot faster, for one, which has been nice.

You can run the benchmark by cloning this gist and running slow.ldpl: https://gist.github.com/dvkt/443f5ddebe24e0fe406c051bee19cf3e

Any other ideas for how to speed up the lookup? Seems like most people use a big utf8 library which I didn't want to bring it. The cache could be improved too though.