Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
L ldpl
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 8
    • Issues 8
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Analytics
    • Analytics
    • CI/CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Lartu
  • ldpl
  • Merge requests
  • !93

Closed
Created May 13, 2019 by Lartu@lartu🐕Maintainer
  • Report abuse
Report abuse

Speed up text character lookup

  • Overview 1
  • Commits 7
  • Changes 2

Created by: dvkt

These are two little "optimizations" that, under some circumstances, make LDPL a lot faster when working with text.

I was trying to parse a big file using LDPL, one character at a time, but it was going kinda slow. So I started tracking down where time was spent, and it looked like it was mainly in two function: join() and utf8_substr().

The first thing I did was run some benchmarks. It looked like a += b was a heck of a lot faster than join(a, b, a), so I made the compiler special case any concat-style joins.

The second one is kinda nutty: Every time you call GET CHARACTER AT - FROM - IN, the whole string has to be searched starting at the beginning and all the positions of all the characters have to be re-discovered. Because of the variable widths of characters in utf8. Kind of a necessary evil, since it's so nice that LDPL supports utf8! But it means if you call get character at 1000 from $filetext in var then get character at 1001 from $filetext in var, and so on in a loop, each call is re-discovering the string and going through all 1000+ characters to find the one you asked for. It's like that movie Memento - everyone is always getting re-introduced! For big files, this can really get sluggish.

So, I added a little utf8cache inside charat() that maps positions of characters to their utf8 code points the first time a search is done on a string. This means the get in the loop basically becomes a constant time lookup, instead of O(whatever^n). There might still be some issues with how I compute the string id but it seems to work okay. All the tests pass and all the examples seem to work, as well as my projects.

I started getting a benchmark ready of the slowness, but I had to bail after 30 minutes:

LDPL 3.0.4 -- 10 runs

$ time ./slow-bin 
*** STARTING TEST ***************************************************
> performing 10 runs
^C

real    30m29.856s
user    30m26.815s
sys 0m1.417s

So I ran it again with just 1 run and it took a minute:

LDPL 3.0.4 -- 1 run

$ ./slow-bin
*** STARTING TEST ***************************************************
> performing 1 runs
> looped 146720 times (1 runs of 146720 chars)
> took 67 seconds
*************************************************** TEST COMPLETE ***

LDPL w/ character cache

Now here's the benchmark with this patch, starting out with 10 runs:

$ ./slow-bin 10
*** STARTING TEST ***************************************************
> performing 10 runs
> looped 3408240 times (10 runs of 340824 chars)
> took 1 seconds
*************************************************** TEST COMPLETE ***

So yes, much faster, but a crazy approach. And maybe it'll be slower for small strings? But it's made gild a lot faster, for one, which has been nice.

You can run the benchmark by cloning this gist and running slow.ldpl: https://gist.github.com/dvkt/443f5ddebe24e0fe406c051bee19cf3e

Any other ideas for how to speed up the lookup? Seems like most people use a big utf8 library which I didn't want to bring it. The cache could be improved too though.

Assignee
Assign to
Reviewer
Request review from
None
Milestone
None
Assign milestone
Time tracking
Source branch: github/fork/dvkt/mini-opts