[bug#39258] Faster guix search using an sqlite cache
diff mbox series

Message ID cu7pnfaar36.fsf@systemreboot.net
State New
Headers show
Series
  • [bug#39258] Faster guix search using an sqlite cache
Related show

Checks

Context Check Description
cbaines/comparison success View comparision
cbaines/git branch success View Git branch
cbaines/applying patch fail View Laminar job

Commit Message

Arun Isaac Jan. 23, 2020, 7:51 p.m. UTC
Hi,

As discussed on guix-devel at
https://lists.gnu.org/archive/html/guix-devel/2020-01/msg00310.html , I
am working on an sqlite cache to improve guix search performance. I have
attached a highly incomplete WIP patch. The patch attempts to
reimplement the package-cache-file hook in guix/channels.scm using a
sqlite database. To this end, it rewrites most of the
generate-package-cache and cache-lookup functions in gnu/packages.scm. I
am yet to hook this up to guix search.

At the moment, I am having some difficulty populating the sqlite
database. generate-package-cache populates the database correctly when
invoked from a normal guile REPL using geiser, but fails to do so when
run by the guix daemon during guix pull.

I ran guix pull using

$ ./pre-inst-env guix pull --url=$PWD --branch=search -p /tmp/test

where search is the branch I am working on.

Running

$ ls /tmp/test/lib/guix -lh

shows

total 2.1M
-r--r--r-- 2 root root 2.1M ஜன.   1  1970 package-cache.sqlite
-r--r--r-- 2 root root  26K ஜன.   1  1970 package-cache.sqlite-journal

On examining package-cache.sqlite, I find that no records have been
written. And, there is a lingering journal file that shouldn't be
there. For some reason, populating the sqlite database does not work
with guix pull. sqlite probably crashes and leaves the journal file.

If I try to populate the database with each package record being
inserted in its own transaction, at least some of the insertions
work. But the journal file still lingers. My unverified guess is that
everything except the last transaction was successful.

Any ideas what's going on?

Also, inserting each package in its own transaction is ridiculously slow
and so that is out of the question. See https://www.sqlite.org/faq.html#q19

Comments

zimoun Jan. 29, 2020, 11:33 p.m. UTC | #1
Hi Arun,

Thank you for the patch!
Cool! :-)

I have not tested it yet. Sorry.


On Thu, 23 Jan 2020 at 20:53, Arun Isaac <arunisaac@systemreboot.net> wrote:

> At the moment, I am having some difficulty populating the sqlite
> database. generate-package-cache populates the database correctly when
> invoked from a normal guile REPL using geiser, but fails to do so when
> run by the guix daemon during guix pull.

[...]

> On examining package-cache.sqlite, I find that no records have been
> written. And, there is a lingering journal file that shouldn't be
> there. For some reason, populating the sqlite database does not work
> with guix pull. sqlite probably crashes and leaves the journal file.

Hum? weird...
Is it possible that a module is loaded when Guile repl is used and not
with Guix pull?
What about "guix repl"?


> If I try to populate the database with each package record being
> inserted in its own transaction, at least some of the insertions

You mean 'commit' the database after each insertion, right?


> work. But the journal file still lingers. My unverified guess is that
> everything except the last transaction was successful.

And this does not happen with the repl, right?


> Any ideas what's going on?

I have no idea.
Weird.

What about adding 'last-insert-row-id' from 'guix/store/database.scm'?
I mean without really understanding and just grepping
'sqlite-finalize' spots that 'last-insert-row-id' seems often around.
:-)


> Also, inserting each package in its own transaction is ridiculously slow
> and so that is out of the question. See https://www.sqlite.org/faq.html#q19

Agree that it is not an option. :-)


Otherwise, 'list->string' is defined twice. And the first one is not
necessary, I guess.
The docstring of 'cache-lookup' is not coherent anymore. :-)


Cheers,
simon
Arun Isaac Jan. 30, 2020, 1:48 p.m. UTC | #2
> Hum? weird...
> Is it possible that a module is loaded when Guile repl is used and not
> with Guix pull?

It could be. But I don't know how to confirm this theory.

> What about "guix repl"?

I just tried 'guix repl'. It worked correctly, just like guile repl.

>> If I try to populate the database with each package record being
>> inserted in its own transaction, at least some of the insertions
>
> You mean 'commit' the database after each insertion, right?

Yes, that is what I mean.

>> work. But the journal file still lingers. My unverified guess is that
>> everything except the last transaction was successful.
>
> And this does not happen with the repl, right?

No, this does not happen with the repl.

>> Any ideas what's going on?
>
> I have no idea.
> Weird.

Do you know of any way sqlite can create an error log to report what's
going on? That might really help debug this issue.

> What about adding 'last-insert-row-id' from 'guix/store/database.scm'?
> I mean without really understanding and just grepping
> 'sqlite-finalize' spots that 'last-insert-row-id' seems often around.
> :-)

I tried this just now, but still the journal lingers.

> Otherwise, 'list->string' is defined twice. And the first one is not
> necessary, I guess.

Ah, thanks for catching this!

> The docstring of 'cache-lookup' is not coherent anymore. :-)

Yes, I haven't gotten around to fixing up all those yet. I thought I'll
get the code working first.
zimoun Jan. 31, 2020, 12:48 p.m. UTC | #3
Hi,

On Thu, 30 Jan 2020 at 14:49, Arun Isaac <arunisaac@systemreboot.net> wrote:

> >> Any ideas what's going on?
[...]
> Do you know of any way sqlite can create an error log to report what's
> going on? That might really help debug this issue.

Danny told me something like:

  (catch (sqlite-error

I have not tried yet.



> > The docstring of 'cache-lookup' is not coherent anymore. :-)
>
> Yes, I haven't gotten around to fixing up all those yet. I thought I'll
> get the code working first.

Yes, I imagine. Just to notice. :-)



All the best,
simon
Arun Isaac Feb. 2, 2020, 9:16 p.m. UTC | #4
> Danny told me something like:
>
>   (catch (sqlite-error
>
> I have not tried yet.

Thank you, this was useful. I was able to catch and report the error. I
also found the log file for the guix-package-cache profile hook. It says

(repl-version 0 0)
Generating package cache for '/gnu/store/b6f9b5qbcn4r932whrr6m15rdimbgrhs-profile'...
(exception sqlite-error (value sqlite-open) (value 14) (value "Unable to open the database file"))

This could be a permission error, or something to do with the existence
or lack thereof of certain directories (such as /var) in the chroot of
the build daemon. I'm still figuring it out.

I'm also in half a mind to get some guile xapian bindings ready so we
can just do that instead of messing with sqlite here. But, let's
see. :-P
zimoun Feb. 4, 2020, 10:19 a.m. UTC | #5
Hi,

On Sun, 2 Feb 2020 at 22:16, Arun Isaac <arunisaac@systemreboot.net> wrote:

> Thank you, this was useful. I was able to catch and report the error. I

Where have you reported the error?


> also found the log file for the guix-package-cache profile hook. It says
>
> (repl-version 0 0)
> Generating package cache for '/gnu/store/b6f9b5qbcn4r932whrr6m15rdimbgrhs-profile'...
> (exception sqlite-error (value sqlite-open) (value 14) (value "Unable to open the database file"))
>
> This could be a permission error, or something to do with the existence
> or lack thereof of certain directories (such as /var) in the chroot of
> the build daemon. I'm still figuring it out.

Hum? And this should explain why it is working with the REPL and not
with the CLI, right?


> I'm also in half a mind to get some guile xapian bindings ready so we
> can just do that instead of messing with sqlite here. But, let's
> see. :-P

Cool!
Let me know if you push something somewhere.


Cheers,
simon
zimoun April 26, 2020, 3:54 a.m. UTC | #6
Hi,

Thank you Arun for the patches and all the work.  Sorryfor the delay.


TLDR:

 1) around 25 seconds added to "guix pull"... but I am more than often
waiting around 10 minutes when pulling.
 2) the speedup is clear: more than 2x.


The question is the tradeoff between: the slowdown of pull vs the
speedup of search. What is acceptable?


Here let benchmark 3 versions of Guix:

 - default is a357849f5b
 - v2 rebased on default and based on Xapian
 - v3 rebased on default too and based on "custom" index

and let compare the time of "guix pull" and then "guix search".
Because v2 uses Xapian, the accuracy is different and so the list of
outputs is different depending on the query; the impact on the
performance seems minimal.  Let discuss elsewhere about accuracy and
BM25 and let focus on performance for now.


* guix pull
-----------

The idea is: measure if computing the new index is expensive or not,
compared to all of what "guix pull" computes.


** Reference
------------

Maybe, I should have misconfigured something or my laptop is really
not powerful at all, but here some numbers.

(Note: /proc/cpuinfo says 4 times Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz
and /sys/block/sda/queue/rotational says 0 which is SSD.)

--8<---------------cut here---------------start------------->8---
$ guix describe
Generation 8    Apr 25 2020 09:00:01    (current)
  guix f84b036
    repository URL: https://git.savannah.gnu.org/git/guix.git
    branch: master
    commit: f84b0363053e5479464f6ce6ded45f80360d90fc
--8<---------------cut here---------------end--------------->8---

--8<---------------cut here---------------start------------->8---
$ time guix pull -C ~/.config/guix/default-channels.scm
Updating channel 'guix' from Git repository at
'https://git.savannah.gnu.org/git/guix.git'...
Building from this channel:
  guix      https://git.savannah.gnu.org/git/guix.git   8cf6d15
downloading from
https://ci.guix.gnu.org/nar/gzip/xgakzpfs3rz57m666hsk1v3d3zcy7wgn-config.scm
...
 config.scm

[...]

building fonts directory...
building directory of Info manuals...
building database for manual pages...
building profile with 1 package...
building /gnu/store/kq1zlj5rxz8wrxc3ha8vck2wv2iakfnb-inferior-script.scm.drv...
building package cache...
building profile with 1 package...
New in this revision:
  2 new packages: cl-osicat, sbcl-osicat


real    13m37.997s
user    1m38.129s
sys     0m0.856s
--8<---------------cut here---------------end--------------->8---


And because "guix search" is used say 10 times more than "guix pull",
an increase of 10% of "guix pull" will ease the experience of the user
if "guix search" is faster, IMHO.

Therefore, because "guix pull" takes around 13 minutes, the extra cost
to index all the packages can be roughly 1min30s (at most).


Then, if I pull back from 8cf6d15 to '--commit=a357849f5b' then it takes:

real    2m13.693s
user    1m37.418s
sys     0m0.666s

so in this case 10% means around 7s. But after 1 minute waiting, the
command feels too long to me and personally I am already waiting so I
do not mind much if it would take 2m13s or 3m00s.


Well, it is hard to draw a clear line about what could be accepted as
the time of indexing because the time of pulling is already highly
variable.


What is the average of "guix pull"?

It could be really interresting to probe the users.  They could report:
 - guix describe
 - time guix pull
whatever which channels are up.

Just to have an idea about what should be the acceptable extra time
added by indexing.  For sure it depends on the hardware but it would
provide an idea and help to see if the extra time is worth or not.

WDYT?



** Let's compare the index time
-------------------------------

Let pull for the 3 cases and populate the store by all the necessary
items.  Could be looooonng! (20minutes)  For example, for the version
2 of patches -- living in my branch 'search-v2' using a worktree.

--8<---------------cut here---------------start------------->8---
time ./pre-inst-env guix pull -p /tmp/v2 \
     --url=$PWD --branch=search-v2 \
     -C ~/.config/guix/default-channels.scm
--8<---------------cut here---------------end--------------->8---

and then let spot the index file for each version:

--8<---------------cut here---------------start------------->8---
# ls -l /tmp/default/lib/guix
/gnu/store/g5c08vqsv31nkn2r0hr32dbrkhf3cvd8-guix-package-cache

readlink /tmp/v2/lib/guix/package-search.index
/gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index/lib/guix/package-search.index

readlink /tmp/v3/lib/guix/package-metadata.cache
/gnu/store/8j78b5c4ddic21gcx7wpbq2akjn7x7mr-guix-package-metadata-cache/lib/guix/package-metadata.cache
--8<---------------cut here---------------end--------------->8---

Well, let remove the profiles and garbage collect the index files:

--8<---------------cut here---------------start------------->8---
rm /tmp/default /tmp/v{2,3}*
guix gc -D \
   /gnu/store/g5c08vqsv31nkn2r0hr32dbrkhf3cvd8-guix-package-cache \
   /gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index \
   /gnu/store/8j78b5c4ddic21gcx7wpbq2akjn7x7mr-guix-package-metadata-cache
--8<---------------cut here---------------end--------------->8---


And then re-run "guix pull". We are now comparing apple to apple, I guess.


| time | default   | v2        | v3        |
|------+-----------+-----------+-----------|
| real | 1m11.899s | 1m30.806s | 1m34.341s |
| user | 1m23.845s | 1m24.160s | 1m24.233s |
| sys  | 0m0.570s  | 0m0.563s  | 0m0.529s  |


Therefore less than extra 20s and 25s for v2 and v3.


All the question is an extra 25s compared to which time of "guix pull":
 - more than 13m: adding 25s is acceptable
 - less than 2m: adding 25s is questionable

Usually, my feeling about "guix pull" is... I am waiting!  Therefore,
I will not see this extra 25s because it is masked by all the other
work "guix pull" is doing.


* guix search
-------------

Let compare cold (sudo echo 3 > /proc/sys/vm/drop_caches) and warm
cache.  For example for the query 'inkscape'.


| time | default  | v2       | v3       |
|------+----------+----------+----------|
| real | 0m1.842s | 0m0.331s | 0m0.437s |
| user | 0m1.270s | 0m0.179s | 0m0.336s |
| sys  | 0m0.142s | 0m0.047s | 0m0.052s |
|------+----------+----------+----------|
| real | 0m0.898s | 0m0.132s | 0m0.292s |
| user | 0m1.069s | 0m0.168s | 0m0.353s |
| sys  | 0m0.072s | 0m0.008s | 0m0.019s |


Therefore the speedup is at least 3.

| cache | default-vs-v2 | default-vs-v3 |
|-------+---------------+---------------|
| cold  |           5.6 |           4.2 |
| warm  |           6.8 |           3.1 |


Another query:

--8<---------------cut here---------------start------------->8---
time guix search crypto library | recsel -P name | grep libb2
--8<---------------cut here---------------end--------------->8---

| time | default  | v2       | v3       |
|------+----------+----------+----------|
| real | 0m2.216s | 0m1.109s | 0m0.689s |
| user | 0m1.655s | 0m1.309s | 0m0.683s |
| sys  | 0m0.193s | 0m0.073s | 0m0.035s |
|------+----------+----------+----------|
| real | 0m1.197s | 0m0.490s | 0m0.491s |
| user | 0m1.448s | 0m0.819s | 0m0.625s |
| sys  | 0m0.089s | 0m0.034s | 0m0.039s |


| cache | default-vs-v2 | default-vs-v3 |
|-------+---------------+---------------|
| cold  |           2.0 |           3.2 |
| warm  |           2.4 |           2.4 |




Before going further, especially about any other more sophisticated
inverted index (BM25), it appears to me important to fix what is
"cost" on "guix pull" that the users are ready to pay.  Because
somehow the inverted index has to be computed.  And without an
inverted index, it seems difficult to improve the accurary.

One solution should be: let compute the inverted index in the
background with a low priority.  If the index is not done yet when
"guix search" is called, then fallback to the current default
behaviour.


WDYT?


Cheers,
simon
Pierre Neidhardt April 26, 2020, 7:29 a.m. UTC | #7
Hi Simon,

Thanks for taking the time to benchmark this, this is very insightful!

> Usually, my feeling about "guix pull" is... I am waiting!  Therefore,
> I will not see this extra 25s because it is masked by all the other
> work "guix pull" is doing.

I agree and this is a very good point in my opinion.
While I don't expect nor do I need "guix pull" to complete immediately,
this is not true of "guix search".

As Simon suggested, maybe we can wrap a benchmark script together, post
it on the mailing list and ask member to report their results.  Maybe
a few dozen results would give us a better idea of the numbers we are
dealing with.

Cheers!
Ludovic Courtès April 26, 2020, 3:49 p.m. UTC | #8
Hi,

zimoun <zimon.toutoune@gmail.com> skribis:

>  1) around 25 seconds added to "guix pull"... but I am more than often
> waiting around 10 minutes when pulling.
>  2) the speedup is clear: more than 2x.

Nice!

It does seem like Arun’s v3 (or maybe even v2) would work nicely.

> The question is the tradeoff between: the slowdown of pull vs the
> speedup of search. What is acceptable?

That’s only one criterion among others.  I hear the argument that 25s is
“nothing” compared to the rest, but it’s really a tradeoff.  Like, if I
spent a day optimizing ‘guix pull’ and managed to save 25s, I would find
it nice.  :-)

> $ time guix pull -C ~/.config/guix/default-channels.scm

It also depends on what’s in that file, of course.

> Then, if I pull back from 8cf6d15 to '--commit=a357849f5b' then it takes:
>
> real    2m13.693s
> user    1m37.418s
> sys     0m0.666s

For me:

--8<---------------cut here---------------start------------->8---
$ guix describe
Generacio 139   Apr 13 2020 21:50:08    (nuna)
  guix bad368b
    repository URL: https://git.savannah.gnu.org/git/guix.git
    branch: master
    commit: bad368b0d794689f3a8a11b58f1ea4987938682e
$ time guix pull -p /tmp/test --commit=bad368b0d794689f3a8a11b58f1ea4987938682e
Updating channel 'guix' from Git repository at 'https://git.savannah.gnu.org/git/guix.git'...
Building from this channel:
  guix      https://git.savannah.gnu.org/git/guix.git   bad368b

[...]

real    0m57.916s
user    1m1.017s
sys     0m0.609s
--8<---------------cut here---------------end--------------->8---

(On a 2.6 GHz i7 though.)

> Well, let remove the profiles and garbage collect the index files:
>
> rm /tmp/default /tmp/v{2,3}*
> guix gc -D \
>    /gnu/store/g5c08vqsv31nkn2r0hr32dbrkhf3cvd8-guix-package-cache \
>    /gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index \
>    /gnu/store/8j78b5c4ddic21gcx7wpbq2akjn7x7mr-guix-package-metadata-cache

Could you do, for v2 and v3:

  time guix build /gnu/store/…-package-metadata-cache.drv --check

?

That we’ll give us the exact cost of that part.  It’ll be interesting
especially in the Xapian case, which we expected to be higher.

Thanks for the insightful benchmarks!

Ludo’.
zimoun April 26, 2020, 5:01 p.m. UTC | #9
Hi Ludo,

On Sun, 26 Apr 2020 at 17:49, Ludovic Courtès <ludo@gnu.org> wrote:

> It does seem like Arun’s v3 (or maybe even v2) would work nicely.

The v3 is more interesting because it does not change the relevance
scoring and does not add other dependency.
However v2 is interesting to easily test BM25 which is another
relevance scoring... work in progress. :-)


> > The question is the tradeoff between: the slowdown of pull vs the
> > speedup of search. What is acceptable?
>
> That’s only one criterion among others.  I hear the argument that 25s is
> “nothing” compared to the rest, but it’s really a tradeoff.  Like, if I
> spent a day optimizing ‘guix pull’ and managed to save 25s, I would find
> it nice.  :-)

And I expect that the middle-term roadmap would even decrease more the
computations of derivations. ;-)



> > $ time guix pull -C ~/.config/guix/default-channels.scm
>
> It also depends on what’s in that file, of course.

Contains only one line: %default-channels

See my wishlist ;-)
https://lists.gnu.org/archive/html/guix-devel/2020-04/msg00393.html



me:  2m13.693s
you: 0m57.916s

As we already discussed elsewhere, it is hard to "test" 'guix pull'.
Does it make sense to measure "guix pull"? As Chris (Marusich) did for
CDN.


> > Well, let remove the profiles and garbage collect the index files:
> >
> > rm /tmp/default /tmp/v{2,3}*
> > guix gc -D \
> >    /gnu/store/g5c08vqsv31nkn2r0hr32dbrkhf3cvd8-guix-package-cache \
> >    /gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index \
> >    /gnu/store/8j78b5c4ddic21gcx7wpbq2akjn7x7mr-guix-package-metadata-cache
>
> Could you do, for v2 and v3:
>
>   time guix build /gnu/store/…-package-metadata-cache.drv --check

Newbie me! :-)

Two points:

   1. It may not be reproducible... I am checking.
   2. The time seems similar (v2=26s and v3=29s) considering the time
to start Guile and so on.

--8<---------------cut here---------------start------------->8---
guix gc --list-live | grep metadata
time /tmp/v3/bin/guix build
/gnu/store/jxs0abica8kjz1ppym95df97jk0qa9by-guix-package-metadata-cache.drv
--check
The following profile hook will be built:
   /gnu/store/jxs0abica8kjz1ppym95df97jk0qa9by-guix-package-metadata-cache.drv
building package cache...
(repl-version 0 1 1)
Generating package metadata cache for
'/gnu/store/95mi525syinh08jmcd3q7a7a8mr1sykb-profile'...
(values (value "/gnu/store/zhp7wv87vr6iis0fa3ff925i5r04i08q-guix-package-metadata-cache/lib/guix/package-metadata.cache"))
guix build: error: derivation
`/gnu/store/jxs0abica8kjz1ppym95df97jk0qa9by-guix-package-metadata-cache.drv'
may not be deterministic: output
`/gnu/store/zhp7wv87vr6iis0fa3ff925i5r04i08q-guix-package-metadata-cache'
differs

real    0m29.788s
user    0m0.535s
sys    0m0.025s
--8<---------------cut here---------------end--------------->8---


> That we’ll give us the exact cost of that part.  It’ll be interesting
> especially in the Xapian case, which we expected to be higher.

--8<---------------cut here---------------start------------->8---
time /tmp/v2/bin/guix build
/gnu/store/w0dhl2n3ngi4v2ld8lprkqjl1g1q2m4p-guix-package-search-index.drv
--check
The following profile hook will be built:
   /gnu/store/w0dhl2n3ngi4v2ld8lprkqjl1g1q2m4p-guix-package-search-index.drv
running profile hook of type 'package-search-index'...
(repl-version 0 1 1)
Generating package search index for
'/gnu/store/wiinj9nrb45wlf2cgbgkjl9chxz9cb9b-profile'...
(values (value "/gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index/lib/guix/package-search.index"))
guix build: error: derivation
`/gnu/store/w0dhl2n3ngi4v2ld8lprkqjl1g1q2m4p-guix-package-search-index.drv'
may not be deterministic: output
`/gnu/store/8xbzhn81hmshagbgazmnr7xfps1cdsa3-guix-package-search-index'
differs

real    0m26.552s
user    0m0.626s
sys    0m0.046s
--8<---------------cut here---------------end--------------->8---

It is not higher. Why should it be?


Considering aside the issue of reproducibility -- which should be one!
-- well, should be possible to download the index file as any other
substitute?


Cheers,
simon
Ludovic Courtès April 26, 2020, 8:22 p.m. UTC | #10
zimoun <zimon.toutoune@gmail.com> skribis:

>> Could you do, for v2 and v3:
>>
>>   time guix build /gnu/store/…-package-metadata-cache.drv --check
>
> Newbie me! :-)
>
> Two points:
>
>    1. It may not be reproducible... I am checking.
>    2. The time seems similar (v2=26s and v3=29s) considering the time
> to start Guile and so on.

Good, so it means that’s not Xapian taking time here.

Thanks again!

Ludo’.
zimoun April 30, 2020, 1:10 p.m. UTC | #11
Hi Ludo,

On Sun, 26 Apr 2020 at 17:49, Ludovic Courtès <ludo@gnu.org> wrote:

> That’s only one criterion among others.  I hear the argument that 25s is
> “nothing” compared to the rest, but it’s really a tradeoff.  Like, if I
> spent a day optimizing ‘guix pull’ and managed to save 25s, I would find
> it nice.  :-)

I am not sure to understand all what "guix pull" does.
Does "guix pull" compile all the scheme files under 'gnu/'? Probably
only recompiles the "new" files?

I do not know if it makes sense, but I just note this difference:

 1. Search without compiling of all files under 'gnu/packages/'
 2. Compile all the files under 'gnu/packages/' then search
 3. Search with only the file gnu/packages/emacs-xyz.scm not compiled
(all the other files are compiled)
 4. Compile the file above and then search

3b and 4b with gnu/packages/cobol.scm which is smaller than emacs-xyz.scm.


Results:

1) 1m43.312s
2) 0m1.301s (but 9m51.801s compiling)

3) 0m6.526s
4) 0m1.389s (1m8.670s compiling)

3b) 0m0.921s
4b) 0m0.924s (0m1.884s compiling)

Therefore, an option to reduce the time when pulling should to relax
the "compilation" for 'gnu/packages/' and 'gnu/services'; something
less optimized since the packages and services "just" need to be
transformed into bytecode to improve IO when reading them. Perhaps I
miss a point...

And maybe, it is similar than what Andy Wingo is proposing in [1].

[1] https://lists.gnu.org/archive/html/guix-devel/2020-04/msg00444.html


Cheers,
simon

--8<---------------cut here---------------start------------->8---
find gnu/packages -name "*.scm" -type f -exec touch {} \;
time ./pre-inst-env guix search gmsh | recsel -C -p name

;;; note: source file /home/simon/src/guix/wk/tmp/gnu/packages/abduco.scm
;;;       newer than compiled /home/simon/src/guix/wk/tmp/gnu/packages/abduco.go

[...]

;;; note: source file /home/simon/src/guix/wk/tmp/gnu/packages/zwave.scm
;;;       newer than compiled /home/simon/src/guix/wk/tmp/gnu/packages/zwave.go
name: gmsh

real    1m43.312s
user    2m19.318s
--8<---------------cut here---------------end--------------->8---

--8<---------------cut here---------------start------------->8---
find gnu/packages -name "*.scm" -type f -exec touch {} \;
time make -j4 && time ./pre-inst-env guix search gmsh | recsel -C -p name

make  all-recursive
make[1]: Entering directory '/home/simon/src/guix/wk/tmp'
Making all in po/guix
make[2]: Entering directory '/home/simon/src/guix/wk/tmp/po/guix'
make[2]: Leaving directory '/home/simon/src/guix/wk/tmp/po/guix'
Making all in po/packages
make[2]: Entering directory '/home/simon/src/guix/wk/tmp/po/packages'
make[2]: Leaving directory '/home/simon/src/guix/wk/tmp/po/packages'
make[2]: Entering directory '/home/simon/src/guix/wk/tmp'
Compiling Scheme modules...
[  0%] LOAD     gnu/packages/abduco.scm
;;; note: source file ./gnu/packages/abduco.scm
;;;       newer than compiled /home/simon/src/guix/wk/tmp/gnu/packages/abduco.go

[...]

[100%] GUILEC   gnu/packages/zwave.go
make[2]: Leaving directory '/home/simon/src/guix/wk/tmp'
make[1]: Leaving directory '/home/simon/src/guix/wk/tmp'

real    9m51.801s
user    29m18.938s
sys     0m5.822s
name: gmsh

real    0m1.301s
user    0m1.266s
sys     0m0.101s
--8<---------------cut here---------------end--------------->8---
zimoun June 1, 2020, 10:11 a.m. UTC | #12
Dear,

> > Often our search strings are only literal strings. So, we can save some time
> > by using string-contains instead of invoking the regexp engine. Patch 2 does
> > this. In addition, guile's string-contains uses a naive O(n^2) string search
> > algorithm. We should perhaps use the O(n) Knuth-Morris-Pratt algorithm[1]. In
> > fact, a comment on line 2006 of libguile/srfi-13.c in the guile source code
> > mentions this. If implemented, the KMP algorithm could speed up guix search
> > further.
> >
> > [1]: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

It could improve.
Well, I will try to do some back-to-envelop computations because I am
not convinced that the mean value of 'n' (length of description,
isn't) is large enough to really see an improvement for the end-user;
the visible bottleneck is I/O.

All the best,
simon

ps;
To be honest, I thought this kind of algorithm was the default. :-)
Leo Famulari June 1, 2020, 10:24 p.m. UTC | #13
On Mon, Jun 01, 2020 at 12:11:52PM +0200, zimoun wrote:
> Dear,
> 
> > > Often our search strings are only literal strings. So, we can save some time
> > > by using string-contains instead of invoking the regexp engine. Patch 2 does
> > > this. In addition, guile's string-contains uses a naive O(n^2) string search
> > > algorithm. We should perhaps use the O(n) Knuth-Morris-Pratt algorithm[1]. In
> > > fact, a comment on line 2006 of libguile/srfi-13.c in the guile source code
> > > mentions this. If implemented, the KMP algorithm could speed up guix search
> > > further.
> > >
> > > [1]: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
> 
> It could improve.
> Well, I will try to do some back-to-envelop computations because I am
> not convinced that the mean value of 'n' (length of description,
> isn't) is large enough to really see an improvement for the end-user;
> the visible bottleneck is I/O.
> 
> All the best,
> simon
> 
> ps;
> To be honest, I thought this kind of algorithm was the default. :-)

I also recommend taking a look at the Boyer Moore string search
implementation in (guix build grafts).

It would be great to generalize it and make it accessible to other parts
of Guix.
Arun Isaac June 1, 2020, 11:48 p.m. UTC | #14
> I also recommend taking a look at the Boyer Moore string search
> implementation in (guix build grafts).

Nice, I didn't know Guix had an implementation of Boyer Moore. I'll take
a look at it. At the very least, I need something similar for
guile-email.

But, the current implementation of guile's string-contains is in C. So,
I assume a KMP or Boyer Moore implementation of string-contains should
also be in C.
Ludovic Courtès June 2, 2020, 8:49 a.m. UTC | #15
Arun Isaac <arunisaac@systemreboot.net> skribis:

>> I also recommend taking a look at the Boyer Moore string search
>> implementation in (guix build grafts).
>
> Nice, I didn't know Guix had an implementation of Boyer Moore. I'll take
> a look at it. At the very least, I need something similar for
> guile-email.
>
> But, the current implementation of guile's string-contains is in C. So,
> I assume a KMP or Boyer Moore implementation of string-contains should
> also be in C.

Not necessarily.  But it’d be great to have it in Guile proper, for sure!

Ludo’.

Patch
diff mbox series

From d1305351a90a84eb75e4769284d5e06927eade3e Mon Sep 17 00:00:00 2001
From: Arun Isaac <arunisaac@systemreboot.net>
Date: Tue, 21 Jan 2020 20:45:43 +0530
Subject: [PATCH] fast search

---
 build-aux/build-self.scm |   5 +
 gnu/packages.scm         | 207 +++++++++++++++++++++++----------------
 2 files changed, 128 insertions(+), 84 deletions(-)

diff --git a/build-aux/build-self.scm b/build-aux/build-self.scm
index fc13032b73..c123ad3b11 100644
--- a/build-aux/build-self.scm
+++ b/build-aux/build-self.scm
@@ -264,6 +264,9 @@  interface (FFI) of Guile.")
   (define fake-git
     (scheme-file "git.scm" #~(define-module (git))))
 
+  (define fake-sqlite3
+    (scheme-file "sqlite3.scm" #~(define-module (sqlite3))))
+
   (with-imported-modules `(((guix config)
                             => ,(make-config.scm))
 
@@ -278,6 +281,8 @@  interface (FFI) of Guile.")
                            ;; (git) to placate it.
                            ((git) => ,fake-git)
 
+                           ((sqlite3) => ,fake-sqlite3)
+
                            ,@(source-module-closure `((guix store)
                                                       (guix self)
                                                       (guix derivations)
diff --git a/gnu/packages.scm b/gnu/packages.scm
index d22c992bb1..4e2c52e62d 100644
--- a/gnu/packages.scm
+++ b/gnu/packages.scm
@@ -43,6 +43,7 @@ 
   #:use-module (srfi srfi-34)
   #:use-module (srfi srfi-35)
   #:use-module (srfi srfi-39)
+  #:use-module (sqlite3)
   #:export (search-patch
             search-patches
             search-auxiliary-file
@@ -204,10 +205,8 @@  PROC is called along these lines:
 PROC can use #:allow-other-keys to ignore the bits it's not interested in.
 When a package cache is available, this procedure does not actually load any
 package module."
-  (define cache
-    (load-package-cache (current-profile)))
-
-  (if (and cache (cache-is-authoritative?))
+  (if (and (cache-is-authoritative?)
+           (current-profile))
       (vhash-fold (lambda (name vector result)
                     (match vector
                       (#(name version module symbol outputs
@@ -220,7 +219,7 @@  package module."
                              #:supported? supported?
                              #:deprecated? deprecated?))))
                   init
-                  cache)
+                  (cache-lookup (current-profile)))
       (fold-packages (lambda (package result)
                        (proc (package-name package)
                              (package-version package)
@@ -252,31 +251,7 @@  is guaranteed to never traverse the same package twice."
 
 (define %package-cache-file
   ;; Location of the package cache.
-  "/lib/guix/package.cache")
-
-(define load-package-cache
-  (mlambda (profile)
-    "Attempt to load the package cache.  On success return a vhash keyed by
-package names.  Return #f on failure."
-    (match profile
-      (#f #f)
-      (profile
-       (catch 'system-error
-         (lambda ()
-           (define lst
-             (load-compiled (string-append profile %package-cache-file)))
-           (fold (lambda (item vhash)
-                   (match item
-                     (#(name version module symbol outputs
-                             supported? deprecated?
-                             file line column)
-                      (vhash-cons name item vhash))))
-                 vlist-null
-                 lst))
-         (lambda args
-           (if (= ENOENT (system-error-errno args))
-               #f
-               (apply throw args))))))))
+  "/lib/guix/package-cache.sqlite")
 
 (define find-packages-by-name/direct              ;bypass the cache
   (let ((packages (delay
@@ -297,25 +272,57 @@  decreasing version order."
                     matching)
             matching)))))
 
-(define (cache-lookup cache name)
+(define* (cache-lookup profile #:optional name)
   "Lookup package NAME in CACHE.  Return a list sorted in increasing version
 order."
   (define (package-version<? v1 v2)
     (version>? (vector-ref v2 1) (vector-ref v1 1)))
 
-  (sort (vhash-fold* cons '() name cache)
-        package-version<?))
+  (define (int->boolean n)
+    (case n
+      ((0) #f)
+      ((1) #t)))
+
+  (define (string->list str)
+    (call-with-input-string str read))
+
+  (define select-statement
+    (string-append
+     "SELECT name, version, module, symbol, outputs, supported, superseded, locationFile, locationLine, locationColumn from packages"
+     (if name " WHERE name = :name" "")))
+
+  (define cache-file
+    (string-append profile %package-cache-file))
+
+  (let* ((db (sqlite-open cache-file SQLITE_OPEN_READONLY))
+         (statement (sqlite-prepare db select-statement)))
+    (when name
+      (sqlite-bind-arguments statement #:name name))
+    (let ((result (sqlite-fold (lambda (v result)
+                                 (match v
+                                   (#(name version module symbol outputs supported superseded file line column)
+                                    (cons
+                                     (vector name
+                                             version
+                                             (string->list module)
+                                             (string->symbol symbol)
+                                             (string->list outputs)
+                                             (int->boolean supported)
+                                             (int->boolean superseded)
+                                             (list file line column))
+                                     result))))
+                               '() statement)))
+      (sqlite-finalize statement)
+      (sqlite-close db)
+      (sort result package-version<?))))
 
 (define* (find-packages-by-name name #:optional version)
   "Return the list of packages with the given NAME.  If VERSION is not #f,
 then only return packages whose version is prefixed by VERSION, sorted in
 decreasing version order."
-  (define cache
-    (load-package-cache (current-profile)))
-
-  (if (and (cache-is-authoritative?) cache)
-      (match (cache-lookup cache name)
-        (#f #f)
+  (if (and (cache-is-authoritative?)
+           (current-profile))
+      (match (cache-lookup (current-profile) name)
         ((#(_ versions modules symbols _ _ _ _ _ _) ...)
          (fold (lambda (version* module symbol result)
                  (if (or (not version)
@@ -331,12 +338,9 @@  decreasing version order."
 (define* (find-package-locations name #:optional version)
   "Return a list of version/location pairs corresponding to each package
 matching NAME and VERSION."
-  (define cache
-    (load-package-cache (current-profile)))
-
-  (if (and cache (cache-is-authoritative?))
-      (match (cache-lookup cache name)
-        (#f '())
+  (if (and (cache-is-authoritative?)
+           (current-profile))
+      (match (cache-lookup (current-profile) name)
         ((#(name versions modules symbols outputs
                  supported? deprecated?
                  files lines columns) ...)
@@ -372,6 +376,9 @@  VERSION."
 ;; Prevent Guile 3 from inlining this procedure so we can mock it in tests.
 (set! find-best-packages-by-name find-best-packages-by-name)
 
+(define (list->string x)
+  (call-with-output-string (cut write x <>)))
+
 (define (generate-package-cache directory)
   "Generate under DIRECTORY a cache of all the available packages.
 
@@ -381,49 +388,81 @@  reducing the memory footprint."
   (define cache-file
     (string-append directory %package-cache-file))
 
-  (define (expand-cache module symbol variable result+seen)
+  (define schema
+    "CREATE TABLE packages (name text,
+version text,
+module text,
+symbol text,
+outputs text,
+supported int,
+superseded int,
+locationFile text,
+locationLine int,
+locationColumn int);
+CREATE VIRTUAL TABLE packageSearch USING fts5(name, searchText);")
+
+  (define insert-statement
+    "INSERT INTO packages(name, version, module, symbol, outputs, supported, superseded, locationFile, locationLine, locationColumn)
+VALUES(:name, :version, :module, :symbol, :outputs, :supported, :superseded, :locationfile, :locationline, :locationcolumn)")
+
+  (define insert-package-search-statement
+    "INSERT INTO packageSearch(name, searchText) VALUES(:name, :searchtext)")
+
+  (define (boolean->int x)
+    (if x 1 0))
+
+  (define (list->string x)
+    (call-with-output-string (cut write x <>)))
+
+  (define (insert-package db module symbol variable seen)
     (match (false-if-exception (variable-ref variable))
       ((? package? package)
-       (match result+seen
-         ((result . seen)
-          (if (or (vhash-assq package seen)
-                  (hidden-package? package))
-              (cons result seen)
-              (cons (cons `#(,(package-name package)
-                             ,(package-version package)
-                             ,(module-name module)
-                             ,symbol
-                             ,(package-outputs package)
-                             ,(->bool (supported-package? package))
-                             ,(->bool (package-superseded package))
-                             ,@(let ((loc (package-location package)))
-                                 (if loc
-                                     `(,(location-file loc)
-                                       ,(location-line loc)
-                                       ,(location-column loc))
-                                     '(#f #f #f))))
-                          result)
-                    (vhash-consq package #t seen))))))
-      (_
-       result+seen)))
-
-  (define exp
-    (first
-     (fold-module-public-variables* expand-cache
-                                    (cons '() vlist-null)
-                                    (all-modules (%package-module-path)
-                                                 #:warn
-                                                 warn-about-load-error))))
+       (cond
+        ((or (vhash-assq package seen)
+             (hidden-package? package))
+         seen)
+        (else
+         (let ((statement (sqlite-prepare db insert-statement)))
+           (sqlite-bind-arguments statement
+                                  #:name (package-name package)
+                                  #:version (package-version package)
+                                  #:module (list->string (module-name module))
+                                  #:symbol (symbol->string symbol)
+                                  #:outputs (list->string (package-outputs package))
+                                  #:supported (boolean->int (supported-package? package))
+                                  #:superseded (boolean->int (package-superseded package))
+                                  #:locationfile (cond
+                                                  ((package-location package) => location-file)
+                                                  (else #f))
+                                  #:locationline (cond
+                                                  ((package-location package) => location-line)
+                                                  (else #f))
+                                  #:locationcolumn (cond
+                                                    ((package-location package) => location-column)
+                                                    (else #f)))
+           (sqlite-fold cons '() statement)
+           (sqlite-finalize statement))
+         (let ((statement (sqlite-prepare db insert-package-search-statement)))
+           (sqlite-bind-arguments statement
+                                  #:name (package-name package)
+                                  #:searchtext (package-description package))
+           (sqlite-fold cons '() statement)
+           (sqlite-finalize statement))
+         (vhash-consq package #t seen))))
+      (_ seen)))
 
   (mkdir-p (dirname cache-file))
-  (call-with-output-file cache-file
-    (lambda (port)
-      ;; Store the cache as a '.go' file.  This makes loading fast and reduces
-      ;; heap usage since some of the static data is directly mmapped.
-      (put-bytevector port
-                      (compile `'(,@exp)
-                               #:to 'bytecode
-                               #:opts '(#:to-file? #t)))))
+  (let ((db (sqlite-open cache-file)))
+    (sqlite-exec db schema)
+    (sqlite-exec db "BEGIN")
+    (fold-module-public-variables* (cut insert-package db <> <> <> <>)
+                                   vlist-null
+                                   (all-modules (%package-module-path)
+                                                #:warn
+                                                warn-about-load-error))
+    (sqlite-exec db "COMMIT;")
+    (sqlite-close db))
+
   cache-file)
 
 
-- 
2.23.0