You could also prefix it with a varint : in this case it would not require any extra byte compared to the NUL terminated string when the length of the string is less than 128 chars.
I did benchmark extensively 4-5 years ago, but I don't have those numbers with me. Tries are quite expensive memory-wise by design, but I found that ART gave the best balance between speed (by exploiting cache locality) and memory. State of art might have improved by now.
As far as Typesense goes though, I found that the actual posting lists, document listings, and other faceting/sorting related indexing data structures is where the bigger overhead is, especially for larger datasets.
Thanks for the feedback,
my issue is that I allocate only a few MB to my indexing thread so I'm looking for a more efficient implementation to avoid having to produce then merge too many segments from disk.
I'm currently considering using compressed pointers on some part of the tree to reduce the memory footprint as much as I can. Let's see how it goes...
With the Covid crisis we were no longer able to pay 3K per month for a basic search engine and now we are using a solution based on this on production (> 30M search/month).
Being able to allocate only a few KB to ingest any JSON is a killer feature.
There is a valid reason to reinvent the wheel : in my case I had to do something similar in C99 for a SaaS federated search engine to have the lowest memory footprint possible.
I've been working on a lite PHP library acting as an agnostic query engine to aggregate content from different (files, API, SQL, document oriented, ...) kind of datastore and ease the usage of some aggregations : terms, facets, ranges, min, max, average, ...
last year I made a similar project at work to be able to inject values coming from our database into our Gherkin features.
I thought that it may help someone else to build more robust test so I decided to simplify it and to opensource it as I am not allowed to share the one that I made at work.
I'm glad to share "Metis" which is a fulltext search engine written in ECMAScript 6 with facet capabilities. Results are not yet sorted using a TF/IDF algorithm, however sort callbacks can be provided.
This is still a WIP project and the engine API will change in the next weeks. The current revision includes a simple fulltext search with results count and facets.
It follows semantic versioning and the 1.0.0 should be released in two months.
If you want to use it, be sure to specify the current tag (0.3), as the overall project structure will changed as explained before. For example, by the revision 1.0.0, a "Query" object will probably be introduced and will be notified when the response is returned by a JS worker running on another thread.
A working example is accessible and generates 1000 documents before indexing them.
I intend to create an AVX2 based decoder but I had absolutely no time to work on side projects in the past three months.
You might also want to take a look at this streaming encoder if you want to encode large files with a tiny memory footprint : https://github.com/MKCG/php-qoi/blob/main/src/FFI/lib/qoi.c