Supercharge SQLite with Ruby Functions

blog.julik.nl

159 points by julik 5 months ago

ncruces 5 months ago

Taking full advantage of SQLite's extensibility is a center piece of my Go driver.

It started due to having to reimplement the OS layer in Go because of tech constraints, but it means you _can_ implement VFSes, UDFs (scalar, aggregates and windows), and virtual tables in Go, with reasonable performance and nice APIs.

I also made a point of dogfooding this as much as possible, with a bunch of extensions and a few custom VFSes that use the same APIs available to clients of the library.

https://github.com/ncruces/go-sqlite3/tree/main/ext

https://github.com/ncruces/go-sqlite3/tree/main/vfs#custom-v...

oefrha 5 months ago

Interesting, I need to take a look at your vfs implementations. Some of my projects could benefit from a transparent zstd vfs layer, but compiling and loading one of the existing C options in a Go project is yucky even if possible.
- ncruces 5 months ago
  
  I've never done it, but for the readonly case, I think all you need to is to combine the existing reader VFS with zstd-seekable-format-go.
  https://github.com/ncruces/go-sqlite3/discussions/117#discus...
  For read/write, I'm honestly not sure. The Zipvfs is an… erm… architectural mess that only really works because it accesses private SQLite APIs. Which is fine, but history has shown the SQLite team is willing to break those APIs, as they did for the ones they build the SQLite encryption extension on.
  https://sqlite.org/zipvfs/doc/trunk/www/howitworks.wiki
  The zstandard alternative is sqlite_zstd_vfs, which faces the same architectural issues. So, I'd rather not go there. But should be doable, as long as you're not needing private APIs.
  https://github.com/mlin/sqlite_zstd_vfs
  - hinkley 5 months ago
    
    The problem with those API breaks is rarely the fact that they got broken, it's the fire drill that happens after they do.
    If you get lucky, the 2 necessary upgrades happen at a time that fits well into your schedule. If you don't get lucky, then the SQLite upgrade you need contains a CERT advisory for a zero day attack, and not only does that not fit into your schedule but it also doesn't fit into the schedule of the person who did the customization.
    These are rare events but over the course of a project, that low priority taken to the exponent of the number of vendors you decide to play that game with, approaches or exceeds a probability of 1.00 (>1 meaning 'happened to us twice')
    
    ncruces 5 months ago
    
    The problem with those API breaks, for me, is that it may be non trivial to work around them.
    Most of the open source SQLite encryption extensions have not, for the most part, been able to work around the broken APIs, and are still stuck in the last version of SQLite that supported them.
    And it's not for lack of effort. There's one exception that moved to the public VFS API, while retaining the ability to load previously encrypted databases. But it was a big effort, and there are compromises involved.
    https://utelle.github.io/SQLite3MultipleCiphers/
  - oefrha 5 months ago
    
    Thanks for the pointers!
newusertoday 5 months ago

i wish there was something like this for wasm. There are wasm builds available but its not easy to extend them with your own functions.

bob1029 5 months ago

You can hypothetically write your entire product in SQL with appropriate bindings to SQLite. For me, SQL with CTEs is a very compelling way to model tricky business logic. Building a SQL-based rules engine is trivial if your host language has a good SQLite library.

This thing starts to grow legs once you realize you can recursively get into the rabbit hole by binding something like an Execute_Sql UDF - You can store the actual scripts within the same schema they operate on. Treating your code as data means you can do things like transactional updates of business logic while the system is serving live requests. You also get simple reflection & search over the business logic.

maCDzP 5 months ago

Is this the same thing as create_function for python?

https://docs.python.org/3/library/sqlite3.html#sqlite3.Conne...

julik 5 months ago

OP here. Yes, it is, except that the only builtin flag it offers is `deterministic` - if you want to also use "directonly" you probably will need the same OR trick.

tobyhinloopen 5 months ago

Ive nothing to say but “that’s cool”. I want to try this in NodeJS!

I suppose using functions defined by the host with SQLite is cheaper than using similarly defined functions in databases that are separated by a network. I wonder what the overhead is.

Also, what’s with the weird UUIDs? Why not use UUIDv7 if you want time-ordered UUIDs?

julik 5 months ago

> Why not use UUIDv7 if you want time-ordered UUIDs?
It is our flavour of NIH, that said - Tou has a finer-resolution timestamp. We also didn't do our homework right and assumed the v7 UUIDs won't be accepted by Postgres because of a different "version" value.
masklinn 5 months ago

> I suppose using functions defined by the host with SQLite is cheaper than using similarly defined functions in databases that are separated by a network.
“Infeasible” is very fast. sqlite runs in process so you can register a function pointer or five, with a trampoline back into the runtime.
Can’t do that over the network, you can create functions but only using the database’s procedural langage(s in the case of Postgres).
- anarazel 5 months ago
  
  FWIW, the set of procedural languages in postgres is runtime extensible: https://www.postgresql.org/docs/current/sql-createlanguage.h...
dragonwriter 5 months ago

> Also, what’s with the weird UUIDs? Why not use UUIDv7 if you want time-ordered UUIDs?
It looks like the rationale for the Tou library is that some systems do not accept unfamiliar UUID variants as UUIDs, so a time-ordered ID that looks like a UUIDv4 is safer for some legacy systems than a (newer, and less likely recognized) UUIDv7.
- julik 5 months ago
  
  This. We did discover, however, that Postgres will, in fact, swallow a UUIDv7 just fine. After having written that library :-)
  - dragonwriter 5 months ago
    
    I kind of thought it would ingest but not generate unknown UUID versions that nonetheless fit the broader UUID structure, but not having tried I didn't want to bring that up.
julik 5 months ago

It is much cheaper, because you won't have roundtrips or requirements for the availability of extensions on the database server end. It's really a very very sweet capability that SQLite is able to provide exactly because it is hosted by the application.
relistan 5 months ago

> Also, what’s with the weird UUIDs? Why not use UUIDv7 if you want time-ordered UUIDs?
I’m not the author but UUIDv7 came out in about 2022. Guessing this is legacy stuff that long predated that. There were lots of solutions to solve this problem before there was a standard.
- dragonwriter 5 months ago
  
  > Guessing this is legacy stuff that long predated that.
  I’m guessing its not, since the tou library seems to be 8 months old and mentions avoiding the need for extensions if you are using it with Postgres as an advantage over using UUIDv7.
  - relistan 5 months ago
    
    Great, thanks
postepowanieadm 5 months ago

better-sqlite3 got you covered!
- tobyhinloopen 5 months ago
  
  better-sqlite3 is (was?) my favorite sqlite3 lib, it is incredibly fast.
  NodeJS has SQLITE3 support these days!
  https://nodejs.org/api/sqlite.html
  Interestingly it is NOT async, like better-sqlite3. I wonder why. I've been looking for any public remarks about it, but found nothing.
  - darthShadow 5 months ago
    
    These may be useful reading:
    https://github.com/TryGhost/node-sqlite3/issues/408#issue-57...
    https://github.com/WiseLibs/better-sqlite3/issues/32#issueco...
    Copying a quote from the second:
    > The sqlite3 C API serializes all operations (even reads) within a single process. You can parallelize reads to the database but only by having multiple processes, in which case one process being blocked doesn't affect the other processes anyways. In other words, because sqlite3 serializes everything, doing things asynchronously won't speed up database access within a process. It would only free up time for your app to do other things (like HTTP requests to other servers). Unfortunately, the overhead imposed on sqlite3 to serialize asynchronous operations is quite high, making it disadvantageous 95% of the time.
    
    ncruces 5 months ago
    
    The sqlite3 C API very much does not serialize "all operations within a single process."
    The way threading and concurrency work in SQLite may not mesh well with NodeJS's concurrency model. I dunno, I'm not an NodeJS/libuv expert.
    But at the C API level that statement is just wrong. Normally you cannot share a single connection across threads. If you compile SQLite to allow this, yes, it'll serialize operations using locks. The solution is to create additional database connections, not (necessarily) launch another process. With multiple database connections, you can have concurrency, with or without threads.
    https://sqlite.org/threadsafe.html
    Again, whether this is viable in NodeJS, I have no idea. But it's a Node issue, not a C API issue.
    BTW, we're commenting on a Ruby article, and SQLite in Ruby has seen "recent" advances that increase concurrency through implementing SQLite's BUSY handler in Ruby, which allows the GVL lock to be released, and other Ruby and SQLite code to run while waiting on a BUSY connection.
    https://fractaledmind.github.io/2023/12/11/sqlite-on-rails-i...
    
    julik 5 months ago
    
    The user-supplied busy handler has been available for a long while, it's just that the Rails connection adapters did not quite use it right. Indeed, there is elevated interest in SQLite these days.
    
    SCLeo 5 months ago
    
    This explanation is extremely misleading. For many of my projects, vast majority of the operations do not use the database . And the few that does, contains long running huge joins/aggregates. Using the sync API is just straight up terrible because the task will block literally everything else that does not use the db in js, meaning generating a report in the background can literally prevent you from handling any requests. (I did end up using better sqlite 3 because they are personal projects and getting stuck for 2 seconds when the scheduled report generation happens is ok ish for the a few people using it. But I will not consider using better sqlite 3 for any future projects)
    
    tobyhinloopen 5 months ago
    
    I suppose you want to run the longer, expensive queries in the background, EG a worker, or a separate process. Shouldn't be _that_ hard to setup.
    Maybe you can even create your own async wrapper that delegates any query to a separate process.
    
    tobyhinloopen 5 months ago
    
    Tnx for the links!
  - postepowanieadm 5 months ago
    
    That's nice! I use bun, but it's lack of support for UDF makes me consider switching back to node.
    
    tehbeard 5 months ago
    
    I'm starting to second guess my choice of deno.
    The options for sqlite in that are either via unstable APIs, or limited due to the particulars around WASM.
    
    throwitaway1123 5 months ago
    
    They're going to add it once it stabilizes in Node: https://github.com/denoland/deno/issues/24828#issuecomment-2...

mingodad 5 months ago

Also available with LUA http://lua.sqlite.org/index.cgi/doc/tip/doc/lsqlite3.wiki#db...

shakna 5 months ago

Also part of Python's standard implementation of sqlite3. [0]
[0] https://docs.python.org/3/library/sqlite3.html#sqlite3.Conne...
julik 5 months ago

With any decent host runtime, as it were.

steve_gh 5 months ago

There are also some good libs for SQLite. This is the standard set that I use. https://github.com/nalgeon/sqlean

igravious 5 months ago

from the comments:

   [Python](https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.create_function)

https://docs.python.org/3/library/sqlite3.html#sqlite3.Conne...

   [Lua](http://lua.sqlite.org/index.cgi/doc/tip/doc/lsqlite3.wiki#db_create_function)

http://lua.sqlite.org/index.cgi/doc/tip/doc/lsqlite3.wiki#db...

   [Node.js](https://nodejs.org/api/sqlite.html#databasefunctionname-options-function)

https://nodejs.org/api/sqlite.html#databasefunctionname-opti...

   [PHP](https://www.php.net/manual/en/sqlite3.createfunction.php)

https://www.php.net/manual/en/sqlite3.createfunction.php

melvinroest 5 months ago

I happen to work with BigQuery since recently and heard you can do UDFs with JavaScript. Good to know that this is a thing with more databases. I didn't need to use UDFs up until this point but now that I know about it a bit more, I just might as JS happens to be a strong language of mine (currently not using it professionally).

julik 5 months ago

We actually use a couple Ruby UDFs with BigQuery, deployed as Google Cloud Functions. Works pretty well.

throw_m239339 5 months ago

Yes, I remember using that in PHP, very handy before sqlite JSON support, since it also worked with aggregate functions.

https://www.php.net/manual/en/sqlite3.createfunction.php

duskwuff 5 months ago

Also available through PDO:
https://www.php.net/manual/en/pdo-sqlite.createfunction.php

dennisvdvliet 5 months ago

That is pretty cool. And good to see you still doing talks.

Alifatisk 5 months ago

What an interesting find