Postgres is great but using pg_cron as your system to pull and clean data is a recipe for disaster in the long run. I've always regretted using complicated Postgres queries or functions instead of doing it in server. Sometimes I do it because I think "wow, I didn't know pg can do this. So cool, let's just use it!". Regret. Every time. They're too finicky, hard to test, and doesn't have the ecosystem like git and IDEs to make changes easily. And the next employee won't know about some random but crucial logic that lives inside Postgres instead of the code.
You're also relying on your Postgres instance's precious CPU/RAM/network to do logic work which can be unpredictable. Let's be honest, most managed cloud Postgres providers charge an arm and a leg for big Postgres instances already. Why use those expensive resources when you can get a cheap EC2 instance or a serverless function to do the same and have the ability to scale efficiently?
My general rule is, use Postgres to store as much of my data as possible, but never use Postgres to do anything other than be a data store.
This rule of mine is why I never fell in love with Supabase even though many developers have. Use Postgres as my permissions layer? As my API server? You have to be kidding. If you're building a quick throwaway prototype, maybe. (The rest of Supabase is really nice though).
PS. To the team at tembo, no one is going to sign up for a free trial without a pricing page.
> PS. To the team at tembo, no one is going to sign up for a free trial without a pricing page.
This, times 1000.
I know setting a price is scary, but it is one of the critical pieces of information for anyone stumbling on your product.
If you are looking for a way to take advantage of traffic but aren't ready for pricing, ask for my email address for your newsletter. That's easier for me to give than my time to "kick the tires" with a free trial.
So you’re the one who actually wants that popup? I guess the world really is coming to an end, I couldn’t imagine a single person who actually thinks to themselves “I wish a popup would happen asking for my email address right now.”
Not a popup, please! Just a from please. I want a way to keep on top of tembo that doesn't involve investing my time. Not saying I'd read every newsletter, though.
This 100%. A polite signup form at the bottom of a page (or an RSS feed) is totally fine. At this point I understand what you are about and will consider subscribing to your news.
But those popups that show up after I have read a few paragraphs of into can go to hell. I just leave the page when I see this at this point.
Some of the time, companies think it works. I have no faith in how much/whether companies track the number of times a user simply closes the tab because they got a shitty popup while they were trying to _read about the product_. (It happens to me all the time: Hit a product page, semi-curious, I'm reading about it, then BAM, a popup asking me to sign up for a mailing list, so I close the tab and move on.)
Companies are far more likely to say "We got X people signing up for a mailing list, it must be working", and don't bother whatsoever to measure (a) how many of those signups are just bots and (b) how many potential customers they just turned away by nagging them. I would bet any money that these metrics are either not measured or simply ignored.
This rationale right here is why we are stuck with so much bloatware. "People enter the tarpits all the time and never leave. There must be something there, I'm going in"
I also prefer to do stuff on the server because doing things in the database does invariably seem to drift towards a confusing, brittle mess. However, I do wonder if that's necessarily the case or just a limitation of our frameworks. There's no principle that would prevent someone from setting up a simple test suite for database code. It would be a lot of work, but maybe there's an architecture nirvana right in front of our eyes if someone just were to properly engineer a good framework around Postgres functions.
I wrote something [0]. It lets you write plpgsql functions, views, triggers and casts in a way that can be re-created without migration scripts, like editing source code. Just a quick “deploy” and the logic is updated.
It also manages migrations for tables, types, and other stuff in a really simple way. Upgrades are fully atomic, and it lets you write unit tests in SQL - which are run after every upgrade, run inside save points so they don’t affect the database, and can run during production deployments.
It’s sort of my own personal (open source) Swiss Army knife of plpgsql development. It’s a complete work in progress, not production ready, probably has bugs, and needs more and better documentation - but I use it daily. It lets me use Postgres as my main development environment.
> without migration scripts, like editing source code. Just a quick “deploy” and the logic is updated.
This is 100% the key to sanity with managing database stored procedures and functions -- ability to manage them in Git like normal code and deploy them like code.
In contrast, the workflow from traditional imperative database "migration" tools is just super awkward for developing and maintaining any non-trivial number of SQL stored programs (procs, funcs, triggers, views, etc).
Hey - Skeema looks very cool. I haven’t used MySQL for decades but I can see it’s solving much the same problems as pgpkg.
I did decide that declarative tables were too hard when I wrote the first predecessor of pgpkg in bash 10+ years ago - but maybe I should reconsider now that I’m working in Go!
The thing is, it’s a bit of a rabbit hole, there isn’t much tooling around this stuff despite stored functions being so, so much easier for writing database logic than anything else.
I feel the industry has wasted an enormous amount of time on ORMs and other nonsense when stored procedures have been under our noses the whole time.
Thanks! Yeah definitely agree that building out declarative table management for Postgres would be a major effort. A few open source projects I've seen in that area include:
Some of these are based on parsing SQL, and others are based on running the CREATEs in a temporary location and introspecting the result.
The schema export side can be especially tricky for Postgres, since it lacks a built-in equivalent to MySQL's SHOW CREATE TABLE. So most of these declarative pg tools shell out to pg_dump, or require the user to do so. But sqldef actually implements CREATE TABLE dumping in pure Golang if I recall correctly, which is pretty cool.
There's also the question of implementing the table diff logic from scratch, vs shelling out to another tool or using a library. For the latter path, there's a nice blog post from Supabase about how they evaluated the various options: https://supabase.com/blog/supabase-cli#choosing-the-best-dif...
I looked at that supabase blog but the problem in my experience is that a refactor is not just about changing the database tables, but also rearranging the data within those tables - the data in a production database is obviously the most important thing. So I think the schema diff is actually the easy part. And for that, I wouldn’t use text tools like “show create tables”, I’d use the PG catalog tables directly.
The problem I had was that I could never come up with a declarative scheme that would allow reliable data transformation for all databases over the long term.
For example - if I have a table “lookup” with two columns “key” and “name”, over a period of years we might see a transformation like this:
alter table lookup rename column name to description;
[…later…]
alter table lookup add column name text default description;
[…later…]
alter table lookup drop column description;
Assuming “name” and “description” are modified between upgrades, then running these updates over time would result in a different transformation than if you just apply the latest definitions to an old database table.
I couldn’t ever come up with a solution to this that’s simpler than a sequence of migration scripts, which is always repeatable. I haven’t had a look at Skeema yet but am curious how you deal with this?
> So I think the schema diff is actually the easy part. And for that, I wouldn’t use text tools like “show create tables”, I’d use the PG catalog tables directly.
re: SHOW CREATE TABLE, I was referring to the export logic, not diff logic. In other words, when a new user adopts a declarative schema management tool, they need some way of dumping their existing database schema to the filesystem as a set of CREATE statements. Most pg tools seem to just leverage pg_dump for this, but in my opinion that's not great since it's an external dependency.
And then ideally there's also a way to re-sync the filesystem in the future as needed, pulling the latest definitions from a given DB server. This is also an export, but it should be smart enough to only overwrite CREATE statements where the corresponding object has actually changed, to avoid stomping on formatting or inline comments.
The ability to do a "pull" operation is very useful in development workflows: engineers can make DDL changes to a dev DB directly while developing a feature, and then pull those changes into the filesystem to turn them into a git commit / pull request. It's also sometimes useful in production workflows, in case someone had to make an "out of band" emergency hotfix directly to the prod DB, outside of the schema management tool.
> alter table lookup rename column name to description
> I haven’t had a look at Skeema yet but am curious how you deal with this?
Skeema just doesn't support renames directly at all. In practice this actually works out fine, since renames are hugely problematic in production databases anyway due to deploy-order concerns: there's no way to deploy an application change at the same exact moment as the RENAME is executed in SQL, so it cannot be performed "online". Best practice with schema changes is for applications to be able to work fine with both the old and new schema, and renames typically break this rule. (Well, unless you do a convoluted multi-step dance with view-swapping, for DBMS that support transactional DDL... this is doable in pg, but not in all other DBs.)
In Skeema, if you really need to do a rename, you can do it out-of-band (outside Skeema) on all environments (prod/stage/dev/etc), and then use `skeema pull` to update the filesystem definition to match.
Or for new tables that aren't populated in prod yet, happily a rename is entirely equivalent to drop-then-re-add. So this case is trivial, and Skeema can be configured to allow destructive changes only on empty tables.
> when a new user adopts a declarative schema management tool, they need some way of dumping their existing database schema to the filesystem as a set of CREATE statements.
Got it. Yeah - tbh I hadn't thought about that, but declarative tables never been on my radar (for the reasons mentioned).
> Skeema just doesn't support renames directly at all
Right! that's one way to deal with it! :)
> there's no way to deploy an application change at the same exact moment as the RENAME is executed in SQL
I guess it depends on your ops environment, but I don't think renames are special; any schema update can be incompatible with client code if you're not careful. That said, in the systems I've worked with (telco and saas), the schema upgrade has always run with applications stopped - the migration code is shipped with the server application, and run as part of its startup sequence. Even if there are multiple clients, the only requirement is that they are all stopped before the upgraded processes start. (This is something you can delegate to K8s for example).
I think the interesting thing is that there are such very different approaches to schema upgrades. Skeema and pgpkg obviously have different views about how migrations should be managed, despite the fact that you (presumably) and I have both done many tens of thousands of them over the years.
One important difference, as you mention, is that PG has transactional DDL - which really changes the game in terms of how you can think about upgrades. pgpkg is deliberately PG specific; it will never support any other kind of database, because the assumptions it makes are invalid in most other systems.
Awesome, thanks! I update it whenever I work with it, which has ramped up in the last few weeks (added cast support a few days ago). It uses pg_analyse to parse the SQL and uses a few tricks to get everything updated.
The best thing (IMO) is that there is basically no funny stuff, no filename conventions, no funny delimiters. It’s just regular Postgres SQL, and a couple of very small config files. Works perfectly with git.
I've made some automatic schema dumping scripts for supabase/postgres code to become searchable and git-diffable and PR-friendly. And I eventually was able to make unit testing to work good enough that I even wrote migrations in a test-first way, it was just quicker to iterate. But overall experience felt like I'm constantly combating problems, solved looong time ago for other languages and ecosystems. It was weirdly fun, but what killed my interest is that row level security policies kill performance of even simple queries so much, and EXPLAIN doesn't help to well with it.
Had the opposite experience. Row-level policies had minimal overhead (<5%?) but improved security at the app layer tremendously. It was intensely satisfying to have cases where UI developers were complaining that data was missing. Turned out the access tags were wrong, and the data as tagged shouldn't have been visible in the first place. App had set the wrong tags but turns out would have happily returned the data if the policies hadn't been there.
Apps forget that extra AND on the WHERE clause *all the time*. Just one ad hoc script querying the database can ruin your whole security-oriented day.
Do policies make schema design slower? Yes. Do they make queries slower? Not in my experience, but that may be due to our familiarity with Postgres and its planner. Do they basically eliminate data leaks to the end user? Absolutely.
DB policies to me are like Rust vs C++. Someone maybe able to write C++ faster and with less training, but having those extra checks at the outset can save so much time and heartache down the road. It's an investment, not a cost.
to remove a lot of useless comments and flags from the dump (pg_dump output isn't too readable). This oneliner can strip too much, though, comments in functions shouldn't start from the 0 column.
The same script had been run on CI too, to verify that developer didn't forget to run it in the PR.
Have you tried IntelliJ? I honestly do most of my SQL in psql, out of habit. But I'm in IntelliJ all the time for Java, and from what I've seen their query editor is way impressive.
Does it support multiple queries per sql file? My use case is to use this to autocomplete queries made with Cornucopia [0]
Cornucopia queries look like this (here is a something.sql file)
--! authors
SELECT first_name, last_name, country FROM Authors;
--! insert_author
INSERT INTO Authors(first_name, last_name, country)
VALUES (:first_name, :last_name, :country);
There are multiple queries each separated by ; and on top of each query, there's a comment giving a name to the query (it's more like a header)
I think the only thing that might require specific support in postgres_lsp is using the :parameter_name syntax for prepared statements [1] (in vanilla Postgres would be something like $1 or $2, but in Cornucopia it is named to aid readability). But, if postgres_lsp is forgiging enough to not choke on that, then it seems completely fit for this use case.
Thank you for turning me on top Cornucopia, it looks awesome. I've used the very similar aiosql in Python, but I hadn't realized there was a Rust analog.
To tell the truth I've been waiting for postgres_lsp to mature before trying it out, but based on this example [1] I think it does support multiple queries.
Since it uses a parser extracted from Postgres, the nonstandard syntax would probably trip it up, but there's probably a way to fix that.
When I worked on a system that used a lot of postgres triggers and stored procedures we built a little mechanism on top of our existing database migration tool that would check a directory of plpgsql files and generate migrations when they were created or updated. It worked fine.
It wasn’t the most perfect developer workflow, and I was suspicious when I first encountered the way the software used all of the stored procedures, however I came to appreciate that we were able to be a bit freer with changes to the application code because of this semi-isolated layer that took care of some critical stuff right in the database.
Depending on what kind of deployment you have, you could use Tcl, Rust, or even Python if you could use untrusted extensions. (Not a comment on this particularly testing framework, but Postgres server-side programming more generally.)
But I hear you, PgSQL can be very annoying and unergonomic, and it's not a language most people you're hiring will know upfront. Pushing things onto the backend isn't unreasonable. When I write tests for PgSQL, I write them in Python and run them from the client side, not on the server.
pl/pgSQL is very good (and ergonomic) for a logical extension to SQL, aka set theory programming with intermediate state. It isn't and was never targeted toward general purpose programming like Python.
That said, 100% agree that unit tests should live outside the DB. Querying for sets inside or outside makes no functional difference, and your DB doesn't need all that extra cruft.
There's maybe no standard solution which could drive this forward as a practice, the way JUnit has taken on the Java community (I'm sure there are examples like that everywhere that may resonate). But nothing is stopping you from running Postgres in a Docker container, and wrap it in a unit test with the technology of your choice. The Postgres scripts (ddl, triggers, functions) could live in the same project. Setup, run tests, teardown. Each unit test can set up the data for a particular test case. That way you also codify the Postgres scripts, for which it then becomes "just" a deployment matter to get them from Git into the target environment. I have written all sorts of test harnesses, and I wish more people would think out of the box of their standard tool sets. Sure, I'm a Java developer, or one could be a SQL coder, but nothing is stopping us from writing a darn script.
(By the way, I'm not suggesting running a large data warehouse in a unit test. But the parents said "unit testing", which suggests there exist units of smaller, isolated functionality in your project, if you're willing to find them.)
You have just described Testcontainers [1] , and if you are a Java developer you may want to look into Testcontainers for Java [2].
For example, in one of my projects I use TestNG to instantiate a MariaDB container, run the Flyway migrations on it and then populating the tables with the test data:
> But nothing is stopping you from running Postgres in a Docker container, and wrap it in a unit test with the technology of your choice
I agree. I'm in .NET land, I use EFCore as my ORM, and I use EFCore's migration features. My ORM models are in a separate project (same git repo) from both the web server project and test project, and any hand written SQL gets added to the migration scripts that EFCore generates.
I spin up a docker container for postgres, my test code clears any existing DB, creates a DB in the container, and then runs the EFCore migrations on the DB. I have simple tests that make sure my CTEs are working correctly and that things like expected unique indices are also setup.
This works both locally and in Github Actions.
I just wouldn't call any of this a "unit" test. I put all this squarely in my integration test suite. I figure if IO is happening as a consequential part of the test (i.e. not setup/teardown), it's an integration test. I wonder how much that distinction is tripping people up? A lot of people think of unit tests as small, independent, and quick, so by only thinking about unit testing, they automatically rule out tests that have app code call out to a DB.
Based on the sibling comment, I'm going to have to take a look at Testcontainers. I'm not sure how much it'll simplify at this point, but who knows!
In the MS ecosystem they have the SSDT data tools and unit test projects. It uses the designer functionality of VS to allow you to write your arrange, act, and teardown in pure SQL, and gives some basic condition builders for result sets tor your assert, with an escape hatch if that's insufficient.
It's actually so good I regret not having done it sooner. We migrated all our APIs into PostgREST [1] and it reduced our development/testing/deployment time by an insane amount, decreased the number of bugs automatically by removing 1 extra layer and made all our requests faster without having to optimize anything.
I'm genuinely surprised it's not more popular. It would probably benefit like 99% of the small/medium size companies.
I would have agreed w/ you 100% 18 months ago. And presently is is messy do to this in Postgres, but it's also messy to choose lots of tools that create sprawl and data islands.
We're spending a significant part of our R&D effort in 2024 towards making it cleaner/safer to do more in Postgres. We have a lot of work to do to build a great gitops developer experience. We're also prioritizing auto-scale, auto-suspend, and auto-tune projects to help reduce the cost of doing "lumpy" things in Postgres.
Would love to have you as a design partner @aurareturn — want to eventually win you over :) hit me up if you're interested in chatting ry [at] tembo [dot] io
+1 for data in the database, code as... well, code. And also learned this the hard way. Code you can build and run test suites against, extract logs from, place debug breakpoints in, is such a quality of like enhancer.
We're working on a gitops DX for Tembo Cloud, that will allow you to ensure everything you build on Tembo is in source control, and runnable locally w/ our CLI. More info coming soon...
Oracle also lets you run Java from stored procedures which I’m sure lets you tie yourself in knots too.
There could be slim use cases for doing everything in the db but there’s a reason why it’s not widespread.
I hear your points, but can't git, CI tools, IDEs like DataGrip all be used to test and deploy the database code too versus letting some db admin create functions or procedures with no version control? Also, with things like RDS or readonly replicas, couldn't more analytical queries be done directly in the database versus busing data around?
There will be another blog from us at some point about running the data warehouse at scale. We're already working on integrating with s3 storage, and distributed compute is in the roadmap. Both possible today with open source extensions, and our friends in industry are already doing it.
Is there any good data movement/transformation software for on-prem? Specifically integration of other databases and Excel into MSSQL (<100GB of data).
I have looked at SSDT, which looks really clunky... Was going to use a similar approach to OP, especially because it's cheap and ressource-efficient, but lack of support infrastructure might come back to bite me.
Checkout Airbyte and singer.io for data movement. They're tailored for moving offsite data local, but also have support for moving data between different database vendors. I'd recommend starting with Airbyte.
For transformations I'm a sucker for dbt-core, which executes transformations as SQL against your data warehouse vs. a separate process. This keeps everything simple and (generally) snappy. It uses the Python Jinja2 template library to compose SQL templates, and the process that parses templates to SQL also implicitly creates an execution DAG so you could have hundreds of transformations running in the exact order they need to.
There are also many commercial offerings (Fivetran, Dataform) that don't tick self-hosted, but could be easier to deploy if getting internal hardware support is an issue.
Just kidding. Fuck SAS. I’ve use some absolutely dogshit tools in my day, and of all those, SAS is downright the most time wasting, incredibly inconsistent piles of shit ever.
Never had a worse experience. Useless. Run far away.
Well put. It's never not funny to me how one of the fundamental tenets of software engineering is 'separation of concerns' yet there's no shortage of people willing to say 'but it's so convenient to mix them at the start of a project...'.
This is the definition of vendor lock in; the more you use embedded non-standard functionality of a product the less likely it is that you'll ditch it.
As long as you're aware of what you're doing it's fine. And you're documenting your usage, right?
If you don't write your database layer with vendor-specific features and optimizations in mind, you're definitely leaving tons of performance on the floor. Multi-DB support for a largely identical schema is a fool's errand.
It's not Postgres's fault that MySQL is comparatively bereft of modern features.
• transactional DDL
• RETURNING clause
• unfettered use of user-defined functions
• a native UUID type
• the ability to define custom data types and domains
• DDL triggers
• row-level security
• actual arrays and booleans
• MERGE
• range types and constraints
• IP address and CIDR types
• partial indexes
• LATERAL JOIN
MS SQL has temporal table support, the ability to integrate any .Net functionality directly into the database, pivot tables, etc.
It's really hard to explain to lifelong MySQL users what they've been missing in the name of database agnosticism. Using the lowest common denominator to all the engines feels like wearing a straitjacket attached to lead weights.
Not sure the word "vendor" is exactly correct when using standard PostgreSQL functionality though, as PostgreSQL has multiple vendors providing that functionality. So, not really "vendor lock in".
The term is clearly correct for databases with only one vendor though (Oracle, etc).
I love Postgres and relational databases in general, but the lack of documentation generation found in general purpose programming languages for the last 20 years is a glaring omission.
There simply are no good equivalent to Javadoc for SQL databases. (Except you Microsoft. I see you and MS SQL Server tools. I just don't work on Windows.) You have ERD tools that happily make a 500-table diagram that's utterly useless to humans, are ugly as sin, are not at all interactive, etc. Seriously, Javadoc (and Doxygen, etc.) have been on point since the 1990s.
OOP and functional design patterns are common knowledge while 50-year-old relational concepts are still woefully unknown to far too many data architects.
Folks don't write docs. They just don't. They start. They agree it's important. But they either get skipped or they stagnate. The only real option is automatically generating. Reading the column and table comments, the foreign key relationships, the indexes, the unique and primary keys, the constraints, etc. and rendering it all into a cohesive and consistent user interface. Not the data. The structure.
It's the single biggest tooling failure in that sector in my opinion.
Agreed, I learned the hard way that complicated constraints, those that are really good to keep data integrity, are really hard to test and might have unintended consequences.
A data store *that enforces valid data at all times*. Far too many people miss that part. They skimp on foreign keys, check constraints, and the most appropriate data types. Time is far better spent getting the schema right at the outset than trying to fix bad data already mixed in haphazardly with the good.
And then there's audit trails, which are best made centralized in the database with triggers. "This change was made by X at this time."
Row-level access policies are yet another. Doesn't matter which app or user is querying, what joins are mixed in, etc., the DB simply omits data the user shouldn't see. Period. No "oh, my script forgot to include part of the WHERE clause for that".
Totally agree about pg_cron (and MySQL scheduled events while we're on the topic). It artificially limits scheduled actions to just those that affect the DB, cannot adjust to current DB load, and are hard to find. Better to run cron jobs all accumulated in one place from the cloud provider a la EventBridge or from a dedicated schedule server. (I'm not a fan of app cluster leader elections either for complexity and opacity reasons as well. Put the schedules in one place to be audited and enumerated as needed.)
You're also relying on your Postgres instance's precious CPU/RAM/network to do logic work which can be unpredictable. Let's be honest, most managed cloud Postgres providers charge an arm and a leg for big Postgres instances already. Why use those expensive resources when you can get a cheap EC2 instance or a serverless function to do the same and have the ability to scale efficiently?
My general rule is, use Postgres to store as much of my data as possible, but never use Postgres to do anything other than be a data store.
This rule of mine is why I never fell in love with Supabase even though many developers have. Use Postgres as my permissions layer? As my API server? You have to be kidding. If you're building a quick throwaway prototype, maybe. (The rest of Supabase is really nice though).
PS. To the team at tembo, no one is going to sign up for a free trial without a pricing page.