In my current job I maintain a database of 2B records that is (unfortunately) on spinning rust and on a last-gen Intel desktop CPU system. This database is a read-only data warehousing sort of setup.
If you need to do operations like this on a database with this setup I'd suggest checking out the following config options:
I'd also suggest looking into TEMPORARY TABLEs with ENGINE=MEMORY.
I have not included values in this because the values that make sense for you are likely not the values that make sense for me. Check out MySQL's documentation for what these values effect [0]. The defaults in even MySQL's huge config are very outdated for the horsepower that modern computers bring to the table. It's funny that MySQL's once massive 4GB of RAM config file is now the appropriate setup for my laptop.
It's a great tool, but don't just apply advice blindly. It's guidance, not "this is the right thing to do", and some of the advice can harm as much much as help, because no one has size fits all (though I will found it to provide a excellent suggestions in general).
Use it, look at the suggestions, and use that as an excellent springboard for reading up on what and why, and see how it could apply to your database.
Two big pitfalls seem to hit almost all of these "low-level optimization of how the database handles..." type posts: the importance in optimizing the system, and the use of other technologies.
Write activity impacts database query caches. Many applications do not require realtime-accurate results from the database and generate a substantial number of the same paginated queries. For these cases, it is very important to consider higher-level caches within the system -- CDN, page-level caches, page block-level caches, etc. as caching your 99% traffic pattern will provide DB platform headroom to support your 1% traffic pattern.
Where the results you are paginating are based on any sort of matching (SQL WHERE), most read applications see a sizable benefit in integrating a search platform. Data selection for display is handled in the search layer, and underlying data retrieval for display happens either from the search layer or via the backing database using inexpensive lookups via primary key.
One of the key considerations not covered in the article is the need for result consistency when paginating, e.g. if the underlying data changes. It is the need for this consistency, not the desire for performance, that I see as the primary reason to include primary key identifiers or timestamp values in your pagination strategy.
> the importance in optimizing the system, and the use of other technologies.
If folks knew how to get the most out of their databases, 90% of them wouldn't even need the other systems and the complexity of the system as a whole would be reduced.
Many database platforms support column-level privileges or permissions which could be used to prevent that sort of backfill, by preventing UPDATE on the column that holds the creation timestamp.
I've also dealt with databases without column-level privileges. Where not handled by a column-level privilege system, it may still be possible to block this sort of UPDATE using a TRIGGER designed to fail.
Privilege grant (or drop of the trigger) could be used when the system is in an offline maintenance mode should you require the ability to correct that protected data, reinstituting the control when the maintenance is complete.
The criticality of the data and the level of automation in use would probably be the factors I would use to decide whether this overhead was warranted. Hopefully you had backups available.
You know, if I had access to the documentation of a major DB system, I would put something about this right on the documentation page for LIMIT and OFFSET. It's just such an attractive nuisance; who can blame a novice developer/DB user for making the mistake of thinking that LIMIT and OFFSET are a good solution, really?
Well, if you need to support arbitrary sorting and filtering with your pagination, you'll be looking at table scans no matter what you do, so limit and offset don't really add much extra cost.
You can arrange for indexes on the most commonly used columns, so that e.g. initial default page load is fast, while leaving the user hanging for up to a second or two on the more unusual queries.
It turns out that when people have better filtering and sorting tools, they spent a lot less time digging for page 10,001.
LIMIT is great, and a necessity on MySQL. It's OFFSET that causes problems. If I were king of databases I'd make OFFSET > 1000 cause an error directing the user to the doc page explaining why what they're doing is a bad idea and how to increase the limit anyway.
Offset 1000 won't hurt you if it's a simple indexed scan; MySQL will mess you over far worse if you use a subquery in your where clause, you can be looking at 1M+ rows scanned via the power of nested loops.
I've invested several months of my life optimizing arbitrary sort and filter for MySQL results over tables varying from 100k rows to 100M (using time windows to scope things). As long as you can efficiently cut down the underlying data set into the region of 100k using some kind of window - usually recency based - and convince MySQL to filter by this before it sorts or does any other kind of filter - then limit / offset pagination doesn't hurt.
You can then use higher-level pagination on the recency window if it becomes necessary.
Some systems do. Elasticsearch is the example that comes to mind. By default, configuration prevents you from offsetting past I believe it's result 10,000. You can of course increase this limit, but there is a sensible default.
Good documentation, useful to the humans reading it in actual, real life? Solving the actual problem they’re having and not just “technically correct” answering the exact question they asked?
I'm asking for a note about an extremely common performance issue to be added to the documentation. Any database documentation worth it's salt is full of performance notes already. It hardly seems like there's a slippery slope down to... whatever it is you seem to fear here.
It gets worse when you want to allow sorting the paginated results by something other than the integer primary key, such as a name. Duplicate names with different ids means having to use GROUP BY.
I typed out the example queries below from memory, hopefully I haven't made any logic errors. Note:
a) Requires additional indexes to make it efficient.
b) Accounts for inserted and deleted rows.
c) You cannot use page numbers. You can only do first page, next and previous pages, and last page.
d) Example is for 20 items per page. The query pulls 21 (+1) items, so app can determine whether there is a next or previous page.
e) In a web app or api for example, your parameters get crazy, such as: /users?page=next&sort=name&sortId=20&sortVal=George
-- first page
SELECT name, id
FROM users
GROUP BY name, id
ORDER BY name, id
LIMIT 21;
-- next page (ex: last item on current
-- page has id 20 and name 'George')
SELECT name, id
FROM users
WHERE name >= 'George' AND id > 20
GROUP BY name, id
ORDER BY name, id
LIMIT 21;
-- prev page (ex: first item on current
-- page has id 21 and name 'Harry')
SELECT name, id
FROM users
WHERE name <= 'Harry' AND id < 21
GROUP BY name, id
-- need DESC, reverse order in app for display
ORDER BY name DESC, id DESC
LIMIT 21;
-- last page
SELECT name, id
FROM users
GROUP BY name, id
-- need DESC, reverse order in app for display
ORDER BY name DESC, id DESC
LIMIT 21;
> c) You cannot use page numbers. You can only do first page, next and previous pages, and last page.
To get to page 9, you start on first page. Then "next page" 8 times. This is why you see this kind of UI fairly often. First (<<), previous (<), next (>), last (>>). No numbers.
Technically you can do relative page numbers (ie: current page + 8) by increasing the LIMIT of the query and skipping over results. It's really messy to handle and you can never use "real", absolute page numbers, because inserted and deleted records can change how many next or previous pages there will be.
it's kind of crazy that here in 2017 there are still discussions and debates on what exactly is the best way to paginate a large table in a database.
I'm leaning with @jerf that the LIMIT/OFFSET approach can trap you on the 100M record scale, however in all honesty, most of us don't deal with databases at this size. As @alvil stated, his approach works fine for 1M records, which is way more then most personal and application databases will contain.
Maybe the solution is for the database vendor to actually implement and document a definitive solution for this. Maybe re-engineering the LIMIT/OFFSET approach under the hood to take advantage of this new way.
As for the article... I really have a personal problem when authors get lazy and take the easy way out by assuming everyone knows what they are talking about. Take this portion of the article
========================
Simplified algorithm:
We get PAGE_SIZE number of records from the table. Starting offset value is 0.
Use the max returned value for user_id in the batch as the offset for the next page.
Get the next batch from the records which have user_id value higher than current offset.
========================
The author took the time to write an entire article debating and demonstrating their solution to the whole pagination problem and they couldn't take 5 minutes to show the code behind these steps in their solution?
As I stated, it's just a personal thing.
UPDATE: Down the internet rabbit hole I go. Here is a very nice article similarly demonstrating and debating the LIMIT/OFFSET approach in MSSQL between their OFFSET/FETCH and CTE.
Exactly, the "correct solution" would be my first attempt as it uses the (clustered) index scan to count pages/records, it's likely to be the fastest option whereas the first method should not even be an option. This should be immediate for someone who understands how to develop in virtually any RDBMS.
This is also, more broadly, something I've been thinking about recently. It feels (to me) like industry knowledge is getting lost somewhere, because you see people trying to reinvent the wheel in a lot of technical areas. One of them is data: there are proven approaches to most problems faced by the average organization, especially when it comes to designing and managing a data platform. Yet, you see people rolling their own approaches/designs/methodologies to basic things like ingestion, ETL and data modelling when the traditional approaches would suffice and would take half the time to implement and one third of the effort to maintain. It's great that people try to innovate, but what I see regularly is more like people trying to solve a solved problem without bothering to educate themselves on how it was solved so far.
My day job at a smallish company has five different applications with tables more than 100M rows. This is not uncommon at all, especially for apps that need to retain audit history tables.
Assuming you can’t toss history for business or compliance reasons, tables with 100M rows become commmon when an app has been in service for more than a few years. Even in SMBs.
You can't: On slide 43 you can also see that keyset pagination has some limitations: most notably that you cannot directly navigate to arbitrary pages. However, this is not a problem when using infinite scrolling. Showing page number to click on is a poor navigation interface anyway—IMHO.
As someone who often skips a few pages when navigating such interfaces, I say "thanks, but no thanks".
> In my case with 100 000 000 records, the query is never finished. The DBMS just kills it. Why? Probably, because it led to the attempt to load the whole table into RAM.
What? Why? It should just instantiate a cursor and return the first batch size worth of records.
I’m not sure why the dB pulls load the entire result set into ram to return results unless it requires a sort on a non indexed field.
Mysql would fall back to a disk based sort once the data set is too large.
Perhaps it's his client that is OOM-ing? Back in the day, I remember the PHP mysql client loading the entire result set into a local buffer unless you explicitly asked for the output to be streamed.
> You need to walk thru the table,
> extract each record,
> transform it inside your application’s code
> and insert to another place.
The writer said his database died in the middle of the query when he selected the 100,000,000 rows without first breaking them into pages.
It looks like he's using MySQL. I've used PostgreSQL for over a decade, and I just don't see this happening, though I've tested only on a table with a few million rows. But anyway I doubt PostgreSQL loads the whole table into memory when you select it. It seems to load just parts at a time, as needed. And for what it's worth, an offset of a few million rows still took just a few seconds.
Actually what I would first try to is dump the table to a file, transform it if possible with Linux command-line tools like sed and awk, and then load the file into the new table.
Can anyone confirm if Postgres would die like MySQL did on too big a dataset --- selecting everything without paginating but fetching just one row at a time in the application?
> I've tested only on a table with a few million rows
I've tested postgres with > 250m rows. It has problems. I've also tested to 1b rows, same issue.
Now throw in some sum, count and group by functions, more issues.
Add a join? Now you've major problems.
Add several joins? Woah there!
I've been able to bring a 5 node citusDB cluster to it's knees. Which the only solution was to scale out massively to double digits servers. But you can't do that on a limited budget.
The caveat here, this was about 2 years ago now. I don't know if there have been any improvements since that time.
I don't have a postgres database now, I've since migrated over to memsql as the majority of my work is OLAP.
I run a 270m rows database for my IRC bouncer, and run fulltext searches over it all the time, with many joins, expensive ranking, etc. On a single machine.
> Can anyone confirm if Postgres would die like MySQL did on too big a dataset --- selecting everything without paginating but fetching just one row at a time in the application?
Depends on the client library. A lot of clientside SQL interfaces load all rows into arrays of hashes or such into memory.
I'd seriously expect anyone working with such large datasets to be working with lower level libraries when interfacing with the database. Stuff like ActiveRecord is fine for simple operations and small datasets, but surely at the scale of hundreds of millions of records you can afford to hire an engineer capable of writing SQL and understanding the underlying database operations? Am I just being naïve?
In my experience, if you're working with moderately large data sets and want to provide interesting UI with rich filtering operations, you'll want a lot more control over execution plans than straightforward declarative SQL gives you. You end up forcing the database's hand by nesting subtable queries; and since ORM tools don't generate queries in this style, you write your own SQL generation library.
The writer is incorrect. MySQL doesn't wait for query execution to finish to start sending rows, unless it has to. For a SELECT with no WHERE, it definitely doesn't have to (ORDER BY requiring a sort step is one case where it does). However, the default behavior of most (all I've ever run in to) MySQL clients is to buffer the entire result set before returning, to minimize the time the server spends executing the statement and holding locks.
When you show a paginated list to the users they usually wants some sort order. Often they also want to be able to sort on any of multiple columns. Don't remember what example the writer had but sorting is normal.
>> selecting everything without paginating but fetching just one row at a time in the application
This is called an "unbuffered query". The problem usually isn't with the number of rows you're trying to read; it's about what the database must do on its end before it can know which rows to send first. ORDER BY being one example wherein, unless you have perfect indexes including the sorted column(s), the database has to generate the entire resultset before even sending the first row. That means writing potentially millions of rows' worth of information to memory or - more commonly at large sizes - to a temporary table on disk.
Unbuffered queries are fairly rare in the "real world". While in theory it means you're parsing results faster and using less memory on the application side, it introduces difficulties like the fact you can't run a subsequent query until you've finished reading all rows from the original query.
When an application is reading this many rows with a single query, it's usually an indication that the app is poorly written. Of course there are exceptions, though typically reserved for maintenance scripts, reporting, data migrations, etc.
That's great if you need to do some kind of crawl through the DB for background work, but fails to support key elements (filtering, non-primary-key sorting) that are probably hard requirements at the API level.
You can still support arbitrary filters and ordering, as long as you always sort by the ID last. The nice thing is the queries actually get more and more efficient as you approach the last page. (The normal approaches have the same cost across all pages, or even get more expensive toward the latter pages.)
The only thing this approach does not support is jumping to an arbitrary page number.
I wonder if an indexed "PageNumber" column could work? Obviously wouldn't work for arbitrary queries, but if you had a single, commonly run query it could be used to speed up that one...
With this approach you can't have page numbers on the front-end, just first, prev, next and maybe last. You can't jump to particular page. It is old technique and I like it, but if it would be usable it would be used by everyone but it is not.
For example you have a forum post with 1000 messages starting in 2007, and you are dying to know what happened in the discourse in 2011. So basically you need to jump in the middle by bisection.
1000 is tiny by DB standards but solutions like this one are often bolted without considerations.
That's a great example. I understand the use case now.
I note this use case can be solved using keyset pagination instead of offset/limit, because the ordering of messages is stable: messages are never inserted in the middle of the list; they are appended at the end. We can attribute a number to each message, and use this message to filter and sort.
You keep it and mark it "deleted", and you show "deleted" in place of the message. This is what most forums do already, independently of the underlying pagination mechanism.
Some APIs expose the key offset mechanism. For example, in most OpenStack APIs, instead of ?page=2, you say ?marker=12345, where 12345 is the ID of the last object on the previous page. The end of the pagination is signalled by an empty response.
storing the max id returned from the batch for use as an offset for the next batch is something I've done ever since I even started programming - is it supposed to be revolutionary advice?
I usually use query similar to this one (in PHP/MySQL), it's much faster than traditional offset query. You can compare it for yourself using EXPLAIN. Sure not for 100M records, but for 1M or so :)
$items = db("SELECT a,b,c FROM item
JOIN (SELECT id
FROM item
WHERE
<your-where-conditions>
ORDER BY id DESC
LIMIT <offset>, <limit>)
AS x ON
x.id = item.id");
If you need to do operations like this on a database with this setup I'd suggest checking out the following config options:
I'd also suggest looking into TEMPORARY TABLEs with ENGINE=MEMORY.I have not included values in this because the values that make sense for you are likely not the values that make sense for me. Check out MySQL's documentation for what these values effect [0]. The defaults in even MySQL's huge config are very outdated for the horsepower that modern computers bring to the table. It's funny that MySQL's once massive 4GB of RAM config file is now the appropriate setup for my laptop.
[0] - https://dev.mysql.com/doc/refman/5.7/en/server-system-variab...