I’ve used mysql in almost all of my database work for the last 20 years. Before the project i’m on, my biggest database would have been 20-30gb and have some tables with tens of millions of rows. No matter what the project, I could always spin up a copy of the database locally and work with it without issue.

That all changed today.

Today, I began working with what is to me, a very large dataset. The footprint on the server is 4.5 TB spanning 499 tables. The bulk of the data in the database lives in 6 tables, seen here:

Table Data (GB) Index (GB) Total (GB)
table_1 1186.060 0.000 1186.060
table_2 697.883 8.957 706.840
table_3 204.270 10.424 214.693
table_4 149.770 141.131 290.900
table_5 140.612 81.690 222.303
table_6 129.720 4.805 134.524
table_7 42.407 24.026 66.434
table_8 38.010 0.233 38.242
table_9 34.674 1.625 36.299
table_10 29.678 4.032 33.710

In those 10 tables, there is nearly 2.9TB of data. This database is used as the public facing side of a web application that processes 20-30 transactions per second. There are reports generated with queries taking 5-10 minutes per execution (the bulk of queries are very fast, but there are some outliers.)

All of this data lives on an AWS EC2 instance, is replicated to another EC2 instance, and then replicated to 2 more instances that serve as “hot swappable backups.”

I’ve been given the task of making the reporting side of things faster, and ensuring reliability of the setup. The problem is: seeing as it’s a production database, I can’t lock tables for any excessive amount of time without a lot of notice, and adding indexes to some of these tables takes hours. My first task was to get a copy of the database on another machine that I can freely modify and measure changes to. We had two options:

  • Another EC2 Instance - This would be the fastest to set up, but would cost roughly $850/mo
  • Set up a local install - This is slower, but has a fixed price point.

I decided to buy a 12TB spinning disk online, which cost roughly $600. My workstation is connected to a 300mbps fibre connection, and has 64GB of ram, so I could easily copy the dataset to my machine over the course of a day or two, and then hammer the database as much as I want without incurring any more cost. I have a 1TB drive sitting unused currently, which should be enough space to store a gzipped database dump. My first task was going to be getting a full dump locally. I don’t need it to be a perfect snapshot, I just want a reasonable representation of each table in my local setup. As such, I don’t need any replication data, I don’t need it done as a single transaction. I went with the following method:

This at least gives me a stepping off point. Now all I have to do is wait for the data to transfer, and for customs to clear the drive, and I can start the long process of importing all of that data back into a local mysql instance. My next task will be to evaluate the contents of all of these tables and see if there’s a way to remove some excess bagage before modifying the actual structure.