I originally drafted this article in 2015, when I was managing a production MariaDB Galera cluster for Stockopedia. It’s been gathering dust in my drafts since then. I don’t remember why I didn’t complete it; maybe the subject needed more treatment, maybe I couldn’t find a proper conclusion, or maybe I just lost interest. I don’t know. It’s Jan 2020, I’m finally pushing it out so I can call it done. This article reflects my learning from that period. I have since moved on, both from active duty at Stocko and managing databases in general. The state of the affairs may have improved since, I haven’t kept up. FYI
At Stocko I wear many hats, one involves running a MariaDB Galera cluster as our primary RDBMS. We came to Galera after running a MySQL async replica for a while. Our rationale was to start with what’s built-in, as that is the path of least resistance. In time, we knew we needed a better way to guarantee cluster reliability. After some research, it became clear Galera was the only active, open-source solution that could do multi-master replication.
Preparing for Galera
Galera is an independent multi-master replication system for MySQL (and compatible forks). It’s simple to setup and provides strong consistency guarantees (emphasis on ‘C’ in the CAP theorum). Since it works on the clustering layer it is transparent to the end-user, and as an independent layer it can keep up with newer versions of MySQL fairly rapidly.
The catch is, Galera only works with InnoDB (and it’s source compatible variants. AFAIK, XtraDB). All other engines will be available for use, but will NOT be replicated. Know this before proceeding. At Stocko, we kept all our data in InnoDB.
Anyone who’s worked with Galera will acknowledge it is its own thing. Its runtime characteristics can be unintuitive if you’re coming from an unclustered MySQL instance. I recently had the cluster choke up in production. The CPU usage was consistently at a 100% on an 8 core machine. Occasionally the machine would go into swap, and queries would queue up. We had to act fast to recover, but we also had to tune it up so we could get the most of our hardware. That experience gave me a better understanding of how to tune for performance and stability.
In this post, I aim to share some of my learnings on how to keep a production cluster humming smoothly.
There are infinite knobs you can twist, but the 80-20 rule applies here too. Most of your performance will come from:
- tuning to the nature of your queries. Basic query optimization goes a long way.
- tuning InnoDb and Galera to your hardware. Most other MySQL options don’t apply.
This is what our cluster was comprised of:
- MariaDB 10, with Galera builtin.
- A cluster of 5 (physical, commodity) machines, rung up on a 10 Gbit vLAN.
- 8 core CPUs. 32 GB RAM per machine.
- Load balanced via HAProxy.
- State Snapshot Transfers configured with rsync.
A Note on Hardware
Galera suffers from low IOPS compared to a well-tuned standalone MySQL. This is unavoidable due to the nature of how Galera itself works. A writing node needs to wait for the write to be acknowledged by every other node in the cluster before the driver returns. If you plan on handling load efficiently your cluster needs to be tuned to the hardware’s IO.
Here’s a basic checklist.
Mirror the Specs
Make sure all your servers are of equal specs. Each should be equally capable of replacing another at a moment’s notice.
The only exception here is if you use a Galera Arbitrator. Then a low-spec device will do. This is optional and generally only required if you have a limited budget (say for 2 masters and an arbitrator). I won’t be discussing the arbitrator here.
Choose an SSD
Your fastest writes are limited to your slowest machine, so get a good set of SSDs. This will raise the ceiling of your QPS. The price points for server grade SSDs are converging to consumer-grade levels. Many data-centers provide them as standard now, AWS offers them as do other cloud providers. Go for it.
Choose more RAM
Nothing works better than when all of your data fits in RAM. Cloud providers are driving down the cost of memory, buy as much as you can get a hold of. If you can, use faster memory. With databases relying heavily on caches, this will show.
Use a dedicated instance
Needless to say, your machine should be dedicated to the database. Do not have other high IOPS tools (ex: a backup utility writing to disk) running on machines that serve queries. A noisy machine is difficult to tune effectively.
This applies equally to the network.
Some Basic settings
Let’s start with a basic checklist.
Store Data and Cache on the SSD
datadir = /path/to/ssd/..
tmpdir = /path/to/ssd/..
If your machine has both HDDs and SSDs ensure your data and tmp dirs point to the SSD. I usually mount SSDs on a separate path so there’s nothing else competing for IO. Let / be on the spinning disk.
Switch on Performance Schema
In theory there’s a slight performance penalty. I didn’t care, the extra performance insights can be crucial to debugging sluggish performance and measuring tune-ups.
Configure your max connections
max_connections = 3000
The defaults are too low. Raise this to what your machine can handle and then some more. And don’t forget to configure your system ulimits as well.
Set an admin port
extra_port = 33306 extra_max_connections = 10
Don’t forget to set an additional port to listen on. If your node is fully choked, you can connect to this other port to get in. See some guidance here.
Tuning for WRITES
My work here can be summarized into 2 principles:
- Tune CPU, RAM and IO to the machine’s capacity.
- Dial down node write-safety to get write-throughput in return. Rely on the cluster for integrity.
Tune to the Hardware
# Number of threads used for applying slave writesets in parallel. wsrep_slave_threads=8
The default, 1, severely underutilizes multicore machines. The recommended value is double the number of cores, but I tend to be conservative with this. See if your writes are idling before increasing it beyond your core count. I already have fast IOPS, so I set this to my CPU’s native core count, i.e: 8.
innodb_io_capacity=80000 # This was set for a Samsung SSD 850 EVO
The simple rule here is to raise the throttle to just about what the hardware can handle. There is no need to be conservative. Anything less, and you’re paying for latent capacity you’re not utilizing.
innodb_buffer_pool_size = 20G
innodb_buffer_pool_instances = 10 # Divide that memory pool above into subpools, each atleast 1 Gig in size. This hasn't affected my performance in any way.
Per MariaDB’s advice, set the innodb_buffer_pool_size to 2/3 of your server memory. This is good advice.
I experimented with these values to my read-write load. My SELECT queries peak at about 15K queries per second. This is a bursty load, with the averages being much lower. Almost all of those reads were handled between the query cache and InnoDB cache, so the system was silent under load.
There is no one-size-fits-all solution here. Experiment and tune this throughout your application life-cycle.
innodb_log_buffer_size = 256M innodb_log_file_size = 1G
A larger log buffer size let’s you do larger (single) transactions without performing disk IO. I raised this from the default of 8MB.
Dial down individual consistency
innodb_flush_log_at_trx_commit = 2
With a setting of 2, “logs are written after each transaction commit and flushed to disk once per second”. You could lose a second of data, if the node dies. However, with Galera guaranteeing consistency across the cluster we can safely set this to 2. What you get in return is tremendous write throughput; 75x per one SO post.
innodb_flush_method = O_DIRECT_NO_FSYNC
This setting does 2 things; writes data with O_DIRECT (which skips copying data to the kernel’s buffers before dumping it to disk) and skips the fsync syscall. Again, we’re trading away an individual node’s write safety for performance, while collecting write safety on a cluster level. In my case, I could tolerate this as my data was largely non-monetary and non-transactional.
It’s easy to switch away from strict writes to more lenient settings. If you decide to go the other way, I anticipate you will need to over-provision the cluster prior to the switch.
Setting it to 2 prevents table level locks when auto-incs are generated. Galera recommends it at 2.
# Use a moderately large Gcache size. wsrep_provider_options="gcache.size=10G; gcache.page_size=10G; ...;"
GCache is Galera’s on-node cache. It works like a write journal. When one of your nodes fail, it tries to catch up to the cluster state by running through all the entries in its gcache. If that’s enough, your node comes back online. If not, it requires an IST or an SST, this will take down another node until both nodes are synced. Having a large GCache provides you with uninterrupted recovery if a node was out for a short while. Definitely a plus!
Use A Thread Pool
thread_handling = pool-of-threads
thread_stack = 192K
thread_cache_size = 4
thread_pool_size = 8
thread_pool_oversubscribe = 3 # Number of threads per core. Stick to the defaults
My most important setting here was to switch thread_handling to pool-of-threads. Not sure why this isn’t the default. Lots of great guidance here.
Master Slave Setting For Writes
# See https://www.percona.com/blog/2013/05/02/galera-flow-control-in-percona-xtradb-cluster-for-mysql/
wsrep_provider_options="...; gcs.fc_limit=100; gcs.fc_master_slave=YES;"
It seems completely counter-intuitive to have a Master-Slave switch in a Multi-Master setup. But there is one, and I don’t recommend you set this up until you you have to deal with consistent deadlocks due to optimistic locking.
For one thing, it only applies to writes, not reads. In our case our proxy was setup to evenly distribute writes across the cluster, but there were occasional deadlocks in high IO situations, i.e: crucial business moments! Deadlocks require manual intervention, which means indeterminate downtime! (unless your application can ROLLBACK, ours couldn’t). And, I don’t like to babysit databases. So I switched this on and never had to deal with it again.
The astute reader will realize this makes your SSD the bottleneck, which was true. I don’t see how to scale this out without sub-clustering DBs according to function, and that’s taking on more complexity than I was willing to manage.
Sizing Up The Cluster
I recommend a minimum cluster size of 5 nodes. Galera requires at least 3 nodes to form a quorum. However if one of the nodes goes down, it takes a DONOR to sync up. A DONOR node will not respond to user requests during a state transfer, hence when 1 node goes down it takes 2 to recover.
The maximum simultaneous number of nodes that can fail without downtime is (N-1)/2. So a 5 node cluster can afford 2 simultaneous incapacitated nodes ((5-1) = 4/2 = 2) without downtime. Adjust this to your SLAs.
A note on even-numbered clusters
Make sure you never have an even-numbered cluster. If you ever have a split brain this will instantly seize the cluster. Your monitoring tools won’t detect a downtime, but your machines will go silent. The cluster won’t recover until the partition is manually resolved. Also, I bet this will happen in the middle of the night!
This may seem like something you deal with up-front, but you could manually remove a cluster for HW replacement and now you’re on even numbers. Watch out for this.
Keep Some Headroom
Something I’ve learnt painfully is to never have 100% of your cluster serving queries. You need headroom for failover. Guide your queries via a load-balancing solution.
We used HAProxy, and set up 3/5 machines as primary, and the rest as failover (See here). In our setup the failovers were also the primary DONORs for SST, so when a query-serving node went down, we lost 1/3 (33%) of our query capacity instead of 2/5 (1 Node + 1 Donor = 40%).
Galera is a great multi-master solution for MySQL. It’s simple to setup and you should be able to sleep at night on most days. It works for moderate to medium-high IO loads alright, but if you’re going any higher consider using Galera to keep your subclusters in sync.
Do you use Galera at your startup? Share your experience in the comments below.