Wednesday, February 22, 2017

Vertica Architecture (Part 2)

This is continued from my previous post.

In review, I spoke with some very smart and experienced Vertica consultants regarding the DR architecture, and found the most obvious solutions all had huge drawbacks.

1. Dual-Load: Double your license costs(?), there's also the potential to have the two clusters out of synch, which means you need to put logic in your loads to handle the possibility that a load succeeds in datacenter 1 and fails in datacenter 2.
2. Periodic Incremental Backups:Need identical standby system (aka, half the capacity and performance of your hardware because the standby is typically idle)
3. Replication solutions provided by storage vendors: The recommended design uses local storage, not storage arrays, so this is difficult to implement, in addition to the expense and the potential of replicating media failures.

At first, here's what we did instead:



Initially (aka don't do this), we set up 2 failgroups, 3 nodes in datacenter 1 and 3 in datacenter 2. Failgroups in Vertica are intended for use where you could have known dependencies that are transparent to Vertica...for example, a server rack.  Both failgroups are in the same cluster, and so data that's entered into nodes 1,2 or 3 get replicated automatically by Vertica to the other failgroup's nodes 4, 5 and 6.

We were trying to protect ourselves from the possibility of a complete datacenter failure, or a WAN failure.  The WAN is a 10Gb, low latency dark fiber link with a ring design, so highly available.  Although the network is HA, the occasional "blip" happens, where a very brief outage causes a disconnection.  Clusters don't like disconnections.

We were very proud of this design until we tested it...it completely failed.  It made sense...although logically we had all the data we needed in a single failgroup, if we simulated a network outage we'd see all 6 nodes go down.  This is actually an intentional outcome, and a good thing.  If you've worked with clusters before...you know its much better to have the cluster go down than to have it stay up in a split brain scenario and corrupt all your data.  If the cluster stays up and becomes out of synch, you have to fix whatever the initial issue was, and you compound the problem with the need to restore all your data.

So...intentionally, if you have half your nodes go down, Vertica causes the whole cluster to go down, even if you have all the data you need to stay up in the surviving nodes.  Oracle RAC uses a disk voting mechanism to decide which part of the cluster stays up, but there's no such mechanism in Vertica.

We were back to the 3 original options...all with their drawbacks.  While pouring over the documentation looking for an out-of-the-box solution, I noticed Vertica 8 introduced a new type of node called an Execute node.  Again...very little documentation on this, but I was told this was a more official way to deal with huge ingest problems like they had at Facebook (35TB/hr).  Instead of using Ephimeral nodes (nodes in transition between being up and being down) like they did, you could create execute nodes that only store the catalog...they store no other data, but only exist for the purpose of ingestion.

Upon testing, we also found Execute nodes "count" as a node in the cluster...so instead of having 6 nodes-3 nodes in DC1 and 3 in DC2, we'd add a 7th node in a cloud (we chose Oracle's cloud.)  Its a great use case for a cloud server because it has almost no outgoing data, almost no CPU utilization (only enough to maintain the catalog) and the only IO is for the catalog.  So now, if DC1 went down, we had a quorum of 4 surviving nodes (4,5,6,7)...if DC2 went down, we still have 4 surviving nodes (1,2,3,7).  If all the nodes stayed up, but the WAN between DC1 and DC2 stopped functioning, Vertica would kill one of the failgroups and continue to function...so no risk of a split brain.

We're continuing to test, but at this point, its performed perfectly.  This has effectively doubled our performance and capacity because we have a 6 node cluster instead of two 3 node clusters.  Its all real time, and there's no complex dual load logic to program in our application.

Next, I'll talk about Vertica backups.

Tuesday, February 21, 2017

Vertica Capacity Architecture

Since nearly the first time I logged in to an Oracle database, I remember finding issues with documentation and the occasional error in documentation.  Its understandable...usually this was due to a change or a new feature that was introduced and the documentation just wasn't updated.  The real problem was in me...I was judging Oracle's documentation vs perfection...I should have lowered my expectations and appreciated what it was instead of being upset for what it wasn't.

While evaluating and designing the architecture for an HP Vertica database for a client, I gained a new appreciation for Oracle's documentation.  I expected to find everything I needed to do a perfect Vertica cluster install across 2 data centers in an active/active configuration for DR.  When the documentation failed and I resorted to gGoogle, I mostly found people with the same questions I had and no solutions.

Soooo...I thought I'd make a few notes on what I learned and landed on. I am by no means a Vertica expert, but I've definitely learned a lot about and I've had the opportunity to stand on the shoulders of a few giants recently.

Our requirement is to store 10TB of actual data, we don't know how well it will compress...so we're ignoring compression for capacity planning purposes.  How much physical storage do you need for that much data?  Vertica licensing is based on data capacity, but that's not the amount of capacity used...its the amount of data ingested.  Vertica makes "projections" that (in Oracle terms) I think of as self-created materialized views and aggregates that can later be used for query re-write.  Vertica will learn from your queries what it needs to do in the future to improve performance, it'll create projections and these projections use storage.  Since there's columnar compression in Vertica by default, these projections are stored efficiently...and they aren't counted toward your licensed total.  I've heard stories that companies have had so many (200+) that the performance of importing data was hampered...these physically stored objects are updated as data is loaded.  Since projections will take up storage you have to account for that in the early design, but it completely depends on your dataset and access patterns.  Estimates based on other companies I've spoken with are between 0% (everything is deduped) and 50% (their ETL is done in Vertica, so less deduplication), so lets say 35%.

Also, you're strongly recommended to use local storage (raid 1+0...mirrored and striped), and the storage is replicated in multiple nodes for protection.  They call this concept k-safety. The idea is that you can lose "k" nodes, and the database would still continue to run normally.  We would run K+1 (the default).

In order to do a rebalance (needed when you add or remove a node), the documentation suggests you have 40% capacity free.

Also, Vertica expects you to isolate your "catalog" metadata from your actual data, so you need to set up one mirrored raid group with ~150GB for catalog...and OS, etc.  They give an example architecture using HP hardware with servers that have 24 slots for drives.  2 of them are used for mirroring the OS/Catalog, leaving 22 for your actual data.  Knowing SSD's are the future for storage, the systems we worked on are Cisco UCS C-series with 24 slots filled with 100% SSD's.  From the feedback from Vertica, this will help with rebuild times, but not so much with normal processing performance, since so much of Vertica is done in memory.  There's a huge price increase in $/GB between 400 and 800GB drives.

So...if you have 6 nodes with 22 slots, each populated with 400GB SSD's, you have 52,800GB. Half that for raid 1+0=26,400.  If you have an HA architecture, you'd expect to half that again (3 nodes in datacenter 1, 3 nodes in datacenter 2)...which brings you to 13,200GB.  Since you have to keep at least 40% free for a rebalance operation, that brings you down to 7,920GB.  We have to account for projections...we said the would be 35% of our dataset...which brings us to 5,148GB.  All the data in Vertica is copied to 2 nodes, so half the storage again....2,574GB.

Hmmm...2.5TB of storage is less than our 10TB requirement.  I'll show you how we changed the design to double capacity in my next post.