Your manager’s peers have been bragging a lot lately about their data warehouses, analytics, and charts, and now a steady stream of data-related questions are being sent your way. Your department maintains several databases, and the data they contain has the potential to answer everything management is asking for. But the databases are needed for day-to-day operations, and can’t scale to answer these often highly specific questions such as, “How many asparaguses were consumed by men named Fonzie in Cleveland on Tuesdays in 2013?”. How to unlock the potential of this data?
You’ve probably heard of data warehouses, which are tailor-made for this sort of witchcraft. They make it possible to unlock every bit of value from data, and find answers wickedly fast. In the past, creating and maintaining data warehouses meant large, ongoing investments in hardware, software, and people to run them. This would be a hard sell – isn’t the company already spending enough?! Good news, however! In this day of cloud computing, it’s incredibly simple to create, load, and query data warehouses. They typically charge on a usage basis, meaning you don’t need the initial upfront capital investment to get off the ground. And they are super fast – far more powerful than anything you could run in-house.
This post will focus on Amazon Web Services Redshift (Amazon Web Services = AWS). And as a bonus, I’ll demonstrate the incredible Dreamfactory, which automatically builds a slick REST API interface over the top. From there, you’re a GUI away from giving management everything they could ask for, and wowing them with extras they hadn’t even thought of. They can now stand tall amongst their fellow executives, knowing you have their back.
AWS RedshiftAWS Redshift is built upon PostgreSQL, but has been dramatically enhanced to run at “cloud scale” within AWS. There are a few ingredients to this secret sauce:
Column-oriented storageWhile you don’t need a deep understanding of what’s happening under the hood to use it, Redshift employs a fascinating approach to achieve it’s mind-boggling performance. Let’s say you have data that looks like the following:
ID NAME CREATED DESCRIPTION AMOUNT 1 Harold 2018/01/01 Membership 10.00 2 Susan 2017/11/15 Penalty 5.00 3 Thomas 2016/10/01 Membership 8.00Most SQL databases you’ve probably used in the past are row-based, which means they store their data something like this:
1:Harold,2018/01/01,Membership,10.00; 2:Susan,2017/11/15,Penalty,5.00; 3:Thomas,2016/10/01,Membership,8.00;This is the efficient way to maximize storage, and works well for retrieving data in the “traditional fashion” (rows at a time). But when you want to slice and dice this data, it doesn’t scale very well. If you’ve got large (business-scale) volumes of data, and a variety of ways you want to query it, you can really start to strain your database. Column-based databases, on the other hand, flip this idea on its head, and store the information in a column-based format, with the *data* serving as the *key*. So the above might look something like this:
Harold:1;Susan:2;Thomas:3; 2018/01/01:1;2017/11/15:2;2016/10/01:3; Membership:1,3;Penalty:2; 10.00:1;5.00:2;8.00:3;This drastically improves query performance. For example, when searching for “DESCRIPTION == ‘Membership'”, the query only needs to make one database call (“give me the items with a ‘DESCRIPTION’ of ‘Membership'”), instead of inspecting each row individually (as it would have to do in a traditional, row-based database). Very cool, very fast!
Massive ParallelizationWhen I picture what the AWS cloud must look like, I usually conjure something up from the Matrix (except it’s full of regular computers, rather than, well, humans). Or maybe Star Trek’s “Borg”, a ridiculous planet-cube flying through space, sucking up other civilizations. I guess both of those images are a little disturbing. A safer mental image is this – data centers spanning the globe, loaded with racks and racks of computers, all connected and working together. For most computing tasks, throwing more hardware at the problem doesn’t automatically increase performance. There are bottlenecks that remain in place no matter how many processors are churning away. In our “traditional database” example, this bottleneck is typically disk I/O – the processors are all trying to grab data from the same place. To overcome this, the architecture and storage have to be arranged in a way that can benefit from parallelization. Which is exactly the case with AWS Redshift. Due to the column-based design described above, Redshift is able to take full advantage of adding processors, and it’s almost linearly scalable. This means if you double the number of computers (“nodes”, in Redshift-speak), the performance doubles. And so on. Combine this scalability with the ridiculous number of computers AWS has at it’s disposal (specifically, several Borgs-worth), and it’s like staring out at a starry night. It goes on forever in all directions.
How this works for youIf you’re sold on the power of AWS Redshift, then you’ll be pleased to learn that setup is incredibly simple. AWS documentation is top notch, a crucial thing in this brave new world. When writing this post, I followed their tutorial, and it all went smoothly. Probably took me 15 minutes, and I had the example up and running. If you already have SQL expertise, you won’t have any problem picking up Redshift syntax. There are some differences and nuances, but the standard “things” (joins, where clauses, etc) all work as expected. I typically use Microsoft’s SQL Server Management Studio (SSMS), and was able to connect to Redshift with no problem (after setting it up as a linked server). Your favorite SQL client will presumably work here as well (anything that supports JDBC or ODBC drivers). Once you get your feet wet, there are myriad tools that will load your business data into Redshift. If you’ve got SQL chops in house, I’d start with the AWS documentation, and go from there. If you need a little (or a lot) of help, a whole ecosystem of companies and tools have sprung up around Redshift. A quick Google search will introduce you to them. When you’re up and running, and growing more comfortable demanding more from the system, AWS makes it incredibly simple to add capacity. Thanks to the brilliant Redshift architecture, you just add nodes, and AWS takes care of the rest. Their billing dashboard will show you what it’s costing in real time, with no hidden or creeping costs of data centers, hardware upgrades, things going bump in the night, etc. So much magic happening under the covers, and you get the credit. The joys of cloud computing!
My Humble ExampleWhen writing this, I used the example AWS provides (it consists of a few tables containing some fake Sales data). With everything in place, I can query from SSMS (with a little bit of “linked server” glue syntax):
exec ('-- Find total sales on a given calendar date. SELECT sum(qtysold) FROM sales, date WHERE sales.dateid = date.dateid AND caldate = ''2008-01-05'';') at redshift sum -------------------- 210 (1 row affected)I get a thrill when a chain of systems, architectures, and networks all flow together nicely. Somewhere in a behemoth of a data center, a processor heard my cry, and spun out this result in response. Amazing.