Data Modeling with Apache Cassandra

Source: Wikipedia

Motivation

In the previous post, I talked about data modeling with Postgres and how to create an ETL pipeline that migrates data from a directory of CSV files to a Postgres database.

In this post, I will dive into data modeling with Apache Cassandra, a NoSQL database management system.

So why do we need NoSQL anyway?

There are some drawbacks of the relational database management systems when it comes to big data because you need:

  • high availability
  • horizontal scaling
  • fast throughputs: reads and writes
  • flexibility

Apache Cassandra is built for high availability, scalability, and fault-tolerant systems.

Introduction

A startup called Sparkify wants to analyze the data they’ve been collecting on songs and user activity on their new music streaming app. The analysis team is particularly interested in understanding what songs users are listening to. Currently, there is no easy way to query the data to generate the results, since the data reside in a directory of CSV files on user activity on the app.

They’d like a data engineer to create an Apache Cassandra database which can create queries on song play data to answer the questions. Our role is to create a database for this analysis. We’ll be able to test the database by running queries given by the analytics team at Sparkify.

To follow along with the upcoming instruction, please view it’s GitHub repo.

Starting with Apache Cassandra

Before we perform ETL and create an Apache Cassandra database, we need to set it up for running on our machine. This is a three-step process:

Create a Cluster

Setup Apache Cassandra on the local machine

This will make a connection to a Cassandra instance on our local machine. To establish a connection and begin executing queries, we create a session .

Create a Keyspace

Create a Keyspace in Apache Cassandra

This is similar to creating a database in Postgres where we specify the host and user privileges.

Set Keyspace

Set the keyspace to use

This is very similar to what we used when we connected to the Postgres database in the previous project and got a cursor to it like this:

Connect to Postgres database

Data Modeling

In Apache Cassandra, we model our data based on the queries we will perform. Aggregation like GROUP BY, JOIN are highly discouraged in Cassandra. This is because we shouldn’t scan the entire data because it is distributed on multiple nodes. It will slow down our system because sending all of that data from multiple nodes to a single machine will crash it.

Now we will create tables for the following queries:

Part 1: Give me the artist, song title and song’s length in the music app history that was heard during sessionId = 338, and itemInSession = 4

This creates the first table and we set the partition key and clustering column based on the last part of the question. These two columns together form the primary key. Here is an introduction on primary key, partition key and clustering columns.

Transformation and ETL part

Here we read every line of the CSV file and extract appropriate fields to INSERT data into the table discography.

Let’s see if the ETL works correctly.

Checks if ETL process works correctly

In the same way, we create tables to optimize for the other two queries.

Part 2: Give me only the following: name of artist, song (sorted by itemInSession) and user (first and last name) for userid = 10, sessionid = 182

This creates the second table and we set the Partition Key as userId, sessionIdand clustering column from the condition set in the question.

Transformation and ETL part

Here we read every line of the CSV file and extract appropriate fields to INSERT into the table user_artist_relation.

Let’s see if the ETL works correctly.

Checks if ETL process works correctly

Part 3: Give me every user name (first and last) in my music app history who listened to the song ‘All Hands Against His Own’

This creates the third table and we set the Primary Key as the two columns: songand userId.

Transformation and ETL part

Here we read every line of the CSV file and extract appropriate fields to INSERT into the table user_info.

Let’s see if the ETL works correctly.

Conclusion

We pre-processed the entire events_data directory and created three tables in the Apache Cassandra database. As mentioned earlier, we model our tables based on the queries we will perform on them. So, to ensure maximum availability and fast throughputs, we make separate tables for different types of queries.

If you want to learn more about the Postgres data modeling prior to this, you can read more about it here. Next we will discuss on Cloud Data Warehousing.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store