Running Cassandra and PostgreSQL in Docker
Today I started thee Udacity Data Engineering Nanodegree and wanted to follow along the demo projects in Lesson 1 on my local machine. (However, this is optional as Udacity provides the environment in their workspace).
In this post I describe how to set up the environment using Docker containers for Cassandra and Postgres so that you don’t have to install them locally, which is rather cumbersome, especially for Cassandra as you need to install Java first.
1. Creating Cassandra and Postgres Docker containers with docker-compose
Create docker-compose.yml
(with exactly this name) with the following content:
version: '2'
networks:
app-tier:
driver: bridge
services:
cassandra:
image: 'cassandra:latest'
networks:
- app-tier
expose:
- '6000'
ports:
- '6000:9042'
postgres:
image: 'postgres:latest'
restart: always
environment:
POSTGRES_PASSWORD: example
expose:
- '7000'
ports:
- '7000:5432'
Here I connect container ports 9042 (for Cassandra) and 5432 (for Postgres), which are default used by these services, with the host ports, here specified as 6000 and 7000.
Creating a yml-document ensures that you don’t have to set up containers each time you want to use them.
To run Docker containers, you just need to type docker-compose up
in your terminal.
Optional: if you want to check whether Cassandra runs properly, you need first to attach shell to cassandra:latest
container (I will not go into details here, I do it in Visual Studio Code where I have installed Docker extension - more on it here) and type cqlsh
to run a command line shell for interacting with Cassandra through CQL (the Cassandra Query Language). If no errors thrown, congratulations - you can now use Cassandra docker and run CQL queries. Type exit
to exit the cqlsh terminal.
In a similar way, you can check whether PostgreSQL runs properly - attach shell to postgres:latest
and type psql --username=postgres
to log in as created by default postgres user. Now you can type \l
to see the existing databases, one of which is postgres
which we will be connecting to in Jupyter Notebooks. Type exit
to exit the psql terminal. Type exit
again if you want to exit container shell.
Finally, if you wish to stop the container, just press Ctrl+C
in your terminal. Or check which docker containers are running with docker ps
command in your terminal and then type docker kill <container-ID as listed when running docker ps command>
, replacing the phrase in <> with corresponding ID.
2. Setting up environment for Data Engineering Nanodegree
In the rest of this post, I describe how to create a virtual environment for Data Engineering Nanodegree and run the jupyter notebooks provided in Lesson 1.
2.1. Create new pyenv environment
In your terminal, go to a folder where you’re going to save all data for Data Engineering Nanodegree and create a new virtual envirnment by running:
pyenv virtualenv <python version> <name of virtual environment>
, replacing <> with your own values.
For instance, I created a new virtual environment called “dataengineer” using my current Python version 3.8.2 like so:
pyenv virtualenv 3.8.2 dataengineer
To check which Python version is installed on your computer, try pyenv version
.
Then we activate the new virtual environment with pyenv activate dataengineer
and install the following packages:
pip install jupyter notebook
pip install psycopg2-binary
pip install cassandra-driver
Note that I installed the binary version of psycopg2, as I have an error thrown if I try installing psycopg2 as suggested by Udacity.
2.2. Connect to Cassandra and PostgreSQL database in Jupyter Notebook
After completing steps 1 and 2.1., you can now start using Cassandra within Jupyter Notebook “L1_Demo_2_Creating_a_Table_with_Apache_Cassandra” by specifying the host port from container as following:
from cassandra.cluster import Cluster
try:
cluster = Cluster(['127.0.0.1'], port=6000) #If you have a locally installed Apache Cassandra instance
session = cluster.connect()
except Exception as e:
print(e)
In the same way, you can start PostgreSQL within Jupyter Notebook “L1_Demo_0_creating-a-table-with-postgres” by specifying the host port and password from container as following:
conn = psycopg2.connect("host=127.0.0.1 port=7000 dbname=postgres user=postgres password=example")
Use this connection and cluster for all other jupyter notebooks in Lesson 1.
Comments