Playing With Duck DB

Playing With Duck DB

Took Duck DB out for a spin

2020-10-22

With an ever-increasing number of developers attracted towards the file-based embedded database in their development environment, DuckDB has likewise developed its popularity as an embedded OLAP database.

Its column-oriented storage approach and vectorized execution makes it one of the quickest one in its category. Otherwise called SQLite for analysis, it is similar to SQLite in its design however varies in the type of workloads it handles. DuckDB is focused on OLAP workloads while SQLite is focused on OLTP.

Check out differences between OLAP and OLTP

Here are some of the features of DuckDB:

  1. Embedded single file database
  2. Easy setup (Just do pip install duckdb on python or install.packages(“duckdb”) on R to get started)
  3. Column store database with vectorized execution
  4. Supports fast analytical queries
  5. SQL
  6. Open Source
  7. Fully ACID

DuckDB is completely supported by a portion of the famous programming languages like C, C++, Java, Python and R. Other languages, like GO and Ruby, need the third party driver to work with DuckDB.

In this article, I will exhibit how to utilize DuckDB with R.

DuckDB with R

R is one of the popular languages for statistical analysis, visualization and reporting.

In order to get started with DuckDB in R, the first thing you have to do is to install DuckDB package. Packages can be installed utilizing the packages explorer or by utilizing the install.packages command that R offers.
The package required for DuckDB is duckdb.

R packages installer

You can also use the following command to install duckdb package in R.


install.packages("duckdb")

R will automatically install another package DBI if not already installed as it is required for the communication between database and R

After installing the package, you need to load the installed package using the following commands:R


library(duckdb)
library(DBI)

Connecting to the database

There are two ways you can define the connection object for DuckDB in R, creating an in-memory database or connecting the database to a file where it can read from and write objects.


# In memory connection
con = dbConnect(duckdb::duckdb(), dbdir=":memory:", read_only=FALSE)

# Connection using a file
con_duck <- dbConnect(duckdb::duckdb(), "~/Path/To/Your/file.duckdb")

Let’s write some queries

DuckDB supports DBI methods like dbExecute() and dbGetQuery() to write queries and retrieve resultsets. Queries like CREATE TABLE and UPDATE uses dbExecute() as it does not return any results, while queries like SELECT are executed using dbGetQuery() by storing the results in a variable.

Creating a table

Tables can be created in DuckDB in different ways. We can use the standard CREATE TABLE statement by providing column names, using sub-query or using a CSV file. SQL statements can be run using dbExecute method.


# dbexecute() to create table using CREATE TABLE statement
dbExecute(con, "CREATE TABLE countries (country_code VARCHAR, country_name VARCHAR, zip_code INTEGER);")

There is another method to create a table and load data into the table simultaneously. DuckDB allows us to directly create a table utilizing a CSV file.
The data I am referring to is from Kaggle.

So as to load that data in DuckDB, I have to pass the file path to the CSV file in the read_csv_auto method. The database will automatically identify the column name and data type while creating the table. Here is an example of how to create a table by straightforwardly utilizing a CSV file.


# Create table using CSV file
dbExecute(con, "CREATE TABLE COVID_DATA AS SELECT * FROM read_csv_auto('/Users/suresh/Desktop/COVID_DATA.csv');")

Inserting records

Similar to creating a table, inserting rows in the table can be done using INSERT INTO statement, or by directly loading from a CSV file.


# Inserting rows in the table
dbExecute(con, "INSERT INTO countries VALUES ('FR', 'France', 75116);")

# Loading records from CSV
dbExecute(con, "copy COVID_DATA FROM '/Users/suresh/Desktop/countries.csv';")

Querying the tables

We can use the dbGetQuery() method to run queries that return records and can print them using the print method offered by R. Here are some examples on querying tables in DuckDB.


# Selecting all records from the table
print(dbGetQuery(con, "SELECT * FROM COVID_DATA ;"))

Result for 'SELECT * FROM COVID_DATA ;'


#Calculating total cases and total deaths by country and ordering them in descending order

print(dbGetQuery(con, "select countriesAndTerritories as country, sum(cases) as total_cases, sum(deaths) as total_deaths from COVID_DATA group by countriesAndTerritories order by total_deaths desc;"))

Result for cases and total deaths by country query


# Calculating average daily deaths by country
print(dbGetQuery(con, "SELECT country, round(avg(deaths),2) as average_daily_deaths FROM COVID_DATA group by country order by average_daily_deaths desc;"))

Result for average daily deaths by countries query

Joining tables

Let’s add two new tables “Courses” and “Offers”. This should be possible either by manually creating tables and adding records, or by making a table utilizing CSV file. I will show how to perform join in DuckDB.

Here are the structure and data of two tables.

Courses:

Courses table

Offers:

Offers table

Now let’s join those two tables:


print(dbGetQuery(con, "SELECT C.COURSENO, C.CRSDESC, O.OFFTERM, O.OFFYEAR FROM COURSES C JOIN OFFERS O ON C.COURSENO=O.COURSENO;"))

Join on Courses and Offers table

Let’s add some aggregate functions to the join operation:



print(dbGetQuery(con, "SELECT SRC.COURSENO, SRC.CRSDESC, SRC.OFFTERM, SRC.OFFYEAR, COUNT(*) AS OFFERING_COUNT FROM (SELECT C.COURSENO AS COURSENO, C.CRSDESC AS CRSDESC, O.OFFTERM AS OFFTERM,  O.OFFYEAR AS OFFYEAR FROM COURSES C JOIN OFFERS O ON C.COURSENO=O.COURSENO) SRC GROUP BY COURSENO, CRSDESC, OFFTERM, OFFYEAR ORDER BY OFFERING_COUNT DESC;"))

Join query result on Courses and Offers table with aggregate functions

Some benchmarks

Inserting 50000 records in the table

Inserting 50000 records in the table

Inner Join

Inner Join

Inner Join with aggregation and group by for few records

Inner Join with aggregation and group by for few records

Aggregate function with the group by for large data

Aggregate function with the group by for large data

Complex COUNT query

Complex COUNT query

Summary

These days, a lot of people perform data analysis tasks by using languages like R and Python. Using libraries Pandas and dplyr is pretty common. DuckDB makes it easy to perform data analysis tasks with SQL in a local development environment.

Hope you enjoyed the article! Feel free to comment or share.