With an ever-increasing number of developers attracted towards the file-based embedded database in their development environment, DuckDB has likewise developed its popularity as an embedded OLAP database.
Its column-oriented storage approach and vectorized execution makes it one of the quickest one in its category. Otherwise called SQLite for analysis, it is similar to SQLite in its design however varies in the type of workloads it handles. DuckDB is focused on OLAP workloads while SQLite is focused on OLTP.
Check out differences between OLAP and OLTP
Here are some of the features of DuckDB:
- Embedded single file database
- Easy setup (Just do pip install duckdb on python or install.packages(“duckdb”) on R to get started)
- Column store database with vectorized execution
- Supports fast analytical queries
- Open Source
- Fully ACID
DuckDB is completely supported by a portion of the famous programming languages like C, C++, Java, Python and R. Other languages, like GO and Ruby, need the third party driver to work with DuckDB.
In this article, I will exhibit how to utilize DuckDB with R.
DuckDB with R
R is one of the popular languages for statistical analysis, visualization and reporting.
In order to get started with DuckDB in R, the first thing you have to do is to install DuckDB package.
Packages can be installed utilizing the packages explorer or by utilizing the install.packages command that R offers.
The package required for DuckDB is
You can also use the following command to install
duckdb package in R.
R will automatically install another package DBI if not already installed as it is required for the communication between database and R
After installing the package, you need to load the installed package using the following commands:R
Connecting to the database
There are two ways you can define the connection object for DuckDB in R, creating an in-memory database or connecting the database to a file where it can read from and write objects.
# In memory connection con = dbConnect(duckdb::duckdb(), dbdir=":memory:", read_only=FALSE) # Connection using a file con_duck <- dbConnect(duckdb::duckdb(), "~/Path/To/Your/file.duckdb")
Let’s write some queries
DuckDB supports DBI methods like
dbGetQuery() to write queries and retrieve resultsets.
CREATE TABLE and
dbExecute() as it does not return any results, while queries like
SELECT are executed using
dbGetQuery() by storing the results in a variable.
Creating a table
Tables can be created in DuckDB in different ways.
We can use the standard
CREATE TABLE statement by providing column names, using sub-query or using a CSV file.
SQL statements can be run using
# dbexecute() to create table using CREATE TABLE statement dbExecute(con, "CREATE TABLE countries (country_code VARCHAR, country_name VARCHAR, zip_code INTEGER);")
There is another method to create a table and load data into the table simultaneously.
DuckDB allows us to directly create a table utilizing a CSV file.
The data I am referring to is from Kaggle.
So as to load that data in DuckDB, I have to pass the file path to the CSV file in the
The database will automatically identify the column name and data type while creating the table.
Here is an example of how to create a table by straightforwardly utilizing a CSV file.
# Create table using CSV file dbExecute(con, "CREATE TABLE COVID_DATA AS SELECT * FROM read_csv_auto('/Users/suresh/Desktop/COVID_DATA.csv');")
Similar to creating a table, inserting rows in the table can be done using
INSERT INTO statement, or by directly loading from a CSV file.
# Inserting rows in the table dbExecute(con, "INSERT INTO countries VALUES ('FR', 'France', 75116);") # Loading records from CSV dbExecute(con, "copy COVID_DATA FROM '/Users/suresh/Desktop/countries.csv';")
Querying the tables
We can use the
dbGetQuery() method to run queries that return records and can print them using the
# Selecting all records from the table print(dbGetQuery(con, "SELECT * FROM COVID_DATA ;"))
#Calculating total cases and total deaths by country and ordering them in descending order print(dbGetQuery(con, "select countriesAndTerritories as country, sum(cases) as total_cases, sum(deaths) as total_deaths from COVID_DATA group by countriesAndTerritories order by total_deaths desc;"))
# Calculating average daily deaths by country print(dbGetQuery(con, "SELECT country, round(avg(deaths),2) as average_daily_deaths FROM COVID_DATA group by country order by average_daily_deaths desc;"))
Let’s add two new tables “Courses” and “Offers”. This should be possible either by manually creating tables and adding records, or by making a table utilizing CSV file. I will show how to perform join in DuckDB.
Here are the structure and data of two tables.
Now let’s join those two tables:
print(dbGetQuery(con, "SELECT C.COURSENO, C.CRSDESC, O.OFFTERM, O.OFFYEAR FROM COURSES C JOIN OFFERS O ON C.COURSENO=O.COURSENO;"))
Let’s add some aggregate functions to the join operation:
print(dbGetQuery(con, "SELECT SRC.COURSENO, SRC.CRSDESC, SRC.OFFTERM, SRC.OFFYEAR, COUNT(*) AS OFFERING_COUNT FROM (SELECT C.COURSENO AS COURSENO, C.CRSDESC AS CRSDESC, O.OFFTERM AS OFFTERM, O.OFFYEAR AS OFFYEAR FROM COURSES C JOIN OFFERS O ON C.COURSENO=O.COURSENO) SRC GROUP BY COURSENO, CRSDESC, OFFTERM, OFFYEAR ORDER BY OFFERING_COUNT DESC;"))
Inserting 50000 records in the table
Inner Join with aggregation and group by for few records
Aggregate function with the group by for large data
Complex COUNT query
These days, a lot of people perform data analysis tasks by using languages like R and Python. Using libraries Pandas and dplyr is pretty common. DuckDB makes it easy to perform data analysis tasks with SQL in a local development environment.
Hope you enjoyed the article! Feel free to comment or share.