Skip to content

In-situ Database with Apache Drill

Apache Drill 1.7 just released this month. This is the time to test the In-situ database concept inside apache drill and investigate the performance.

In-situ Database

For sake of simplicity, In-situ is database which have following characteristics:

  1. Data file is external part of query engine, this is means we have to manage database data files by our self.
  2. Query execution is running directly to data files which is the speed is depends on the file format.
  3. No support for secondary index, transaction, trigger and other advance database feature.

Sample Data

The sample data is CDR likes record that automatically generated by a server into CSV files. Number of files is 7167 spanning 5 days recording time, it’s likely files is created every minutes. All files takes about 24 GB disk storage. 
Sample files is compressed into gzip format and it takes only 3.4G and it’s mean we have 14.2% compression factor.

Install Apache Drill

Installing and running apache drill is straight forward, below is the steps:

  1. Download apache drill: 
    curl -o apache-drill-1.7.0.tar.gz http://apache.mesi.com.ar/drill/drill-1.7.0/apache-drill-1.7.0.tar.gz
  2. Extract downloaded file: 
    tar -xvzf apache-drill-1.7.0.tar.gz
  3. Change to drill directory and issue following command just to confirm drill is working 
    bin/drill-embedded
  4. Quit drill console by using !q command.

Query Test

Let’s defines 2 queries:

  1. Q1 = SELECT * FROM dfs.`/path/to/directory`
  2. Q2 = SELECT columns[0] FROM dfs.`/path/to/directory` GROUP by columns[0]

Both queries is executed for csv and csv.gz, we hope compressed data files resulting faster query time compare to flat file. Below is the result: 

Querycsvcsv.gz
Q149.469 seconds143.4 seconds
Q231.643 seconds129.0 seconds


Investigation result reveal interesting fact, query will run slower for compressed data file. Even we have gain 14.2% compression factor for gzip but we have to pay with 3 times slower query.

Notes:

  1. Investigation is done on OS X El Capitan Intel Core i7 2.2 GHz Processor, 16 GB RAM and 256GB SSD.
  2. In case the files doesn’t have CSV extension , use this one line command to rename it all: 
    for file in `ls` ; do echo "Renaming $file"; mv "$file" "$file.csv"; done

Leave a Reply

Your email address will not be published.