Apache Drill 1.7 just released this month. This is the time to test the In-situ database concept inside apache drill and investigate the performance.
For sake of simplicity, In-situ is database which have following characteristics:
- Data file is external part of query engine, this is means we have to manage database data files by our self.
- Query execution is running directly to data files which is the speed is depends on the file format.
- No support for secondary index, transaction, trigger and other advance database feature.
The sample data is CDR likes record that automatically generated by a server into CSV files. Number of files is 7167 spanning 5 days recording time, it’s likely files is created every minutes. All files takes about 24 GB disk storage.
Sample files is compressed into gzip format and it takes only 3.4G and it’s mean we have 14.2% compression factor.
Install Apache Drill
Installing and running apache drill is straight forward, below is the steps:
- Download apache drill:
curl -o apache-drill-1.7.0.tar.gz http://apache.mesi.com.ar/drill/drill-1.7.0/apache-drill-1.7.0.tar.gz
- Extract downloaded file:
tar -xvzf apache-drill-1.7.0.tar.gz
- Change to drill directory and issue following command just to confirm drill is working
- Quit drill console by using
Let’s defines 2 queries:
- Q1 =
SELECT * FROM dfs.`/path/to/directory`
- Q2 =
SELECT columns FROM dfs.`/path/to/directory` GROUP by columns
Both queries is executed for csv and csv.gz, we hope compressed data files resulting faster query time compare to flat file. Below is the result:
|Q1||49.469 seconds||143.4 seconds|
|Q2||31.643 seconds||129.0 seconds|
Investigation result reveal interesting fact, query will run slower for compressed data file. Even we have gain 14.2% compression factor for gzip but we have to pay with 3 times slower query.
- Investigation is done on OS X El Capitan Intel Core i7 2.2 GHz Processor, 16 GB RAM and 256GB SSD.
- In case the files doesn’t have CSV extension , use this one line command to rename it all:
for file in `ls` ; do echo "Renaming $file"; mv "$file" "$file.csv"; done