At Compile.com, we deal with a variety of datasets both big and small. Often, there is a need to run analysis on top of 3rd party datasets that we haven’t ingested to see if it’s worth the effort. This particular dataset contained over 300 GB of historical data that contained multiple identifiers, the codes they were associated with and some metadata.

Background

To perform the analysis, we had to get the data for a sample of 7000 entities with an ID and a particular code, whose references were spread across multiple files. A sample would look like this.

ID1ID2ID3ID4CODE1CODE2CODE3CODE4CODE5CODE6CODE7CODE8VAL1VAL2DATE
1558459917155845991719422866201942286620C1C2C3C4C5C680307992132018-11-27
162919517716290139741598027930C2C3C5C999921322018-11-07
1063664217106366421715089722661629081500C40C12C13C89581112018-10-09

The pseudocode to access the data looks roughly like this. (this assumes only one ID field and one CODE field for brevity.)

def get_csv_data(filename, id_list):
    for row in read_csv(filename):
        if row['id_field'] in id_list and row['code_field'] == TARGET_CODE:
            result.append(row)
    return result

The dataset had around 25,000 gzipped CSV’s ranging from 13MB to 100MB. Reading them one by one was going to take days to complete so parallel processing of the data was the way to go. Even then, a 100 MB file took around 1 minute to process. All these tests were done on a server with 32GB of RAM and used SSD.

The standard way

Due to the variety of datasets we handle, most of them containing unicode, we normally use the unicodecsv instead of Python’s CSV module.

import gzip
import unicodecsv as csv

def get_csv_data_udictreader(filename, idset):
    data = []
    with gzip.open(filename, 'rb') as r_file:
        rowlist = ucsv.DictReader(r_file, fieldnames=HEADER)
        for row in rowlist:
            if row[CODEFIELD] == TARGET_CODE and row[IDENTIFIER_FIELD] in idset:
                data.append(row)
    return data

Inspecting our code we saw that unicodecsv DictReader’s next function was taking most of the time. Python’s CSV module would’ve given us a faster response but it would have crashed while processing unicode characters.

(I tested with csv.DictReader on an input file of similar size which doesn’t have unicode characters and it took 21 seconds)

Trying pandas

Another option was to try using pandas and it’s read_csv function but this took around 5 minutes. Pandas is great library for data exploration and read_csv was implemented in C rather than python.

import pandas as pd

def get_csv_data_pandas(filename, idset):
    df = pd.read_csv(filename)
    return df[(df[IDENTIFIER_FIELD].isin(idset)) & (df[CODE_FIELD] == TARGET_CODE)].to_dict('records')

This approach was even slower than the unicodecsv approach and took around 5 minutes. CSV reading itself was fast (19 seconds) but subsequent processing took a lot of time.

Reading the file alone took only 3.9 seconds and much of the remaining time was spent on parsing the contents and doing other processing steps on them. Our aim was to get the rows in a CSV matching certain criteria (with CODE_FIELD equal to TARGET_CODE and ID inside ID_SET). We knew the fields which were to be matched so in our next attempt, we decided to read the raw file and check for the criteria ourselves without parsing.

Simpler approach

import gzip
import io

def get_csv_data_raw(filename, idset):
    data = []
    with gzip.open(filename, 'rb') as r_file:
        f = io.BufferedReader(r_file)
        for line in f:
            parts = line.split(',')
            if parts[FIELD_NUM] in idset and TARGET_CODE in parts:
                data.append(row)
    return data

10 Seconds! Which is great considering the 1 minute time it took to read using unicodecsv. But we could optimize it further. If you look at the chart below, the number of rows with matching IDs is far greater than the number of rows with matching TARGET_CODE. We don’t have to split the line to check whether the TARGET_CODE is present, we can move that check upwards.

Even simpler approach

import gzip

def get_csv_data_raw2(filename, idset):
    data = []
    with gzip.open(filename, 'rb') as r_file:
        f = io.BufferedReader(r_file)
        for line in f:
            if TARGET_CODE in line:
                parts = line.split(',')
                if parts[FIELD_NUM] in idset:
                    data.append(row)
    return data

This took 4.1 seconds to complete and the whole process was done in under 3 hours as opposed to the estimated 45 hours with our generic approach of using unicodecsv.

unicodecsvpandasfileread #1fileread #2
Time taken for 100MB file65 sec300 sec10 sec4.1 sec

Takeaway

The key takeaway from this is to understand your data before you optimize your pipeline. These improvements have been possible because of the nature of the data. Another lesson we learned from this is to work on your control flow, A simple change of moving an if condition to the top shaved off 6 seconds. These small savings add up when you’re processing a large number of files. Also, measure, measure and measure again! I can’t stress enough, the importance of profiling your large processing flows. It helps in identifying the bottlenecks in your flow and hopefully, you can work around those.

So the next time you’re processing a large dataset, spend some time looking through the data and profile how you’re accessing it.