At Compile.com, we deal with a variety of datasets both big and small. Often, there is a need to run analysis on top of 3rd party datasets that we haven’t ingested to see if it’s worth the effort. This particular dataset contained over 300 GB of historical data that contained multiple identifiers, the codes they were associated with and some metadata.
To perform the analysis, we had to get the data for a sample of 7000 entities with an
ID and a particular code, whose references were spread across multiple files. A sample would look like this.
The pseudocode to access the data looks roughly like this. (this assumes only one
ID field and one
CODE field for brevity.)
def get_csv_data(filename, id_list): for row in read_csv(filename): if row['id_field'] in id_list and row['code_field'] == TARGET_CODE: result.append(row) return result
The dataset had around 25,000 gzipped CSV’s ranging from 13MB to 100MB. Reading them one by one was going to take days to complete so parallel processing of the data was the way to go. Even then, a 100 MB file took around 1 minute to process. All these tests were done on a server with 32GB of RAM and used SSD.
The standard way
Due to the variety of datasets we handle, most of them containing unicode, we normally use the
unicodecsv instead of Python’s CSV module.
import gzip import unicodecsv as csv def get_csv_data_udictreader(filename, idset): data =  with gzip.open(filename, 'rb') as r_file: rowlist = ucsv.DictReader(r_file, fieldnames=HEADER) for row in rowlist: if row[CODEFIELD] == TARGET_CODE and row[IDENTIFIER_FIELD] in idset: data.append(row) return data
Inspecting our code we saw that
unicodecsv DictReader’s next function was taking most of the time. Python’s CSV module would’ve given us a faster response but it would have crashed while processing unicode characters.
(I tested with
csv.DictReader on an input file of similar size which doesn’t have unicode characters and it took 21 seconds)
Another option was to try using pandas and it’s
read_csv function but this took around 5 minutes. Pandas is great library for data exploration and
read_csv was implemented in C rather than python.
import pandas as pd def get_csv_data_pandas(filename, idset): df = pd.read_csv(filename) return df[(df[IDENTIFIER_FIELD].isin(idset)) & (df[CODE_FIELD] == TARGET_CODE)].to_dict('records')
This approach was even slower than the unicodecsv approach and took around 5 minutes. CSV reading itself was fast (19 seconds) but subsequent processing took a lot of time.
Reading the file alone took only 3.9 seconds and much of the remaining time was spent on parsing the contents and doing other processing steps on them. Our aim was to get the rows in a CSV matching certain criteria (with
CODE_FIELD equal to
ID_SET). We knew the fields which were to be matched so in our next attempt, we decided to read the raw file and check for the criteria ourselves without parsing.
import gzip import io def get_csv_data_raw(filename, idset): data =  with gzip.open(filename, 'rb') as r_file: f = io.BufferedReader(r_file) for line in f: parts = line.split(',') if parts[FIELD_NUM] in idset and TARGET_CODE in parts: data.append(row) return data
10 Seconds! Which is great considering the 1 minute time it took to read using
unicodecsv. But we could optimize it further. If you look at the chart below, the number
of rows with matching
ID‘s is far greater than the number of rows with matching
TARGET_CODE. We don’t have to split the line to check whether the
TARGET_CODE is present, we can move that check upwards.
Even simpler approach
import gzip def get_csv_data_raw2(filename, idset): data =  with gzip.open(filename, 'rb') as r_file: f = io.BufferedReader(r_file) for line in f: if TARGET_CODE in line: parts = line.split(',') if parts[FIELD_NUM] in idset: data.append(row) return data
This took 4.1 seconds to complete and the whole process was done in under 3 hours as opposed to the estimated 45 hours with our generic approach of using unicodecsv.
|unicodecsv||pandas||fileread #1||fileread #2|
|Time taken for 100MB file||65 sec||300 sec||10 sec||4.1 sec|
The key takeaway from this is to understand your data before you optimize your pipeline. These improvements have been possible because of the nature of the data. Another lesson we learned from this is to work on your control flow, A simple change of moving an if condition to the top shaved off 6 seconds. These small savings add up when you’re processing a large number of files. Also, measure, measure and measure again! I can’t stress enough, the importance of profiling your large processing flows. It helps in identifying the bottlenecks in your flow and hopefully, you can work around those.
So the next time you’re processing a large dataset, spend some time looking through the data and profile how you’re accessing it.