At Compile.com, we deal with a variety of datasets both big and small. Often, there is a need to run analysis on top of 3rd party datasets that we haven’t ingested to see if it’s worth the effort. This particular dataset contained over 300 GB of historical data that contained multiple identifiers, the codes they were associated with and some metadata.
Background
To perform the analysis, we had to get the data for a sample of 7000 entities with an ID
and a particular code, whose references were spread across multiple files. A sample would look like this.
ID1 | ID2 | ID3 | ID4 | CODE1 | CODE2 | CODE3 | CODE4 | CODE5 | CODE6 | CODE7 | CODE8 | VAL1 | VAL2 | DATE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1558459917 | 1558459917 | 1942286620 | 1942286620 | C1 | C2 | C3 | C4 | C5 | C6 | 80307 | 99213 | 2018-11-27 | ||
1629195177 | 1629013974 | 1598027930 | C2 | C3 | C5 | C99 | 99213 | 2 | 2018-11-07 | |||||
1063664217 | 1063664217 | 1508972266 | 1629081500 | C40 | C12 | C13 | C8 | 95811 | 1 | 2018-10-09 |
The pseudocode to access the data looks roughly like this. (this assumes only one ID
field and one CODE
field for brevity.)
def get_csv_data(filename, id_list):
for row in read_csv(filename):
if row['id_field'] in id_list and row['code_field'] == TARGET_CODE:
result.append(row)
return result
The dataset had around 25,000 gzipped CSV’s ranging from 13MB to 100MB. Reading them one by one was going to take days to complete so parallel processing of the data was the way to go. Even then, a 100 MB file took around 1 minute to process. All these tests were done on a server with 32GB of RAM and used SSD.
The standard way
Due to the variety of datasets we handle, most of them containing unicode, we normally use the unicodecsv
instead of Python’s CSV module.
import gzip
import unicodecsv as csv
def get_csv_data_udictreader(filename, idset):
data = []
with gzip.open(filename, 'rb') as r_file:
rowlist = ucsv.DictReader(r_file, fieldnames=HEADER)
for row in rowlist:
if row[CODEFIELD] == TARGET_CODE and row[IDENTIFIER_FIELD] in idset:
data.append(row)
return data
Inspecting our code we saw that unicodecsv
DictReader’s next function was taking most of the time. Python’s CSV module would’ve given us a faster response but it would have crashed while processing unicode characters.
(I tested with csv.DictReader
on an input file of similar size which doesn’t have unicode characters and it took 21 seconds)
Trying pandas
Another option was to try using pandas and it’s read_csv
function but this took around 5 minutes. Pandas is great library for data exploration and read_csv
was implemented in C rather than python.
import pandas as pd
def get_csv_data_pandas(filename, idset):
df = pd.read_csv(filename)
return df[(df[IDENTIFIER_FIELD].isin(idset)) & (df[CODE_FIELD] == TARGET_CODE)].to_dict('records')
This approach was even slower than the unicodecsv approach and took around 5 minutes. CSV reading itself was fast (19 seconds) but subsequent processing took a lot of time.
Reading the file alone took only 3.9 seconds and much of the remaining time was spent on parsing the contents and doing other processing steps on them. Our aim was to get the rows in a CSV matching certain criteria (with CODE_FIELD
equal to TARGET_CODE
and ID
inside ID_SET
). We knew the fields which were to be matched so in our next attempt, we decided to read the raw file and check for the criteria ourselves without parsing.
Simpler approach
import gzip
import io
def get_csv_data_raw(filename, idset):
data = []
with gzip.open(filename, 'rb') as r_file:
f = io.BufferedReader(r_file)
for line in f:
parts = line.split(',')
if parts[FIELD_NUM] in idset and TARGET_CODE in parts:
data.append(row)
return data
10 Seconds! Which is great considering the 1 minute time it took to read using unicodecsv
. But we could optimize it further. If you look at the chart below, the number
of rows with matching ID
‘s is far greater than the number of rows with matching TARGET_CODE
. We don’t have to split the line to check whether the TARGET_CODE
is present, we can move that check upwards.
Even simpler approach
import gzip
def get_csv_data_raw2(filename, idset):
data = []
with gzip.open(filename, 'rb') as r_file:
f = io.BufferedReader(r_file)
for line in f:
if TARGET_CODE in line:
parts = line.split(',')
if parts[FIELD_NUM] in idset:
data.append(row)
return data
This took 4.1 seconds to complete and the whole process was done in under 3 hours as opposed to the estimated 45 hours with our generic approach of using unicodecsv.
unicodecsv | pandas | fileread #1 | fileread #2 | |
---|---|---|---|---|
Time taken for 100MB file | 65 sec | 300 sec | 10 sec | 4.1 sec |
Takeaway
The key takeaway from this is to understand your data before you optimize your pipeline. These improvements have been possible because of the nature of the data. Another lesson we learned from this is to work on your control flow, A simple change of moving an if condition to the top shaved off 6 seconds. These small savings add up when you’re processing a large number of files. Also, measure, measure and measure again! I can’t stress enough, the importance of profiling your large processing flows. It helps in identifying the bottlenecks in your flow and hopefully, you can work around those.
So the next time you’re processing a large dataset, spend some time looking through the data and profile how you’re accessing it.
We live in a connected world. When trying to determine the influence of health systems and physicians, simple analysis based on claims or prescription volumes isn’t enough. With a graph database, real world events in healthcare can be mapped to …
Graphs are hidden in data models everywhere we look from social networks and the world-wide-web to genealogy and patient journeys. While their structure may be hidden from users, we have been using them for solving various problems. This piece covers …