Compile extracts information from various parts of the internet including the deep web. This means we look into datasets and databases that are not easily accessible and make it into consumable forms for other applications. This occasionally involves scraping few deep web pages and parsing obscure text.
Compile’s philosophy of programming is about being lazy and writing as little as code as possible. This helps us minimize technical debt, increase readability and reduce maintainability of our large codebase. We initially used Kimono Labs for scraping and structuring data, we loved it. But after their acquisition by Palantir, they had to re-align their mission.
We then moved on to Import.io. This ended up being a herculean task of migrating many of our scripts and scraping rules, but we did it for most of our scripts. But over time we realized that we were bumping into use-cases that needed custom work at our end. For example,
- The
transform
feature which we use to transform results to python objects - Grouping data –
_groups
- Restricting pagination limits
- Proxy
- Ignoring ssl verification (yes, a lot of the deep web sites have expired certificates).
- Cheap for the flexibility it gives. (No offense import.io. We love your product.)
Compile, being a python shop, we fiddled around with lxml and BeautifulSoup (bs4) to get these things done initially. We loved that we could write performant code with lxml and simple code with bs4. But we kept on repeating a lot of code for initializing trees and extracting information. Compile’s codebase is primarily configuration-driven. This made us move all the lxml/bs4 code to a system that is configuration-driven.
We looked around and found jamapi.xyz which shared the same philosophies as us. And we decided to build a similar configuration system in python.
Enter Hodor 
For GoT fans/ASOIAF, the pun should be obvious, for others: watch the show already!
Simply put, Hodor is a configuration driven wrapper on top of lxml and cssselect. Hodor does only one thing - it extracts information based on the rules it gets. A rule can either be based on xpath, or css.
Hodor also has a few features that we loved across the services we used in the past.
- Inbuilt proxy support
- Fully respects robots.txt
- Pagination
- Grouping data blocks
- Transform individual datum with custom functions for easier consumption and many more.
How to use it
pypi: pip install hodorlive
source: Github
Sample code
gist:cyriac/ac6bcdab43bd55df759d37d38b117556
Output
gist:cyriac/550dc1c490ed492918342f7b76717a7d
P.S. Why the name?
Hodor had only one job and he did it well. Exactly what we expect from our Hodor as well!
Spoiler alert