Compile extracts information from various parts of the internet including the deep web. This means we look into datasets and databases that are not easily accessible and make it into consumable forms for other applications. This occasionally involves scraping few deep web pages and parsing obscure text.
Compile’s philosophy of programming is about being lazy and writing as little as code as possible. This helps us minimize technical debt, increase readability and reduce maintainability of our large codebase. We initially used Kimono Labs for scraping and structuring data, we loved it. But after their acquisition by Palantir, they had to re-align their mission.
We then moved on to Import.io. This ended up being a herculean task of migrating many of our scripts and scraping rules, but we did it for most of our scripts. But over time we realized that we were bumping into use-cases that needed custom work at our end. For example,
transformfeature which we use to transform results to python objects
- Grouping data –
- Restricting pagination limits
- Ignoring ssl verification (yes, a lot of the deep web sites have expired certificates).
- Cheap for the flexibility it gives. (No offense import.io. We love your product.)
Compile, being a python shop, we fiddled around with lxml and BeautifulSoup (bs4) to get these things done initially. We loved that we could write performant code with lxml and simple code with bs4. But we kept on repeating a lot of code for initializing trees and extracting information. Compile’s codebase is primarily configuration-driven. This made us move all the lxml/bs4 code to a system that is configuration-driven.
We looked around and found jamapi.xyz which shared the same philosophies as us. And we decided to build a similar configuration system in python.
Simply put, Hodor is a configuration driven wrapper on top of lxml and cssselect. Hodor does only one thing - it extracts information based on the rules it gets. A rule can either be based on xpath, or css.
Hodor also has a few features that we loved across the services we used in the past.
- Inbuilt proxy support
- Fully respects robots.txt
- Grouping data blocks
- Transform individual datum with custom functions for easier consumption and many more.
How to use it
pip install hodorlive
P.S. Why the name?
Hodor had only one job and he did it well. Exactly what we expect from our Hodor as well!
AI and advanced analytics can unlock $122B in value for the life sciences industry. Yet for all its promise, realizing the benefits of AI in healthcare isn’t easy. The biggest challenge with building smart systems is availability of good data. …