After evaluating some open data platforms and continuing to work on my scraper for Tulsa Health Department Restaurant Inspections, I’ve changed my approach a bit.
Problems with ScraperWiki
My scraper kept timing out when I tried to run it on ScraperWiki.com. I throttled it up, but if I scrape too fast, THD will block the ScraperWiki.com IP address and we’ll have problems running THD scrapers from ScraperWiki.com. (THD already blocked my local IP once when I ran too fast). It only ever got thru 3781 records before timing out. So I need to find another way to run the scraper.
The ScraperWiki project on bitbucket has some documentation for setting up your own instance, but there are lots of omissions, especially for those of us who know nothing about twisted or orbited. I set up the django component at OklahomaData.org, but even though ScraperWiki code is AGPL, ScraperWiki doesn’t seem to want to help us copy and localize their business model to Tulsa, Oklahoma.
Alternative data stores for public data
So, I looked around for alternatives to ScraperWiki. Chris from Socrata commented on my last post; I took his advice to read the Socrata Publisher API docs. I also discovered that Oklahoma has a Socrata site already up at data.ok.gov! (I only noticed because their favicon is the Socrata favicon) I’m still trying to figure out how to register an app with Socrata’s OAuth 2.0 server-side flow implementation, but we will post our data to data.ok.gov if at all possible. And that got me thinking about …
A loosely-coupled approach to data scraping
If we publish to data.ok.gov, we should be able to publish to other data-stores too. And we should be able to run our scrapers anywhere. I.e., any Tulsa developer should be able to develop, run, and store public data locally or in the cloud.
So, now I’m developing my scraper code in git, running on localhost, and storing into a local couchdb server. The nice thing about this approach is I can easily move up into the cloud – i.e., github, heroku, and cloudant – but I’m not locked into a technology (except maybe git) or platform/cloud provider.
We may still need or want to build a “data-pusher” piece that manages where data should be pushed – i.e., local, cloudant, freebase, data.ok.gov, etc. But I much prefer this approach for flexibility and control.