The past week I’ve been working hard with my colleague Assis Ngolo to get the API improved as much as we can. The initial challenge of “make sure the data from John Hopkins is clean” turned out to be a major one. However, we made it through the other side!
With that, here is everything that changed and a bit of where we are heading next.
We were initially using the John Hopkins timeseries dataset, until they announced this was to be discontinued. At the time, there wasn’t a one for one replacement for those data files. I took a look at the daily data, and it seemed to be a good replacement. Except, it really wasn’t.
We then spent 20 or 30 hours trying to get the data from those daily reports clean. Name format changes, date changes, bad data (going from 10 confirmed to 6), and bad location data (co-ordinates changing or inaccurate). Through this process, we got the following done:
- Built a robust parsing engine to allow us to ensure that data is clean. This looks at locations as a data point of city, province and co-ordinate.
- The mapping of co-ordinates back to known, good figures to ensure they are accurate.
- Sanity checks for “bad data” - e.g. numbers decreasing where it doesn’t make sense.
- A flexible CSV parsing engine, which allows us to easily add new file sources
But there is only so much you can do with bad data, so we switched to the new timeseries files for global and US cases. Eventually we got the data clean and sanitised. This is also completely reproducible - we often do tests of deleting all the data we have and then reimporting from day one.
Almost all requests now return a response between 25ms and 35ms, which is lightning quick. Loading times for all data (234,000+ rows), including new santising checks, is around 90 seconds.
We are comfortably serving over 20 requests a second at most times of the day, spiking at around 35.
Thanks to Digital Ocean’s support for COVID19 projects we managed to get some credit allocated toward the API. This allowed us to move to a managed database and get more bang for buck on the application instance. After some initial teething problems, mainly around the max connections to the managed MySQL instance, we are happily serving requests with no trouble.
We also launched a Slack channel for everyone to get involved: request features, notify us of bugs, share what they’re built with the API. We’ve got almost 150 people in the channel only a few days after launch and it’s growing fast. This is in addition to our mailing list of almost 300 people. Building a community around this API is awesome - the more everyone gets involved, the better we can make the API and the more value is created for everyone around the world.
Looking forward to the next few weeks, we are planning to do the following:
- Improving the existing API by adding more routes and ways to request data
- Additional data sources for more detailed data
- General bug fixing and maintenance
This all takes time and resource, and as a result of a grant we received from Emergent Ventures we have been able to do the development to date. However, it is a relatively small grant so we are looking at other ways to keep the project going strong in to the future. So far we have PayPal donations and a Patreon - if you can, please support!
The API has now reached over 11 million API requests and we are reaching 1 million API requests per day. It is incredible. Join our Slack community, sign up to the mailing list or follow me on Twitter for updates.