The past few months have been hectic. As I try desperately to get back in to some sane writing schedule, here is a quick post on the most recent updates with the COVID19 API.
The simple approach of deploying a binary to a single server and having it run happily worked for quite some time. Then slowly and suddenly there was a requirement to scale out. I tackled this at the time by making several more copies of that same binary and deploying on to the same machine. This allowed us to scale out horizontally quickly and easily, increasing the capacity for API requests by at least a factor of 3. This did, however, make new deployments slight harder (more binaries to be replaced) but also meant I was effectively running a cheap imitation of Kubernetes without any of the good stuff.
So I finally decided to bite the bullet and move to the great k8s. This turned out to be pretty trivial to get up and running easily, as the project already had a Dockerfile. With a few hours work, including the set up of a brand new Kubernetes cluster, I easily moved the services to the new infrastructure.
There were some weird issues I needed to work through and debug, specifically around trying to get real IP addresses and using Proxy Protocol on the Load Balancer, but by and large this has been a super easy transition to make.
There are two significant pieces of work done around the data. The first is the addition of several new data sources. The second is the amendment of the way the current data is handled.
Two new data sources have been added. The first is Our World In Data, which includes country-level static data (population, washing facilities, etc) as well as case data (including cases per million). I’ve also added additional calculated data as part of work I’ve done with COVID19 Humanitarian such as daily incidence and case fatality ratio. The second is a small subset of data from the Oxford COVID-19 Government Response Tracker which is focused on travel advice for each country.
The second amendment on the data front is the handling of current data. One thing I noticed is that source data can and does change. For a given day and a given country a data entry is made, however a few hours later they realise there was an error and amend the data. We were saving each entry in the database and not looking at it again and as a result would miss this correction. The reason for this was performance - if you do a bulk insert of data you can save 500k rows of data in under 5 seconds. If you want to make sure any updated data is subsequently updated in your system, you need to check and compare each row and, if they are amended, do an update. This is far more costly and depending on how you do it could make those 500k rows take 15min each time.
I found a good way to do this is to take the existing data from the database, create a unique key in memory representing the data, compare the data row by row to the key in memory, and only if it is amended do the update to the database. The issue is that this requires a fair amount of memory on the application, but the upside is a massive uptick in performance.
Due to the above work, it also puts the API in a really good position to have new data added. I’ll shortly be adding testing case data, as well as more broadly extending the data saved from the current sources.
While it’s great having an API that is used roughly 35 times a second (over 50 times a second at peak), this obviously costs money to run. Digital Ocean have been great support to date (and hopefully moving forward) but the project’s livelihood cannot exclusively count on the charity of our hosting provider.
I implemented rate limiting on all routes, eventually ending up at 2 requests per second (10 requests every 5 seconds). This seems to be a good spot for most (80%+) of the applications using the system. Interestingly, the top four IP addresses made up over 70% of the daily API usage. This will allow us to bring the hosting costs back down to something sane once more.
Aside: finding good rate limiting software to do this which was a middleware provider and could support multiple instances was surprisingly difficult. I settled on this great package.
In addition to removing rate limiting, the basic plan at $10 per month gives additional data on our most popular route
/summary. The other two plans give additional routes for new data and a support model.
So far the response has been pretty good and I’m dealing with every sign up manually, speaking to the new customers to see what they want and how the API can be made better. I’m hoping for a steady increase in subscribers in the coming months, and I’ll be adding more value to the API to move that needle.
The past few months have been interesting and I’m really happy with where the COVID19 API project is sitting at the moment. The transition from fully-free to partially-paid has been interesting, and for the majority of people the API is still completely free to use so the “societal good” value is still there. If you’re interested in what the paid plans offer, head over to the subscriptions section on the website.