
Why amateur COVID-19 predictions are worthless — featuring ARIMA
Check out the latest COVID-19 predictions, and why I call them worthless. If you are not a public health expert, this post is for you. And if you happen to be an expert, you’ll find some really interesting results here. All with the ease of BigQuery and SQL.
tl;dr: ARIMA is widely used by domain experts to predict timeseries in different domains. With BigQuery you can easily find public datasets and ready to use ML models (including ARIMA, in alpha today). Amateurs beware, especially with topics this critical: #vizresponsibly.
Let me start by establishing that any sideline predictions about the current crisis are worthless. I’m not a public health expert, and — statistically speaking — neither are you. So instead of trying to predict anything, I’ll jump back to the past, and attempt to predict what has already happened — in the last 6 days.
You should know that there are great examples of the usefulness of ARIMA to predict time series. For example it was used successfully to predict the number of beds needed during the outbreak of SARS in Singapore in 2005. You can also check this quick primer on how to use ARIMA to predict public bike usage.
My friend Lak Lakshmanan wrote a post about analyzing COVID-19 with BigQuery. It includes code to easily perform ARIMA predictions with SQL (currently in alpha, see Lak’s notes). Let’s check what happens when we augment his code to compare the ARIMA predictions vs the actual numbers for the last 6 days.
ARIMA predictions vs reality, Japan

Are these good predictions? Nah…
- The predicted numbers quickly exhibit a larger than 20% underestimation.
- None of the actual numbers fell into the 0.9 confidence interval.
- Check the query.
ARIMA predictions vs reality, USA

Are these good predictions? Nah…
- The predicted numbers quickly exhibit a larger than 30% overestimation.
- However, the actual numbers fell into the 0.9 confidence interval. That might be good, but..
- Check how large those confidence intervals are. On day 6 the model gives a confidence interval estimating “somewhere between 80k and 700k”. How can anyone use this?
- The confidence intervals got so large, because this model follows Lak’s suggestion “for exponentially growing timeseries, use the
LOG()of numbers before applying ARIMA”. We can review later why this is a good idea. - Check the query.
Why your predictions are dangerous
If you search on Twitter for “ARIMA COVID”, you’ll find plenty of people armed with a spreadsheet that now think they are epidemiology experts. They are not. Their charts are bad, and their analysis is bad. Don’t be like them. Don’t retweet them either.
Why are bad predictions dangerous? Plenty of reasons, for example:
- You might want to give hope to your friends, so you start telling them “don’t worry, my charts show that the crisis will be over before Easter”. If they believe you, and you are wrong, then your friends could end up badly underprepared if the crisis is longer.
- You might want to tell your friends that certain medicine appears to be really effective against this virus. But then you have no idea if it is. If your ideas spread, then that medicine might go out of stock, leaving good people that need it without. For no good reason, and it has already happened.
Before creating another chart read Amanda Makulec’s “Ten Considerations Before You Create Another Chart About COVID-19”.
What to do instead
If you really want to help, team up and offer your expertise to groups that can use it. For example, check out covidactnow.org. This team has built models that track the current crisis and possible outcomes. If you are a data scientist, they would love to get your help to keep these models working and updated.
The best part: You’ll be working with an awesome team, whose work has been validated and endorsed by a number of experts in epidemiology, public health, and medicine.
What’s good, awesome, and useful about ARIMA in BigQuery
Now that we established the perils of amateurs spreading their predictions online, let me tell you why I find ARIMA in BigQuery so awesome:
Easy ARIMA and easy access to data
As you can see in my queries above, it was really easy for me to create an ARIMA model and get predictions out of a timeseries. And it all ran in less than 30 seconds. This is great. I used this power to show here how wrong these predictions can be.
To make these tasks even faster, we announced a replica of the John Hopkins U published numbers in BigQuery. Having that table publicly available gave me instant access to the data I needed to replicate these predictions.
Domain experts like ARIMA
I found many interesting papers applying ARIMA to virology tasks.
For example, some literature I found dissing ARIMA:
- BMC Bioinformatics, 2014: Comparison of ARIMA and Random Forest time series models for prediction of avian influenza H5N1 outbreaks
These analyses indicate that the Random Forest model has advantages over the ARIMA approach to time series modeling of avian influenza outbreaks in poultry in Egypt. At the same time, it clear that both retrospective models have deficiencies when trying to fit the time series of outbreaks. For example, the ARIMA model provides some estimates that are actually less than zero, which is impossible given the nature of outbreaks. Furthermore, there are times, like the end of 2008 where the model is consistently biased with respect to the signal. The Random Forest model is also consistently biased at the end of 2008, as well as the middle of 2012. However, it performs an order of magnitude better than ARIMA in terms of mean square error.
- BMC Infectious Diseases, 2017: A framework for evaluating epidemic forecasts
It means the performance of ARIMA is completely behind all other methods. Figure 20 depicts the one-step-ahead predicted curve of the ARIMA method compared to the observed data that shows the ARIMA output has large deviations from the real observed curve and confirms the correctness of the clustering approach.
- Online Journal of Public Health Informatics, 2014: Evaluating a Seasonal ARIMA Model for Event Detection in New York City
An ARIMA model is not an ideal model for prospectively detecting outbreaks in syndromic data, due to frequent monitoring and adjustment of model parameters. Furthermore, by using autoregressive and moving average parameters, the model may have over-fit the data, causing outbreaks to go undetected. ARIMA models have some limitations. Model parameters depend highly on data trends and characteristics, making geographic stratification difficult. Alternative approaches that require less frequent refitting may be easier for health departments to implement and perform better for outbreak detection.
If you search for the opposite, you’ll also find plenty of literature that has put ARIMA models to good use (especially it’s seasonally aware version, SARIMA). For example:
- BMC Health Services Research, 2004: Using autoregressive integrated moving average (ARIMA) models to predict and monitor the number of beds occupied during a SARS outbreak in a tertiary hospital in Singapore
The ARIMA model that we developed for modeling the number of beds occupied during the SARS outbreak performed reasonably well, with a MAPE of 5.7% for the training set, and 8.6% for the validation set. In addition, we found that three-day forecasts provided a reasonable prediction of the number of beds required during the outbreak
- International journal of infectious diseases, 2018: Epidemiology and ARIMA model of positive-rate of influenza viruses among children in Wuhan, China: A nine-year retrospective study
This work demonstrates the epidemiology of different types of influenza viruses among children in Wuhan, China. Our study suggests that the ARIMA model can be used to forecast the positive rate of different types of influenza virus.
You can even find some fresh papers out utilizing ARIMA for the current crisis:
- Elsevier Data in Brief, 2020: Application of the ARIMA model on the COVID2019 epidemic dataset
It only uses data until Feb 10, so this paper is missing a usefulness refresh. Authors have not replied to my emails, or my tweets. (I don’t blame them for not replying, for sure they have more important tasks today than answering my questions)
This article is a preprint and has not been peer-reviewed. It reports new medical research that has yet to be evaluated and so should not be used to guide clinical practice.
- medRxiv, 2020: Forecasting the dynamics of COVID-19 Pandemic in Top 15 countries in April 2020 through ARIMA Model with Machine Learning Approach
This article is a preprint and has not been peer-reviewed. It reports new medical research that has yet to be evaluated and so should not be used to guide clinical practice.
Accelerating experts
One of the top resources domain experts have today is time. They are racing against the virus, and the faster they can get their results, the quicker we’ll all find a solution.
Google and the BigQuery team wants to help accelerate their results. And if they want to prove the usefulness or not of ARIMA, they can use BigQuery to make this process really fast.
In the hands of an expert, the results of a model can be extremely valuable, even if to you the numbers look really wrong. Surprising results can drive experts to ask “why”, and find interesting insights over that starting point. Likewise, in the hands of a non-expert a result that looks good can be totally not.
Stupidly good results, stupidly large confidence intervals


At the start of this post we saw how the LOG ARIMA model got results wrong, even over a short 6-day interval. So if we repeated the experiment asking ARIMA to predict 14 days instead of 6, the results should be worse. Let’s see what happens, when we train the model with data only until March 16th, and predict 14 days thereafter:

Did you see that? The model goes totally wrong (>30%) for most of the predicted timeline, but on day 14 the result is eerily close to the real number (161,428 vs 161,807). This can be dangerous, as some people might think “oh, ARIMA is good when predicting over 14 days, but just when looking at that last day — look, this chart proves it”. That is wrong.

Even if the predicted number was so close, it still doesn’t make sense. Whenever you run predictions the predicted number is not that important. What really matters is the confidence interval. As you can see here, because this model used a LOG scale, the predicted confidence interval is “anything between 14,061 and 1,853,268”.

Is a confidence interval that large any good? Don’t ask me. I’ll say no, but I’m not a domain expert. For a domain expert these numbers might be gold, and I’m happy to be part of the BigQuery team that’s working hard to support them.
Additional note on the confidence intervals
Note that the analysis I just made on the confidence intervals in a LOG space is probably wrong too. To double down on the message of this post, I don’t know enough to read those intervals correctly. If you want to know more about this, see this note from Lak:
“The forecast bounds are independent of the forecast value. The bounds is proportional to sqrt(variance/n). And the only way to make the bound narrower is to use a longer sequence or a sequence with a lower variance. However, for COVID, we have a problem. In the exponential growth phase, the standard deviation is proportional to today’s value and so the longer you let it run, the wider the bounds are going to be! (Technically, the variance is seasonally corrected, but in the case of COVID we have no seasonality (yet) so it doesn’t matter.)”
It’s ok if we don’t understand what this means exactly. What’s important is for us to at least recognize that we don’t.
Want more?
- Check Lak’s post on Analyzing COVID-19 with BigQuery.
- Check Amanda’s post before you create another chart about COVID-19.
- Always include the necessary disclaimers:
I’m Felipe Hoffa, a Developer Advocate for Google Cloud. Follow me on @felipehoffa, find my previous posts on medium.com/@hoffa, and all about BigQuery on reddit.com/r/bigquery.
#vizresponsibly





