Working With Time Series
Recently I’ve been working on a project tracking Covid-19 infections and vaccinations. The bulk of this project involves working with time series data, and I wanted to talk a little bit about what makes a time series and what assumptions need to be met to model the data effectively. Make sure to follow along with my blog as I plan to write about different kinds of time series models in upcoming posts.
What Is a Time Series:
Before we can start talking about how to analyze time series, we should go over the definition of what a time series is. A time series is a set of recordings that measure the same information at consistent intervals of time. For example, in the Covid-19 data supplied by Our World in Data we see daily updates in the total number of cases of the virus, the new cases reported that day, the vaccinations administered that day, etc. Each of these columns is a time series as they show recordings of our variable at a consistent daily interval.
Now that we have defined what a time series is, we should cover when time series analysis is useful. As one might expect, this process is useful for any research application that pertains to time. It is especially good for analyzing how values have changed over time, or predicting how they will continue to change as time progresses. Some common applications of time series analysis are predicting stock prices, forecasting sales, inventory analysis, and budget analysis.
Stationarity:
Next, we will discuss the key assumption of working with time series data, stationarity. A time series must have constant mean, variance and autocorrelation to be considered stationary. Let’s break this down into individual components. For the mean to be constant, the time series has to be relatively flat. If there is a large upward or downward direction to the time series, then the mean will reflect this and will either increase or decrease. This general direction is called trend, and for most models the effect of trend must be minimized. To demonstrate constant variance, the time series needs to have constant noise as time changes without large swings in either direction. Sometimes changes in variance can be attributed to seasonality. Seasonality refers to the idea that time series can reflect cyclical patterns due to set periods of time. Some examples of seasonality include the decrease in house prices during the winter and the increase of airline traffic on weekends. Seasonality is a problem for most time series models, but running a model such as a SARIMA can factor the seasonal nature of the data. Finally to have constant autocorrelation, each data point must have the same relationship with previous data points.
There are many ways to check if a series is stationary, but one of the easiest is to use an Augmented Dickey-Fuller test. This test checks for the presence of a unit root in the series which indicates a lack of stationarity. The null hypothesis of this test is that there is a unit root present and the alternative hypothesis is that there is no unit root. Here is an example of how to run an Augmented Dickey-Fuller Test in python:
# Import pandas, numpy and adfuller test
import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller# Create a random series
test_series = np.random.random(50)# Run test
results = adfuller(test_series)# Check the test statistic and p-value for the test
print('Test Statistic:', results[0])
print('P-Value:', results[1])
The test statistic returned is a negative number and the more negative it is indicates a smaller chance of rejecting the null hypothesis.
There are a few options of how to continue with a non-stationary time series, and which options you choose will vary based on the series in question. To improve stationarity in terms of variance, you can try to transform your data such as through a log transformation. If the mean is not constant for a time series, you can instead use differencing to improve stationarity. Differencing is similar to taking a derivative of the series, so it will bring the mean closer to a straight line. In some instances you may need to difference the time series multiple times to get to a stationary series. Differencing in python is very easy, simply call .diff() on your time series. Here is an example of a non-stationary time series from the Covid-19 dataset.
Here we can see that neither variance nor autocorrelation are the issue with this time series, but the mean greatly increases over time. When checked with an Augmented Dickey-Fuller Test, the p-value was 0.74 indicating that the series is not stationarity. Here is a plot of the differenced time series:
This time series looks much closer to being stationary, but we still have some issues with the mean not being constant. When checked with the Dickey-Fuller test, the differenced time series has a p-value of 0.54 indicating that the new time series is closer, but not stationary yet. Below is the series differenced twice:
The twice-differenced series is much closer to stationary. Here we see the mean is constant at zero, as there is noise around it but there is no general upward direction to the series. We do have a small issue with variance here as we can see large swings around January 2021, but likely it is not a large enough issue to affect our stationarity. When checked with a Dickey-Fuller test the p-value was below .0005, indicating that the series is likely stationary.
Modeling:
Once the usability of a time series is assessed, a model must be selected. There are many different types of models that can be used for time series analysis and selecting a model should depend on the time series that you are modeling. For this article, I will not go into the specifics of using any models, but I will likely do a more in depth analysis of several models soon. For now, I will give a simple overview of some of the types of models. Autoregression (AR) models calculate predictions as a linear function of the previous data points. These models use a value P to indicate the order of autoregression used. Moving Average (MA) models calculate predictions as a function of the variance present at earlier points of the time series. ARMA (Autoregressive Moving Average) models combine AR and MA models together to account for autocorrelation and variance. Similarly, Autoregressive Integrated Moving Average (ARIMA) models also combine AR and MA models together, but will also incorporate differencing into the model. Seasonal Autoregressive Integrated Moving Average (SARIMA) models incorporate seasonal data into the ARIMA model. Vector Autoregressive(VAR) models are used for multivariate time series analysis. In this model, several time series are converted into vectors, and predictions are calculated based on the vectors of previous data points. VAR models are similar to AR models in that they can be combined with MA models to form VARMA models. Finally, there are models that use exponential smoothing which predict next steps as an exponentially weighted linear function of earlier data points.
Summary:
Working with time series can sometimes be complicated. There are many different types and combinations of models to be used for time series analysis, and choosing the right one requires a thorough understanding of your data set.In addition preprocessing can be a bit tricky if you’re not familiar with how to transform a time series into a stationary series. Ultimately the process is worthwhile as the predictions and analysis generated from a time series model can be very powerful.
Resources:
For more on the Augmented Dickey-Fuller test, check out this article.
For a in depth look at the statistics behind time series modeling, check out this handbook.
For a quick overview of how to use various models in Python, check out this cheat sheet.