A Complete Introduction To Time Series Analysis (with R):: Model Selection for ARMA(p,q)

Image for post
Image for post
AIC, AICc, and BIC metrics

In the last section, we learned about Gaussian Time Series, a powerful and flexible assumption when it comes to ARMA(p,q) parameters estimation. In this article, we will see how we can select the “best” model among a number of fitted models.

In the previous section, we saw how Gaussian assumptions allow us to obtain and maximize the likelihood of some ARMA(p,q) process to obtain parameter estimates for the thetas and alphas. That is, for any model, we are looking to find parameters such that

Image for post
Image for post

Equivalently, in general, we consider the log-likelihood for ease of optimization, given by

Image for post
Image for post

Now, imagine we have some time series X_{t}, and we fit two models: and ARMA(4,2) and an ARMA(5,3). The question is, cannot we just use the raw likelihood of each of these models to choose one over the other? Definitely not! The resultant parameters are chosen precisely to maximize the likelihood under some local maximization space, but this is obviously a very “optimistic” measure. Why is this?

  • Suppose that we calculate the log-likelihood for M different models.
  • We could choose the model with the largest likelihood.
  • However, the log-likelihood is an optimistic measure of fit because choosing the model with the largest log-likelihood can easily mean choosing the model with the most parameters!
  • Indeed, choosing a model with more parameters may appear to provide a better fit simply because smaller models are contained within.
Image for post
Image for post

So, how do we solve this issue? The idea is to add regularization by penalizing model complexity, that is, considering

Image for post
Image for post

where, in general f(x):= -2x, and then perform the model selection by choosing whichever model has the lowest value. The selection criteria we consider for this task are the AIC, AICc, and BIC metrics, presented next.

Consider the log-likelihood of some maximized ARMA(p,q) model. To assess fit, the following metrics are defined:

Image for post
Image for post

You can find a very nice theoretical derivation of these here, but I will now go over the details. However, here are a couple of notes:

  • AIC stands for “Akaike Information Criterion”, and comes from minimizing the Kullback-Leibler divergence, a statistical measure of how far are two distributions.
  • BIC follows a similar principle, but stands for “Bayesian Information Criterion.
  • AICc performs in practice similar to the AIC but adjusts for over-penalization.
  • BIC focuses more on assessing the true fit of the model, often sacrificing some predictive power for the sake of a model that is closer to the truth.
  • For n > e², BIC penalizes more than the AIC/AICc and tends to prefer simpler models.
  • BIC tends to perform badly when the data is not underlyingly normal.
  • In practice, we calculate all three metrics for every model, and if there is a disagreement, we further investigate the models in question and decide whether the objective is to obtain more predictive power rather than a model closer to the true model.

As usual, we start by importing some libraries:

In this example, we are going to work with the Wave Tank data, in which a group of researchers simulated the properties of waves in a tank according to certain disturbances. The data can be downloaded from here. First, we read and inspect the data, and then transform it into an appropriate ts object.

Let’s visualize the data and its ACF and PACF:

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Because these are waves, there is no suprise that there is seasonality. We also see that we get a lot more attenuation of the autocorrelation. Next, we are going to fit a bunch of models, which will compare later. We will also inspect the residuals for each.

AR(2)

Image for post
Image for post

The model is such a bad fit that the test degrees of freedom don’t even make sense. If we check the residuals, we do see a bunch of lag points that go outside the confidence bounds, however the underlying data looks approximatedly normal. It might be hard to determine whether that was white noise or not.

AR(4)

Image for post
Image for post

Same as before.

ARMA(2,2)

Image for post
Image for post

Once again, we still get a lot of points out of bounaries in the ACF. It’s hard to say whether there is any white noise at all or not.

ARMA(4,4)

Image for post
Image for post

Okay, this one is much better than the previous ones, but now normality looks kind of doubtious.

Stationarity tests

Let’s use another function to verify the residuals, like the ones we saw in this article. Let’s start by the ARMA(2,2) we fitted before:

Image for post
Image for post

Most of the tests fail to rejel the null for iid noise, except for the difference signs tests. The residuals indeed look like white noise, and the normal QQ plot also testifies underlying normal assumptions. How about the ARMA(4,4) ?

Image for post
Image for post

Similar to before, but we obtain much stronger results. You might think that this is a reasonable model to go ahead and fit. However!!!! You might be uncomfortable only using these diagnostics. Why? Because these do not really tell us which model REALLY is better!

AIC, AICc, BIC

Instead, we should use the evaluation metrics we presented at the beginning of this article to perform model selection. We do it as follows:

Here, the first output is the AIC for all the models, and the second one is the BIC. We see that based on both the AIC and BIC, we should indeed pick the ARMA(4,4) model.

AutoArima

One awesome tool that you can use, however, the auto.arima function, which will help us automatically fit a number of models and report their AIC/BIC accordingly, as well as return the model that has the lowest evaluation metric among all the fitted models. How convenient! Let’s start by fitting based on the AIC criterion:

Note that for some of these, the value is reported as “inf”. This means the fitted model is so bad that the AIC metric ends up making no sense! The output model is an ARIMA(4,0,1) = ARMA(4,1) model (we will explore ARIMA and SARIMA models later). We can inspect the resulting model:

We can repeat this exercise using AICc and BIC:

We see that under all criteria, we obtain the same best model. This is not always the case, but usually, this is what happens. If you end up with different models, choose according to whether you need more prediction power (AIC) or a more “truthful” model (BIC). Let’s check the residuals:

Image for post
Image for post

This is not that bad; although we see that we are still missing part of the seasonal component, this is a good first attempt. We will revisit this in the SARIMA models later.

Predictions

Now, let’s go with my favorite part: producing some predictions! Can you remember which algorithm(s) is/are used for this? :)

Image for post
Image for post
Image for post
Image for post

Data Scientist & Data Engineer at Cisco, Canada. McGill University CS, Stats & Linguistics graduate. Polyglot.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store