--- title: "Forecasting Using Multiple Models" author: "Matt Dancho" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Forecasting Using Multiple Models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, message = FALSE, warning = FALSE} knitr::opts_chunk$set( # message = FALSE, # warning = FALSE, fig.width = 8, fig.height = 4.5, fig.align = 'center', out.width='95%', dpi = 200 ) # devtools::load_all() # Travis CI fails on load_all() ``` > Extending `broom` to time series forecasting One of the most powerful benefits of `sweep` is that it helps forecasting at scale within the "tidyverse". There are two common situations: 1. Applying a model to groups of time series 2. Applying multiple models to a time series In this vignette we'll review how `sweep` can help the __second situation__: _Applying multiple models to a time series_. # Prerequisites Before we get started, load the following packages. ```{r, message = F} library(tidyr) library(dplyr) library(purrr) library(lubridate) library(ggplot2) library(tidyquant) library(timetk) library(sweep) library(forecast) ``` # Forecasting Gasoline Prices To start, let's get some data from the FRED data base using `tidyquant`. We'll use `tq_get()` to retrieve the Gasoline Prices from 1990 through today (`r today()`). ```{r} gas_prices_monthly_raw <- tq_get( x = "GASREGCOVM", get = "economic.data", from = "1990-01-01", to = "2016-12-31") gas_prices_monthly_raw ``` Upon a brief inspection, the data contains `r is.na(gas_prices_monthly_raw$price) %>% sum()` `NA` values that will need to be dealt with. ```{r} summary(gas_prices_monthly_raw$price) ``` We can use the `fill()` from the `tidyr` package to help deal with these data. We first fill down and then fill up to use the previous and then post days prices to fill in the missing data. ```{r} gas_prices_monthly <- gas_prices_monthly_raw %>% fill(price, .direction = "down") %>% fill(price, .direction = "up") ``` We can now visualize the data. ```{r} gas_prices_monthly %>% ggplot(aes(x = date, y = price)) + geom_line(color = palette_light()[[1]]) + labs(title = "Gasoline Prices, Monthly", x = "", y = "USD") + scale_y_continuous(labels = scales::dollar) + theme_tq() ``` Monthly periodicity might be a bit granular for model fitting. We can easily switch periodicity to quarterly using `tq_transmute()` from the `tidyquant` package along with the periodicity aggregation function `to.period` from the `xts` package. We'll convert the date to `yearqtr` class which is regularized. ```{r} gas_prices_quarterly <- gas_prices_monthly %>% tq_transmute(mutate_fun = to.period, period = "quarters") gas_prices_quarterly ``` Another quick visualization to show the reduction in granularity. ```{r} gas_prices_quarterly %>% ggplot(aes(x = date, y = price)) + geom_line(color = palette_light()[[1]], linewidth = 1) + labs(title = "Gasoline Prices, Quarterly", x = "", y = "USD") + scale_y_continuous(labels = scales::label_dollar()) + scale_x_date(date_breaks = "5 years", date_labels = "%Y") + theme_tq() ``` # Performing Forecasts Using Multiple Models In this section we will use three models to forecast gasoline prices: 1. ARIMA 2. ETS 3. BATS ## Multiple Models Concept Before we jump into modeling, let's take a look at the multiple model process from [R for Data Science, Chapter 25 Many Models](https://r4ds.had.co.nz/many-models.html#from-a-named-list). We first create a data frame from a named list. The example below has two columns: "f" the functions as text, and "params" a nested list of parameters we will pass to the respective function in column "f". ```{r} df <- tibble( f = c("runif", "rpois", "rnorm"), params = list( list(n = 10), list(n = 5, lambda = 10), list(n = 10, mean = -3, sd = 10) ) ) df ``` We can also view the contents of the `df$params` column to understand the underlying structure. Notice that there are three primary levels and then secondary levels containing the name-value pairs of parameters. This format is important. ```{r} df$params ``` Next we apply the functions to the parameters using a special function, `invoke_map()`. The parameter lists in the "params" column are passed to the function in the "f" column. The output is in a nested list-column named "out". ```{r} # FIXME invoke_map is deprecated df_out <- df %>% mutate(out = invoke_map(f, params)) df_out ``` And, here's the contents of "out", which is the result of mapping a list of functions to a list of parameters. Pretty powerful! ```{r} df_out$out ``` Take a minute to understand the conceptual process of the `invoke_map` function and specifically the parameter setup. Once you are comfortable, we can move on to model implementation. ## Multiple Model Implementation We'll need to take the following steps to in an actual forecast model implementation: 1. Coerce the data to time series 2. Build a model list using nested lists 3. Create the the model data frame 4. Invoke a function map This is easier than it sounds. Let's start by coercing the univariate time series with `tk_ts()`. ```{r} gas_prices_quarterly_ts <- gas_prices_quarterly %>% tk_ts(select = -date, start = c(1990, 3), freq = 4) gas_prices_quarterly_ts ``` Next, create a nested list using the function names as the first-level keys (this is important as you'll see in the next step). Pass the model parameters as name-value pairs in the second level. ```{r} models_list <- list( auto.arima = list( y = gas_prices_quarterly_ts ), ets = list( y = gas_prices_quarterly_ts, damped = TRUE ), bats = list( y = gas_prices_quarterly_ts ) ) ``` Now, convert to a data frame using the function, `enframe()` that turns lists into tibbles. Set the arguments `name = "f"` and `value = "params"`. In doing so we get a bonus: the model names are the now conveniently located in column "f". ```{r} models_tbl <- tibble::enframe(models_list, name = "f", value = "params") models_tbl ``` We are ready to invoke the map. Combine `mutate()` with `invoke_map()` as follows. Bada bing, bada boom, we now have models fitted using the parameters we defined previously. ```{r} models_tbl_fit <- models_tbl %>% mutate(fit = purrr::invoke_map(f, params)) models_tbl_fit ``` # Inspecting the Model Fit It's a good point to review and understand the model output. We can review the model parameters, accuracy measurements, and the residuals using `sw_tidy()`, `sw_glance()`, and `sw_augment()`. ## sw_tidy The tidying function returns the model parameters and estimates. We use the combination of `mutate` and `map` to iteratively apply the `sw_tidy()` function as a new column named "tidy". Then we unnest and spread to review the terms by model function. ```{r} models_tbl_fit %>% mutate(tidy = map(fit, sw_tidy)) %>% unnest(tidy) %>% spread(key = f, value = estimate) ``` ## sw_glance Glance is one of the most powerful tools because it yields the model accuracies enabling direct comparisons between the fit of each model. We use the same process for used for tidy, except theres no need to spread to perform the comparison. We can see that the ARIMA model has the lowest AIC by far. ```{r} models_tbl_fit %>% mutate(glance = map(fit, sw_glance)) %>% unnest(glance, .drop = TRUE) ``` ## sw_augment We can augment the models to get the residuals following the same procedure. We can pipe (`%>%`) the results right into `ggplot()` for plotting. Notice the ARIMA model has the largest residuals especially as the model index increases whereas the bats model has relatively low residuals. ```{r, warning=F, fig.height=8} models_tbl_fit %>% mutate(augment = map(fit, sw_augment, rename_index = "date")) %>% unnest(augment) %>% ggplot(aes(x = date, y = .resid, group = f)) + geom_line(color = palette_light()[[2]]) + geom_point(color = palette_light()[[1]]) + geom_smooth(method = "loess") + facet_wrap(~ f, nrow = 3) + labs(title = "Residuals Plot") + theme_tq() ``` # Forecasting the model Creating the forecast for the models is accomplished by mapping the `forecast` function. The next six quarters are forecasted withe the argument `h = 6`. ```{r} models_tbl_fcast <- models_tbl_fit %>% mutate(fcast = map(fit, forecast, h = 6)) models_tbl_fcast ``` # Tidying the forecast Next, we map `sw_sweep`, which coerces the forecast into the "tidy" tibble format. We set `fitted = FALSE` to remove the model fitted values from the output. We set `timetk_idx = TRUE` to use dates instead of numeric values for the index. ```{r} models_tbl_fcast_tidy <- models_tbl_fcast %>% mutate(sweep = map(fcast, sw_sweep, fitted = FALSE, timetk_idx = TRUE, rename_index = "date")) models_tbl_fcast_tidy ``` We can unnest the "sweep" column to get the results of all three models. ```{r} models_tbl_fcast_tidy %>% unnest(sweep) ``` Finally, we can plot the forecasts by unnesting the "sweep" column and piping to `ggplot()`. ```{r, fig.height=8} models_tbl_fcast_tidy %>% unnest(sweep) %>% ggplot(aes(x = date, y = price, color = key, group = f)) + geom_ribbon(aes(ymin = lo.95, ymax = hi.95), fill = "#D5DBFF", color = NA, linewidth = 0) + geom_ribbon(aes(ymin = lo.80, ymax = hi.80, fill = key), fill = "#596DD5", color = NA, linewidth = 0, alpha = 0.8) + geom_line(linewidth = 1) + facet_wrap(~f, nrow = 3) + labs(title = "Gasoline Price Forecasts", subtitle = "Forecasting multiple models with sweep: ARIMA, BATS, ETS", x = "", y = "Price") + scale_y_continuous(labels = scales::label_dollar()) + scale_x_date(date_breaks = "5 years", date_labels = "%Y") + theme_tq() + scale_color_tq() ``` # Recap The `sweep` package can aid analysis of multiple forecast models. In the next vignette we will review time series object coercion with `sweep`.