--- title: "Forecasting Time Series Groups in the tidyverse" author: "Matt Dancho" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Forecasting Time Series Groups in the tidyverse} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, message = FALSE, warning = FALSE} knitr::opts_chunk$set( # message = FALSE, # warning = FALSE, fig.width = 8, fig.height = 4.5, fig.align = 'center', out.width='95%', dpi = 200 ) # devtools::load_all() # Travis CI fails on load_all() ``` > Extending `broom` to time series forecasting One of the most powerful benefits of `sweep` is that it helps forecasting at scale within the "tidyverse". There are two common situations: 1. Applying a model to groups of time series 2. Applying multiple models to a time series In this vignette we'll review how `sweep` can help the __first situation__: _Applying a model to groups of time series_. # Prerequisites Before we get started, load the following packages. ```{r, message = F} library(dplyr) library(ggplot2) library(tidyr) library(purrr) library(lubridate) library(tidyquant) library(timetk) library(sweep) library(forecast) ``` # Bike Sales We'll use the bike sales data set, `bike_sales`, provided with the `sweep` package for this tutorial. The `bike_sales` data set is a _fictional_ daily order history that spans 2011 through 2015. It simulates a sales database that is typical of a business. The customers are the "bike shops" and the products are the "models". ```{r} bike_sales ``` We'll analyse the monthly sales trends for the bicycle manufacturer. Let's transform the data set by aggregating by month. ```{r} bike_sales_monthly <- bike_sales %>% mutate(month = month(order.date, label = TRUE), year = year(order.date)) %>% group_by(year, month) %>% summarise(total.qty = sum(quantity)) bike_sales_monthly ``` We can visualize package with a month plot using the `ggplot2` . ```{r} bike_sales_monthly %>% ggplot(aes(x = month, y = total.qty, group = year)) + geom_area(aes(fill = year), position = "stack") + labs(title = "Quantity Sold: Month Plot", x = "", y = "Sales", subtitle = "March through July tend to be most active") + scale_y_continuous() + theme_tq() ``` Suppose Manufacturing wants a more granular forecast because the bike components are related to the secondary category. In the next section we discuss how `sweep` can help to perform a forecast on each sub-category. # Performing Forecasts on Groups First, we need to get the data organized into groups by month of the year. We'll create a new "order.month" date using `zoo::as.yearmon()` that captures the year and month information from the "order.date" and then passing this to `lubridate::as_date()` to convert to date format. ```{r} monthly_qty_by_cat2 <- bike_sales %>% mutate(order.month = as_date(as.yearmon(order.date))) %>% group_by(category.secondary, order.month) %>% summarise(total.qty = sum(quantity)) monthly_qty_by_cat2 ``` Next, we use the `nest()` function from the `tidyr` package to consolidate each time series by group. The newly created list-column, "data.tbl", contains the "order.month" and "total.qty" columns by group from the previous step. The `nest()` function just bundles the data together which is very useful for iterative functional programming. ```{r} monthly_qty_by_cat2_nest <- monthly_qty_by_cat2 %>% group_by(category.secondary) %>% nest() monthly_qty_by_cat2_nest ``` ## Forecasting Workflow The forecasting workflow involves a few basic steps: 1. Step 1: Coerce to a `ts` object class. 2. Step 2: Apply a model (or set of models) 3. Step 3: Forecast the models (similar to predict) 4. Step 4: Tidy the forecast ## Step 1: Coerce to a `ts` object class In this step we map the `tk_ts()` function into a new column "data.ts". The procedure is performed using the combination of `dplyr::mutate()` and `purrr::map()`, which works really well for the data science workflow where analyses are built progressively. As a result, this combination will be used in many of the subsequent steps in this vignette as we build the analysis. ### mutate and map The `mutate()` function adds a column, and the `map()` function maps the contents of a list-column (`.x`) to a function (`.f`). In our case, `.x = data.tbl` and `.f = tk_ts`. The arguments `select = -order.month`, `start = 2011`, and `freq = 12` are passed to the `...` parameters in map, which are passed through to the function. The `select` statement is used to drop the "order.month" from the final output so we don't get a bunch of warning messages. We specify `start = 2011` and `freq = 12` to return a monthly frequency. ```{r} monthly_qty_by_cat2_ts <- monthly_qty_by_cat2_nest %>% mutate(data.ts = map(.x = data, .f = tk_ts, select = -order.month, start = 2011, freq = 12)) monthly_qty_by_cat2_ts ``` ## Step 2: Modeling a time series Next, we map the Exponential Smoothing ETS (Error, Trend, Seasonal) model function, `ets`, from the `forecast` package. Use the combination of `mutate` to add a column and `map` to interatively apply a function rowwise to a list-column. In this instance, the function to map the `ets` function and the list-column is "data.ts". We rename the resultant column "fit.ets" indicating an ETS model was fit to the time series data. ```{r} monthly_qty_by_cat2_fit <- monthly_qty_by_cat2_ts %>% mutate(fit.ets = map(data.ts, ets)) monthly_qty_by_cat2_fit ``` At this point, we can do some model inspection with the `sweep` tidiers. ### sw_tidy To get the model parameters for each nested list, we can combine `sw_tidy` within the `mutate` and `map` combo. The only real difference is now we `unnest` the generated column (named "tidy"). Last, because it's easier to compare the model parameters side by side, we add one additional call to `spread()` from the `tidyr` package. ```{r} monthly_qty_by_cat2_fit %>% mutate(tidy = map(fit.ets, sw_tidy)) %>% unnest(tidy) %>% spread(key = category.secondary, value = estimate) ``` ### sw_glance We can view the model accuracies also by mapping `sw_glance` within the `mutate` and `map` combo. ```{r} monthly_qty_by_cat2_fit %>% mutate(glance = map(fit.ets, sw_glance)) %>% unnest(glance) ``` ### sw_augment The augmented fitted and residual values can be achieved in much the same manner. This returns nine groups data. Note that we pass `timetk_idx = TRUE` to return the date format times as opposed to the regular (yearmon or numeric) time series. ```{r} augment_fit_ets <- monthly_qty_by_cat2_fit %>% mutate(augment = map(fit.ets, sw_augment, timetk_idx = TRUE, rename_index = "date")) %>% unnest(augment) augment_fit_ets ``` We can plot the residuals for the nine categories like so. Unfortunately we do see some very high residuals (especially with "Fat Bike"). This is often the case with realworld data. ```{r} augment_fit_ets %>% ggplot(aes(x = date, y = .resid, group = category.secondary)) + geom_hline(yintercept = 0, color = "grey40") + geom_line(color = palette_light()[[2]]) + geom_smooth(method = "loess") + labs(title = "Bike Quantity Sold By Secondary Category", subtitle = "ETS Model Residuals", x = "") + theme_tq() + facet_wrap(~ category.secondary, scale = "free_y", ncol = 3) + scale_x_date(date_labels = "%Y") ``` ### sw_tidy_decomp We can create decompositions using the same procedure with `sw_tidy_decomp()` and the `mutate` and `map` combo. ```{r} monthly_qty_by_cat2_fit %>% mutate(decomp = map(fit.ets, sw_tidy_decomp, timetk_idx = TRUE, rename_index = "date")) %>% unnest(decomp) ``` ## Step 3: Forecasting the model We can also forecast the multiple models again using a very similar approach with the `forecast` function. We want a 12 month forecast so we add the argument for the `h = 12` (refer to `?forecast` for all of the parameters you can add, there's quite a few). ```{r} monthly_qty_by_cat2_fcast <- monthly_qty_by_cat2_fit %>% mutate(fcast.ets = map(fit.ets, forecast, h = 12)) monthly_qty_by_cat2_fcast ``` ## Step 4: Tidy the forecast Next, we can apply `sw_sweep` to get the forecast in a nice "tidy" data frame. We use the argument `fitted = FALSE` to remove the fitted values from the forecast (leave off if fitted values are desired). We set `timetk_idx = TRUE` to use dates instead of numeric values for the index. We'll use `unnest()` to drop the left over list-columns and return an unnested data frame. ```{r} monthly_qty_by_cat2_fcast_tidy <- monthly_qty_by_cat2_fcast %>% mutate(sweep = map(fcast.ets, sw_sweep, fitted = FALSE, timetk_idx = TRUE)) %>% unnest(sweep) monthly_qty_by_cat2_fcast_tidy ``` Visualization is just one final step. ```{r, fig.height=7} monthly_qty_by_cat2_fcast_tidy %>% ggplot(aes(x = index, y = total.qty, color = key, group = category.secondary)) + geom_ribbon(aes(ymin = lo.95, ymax = hi.95), fill = "#D5DBFF", color = NA, linewidth = 0) + geom_ribbon(aes(ymin = lo.80, ymax = hi.80, fill = key), fill = "#596DD5", color = NA, linewidth = 0, alpha = 0.8) + geom_line() + labs(title = "Bike Quantity Sold By Secondary Category", subtitle = "ETS Model Forecasts", x = "", y = "Units") + scale_x_date(date_breaks = "1 year", date_labels = "%Y") + scale_color_tq() + scale_fill_tq() + facet_wrap(~ category.secondary, scales = "free_y", ncol = 3) + theme_tq() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` # Recap The `sweep` package has a several tools to analyze grouped time series. In the next vignette we will review how to apply multiple models to a time series.