--- title: "Reduce Forecast Error with Cleaned Anomalies" author: "Business Science" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Reduce Forecast Error with Cleaned Anomalies} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = F, fig.align = "center" ) library(dplyr) library(ggplot2) library(tidyquant) library(anomalize) library(timetk) ``` > Forecasting error can often be reduced 20% to 50% by repairing anomolous data ## Example - Reducing Forecasting Error by 32% We can often get better forecast performance by cleaning anomalous data prior to forecasting. This is the perfect use case for integrating the `clean_anomalies()` function into your ___forecast workflow___. ```r library(tidyverse) library(tidyquant) library(anomalize) library(timetk) ``` ```{r} # NOTE: timetk now has anomaly detection built in, which # will get the new functionality going forward. # Use this script to prevent overwriting legacy anomalize: anomalize <- anomalize::anomalize plot_anomalies <- anomalize::plot_anomalies ``` Here is a short example with the `tidyverse_cran_downloads` dataset that comes with `anomalize`. __We'll see how we can reduce the forecast error by 32% simply by repairing anomalies.__ ```{r} tidyverse_cran_downloads ``` Let's take one package with some extreme events. We can hone in on `lubridate`, which has some outliers that we can fix. ```{r, fig.height=8, fig.width=6} tidyverse_cran_downloads %>% ggplot(aes(date, count, color = package)) + geom_point(alpha = 0.5) + facet_wrap(~ package, ncol = 3, scales = "free_y") + scale_color_viridis_d() + theme_tq() ``` ## Forecasting Lubridate Downloads Let's focus on downloads of the `lubridate` R package. ```{r} lubridate_tbl <- tidyverse_cran_downloads %>% ungroup() %>% filter(package == "lubridate") ``` First, we'll make a function, `forecast_mae()`, that can take the input of both cleaned and uncleaned anomalies and calculate forecast error of future uncleaned anomalies. The modeling function uses the following criteria: - Split the `data` into training and testing data that maintains the correct time-series sequence using the `prop` argument. - Models the daily time series of the training data set from observed (demonstrates no cleaning) or observed and cleaned (demonstrates improvement from cleaning). Specified by the `col_train` argument. - Compares the predictions to the observed values. Specified by the `col_test` argument. ```{r} forecast_mae <- function(data, col_train, col_test, prop = 0.8) { predict_expr <- enquo(col_train) actual_expr <- enquo(col_test) idx_train <- 1:(floor(prop * nrow(data))) train_tbl <- data %>% filter(row_number() %in% idx_train) test_tbl <- data %>% filter(!row_number() %in% idx_train) # Model using training data (training) model_formula <- as.formula(paste0(quo_name(predict_expr), " ~ index.num + year + quarter + month.lbl + day + wday.lbl")) model_glm <- train_tbl %>% tk_augment_timeseries_signature() %>% glm(model_formula, data = .) # Make Prediction suppressWarnings({ # Suppress rank-deficit warning prediction <- predict(model_glm, newdata = test_tbl %>% tk_augment_timeseries_signature()) actual <- test_tbl %>% pull(!! actual_expr) }) # Calculate MAE mae <- mean(abs(prediction - actual)) return(mae) } ``` ## Workflow for Cleaning Anomalies We will use the `anomalize` workflow of decomposing (`time_decompose()`) and identifying anomalies (`anomalize()`). We use the function, __`clean_anomalies()`, to add new column called "observed_cleaned" that is repaired by replacing all anomalies with the trend + seasonal components from the decompose operation__. We can now experiment to see the improvment in forecasting performance by comparing a forecast made with "observed" versus "observed_cleaned" ```{r} lubridate_anomalized_tbl <- lubridate_tbl %>% time_decompose(count) %>% anomalize(remainder) %>% # Function to clean & repair anomalous data clean_anomalies() lubridate_anomalized_tbl ``` ## Before Cleaning with anomalize ```{r} lubridate_anomalized_tbl %>% forecast_mae(col_train = observed, col_test = observed, prop = 0.8) ``` ## After Cleaning with anomalize ```{r} lubridate_anomalized_tbl %>% forecast_mae(col_train = observed_cleaned, col_test = observed, prop = 0.8) ``` ## 32% Reduction in Forecast Error This is approximately a 32% reduction in forecast error as measure by Mean Absolute Error (MAE). ```{r} (2755 - 4054) / 4054 ``` # Interested in Learning Anomaly Detection? Business Science offers two 1-hour courses on Anomaly Detection: - [Learning Lab 18](https://university.business-science.io/p/learning-labs-pro) - Time Series Anomaly Detection with `anomalize` - [Learning Lab 17](https://university.business-science.io/p/learning-labs-pro) - Anomaly Detection with `H2O` Machine Learning