Title: | Tidy Anomaly Detection |
---|---|
Description: | The 'anomalize' package enables a "tidy" workflow for detecting anomalies in data. The main functions are time_decompose(), anomalize(), and time_recompose(). When combined, it's quite simple to decompose time series, detect anomalies, and create bands separating the "normal" data from the anomalous data at scale (i.e. for multiple time series). Time series decomposition is used to remove trend and seasonal components via the time_decompose() function and methods include seasonal decomposition of time series by Loess ("stl") and seasonal decomposition by piecewise medians ("twitter"). The anomalize() function implements two methods for anomaly detection of residuals including using an inner quartile range ("iqr") and generalized extreme studentized deviation ("gesd"). These methods are based on those used in the 'forecast' package and the Twitter 'AnomalyDetection' package. Refer to the associated functions for specific references for these methods. |
Authors: | Matt Dancho [aut, cre], Davis Vaughan [aut] |
Maintainer: | Matt Dancho <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.3.0.9000 |
Built: | 2024-10-26 04:27:51 UTC |
Source: | https://github.com/business-science/anomalize |
The anomalize()
function is used to detect outliers in a distribution
with no trend or seasonality present. It takes the output of time_decompose()
,
which has be de-trended and applies anomaly detection methods to identify outliers.
anomalize( data, target, method = c("iqr", "gesd"), alpha = 0.05, max_anoms = 0.2, verbose = FALSE )
anomalize( data, target, method = c("iqr", "gesd"), alpha = 0.05, max_anoms = 0.2, verbose = FALSE )
data |
A |
target |
A column to apply the function to |
method |
The anomaly detection method. One of |
alpha |
Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations. |
max_anoms |
The maximum percent of anomalies permitted to be identified. |
verbose |
A boolean. If |
The return has three columns: "remainder_l1" (lower limit for anomalies), "remainder_l2" (upper limit for anomalies), and "anomaly" (Yes/No).
Use time_decompose()
to decompose a time series prior to performing
anomaly detection with anomalize()
. Typically, anomalize()
is
performed on the "remainder" of the time series decomposition.
For non-time series data (data without trend), the anomalize()
function can
be used without time series decomposition.
The anomalize()
function uses two methods for outlier detection
each with benefits.
IQR:
The IQR Method uses an innerquartile range of 25% and 75% to establish a baseline distribution around
the median. With the default alpha = 0.05
, the limits are established by expanding
the 25/75 baseline by an IQR Factor of 3 (3X). The IQR Factor = 0.15 / alpha (hense 3X with alpha = 0.05).
To increase the IQR Factor controling the limits, decrease the alpha, which makes
it more difficult to be an outlier. Increase alpha to make it easier to be an outlier.
The IQR method is used in forecast::tsoutliers()
.
GESD:
The GESD Method (Generlized Extreme Studentized Deviate Test) progressively eliminates outliers using a Student's T-Test comparing the test statistic to a critical value. Each time an outlier is removed, the test statistic is updated. Once test statistic drops below the critical value, all outliers are considered removed. Because this method involves continuous updating via a loop, it is slower than the IQR method. However, it tends to be the best performing method for outlier removal.
The GESD method is used in AnomalyDection::AnomalyDetectionTs()
.
Returns a tibble
/ tbl_time
object or list depending on the value of verbose
.
Alex T.C. Lau (November/December 2015). GESD - A Robust and Effective Technique for Dealing with Multiple Outliers. ASTM Standardization News. www.astm.org/sn
Anomaly Detection Methods (Powers anomalize
)
Time Series Anomaly Detection Functions (anomaly detection workflow):
## Not run: library(dplyr) # Needed to pass CRAN check / This is loaded by default set_time_scale_template(time_scale_template()) tidyverse_cran_downloads %>% time_decompose(count, method = "stl") %>% anomalize(remainder, method = "iqr") ## End(Not run)
## Not run: library(dplyr) # Needed to pass CRAN check / This is loaded by default set_time_scale_template(time_scale_template()) tidyverse_cran_downloads %>% time_decompose(count, method = "stl") %>% anomalize(remainder, method = "iqr") ## End(Not run)
Methods that power anomalize()
iqr(x, alpha = 0.05, max_anoms = 0.2, verbose = FALSE) gesd(x, alpha = 0.05, max_anoms = 0.2, verbose = FALSE)
iqr(x, alpha = 0.05, max_anoms = 0.2, verbose = FALSE) gesd(x, alpha = 0.05, max_anoms = 0.2, verbose = FALSE)
x |
A vector of numeric data. |
alpha |
Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations. |
max_anoms |
The maximum percent of anomalies permitted to be identified. |
verbose |
A boolean. If |
Returns character vector or list depending on the value of verbose
.
The IQR method is used in forecast::tsoutliers()
The GESD method is used in Twitter's AnomalyDetection
package and is also available as a function in @raunakms's GESD method
set.seed(100) x <- rnorm(100) idx_outliers <- sample(100, size = 5) x[idx_outliers] <- x[idx_outliers] + 10 iqr(x, alpha = 0.05, max_anoms = 0.2) iqr(x, alpha = 0.05, max_anoms = 0.2, verbose = TRUE) gesd(x, alpha = 0.05, max_anoms = 0.2) gesd(x, alpha = 0.05, max_anoms = 0.2, verbose = TRUE)
set.seed(100) x <- rnorm(100) idx_outliers <- sample(100, size = 5) x[idx_outliers] <- x[idx_outliers] + 10 iqr(x, alpha = 0.05, max_anoms = 0.2) iqr(x, alpha = 0.05, max_anoms = 0.2, verbose = TRUE) gesd(x, alpha = 0.05, max_anoms = 0.2) gesd(x, alpha = 0.05, max_anoms = 0.2, verbose = TRUE)
Clean anomalies from anomalized data
clean_anomalies(data)
clean_anomalies(data)
data |
A |
The clean_anomalies()
function is used to replace outliers with the seasonal and trend component.
This is often desirable when forecasting with noisy time series data to improve trend detection.
To clean anomalies, the input data must be detrended with time_decompose()
and anomalized with anomalize()
.
The data can also be recomposed with time_recompose()
.
Returns a tibble
/ tbl_time
object with a new column "observed_cleaned".
Time Series Anomaly Detection Functions (anomaly detection workflow):
## Not run: library(dplyr) # Needed to pass CRAN check / This is loaded by default set_time_scale_template(time_scale_template()) data(tidyverse_cran_downloads) tidyverse_cran_downloads %>% time_decompose(count, method = "stl") %>% anomalize(remainder, method = "iqr") %>% clean_anomalies() ## End(Not run)
## Not run: library(dplyr) # Needed to pass CRAN check / This is loaded by default set_time_scale_template(time_scale_template()) data(tidyverse_cran_downloads) tidyverse_cran_downloads %>% time_decompose(count, method = "stl") %>% anomalize(remainder, method = "iqr") %>% clean_anomalies() ## End(Not run)
Methods that power time_decompose()
decompose_twitter( data, target, frequency = "auto", trend = "auto", message = TRUE ) decompose_stl(data, target, frequency = "auto", trend = "auto", message = TRUE)
decompose_twitter( data, target, frequency = "auto", trend = "auto", message = TRUE ) decompose_stl(data, target, frequency = "auto", trend = "auto", message = TRUE)
data |
A |
target |
A column to apply the function to |
frequency |
Controls the seasonal adjustment (removal of seasonality).
Input can be either "auto", a time-based definition (e.g. "1 week"),
or a numeric number of observations per frequency (e.g. 10).
Refer to |
trend |
Controls the trend component For stl, the trend controls the sensitivity of the lowess smoother, which is used to remove the remainder. For twitter, the trend controls the period width of the median, which are used to remove the trend and center the remainder. |
message |
A boolean. If |
A tbl_time
object containing the time series decomposition.
The "twitter" method is used in Twitter's AnomalyDetection
package
library(dplyr) tidyverse_cran_downloads %>% ungroup() %>% filter(package == "tidyquant") %>% decompose_stl(count)
library(dplyr) tidyverse_cran_downloads %>% ungroup() %>% filter(package == "tidyquant") %>% decompose_stl(count)
Visualize the anomalies in one or multiple time series
plot_anomalies( data, time_recomposed = FALSE, ncol = 1, color_no = "#2c3e50", color_yes = "#e31a1c", fill_ribbon = "grey70", alpha_dots = 1, alpha_circles = 1, alpha_ribbon = 1, size_dots = 1.5, size_circles = 4 )
plot_anomalies( data, time_recomposed = FALSE, ncol = 1, color_no = "#2c3e50", color_yes = "#e31a1c", fill_ribbon = "grey70", alpha_dots = 1, alpha_circles = 1, alpha_ribbon = 1, size_dots = 1.5, size_circles = 4 )
data |
A |
time_recomposed |
A boolean. If |
ncol |
Number of columns to display. Set to 1 for single column by default. |
color_no |
Color for non-anomalous data. |
color_yes |
Color for anomalous data. |
fill_ribbon |
Fill color for the time_recomposed ribbon. |
alpha_dots |
Controls the transparency of the dots. Reduce when too many dots on the screen. |
alpha_circles |
Controls the transparency of the circles that identify anomalies. |
alpha_ribbon |
Controls the transparency of the time_recomposed ribbon. |
size_dots |
Controls the size of the dots. |
size_circles |
Controls the size of the circles that identify anomalies. |
Plotting function for visualizing anomalies on one or more time series.
Multiple time series must be grouped using dplyr::group_by()
.
Returns a ggplot
object.
## Not run: library(dplyr) library(ggplot2) #### SINGLE TIME SERIES #### tidyverse_cran_downloads %>% filter(package == "tidyquant") %>% ungroup() %>% time_decompose(count, method = "stl") %>% anomalize(remainder, method = "iqr") %>% time_recompose() %>% plot_anomalies(time_recomposed = TRUE) #### MULTIPLE TIME SERIES #### tidyverse_cran_downloads %>% time_decompose(count, method = "stl") %>% anomalize(remainder, method = "iqr") %>% time_recompose() %>% plot_anomalies(time_recomposed = TRUE, ncol = 3) ## End(Not run)
## Not run: library(dplyr) library(ggplot2) #### SINGLE TIME SERIES #### tidyverse_cran_downloads %>% filter(package == "tidyquant") %>% ungroup() %>% time_decompose(count, method = "stl") %>% anomalize(remainder, method = "iqr") %>% time_recompose() %>% plot_anomalies(time_recomposed = TRUE) #### MULTIPLE TIME SERIES #### tidyverse_cran_downloads %>% time_decompose(count, method = "stl") %>% anomalize(remainder, method = "iqr") %>% time_recompose() %>% plot_anomalies(time_recomposed = TRUE, ncol = 3) ## End(Not run)
Visualize the time series decomposition with anomalies shown
plot_anomaly_decomposition( data, ncol = 1, color_no = "#2c3e50", color_yes = "#e31a1c", alpha_dots = 1, alpha_circles = 1, size_dots = 1.5, size_circles = 4, strip.position = "right" )
plot_anomaly_decomposition( data, ncol = 1, color_no = "#2c3e50", color_yes = "#e31a1c", alpha_dots = 1, alpha_circles = 1, size_dots = 1.5, size_circles = 4, strip.position = "right" )
data |
A |
ncol |
Number of columns to display. Set to 1 for single column by default. |
color_no |
Color for non-anomalous data. |
color_yes |
Color for anomalous data. |
alpha_dots |
Controls the transparency of the dots. Reduce when too many dots on the screen. |
alpha_circles |
Controls the transparency of the circles that identify anomalies. |
size_dots |
Controls the size of the dots. |
size_circles |
Controls the size of the circles that identify anomalies. |
strip.position |
Controls the placement of the strip that identifies the time series decomposition components. |
The first step in reviewing the anomaly detection process is to evaluate
a single times series to observe how the algorithm is selecting anomalies.
The plot_anomaly_decomposition()
function is used to gain
an understanding as to whether or not the method is detecting anomalies correctly and
whether or not parameters such as decomposition method, anomalize method,
alpha, frequency, and so on should be adjusted.
Returns a ggplot
object.
library(dplyr) library(ggplot2) tidyverse_cran_downloads %>% filter(package == "tidyquant") %>% ungroup() %>% time_decompose(count, method = "stl") %>% anomalize(remainder, method = "iqr") %>% plot_anomaly_decomposition()
library(dplyr) library(ggplot2) tidyverse_cran_downloads %>% filter(package == "tidyquant") %>% ungroup() %>% time_decompose(count, method = "stl") %>% anomalize(remainder, method = "iqr") %>% plot_anomaly_decomposition()
Automatically create tibbletime objects from tibbles
prep_tbl_time(data, message = FALSE)
prep_tbl_time(data, message = FALSE)
data |
A |
message |
A boolean. If |
Detects a date or datetime index column and automatically
Returns a tibbletime
object of class tbl_time
.
library(dplyr) library(tibbletime) data_tbl <- tibble( date = seq.Date(from = as.Date("2018-01-01"), by = "day", length.out = 10), value = rnorm(10) ) prep_tbl_time(data_tbl)
library(dplyr) library(tibbletime) data_tbl <- tibble( date = seq.Date(from = as.Date("2018-01-01"), by = "day", length.out = 10), value = rnorm(10) ) prep_tbl_time(data_tbl)
Get and modify time scale template
set_time_scale_template(data) get_time_scale_template() time_scale_template()
set_time_scale_template(data) get_time_scale_template() time_scale_template()
data |
A |
Used to get and set the time scale template, which is used by time_frequency()
and time_trend()
when period = "auto"
.
time_frequency()
, time_trend()
get_time_scale_template() set_time_scale_template(time_scale_template())
get_time_scale_template() set_time_scale_template(time_scale_template())
A dataset containing the daily download counts from 2017-01-01 to 2018-03-01 for the following tidyverse packages:
tidyr
lubridate
dplyr
broom
tidyquant
tidytext
ggplot2
purrr
stringr
forcats
knitr
readr
tibble
tidyverse
tidyverse_cran_downloads
tidyverse_cran_downloads
A grouped_tbl_time
object with 6,375 rows and 3 variables:
Date of the daily observation
Number of downloads that day
The package corresponding to the daily download number
The package downloads come from CRAN by way of the cranlogs
package.
Apply a function to a time series by period
time_apply( data, target, period, .fun, ..., start_date = NULL, side = "end", clean = FALSE, message = TRUE )
time_apply( data, target, period, .fun, ..., start_date = NULL, side = "end", clean = FALSE, message = TRUE )
data |
A |
target |
A column to apply the function to |
period |
A time-based definition (e.g. "1 week").
or a numeric number of observations per frequency (e.g. 10).
See |
.fun |
A function to apply (e.g. |
... |
Additional parameters passed to the function, |
start_date |
Optional argument used to specify the start date for the first group. The default is to start at the closest period boundary below the minimum date in the supplied index. |
side |
Whether to return the date at the beginning or the end of the new period. By default, the "end" of the period. Use "start" to change to the start of the period. |
clean |
Whether or not to round the collapsed index up / down to the next period boundary. The decision to round up / down is controlled by the side argument. |
message |
A boolean. If |
Uses a time-based period to apply functions to. This is useful in circumstances where you want to
compare the observation values to aggregated values such as mean()
or median()
during a set time-based period. The returned output extends the
length of the data frame so the differences can easily be computed.
Returns a tibbletime
object of class tbl_time
.
library(dplyr) # Basic Usage tidyverse_cran_downloads %>% time_apply(count, period = "1 week", .fun = mean, na.rm = TRUE)
library(dplyr) # Basic Usage tidyverse_cran_downloads %>% time_apply(count, period = "1 week", .fun = mean, na.rm = TRUE)
Decompose a time series in preparation for anomaly detection
time_decompose( data, target, method = c("stl", "twitter"), frequency = "auto", trend = "auto", ..., merge = FALSE, message = TRUE )
time_decompose( data, target, method = c("stl", "twitter"), frequency = "auto", trend = "auto", ..., merge = FALSE, message = TRUE )
data |
A |
target |
A column to apply the function to |
method |
The time series decomposition method. One of |
frequency |
Controls the seasonal adjustment (removal of seasonality).
Input can be either "auto", a time-based definition (e.g. "1 week"),
or a numeric number of observations per frequency (e.g. 10).
Refer to |
trend |
Controls the trend component For stl, the trend controls the sensitivity of the lowess smoother, which is used to remove the remainder. For twitter, the trend controls the period width of the median, which are used to remove the trend and center the remainder. |
... |
Additional parameters passed to the underlying method functions. |
merge |
A boolean. |
message |
A boolean. If |
The time_decompose()
function generates a time series decomposition on
tbl_time
objects. The function is "tidy" in the sense that it works
on data frames. It is designed to work with time-based data, and as such
must have a column that contains date or datetime information. The function
also works with grouped data. The function implements several methods
of time series decomposition, each with benefits.
STL:
The STL method (method = "stl"
) implements time series decomposition using
the underlying decompose_stl()
function. If you are familiar with stats::stl()
,
the function is a "tidy" version that is designed to work with tbl_time
objects.
The decomposition separates the "season" and "trend" components from
the "observed" values leaving the "remainder" for anomaly detection.
The user can control two parameters: frequency
and trend
.
The frequency
parameter adjusts the "season" component that is removed
from the "observed" values. The trend
parameter adjusts the
trend window (t.window
parameter from stl()
) that is used.
The user may supply both frequency
and trend
as time-based durations (e.g. "90 days") or numeric values
(e.g. 180) or "auto", which predetermines the frequency and/or trend
based on the scale of the time series.
Twitter:
The Twitter method (method = "twitter"
) implements time series decomposition using
the methodology from the Twitter AnomalyDetection package.
The decomposition separates the "seasonal" component and then removes
the median data, which is a different approach than the STL method for removing
the trend. This approach works very well for low-growth + high seasonality data.
STL may be a better approach when trend is a large factor.
The user can control two parameters: frequency
and trend
.
The frequency
parameter adjusts the "season" component that is removed
from the "observed" values. The trend
parameter adjusts the
period width of the median spans that are used. The user may supply both frequency
and trend
as time-based durations (e.g. "90 days") or numeric values
(e.g. 180) or "auto", which predetermines the frequency and/or median spans
based on the scale of the time series.
Returns a tbl_time
object.
CLEVELAND, R. B., CLEVELAND, W. S., MCRAE, J. E., AND TERPENNING, I. STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, Vol. 6, No. 1 (1990), pp. 3-73.
Decomposition Methods (Powers time_decompose
)
Time Series Anomaly Detection Functions (anomaly detection workflow):
library(dplyr) # Basic Usage tidyverse_cran_downloads %>% time_decompose(count, method = "stl") # twitter tidyverse_cran_downloads %>% time_decompose(count, method = "twitter", frequency = "1 week", trend = "2 months", merge = TRUE, message = FALSE)
library(dplyr) # Basic Usage tidyverse_cran_downloads %>% time_decompose(count, method = "stl") # twitter tidyverse_cran_downloads %>% time_decompose(count, method = "twitter", frequency = "1 week", trend = "2 months", merge = TRUE, message = FALSE)
Generate a time series frequency from a periodicity
time_frequency(data, period = "auto", message = TRUE) time_trend(data, period = "auto", message = TRUE)
time_frequency(data, period = "auto", message = TRUE) time_trend(data, period = "auto", message = TRUE)
data |
A |
period |
Either "auto", a time-based definition (e.g. "14 days"),
or a numeric number of observations per frequency (e.g. 10).
See |
message |
A boolean. If |
A frequency is loosely defined as the number of observations that comprise a cycle
in a data set. The trend is loosely defined as time span that can
be aggregated across to visualize the central tendency of the data.
It's often easiest to think of frequency and trend in terms of the time-based units
that the data is already in. This is what time_frequency()
and time_trend()
enable: using time-based periods to define the frequency or trend.
Frequency:
As an example, a weekly cycle is often 5-days (for working
days) or 7-days (for calendar days). Rather than specify a frequency of 5 or 7,
the user can specify period = "1 week"
, and
time_frequency()' will detect the scale of the time series and return 5 or 7
based on the actual data.
The period
argument has three basic options for returning a frequency.
Options include:
"auto"
: A target frequency is determined using a pre-defined template (see template
below).
time-based duration
: (e.g. "1 week" or "2 quarters" per cycle)
numeric number of observations
: (e.g. 5 for 5 observations per cycle)
The template
argument is only used when period = "auto"
. The template is a tibble
of three features: time_scale
, frequency
, and trend
. The algorithm will inspect
the scale of the time series and select the best frequency that matches the scale and
number of observations per target frequency. A frequency is then chosen on be the
best match. The predefined template is stored in a function time_scale_template()
.
However, the user can come up with his or her own template changing the values
for frequency in the data frame and saving it to anomalize_options$time_scale_template
.
Trend:
As an example, the trend of daily data is often best aggregated by evaluating
the moving average over a quarter or a month span. Rather than specify the number
of days in a quarter or month, the user can specify "1 quarter" or "1 month",
and the time_trend()
function will return the correct number of observations
per trend cycle. In addition, there is an option, period = "auto"
, to
auto-detect an appropriate trend span depending on the data. The template
is used to define the appropriate trend span.
Returns a scalar numeric value indicating the number of observations in the frequency or trend span.
library(dplyr) data(tidyverse_cran_downloads) #### FREQUENCY DETECTION #### # period = "auto" tidyverse_cran_downloads %>% filter(package == "tidyquant") %>% ungroup() %>% time_frequency(period = "auto") time_scale_template() # period = "1 month" tidyverse_cran_downloads %>% filter(package == "tidyquant") %>% ungroup() %>% time_frequency(period = "1 month") #### TREND DETECTION #### tidyverse_cran_downloads %>% filter(package == "tidyquant") %>% ungroup() %>% time_trend(period = "auto")
library(dplyr) data(tidyverse_cran_downloads) #### FREQUENCY DETECTION #### # period = "auto" tidyverse_cran_downloads %>% filter(package == "tidyquant") %>% ungroup() %>% time_frequency(period = "auto") time_scale_template() # period = "1 month" tidyverse_cran_downloads %>% filter(package == "tidyquant") %>% ungroup() %>% time_frequency(period = "1 month") #### TREND DETECTION #### tidyverse_cran_downloads %>% filter(package == "tidyquant") %>% ungroup() %>% time_trend(period = "auto")
Recompose bands separating anomalies from "normal" observations
time_recompose(data)
time_recompose(data)
data |
A |
The time_recompose()
function is used to generate bands around the
"normal" levels of observed values. The function uses the remainder_l1
and remainder_l2 levels produced during the anomalize()
step
and the season and trend/median_spans values from the time_decompose()
step to reconstruct bands around the normal values.
The following key names are required: observed:remainder from the
time_decompose()
step and remainder_l1 and remainder_l2 from the
anomalize()
step.
Returns a tbl_time
object.
Time Series Anomaly Detection Functions (anomaly detection workflow):
library(dplyr) data(tidyverse_cran_downloads) # Basic Usage tidyverse_cran_downloads %>% time_decompose(count, method = "stl") %>% anomalize(remainder, method = "iqr") %>% time_recompose()
library(dplyr) data(tidyverse_cran_downloads) # Basic Usage tidyverse_cran_downloads %>% time_decompose(count, method = "stl") %>% anomalize(remainder, method = "iqr") %>% time_recompose()