--- title: "Calendar Features" author: "Matt Dancho" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Calendar Features} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, message = FALSE, warning = FALSE} knitr::opts_chunk$set( # message = FALSE, # warning = FALSE, fig.width = 8, fig.height = 4.5, fig.align = 'center', out.width='95%', dpi = 100 ) # devtools::load_all() # Travis CI fails on load_all() ``` This vignette covers making and working with __Calendar Features__, which are derived from a time series index, or the sequence of date/datetime stamps that accompany time series data. # Introduction The __time series index__ consists of a collection of time-based values that define _when_ each observation occurred, is the most important part of a time series object. The index gives the user a lot of information in a simple timestamp. Consider the datetime ___"2016-01-01 00:00:00"___. From this timestamp, we can decompose the date and time information to get the __signature__, which consists of the year, quarter, month, day, day of year, day of month, hour, minute, and second of the occurrence of a single observation. Further, the difference between two or more observations is the __frequency__ from which we can obtain even more information such as the periodicity of the data and whether or not these observations are on a regular interval. This information is critical as it provides the basis for performance over time in finance, decay rates in biology, growth rates in economics, and so on. In this vignette the user will be exposed to: 1. Time Series Index 2. Time Series Signature 3. Time Series Summary # Prerequisites Before we get started, load the following packages. ```{r, message = FALSE, warning=FALSE} library(dplyr) library(timetk) ``` # Data We'll use the Facebook stock prices from the `FANG` data set. These are the historical stock prices (open, high, low, close, volume, and adjusted) for the "FB" stock from 2013 through 2016. ```{r} data("FANG") FB_tbl <- FANG %>% dplyr::filter(symbol == "FB") FB_tbl ``` To simplify the tutorial, we will select only the "date" and "volume" columns. For the `FB_vol_date` data frame, we can see from the "date" column that the observations are _daily_ beginning on the second day of 2013. ```{r} FB_vol_date <- FB_tbl %>% select(date, volume) FB_vol_date ``` # Time Series Index Before we can analyze an index, we need to extract it from the object. The function `tk_index()` extracts the index from any time series object including data frame (or `tbl`), `xts`, `zoo`, etc. The index is always returned in the native date, datetime, yearmon, or yearqtr format. Note that the index must be in one of these time-based classes for extraction to work: * datetimes: Must inherit `POSIXt` * dates: Must inherit `Date` * yearmon: Must inherit `yearmon` from the `zoo` package * yearqtr: Must inherit `yearqtr` from the `zoo` package Extract the index using `tk_index()`. The structure is shown to see the output format, which is a vector of dates. ```{r} # idx_date idx_date <- tk_index(FB_vol_date) str(idx_date) ``` # Time Series Signature The index can be decomposed into a _signature_. The time series signature is a unique set of properties of the time series values that describe the time series. ## Get Functions - Turning an Index into Information The function `tk_get_timeseries_signature()` can be used to convert the index to a tibble containing the following values (columns): * __index__: The index value that was decomposed * __index.num__: The numeric value of the index in seconds. The base is "1970-01-01 00:00:00" (Execute `"1970-01-01 00:00:00" %>% ymd_hms() %>% as.numeric()` to see the value returned is zero). Every time series value after this date can be converted to a numeric value in seconds. * __diff__: The difference in seconds from the previous numeric index value. * __year__: The year component of the index. * __year.iso__: The ISO year number of the year (Monday start). * __half__: The half component of the index. * __quarter__: The quarter component of the index. * __month__: The month component of the index with base 1. * __month.xts__: The month component of the index with base 0, which is what `xts` implements. * __month.lbl__: The month label as an ordered factor begining with January and ending with December. * __day__: The day component of the index. * __hour__: The hour component of the index. * __minute__: The minute component of the index. * __second__: The second component of the index. * __hour12__: The hour component on a 12 hour scale. * __am.pm__: Morning (AM) = 1, Afternoon (PM) = 2. * __wday__: The day of the week with base 1. Sunday = 1 and Saturday = 7. * __wday.xts__: The day of the week with base 0, which is what `xts` implements. Sunday = 0 and Saturday = 6. * __wday.lbl__: The day of the week label as an ordered factor begining with Sunday and ending with Saturday. * __mday__: The day of the month. * __qday__: The day of the quarter. * __yday__: The day of the year. * __mweek__: The week of the month. * __week__: The week number of the year (Sunday start). * __week.iso__: The ISO week number of the year (Monday start). * __week2__: The modulus for bi-weekly frequency. * __week3__: The modulus for tri-weekly frequency. * __week4__: The modulus for quad-weekly frequency. * __mday7__: The integer division of day of the month by seven, which returns the first, second, third, ... instance the day has appeared in the month. Values begin at 1. For example, the first Saturday in the month has mday7 = 1. The second has mday7 = 2. ```{r} # idx_date signature tk_get_timeseries_signature(idx_date) ``` ## Augment Functions (Adding Many Features to a Data Frame) It's usually important to keep the index signature with the values (e.g. volume in our example). We can use an expedited approach with `tk_augment_timeseries_signature()`, which adds the signature to the end of the time series object. ```{r} # Augmenting a data frame FB_vol_date_signature <- FB_vol_date %>% tk_augment_timeseries_signature(.date_var = date) FB_vol_date_signature ``` Modeling is now much easier. As an example, we can use linear regression model using the `lm()` function with the month and year as a predictor of volume. ```{r} # Example Benefit 2: Modeling is easier fit <- lm(volume ~ year + month.lbl, data = FB_vol_date_signature) summary(fit) ``` # Time Series Summary The next index analysis tool is the summary metrics, which can be retrieved using the `tk_get_timeseries_summary()` function. The summary reports the following attributes as a single-row tibble. _General Summary_: The first six columns are general summary information. * __n.obs__: The total number of observations * __start__: The start in the appropriate time class * __end__: The end in the appropriate time class * __units__: A label that describes the unit of the index value that is independent of frequency (i.e. a date class will always be "days" whereas a datetime class will always be "seconds"). Values can be days, hours, mins, secs. * __scale__: A label that describes the the median difference (frequency) between observations. Values can be quarter, month, day, hour, minute, second. * __tzone__: The timezone of the index. ```{r} # idx_date: First six columns, general summary tk_get_timeseries_summary(idx_date)[,1:6] ``` _Differences Summary_: The next group of values are the __differences summary__ (i.e. summary of frequency). All values are in seconds: * __diff.minimum__: The minimum difference between index values. * __diff.q1__: The first quartile of the index differences. * __diff.median__: The median difference between index values (i.e. most common frequency). * __diff.mean__: The average difference between index values. * __diff.q3__: The third quartile of the index differences. * __diff.maximum__: The maximum difference between index values. ```{r} # idx_date: Last six columns, difference summary tk_get_timeseries_summary(idx_date)[,7:12] ``` The __differences__ provide information about the _regularity of the frequency_. Generally speaking if all difference values are equal, the index is regular. However, scales beyond "day" are never theoretically regular since the differences in seconds are not equivalent. However, conceptually monthly, quarterly and yearly data can be thought of as regular if the index contains consecutive months, quarters, or years, respectively. Therefore, the difference attributes are most meaningful for daily and lower time scales because the difference summary always indicates level of regularity. From the second group (differences summary), we immediately recognize that the mean is different than the median and therefore the index is _irregular_ (meaning certain days are missing). Further we can see that the maximum difference is 345,600 seconds, indicating the maximum difference is 4 days (345,600 seconds / 86400 seconds/day). # Learning More

_My Talk on High-Performance Time Series Forecasting_ Time series is changing. __Businesses now need 10,000+ time series forecasts every day.__ __High-Performance Forecasting Systems will save companies MILLIONS of dollars.__ Imagine what will happen to your career if you can provide your organization a "High-Performance Time Series Forecasting System" (HPTSF System). I teach how to build a HPTFS System in my __High-Performance Time Series Forecasting Course__. If interested in learning Scalable High-Performance Forecasting Strategies then [take my course](https://university.business-science.io/p/ds4b-203-r-high-performance-time-series-forecasting). You will learn: - Time Series Machine Learning (cutting-edge) with `Modeltime` - 30+ Models (Prophet, ARIMA, XGBoost, Random Forest, & many more) - NEW - Deep Learning with `GluonTS` (Competition Winners) - Time Series Preprocessing, Noise Reduction, & Anomaly Detection - Feature engineering using lagged variables & external regressors - Hyperparameter Tuning - Time series cross-validation - Ensembling Multiple Machine Learning & Univariate Modeling Techniques (Competition Winner) - Scalable Forecasting - Forecast 1000+ time series in parallel - and more.

Unlock the High-Performance Time Series Forecasting Course