This vignette covers making and working with Calendar Features, which are derived from a time series index, or the sequence of date/datetime stamps that accompany time series data.
The time series index consists of a collection of time-based values that define when each observation occurred, is the most important part of a time series object.
The index gives the user a lot of information in a simple timestamp. Consider the datetime “2016-01-01 00:00:00”.
From this timestamp, we can decompose the date and time information to get the signature, which consists of the year, quarter, month, day, day of year, day of month, hour, minute, and second of the occurrence of a single observation. Further, the difference between two or more observations is the frequency from which we can obtain even more information such as the periodicity of the data and whether or not these observations are on a regular interval. This information is critical as it provides the basis for performance over time in finance, decay rates in biology, growth rates in economics, and so on.
In this vignette the user will be exposed to:
We’ll use the Facebook stock prices from the FANG
data
set. These are the historical stock prices (open, high, low, close,
volume, and adjusted) for the “FB” stock from 2013 through 2016.
## # A tibble: 1,008 × 8
## symbol date open high low close volume adjusted
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FB 2013-01-02 27.4 28.2 27.4 28 69846400 28
## 2 FB 2013-01-03 27.9 28.5 27.6 27.8 63140600 27.8
## 3 FB 2013-01-04 28.0 28.9 27.8 28.8 72715400 28.8
## 4 FB 2013-01-07 28.7 29.8 28.6 29.4 83781800 29.4
## 5 FB 2013-01-08 29.5 29.6 28.9 29.1 45871300 29.1
## 6 FB 2013-01-09 29.7 30.6 29.5 30.6 104787700 30.6
## 7 FB 2013-01-10 30.6 31.5 30.3 31.3 95316400 31.3
## 8 FB 2013-01-11 31.3 32.0 31.1 31.7 89598000 31.7
## 9 FB 2013-01-14 32.1 32.2 30.6 31.0 98892800 31.0
## 10 FB 2013-01-15 30.6 31.7 29.9 30.1 173242600 30.1
## # ℹ 998 more rows
To simplify the tutorial, we will select only the “date” and “volume”
columns. For the FB_vol_date
data frame, we can see from
the “date” column that the observations are daily beginning on
the second day of 2013.
## # A tibble: 1,008 × 2
## date volume
## <date> <dbl>
## 1 2013-01-02 69846400
## 2 2013-01-03 63140600
## 3 2013-01-04 72715400
## 4 2013-01-07 83781800
## 5 2013-01-08 45871300
## 6 2013-01-09 104787700
## 7 2013-01-10 95316400
## 8 2013-01-11 89598000
## 9 2013-01-14 98892800
## 10 2013-01-15 173242600
## # ℹ 998 more rows
Before we can analyze an index, we need to extract it from the
object. The function tk_index()
extracts the index from any
time series object including data frame (or tbl
),
xts
, zoo
, etc. The index is always returned in
the native date, datetime, yearmon, or yearqtr format. Note that the
index must be in one of these time-based classes for extraction to
work:
POSIXt
Date
yearmon
from the zoo
packageyearqtr
from the zoo
packageExtract the index using tk_index()
. The structure is
shown to see the output format, which is a vector of dates.
## Date[1:1008], format: "2013-01-02" "2013-01-03" "2013-01-04" "2013-01-07" "2013-01-08" ...
The index can be decomposed into a signature. The time series signature is a unique set of properties of the time series values that describe the time series.
The function tk_get_timeseries_signature()
can be used
to convert the index to a tibble containing the following values
(columns):
"1970-01-01 00:00:00" %>% ymd_hms() %>% as.numeric()
to see the value returned is zero). Every time series value after this
date can be converted to a numeric value in seconds.xts
implements.xts
implements. Sunday = 0 and Saturday = 6.## # A tibble: 1,008 × 29
## index index.num diff year year.iso half quarter month month.xts
## <date> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
## 1 2013-01-02 1357084800 NA 2013 2013 1 1 1 0
## 2 2013-01-03 1357171200 86400 2013 2013 1 1 1 0
## 3 2013-01-04 1357257600 86400 2013 2013 1 1 1 0
## 4 2013-01-07 1357516800 259200 2013 2013 1 1 1 0
## 5 2013-01-08 1357603200 86400 2013 2013 1 1 1 0
## 6 2013-01-09 1357689600 86400 2013 2013 1 1 1 0
## 7 2013-01-10 1357776000 86400 2013 2013 1 1 1 0
## 8 2013-01-11 1357862400 86400 2013 2013 1 1 1 0
## 9 2013-01-14 1358121600 259200 2013 2013 1 1 1 0
## 10 2013-01-15 1358208000 86400 2013 2013 1 1 1 0
## # ℹ 998 more rows
## # ℹ 20 more variables: month.lbl <ord>, day <int>, hour <int>, minute <int>,
## # second <int>, hour12 <int>, am.pm <int>, wday <int>, wday.xts <int>,
## # wday.lbl <ord>, mday <int>, qday <int>, yday <int>, mweek <int>,
## # week <int>, week.iso <int>, week2 <int>, week3 <int>, week4 <int>,
## # mday7 <int>
It’s usually important to keep the index signature with the values
(e.g. volume in our example). We can use an expedited approach with
tk_augment_timeseries_signature()
, which adds the signature
to the end of the time series object.
# Augmenting a data frame
FB_vol_date_signature <- FB_vol_date %>% tk_augment_timeseries_signature(.date_var = date)
FB_vol_date_signature
## # A tibble: 1,008 × 30
## date volume index.num diff year year.iso half quarter month
## <date> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int>
## 1 2013-01-02 69846400 1357084800 NA 2013 2013 1 1 1
## 2 2013-01-03 63140600 1357171200 86400 2013 2013 1 1 1
## 3 2013-01-04 72715400 1357257600 86400 2013 2013 1 1 1
## 4 2013-01-07 83781800 1357516800 259200 2013 2013 1 1 1
## 5 2013-01-08 45871300 1357603200 86400 2013 2013 1 1 1
## 6 2013-01-09 104787700 1357689600 86400 2013 2013 1 1 1
## 7 2013-01-10 95316400 1357776000 86400 2013 2013 1 1 1
## 8 2013-01-11 89598000 1357862400 86400 2013 2013 1 1 1
## 9 2013-01-14 98892800 1358121600 259200 2013 2013 1 1 1
## 10 2013-01-15 173242600 1358208000 86400 2013 2013 1 1 1
## # ℹ 998 more rows
## # ℹ 21 more variables: month.xts <int>, month.lbl <ord>, day <int>, hour <int>,
## # minute <int>, second <int>, hour12 <int>, am.pm <int>, wday <int>,
## # wday.xts <int>, wday.lbl <ord>, mday <int>, qday <int>, yday <int>,
## # mweek <int>, week <int>, week.iso <int>, week2 <int>, week3 <int>,
## # week4 <int>, mday7 <int>
Modeling is now much easier. As an example, we can use linear
regression model using the lm()
function with the month and
year as a predictor of volume.
# Example Benefit 2: Modeling is easier
fit <- lm(volume ~ year + month.lbl, data = FB_vol_date_signature)
summary(fit)
##
## Call:
## lm(formula = volume ~ year + month.lbl, data = FB_vol_date_signature)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51042223 -13528407 -4588594 8296073 304011277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.494e+10 1.414e+09 17.633 < 2e-16 ***
## year -1.236e+07 7.021e+05 -17.604 < 2e-16 ***
## month.lbl.L -9.589e+06 2.740e+06 -3.499 0.000488 ***
## month.lbl.Q 7.348e+06 2.725e+06 2.697 0.007122 **
## month.lbl.C -9.773e+06 2.711e+06 -3.605 0.000328 ***
## month.lbl^4 -2.885e+06 2.720e+06 -1.060 0.289176
## month.lbl^5 -2.994e+06 2.749e+06 -1.089 0.276428
## month.lbl^6 3.169e+06 2.753e+06 1.151 0.249851
## month.lbl^7 6.000e+05 2.721e+06 0.221 0.825514
## month.lbl^8 8.281e+03 2.702e+06 0.003 0.997555
## month.lbl^9 9.504e+06 2.704e+06 3.515 0.000459 ***
## month.lbl^10 -5.911e+06 2.701e+06 -2.188 0.028888 *
## month.lbl^11 -4.738e+06 2.696e+06 -1.757 0.079181 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24910000 on 995 degrees of freedom
## Multiple R-squared: 0.2714, Adjusted R-squared: 0.2626
## F-statistic: 30.89 on 12 and 995 DF, p-value: < 2.2e-16
The next index analysis tool is the summary metrics, which can be
retrieved using the tk_get_timeseries_summary()
function.
The summary reports the following attributes as a single-row tibble.
General Summary:
The first six columns are general summary information.
## # A tibble: 1 × 6
## n.obs start end units scale tzone
## <int> <date> <date> <chr> <chr> <chr>
## 1 1008 2013-01-02 2016-12-30 days day UTC
Differences Summary:
The next group of values are the differences summary (i.e. summary of frequency). All values are in seconds:
## # A tibble: 1 × 6
## diff.minimum diff.q1 diff.median diff.mean diff.q3 diff.maximum
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 86400 86400 86400 125096. 86400 345600
The differences provide information about the regularity of the frequency. Generally speaking if all difference values are equal, the index is regular. However, scales beyond “day” are never theoretically regular since the differences in seconds are not equivalent. However, conceptually monthly, quarterly and yearly data can be thought of as regular if the index contains consecutive months, quarters, or years, respectively. Therefore, the difference attributes are most meaningful for daily and lower time scales because the difference summary always indicates level of regularity.
From the second group (differences summary), we immediately recognize that the mean is different than the median and therefore the index is irregular (meaning certain days are missing). Further we can see that the maximum difference is 345,600 seconds, indicating the maximum difference is 4 days (345,600 seconds / 86400 seconds/day).
My Talk on High-Performance Time Series Forecasting
Time series is changing. Businesses now need 10,000+ time series forecasts every day.
High-Performance Forecasting Systems will save companies MILLIONS of dollars. Imagine what will happen to your career if you can provide your organization a “High-Performance Time Series Forecasting System” (HPTSF System).
I teach how to build a HPTFS System in my High-Performance Time Series Forecasting Course. If interested in learning Scalable High-Performance Forecasting Strategies then take my course. You will learn:
Modeltime
- 30+ Models (Prophet, ARIMA, XGBoost, Random
Forest, & many more)GluonTS
(Competition
Winners)