R语言timekit包包含时间序列一系列工具,最大的特点是通过数据挖掘进行时间序列的预测能力。第一组的函数tk_index(),tk_get_timeseries_signature(), tk_augment_timeseries_signature()和 and tk_get_timeseries_summary()处理时间序列。第二组函数tk_make_future_timeseries()从现有时间序列进行扩展,第三族函数tk_tbl(), tk_xts(), tk_zoo() (and tk_zooreg()), and tk_ts()进行时间序列类型转换。
library(tidyquant)
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## Loading required package: PerformanceAnalytics
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
## Loading required package: quantmod
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
## Loading required package: tidyverse
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## as.difftime(): lubridate, base
## date(): lubridate, base
## filter(): dplyr, stats
## first(): dplyr, xts
## intersect(): lubridate, base
## lag(): dplyr, stats
## last(): dplyr, xts
## setdiff(): lubridate, base
## union(): lubridate, base
##
## Attaching package: 'tidyquant'
## The following object is masked from 'package:tibble':
##
## as_tibble
library(timekit)
#Filter to get just the FB stock prices, and select the “date” and “volume” columns.FANG数据集由tidyquant提供,筛选FB的股价信息中大date和volume两列。
FB_tbl <- FANG %>%
filter(symbol == "FB") %>%
select(date, volume)
FB_tbl
## # A tibble: 1,008 x 2
## date volume
## <date> <dbl>
## 1 2013-01-02 69846400
## 2 2013-01-03 63140600
## 3 2013-01-04 72715400
## 4 2013-01-07 83781800
## 5 2013-01-08 45871300
## 6 2013-01-09 104787700
## 7 2013-01-10 95316400
## 8 2013-01-11 89598000
## 9 2013-01-14 98892800
## 10 2013-01-15 173242600
## # ... with 998 more rows
split the data into two sets, one for training and one for comparing the actual output to our predictions 拆分数据集,2016年以前的数据为训练集和以后的为测试集
# Everything before 2016 will be used for training (2013-2015 data)
train <- FB_tbl %>%
filter(date < ymd("2016-01-01"))
# Everything in 2016 will be used for comparing the output
actual_future <- FB_tbl %>%
filter(date >= ymd("2016-01-01"))
Next, augment the time series signature to the training set using tk_augment_timeseries_signature(). This function adds the time series signature as additional columns to the data frame.用tk_augment_timeseries_signature()函数在训练集上添加时间签名,该函数将在数据框添加多个时间签名列。
train <- tk_augment_timeseries_signature(train)
train
## # A tibble: 756 x 29
## date volume index.num diff year half quarter month
## <date> <dbl> <int> <int> <int> <int> <int> <int>
## 1 2013-01-02 69846400 1357084800 NA 2013 1 1 1
## 2 2013-01-03 63140600 1357171200 86400 2013 1 1 1
## 3 2013-01-04 72715400 1357257600 86400 2013 1 1 1
## 4 2013-01-07 83781800 1357516800 259200 2013 1 1 1
## 5 2013-01-08 45871300 1357603200 86400 2013 1 1 1
## 6 2013-01-09 104787700 1357689600 86400 2013 1 1 1
## 7 2013-01-10 95316400 1357776000 86400 2013 1 1 1
## 8 2013-01-11 89598000 1357862400 86400 2013 1 1 1
## 9 2013-01-14 98892800 1358121600 259200 2013 1 1 1
## 10 2013-01-15 173242600 1358208000 86400 2013 1 1 1
## # ... with 746 more rows, and 21 more variables: month.xts <int>,
## # month.lbl <ord>, day <int>, hour <int>, minute <int>, second <int>,
## # hour12 <int>, am.pm <int>, wday <int>, wday.xts <int>, wday.lbl <ord>,
## # mday <int>, qday <int>, yday <int>, mweek <int>, week <int>,
## # week.iso <int>, week2 <int>, week3 <int>, week4 <int>, mday7 <dbl>
Next, model the data using a regression. We are going to use the lm() function to model volume using the time series signature.对时间签名列使用线性回归模型进行预测。
fit_lm <- lm(volume ~ ., data = train[,-1])
summary(fit_lm)
##
## Call:
## lm(formula = volume ~ ., data = train[, -1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -58109572 -14375506 -3850847 10269515 287819472
##
## Coefficients: (15 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.383e+12 7.152e+11 1.934 0.05350 .
## index.num 2.172e+01 1.151e+01 1.888 0.05946 .
## diff -4.780e+01 2.984e+01 -1.602 0.10959
## year -7.017e+08 3.630e+08 -1.933 0.05363 .
## half 1.663e+07 1.526e+07 1.090 0.27616
## quarter 8.069e+06 7.710e+06 1.047 0.29564
## month -6.293e+07 2.423e+07 -2.597 0.00959 **
## month.xts NA NA NA NA
## month.lbl.L NA NA NA NA
## month.lbl.Q 2.680e+06 3.732e+06 0.718 0.47289
## month.lbl.C 1.601e+06 7.815e+06 0.205 0.83773
## month.lbl^4 -2.693e+06 3.483e+06 -0.773 0.43964
## month.lbl^5 -7.038e+06 8.808e+06 -0.799 0.42451
## month.lbl^6 7.749e+06 3.571e+06 2.170 0.03034 *
## month.lbl^7 4.827e+06 5.647e+06 0.855 0.39293
## month.lbl^8 3.726e+06 3.437e+06 1.084 0.27869
## month.lbl^9 NA NA NA NA
## month.lbl^10 -1.166e+06 4.039e+06 -0.289 0.77301
## month.lbl^11 NA NA NA NA
## day NA NA NA NA
## hour NA NA NA NA
## minute NA NA NA NA
## second NA NA NA NA
## hour12 NA NA NA NA
## am.pm NA NA NA NA
## wday -1.775e+06 1.506e+06 -1.178 0.23901
## wday.xts NA NA NA NA
## wday.lbl.L NA NA NA NA
## wday.lbl.Q 2.504e+06 3.565e+06 0.702 0.48260
## wday.lbl.C -6.041e+06 2.567e+06 -2.353 0.01888 *
## wday.lbl^4 -1.411e+06 2.208e+06 -0.639 0.52307
## mday NA NA NA NA
## qday NA NA NA NA
## yday NA NA NA NA
## mweek -7.211e+06 4.228e+06 -1.706 0.08851 .
## week -2.016e+05 3.911e+06 -0.052 0.95890
## week.iso 3.640e+05 2.456e+05 1.482 0.13875
## week2 -2.292e+06 2.186e+06 -1.049 0.29473
## week3 8.181e+05 1.231e+06 0.664 0.50668
## week4 1.943e+06 9.870e+05 1.968 0.04941 *
## mday7 -2.985e+06 3.708e+06 -0.805 0.42101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26930000 on 729 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.2666, Adjusted R-squared: 0.2414
## F-statistic: 10.6 on 25 and 729 DF, p-value: < 2.2e-16
Now we need to build the future data to model. We already have the index in the actual_future data. However, in practice we don’t normally have the future index. Let’s build it using the existing index following three steps: 实际中常常没有需要预测的时间序列索引,可以现有的时间序列索引进行,遵循以下三步 1.Extract the index from the training set with tk_index() tk_index()函数进行获取 2.Make a future index factoring in holidays and weekly inspection using tk_make_future_timeseries() tk_make_future_timeseries()函数进行时间数据的拓展,检查周末和节假日。 3.Create a time series signature from the future index using tk_get_timeseries_signature() tk_get_timeseries_signature()时间数据的分解,对上一步获取idx数据的分解,如常见的年月日等,是否是上半年(half=1)、下半年(half=2)等
# US trading holidays in 2016
#2016年美国传统节假日
holidays <- c("2016-01-01", "2016-01-18", "2016-02-15", "2016-03-25", "2016-05-30","2016-07-04", "2016-09-05") %>%ymd()
# Build new data for prediction: 3 Steps
#3步创建new_data预测数据集,inspect_weekdays = TRUE是跳过周末两天时间
new_data <- train %>%
tk_index() %>%
tk_make_future_timeseries(n_future = 363, skip_values = holidays, inspect_weekdays = TRUE) %>%
tk_get_timeseries_signature()
## pad applied on the interval: day
new_data
## # A tibble: 252 x 28
## index index.num diff year half quarter month month.xts
## <date> <int> <int> <int> <int> <int> <int> <int>
## 1 2016-01-04 1451865600 NA 2016 1 1 1 0
## 2 2016-01-05 1451952000 86400 2016 1 1 1 0
## 3 2016-01-06 1452038400 86400 2016 1 1 1 0
## 4 2016-01-07 1452124800 86400 2016 1 1 1 0
## 5 2016-01-08 1452211200 86400 2016 1 1 1 0
## 6 2016-01-11 1452470400 259200 2016 1 1 1 0
## 7 2016-01-12 1452556800 86400 2016 1 1 1 0
## 8 2016-01-13 1452643200 86400 2016 1 1 1 0
## 9 2016-01-14 1452729600 86400 2016 1 1 1 0
## 10 2016-01-15 1452816000 86400 2016 1 1 1 0
## # ... with 242 more rows, and 20 more variables: month.lbl <ord>,
## # day <int>, hour <int>, minute <int>, second <int>, hour12 <int>,
## # am.pm <int>, wday <int>, wday.xts <int>, wday.lbl <ord>, mday <int>,
## # qday <int>, yday <int>, mweek <int>, week <int>, week.iso <int>,
## # week2 <int>, week3 <int>, week4 <int>, mday7 <dbl>
Now use the predict() function to run a regression prediction on the new data.predict()函数进行预测
# Prediction using a linear model, fit_lm, on future index time series signature
pred_lm <- predict(fit_lm, newdata = new_data)
## Warning in predict.lm(fit_lm, newdata = new_data): prediction from a rank-
## deficient fit may be misleading
Let’s compare the prediction to the actual daily FB volume in 2016. Using the add_column() function, we can add the predictions to the actual data, actual_future. We can then plot the prediction using ggplot() 比较预测值和实际值,add_column() 添加预测值在actual_future预测集,由ggplot()绘图展示。
# Add predicted values to actuals data
actual_future <- actual_future %>% add_column(yhat = pred_lm)
# Plot using ggplot
actual_future %>%
ggplot(aes(x = date)) +
geom_line(aes(y = volume), data = train, color = palette_light()[[1]]) +
geom_line(aes(y = volume), color = palette_light()[[1]]) +
geom_line(aes(y = yhat), color = palette_light()[[2]]) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Forecasting FB Daily Volume: New Methods Using Data Mining",
subtitle = "Linear Regression Model Applied to Time Series Signature",
x = "",
y = "Volume",
caption = "Data from Yahoo! Finance: 'FB' Daily Volume from 2013 to 2016.") +
theme_tq(base_size = 12)
## Warning: Removed 1 rows containing missing values (geom_path).
# Collect data and convert to yearmon regularized monthly data
#收集数据并转换为正常的每月数据
beer_sales <- tq_get("S4248SM144NCEN", get = "economic.data", from = "2000-01-01")
beer_sales <- beer_sales %>%
mutate(date = as.yearmon(date))
# Build training set and actual_future for comparison
#训练集和测试集
train <- beer_sales %>%
filter(date < 2014)
actual_future <- beer_sales %>%
filter(date >= 2014)
# Add time series signature to training set
train <- train %>%
tk_augment_timeseries_signature()
# Fit the linear regression model to predict price variable
fit_lm <- lm(price ~ ., data = train[,-1])
# Create the new date following the 3 steps
#40为测试集的行数。nrow(actual_future)
new_data <- train %>%
tk_index() %>%
tk_make_future_timeseries(n_future = 40) %>%
tk_get_timeseries_signature()
# Make predictions using predict
pred_lm <- predict(object = fit_lm, newdata = new_data)
## Warning in predict.lm(object = fit_lm, newdata = new_data): prediction from
## a rank-deficient fit may be misleading
# Add predictions to actual_future
actual_future <- actual_future %>%
add_column(yhat = pred_lm)
# Plot using ggplot
actual_future %>%
ggplot(aes(x = date)) +
geom_line(aes(y = price), data = train, color = palette_light()[[1]]) +
geom_line(aes(y = price), color = palette_light()[[1]]) +
geom_line(aes(y = yhat), color = palette_light()[[2]]) +
scale_x_yearmon() +
scale_y_continuous(labels = scales::comma) +
labs(title = "Forecasting Alcohol Sales Using Data Mining",
subtitle = "Linear Regression Model Applied to Time Series Signature",
x = "",
y = "USD (Millions)",
caption = "Data from FRED Code 'S4248SM144NCEN' 2000 to present.") +
theme_tq(base_size = 12)
## Warning: Removed 1 rows containing missing values (geom_path).
we’ll examine briefly the various coercion functions that enable simplified coercion back and forth. We’ll start with the FB_tbl data.R中有些函数只能传入xts、zoo和ts等时间格式的数据,这个时候传入data.frame的数据是不行的
FB_tbl
## # A tibble: 1,008 x 2
## date volume
## <date> <dbl>
## 1 2013-01-02 69846400
## 2 2013-01-03 63140600
## 3 2013-01-04 72715400
## 4 2013-01-07 83781800
## 5 2013-01-08 45871300
## 6 2013-01-09 104787700
## 7 2013-01-10 95316400
## 8 2013-01-11 89598000
## 9 2013-01-14 98892800
## 10 2013-01-15 173242600
## # ... with 998 more rows
Note the argument silent = TRUE removes the warning that the “date” column is being dropped ,silent = TRUE移除date列删除的红色警告
FB_tbl %>%
tk_xts(silent = TRUE) %>% # Coerce to xts
tk_zoo() %>% # Coerce to zoo
tk_ts(start = 2013, freq = 252) %>% # Coerce to ts
tk_xts() %>% # Coerce back to xts
tk_tbl()
## # A tibble: 1,008 x 2
## index volume
## <date> <dbl>
## 1 2013-01-02 69846400
## 2 2013-01-03 63140600
## 3 2013-01-04 72715400
## 4 2013-01-07 83781800
## 5 2013-01-08 45871300
## 6 2013-01-09 104787700
## 7 2013-01-10 95316400
## 8 2013-01-11 89598000
## 9 2013-01-14 98892800
## 10 2013-01-15 173242600
## # ... with 998 more rows
One caveat is the going from ts to tbl. The default is timekit_idx = FALSE argument which returns a regularized index. If the time-based index is needed, just set timekit_idx = TRUE.从ts转为tbl将产生一个警告,默认timekit_idx = FALSE产生时间序列index的格式是类似2013.000这样的 ,改为TRUE产生2013-01-02这样的index格式。
FB_tbl %>%
tk_ts(start = 2013, freq = 252, silent = TRUE) %>%
tk_tbl(timekit_idx = TRUE)
## # A tibble: 1,008 x 2
## index volume
## <date> <dbl>
## 1 2013-01-02 69846400
## 2 2013-01-03 63140600
## 3 2013-01-04 72715400
## 4 2013-01-07 83781800
## 5 2013-01-08 45871300
## 6 2013-01-09 104787700
## 7 2013-01-10 95316400
## 8 2013-01-11 89598000
## 9 2013-01-14 98892800
## 10 2013-01-15 173242600
## # ... with 998 more rows
library(tidyverse)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(tidyquant)
library(broom)
library(timekit)
library(modelr)
##
## Attaching package: 'modelr'
## The following object is masked from 'package:broom':
##
## bootstrap
options(na.action = na.warn)
retail <- read_csv("~/Dropbox/OnlineRetail.csv",
col_types = cols(
InvoiceNo = col_character(),
StockCode = col_character(),
Description = col_character(),
Quantity = col_integer(),
InvoiceDate = col_datetime("%m/%d/%Y %H:%M"),
UnitPrice = col_double(),
CustomerID = col_integer(),
Country = col_character()
)) %>%
mutate(day = parse_date(format(InvoiceDate, "%Y-%m-%d")),
day_of_week = wday(day, label = TRUE),
time = parse_time(format(InvoiceDate, "%H:%M")),
month = format(InvoiceDate, "%m"),
income = Quantity * UnitPrice,
income_return = ifelse(Quantity > 0, "income", "return"))
InvoiceNo 发票号码,名义,每个交易唯一分配的6位整数。 如果此代码以字母“c”开头,则表示取消。 StockCode 产品代码,名义,一个5位整数,唯一分配给每个不同的产品。 Description: 产品名称,名义。 Quantity: 每个交易的每个产品的数量,数字。 InvoiceDate: 发票日期和时间,数字,每个事务生成的日期和时间。 UnitPrice: 单价, 数字,单位产品价格(英镑)。 CustomerID: 客户编号,标称,每个客户唯一分配的5位整数。 Country: 国家名称,名义,每个客户所在的国家的名称。 新生成的变量 day:交易日期 day_of_week:周几 time:交易时间 month:交易月份 income:收入 income_return :为income还是reture
#按照日期分组计算总income,平均income,总quantity,平均quantity
income <- retail %>%
group_by(day) %>%
summarise(sum_income = sum(income),
mean_income = mean(income),
sum_quantity = sum(Quantity),
mean_quantity = mean(Quantity))
#每天非重复的产品和单价
temp <- distinct(select(retail, day, StockCode, UnitPrice)) %>%
mutate(temp = paste(day, StockCode, sep = "_")) %>%
select(temp, UnitPrice)
#每天的平均单价
mean_unit_price <- retail %>%
filter(income_return == "income") %>%
group_by(day, StockCode) %>%
summarise(n = n()) %>%
mutate(temp = paste(day, StockCode, sep = "_")) %>%
left_join(temp, by = "temp") %>%
group_by(day, StockCode) %>%
summarise(mean = mean(UnitPrice)) %>%
group_by(day) %>%
summarise(mean_unit_price = mean(mean))
#按照日期分组计算income大于0总income,平均income,总quantity,平均quantity
purchases <- retail %>%
filter(income > 0) %>%
group_by(day) %>%
summarise(sum_income_purch = sum(income),
mean_income_purch = mean(income),
sum_quantity_purch = sum(Quantity),
mean_quantity_purch = mean(Quantity))
#按照日期分组计算income小于0总income,平均income,总quantity,平均quantity
returns <- retail %>%
filter(income < 0) %>%
group_by(day) %>%
summarise(sum_income_return = sum(income),
mean_income_return = mean(income),
sum_quantity_return = sum(Quantity),
mean_quantity_return = mean(Quantity))
#按日期计算每个顾客的购买情况
customer_purch <- retail %>%
group_by(day, CustomerID) %>%
summarise(n = n(),
sum_it = sum(Quantity),
sum_in = sum(income)) %>%
group_by(day) %>%
summarise(mean_in_cust = mean(sum_in),
mean_quant_cust = mean(sum_it),
mean_items_cust = mean(n))
#一次购买的客户还是多次购买的客户,多次购买的次数
rep_customer <- retail %>%
group_by(day, CustomerID) %>%
summarise(sum = sum(Quantity)) %>%
group_by(CustomerID) %>%
summarise(n = n()) %>%
mutate(repeat_customer = ifelse(n > 1, "repeat_cust", "one_time_cust"))
#每天多次购买还是一次购买的客户数量
rep_customer_day <- left_join(retail, rep_customer, by = "CustomerID") %>%
distinct(day, CustomerID, repeat_customer) %>%
group_by(day, repeat_customer) %>%
summarise(n = n()) %>%
spread(key = repeat_customer, value = n)
#每天的income和return,spread—长数据转为宽数据
income_return <- retail %>%
group_by(day, income_return) %>%
summarise(sum = sum(Quantity)) %>%
spread(key = income_return, value = sum)
#uk和非uk的比列
country_purch <- retail %>%
mutate(Country2 = ifelse(Country == "United Kingdom", "uk", "other_country")) %>%
group_by(day, Country2) %>%
summarise(sum = sum(Quantity)) %>%
spread(key = Country2, value = sum) %>%
mutate(prop_other_country = other_country / sum(other_country + uk),
prop_uk = uk / sum(other_country + uk))
#每天的产品量
n_items <- retail %>%
group_by(day, StockCode) %>%
summarise(n = n()) %>%
group_by(day) %>%
summarise(n_items = n())
#销量最好的产品
most_sold <- retail %>%
group_by(day, StockCode, Description) %>%
summarise(sum = sum(Quantity)) %>%
group_by(StockCode, Description) %>%
summarise(n = n()) %>%
arrange(-n)
#销量最好的产品每天售出情况
most_sold_day <- retail %>%
filter(StockCode %in% most_sold$StockCode[1:10]) %>%
group_by(day, StockCode) %>%
summarise(sum = sum(Quantity)) %>%
spread(key = StockCode, value = sum)
#数据集左联,添加季节和总income lag
retail_p_day <- distinct(select(retail, day, day_of_week, month)) %>%
left_join(income, by = "day") %>%
left_join(mean_unit_price, by = "day") %>%
left_join(purchases, by = "day") %>%
left_join(returns, by = "day") %>%
left_join(customer_purch, by = "day") %>%
left_join(rep_customer_day, by = "day") %>%
left_join(income_return, by = "day") %>%
left_join(country_purch, by = "day") %>%
left_join(n_items, by = "day") %>%
left_join(most_sold_day, by = "day") %>%
mutate(diff_sum_income = sum_income - lag(sum_income),
season = ifelse(month %in% c("03", "04", "05"), "spring",
ifelse(month %in% c("06", "07", "08"), "summer",
ifelse(month %in% c("09", "10", "11"), "fall", "winter"))))
I am splitting this dataset into training (all data points before/on Nov. 1st 2011) and test samples (all data points after Nov. 1st 2011) 我将此数据集分为培训(2011年11月1日之前的所有数据点)和测试样本(2011年11月1日以后的所有数据点)
retail_p_day <- retail_p_day %>%
mutate(model = ifelse(day <= "2011-11-01", "train", "test"))
#数字变量前添加P——
colnames(retail_p_day)[grep("^[0-9]+", colnames(retail_p_day))] <- paste0("P_", colnames(retail_p_day)[grep("^[0-9]+", colnames(retail_p_day))])
As we can see in the plot below, the net income shows variation between days.每日净收入差异
retail_p_day %>%
ggplot(aes(x = day, y = sum_income, color = model)) +
geom_point(alpha = 0.5) +
geom_line(alpha = 0.5) +
scale_color_manual(values = palette_light()) +
theme_tq()
A central function of timekit is tk_augment_timeseries_signature(), which adds a number of features based on the properties of our time series signature: index: The index value that was decomposed 分解的索引值 index.num: The numeric value of the index in seconds. The base is “1970-01-01 00:00:00”. 以秒为单位的索引数值,基础是“1970-01-01 00:00:00”。 diff: The difference in seconds from the previous numeric index value. 与上一个数字索引值的差值。 year: The year component of the index. 年份 half: The half component of the index. 上半年或下半年 quarter: The quarter component of the index. 季度 month: The month component of the index with base 1. 以1为开始的月份 month.xts: The month component of the index with base 0, which is what xts implements. 以0为开始的月份 month.lbl: The month label as an ordered factor beginning with January and ending with December. 以January开始,December结束的与分 day: The day component of the index. 天 hour: The hour component of the index. 时 minute: The minute component of the index. 分 second: The second component of the index. 秒 hour12: The hour component on a 12 hour scale. am.pm: Morning (AM) = 1, Afternoon (PM) = 2. 12小时制,AM=1,PM=2 wday: The day of the week with base 1. Sunday = 1 and Saturday = 7. 周几,Sunday为1,Saturday为7 wday.xts: The day of the week with base 0, which is what xts implements. Sunday = 0 and Saturday = 6. 周几,Sunday为0,Saturday为6 wday.lbl: The day of the week label as an ordered factor begining with Sunday and ending with Saturday. 周几,因子变量 mday: The day of the month. 这个月的几号 qday: The day of the quarter. 这个季度的几号 yday: The day of the year. 这个年的几号 mweek: The week of the month. 这个月的第几周 week: The week number of the year (Sunday start). 这个年的第几周 week.iso: The ISO week number of the year (Monday start). 这个年的第几周,ISO标准 week2: The modulus for bi-weekly frequency. 双周频率 week3: The modulus for tri-weekly frequency. 三周频率 week4: The modulus for quad-weekly frequency. 四周频率 mday7: The integer division of day of the month by seven, which returns the first, second, third, … instance the day has appeared in the month. Values begin at 1. For example, the first Saturday in the month has mday7 = 1. The second has mday7 = 2. 当天的整数除以七,返回第一,二,三,…的例子,当天出现在本月。 值从1开始。例如,本月的第一个星期六有mday7 = 1。第二个有mday7 = 2。
retail_p_day_aug <- retail_p_day %>%
rename(date = day) %>%
select(model, date, sum_income) %>%
tk_augment_timeseries_signature() %>%
select(-contains("month"))
#删除完全缺失值
retail_p_day_aug <- retail_p_day_aug[complete.cases(retail_p_day_aug), ]
Let’s look at column variation for all numeric feature and remove those with a variance of 0. 我们来看看所有数字特征的列变化,并删除方差为0的那些。
library(matrixStats)
##
## Attaching package: 'matrixStats'
## The following object is masked from 'package:dplyr':
##
## count
#colVars函数估计矩阵中每行(列)的方差。
(var <- data.frame(colnames = colnames(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]),
colvars = colVars(as.matrix(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)]))) %>%
filter(colvars == 0))
## colnames colvars
## 1 hour 0
## 2 minute 0
## 3 second 0
## 4 hour12 0
## 5 am.pm 0
#one_of函数 字符向量中的变量。
retail_p_day_aug <- select(retail_p_day_aug, -one_of(as.character(var$colnames)))
Next, we want to remove highly correlated features. By plotting them, we can get an idea about which cutoff to set. 接下来,我们要删除高度相关的变量。 通过绘制它们,我们可以了解要设置的临界值。
library(ggcorrplot)
#cor计算列和列之间的协方差
#cor_pmat计算相关矩阵p值
cor <- cor(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)])
p.cor <- cor_pmat(retail_p_day_aug[, sapply(retail_p_day_aug, is.numeric)])
ggcorrplot(cor, type = "upper", outline.col = "white", hc.order = TRUE, p.mat = p.cor,
colors = c(palette_light()[1], "white", palette_light()[2]))
I am going to choose a cutoff of 0.9 for removing features: 我要选择一个临界值为0.9来删除变量:
#findCorrelation函数搜索相关矩阵,并返回与要删除的列相对应的整数向量
cor_cut <- findCorrelation(cor, cutoff=0.9)
retail_p_day_aug <- select(retail_p_day_aug, -one_of(colnames(cor)[cor_cut]))
Now, I can split the data into training and test sets: 现在可以将数据分成训练和测试集:
train <- filter(retail_p_day_aug, model == "train") %>%
select(-model)
test <- filter(retail_p_day_aug, model == "test")
To model the time series of the response variable sum_income, I am using a generalized linear model. We could try all kinds of different models and modeling parameters, but for this test I am keeping it simple. 使用的是广义线性模型,可以尝试各种不同的模型和建模参数,但是对于这个测试,保持简单。
fit_lm <- glm(sum_income ~ ., data = train)
#tidy整理的输出的结果是具有行名称的数据框。
#gather宽数据转为长数据
tidy(fit_lm) %>%
gather(x, y, estimate:p.value) %>%
ggplot(aes(x = term, y = y, color = x, fill = x)) +
facet_wrap(~ x, scales = "free", ncol = 4) +
geom_bar(stat = "identity", alpha = 0.8) +
scale_color_manual(values = palette_light()) +
scale_fill_manual(values = palette_light()) +
theme_tq() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
#augment向原始数据集添加列,如预测,残差等。
augment(fit_lm) %>%
ggplot(aes(x = date, y = .resid)) +
geom_hline(yintercept = 0, color = "red") +
geom_point(alpha = 0.5, color = palette_light()[[1]]) +
geom_smooth() +
theme_tq()
## `geom_smooth()` using method = 'loess'
With this model, we can now add predictions and residuals for the test data… 有了这个模型,我们现在可以添加测试数据的预测和残差
pred_test <- test %>%
add_predictions(fit_lm, "pred_lm") %>%
add_residuals(fit_lm, "resid_lm")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
visualize the residuals. 可视化残差。
pred_test %>%
ggplot(aes(x = date, y = resid_lm)) +
geom_hline(yintercept = 0, color = "red") +
geom_point(alpha = 0.5, color = palette_light()[[1]]) +
geom_smooth() +
theme_tq()
## `geom_smooth()` using method = 'loess'
We can also compare the predicted with the actual sum income in the test set. 我们也可以将预测值与实际总收入进行比较。
pred_test %>%
gather(x, y, sum_income, pred_lm) %>%
ggplot(aes(x = date, y = y, color = x)) +
geom_point(alpha = 0.5) +
geom_line(alpha = 0.5) +
scale_color_manual(values = palette_light()) +
theme_tq()
Once we are satisfied with our model’s performance on the test set, we can use it to forecast future events. To create future time points for modeling, we need to extract the time index (the date column day in our data frame). 一旦我们对我们的模型在测试集上的表现感到满意,我们可以用它来预测未来的事件。要创建未来的建模时间点,我们需要提取时间索引(数据框中的日期列)。
# Extract index
idx <- retail_p_day %>%
tk_index()
From this index we can generate the future time series. Here, we need to beware of a couple of things. Most importantly, we need to account for the irregularity of our data: We never have data for Saturdays and we have a few random missing values in between, as can be seen in the diff column of retail_p_day_aug (1 day difference == 86400 seconds). 从这个索引我们可以生成未来的时间序列。在这里,我们需要注意几件事情。最重要的是,我们需要解释我们的数据的不规则:我们从来没有星期六的数据,我们之间有几个随机的缺失值,可以在retail_p_day_aug(1天差= 86400秒)的diff列中看到。
retail_p_day_aug %>%
ggplot(aes(x = date, y = diff)) +
geom_point(alpha = 0.5, aes(color = as.factor(diff))) +
geom_line(alpha = 0.5) +
scale_color_manual(values = palette_light()) +
theme_tq()
What dates are these? Let’s filter for dates with more than 1 day between the last recorded day that are not Sundays (as Saturdays are always off-days). 这些日期是什么? 我们过滤最后一个不是星期日的记录
retail_p_day_aug %>%
select(date, wday.lbl, diff) %>%
filter(wday.lbl != "Sunday" & diff > 86400) %>%
mutate(days_missing = diff / 86400 -1)
## # A tibble: 5 x 4
## date wday.lbl diff days_missing
## <date> <ord> <int> <dbl>
## 1 2011-01-04 Tuesday 1036800 11
## 2 2011-04-26 Tuesday 432000 4
## 3 2011-05-03 Tuesday 172800 1
## 4 2011-05-31 Tuesday 172800 1
## 5 2011-08-30 Tuesday 172800 1
retail_p_day_aug %>%
select(date, wday.lbl, diff) %>%
filter(wday.lbl == "Sunday" & diff > 172800) %>%
mutate(days_missing = diff / 86400 -1)
## # A tibble: 1 x 4
## date wday.lbl diff days_missing
## <date> <ord> <int> <dbl>
## 1 2011-05-01 Sunday 259200 2
Let’s create a list of all missing days: 让我们创建一个所有有缺失的列表:
off_days <- c("2010-12-24", "2010-12-25", "2010-12-26", "2010-12-27", "2010-12-28", "2010-12-29", "2010-12-30", "2010-01-01", "2010-01-02", "2010-01-03",
"2011-04-22", "2011-04-23", "2011-04-24", "2011-04-25", "2011-05-02", "2011-05-30", "2011-08-29", "2011-04-29", "2011-04-30") %>%
ymd()
We can account for the missing Saturdays with inspect_weekdays = TRUE. Ideally, we would now use skip_values and insert_values to specifically account for days with irregular missing data in our future time series, e.g. by accounting for holidays. 我们可以通过inspect_weekdays = TRUE来计算缺少的星期六。理想情况下,将使用skip_values和insert_values来特别说明我们未来时间系列中不正常丢失数据的日期,例如,假期。
idx_future <- idx %>%
tk_make_future_timeseries(n_future = 300, inspect_weekdays = TRUE, inspect_months = FALSE, skip_values = off_days)
## pad applied on the interval: day
## The following `skip_values` were not in the future date sequence: 2010-12-24, 2010-12-25, 2010-12-26, 2010-12-27, 2010-12-28, 2010-12-29, 2010-12-30, 2010-01-01, 2010-01-02, 2010-01-03, 2011-04-22, 2011-04-23, 2011-04-24, 2011-04-25, 2011-05-02, 2011-05-30, 2011-08-29, 2011-04-29, 2011-04-30
idx_future %>%
tk_get_timeseries_signature() %>%
ggplot(aes(x = index, y = diff)) +
geom_point(alpha = 0.5, aes(color = as.factor(diff))) +
geom_line(alpha = 0.5) +
scale_color_manual(values = palette_light()) +
theme_tq()
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
Then, we can build the data frame for forecasting by using tk_get_timeseries_signature() and renaming the index column to date, so that it matches the features in the model. With this data frame, we can now predict future values and add this to the data frame. 然后,我们可以通过使用tk_get_timeseries_signature()和重新命名索引列来建立预测数据框,以使其与模型中的特征相匹配。 有了这个数据框,我们现在可以预测将来的值,并将其添加到数据框。
data_future <- idx_future %>%
tk_get_timeseries_signature() %>%
rename(date = index)
pred_future <- predict(fit_lm, newdata = data_future)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
pred_future <- data_future %>%
select(date) %>%
add_column(sum_income = pred_future)
retail_p_day %>%
select(day, sum_income) %>%
rename(date = day) %>%
rbind(pred_future) %>%
ggplot(aes(x = date, y = sum_income)) +
scale_x_date() +
geom_vline(xintercept = as.numeric(max(retail_p_day$day)), color = "red", size = 1) +
geom_point(alpha = 0.5) +
geom_line(alpha = 0.5) +
theme_tq()
## Warning: Removed 1 rows containing missing values (geom_point).
When we evaluate the forecast, we want to account for uncertainty of accuracy, e.g. by accounting for the standard deviation of the test residuals. 当我们评估预测时,我们要考虑准确性的不确定性,例如 通过考虑测试残差的标准偏差。
test_residuals <- pred_test$resid_lm
test_resid_sd <- sd(test_residuals, na.rm = TRUE)
pred_future <- pred_future %>%
mutate(
lo.95 = sum_income - 1.96 * test_resid_sd,
lo.80 = sum_income - 1.28 * test_resid_sd,
hi.80 = sum_income + 1.28 * test_resid_sd,
hi.95 = sum_income + 1.96 * test_resid_sd
)
This, we can then plot to show the forecast with confidence intervals: 这样,我们可以用置信区间来绘制预测:
retail_p_day %>%
select(day, sum_income) %>%
rename(date = day) %>%
ggplot(aes(x = date, y = sum_income)) +
geom_point(alpha = 0.5) +
geom_line(alpha = 0.5) +
geom_ribbon(aes(ymin = lo.95, ymax = hi.95), data = pred_future,
fill = "#D5DBFF", color = NA, size = 0) +
geom_ribbon(aes(ymin = lo.80, ymax = hi.80, fill = key), data = pred_future,
fill = "#596DD5", color = NA, size = 0, alpha = 0.8) +
geom_point(aes(x = date, y = sum_income), data = pred_future,
alpha = 0.5, color = palette_light()[[2]]) +
geom_smooth(aes(x = date, y = sum_income), data = pred_future,
method = 'loess', color = "white") +
theme_tq()
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
Our model predicts that income will follow a curve that is very similar to last year’s with a drop after Christmas and an increase towards the later months of the year. In and off itself, this sounds reasonable. However, because we only have data from one year, we do not know whether the decline in January/February and the increase towards Christmas is an annually recurring trend or whether the increase we see at the end of 2011 will be independent of seasonality and continue to rise in the future. 我们的模型预测,收入将跟随一个曲线,与去年非常相似,圣诞节后下降,并在一年后的几个月增加。这个听起来很合理。 不过,由于我们只有一年的数据,我们不知道1月/2月的下降以及圣诞节的增长是否是每年的经常性趋势,或者我们在2011年底看到的增长是否会与季节性无关或将来会上升。