A situation an exponential smoothing model could be approriate could be when trying to observe the behavior of security that is experiencing a high amount of volatility. The exponential smoothing model could allow the observer easier see if there is a trend without variation due to the volatility. Data that would be required would be the previous price of the security, which would be considered our response, and the time intervals for each of corresponding prices. Since the security in question would experiencing large changes in price due to high volatility, I would expect the α to be closer to 0 so that we would place more weight on the previous price.
Importing library and data.
library(stats)
temp <- read.table("temps.txt", header = T)
head(temp)
## DAY X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 1 1-Jul 98 86 91 84 89 84 90 73 82 91 93 95
## 2 2-Jul 97 90 88 82 91 87 90 81 81 89 93 85
## 3 3-Jul 97 93 91 87 93 87 87 87 86 86 93 82
## 4 4-Jul 90 91 91 88 95 84 89 86 88 86 91 86
## 5 5-Jul 89 84 91 90 96 86 93 80 90 89 90 88
## 6 6-Jul 93 84 89 91 96 87 93 84 90 82 81 87
## X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
## 1 85 95 87 92 105 82 90 85
## 2 87 90 84 94 93 85 93 87
## 3 91 89 83 95 99 76 87 79
## 4 90 91 85 92 98 77 84 85
## 5 88 80 88 90 100 83 86 84
## 6 82 87 89 90 98 83 87 84
From the detail section of the Holt Winters function, arguments alpha, beta, and gamma are set to NULL by default. The function will try to find optimal values of α, β, and γ if they are NULL.
#Using single exponential smoothing using the HoltWinters function.
temp.vec <- as.vector(unlist(temp[2:21]))
ts.temp <- ts(temp.vec,start = 1996, frequency = 123)
model1 <- HoltWinters(ts.temp,beta = F, gamma =F)
model1
## Holt-Winters exponential smoothing without trend and without seasonal component.
##
## Call:
## HoltWinters(x = ts.temp, beta = F, gamma = F)
##
## Smoothing parameters:
## alpha: 0.8388021
## beta : FALSE
## gamma: FALSE
##
## Coefficients:
## [,1]
## a 63.30952
plot(model1)
plot(model1$fitted)
Our function found 0.8388021 to be the value of α, implying that our data appears to have less randomness and our model is more willing to trust the the observations made from the data. There appears to be trends and seasonality patterns in our data, so lets account for that by allowing alpha, beta, and gamma to be NULL so that the function can find the most optimal values for them. Additionally, lets set the seasonal keyword paramenter to “multiplicative”.
temp.vec2 <- as.vector(unlist(temp[2:21]))
ts.temp2 <- ts(temp.vec,start = 1996, frequency = 123)
model2 <- HoltWinters(ts.temp2, seasonal = "multiplicative")
model2
## Holt-Winters exponential smoothing with trend and multiplicative seasonal component.
##
## Call:
## HoltWinters(x = ts.temp2, seasonal = "multiplicative")
##
## Smoothing parameters:
## alpha: 0.615003
## beta : 0
## gamma: 0.5495256
##
## Coefficients:
## [,1]
## a 73.679517064
## b -0.004362918
## s1 1.239022317
## s2 1.234344062
## s3 1.159509551
## s4 1.175247483
## s5 1.171344196
## s6 1.151038408
## s7 1.139383104
## s8 1.130484528
## s9 1.110487514
## s10 1.076242879
## s11 1.041044609
## s12 1.058139281
## s13 1.032496529
## s14 1.036257448
## s15 1.019348815
## s16 1.026754142
## s17 1.071170378
## s18 1.054819556
## s19 1.084397734
## s20 1.064605879
## s21 1.109827336
## s22 1.112670130
## s23 1.103970506
## s24 1.102771209
## s25 1.091264692
## s26 1.084518342
## s27 1.077914660
## s28 1.077696145
## s29 1.053788854
## s30 1.079454300
## s31 1.053481186
## s32 1.054023885
## s33 1.078221405
## s34 1.070145761
## s35 1.054891375
## s36 1.044587771
## s37 1.023285461
## s38 1.025836722
## s39 1.031075732
## s40 1.031419152
## s41 1.021827552
## s42 0.998177248
## s43 0.996049257
## s44 0.981570825
## s45 0.976510542
## s46 0.967977608
## s47 0.985788411
## s48 1.004748195
## s49 1.050965934
## s50 1.072515008
## s51 1.086532279
## s52 1.098357400
## s53 1.097158461
## s54 1.054827180
## s55 1.022866587
## s56 0.987259326
## s57 1.016923524
## s58 1.016604903
## s59 1.004320951
## s60 1.019102781
## s61 0.983848662
## s62 1.055888360
## s63 1.056122844
## s64 1.043478958
## s65 1.039475693
## s66 0.991019224
## s67 1.001437488
## s68 1.002221759
## s69 1.003949213
## s70 0.999566344
## s71 1.018636837
## s72 1.026490773
## s73 1.042507768
## s74 1.022500795
## s75 1.002503740
## s76 1.004560984
## s77 1.025536556
## s78 1.015357769
## s79 0.992176558
## s80 0.979377825
## s81 0.998058079
## s82 1.002553395
## s83 0.955429116
## s84 0.970970220
## s85 0.975543504
## s86 0.931515830
## s87 0.926764603
## s88 0.958565273
## s89 0.963250387
## s90 0.951644060
## s91 0.937362688
## s92 0.954257999
## s93 0.892485444
## s94 0.879537700
## s95 0.879946892
## s96 0.890633648
## s97 0.917134959
## s98 0.925991769
## s99 0.884247686
## s100 0.846648167
## s101 0.833696369
## s102 0.800001437
## s103 0.807934782
## s104 0.819343668
## s105 0.828571029
## s106 0.795608740
## s107 0.796609993
## s108 0.815503509
## s109 0.830111282
## s110 0.829086181
## s111 0.818367239
## s112 0.863958784
## s113 0.912057203
## s114 0.898308248
## s115 0.878723779
## s116 0.848971946
## s117 0.813891909
## s118 0.846821392
## s119 0.819121827
## s120 0.851036184
## s121 0.820416491
## s122 0.851581233
## s123 0.874038407
plot(model2)
plot(model2$fitted)
It can be seen that our time series data with the HoltWinters function accounting for the trends and seasonality, there are no trends present in the data. Because of this, we can conlude that the unofficial end of summer has not gotten later over the past 20 years.
A situation from current events that a linear regression model could be useful would be in forecasting losses from retail stores due to the current protests in major cities. Companies can use predictors such as, the # of stores affected by the protests or length of time that stores are closed as predictors for potential losses in revenue.
https://blog.minitab.com/blog/adventures-in-statistics-2/how-to-interpret-regression-analysis-results-p-values-and-coefficients https://stats.stackexchange.com/questions/232548/r-how-are-the-significance-codes-determined-when-summarizing-a-logistic-regres
library(stats)
crime <- read.table("uscrime.txt", header = T)
head(crime)
## M So Ed Po1 Po2 LF M.F Pop NW U1 U2 Wealth Ineq Prob
## 1 15.1 1 9.1 5.8 5.6 0.510 95.0 33 30.1 0.108 4.1 3940 26.1 0.084602
## 2 14.3 0 11.3 10.3 9.5 0.583 101.2 13 10.2 0.096 3.6 5570 19.4 0.029599
## 3 14.2 1 8.9 4.5 4.4 0.533 96.9 18 21.9 0.094 3.3 3180 25.0 0.083401
## 4 13.6 0 12.1 14.9 14.1 0.577 99.4 157 8.0 0.102 3.9 6730 16.7 0.015801
## 5 14.1 0 12.1 10.9 10.1 0.591 98.5 18 3.0 0.091 2.0 5780 17.4 0.041399
## 6 12.1 0 11.0 11.8 11.5 0.547 96.4 25 4.4 0.084 2.9 6890 12.6 0.034201
## Time Crime
## 1 26.2011 791
## 2 25.2999 1635
## 3 24.3006 578
## 4 29.9012 1969
## 5 21.2998 1234
## 6 20.9995 682
Using the lm function, we can create a regression model from our uscrime.txt data set. We can then take the sample data provided from the question statement and make a prediction of the crime in our sample.
set.seed(3)
model3 <- lm( Crime ~. , data = crime)
sample <- data.frame(M = 14.0,So = 0,Ed = 10.0,Po1 = 12.0, Po2 = 15.5,LF = 0.640,M.F = 94.0,Pop = 150,NW = 1.1,U1 = 0.120,U2 = 3.6, Wealth = 3200, Ineq = 20.1, Prob = 0.04, Time = 39.0)
test <- predict(model3, sample)
test
## 1
## 155.4349
crime$Crime
## [1] 791 1635 578 1969 1234 682 963 1555 856 705 1674 849 511 664 798
## [16] 946 539 929 750 1225 742 439 1216 968 523 1993 342 1216 1043 696
## [31] 373 754 1072 923 653 1272 831 566 826 1151 880 542 823 1030 455
## [46] 508 849
So, using the model created from the uscrime.txt file, we predicted a response value of 155.4349 for crime in the sample city. By just comparing it to the other crime values in our original data set, it appears significantly off even though all of the predictors seem fairly reasonable in comparison to the other predictors in the data set. It’s possible that some of the predictors are not correlated to the response and that could be affecting our prediction. We can check the P-values from our model and see which predictors have the most influence on our response and only use those.
summary(model3)
##
## Call:
## lm(formula = Crime ~ ., data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -395.74 -98.09 -6.69 112.99 512.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.984e+03 1.628e+03 -3.675 0.000893 ***
## M 8.783e+01 4.171e+01 2.106 0.043443 *
## So -3.803e+00 1.488e+02 -0.026 0.979765
## Ed 1.883e+02 6.209e+01 3.033 0.004861 **
## Po1 1.928e+02 1.061e+02 1.817 0.078892 .
## Po2 -1.094e+02 1.175e+02 -0.931 0.358830
## LF -6.638e+02 1.470e+03 -0.452 0.654654
## M.F 1.741e+01 2.035e+01 0.855 0.398995
## Pop -7.330e-01 1.290e+00 -0.568 0.573845
## NW 4.204e+00 6.481e+00 0.649 0.521279
## U1 -5.827e+03 4.210e+03 -1.384 0.176238
## U2 1.678e+02 8.234e+01 2.038 0.050161 .
## Wealth 9.617e-02 1.037e-01 0.928 0.360754
## Ineq 7.067e+01 2.272e+01 3.111 0.003983 **
## Prob -4.855e+03 2.272e+03 -2.137 0.040627 *
## Time -3.479e+00 7.165e+00 -0.486 0.630708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078
## F-statistic: 8.429 on 15 and 31 DF, p-value: 3.539e-07
From the summary of our model using the whole dataset of uscrime.txt, it can be observed that many of the predictors have large P values which means that the null hypothesis can not be rejected. In other words, there are predictors in this data set that do not have a significant impact to the crime in the city. Lets create a new model using only the predictors that can significantly affect the response (all the predictors marked by signif. codes).
model4 <- lm(Crime ~ M + Ed + Po1 + U2 + Ineq + Prob, data = crime)
set.seed(3)
summary(model4) #all P-values less than .05 which mean that the null hypothesis for these predictors CAN be rejected. Meaning that they predictors do influence the response.
##
## Call:
## lm(formula = Crime ~ M + Ed + Po1 + U2 + Ineq + Prob, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -470.68 -78.41 -19.68 133.12 556.23
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5040.50 899.84 -5.602 1.72e-06 ***
## M 105.02 33.30 3.154 0.00305 **
## Ed 196.47 44.75 4.390 8.07e-05 ***
## Po1 115.02 13.75 8.363 2.56e-10 ***
## U2 89.37 40.91 2.185 0.03483 *
## Ineq 67.65 13.94 4.855 1.88e-05 ***
## Prob -3801.84 1528.10 -2.488 0.01711 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 200.7 on 40 degrees of freedom
## Multiple R-squared: 0.7659, Adjusted R-squared: 0.7307
## F-statistic: 21.81 on 6 and 40 DF, p-value: 3.418e-11
sample2<- data.frame(M = 14.0, Ed = 10.0,Po1 = 12.0,U2 = 3.6, Ineq = 20.1, Prob = 0.04)
test2 <- predict(model4,sample2)
test2
## 1
## 1304.245
Using our model with predictors that correlate to our response, we get a much more reasonable prediction for the crime rate with a response value of 1304.245. With an R-squared value of 0.7659 for this model conducted on the training set, we need to cross-validate to get rid of some overfitting.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(3)
training <- trainControl(method = "cv", number = 5)
model5 <- train(Crime ~ M + Ed + Po1 + U2 + Ineq + Prob, method = "lm", data = crime, trControl = training)
model5
## Linear Regression
##
## 47 samples
## 6 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 39, 37, 38, 36, 38
## Resampling results:
##
## RMSE Rsquared MAE
## 218.9649 0.6955635 166.1713
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
test3 <- predict(model5,sample2)
test3
## 1
## 1304.245
Using k-fold cross validation with 5 folds, we can see that we get the same prediction for the crime from our sample, but with a smaller R-squared value of 0.6955 compared to 0.7659 when we didn’t train our model. By cross validating, we were able to remove some of the random effects that were captured in our initial model and account for only real effects. The result is a model with less accuracy due to it accounting for less random effects.