Question 7.1

A situation an exponential smoothing model could be approriate could be when trying to observe the behavior of security that is experiencing a high amount of volatility. The exponential smoothing model could allow the observer easier see if there is a trend without variation due to the volatility. Data that would be required would be the previous price of the security, which would be considered our response, and the time intervals for each of corresponding prices. Since the security in question would experiencing large changes in price due to high volatility, I would expect the α to be closer to 0 so that we would place more weight on the previous price.

Question 7.2

Importing library and data.

library(stats)
temp <- read.table("temps.txt", header = T)
head(temp)
##     DAY X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 1 1-Jul    98    86    91    84    89    84    90    73    82    91    93    95
## 2 2-Jul    97    90    88    82    91    87    90    81    81    89    93    85
## 3 3-Jul    97    93    91    87    93    87    87    87    86    86    93    82
## 4 4-Jul    90    91    91    88    95    84    89    86    88    86    91    86
## 5 5-Jul    89    84    91    90    96    86    93    80    90    89    90    88
## 6 6-Jul    93    84    89    91    96    87    93    84    90    82    81    87
##   X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
## 1    85    95    87    92   105    82    90    85
## 2    87    90    84    94    93    85    93    87
## 3    91    89    83    95    99    76    87    79
## 4    90    91    85    92    98    77    84    85
## 5    88    80    88    90   100    83    86    84
## 6    82    87    89    90    98    83    87    84

From the detail section of the Holt Winters function, arguments alpha, beta, and gamma are set to NULL by default. The function will try to find optimal values of α, β, and γ if they are NULL.

#Using single exponential smoothing using the HoltWinters function. 
temp.vec <- as.vector(unlist(temp[2:21]))
ts.temp <- ts(temp.vec,start = 1996, frequency = 123)
model1 <- HoltWinters(ts.temp,beta = F, gamma =F)
model1
## Holt-Winters exponential smoothing without trend and without seasonal component.
## 
## Call:
## HoltWinters(x = ts.temp, beta = F, gamma = F)
## 
## Smoothing parameters:
##  alpha: 0.8388021
##  beta : FALSE
##  gamma: FALSE
## 
## Coefficients:
##       [,1]
## a 63.30952
plot(model1)

plot(model1$fitted)

Our function found 0.8388021 to be the value of α, implying that our data appears to have less randomness and our model is more willing to trust the the observations made from the data. There appears to be trends and seasonality patterns in our data, so lets account for that by allowing alpha, beta, and gamma to be NULL so that the function can find the most optimal values for them. Additionally, lets set the seasonal keyword paramenter to “multiplicative”.

temp.vec2 <- as.vector(unlist(temp[2:21]))
ts.temp2 <- ts(temp.vec,start = 1996, frequency = 123)
model2 <- HoltWinters(ts.temp2, seasonal = "multiplicative")
model2
## Holt-Winters exponential smoothing with trend and multiplicative seasonal component.
## 
## Call:
## HoltWinters(x = ts.temp2, seasonal = "multiplicative")
## 
## Smoothing parameters:
##  alpha: 0.615003
##  beta : 0
##  gamma: 0.5495256
## 
## Coefficients:
##              [,1]
## a    73.679517064
## b    -0.004362918
## s1    1.239022317
## s2    1.234344062
## s3    1.159509551
## s4    1.175247483
## s5    1.171344196
## s6    1.151038408
## s7    1.139383104
## s8    1.130484528
## s9    1.110487514
## s10   1.076242879
## s11   1.041044609
## s12   1.058139281
## s13   1.032496529
## s14   1.036257448
## s15   1.019348815
## s16   1.026754142
## s17   1.071170378
## s18   1.054819556
## s19   1.084397734
## s20   1.064605879
## s21   1.109827336
## s22   1.112670130
## s23   1.103970506
## s24   1.102771209
## s25   1.091264692
## s26   1.084518342
## s27   1.077914660
## s28   1.077696145
## s29   1.053788854
## s30   1.079454300
## s31   1.053481186
## s32   1.054023885
## s33   1.078221405
## s34   1.070145761
## s35   1.054891375
## s36   1.044587771
## s37   1.023285461
## s38   1.025836722
## s39   1.031075732
## s40   1.031419152
## s41   1.021827552
## s42   0.998177248
## s43   0.996049257
## s44   0.981570825
## s45   0.976510542
## s46   0.967977608
## s47   0.985788411
## s48   1.004748195
## s49   1.050965934
## s50   1.072515008
## s51   1.086532279
## s52   1.098357400
## s53   1.097158461
## s54   1.054827180
## s55   1.022866587
## s56   0.987259326
## s57   1.016923524
## s58   1.016604903
## s59   1.004320951
## s60   1.019102781
## s61   0.983848662
## s62   1.055888360
## s63   1.056122844
## s64   1.043478958
## s65   1.039475693
## s66   0.991019224
## s67   1.001437488
## s68   1.002221759
## s69   1.003949213
## s70   0.999566344
## s71   1.018636837
## s72   1.026490773
## s73   1.042507768
## s74   1.022500795
## s75   1.002503740
## s76   1.004560984
## s77   1.025536556
## s78   1.015357769
## s79   0.992176558
## s80   0.979377825
## s81   0.998058079
## s82   1.002553395
## s83   0.955429116
## s84   0.970970220
## s85   0.975543504
## s86   0.931515830
## s87   0.926764603
## s88   0.958565273
## s89   0.963250387
## s90   0.951644060
## s91   0.937362688
## s92   0.954257999
## s93   0.892485444
## s94   0.879537700
## s95   0.879946892
## s96   0.890633648
## s97   0.917134959
## s98   0.925991769
## s99   0.884247686
## s100  0.846648167
## s101  0.833696369
## s102  0.800001437
## s103  0.807934782
## s104  0.819343668
## s105  0.828571029
## s106  0.795608740
## s107  0.796609993
## s108  0.815503509
## s109  0.830111282
## s110  0.829086181
## s111  0.818367239
## s112  0.863958784
## s113  0.912057203
## s114  0.898308248
## s115  0.878723779
## s116  0.848971946
## s117  0.813891909
## s118  0.846821392
## s119  0.819121827
## s120  0.851036184
## s121  0.820416491
## s122  0.851581233
## s123  0.874038407
plot(model2)

plot(model2$fitted)

It can be seen that our time series data with the HoltWinters function accounting for the trends and seasonality, there are no trends present in the data. Because of this, we can conlude that the unofficial end of summer has not gotten later over the past 20 years.

Question 8.1

A situation from current events that a linear regression model could be useful would be in forecasting losses from retail stores due to the current protests in major cities. Companies can use predictors such as, the # of stores affected by the protests or length of time that stores are closed as predictors for potential losses in revenue.

Question 8.2

https://blog.minitab.com/blog/adventures-in-statistics-2/how-to-interpret-regression-analysis-results-p-values-and-coefficients https://stats.stackexchange.com/questions/232548/r-how-are-the-significance-codes-determined-when-summarizing-a-logistic-regres

library(stats)
crime <- read.table("uscrime.txt", header = T)
head(crime)
##      M So   Ed  Po1  Po2    LF   M.F Pop   NW    U1  U2 Wealth Ineq     Prob
## 1 15.1  1  9.1  5.8  5.6 0.510  95.0  33 30.1 0.108 4.1   3940 26.1 0.084602
## 2 14.3  0 11.3 10.3  9.5 0.583 101.2  13 10.2 0.096 3.6   5570 19.4 0.029599
## 3 14.2  1  8.9  4.5  4.4 0.533  96.9  18 21.9 0.094 3.3   3180 25.0 0.083401
## 4 13.6  0 12.1 14.9 14.1 0.577  99.4 157  8.0 0.102 3.9   6730 16.7 0.015801
## 5 14.1  0 12.1 10.9 10.1 0.591  98.5  18  3.0 0.091 2.0   5780 17.4 0.041399
## 6 12.1  0 11.0 11.8 11.5 0.547  96.4  25  4.4 0.084 2.9   6890 12.6 0.034201
##      Time Crime
## 1 26.2011   791
## 2 25.2999  1635
## 3 24.3006   578
## 4 29.9012  1969
## 5 21.2998  1234
## 6 20.9995   682

Using the lm function, we can create a regression model from our uscrime.txt data set. We can then take the sample data provided from the question statement and make a prediction of the crime in our sample.

set.seed(3)
model3 <- lm( Crime ~. , data = crime)
sample <- data.frame(M = 14.0,So = 0,Ed = 10.0,Po1 = 12.0, Po2 = 15.5,LF = 0.640,M.F = 94.0,Pop = 150,NW = 1.1,U1 = 0.120,U2 = 3.6, Wealth = 3200, Ineq = 20.1, Prob = 0.04, Time = 39.0)
test <- predict(model3, sample)
test
##        1 
## 155.4349
crime$Crime
##  [1]  791 1635  578 1969 1234  682  963 1555  856  705 1674  849  511  664  798
## [16]  946  539  929  750 1225  742  439 1216  968  523 1993  342 1216 1043  696
## [31]  373  754 1072  923  653 1272  831  566  826 1151  880  542  823 1030  455
## [46]  508  849

So, using the model created from the uscrime.txt file, we predicted a response value of 155.4349 for crime in the sample city. By just comparing it to the other crime values in our original data set, it appears significantly off even though all of the predictors seem fairly reasonable in comparison to the other predictors in the data set. It’s possible that some of the predictors are not correlated to the response and that could be affecting our prediction. We can check the P-values from our model and see which predictors have the most influence on our response and only use those.

summary(model3)
## 
## Call:
## lm(formula = Crime ~ ., data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -395.74  -98.09   -6.69  112.99  512.67 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.984e+03  1.628e+03  -3.675 0.000893 ***
## M            8.783e+01  4.171e+01   2.106 0.043443 *  
## So          -3.803e+00  1.488e+02  -0.026 0.979765    
## Ed           1.883e+02  6.209e+01   3.033 0.004861 ** 
## Po1          1.928e+02  1.061e+02   1.817 0.078892 .  
## Po2         -1.094e+02  1.175e+02  -0.931 0.358830    
## LF          -6.638e+02  1.470e+03  -0.452 0.654654    
## M.F          1.741e+01  2.035e+01   0.855 0.398995    
## Pop         -7.330e-01  1.290e+00  -0.568 0.573845    
## NW           4.204e+00  6.481e+00   0.649 0.521279    
## U1          -5.827e+03  4.210e+03  -1.384 0.176238    
## U2           1.678e+02  8.234e+01   2.038 0.050161 .  
## Wealth       9.617e-02  1.037e-01   0.928 0.360754    
## Ineq         7.067e+01  2.272e+01   3.111 0.003983 ** 
## Prob        -4.855e+03  2.272e+03  -2.137 0.040627 *  
## Time        -3.479e+00  7.165e+00  -0.486 0.630708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared:  0.8031, Adjusted R-squared:  0.7078 
## F-statistic: 8.429 on 15 and 31 DF,  p-value: 3.539e-07

From the summary of our model using the whole dataset of uscrime.txt, it can be observed that many of the predictors have large P values which means that the null hypothesis can not be rejected. In other words, there are predictors in this data set that do not have a significant impact to the crime in the city. Lets create a new model using only the predictors that can significantly affect the response (all the predictors marked by signif. codes).

model4 <- lm(Crime ~ M + Ed + Po1 + U2 + Ineq + Prob, data = crime)
set.seed(3)
summary(model4) #all P-values less than .05 which mean that the null hypothesis for these predictors CAN be rejected. Meaning that they predictors do influence the response.
## 
## Call:
## lm(formula = Crime ~ M + Ed + Po1 + U2 + Ineq + Prob, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -470.68  -78.41  -19.68  133.12  556.23 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5040.50     899.84  -5.602 1.72e-06 ***
## M             105.02      33.30   3.154  0.00305 ** 
## Ed            196.47      44.75   4.390 8.07e-05 ***
## Po1           115.02      13.75   8.363 2.56e-10 ***
## U2             89.37      40.91   2.185  0.03483 *  
## Ineq           67.65      13.94   4.855 1.88e-05 ***
## Prob        -3801.84    1528.10  -2.488  0.01711 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200.7 on 40 degrees of freedom
## Multiple R-squared:  0.7659, Adjusted R-squared:  0.7307 
## F-statistic: 21.81 on 6 and 40 DF,  p-value: 3.418e-11
sample2<- data.frame(M = 14.0, Ed = 10.0,Po1 = 12.0,U2 = 3.6, Ineq = 20.1, Prob = 0.04)
test2 <- predict(model4,sample2)
test2
##        1 
## 1304.245

Using our model with predictors that correlate to our response, we get a much more reasonable prediction for the crime rate with a response value of 1304.245. With an R-squared value of 0.7659 for this model conducted on the training set, we need to cross-validate to get rid of some overfitting.

http://www.sthda.com/english/articles/38-regression-model-validation/157-cross-validation-essentials-in-r/#k-fold-cross-validation

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(3)
training <- trainControl(method = "cv", number = 5)

model5 <-  train(Crime ~ M + Ed + Po1 + U2 + Ineq + Prob, method = "lm", data = crime, trControl = training)

model5
## Linear Regression 
## 
## 47 samples
##  6 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 39, 37, 38, 36, 38 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   218.9649  0.6955635  166.1713
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
test3 <- predict(model5,sample2)
test3
##        1 
## 1304.245

Using k-fold cross validation with 5 folds, we can see that we get the same prediction for the crime from our sample, but with a smaller R-squared value of 0.6955 compared to 0.7659 when we didn’t train our model. By cross validating, we were able to remove some of the random effects that were captured in our initial model and account for only real effects. The result is a model with less accuracy due to it accounting for less random effects.