Question 7.1
Describe a situation or problem from your job, everyday life, current events, etc., for which exponential smoothing would be appropriate. What data would you need? Would you expect the value of α (the first smoothing parameter) to be closer to 0 or 1, and why?
The titin protein deals with protein degradation and stretched based signaling in the heart, therefore a fully functional stretched based signaling complex results in a healthy cardiovascular system. To examine the health of ones cardiovascular system using the titin protein. Levels of titin activity would be measured daily along with diastole and systole. The variability would be expected to be low so an alpha value close to 1 would be chosen.
Question 7.2
Using the 20 years of daily high temperature data for Atlanta (July through October) from Question 6.2 (file temps.txt), build and use an exponential smoothing model to help make a judgment of whether the unofficial end of summer has gotten later over the 20 years. (Part of the point of this assignment is for you to think about how you might use exponential smoothing to answer this question. Feel free to combine it with other models if you’d like to. There’s certainly more than one reasonable approach.)
Note: in R, you can use either HoltWinters (simpler to use) or the smooth package’s es function (harder to use, but more general). If you use es, the Holt-Winters model uses model=”AAM” in the function call (the first and second constants are used “A”dditively, and the third (seasonality) is used “M”ultiplicatively; the documentation doesn’t make that clear).
temps <- read.table("\\Users\\sethm\\OneDrive\\Desktop\\week_3_data-summer\\data 7.2\\temps.txt" , stringsAsFactors = FALSE, header = TRUE)
head(temps)
## DAY X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 1 1-Jul 98 86 91 84 89 84 90 73 82 91 93 95
## 2 2-Jul 97 90 88 82 91 87 90 81 81 89 93 85
## 3 3-Jul 97 93 91 87 93 87 87 87 86 86 93 82
## 4 4-Jul 90 91 91 88 95 84 89 86 88 86 91 86
## 5 5-Jul 89 84 91 90 96 86 93 80 90 89 90 88
## 6 6-Jul 93 84 89 91 96 87 93 84 90 82 81 87
## X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
## 1 85 95 87 92 105 82 90 85
## 2 87 90 84 94 93 85 93 87
## 3 91 89 83 95 99 76 87 79
## 4 90 91 85 92 98 77 84 85
## 5 88 80 88 90 100 83 86 84
## 6 82 87 89 90 98 83 87 84
# Calculate avergae daily temperature for each year
Avg_daily <- as.vector(unlist(temps[,2:21]))
# Time series transformation
Avg_daily.ts<-ts(data = Avg_daily, frequency=123, start=1996)
summary(Avg_daily)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.00 79.00 85.00 83.34 90.00 105.00
set.seed(1)
# Graph time series data
ts.plot(Avg_daily.ts)
# Perform Exponential Smoothing
Avg_daily.ts_forecast <- HoltWinters(Avg_daily.ts, alpha=NULL, beta=NULL, gamma=NULL, seasonal = "multiplicative")
summary(Avg_daily.ts_forecast)
## Length Class Mode
## fitted 9348 mts numeric
## x 2460 ts numeric
## alpha 1 -none- numeric
## beta 1 -none- numeric
## gamma 1 -none- numeric
## coefficients 125 -none- numeric
## seasonal 1 -none- character
## SSE 1 -none- numeric
## call 6 -none- call
# Graph exponential smooth data
plot(Avg_daily.ts_forecast)
head(Avg_daily.ts_forecast$fitted)
## xhat level trend season
## [1,] 87.23653 82.87739 -0.004362918 1.052653
## [2,] 90.42182 82.15059 -0.004362918 1.100742
## [3,] 92.99734 81.91055 -0.004362918 1.135413
## [4,] 90.94030 81.90763 -0.004362918 1.110338
## [5,] 83.99917 81.93634 -0.004362918 1.025231
## [6,] 84.04496 81.93247 -0.004362918 1.025838
which(temps$DAY=="1-Sep")
## [1] 63
# Let's pull out the seasonality factors into a separate object and add the row/column names for ease of use
temps_hw_sf <- matrix(Avg_daily.ts_forecast$fitted[,4], nrow=123)
head(temps_hw_sf)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1.052653 1.049468 1.120607 1.103336 1.118390 1.108172 1.140906 1.140574
## [2,] 1.100742 1.099653 1.108025 1.098323 1.110184 1.116213 1.126827 1.154074
## [3,] 1.135413 1.135420 1.139096 1.142831 1.143201 1.138495 1.129678 1.156092
## [4,] 1.110338 1.110492 1.117079 1.125774 1.134539 1.126117 1.130758 1.137722
## [5,] 1.025231 1.025233 1.044684 1.067291 1.084725 1.097239 1.115055 1.103877
## [6,] 1.025838 1.025722 1.028169 1.042340 1.053954 1.067494 1.080203 1.094312
## [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
## [1,] 1.125438 1.122063 1.161415 1.198102 1.198910 1.243012 1.243781 1.238435
## [2,] 1.142187 1.131889 1.144549 1.134661 1.153433 1.165431 1.172935 1.190735
## [3,] 1.165657 1.147982 1.149459 1.135756 1.153310 1.155197 1.157286 1.169773
## [4,] 1.150639 1.146992 1.142497 1.150162 1.151169 1.157751 1.163844 1.159343
## [5,] 1.120818 1.133733 1.132167 1.142714 1.139244 1.112909 1.132435 1.132045
## [6,] 1.102680 1.092178 1.075766 1.088547 1.082185 1.103092 1.115071 1.118575
## [,17] [,18] [,19]
## [1,] 1.300204 1.290647 1.254521
## [2,] 1.191956 1.219190 1.228826
## [3,] 1.189915 1.172309 1.169045
## [4,] 1.166605 1.167993 1.158956
## [5,] 1.145230 1.168161 1.170449
## [6,] 1.121598 1.134962 1.145475
#average seasonal factor for each year
colnames(temps_hw_sf) <- colnames(temps[,3:21])
rownames(temps_hw_sf) <- temps[,1]
meanSeasonalperyr <- vector()
for (i in 1:ncol(temps_hw_sf)){
meanSeasonalperyr[i] = mean(temps_hw_sf[,i])
}
meanSeasonalperyr
## [1] 1.0000000 0.9981882 0.9980547 0.9975095 0.9971271 0.9961954 0.9955108
## [8] 0.9957393 0.9956360 0.9949667 0.9948162 0.9946495 0.9943300 0.9941211
## [15] 0.9940949 0.9933907 0.9932476 0.9935215 0.9928824
head(temps_hw_sf)
## X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004
## 1-Jul 1.052653 1.049468 1.120607 1.103336 1.118390 1.108172 1.140906 1.140574
## 2-Jul 1.100742 1.099653 1.108025 1.098323 1.110184 1.116213 1.126827 1.154074
## 3-Jul 1.135413 1.135420 1.139096 1.142831 1.143201 1.138495 1.129678 1.156092
## 4-Jul 1.110338 1.110492 1.117079 1.125774 1.134539 1.126117 1.130758 1.137722
## 5-Jul 1.025231 1.025233 1.044684 1.067291 1.084725 1.097239 1.115055 1.103877
## 6-Jul 1.025838 1.025722 1.028169 1.042340 1.053954 1.067494 1.080203 1.094312
## X2005 X2006 X2007 X2008 X2009 X2010 X2011 X2012
## 1-Jul 1.125438 1.122063 1.161415 1.198102 1.198910 1.243012 1.243781 1.238435
## 2-Jul 1.142187 1.131889 1.144549 1.134661 1.153433 1.165431 1.172935 1.190735
## 3-Jul 1.165657 1.147982 1.149459 1.135756 1.153310 1.155197 1.157286 1.169773
## 4-Jul 1.150639 1.146992 1.142497 1.150162 1.151169 1.157751 1.163844 1.159343
## 5-Jul 1.120818 1.133733 1.132167 1.142714 1.139244 1.112909 1.132435 1.132045
## 6-Jul 1.102680 1.092178 1.075766 1.088547 1.082185 1.103092 1.115071 1.118575
## X2013 X2014 X2015
## 1-Jul 1.300204 1.290647 1.254521
## 2-Jul 1.191956 1.219190 1.228826
## 3-Jul 1.189915 1.172309 1.169045
## 4-Jul 1.166605 1.167993 1.158956
## 5-Jul 1.145230 1.168161 1.170449
## 6-Jul 1.121598 1.134962 1.145475
which(temps$DAY=="1-Oct")
## [1] 93
#average seasonal factor for 1996
meanSF_firstYear <- mean(temps_hw_sf[,1])
meanSF_firstYear
## [1] 1
cusum_decrease = function(data, mean, T, C){
results = list()
cusum = 0
rowCounter = 1
while (rowCounter <= nrow(data)){
current = data[rowCounter,]
cusum = max(0, cusum + (mean - current - C))
# print(cusum)
if (cusum >= T) {
results = rowCounter
break
}
rowCounter = rowCounter + 1
if (rowCounter >= nrow(data)){
results = NA
break
}
}
return(results)
}
# Setting C at 0.5 and T at 3.
C_var = sd(temps_hw_sf[,1])*0.5
T_var = sd(temps_hw_sf[,1])*3
# OK. Now let's run cusum on each year of data to see which day the running average of the SF breached the threshold that we set.
# I'll use the mean of the first year (x[1]) when applying CUSUM.
result_vector = vector()
for (col in 1:ncol(temps_hw_sf)){
result_vector[col] = cusum_decrease(data = as.matrix(temps_hw_sf[,col]), mean = 1,T = T_var, C = C_var)
}
CUSUM_RESULTS = data.frame(Year = colnames(temps_hw_sf),Day = temps[result_vector,1])
CUSUM_RESULTS
## Year Day
## 1 X1997 30-Sep
## 2 X1998 1-Oct
## 3 X1999 1-Oct
## 4 X2000 1-Oct
## 5 X2001 2-Oct
## 6 X2002 2-Oct
## 7 X2003 3-Oct
## 8 X2004 3-Oct
## 9 X2005 4-Oct
## 10 X2006 4-Oct
## 11 X2007 5-Oct
## 12 X2008 5-Oct
## 13 X2009 5-Oct
## 14 X2010 5-Oct
## 15 X2011 3-Oct
## 16 X2012 3-Oct
## 17 X2013 3-Oct
## 18 X2014 4-Oct
## 19 X2015 4-Oct
The data with exponential smoothing does show that the day fall arrives does appear to be getting later over the years 1997-2006. Therefore, the unofficial end of summer does appear to be getting later over the years 1997-2006.
Question 8.1
Describe a situation or problem from your job, everyday life, current events, etc., for which a linear regression model would be appropriate. List some (up to 5) predictors that you might use.
Linear regression analysis is used to ppredict the forecasting budget for the Campus Safety, Health and Environmental Management Association every year. The predictors used in the analysis are waste disposal per institution, number of students per institution, number of medical students, research budget per institution, active laboratory volume per institution. These variables are collected from the Post Secondary Education Survey and The National Science Foundation.
Question 8.2 Using crime data from http://www.statsci.org/data/general/uscrime.txt (file uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html ), use regression (a useful R function is lm or glm) to predict the observed crime rate in a city with the following data: M = 14.0 So = 0 Ed = 10.0 Po1 = 12.0 Po2 = 15.5 LF = 0.640 M.F = 94.0 Pop = 150 NW = 1.1 U1 = 0.120 U2 = 3.6 Wealth = 3200 Ineq = 20.1 Prob = 0.04 Time = 39.0
Show your model (factors used and their coefficients), the software output, and the quality of fit.
Note that because there are only 47 data points and 15 predictors, you’ll probably notice some overfitting. We’ll see ways of dealing with this sort of problem later in the course.
crime <- read.table("\\Users\\sethm\\OneDrive\\Desktop\\week_3_data-summer\\data 8.2\\uscrime.txt" , stringsAsFactors = FALSE, header = TRUE)
head(crime)
## M So Ed Po1 Po2 LF M.F Pop NW U1 U2 Wealth Ineq Prob
## 1 15.1 1 9.1 5.8 5.6 0.510 95.0 33 30.1 0.108 4.1 3940 26.1 0.084602
## 2 14.3 0 11.3 10.3 9.5 0.583 101.2 13 10.2 0.096 3.6 5570 19.4 0.029599
## 3 14.2 1 8.9 4.5 4.4 0.533 96.9 18 21.9 0.094 3.3 3180 25.0 0.083401
## 4 13.6 0 12.1 14.9 14.1 0.577 99.4 157 8.0 0.102 3.9 6730 16.7 0.015801
## 5 14.1 0 12.1 10.9 10.1 0.591 98.5 18 3.0 0.091 2.0 5780 17.4 0.041399
## 6 12.1 0 11.0 11.8 11.5 0.547 96.4 25 4.4 0.084 2.9 6890 12.6 0.034201
## Time Crime
## 1 26.2011 791
## 2 25.2999 1635
## 3 24.3006 578
## 4 29.9012 1969
## 5 21.2998 1234
## 6 20.9995 682
# Regression model for unscale data
datalm<-lm(Crime~M+So+Ed+Po1+Po2+LF+M.F+Pop+NW+U1+U2+Wealth+Ineq+Prob+Time,crime)
# Check the P-value for the regression model
summary(datalm)
##
## Call:
## lm(formula = Crime ~ M + So + Ed + Po1 + Po2 + LF + M.F + Pop +
## NW + U1 + U2 + Wealth + Ineq + Prob + Time, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -395.74 -98.09 -6.69 112.99 512.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.984e+03 1.628e+03 -3.675 0.000893 ***
## M 8.783e+01 4.171e+01 2.106 0.043443 *
## So -3.803e+00 1.488e+02 -0.026 0.979765
## Ed 1.883e+02 6.209e+01 3.033 0.004861 **
## Po1 1.928e+02 1.061e+02 1.817 0.078892 .
## Po2 -1.094e+02 1.175e+02 -0.931 0.358830
## LF -6.638e+02 1.470e+03 -0.452 0.654654
## M.F 1.741e+01 2.035e+01 0.855 0.398995
## Pop -7.330e-01 1.290e+00 -0.568 0.573845
## NW 4.204e+00 6.481e+00 0.649 0.521279
## U1 -5.827e+03 4.210e+03 -1.384 0.176238
## U2 1.678e+02 8.234e+01 2.038 0.050161 .
## Wealth 9.617e-02 1.037e-01 0.928 0.360754
## Ineq 7.067e+01 2.272e+01 3.111 0.003983 **
## Prob -4.855e+03 2.272e+03 -2.137 0.040627 *
## Time -3.479e+00 7.165e+00 -0.486 0.630708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078
## F-statistic: 8.429 on 15 and 31 DF, p-value: 3.539e-07
#Create the test datapoint manually using dataframe
test <-data.frame(M = 14.0,So = 0, Ed = 10.0, Po1 = 12.0, Po2 = 15.5,LF = 0.640, M.F = 94.0, Pop = 150, NW = 1.1, U1 = 0.120, U2 = 3.6, Wealth = 3200, Ineq = 20.1, Prob = 0.040,Time = 39.0)
test
## M So Ed Po1 Po2 LF M.F Pop NW U1 U2 Wealth Ineq Prob Time
## 1 14 0 10 12 15.5 0.64 94 150 1.1 0.12 3.6 3200 20.1 0.04 39
#remove predictors if their p values exceed a target alpha, say p > 0.05.
datalm0.05 <-lm(Crime~M+Ed+Ineq+Prob,crime)
summary(datalm0.05)
##
## Call:
## lm(formula = Crime ~ M + Ed + Ineq + Prob, data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -532.97 -254.03 -55.72 137.80 960.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1339.35 1247.01 -1.074 0.28893
## M 35.97 53.39 0.674 0.50417
## Ed 148.61 71.92 2.066 0.04499 *
## Ineq 26.87 22.77 1.180 0.24458
## Prob -7331.92 2560.27 -2.864 0.00651 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 347.5 on 42 degrees of freedom
## Multiple R-squared: 0.2629, Adjusted R-squared: 0.1927
## F-statistic: 3.745 on 4 and 42 DF, p-value: 0.01077
crime <- predict(datalm, test)
crime
## 1
## 155.4349
# The model has a p-value of < 0.5 and the adjusted R-sq value is 0.70. The model is probably overfit and other information criteria such as AIC values, SBC values should be used when selecting a model as well as training and testing the data set. The power and sample size should also be considered.