Question 7.1

Describe a situation or problem from your job, everyday life, current events, etc., for which exponential smoothing would be appropriate. What data would you need? Would you expect the value of α (the first smoothing parameter) to be closer to 0 or 1, and why?

The titin protein deals with protein degradation and stretched based signaling in the heart, therefore a fully functional stretched based signaling complex results in a healthy cardiovascular system. To examine the health of ones cardiovascular system using the titin protein. Levels of titin activity would be measured daily along with diastole and systole. The variability would be expected to be low so an alpha value close to 1 would be chosen.

Question 7.2

Using the 20 years of daily high temperature data for Atlanta (July through October) from Question 6.2 (file temps.txt), build and use an exponential smoothing model to help make a judgment of whether the unofficial end of summer has gotten later over the 20 years. (Part of the point of this assignment is for you to think about how you might use exponential smoothing to answer this question. Feel free to combine it with other models if you’d like to. There’s certainly more than one reasonable approach.)
Note: in R, you can use either HoltWinters (simpler to use) or the smooth package’s es function (harder to use, but more general). If you use es, the Holt-Winters model uses model=”AAM” in the function call (the first and second constants are used “A”dditively, and the third (seasonality) is used “M”ultiplicatively; the documentation doesn’t make that clear).

temps <- read.table("\\Users\\sethm\\OneDrive\\Desktop\\week_3_data-summer\\data 7.2\\temps.txt" , stringsAsFactors = FALSE, header = TRUE)
head(temps)
##     DAY X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 1 1-Jul    98    86    91    84    89    84    90    73    82    91    93    95
## 2 2-Jul    97    90    88    82    91    87    90    81    81    89    93    85
## 3 3-Jul    97    93    91    87    93    87    87    87    86    86    93    82
## 4 4-Jul    90    91    91    88    95    84    89    86    88    86    91    86
## 5 5-Jul    89    84    91    90    96    86    93    80    90    89    90    88
## 6 6-Jul    93    84    89    91    96    87    93    84    90    82    81    87
##   X2008 X2009 X2010 X2011 X2012 X2013 X2014 X2015
## 1    85    95    87    92   105    82    90    85
## 2    87    90    84    94    93    85    93    87
## 3    91    89    83    95    99    76    87    79
## 4    90    91    85    92    98    77    84    85
## 5    88    80    88    90   100    83    86    84
## 6    82    87    89    90    98    83    87    84
# Calculate avergae daily temperature for each year
 
Avg_daily <- as.vector(unlist(temps[,2:21]))
 
# Time series transformation
 
Avg_daily.ts<-ts(data = Avg_daily, frequency=123, start=1996)
summary(Avg_daily)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   50.00   79.00   85.00   83.34   90.00  105.00
set.seed(1)
 
# Graph time series data
 
ts.plot(Avg_daily.ts)

# Perform Exponential Smoothing
 
Avg_daily.ts_forecast <- HoltWinters(Avg_daily.ts, alpha=NULL, beta=NULL, gamma=NULL, seasonal = "multiplicative") 
summary(Avg_daily.ts_forecast)
##              Length Class  Mode     
## fitted       9348   mts    numeric  
## x            2460   ts     numeric  
## alpha           1   -none- numeric  
## beta            1   -none- numeric  
## gamma           1   -none- numeric  
## coefficients  125   -none- numeric  
## seasonal        1   -none- character
## SSE             1   -none- numeric  
## call            6   -none- call
# Graph exponential smooth data 
 
plot(Avg_daily.ts_forecast)

head(Avg_daily.ts_forecast$fitted)
##          xhat    level        trend   season
## [1,] 87.23653 82.87739 -0.004362918 1.052653
## [2,] 90.42182 82.15059 -0.004362918 1.100742
## [3,] 92.99734 81.91055 -0.004362918 1.135413
## [4,] 90.94030 81.90763 -0.004362918 1.110338
## [5,] 83.99917 81.93634 -0.004362918 1.025231
## [6,] 84.04496 81.93247 -0.004362918 1.025838
which(temps$DAY=="1-Sep")
## [1] 63
# Let's pull out the seasonality factors into a separate object and add the row/column names for ease of use
temps_hw_sf <- matrix(Avg_daily.ts_forecast$fitted[,4], nrow=123)
head(temps_hw_sf)
##          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]
## [1,] 1.052653 1.049468 1.120607 1.103336 1.118390 1.108172 1.140906 1.140574
## [2,] 1.100742 1.099653 1.108025 1.098323 1.110184 1.116213 1.126827 1.154074
## [3,] 1.135413 1.135420 1.139096 1.142831 1.143201 1.138495 1.129678 1.156092
## [4,] 1.110338 1.110492 1.117079 1.125774 1.134539 1.126117 1.130758 1.137722
## [5,] 1.025231 1.025233 1.044684 1.067291 1.084725 1.097239 1.115055 1.103877
## [6,] 1.025838 1.025722 1.028169 1.042340 1.053954 1.067494 1.080203 1.094312
##          [,9]    [,10]    [,11]    [,12]    [,13]    [,14]    [,15]    [,16]
## [1,] 1.125438 1.122063 1.161415 1.198102 1.198910 1.243012 1.243781 1.238435
## [2,] 1.142187 1.131889 1.144549 1.134661 1.153433 1.165431 1.172935 1.190735
## [3,] 1.165657 1.147982 1.149459 1.135756 1.153310 1.155197 1.157286 1.169773
## [4,] 1.150639 1.146992 1.142497 1.150162 1.151169 1.157751 1.163844 1.159343
## [5,] 1.120818 1.133733 1.132167 1.142714 1.139244 1.112909 1.132435 1.132045
## [6,] 1.102680 1.092178 1.075766 1.088547 1.082185 1.103092 1.115071 1.118575
##         [,17]    [,18]    [,19]
## [1,] 1.300204 1.290647 1.254521
## [2,] 1.191956 1.219190 1.228826
## [3,] 1.189915 1.172309 1.169045
## [4,] 1.166605 1.167993 1.158956
## [5,] 1.145230 1.168161 1.170449
## [6,] 1.121598 1.134962 1.145475
#average seasonal factor for each year
colnames(temps_hw_sf) <- colnames(temps[,3:21])
rownames(temps_hw_sf) <- temps[,1]

meanSeasonalperyr <- vector()
for (i in 1:ncol(temps_hw_sf)){
  meanSeasonalperyr[i] = mean(temps_hw_sf[,i])
}
meanSeasonalperyr
##  [1] 1.0000000 0.9981882 0.9980547 0.9975095 0.9971271 0.9961954 0.9955108
##  [8] 0.9957393 0.9956360 0.9949667 0.9948162 0.9946495 0.9943300 0.9941211
## [15] 0.9940949 0.9933907 0.9932476 0.9935215 0.9928824
head(temps_hw_sf)
##          X1997    X1998    X1999    X2000    X2001    X2002    X2003    X2004
## 1-Jul 1.052653 1.049468 1.120607 1.103336 1.118390 1.108172 1.140906 1.140574
## 2-Jul 1.100742 1.099653 1.108025 1.098323 1.110184 1.116213 1.126827 1.154074
## 3-Jul 1.135413 1.135420 1.139096 1.142831 1.143201 1.138495 1.129678 1.156092
## 4-Jul 1.110338 1.110492 1.117079 1.125774 1.134539 1.126117 1.130758 1.137722
## 5-Jul 1.025231 1.025233 1.044684 1.067291 1.084725 1.097239 1.115055 1.103877
## 6-Jul 1.025838 1.025722 1.028169 1.042340 1.053954 1.067494 1.080203 1.094312
##          X2005    X2006    X2007    X2008    X2009    X2010    X2011    X2012
## 1-Jul 1.125438 1.122063 1.161415 1.198102 1.198910 1.243012 1.243781 1.238435
## 2-Jul 1.142187 1.131889 1.144549 1.134661 1.153433 1.165431 1.172935 1.190735
## 3-Jul 1.165657 1.147982 1.149459 1.135756 1.153310 1.155197 1.157286 1.169773
## 4-Jul 1.150639 1.146992 1.142497 1.150162 1.151169 1.157751 1.163844 1.159343
## 5-Jul 1.120818 1.133733 1.132167 1.142714 1.139244 1.112909 1.132435 1.132045
## 6-Jul 1.102680 1.092178 1.075766 1.088547 1.082185 1.103092 1.115071 1.118575
##          X2013    X2014    X2015
## 1-Jul 1.300204 1.290647 1.254521
## 2-Jul 1.191956 1.219190 1.228826
## 3-Jul 1.189915 1.172309 1.169045
## 4-Jul 1.166605 1.167993 1.158956
## 5-Jul 1.145230 1.168161 1.170449
## 6-Jul 1.121598 1.134962 1.145475
which(temps$DAY=="1-Oct")
## [1] 93
#average seasonal factor for 1996
meanSF_firstYear <- mean(temps_hw_sf[,1])
meanSF_firstYear
## [1] 1
cusum_decrease = function(data, mean, T, C){
  results = list()
  cusum = 0
  rowCounter = 1
  while (rowCounter <= nrow(data)){
    current = data[rowCounter,]
    cusum = max(0, cusum + (mean - current - C))
    # print(cusum)
    if (cusum >= T) {
      results = rowCounter
      break
    }
    rowCounter = rowCounter + 1
    if (rowCounter >= nrow(data)){
      results = NA
      break
    }
  }
  return(results)
}
# Setting C at 0.5 and T at 3.
C_var = sd(temps_hw_sf[,1])*0.5
T_var = sd(temps_hw_sf[,1])*3

# OK. Now let's run cusum on each year of data to see which day the running average of the SF breached the threshold that we set.
# I'll use the mean of the first year (x[1]) when applying CUSUM.
result_vector = vector()
for (col in 1:ncol(temps_hw_sf)){
  result_vector[col] = cusum_decrease(data = as.matrix(temps_hw_sf[,col]), mean = 1,T = T_var, C = C_var)
}

CUSUM_RESULTS = data.frame(Year = colnames(temps_hw_sf),Day = temps[result_vector,1])
CUSUM_RESULTS
##     Year    Day
## 1  X1997 30-Sep
## 2  X1998  1-Oct
## 3  X1999  1-Oct
## 4  X2000  1-Oct
## 5  X2001  2-Oct
## 6  X2002  2-Oct
## 7  X2003  3-Oct
## 8  X2004  3-Oct
## 9  X2005  4-Oct
## 10 X2006  4-Oct
## 11 X2007  5-Oct
## 12 X2008  5-Oct
## 13 X2009  5-Oct
## 14 X2010  5-Oct
## 15 X2011  3-Oct
## 16 X2012  3-Oct
## 17 X2013  3-Oct
## 18 X2014  4-Oct
## 19 X2015  4-Oct

The data with exponential smoothing does show that the day fall arrives does appear to be getting later over the years 1997-2006. Therefore, the unofficial end of summer does appear to be getting later over the years 1997-2006.

Question 8.1

Describe a situation or problem from your job, everyday life, current events, etc., for which a linear regression model would be appropriate. List some (up to 5) predictors that you might use.

Linear regression analysis is used to ppredict the forecasting budget for the Campus Safety, Health and Environmental Management Association every year. The predictors used in the analysis are waste disposal per institution, number of students per institution, number of medical students, research budget per institution, active laboratory volume per institution. These variables are collected from the Post Secondary Education Survey and The National Science Foundation.

Question 8.2 Using crime data from http://www.statsci.org/data/general/uscrime.txt (file uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html ), use regression (a useful R function is lm or glm) to predict the observed crime rate in a city with the following data: M = 14.0 So = 0 Ed = 10.0 Po1 = 12.0 Po2 = 15.5 LF = 0.640 M.F = 94.0 Pop = 150 NW = 1.1 U1 = 0.120 U2 = 3.6 Wealth = 3200 Ineq = 20.1 Prob = 0.04 Time = 39.0

Show your model (factors used and their coefficients), the software output, and the quality of fit.

Note that because there are only 47 data points and 15 predictors, you’ll probably notice some overfitting. We’ll see ways of dealing with this sort of problem later in the course.

crime <- read.table("\\Users\\sethm\\OneDrive\\Desktop\\week_3_data-summer\\data 8.2\\uscrime.txt" , stringsAsFactors = FALSE, header = TRUE)
head(crime)
##      M So   Ed  Po1  Po2    LF   M.F Pop   NW    U1  U2 Wealth Ineq     Prob
## 1 15.1  1  9.1  5.8  5.6 0.510  95.0  33 30.1 0.108 4.1   3940 26.1 0.084602
## 2 14.3  0 11.3 10.3  9.5 0.583 101.2  13 10.2 0.096 3.6   5570 19.4 0.029599
## 3 14.2  1  8.9  4.5  4.4 0.533  96.9  18 21.9 0.094 3.3   3180 25.0 0.083401
## 4 13.6  0 12.1 14.9 14.1 0.577  99.4 157  8.0 0.102 3.9   6730 16.7 0.015801
## 5 14.1  0 12.1 10.9 10.1 0.591  98.5  18  3.0 0.091 2.0   5780 17.4 0.041399
## 6 12.1  0 11.0 11.8 11.5 0.547  96.4  25  4.4 0.084 2.9   6890 12.6 0.034201
##      Time Crime
## 1 26.2011   791
## 2 25.2999  1635
## 3 24.3006   578
## 4 29.9012  1969
## 5 21.2998  1234
## 6 20.9995   682
# Regression model for unscale data

datalm<-lm(Crime~M+So+Ed+Po1+Po2+LF+M.F+Pop+NW+U1+U2+Wealth+Ineq+Prob+Time,crime)



# Check the P-value for the regression model

summary(datalm)
## 
## Call:
## lm(formula = Crime ~ M + So + Ed + Po1 + Po2 + LF + M.F + Pop + 
##     NW + U1 + U2 + Wealth + Ineq + Prob + Time, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -395.74  -98.09   -6.69  112.99  512.67 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.984e+03  1.628e+03  -3.675 0.000893 ***
## M            8.783e+01  4.171e+01   2.106 0.043443 *  
## So          -3.803e+00  1.488e+02  -0.026 0.979765    
## Ed           1.883e+02  6.209e+01   3.033 0.004861 ** 
## Po1          1.928e+02  1.061e+02   1.817 0.078892 .  
## Po2         -1.094e+02  1.175e+02  -0.931 0.358830    
## LF          -6.638e+02  1.470e+03  -0.452 0.654654    
## M.F          1.741e+01  2.035e+01   0.855 0.398995    
## Pop         -7.330e-01  1.290e+00  -0.568 0.573845    
## NW           4.204e+00  6.481e+00   0.649 0.521279    
## U1          -5.827e+03  4.210e+03  -1.384 0.176238    
## U2           1.678e+02  8.234e+01   2.038 0.050161 .  
## Wealth       9.617e-02  1.037e-01   0.928 0.360754    
## Ineq         7.067e+01  2.272e+01   3.111 0.003983 ** 
## Prob        -4.855e+03  2.272e+03  -2.137 0.040627 *  
## Time        -3.479e+00  7.165e+00  -0.486 0.630708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.1 on 31 degrees of freedom
## Multiple R-squared:  0.8031, Adjusted R-squared:  0.7078 
## F-statistic: 8.429 on 15 and 31 DF,  p-value: 3.539e-07
#Create the test datapoint manually using dataframe

test <-data.frame(M = 14.0,So = 0, Ed = 10.0, Po1 = 12.0, Po2 = 15.5,LF = 0.640, M.F = 94.0, Pop = 150, NW = 1.1, U1 = 0.120, U2 = 3.6, Wealth = 3200, Ineq = 20.1, Prob = 0.040,Time = 39.0)
test
##    M So Ed Po1  Po2   LF M.F Pop  NW   U1  U2 Wealth Ineq Prob Time
## 1 14  0 10  12 15.5 0.64  94 150 1.1 0.12 3.6   3200 20.1 0.04   39
#remove predictors if their p values exceed a target alpha, say p > 0.05.

datalm0.05 <-lm(Crime~M+Ed+Ineq+Prob,crime)

summary(datalm0.05)
## 
## Call:
## lm(formula = Crime ~ M + Ed + Ineq + Prob, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -532.97 -254.03  -55.72  137.80  960.21 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -1339.35    1247.01  -1.074  0.28893   
## M              35.97      53.39   0.674  0.50417   
## Ed            148.61      71.92   2.066  0.04499 * 
## Ineq           26.87      22.77   1.180  0.24458   
## Prob        -7331.92    2560.27  -2.864  0.00651 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 347.5 on 42 degrees of freedom
## Multiple R-squared:  0.2629, Adjusted R-squared:  0.1927 
## F-statistic: 3.745 on 4 and 42 DF,  p-value: 0.01077
crime <- predict(datalm, test)
crime
##        1 
## 155.4349
# The model has a p-value of < 0.5 and the adjusted R-sq value is 0.70. The model is probably overfit and other information criteria such as AIC values, SBC values should be used when selecting a model as well as training and testing the data set. The power and sample size should also be considered.