The data corresponding to this assignment is given in the text file named TPASE.txt. The data consists of weekly time-series (1958 through 1963, 276 weeks) with the following order of variables: Time index (1,… 276), Market share of CREST toothpaste (weekly), Market share of COLGATE toothpaste (weekly), Unit Price of CREST toothpaste (weekly), Unit Price of COLGATE toothpaste (weekly).

We are interested in modeling the variable “Market share of CREST” as the dependent variable. Assuming the following abbreviations: CRMS= Market share of CREST toothpaste (weekly), COMS= Market share of COLGATE toothpaste (weekly), CRP= Unit Price of CREST toothpaste (weekly), COP=Unit Price of COLGATE toothpaste (weekly). For statistical inference purposes you can use an 𝜶 (significance) level of 0.05. For each case, please clearly state your hypotheses, rejection criteria, and conclusion when needed.

data<-read.table("/Users/nishanaugain/Desktop/UNH/SEM2/TERM3/DS809/A1/TPASTE.txt")
attach(data)
head(data)
##   V1    V2    V3   V4   V5
## 1  1 0.108 0.424 2.07 2.05
## 2  2 0.166 0.482 2.17 2.10
## 3  3 0.126 0.428 2.17 2.04
## 4  4 0.115 0.397 2.07 1.96
## 5  5 0.119 0.352 2.12 2.06
## 6  6 0.176 0.342 2.05 2.01
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data <- rename(data,'TimeIndex' = 'V1','CRMS' = 'V2','COMS' = 'V3','CRP' = 'V4','COP' = 'V5')
head(data)
##   TimeIndex  CRMS  COMS  CRP  COP
## 1         1 0.108 0.424 2.07 2.05
## 2         2 0.166 0.482 2.17 2.10
## 3         3 0.126 0.428 2.17 2.04
## 4         4 0.115 0.397 2.07 1.96
## 5         5 0.119 0.352 2.12 2.06
## 6         6 0.176 0.342 2.05 2.01
  1. Obtain the correlation estimates between CRMS, COMS, CRP, and COP (are they significant?). Also obtain pairwise scatter plots of all four variables. Do you suspect an issue of multicollinearity? Why or why not?

solution

x <- cbind(CRMS=data$CRMS, COMS=data$COMS, CRP=data$CRP,COP=data$COP)
cor(x)
##            CRMS       COMS        CRP        COP
## CRMS  1.0000000 -0.7659797 -0.8241672 -0.6534415
## COMS -0.7659797  1.0000000  0.6308614  0.5020123
## CRP  -0.8241672  0.6308614  1.0000000  0.6910539
## COP  -0.6534415  0.5020123  0.6910539  1.0000000
cor.test(data$CRMS,data$COMS)
## 
##  Pearson's product-moment correlation
## 
## data:  data$CRMS and data$COMS
## t = -19.723, df = 274, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8107278 -0.7123330
## sample estimates:
##        cor 
## -0.7659797
cor.test(data$CRMS,data$CRP)
## 
##  Pearson's product-moment correlation
## 
## data:  data$CRMS and data$CRP
## t = -24.088, df = 274, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8586793 -0.7822143
## sample estimates:
##        cor 
## -0.8241672
cor.test(data$CRMS,data$COP)
## 
##  Pearson's product-moment correlation
## 
## data:  data$CRMS and data$COP
## t = -14.289, df = 274, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7162509 -0.5801301
## sample estimates:
##        cor 
## -0.6534415
cor.test(data$COMS,data$CRP)
## 
##  Pearson's product-moment correlation
## 
## data:  data$COMS and data$CRP
## t = 13.459, df = 274, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5540617 0.6970133
## sample estimates:
##       cor 
## 0.6308614
cor.test(data$COMS,data$COP)
## 
##  Pearson's product-moment correlation
## 
## data:  data$COMS and data$COP
## t = 9.6082, df = 274, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4081341 0.5853844
## sample estimates:
##       cor 
## 0.5020123
cor.test(data$CRP,data$COP)
## 
##  Pearson's product-moment correlation
## 
## data:  data$CRP and data$COP
## t = 15.826, df = 274, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6238892 0.7480852
## sample estimates:
##       cor 
## 0.6910539

The p-values for all the pairs are very small (p-value < 2.2e-16 i.e < 0.05), hence we reject the null hypothesis.This implies that the relationships are significant at 𝛼 level of 5%.

pairs(~ CRMS + COMS + CRP + COP,data=data,lower.panel = NULL)

#pairs(~ CRMS + COMS + CRP + COP,data=data,lower.panel = NULL,
#      col = c("red", "cornflowerblue", "purple","green"),   # Change color by group
#      pch = c(8, 18, 1,11), # Change points by group
#      main = "Scatter Plots")

****We do suspect multicolinearity because CRMS appears to be negatively correlated with the other three variables the same way****

  1. Obtain the time series plot and the autocorrelation plot of CRMS. Do you think the series is stationary? Why or why not?

solution

plot.ts(data$CRMS, col="gray")
grid()

acf(data$CRMS,lag=30)

Box.test(data$CRMS, lag=30)
## 
##  Box-Pierce test
## 
## data:  data$CRMS
## X-squared = 4656.7, df = 30, p-value < 2.2e-16

The p-value is very small( p-value < 2.2e-16), hence we reject the null hypothesis.This proves that CRMS is autocorrelated Also the ACF seems to decay slowly, Therefore CRMS may be non-stationary. If CRMS were stationary, all lags other than t=0 would be within the dashed blue lines. Since none of them are within that zone, we can conclude the data series is not stationary

  1. Estimate a regression between CRMS (dependent variable) and COMS, CRP, and COP (independent variables). Write down the estimated regression model.

solution

mlrfit<-lm(CRMS~COMS+CRP+COP,data)
summary(mlrfit)
## 
## Call:
## lm(formula = CRMS ~ COMS + CRP + COP, data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.164142 -0.037277  0.000109  0.041295  0.156192 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.68403    0.07084  23.772  < 2e-16 ***
## COMS        -0.72982    0.06698 -10.896  < 2e-16 ***
## CRP         -0.48467    0.04241 -11.427  < 2e-16 ***
## COP         -0.12892    0.04560  -2.827  0.00505 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05904 on 272 degrees of freedom
## Multiple R-squared:  0.7861, Adjusted R-squared:  0.7837 
## F-statistic: 333.2 on 3 and 272 DF,  p-value: < 2.2e-16

Regression Equation:

\[CRMS_t = 1.684 - 0.72*COMS_t - 0.48*CRP_t - 0.128*COP_t + \varepsilon_t\]

  1. Discuss the significance of the individual partial regression coefficients of the model from 3. Comment on what these coefficients represent in the context of the problem.

solution

Significance of coefficients

  1. Hypothesis Test for COMS:

Since p-value is very small (p-value < 2e-16,less than 𝛼) for COMS, we reject null hypothesis.There is evidence that COMS is a good predictor of IRC given the effects of CRP and COP

  1. Hypothesis Test for CRP:

Since p-value is very small (p-value < 2e-16,less than 𝛼) for CRP, we reject null hypothesis.There is evidence that CRP is a good predictor of IRC given the effects of COMS and COP

  1. Hypothesis Test for COP:

Since p-value is small (p-value < 0.00505,less than 𝛼) for COP, we reject null hypothesis.There is evidence that COP is a good predictor of IRC given the effects of COMS and CRP

Coefficient Interpretations:

-> 1.684 (intercept): expected value of CRMS when all other independent variables (COMS,CRP, and COP) are zero.

-> -0.72 (coefficient for COMS): expected decrease in CRMS for every unit increase in COMS given the same levels of (controlling for) CRP and COP.

-> -0.48 (coefficient for CRP): expected decrease in CRMS for every unit increase in CRP given the same levels of (controlling for) COMS and COP

-> -0.128 (coefficient for COP): expected decrease in CRMS for every unit increase in COP given the same levels of (controlling for) COMS and CRP

  1. Save the residuals from your model and investigate whether the residuals are auto correlated.
sres<-rstandard(mlrfit)


acf(sres, lag=30)

acf(sres, plot=FALSE)
## 
## Autocorrelations of series 'sres', by lag
## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
## 1.000 0.502 0.456 0.412 0.456 0.416 0.378 0.398 0.366 0.375 0.387 0.353 0.330 
##    13    14    15    16    17    18    19    20    21    22    23    24 
## 0.349 0.278 0.313 0.336 0.361 0.320 0.292 0.313 0.297 0.271 0.248 0.222
Box.test(sres, lag=20)
## 
##  Box-Pierce test
## 
## data:  sres
## X-squared = 771, df = 20, p-value < 2.2e-16
#Plots for checking residuals visually 
#plot(mlrfit)

The p-value is very small( p-value < 2.2e-16), hence we reject the null hypothesis.This implies that residuals are autocorrelated

  1. Investigate whether the residuals are normally distributed.
shapiro.test(sres)
## 
##  Shapiro-Wilk normality test
## 
## data:  sres
## W = 0.99547, p-value = 0.6016
hist(sres,breaks=15, col="steelblue")

Visually, the histogram looks fairly symmetric. Also the p-value (0.6016) is not less than 𝛼 (0.05), we fail to reject the null.This implies that the residuals are normally distributed

  1. Investigate presence of heteroskedasticity (non-constant variance) in the residuals.
#install.packages("skedastic")
library(skedastic)
white_lm(mlrfit, interactions = TRUE)
## # A tibble: 1 x 5
##   statistic p.value parameter method       alternative
##       <dbl>   <dbl>     <dbl> <chr>        <chr>      
## 1      20.9  0.0132         9 White's Test greater

The formal test indicates that the residuals exhibit non-constant variance. Since the p-value (≈ 0) is less than 𝛼 (0.05), we reject the null hypothesis. The residuals are heteroscedastic.

  1. Investigate if multi-collinearity is an issue in your model by computing the respective VIF measures. If multi-collinearity is found to be an issue, re-estimate a model that does not have this issue.
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(mlrfit)
##     COMS      CRP      COP 
## 1.684458 2.411638 1.941004

All VIF estimates are below 10. Therefore, there are no multi-collinearity issues in our model.

  1. What is your best estimate (forecast) of CRMS when COMS=0.30, CRP=1.7, COP=2.1. What is the 99% prediction interval for your forecast?
predict(mlrfit, data.frame(COMS=0.30,CRP=1.7,COP=2.1), interval="prediction",level=.99)
##         fit       lwr       upr
## 1 0.3704291 0.2119287 0.5289296
  1. Obtain a time-series overlay plot of actual values of CRMS versus your predicted values and the respective 95% prediction intervals.
#Plotting prediction intervals
predint =  predict(mlrfit,interval="prediction")
## Warning in predict.lm(mlrfit, interval = "prediction"): predictions on current data refer to _future_ responses
confint =  predict(mlrfit,interval="confidence")
predlower = predint[,2]
predupper = predint[,3]
conflower = confint[,2]
confupper = confint[,3]

plot(data$CRMS, col="lightgray")
lines(predict(mlrfit), col="red")
lines(predlower,col="orange")
lines(predupper,col="orange")

#lines(conflower,col="blue")
#lines(confupper,col="blue")