title: “Report 1” |
author: “Danny Chen” |
date: “April 14, 2016” |
output: html_document |
load("Stat_224.RData")
attach(P054)
plot(Sunday,Daily,main="Newspapers Sunday and Daily Circulation in Thousands",xlab="Daily Circulation",ylab="Sunday Circulations")
From looking at the plot, there appears to be a positive linear relationship between a newspaper’s daily circulation and it’s Sunday circulation. This seems like a plausible and likely relationship. If a newspaper’s daily edition is more popular, then likely its Sunday edition is more popular as well.
plot(Sunday,Daily,main="Newspapers Sunday and Daily Circulation in Thousands",xlab="Daily Circulation",ylab="Sunday Circulations")
fit <- lm(Sunday~Daily,data=P054)
abline(fit,col="blue")
Formula for 95% confidence interval of :
\(\hat{\beta_1}\) \(\pm\) t(.975,n-2) * s.e.(\(\hat{\beta_1}\))
where
\(\hat{\beta_1}\) = \(\Sigma\)(yi - \(\bar{y}\))*(xi- \(\bar{x}\)) / \(\Sigma\)(xi-\(\bar{x}\))^2
s.e.(\(\hat{\beta_1}\)) = \(\hat{\sigma}\) / \(\sqrt{\Sigma(xi-\bar{x})}\)
Formula for 95% confidence interval of :
\(\hat{\beta_0}\) \(\pm\) t(.975,n-2) * s.e.(\(\hat{\beta_0}\))
where
\(\hat{\beta_0}\) = \(\bar{y}\) - \(\hat{\beta_1}\) * \(\bar{x}\)
s.e.(\(\hat{\beta_0}\)) = \(\hat{\sigma}\)\(\sqrt{1/n + \bar{x}^{2}/\Sigma(xi-\bar{x})^{2}}\)
qt(.975,32)
## [1] 2.036933
summary(fit)
##
## Call:
## lm(formula = Sunday ~ Daily, data = P054)
##
## Residuals:
## Min 1Q Median 3Q Max
## -255.19 -55.57 -20.89 62.73 278.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.83563 35.80401 0.386 0.702
## Daily 1.33971 0.07075 18.935 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 109.4 on 32 degrees of freedom
## Multiple R-squared: 0.9181, Adjusted R-squared: 0.9155
## F-statistic: 358.5 on 1 and 32 DF, p-value: < 2.2e-16
\(\hat{\beta_0}\) = 13.83563
s.e.(\(\hat{\beta_0}\)) = 35.80401
\(\hat{\beta_1}\) = 1.33971
s.e.(\(\hat{\beta_1}\)) = 0.07075
n = 34
t(.975,32) = 2.0369
95% confidence interval for \(\hat{\beta_1}\) = 1.33971 \(\pm\) 0.1441
95% confidence interval for \(\hat{\beta_0}\) = 13.83563 \(\pm\) 72.9292
The following are my hypothesises:
H\(_{0}\): \(\hat{\beta_1}\) = 0; H\(_{a}\): \(\hat{\beta_1}\) \(\neq\) 0
t\(_{1}\) = \(\hat{\beta_1}\) - 0 / s.e(\(\hat{\beta_1}\)) = 1.33971 / 0.07075 = 18.93583
critical t value = t(n-2,\(\alpha\)/2) = t(32,.025) = 2.036933
|t\(_{1}\)| > critical t value means that must be reject the null hypothesis that there is no linear correlation between Sunday circulation and Daily circulation. My conclusion is thus the inverse, that there is actually a correlation and that it is postive because the slope coefficient is positive.
The R, correlation coefficient formula =
Cor(Y,X) = r = \(\Sigma\)(yi-\(\bar{y}\))(xi-\(\bar{x}\)) / \(\sqrt{\)(yi-\(\bar{y}\))\(^{2}\)\(\Sigma(xi-\){x}\()\)^{2}\(}\)$
r = 0.9581753 r\(^{2}\) = 0.9181
.9181 of the Sunday circulation can be accounted for by the daily circulation.
(\(\hat{\mu_{0}}\)) \(\pm\) t(n-2,\(\alpha\)/2) * s.e.(\(\hat{\mu_{0}}\))
where
(\(\mu_{0}\)) = \(\hat{\beta_0}\) + \(\hat{\beta_1}\) * (x\(_{0}\))
s.e.((\(\hat{\mu_{0}}\))) = \(\hat{\sigma}\)\(\sqrt{1/n + (xo - \bar{x})^{2}/\Sigma(xi-\bar{x})^{2}}\)
\(\hat{\sigma}\) = \(\sqrt{\Sigma\epsilon_{i}^{2} / n-p-1}\)
anova(fit)
## Analysis of Variance Table
##
## Response: Sunday
## Df Sum Sq Mean Sq F value Pr(>F)
## Daily 1 4292653 4292653 358.53 < 2.2e-16 ***
## Residuals 32 383136 11973
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
b0=13.83563
b1=1.33971
mu = 13.83563 + 1.33971 *500000
sigma = sqrt(383136/32)
se.mu = sigma * sqrt(1/34 + ((500000-mean(Daily))^2 / (4292653)))
interval = se.mu * qt(.975,32)
mu
## [1] 669868.8
interval
## [1] 53741.66
daily = data.frame(Daily=500000)
predict(fit,daily,interval="predict")
## fit lwr upr
## 1 669871.2 597872.4 741870
Interval = 669868.8 \(\pm\) 53741.66
= (\(\hat{y_{0}}\)) \(\pm\) t(n-2,\(\alpha\)/2) * s.e.(\(\hat{y_{0}}\))
where
(\(\hat{y_{0}}\)) = \(\hat{\beta_0}\) + \(\hat{\beta_1}\) * (x\(_{0}\))
s.e.(\(\hat{y_{0}}\)) = \(\hat{\sigma}\)\(\sqrt{1 + 1/n + (xo - \bar{x})^{2}/\Sigma(xi-\bar{x})^{2}}\)
anova(fit)
## Analysis of Variance Table
##
## Response: Sunday
## Df Sum Sq Mean Sq F value Pr(>F)
## Daily 1 4292653 4292653 358.53 < 2.2e-16 ***
## Residuals 32 383136 11973
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
y0 = 13.83563 + 1.33971 *500000
sigma = sqrt(383136/32)
se.y0 = sigma * sqrt(1 + 1/34 + ((500000-mean(Daily))^2 / (4292653)))
interval = se.y0 * qt(.975,32)
y0
## [1] 669868.8
interval
## [1] 53742.12
interval = 669868.8 \(\pm\) 53742.12
The particular newspaper’s interval of Sunday circulation is larger than the average newspaper’s interval because the former has smaller variance.
y0 = 13.83563 + 1.33971 *2000000
se.y0 = sigma * sqrt(1 + 1/34 + ((2000000-mean(Daily))^2 / (4292653)))
interval = se.y0 * qt(.975,32)
y0
## [1] 2679434
interval
## [1] 215105.8
Interval for particular Sunday circulation when daily circulation is 2000000 = 2679434 \(\pm\) 215105.8
This interval is different from g) because it has a much larger mean and confidence interval. I suspect that the model would not be reliable for such large predictor values because it is far too large from the original scale of sample data.
attach(P088)
sales <- lm(Sales~Age+HS+Income+Black+Female+Price,data=P088)
summary(sales)
##
## Call:
## lm(formula = Sales ~ Age + HS + Income + Black + Female + Price,
## data = P088)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.398 -12.388 -5.367 6.270 133.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.34485 245.60719 0.421 0.67597
## Age 4.52045 3.21977 1.404 0.16735
## HS -0.06159 0.81468 -0.076 0.94008
## Income 0.01895 0.01022 1.855 0.07036 .
## Black 0.35754 0.48722 0.734 0.46695
## Female -1.05286 5.56101 -0.189 0.85071
## Price -3.25492 1.03141 -3.156 0.00289 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.17 on 44 degrees of freedom
## Multiple R-squared: 0.3208, Adjusted R-squared: 0.2282
## F-statistic: 3.464 on 6 and 44 DF, p-value: 0.006857
I expect Age and Sales to have no linear relationship because although age’s predictor coefficient is 4.52 I cannot disclude the possibility of no relationship because p value .167 is large at even at the 85% confidence level.
I expect HS and Sales to have no relationship because HS’s predictor coefficient is -.06159 which is close to 0 and its p value is .94 which is enormous supporting the null hypothesis.
I expect Income and Sales to have a small positve relationship because the predictor coefficient is .01895 but its p value .07036 is significant at the 90% confidence level.
I expect Black and Sales to probably have no linear correlation because the predictor coefficient .35754 is close to 0 and its p value is .46695 which is large.
I expect Female and Sales to have no linear relationship because the p value is .85 which is large to support null hypothesis that coefficient is 0.
I expect Price and Sales to have a negative relationship because it has a negative coefficient value well below 0 and its p value,.00289, is very low.
cor(P088)
## Age HS Income Black Female
## Age 1.00000000 -0.09891626 0.25658098 -0.04033021 0.55303189
## HS -0.09891626 1.00000000 0.53400534 -0.50171191 -0.41737794
## Income 0.25658098 0.53400534 1.00000000 0.01728756 -0.06882666
## Black -0.04033021 -0.50171191 0.01728756 1.00000000 0.45089974
## Female 0.55303189 -0.41737794 -0.06882666 0.45089974 1.00000000
## Price 0.24775673 0.05697473 0.21455717 -0.14777619 0.02247351
## Sales 0.22655492 0.06669476 0.32606789 0.18959037 0.14622124
## Price Sales
## Age 0.24775673 0.22655492
## HS 0.05697473 0.06669476
## Income 0.21455717 0.32606789
## Black -0.14777619 0.18959037
## Female 0.02247351 0.14622124
## Price 1.00000000 -0.30062263
## Sales -0.30062263 1.00000000
plot(P088)
After comparing the matrix to the plot, I noticed that there although there were no statistical disagreements, that there were important discrepancies and features that the plot captured that were not represented by the correlation coefficients. Usually it was one of two cases: either the plot was very obviously non linear as the points were scattered everywhere or the points were very apparently represented in a horizontal/vertical line which should have indicated no linear relationship.
It definitely is not enough to just check one, especially not just correlation coefficient because it again can miss important features of the data.
Specifically: For Black and age, coefficient was about -.04, I expected to see a very shallow downward line, but in the plot I saw a very vertical line of data that the coefficient missed. Similarly in female and high school, coefficient was about -.41 but again the data looked very vertical in the plot. In Black and Price, the coefficient was about -.147 but the plot showed a very horizontal looking line.
The other two examples, price versus high school and price versus income, the coefficients were about .0569 and .21455 respectively, but on the plot, the data were very scattered to the point in which I doubted there was any linear relationship.
age <-lm(Sales~Age,data=P088)
hs <-lm(Sales~HS,data=P088)
income <-lm(Sales~Income,data=P088)
black <-lm(Sales~Black,data=P088)
female <-lm(Sales~Female,data=P088)
price <- lm(Sales~Price,data=P088)
coef(age)
## (Intercept) Age
## 15.219199 3.870946
coef(hs)
## (Intercept) HS
## 107.3330520 0.2673262
coef(income)
## (Intercept) Income
## 55.36245411 0.01758339
coef(black)
## (Intercept) Black
## 116.7377984 0.4807148
coef(female)
## (Intercept) Female
## -93.426013 4.219098
coef(price)
## (Intercept) Price
## 210.453051 -2.335207
Yes, there are some differences between what I expected and the regression coeficients of the predictors, but I think this different is attributed to me taking into account p value in my expectations.
For age, the predictor coefficient is 3.87 which suggests a positive relationship but I expected no linear relationship due to a large p value.
For hs, the predictor coefficient, .267 is small and close to zero, it might suggest no relationship which is what I expected based on a large p value.
For income, the predictor coefficient was .017 which is very close to zero which possibly suggests no relationship but I expected a very small postitive relationship due to a very small p.
For black, I expected no linear relationship but the correlation coefficent shows positive, again, same reasoning as previous variables.
For female, there appears to be a postive relationship with sales, but once again, goes against my expectations due to a large p value.
Lastly, for price, price coefficient is -2.355 which actually agrees with my expectations of a negative relationship with sales due to a small p value.
If you compare the the regression coefficients of the single linear regression of sales with each pair-wise correlation coefficient of the multiple linear regression, you will find that the values are different. This is because regression coefficents and correlation coefficents aren’t the same: regression coefficient of the predictor determines the expected slope between the two variables and can take values between infinity and negative infinite while correlation coefficient measures the strength of linear correlation and takes values between 1 and -1.
Linearity assumption: I test the linearity of the model by plotting the predicted Sales modeled on each predictor variable individually against the residuals of predicted and observed sales of the respective model. Under the linearity assumption, each predictor variable should have a linear relationship with Sales, and we can prove each by looking at the plot and expecting to see random variance around line with slope = 0 and intercept = 0.
residage = resid(age)
residincome = resid(income)
residblack = resid(black)
residprice = resid(price)
plot(fitted.values(age),residage, main="Predicted Sales Accounted by Age Against \n Residuals of Predicted Linear Model",xlab="Predicted Sale (in thousands)",ylab="Residuals of Predicted to Sample Sale")
abline(0,0,col="red")
plot(fitted.values(income),residincome, main="Predicted Sales Accounted by Income Against \n Residuals of Predicted Linear Model",xlab="Predicted Sale (in thousands)",ylab="Residuals of Predicted to Sample Sale")
abline(1,0,col="red")
plot(fitted.values(black),residblack, main="Predicted Sales Accounted by Black Population\n Against Residuals of Predicted Linear Model",xlab="Predicted Sale (in thousands)",ylab="Residuals of Predicted to Sample Sale")
abline(1,0,col="red")
plot(fitted.values(price),residprice,main="Predicted Sales Accounted by Price Against \n Residuals of Predicted Linear Model",xlab="Predicted Sale (in thousands)",ylab="Residuals of Predicted to Sample Sale")
abline(1,0,col="red")
From the third plot with sales and black population, it does not look there is random variance because all the points seem to be skewed towards the left side, or the lower predicted values. Thus, the linearity assumption doesn’t seem to hold well for the black predictor variable.
Normality Assumption of Errors: I test the assumptions that the error term, \(\epsilon\) of the reduced model is normal in distribution. In order to do this, I take a QQ plot which is a plot of the standarized residuals of predicted sales observed quantile value (y axis) and it’s predicted quantile value assuming errors are normally distributed. If the values match, that is, we see a straight line with slope = 1, intercept = 0, then we know the errors are likely normally distributed.
rm <- lm(Sales~Income+Black+Price,data = P088)
rstandardrm <- rstandard(rm)
qqnorm(rstandardrm,ylab="Sample Quartiles with Standarized Residuals")
abline(a=0,b=1,col="red")
From the normal QQ plot, we can see that the data does not fit a normal distribution well because the quantile-quantile plots don’t match with a straight line with slope = 1, intercept = 0. Thus, this contradicts the normality of errors assumption.
Non-homoscedasticity Assumption of Errors: To test if the errors have constant variance, a plot of residuals against predicted values of the reduced model is created. We should not see a change in variance across the line.
residrm = resid(rm)
plot(fitted.values(rm),residrm, main="Predicted Sales Against Residuals \n of Reduced Model",xlab="Predicted Sale (in thousands)",ylab="Residuals of Predicted to Sample Sales")
abline(0,0,col="red")
From the Residual plot against predicted values, it does appear that the variance remains relatively uniform throughout the horizontal line with intercept = 0. Thus, the non-homoscedasticity assumption holds.
Conclusion: The reduced model seems to have failed the linearity assumption, specifically the linear relationship between cigarettes sales and the black population does not hold well. It also fails the normality of errors assumption. Thus, further adjustments to the model should be made regarding the black predictor variable and the error term.