title: “Report 1”

author: “Danny Chen”

date: “April 14, 2016”

output: html_document

Newspapers

Below is a scatter plot of Sunday circulation versus Daily circulation

load("Stat_224.RData")
attach(P054)
plot(Sunday,Daily,main="Newspapers Sunday and Daily Circulation in Thousands",xlab="Daily Circulation",ylab="Sunday Circulations")

From looking at the plot, there appears to be a positive linear relationship between a newspaper’s daily circulation and it’s Sunday circulation. This seems like a plausible and likely relationship. If a newspaper’s daily edition is more popular, then likely its Sunday edition is more popular as well.

Below, a regression line was fit to the scatter plot.

plot(Sunday,Daily,main="Newspapers Sunday and Daily Circulation in Thousands",xlab="Daily Circulation",ylab="Sunday Circulations")
fit <- lm(Sunday~Daily,data=P054)
abline(fit,col="blue")

95% Confidence intervals for $\hat{\beta_0}$ and $\hat{\beta_1}$

Formula for 95% confidence interval of :

$\hat{\beta_1}$ $\pm$ t(.975,n-2) * s.e.($\hat{\beta_1}$)

where

$\hat{\beta_1}$ = $\Sigma$(yi - $\bar{y}$)*(xi- $\bar{x}$) / $\Sigma$(xi-$\bar{x}$)^2

s.e.($\hat{\beta_1}$) = $\hat{\sigma}$ / $\sqrt{\Sigma(xi-\bar{x})}$

Formula for 95% confidence interval of :

$\hat{\beta_0}$ $\pm$ t(.975,n-2) * s.e.($\hat{\beta_0}$)

where

$\hat{\beta_0}$ = $\bar{y}$ - $\hat{\beta_1}$ * $\bar{x}$

s.e.($\hat{\beta_0}$) = $\hat{\sigma}$$\sqrt{1/n + \bar{x}^{2}/\Sigma(xi-\bar{x})^{2}}$

qt(.975,32)

## [1] 2.036933

summary(fit)

## 
## Call:
## lm(formula = Sunday ~ Daily, data = P054)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -255.19  -55.57  -20.89   62.73  278.17 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.83563   35.80401   0.386    0.702    
## Daily        1.33971    0.07075  18.935   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 109.4 on 32 degrees of freedom
## Multiple R-squared:  0.9181, Adjusted R-squared:  0.9155 
## F-statistic: 358.5 on 1 and 32 DF,  p-value: < 2.2e-16

$\hat{\beta_0}$ = 13.83563

s.e.($\hat{\beta_0}$) = 35.80401

$\hat{\beta_1}$ = 1.33971

s.e.($\hat{\beta_1}$) = 0.07075

n = 34

t(.975,32) = 2.0369

95% confidence interval for $\hat{\beta_1}$ = 1.33971 $\pm$ 0.1441

95% confidence interval for $\hat{\beta_0}$ = 13.83563 $\pm$ 72.9292

To test for a significant relationship between Sunday circulation(Y) and Daily Circulation(X), we will use a two sided t-test at the 95% signficance level.

The following are my hypothesises:

H$_{0}$: $\hat{\beta_1}$ = 0; H$_{a}$: $\hat{\beta_1}$ $\neq$ 0

t$_{1}$ = $\hat{\beta_1}$ - 0 / s.e($\hat{\beta_1}$) = 1.33971 / 0.07075 = 18.93583

critical t value = t(n-2,$\alpha$/2) = t(32,.025) = 2.036933

|t$_{1}$| > critical t value means that must be reject the null hypothesis that there is no linear correlation between Sunday circulation and Daily circulation. My conclusion is thus the inverse, that there is actually a correlation and that it is postive because the slope coefficient is positive.

To find the proportion of variability in Sunday circulation accounted for by daily circulation, we need to find the R$^{2}$ value.

The R, correlation coefficient formula =

Cor(Y,X) = r = $\Sigma$(yi-$\bar{y}$)(xi-$\bar{x}$) / $\sqrt{$(yi-$\bar{y}$)$^{2}$$\Sigma(xi-${x}$)$^{2}$}$$

r = 0.9581753 r$^{2}$ = 0.9181

.9181 of the Sunday circulation can be accounted for by the daily circulation.

95% confidence interval for average Sunday circulation ($\mu_{0}$) at daily circulation (x$_{0}$) = 500000 =

($\hat{\mu_{0}}$) $\pm$ t(n-2,$\alpha$/2) * s.e.($\hat{\mu_{0}}$)

where

($\mu_{0}$) = $\hat{\beta_0}$ + $\hat{\beta_1}$ * (x$_{0}$)

s.e.(($\hat{\mu_{0}}$)) = $\hat{\sigma}$$\sqrt{1/n + (xo - \bar{x})^{2}/\Sigma(xi-\bar{x})^{2}}$

$\hat{\sigma}$ = $\sqrt{\Sigma\epsilon_{i}^{2} / n-p-1}$

anova(fit)

## Analysis of Variance Table
## 
## Response: Sunday
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Daily      1 4292653 4292653  358.53 < 2.2e-16 ***
## Residuals 32  383136   11973                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

b0=13.83563
b1=1.33971
mu = 13.83563 + 1.33971 *500000

sigma = sqrt(383136/32)
se.mu = sigma * sqrt(1/34 + ((500000-mean(Daily))^2 / (4292653)))
interval = se.mu * qt(.975,32)
mu

## [1] 669868.8

interval

## [1] 53741.66

daily = data.frame(Daily=500000)
predict(fit,daily,interval="predict")

##        fit      lwr    upr
## 1 669871.2 597872.4 741870

Interval = 669868.8 $\pm$ 53741.66

95% confidence interval for particular Sunday circulation ($\hat{y_{0}}$) at daily circulation (x$_{0}$) = 500000

= ($\hat{y_{0}}$) $\pm$ t(n-2,$\alpha$/2) * s.e.($\hat{y_{0}}$)

where

($\hat{y_{0}}$) = $\hat{\beta_0}$ + $\hat{\beta_1}$ * (x$_{0}$)

s.e.($\hat{y_{0}}$) = $\hat{\sigma}$$\sqrt{1 + 1/n + (xo - \bar{x})^{2}/\Sigma(xi-\bar{x})^{2}}$

anova(fit)

## Analysis of Variance Table
## 
## Response: Sunday
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Daily      1 4292653 4292653  358.53 < 2.2e-16 ***
## Residuals 32  383136   11973                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

y0 = 13.83563 + 1.33971 *500000

sigma = sqrt(383136/32)
se.y0 = sigma * sqrt(1 + 1/34 + ((500000-mean(Daily))^2 / (4292653)))
interval = se.y0 * qt(.975,32)
y0

## [1] 669868.8

interval

## [1] 53742.12

interval = 669868.8 $\pm$ 53742.12

The particular newspaper’s interval of Sunday circulation is larger than the average newspaper’s interval because the former has smaller variance.

y0 = 13.83563 + 1.33971 *2000000
se.y0 = sigma * sqrt(1 + 1/34 + ((2000000-mean(Daily))^2 / (4292653)))
interval = se.y0 * qt(.975,32)
y0

## [1] 2679434

interval

## [1] 215105.8

Interval for particular Sunday circulation when daily circulation is 2000000 = 2679434 $\pm$ 215105.8

This interval is different from g) because it has a much larger mean and confidence interval. I suspect that the model would not be reliable for such large predictor values because it is far too large from the original scale of sample data.

Cigarettes

Determining Relationship of Sales with each explanatory variable.

attach(P088)
sales <- lm(Sales~Age+HS+Income+Black+Female+Price,data=P088)
summary(sales)

## 
## Call:
## lm(formula = Sales ~ Age + HS + Income + Black + Female + Price, 
##     data = P088)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.398 -12.388  -5.367   6.270 133.213 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 103.34485  245.60719   0.421  0.67597   
## Age           4.52045    3.21977   1.404  0.16735   
## HS           -0.06159    0.81468  -0.076  0.94008   
## Income        0.01895    0.01022   1.855  0.07036 . 
## Black         0.35754    0.48722   0.734  0.46695   
## Female       -1.05286    5.56101  -0.189  0.85071   
## Price        -3.25492    1.03141  -3.156  0.00289 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.17 on 44 degrees of freedom
## Multiple R-squared:  0.3208, Adjusted R-squared:  0.2282 
## F-statistic: 3.464 on 6 and 44 DF,  p-value: 0.006857

I expect Age and Sales to have no linear relationship because although age’s predictor coefficient is 4.52 I cannot disclude the possibility of no relationship because p value .167 is large at even at the 85% confidence level.

I expect HS and Sales to have no relationship because HS’s predictor coefficient is -.06159 which is close to 0 and its p value is .94 which is enormous supporting the null hypothesis.

I expect Income and Sales to have a small positve relationship because the predictor coefficient is .01895 but its p value .07036 is significant at the 90% confidence level.

I expect Black and Sales to probably have no linear correlation because the predictor coefficient .35754 is close to 0 and its p value is .46695 which is large.

I expect Female and Sales to have no linear relationship because the p value is .85 which is large to support null hypothesis that coefficient is 0.

I expect Price and Sales to have a negative relationship because it has a negative coefficient value well below 0 and its p value,.00289, is very low.

cor(P088)

##                Age          HS      Income       Black      Female
## Age     1.00000000 -0.09891626  0.25658098 -0.04033021  0.55303189
## HS     -0.09891626  1.00000000  0.53400534 -0.50171191 -0.41737794
## Income  0.25658098  0.53400534  1.00000000  0.01728756 -0.06882666
## Black  -0.04033021 -0.50171191  0.01728756  1.00000000  0.45089974
## Female  0.55303189 -0.41737794 -0.06882666  0.45089974  1.00000000
## Price   0.24775673  0.05697473  0.21455717 -0.14777619  0.02247351
## Sales   0.22655492  0.06669476  0.32606789  0.18959037  0.14622124
##              Price       Sales
## Age     0.24775673  0.22655492
## HS      0.05697473  0.06669476
## Income  0.21455717  0.32606789
## Black  -0.14777619  0.18959037
## Female  0.02247351  0.14622124
## Price   1.00000000 -0.30062263
## Sales  -0.30062263  1.00000000

plot(P088)

After comparing the matrix to the plot, I noticed that there although there were no statistical disagreements, that there were important discrepancies and features that the plot captured that were not represented by the correlation coefficients. Usually it was one of two cases: either the plot was very obviously non linear as the points were scattered everywhere or the points were very apparently represented in a horizontal/vertical line which should have indicated no linear relationship.

It definitely is not enough to just check one, especially not just correlation coefficient because it again can miss important features of the data.

Specifically: For Black and age, coefficient was about -.04, I expected to see a very shallow downward line, but in the plot I saw a very vertical line of data that the coefficient missed. Similarly in female and high school, coefficient was about -.41 but again the data looked very vertical in the plot. In Black and Price, the coefficient was about -.147 but the plot showed a very horizontal looking line.

The other two examples, price versus high school and price versus income, the coefficients were about .0569 and .21455 respectively, but on the plot, the data were very scattered to the point in which I doubted there was any linear relationship.

age <-lm(Sales~Age,data=P088)
hs <-lm(Sales~HS,data=P088)
income <-lm(Sales~Income,data=P088)
black <-lm(Sales~Black,data=P088)
female <-lm(Sales~Female,data=P088)
price <- lm(Sales~Price,data=P088)

coef(age)

## (Intercept)         Age 
##   15.219199    3.870946

coef(hs)

## (Intercept)          HS 
## 107.3330520   0.2673262

coef(income)

## (Intercept)      Income 
## 55.36245411  0.01758339

coef(black)

## (Intercept)       Black 
## 116.7377984   0.4807148

coef(female)

## (Intercept)      Female 
##  -93.426013    4.219098

coef(price)

## (Intercept)       Price 
##  210.453051   -2.335207

Yes, there are some differences between what I expected and the regression coeficients of the predictors, but I think this different is attributed to me taking into account p value in my expectations.

For age, the predictor coefficient is 3.87 which suggests a positive relationship but I expected no linear relationship due to a large p value.

For hs, the predictor coefficient, .267 is small and close to zero, it might suggest no relationship which is what I expected based on a large p value.

For income, the predictor coefficient was .017 which is very close to zero which possibly suggests no relationship but I expected a very small postitive relationship due to a very small p.

For black, I expected no linear relationship but the correlation coefficent shows positive, again, same reasoning as previous variables.

For female, there appears to be a postive relationship with sales, but once again, goes against my expectations due to a large p value.

Lastly, for price, price coefficient is -2.355 which actually agrees with my expectations of a negative relationship with sales due to a small p value.

If you compare the the regression coefficients of the single linear regression of sales with each pair-wise correlation coefficient of the multiple linear regression, you will find that the values are different. This is because regression coefficents and correlation coefficents aren’t the same: regression coefficient of the predictor determines the expected slope between the two variables and can take values between infinity and negative infinite while correlation coefficient measures the strength of linear correlation and takes values between 1 and -1.

Model assumptions of Sales = $\beta_{0}$ + $\beta_{1}$Income + $\beta_{2}$Black + $\beta_{3}$Price + $\epsilon$

Linearity assumption: I test the linearity of the model by plotting the predicted Sales modeled on each predictor variable individually against the residuals of predicted and observed sales of the respective model. Under the linearity assumption, each predictor variable should have a linear relationship with Sales, and we can prove each by looking at the plot and expecting to see random variance around line with slope = 0 and intercept = 0.

residage = resid(age)
residincome = resid(income)
residblack = resid(black)
residprice = resid(price)


plot(fitted.values(age),residage, main="Predicted Sales Accounted by Age Against \n Residuals of Predicted Linear Model",xlab="Predicted Sale (in thousands)",ylab="Residuals of Predicted to Sample Sale")
abline(0,0,col="red")

plot(fitted.values(income),residincome, main="Predicted Sales Accounted by Income Against \n Residuals of Predicted Linear Model",xlab="Predicted Sale (in thousands)",ylab="Residuals of Predicted to Sample Sale")
abline(1,0,col="red")

plot(fitted.values(black),residblack, main="Predicted Sales Accounted by Black Population\n Against Residuals of Predicted Linear Model",xlab="Predicted Sale (in thousands)",ylab="Residuals of Predicted to Sample Sale")
abline(1,0,col="red")

plot(fitted.values(price),residprice,main="Predicted Sales Accounted by Price Against \n Residuals of Predicted Linear Model",xlab="Predicted Sale (in thousands)",ylab="Residuals of Predicted to Sample Sale")
abline(1,0,col="red")

From the third plot with sales and black population, it does not look there is random variance because all the points seem to be skewed towards the left side, or the lower predicted values. Thus, the linearity assumption doesn’t seem to hold well for the black predictor variable.

Normality Assumption of Errors: I test the assumptions that the error term, $\epsilon$ of the reduced model is normal in distribution. In order to do this, I take a QQ plot which is a plot of the standarized residuals of predicted sales observed quantile value (y axis) and it’s predicted quantile value assuming errors are normally distributed. If the values match, that is, we see a straight line with slope = 1, intercept = 0, then we know the errors are likely normally distributed.

rm <- lm(Sales~Income+Black+Price,data = P088)

rstandardrm <- rstandard(rm)
qqnorm(rstandardrm,ylab="Sample Quartiles with Standarized Residuals")
abline(a=0,b=1,col="red")

From the normal QQ plot, we can see that the data does not fit a normal distribution well because the quantile-quantile plots don’t match with a straight line with slope = 1, intercept = 0. Thus, this contradicts the normality of errors assumption.

Non-homoscedasticity Assumption of Errors: To test if the errors have constant variance, a plot of residuals against predicted values of the reduced model is created. We should not see a change in variance across the line.

residrm = resid(rm)
plot(fitted.values(rm),residrm, main="Predicted Sales Against Residuals \n of Reduced Model",xlab="Predicted Sale (in thousands)",ylab="Residuals of Predicted to Sample Sales")
abline(0,0,col="red")

From the Residual plot against predicted values, it does appear that the variance remains relatively uniform throughout the horizontal line with intercept = 0. Thus, the non-homoscedasticity assumption holds.

Conclusion: The reduced model seems to have failed the linearity assumption, specifically the linear relationship between cigarettes sales and the black population does not hold well. It also fails the normality of errors assumption. Thus, further adjustments to the model should be made regarding the black predictor variable and the error term.