Problem 1
Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.
Answer 1
The null hypothesis is the coefficient corresponding to TV, radio and newspaper B0 , B1 , B2 , B3 is zero. The p-values for intercept and TV, radio is less than 0.05 and hance we can reject the null hypothesis concluding that the coefficient is non zero and are significant in predicting sales. But the p-value for newspaper is greater than 0.05 (alpha level of the test) suggesting we fail to reject the null hypothesis and conclude that B3 is zero. Thus we can conclude that TV and radio are significant in predicting sales but newspaper is not.
Problem 2
Carefully explain the differences between the KNN classifier and KNN regression methods.
Answer 2
KNN classifier classifies Y into 0 or 1 while KNN regrssion method predicts the quantitative value of Y
Problem 3
Suppose we have a data set with five predictors, X1 =GPA, X2 = IQ, X3 = Gender (1 for Female and 0 forMale), X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get ^??0 = 50, ^??1 = 20, ^??2 = 0.07, ^??3 = 35, ^??4 = 0.01, ^??5 = ???10. (a) Which answer is correct, and why? i. For a fixed value of IQ and GPA, males earn more on average than females. ii. For a fixed value of IQ and GPA, females earn more on average than males. iii. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough. iv. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough. (b) Predict the salary of a female with IQ of 110 and a GPA of 4.0. (c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.
Answer 3
\(Y=50 + 20 (gpa) + 0.07(iq) + 35(gender) + 0.01 (gpa * iq) - 10( gpa *gender)\)
$ Y = 50 + 20 k_1 + 0.07 k_2 + 35 gender + 0.01(k_1 * k_2) - 10 (k_1 * gender)$
For male: (gender = 0) \(Y= 50 + 20 k_1 + 0.07 k_2 + 0.01(k_1 * k_2)\)
For female:(gender = 1) \(Y = 50 + 20 k_1 + 0.07 k_2 + 0.01(k_1 * k_2) +35 - 10 (k_1)\)
so once the GPA is high enough (>3.5) males earn more on average
\(Y=50 + 20*(4) +0.07*(110) + 35 + 0.01*(4*110) -10*(4*1)\)
The small coefficient does not indicate the less effect of the interaction term. It can be checked by looking at the p-value of the coefficient to determine its statistical significance.
Problem 4
I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. Y = ??0 + ??1X + ??2X2 + ??3X3 + .
Suppose that the true relationship between X and Y is linear, i.e. Y = ??0 + ??1X + . Consider the training residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
Answer (a) using test rather than training RSS.
Suppose that the true relationship between X and Y is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer.
Answer (c) using test rather than training RSS.
Answer 4
Polynomial regression RSS will be lower on train
Linear regression RSS will be lower on test
Polynomial regression RSS will be lower on train
We cannot comment on test RSS as need to know how different is the relationship from linear. if it is close to linear then linear RSS will be lower and if relationship is closer to polynomial regression than otherwise
Problem 8
This question involves the use of simple linear regression on the Auto data set.
For example:
Is there a relationship between the predictor and the response?
How strong is the relationship between the predictor and the response?
Is the relationship between the predictor and the response positive or negative?
What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals?
Plot the response and the predictor. Use the abline() function to display the least squares regression line.
Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
Answer 8
library(MASS)
library(ISLR)
y<-lm(mpg~horsepower,data=Auto)
summary(y)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
#A)
#i)The p value is less than 0.05 and hence we reject the null hypothesis and can
#conclude there is statitically significant relationship between horsepower and mpg
#ii) The R-square is 0.6059 which is can be seen as the strength of relationship.
#iii) The relationship between response and predictor is negative indicated by the sign of the coefficient
predict(y,data.frame(horsepower=98),interval="confidence")
## fit lwr upr
## 1 24.46708 23.97308 24.96108
predict(y,data.frame(horsepower=98),interval="prediction")
## fit lwr upr
## 1 24.46708 14.8094 34.12476
#iv) Predicted value of mpg is 24.46708.Confidence interval is (23.97308 - 24.96108) and prediction interval is (14.8094 - 34.12476)
#B
plot(Auto$horsepower,Auto$mpg,col="red")
abline(y)
#C
par(mfrow=c(2,2))
plot(y)
#residual vs fitted value plot is not random but there is a u shape visible which suggests non linear relationship between predictor and response variable
Problem 9
This question involves the use of multiple linear regression on the Auto data set.
Produce a scatterplot matrix which includes all of the variables in the data set.
Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.
Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.
Comment on the output. For instance:
Is there a relationship between the predictors and the response?
Which predictors appear to have a statistically significant relationship to the response?
What does the coefficient for the year variable suggest?
Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
Try a few different transformations of the variables, such as log(X), ???X, X2. Comment on your findings.
Answer 9
#a
plot(Auto)
#b
Autowithoutnames<-Auto
Autowithoutnames$name=NULL
cor(Autowithoutnames)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
#c
y1<-lm(mpg~ .-name,data=Auto)
summary(y1)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
# i) The p-value less than 0.05 for f statistic suggest that we can reject the null hypothesis and conclude there is atleast one variable which is significant in predicting mpg.
# ii) displacement, weight, year and origin have statitically significant relationship with response based on lower p-values
# iii) The coefficient of year variable is significant and positive which suggests that if all other variables are constant than on average mpg increases by 0.75 every year.
#d
par(mfrow=c(2,2))
plot(y1)
# residuals vs fitted value plot shows u shape which suggests non linearity in the relationship
plot(predict(y1),rstudent(y1))
# rstudentized residual vs fitted value plot suggests that there are certain observation for which the rstudentized residuals is >3 hence indicating outliers
plot(hatvalues(y1))
which.max(hatvalues(y1))
## 14
## 14
# which.max gives the index of observation having highest leverage statistic
#e)
y2<-lm(mpg~.:.,Autowithoutnames)
summary(y2)
##
## Call:
## lm(formula = mpg ~ .:., data = Autowithoutnames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
#interaction between displacement - acceleration, acceleration - origin, and acceleration - year are found to be statitically significant
#Adjusted R-square increased from 0.81 to 0.88 with addition of interaction terms.
anova(y1,y2)
## Analysis of Variance Table
##
## Model 1: mpg ~ (cylinders + displacement + horsepower + weight + acceleration +
## year + origin + name) - name
## Model 2: mpg ~ (cylinders + displacement + horsepower + weight + acceleration +
## year + origin):(cylinders + displacement + horsepower + weight +
## acceleration + year + origin)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 384 4252.2
## 2 363 2635.6 21 1616.6 10.603 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#ANOVA quantifies the relationship by testing null hypothesis of the models being equal which we reject due to low p-value concluding the model 2 is different from 1
#f)
y3<-lm(mpg~weight+I((weight)^2),Auto)
summary(y3)
##
## Call:
## lm(formula = mpg ~ weight + I((weight)^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6246 -2.7134 -0.3485 1.8267 16.0866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.226e+01 2.993e+00 20.800 < 2e-16 ***
## weight -1.850e-02 1.972e-03 -9.379 < 2e-16 ***
## I((weight)^2) 1.697e-06 3.059e-07 5.545 5.43e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.176 on 389 degrees of freedom
## Multiple R-squared: 0.7151, Adjusted R-squared: 0.7137
## F-statistic: 488.3 on 2 and 389 DF, p-value: < 2.2e-16
plot(y3)
# Here plot shows non normal distribution of error terms, also funnel shape is seen for residuals vs leverage chart
Problem 10
Fit a multiple regression model to predict Sales using Price, Urban, and US.
Provide an interpretation of each coefficient in the model. Be careful-some of the variables in the model are qualitative!
Write out the model in equation form, being careful to handle the qualitative variables properly.
For which of the predictors can you reject the null hypothesis H0 : ??j = 0?
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
How well do the models in (a) and (e) fit the data?
Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
Is there evidence of outliers or high leverage observations in the model from (e)?
Answer 10
#a
sales<-lm(Sales~Price+Urban+US,data=Carseats)
summary(sales)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
#b)
#Price is contniuous variable and its coefficient can be interpreted as average increase in sales for unit increase in prices keeping other variables constant
#US and Urban are categorical variables with Yes coded as 1 and no as base 0 (base). The coefficient can be interpreted as avarage sales more for 1 compare to0 when other parameters are kept constant
#c)
#Sales=13.043469 - 0.054459*(Price) - (0.021916)*(1 ,if Urban is Yes 0 otherwise) + 1.200573 (1,if US is Yes 0 otherwise)
#d)
#Null hypothesis can be rejected for Price and USYes as p-value is less than 0.05
#e
sales1<-lm(Sales~Price+US,data=Carseats)
summary(sales1)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
#f
anova(sales,sales1)
## Analysis of Variance Table
##
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price + US
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 396 2420.8
## 2 397 2420.9 -1 -0.03979 0.0065 0.9357
#On removing urban variable there is slight increase in adjusted r square and decrease in residual standard error. But when we do anova test to find the the difference is statistically significant we failt to reject the null hypothesis and have to conclude that both models are not significantly different.
#g
confint(sales1)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
#95% confidence interval for sales1 model coefficient
#h
plot(predict(sales1),rstudent(sales1))
#We use the plot of rstudentized residuals (residual/ standard error) vs predicted value plot to detect presence of outliers. As rstudentized residuals are within the limit of -3 to 3 we can say that there is no outlier present in the data. outlier is defined as the observation affecting the value of Y to be significantly different from the expected trend
lev<-hat(model.matrix(sales1))
plot(lev)
4/nrow(Carseats)
## [1] 0.01
plot(Carseats$Sales,Carseats$Price)
points(Carseats[lev>0.01,]$Sales,Carseats[lev>0.01,]$Price,col='red')
Here (p+1)/n = (3+1)/400= 0.01 and we find points for which the leverage is greater than 0.01 and color them to show they are high leverage points
Problem 11
In this problem we will investigate the t-statistic for the null hypothesis H0 : ?? = 0 in simple linear regression without an intercept. To begin, we generate a predictor x and a response y as follows.
set.seed (1),
x=rnorm (100),
y=2*x+rnorm (100)
Perform a simple linear regression of y onto x, without an intercept. Report the coefficient estimate ^??, the standard error of this coefficient estimate, and the t-statistic and p-value associated with the null hypothesis H0 : ?? = 0. Comment on these results. (You can perform regression without an intercept using the command lm(y???x+0).)
Now perform a simple linear regression of x onto y without an intercept, and report the coefficient estimate, its standard error, and the corresponding t-statistic and p-values associated with the null hypothesis H0 : ?? = 0. Comment on these results.
What is the relationship between the results obtained in (a) and (b)?
For the regression of Y onto X without an intercept, the tstatistic for H0 : ?? = 0 takes the form ^??/SE( ^ ??), where ^ ?? is given by (3.38), and where
(These formulas are slightly different from those given in Sections 3.1.1 and 3.1.2, since here we are performing regression without an intercept.) Show algebraically, and confirm numerically in R, that the t-statistic can be written as
Using the results from (d), argue that the t-statistic for the regression of y onto x is the same as the t-statistic for the regression of x onto y.
In R, show that when regression is performed with an intercept, the t-statistic for H0 : ??1 = 0 is the same for the regression of y onto x as it is for the regression of x onto y.
Answer 11
set.seed(1)
x=rnorm(100)
y=2*x+rnorm(100)
slr<-lm(y~x+0)
summary(slr)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
coefficient estimate is 1.9938761 and standard error is 0.1065.The t statistic is obtained by dividing paramter estimate with its standard error which is given by 18.73 and p-value associated with is less than 0.05 and hence we reject the null hypothesis that coefficient of x is zero.
revslr<-lm(x~y+0)
summary(revslr)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8699 -0.2368 0.1030 0.2858 0.8938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.39111 0.02089 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
coefficient estimate is 0.3911145 and standard error is 0.02089.The t statistic is obtained by dividing paramter estimate with its standard error which is given by 18.73 and p-value associated with is less than 0.05 and hence we reject the null hypothesis that coefficient of x is zero.
We get the same t statistic and p-value for both the cases and the intercept is changed and it is not the inverse so we cannot say that y=mx+c is written as x=(1/m)(y-c)
n=length(x)
t=sqrt(n - 1)*(x %*% y)/sqrt(sum(x^2) * sum(y^2) - (x %*% y)^2)
as.numeric(t)
## [1] 18.72593
We get t-statistic as 18.7259319 from the formula which is equal to the t-statistic that we obtained earlier by dividing parameter estimate beta by standard error of beta.
As the formula indicates it only depends on the value of x and y we get same t statistic for both the cases
revslr1<-lm(x~y)
summary(revslr1)
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90848 -0.28101 0.06274 0.24570 0.85736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03880 0.04266 0.91 0.365
## y 0.38942 0.02099 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4249 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
slr1<-lm(y~x)
summary(slr1)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8768 -0.6138 -0.1395 0.5394 2.3462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03769 0.09699 -0.389 0.698
## x 1.99894 0.10773 18.556 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
We get the t-statistic as 18.56 for both the cases.
Problem 12
This problem involves simple linear regression without an intercept.
Recall that the coefficient estimate ^ ?? for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
Answer 12
From the equation it is clear that the parameter estimate will be equal if summation xi 2 equals summation yi 2
x=rnorm(100)
y=rbinom(100,2,0.3)
eg<-lm(y~x+0)
summary(eg)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14889 0.01761 0.91274 1.03282 2.19132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.08372 0.09067 0.923 0.358
##
## Residual standard error: 0.9334 on 99 degrees of freedom
## Multiple R-squared: 0.008539, Adjusted R-squared: -0.001476
## F-statistic: 0.8526 on 1 and 99 DF, p-value: 0.3581
eg1<-lm(x~y+0)
summary(eg1)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.88892 -0.56513 -0.02129 0.61989 2.54718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.1020 0.1105 0.923 0.358
##
## Residual standard error: 1.03 on 99 degrees of freedom
## Multiple R-squared: 0.008539, Adjusted R-squared: -0.001476
## F-statistic: 0.8526 on 1 and 99 DF, p-value: 0.3581
Here we get different coefficients for both cases. For case 1(y~x) coefficient estimate is 0.0837186 while for case 2(x~y) it is 0.1019913
x=1:100
y=100:1
eg3<-lm(y~x+0)
summary(eg3)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
eg4<-lm(x~y+0)
summary(eg4)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
Here we get same coefficients for both cases. For case 1(y~x) coefficient estimate is 0.5074627 while for case 2(x~y) it is 0.5074627
Problem 13
In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results.
Using the rnorm() function, create a vector, x, containing 100 observations drawn from a N(0, 1) distribution. This represents a feature, X.
Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a N(0, 0.25) distribution i.e. a normal distribution with mean zero and variance 0.25.
Using x and eps, generate a vector y according to the model Y = ???1 + 0.5X + error. What is the length of the vector y? What are the values of ??0 and ??1 in this linear model?
Fit a least squares linear model to predict y using x. Comment on the model obtained. How do ^ ??0 and ^ ??1 compare to ??0 and ??1?
Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend.
Now fit a polynomial regression model that predicts y using x and x2. Is there evidence that the quadratic term improves the model fit? Explain your answer.
Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise in the data. The model (3.39) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term in (b). Describe your results.
Repeat (a)-(f) after modifying the data generation process in such a way that there is more noise in the data. The model (3.39) should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term in (b). Describe your results.
What are the confidence intervals for ??0 and ??1 based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.
Answer 13
set.seed(1)
x=rnorm(100)
eps<-rnorm(100,mean=0,sd=sqrt(0.25))
y<- -1+0.5*x+eps
length(y)
## [1] 100
The length of vector y is 100. The coefficient estimate B0 and B1 are given by (-1) & 0.5 respectively.
plot(x,y)
As expected (linear relation between x and y from equation y = -1 +0.5x +eps) the plot shows linear relationship with certain noise
sim<-lm(y~x)
summary(sim)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.93842 -0.30688 -0.06975 0.26970 1.17309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.01885 0.04849 -21.010 < 2e-16 ***
## x 0.49947 0.05386 9.273 4.58e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4814 on 98 degrees of freedom
## Multiple R-squared: 0.4674, Adjusted R-squared: 0.4619
## F-statistic: 85.99 on 1 and 98 DF, p-value: 4.583e-15
The coefficient estimates obtained from the simulated model sim are close to -1 and 0.5. The adjusted R-squared value being 0.4619164 explaining round(summary(sim)$adj.r.squared*100,2) percent of the variation
plot(x,y)
abline(sim,col='red')
abline(-1,0.5,col="green")
legend("topleft",c("Least square","Population"),col=c("red","green"),lty=c(1,1))
polyn<-lm(y~x+I(x^2))
summary(polyn)
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.98252 -0.31270 -0.06441 0.29014 1.13500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.97164 0.05883 -16.517 < 2e-16 ***
## x 0.50858 0.05399 9.420 2.4e-15 ***
## I(x^2) -0.05946 0.04238 -1.403 0.164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.479 on 97 degrees of freedom
## Multiple R-squared: 0.4779, Adjusted R-squared: 0.4672
## F-statistic: 44.4 on 2 and 97 DF, p-value: 2.038e-14
anova(polyn,sim)
## Analysis of Variance Table
##
## Model 1: y ~ x + I(x^2)
## Model 2: y ~ x
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 97 22.257
## 2 98 22.709 -1 -0.45163 1.9682 0.1638
The addition of x2 term does not improve model. This is quantified by the anova test between two models which fails to reject the null hypothesis of two models being different. Also the p-value of x2 coefficient is greater than 0.5 indicating its statistical insignificance.
set.seed(1)
x=rnorm(100)
eps<-rnorm(100,mean=0,sd=sqrt(0.1))
y=-1+0.5*x+eps
simlow<-lm(y~x)
summary(simlow)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59351 -0.19409 -0.04411 0.17057 0.74193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.01192 0.03067 -32.99 <2e-16 ***
## x 0.49966 0.03407 14.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3044 on 98 degrees of freedom
## Multiple R-squared: 0.687, Adjusted R-squared: 0.6838
## F-statistic: 215.1 on 1 and 98 DF, p-value: < 2.2e-16
plot(x,y)
abline(simlow,col='red')
abline(-1,0.5,col="blue")
legend("topleft",c("Least square line","True Population line - less Variance"),col=c("red","blue"),lty=c(1,1))
Here we reduced the noise by reducing the variance of error term and keeping the equation same as before. We see the R- square has increased and plot suggest relationship is more linear.
set.seed(1)
x=rnorm(100)
eps<-rnorm(100,mean=0,sd=sqrt(4))
y=-1+0.5*x+eps
simhigh<-lm(y~x)
summary(simhigh)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.754 -1.228 -0.279 1.079 4.692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0754 0.1940 -5.544 2.5e-07 ***
## x 0.4979 0.2155 2.311 0.0229 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.926 on 98 degrees of freedom
## Multiple R-squared: 0.05167, Adjusted R-squared: 0.042
## F-statistic: 5.34 on 1 and 98 DF, p-value: 0.02294
plot(x,y)
abline(simhigh,col='orange')
abline(-1,0.5,col="blue")
legend("topleft",c("Least square line","True Population line - high Variance"),col=c("orange","blue"),lty=c(1,1))
Here we increased the noise by increasing the variance of error term and keeping the equation same as before. We see the R- square has decreased and plot suggests relationship is less linear.
The confidence intervals for coefficients for dataset with less noise is given by -1.07, 0.43, -0.95, 0.57
The confidence intervals for coefficients for original dataset is given by -1.12, 0.39, -0.92, 0.61.
The confidence intervals for coefficients for dataset with more noise is given by -1.46, 0.07, -0.69, 0.93.
We get that with increase in noise the confidence intervals gets wider.
Problem 14
This problem focuses on the collinearity problem.
set .seed (1)
x1=runif (100)
x2 =0.5* x1+rnorm (100) /10
y=2+2* x1 +0.3* x2+rnorm (100)
The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?
What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.
Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are ^ ??0, ^ ??1, and ^ ??2? How do these relate to the true ??0, ??1, and ??2? Can you reject the null hypothesis H0 : ??1 = 0? How about the null hypothesis H0 : ??2 = 0? 126 3. Linear Regression
Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0 : ??1 = 0?
Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0 : ??1 = 0?
Do the results obtained in (c)-(e) contradict each other? Explain your answer.
Now suppose we obtain one additional observation, which was unfortunately mismeasured.
x1=c(x1 , 0.1)
x2=c(x2 , 0.8)
y=c(y,6)
Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.
Answer 14
set.seed(1)
x1=runif(100)
x2 =0.5* x1+rnorm (100) /10
y=2+2* x1 +0.3* x2+rnorm (100)
The form of the model is \(y=2+2*x1 + 0.3*x2 + error\). The regression coefficient are 2, 2, 0.3
cor(x1,x2)
## [1] 0.8351212
plot(x1,x2)
The correlation between x1 and x2 is 0.8351212 . The plot shows linear relationship between x1 and x2.
coll<-lm(y~x1+x2)
summary(coll)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
The B0, B1, and B2 are 2.1305, 1.4396, and 1.0097. These coefficients are away from the regression coefficient of 2, 2, 0.3.The null hypothesis can be rejected for intercept and x1 but cannot be rejected for x2 based on the p-values.We can reject the null hypothesis only if p-value is less than 0.05 at alpha=5%.
collx1<-lm(y~x1)
summary(collx1)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
Based on the p-value (<0.05) we can reject the null hypothesis of B1 =0
collx2<-lm(y~x2)
summary(collx2)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
Based on the p-value (<0.05) we can reject the null hypothesis of B1 =0
In c we obtained that x2 is insignificant as we could not reject the null hypothesis while in e we could reject the null hypothesis and declare x2 is statistically significant. This is happening due to collinearity between x1 and x2. The effect of x2 is masked because of x1 when we use x1 and x2 both in the model. Due to presence of collinearity we fail to reject the null hypothesis and thereby increase Type 1 error.
x1<-c(x1,0.1)
x2<-c(x2,0.8)
y=c(y,6)
model1<-lm(y~x1+x2)
summary(model1)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
par(mfrow=c(2,2))
plot(model1)
The last point (index 101) is highlighted in cook’s distance plot which shows that its a high leverage point.
x1<-c(x1,0.1)
x2<-c(x2,0.8)
y=c(y,6)
model2<-lm(y~x1)
summary(model2)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8848 -0.6542 -0.0769 0.6137 3.4510
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3921 0.2454 9.747 3.55e-16 ***
## x1 1.5691 0.4255 3.687 0.000369 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.16 on 100 degrees of freedom
## Multiple R-squared: 0.1197, Adjusted R-squared: 0.1109
## F-statistic: 13.6 on 1 and 100 DF, p-value: 0.0003686
par(mfrow=c(2,2))
plot(model2)
The last point (index 101) is highlighted in residual vs fitted value as well as cook’s distance plot which shows that its an outlier as well as a high leverage point.
x1<-c(x1,0.1)
x2<-c(x2,0.8)
y=c(y,6)
model3<-lm(y~x2)
summary(model3)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.67781 -0.66511 -0.00773 0.79746 2.27887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2781 0.1850 12.313 < 2e-16 ***
## x2 3.4471 0.5561 6.199 1.25e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.073 on 101 degrees of freedom
## Multiple R-squared: 0.2756, Adjusted R-squared: 0.2684
## F-statistic: 38.43 on 1 and 101 DF, p-value: 1.249e-08
par(mfrow=c(2,2))
plot(model3)
The last point (index 101) is highlighted in cook’s distance plot which shows that its a high leverage point.
Problem 15
This problem involves the Boston data set, which we saw in the lab for this chapter. We will now try to predict per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are the predictors.
For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.
Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis \(H0 : ??j = 0?\)
How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the x-axis, and its coefficient estimate in the multiple linear regression model is shown on the y-axis.
Is there evidence of non-linear association between any of the predictors and the response? To answer this question, for each predictor X, fit a model of the form
\(Y = ??0 + ??1X + ??2X2 + ??3X3 + error\)
Answer 15
boston.zn<-lm(crim~zn,data=Boston)
summary(boston.zn)
##
## Call:
## lm(formula = crim ~ zn, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.429 -4.222 -2.620 1.250 84.523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.45369 0.41722 10.675 < 2e-16 ***
## zn -0.07393 0.01609 -4.594 5.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.435 on 504 degrees of freedom
## Multiple R-squared: 0.04019, Adjusted R-squared: 0.03828
## F-statistic: 21.1 on 1 and 504 DF, p-value: 5.506e-06
boston.indus<-lm(crim~indus,data=Boston)
summary(boston.indus)
##
## Call:
## lm(formula = crim ~ indus, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.972 -2.698 -0.736 0.712 81.813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.06374 0.66723 -3.093 0.00209 **
## indus 0.50978 0.05102 9.991 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.866 on 504 degrees of freedom
## Multiple R-squared: 0.1653, Adjusted R-squared: 0.1637
## F-statistic: 99.82 on 1 and 504 DF, p-value: < 2.2e-16
boston.chas<-lm(crim~chas,data=Boston)
summary(boston.chas)
##
## Call:
## lm(formula = crim ~ chas, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.738 -3.661 -3.435 0.018 85.232
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7444 0.3961 9.453 <2e-16 ***
## chas -1.8928 1.5061 -1.257 0.209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared: 0.003124, Adjusted R-squared: 0.001146
## F-statistic: 1.579 on 1 and 504 DF, p-value: 0.2094
boston.nox<-lm(crim~nox,data=Boston)
summary(boston.nox)
##
## Call:
## lm(formula = crim ~ nox, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.371 -2.738 -0.974 0.559 81.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.720 1.699 -8.073 5.08e-15 ***
## nox 31.249 2.999 10.419 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.81 on 504 degrees of freedom
## Multiple R-squared: 0.1772, Adjusted R-squared: 0.1756
## F-statistic: 108.6 on 1 and 504 DF, p-value: < 2.2e-16
boston.rm<-lm(crim~rm,data=Boston)
summary(boston.rm)
##
## Call:
## lm(formula = crim ~ rm, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.604 -3.952 -2.654 0.989 87.197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.482 3.365 6.088 2.27e-09 ***
## rm -2.684 0.532 -5.045 6.35e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.401 on 504 degrees of freedom
## Multiple R-squared: 0.04807, Adjusted R-squared: 0.04618
## F-statistic: 25.45 on 1 and 504 DF, p-value: 6.347e-07
boston.age<-lm(crim~age,data=Boston)
summary(boston.age)
##
## Call:
## lm(formula = crim ~ age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.789 -4.257 -1.230 1.527 82.849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.77791 0.94398 -4.002 7.22e-05 ***
## age 0.10779 0.01274 8.463 2.85e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.057 on 504 degrees of freedom
## Multiple R-squared: 0.1244, Adjusted R-squared: 0.1227
## F-statistic: 71.62 on 1 and 504 DF, p-value: 2.855e-16
boston.dis<-lm(crim~dis,data=Boston)
summary(boston.dis)
##
## Call:
## lm(formula = crim ~ dis, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.708 -4.134 -1.527 1.516 81.674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.4993 0.7304 13.006 <2e-16 ***
## dis -1.5509 0.1683 -9.213 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.965 on 504 degrees of freedom
## Multiple R-squared: 0.1441, Adjusted R-squared: 0.1425
## F-statistic: 84.89 on 1 and 504 DF, p-value: < 2.2e-16
boston.rad<-lm(crim~rad,data=Boston)
summary(boston.rad)
##
## Call:
## lm(formula = crim ~ rad, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.164 -1.381 -0.141 0.660 76.433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.28716 0.44348 -5.157 3.61e-07 ***
## rad 0.61791 0.03433 17.998 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.718 on 504 degrees of freedom
## Multiple R-squared: 0.3913, Adjusted R-squared: 0.39
## F-statistic: 323.9 on 1 and 504 DF, p-value: < 2.2e-16
boston.tax<-lm(crim~tax,data=Boston)
summary(boston.tax)
##
## Call:
## lm(formula = crim ~ tax, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.513 -2.738 -0.194 1.065 77.696
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.528369 0.815809 -10.45 <2e-16 ***
## tax 0.029742 0.001847 16.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.997 on 504 degrees of freedom
## Multiple R-squared: 0.3396, Adjusted R-squared: 0.3383
## F-statistic: 259.2 on 1 and 504 DF, p-value: < 2.2e-16
boston.ptratio<-lm(crim~ptratio,data=Boston)
summary(boston.ptratio)
##
## Call:
## lm(formula = crim ~ ptratio, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.654 -3.985 -1.912 1.825 83.353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.6469 3.1473 -5.607 3.40e-08 ***
## ptratio 1.1520 0.1694 6.801 2.94e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.24 on 504 degrees of freedom
## Multiple R-squared: 0.08407, Adjusted R-squared: 0.08225
## F-statistic: 46.26 on 1 and 504 DF, p-value: 2.943e-11
boston.black<-lm(crim~black,data=Boston)
summary(boston.black)
##
## Call:
## lm(formula = crim ~ black, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.756 -2.299 -2.095 -1.296 86.822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.553529 1.425903 11.609 <2e-16 ***
## black -0.036280 0.003873 -9.367 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.946 on 504 degrees of freedom
## Multiple R-squared: 0.1483, Adjusted R-squared: 0.1466
## F-statistic: 87.74 on 1 and 504 DF, p-value: < 2.2e-16
boston.lstat<-lm(crim~lstat,data=Boston)
summary(boston.lstat)
##
## Call:
## lm(formula = crim ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.925 -2.822 -0.664 1.079 82.862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.33054 0.69376 -4.801 2.09e-06 ***
## lstat 0.54880 0.04776 11.491 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.664 on 504 degrees of freedom
## Multiple R-squared: 0.2076, Adjusted R-squared: 0.206
## F-statistic: 132 on 1 and 504 DF, p-value: < 2.2e-16
boston.medv<-lm(crim~medv,data=Boston)
summary(boston.medv)
##
## Call:
## lm(formula = crim ~ medv, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.071 -4.022 -2.343 1.298 80.957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.79654 0.93419 12.63 <2e-16 ***
## medv -0.36316 0.03839 -9.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.934 on 504 degrees of freedom
## Multiple R-squared: 0.1508, Adjusted R-squared: 0.1491
## F-statistic: 89.49 on 1 and 504 DF, p-value: < 2.2e-16
Above models show that only chas variable is not significant in predicting the per capita crime rate. Based on the p-value of its t statistic we cannot reject the null hypothesis. For every other variable the p-value is too small and we can reject the null hypothesis and conclude that there is statistical significant relationship between predictor and response.
boston.all<-lm(crim~.,Boston)
summary(boston.all)
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.924 -2.120 -0.353 1.019 75.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.033228 7.234903 2.354 0.018949 *
## zn 0.044855 0.018734 2.394 0.017025 *
## indus -0.063855 0.083407 -0.766 0.444294
## chas -0.749134 1.180147 -0.635 0.525867
## nox -10.313535 5.275536 -1.955 0.051152 .
## rm 0.430131 0.612830 0.702 0.483089
## age 0.001452 0.017925 0.081 0.935488
## dis -0.987176 0.281817 -3.503 0.000502 ***
## rad 0.588209 0.088049 6.680 6.46e-11 ***
## tax -0.003780 0.005156 -0.733 0.463793
## ptratio -0.271081 0.186450 -1.454 0.146611
## black -0.007538 0.003673 -2.052 0.040702 *
## lstat 0.126211 0.075725 1.667 0.096208 .
## medv -0.198887 0.060516 -3.287 0.001087 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared: 0.454, Adjusted R-squared: 0.4396
## F-statistic: 31.47 on 13 and 492 DF, p-value: < 2.2e-16
From the summary we can say that the null hypothesis can be rejected for variables ** zn, dis, rad, black, medv** as their p-value is less than 0.05.
simple<-vector("numeric",0)
simple<-c(simple,boston.zn$coefficients[2])
simple<-c(simple,boston.indus$coefficients[2])
simple<-c(simple,boston.chas$coefficients[2])
simple<-c(simple,boston.nox$coefficients[2])
simple<-c(simple,boston.rm$coefficients[2])
simple<-c(simple,boston.age$coefficients[2])
simple<-c(simple,boston.dis$coefficients[2])
simple<-c(simple,boston.rad$coefficients[2])
simple<-c(simple,boston.tax$coefficients[2])
simple<-c(simple,boston.ptratio$coefficients[2])
simple<-c(simple,boston.black$coefficients[2])
simple<-c(simple,boston.lstat$coefficients[2])
simple<-c(simple,boston.medv$coefficients[2])
multi<-vector("numeric",0)
multi<-c(multi,boston.all$coefficients)
multi<-multi[-1]
plot(simple,multi,col='blue')
It can be seen from the plot that the values for coefficient for variable is different when modelled alone compared to model having all together.
boston.zn1<-lm(crim~poly(zn,3),data=Boston)
summary(boston.zn1)
##
## Call:
## lm(formula = crim ~ poly(zn, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.821 -4.614 -1.294 0.473 84.130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3722 9.709 < 2e-16 ***
## poly(zn, 3)1 -38.7498 8.3722 -4.628 4.7e-06 ***
## poly(zn, 3)2 23.9398 8.3722 2.859 0.00442 **
## poly(zn, 3)3 -10.0719 8.3722 -1.203 0.22954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.372 on 502 degrees of freedom
## Multiple R-squared: 0.05824, Adjusted R-squared: 0.05261
## F-statistic: 10.35 on 3 and 502 DF, p-value: 1.281e-06
par(mfrow=c(2,2))
plot(boston.zn1)
boston.indus1<-lm(crim~poly(indus,3),data=Boston)
summary(boston.indus1)
##
## Call:
## lm(formula = crim ~ poly(indus, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.278 -2.514 0.054 0.764 79.713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.614 0.330 10.950 < 2e-16 ***
## poly(indus, 3)1 78.591 7.423 10.587 < 2e-16 ***
## poly(indus, 3)2 -24.395 7.423 -3.286 0.00109 **
## poly(indus, 3)3 -54.130 7.423 -7.292 1.2e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.423 on 502 degrees of freedom
## Multiple R-squared: 0.2597, Adjusted R-squared: 0.2552
## F-statistic: 58.69 on 3 and 502 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(boston.indus1)
boston.nox1<-lm(crim~poly(nox,3),data=Boston)
summary(boston.nox1)
##
## Call:
## lm(formula = crim ~ poly(nox, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.110 -2.068 -0.255 0.739 78.302
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3216 11.237 < 2e-16 ***
## poly(nox, 3)1 81.3720 7.2336 11.249 < 2e-16 ***
## poly(nox, 3)2 -28.8286 7.2336 -3.985 7.74e-05 ***
## poly(nox, 3)3 -60.3619 7.2336 -8.345 6.96e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.234 on 502 degrees of freedom
## Multiple R-squared: 0.297, Adjusted R-squared: 0.2928
## F-statistic: 70.69 on 3 and 502 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(boston.nox1)
boston.rm1<-lm(crim~poly(rm,3),data=Boston)
summary(boston.rm1)
##
## Call:
## lm(formula = crim ~ poly(rm, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.485 -3.468 -2.221 -0.015 87.219
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3703 9.758 < 2e-16 ***
## poly(rm, 3)1 -42.3794 8.3297 -5.088 5.13e-07 ***
## poly(rm, 3)2 26.5768 8.3297 3.191 0.00151 **
## poly(rm, 3)3 -5.5103 8.3297 -0.662 0.50858
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.33 on 502 degrees of freedom
## Multiple R-squared: 0.06779, Adjusted R-squared: 0.06222
## F-statistic: 12.17 on 3 and 502 DF, p-value: 1.067e-07
par(mfrow=c(2,2))
plot(boston.rm1)
boston.age1<-lm(crim~poly(age,3),data=Boston)
summary(boston.age1)
##
## Call:
## lm(formula = crim ~ poly(age, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.762 -2.673 -0.516 0.019 82.842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3485 10.368 < 2e-16 ***
## poly(age, 3)1 68.1820 7.8397 8.697 < 2e-16 ***
## poly(age, 3)2 37.4845 7.8397 4.781 2.29e-06 ***
## poly(age, 3)3 21.3532 7.8397 2.724 0.00668 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.84 on 502 degrees of freedom
## Multiple R-squared: 0.1742, Adjusted R-squared: 0.1693
## F-statistic: 35.31 on 3 and 502 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(boston.age1)
boston.dis1<-lm(crim~poly(dis,3),data=Boston)
summary(boston.dis1)
##
## Call:
## lm(formula = crim ~ poly(dis, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.757 -2.588 0.031 1.267 76.378
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3259 11.087 < 2e-16 ***
## poly(dis, 3)1 -73.3886 7.3315 -10.010 < 2e-16 ***
## poly(dis, 3)2 56.3730 7.3315 7.689 7.87e-14 ***
## poly(dis, 3)3 -42.6219 7.3315 -5.814 1.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.331 on 502 degrees of freedom
## Multiple R-squared: 0.2778, Adjusted R-squared: 0.2735
## F-statistic: 64.37 on 3 and 502 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(boston.dis1)
boston.rad1<-lm(crim~poly(rad,3),data=Boston)
summary(boston.rad1)
##
## Call:
## lm(formula = crim ~ poly(rad, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.381 -0.412 -0.269 0.179 76.217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.2971 12.164 < 2e-16 ***
## poly(rad, 3)1 120.9074 6.6824 18.093 < 2e-16 ***
## poly(rad, 3)2 17.4923 6.6824 2.618 0.00912 **
## poly(rad, 3)3 4.6985 6.6824 0.703 0.48231
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.682 on 502 degrees of freedom
## Multiple R-squared: 0.4, Adjusted R-squared: 0.3965
## F-statistic: 111.6 on 3 and 502 DF, p-value: < 2.2e-16
plot(boston.rad1)
boston.tax1<-lm(crim~poly(tax,3),data=Boston)
summary(boston.tax1)
##
## Call:
## lm(formula = crim ~ poly(tax, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.273 -1.389 0.046 0.536 76.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3047 11.860 < 2e-16 ***
## poly(tax, 3)1 112.6458 6.8537 16.436 < 2e-16 ***
## poly(tax, 3)2 32.0873 6.8537 4.682 3.67e-06 ***
## poly(tax, 3)3 -7.9968 6.8537 -1.167 0.244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.854 on 502 degrees of freedom
## Multiple R-squared: 0.3689, Adjusted R-squared: 0.3651
## F-statistic: 97.8 on 3 and 502 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(boston.tax1)
boston.ptratio1<-lm(crim~poly(ptratio,3),data=Boston)
summary(boston.ptratio1)
##
## Call:
## lm(formula = crim ~ poly(ptratio, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.833 -4.146 -1.655 1.408 82.697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.614 0.361 10.008 < 2e-16 ***
## poly(ptratio, 3)1 56.045 8.122 6.901 1.57e-11 ***
## poly(ptratio, 3)2 24.775 8.122 3.050 0.00241 **
## poly(ptratio, 3)3 -22.280 8.122 -2.743 0.00630 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.122 on 502 degrees of freedom
## Multiple R-squared: 0.1138, Adjusted R-squared: 0.1085
## F-statistic: 21.48 on 3 and 502 DF, p-value: 4.171e-13
plot(boston.ptratio1)
boston.black1<-lm(crim~poly(black,3),data=Boston)
summary(boston.black1)
##
## Call:
## lm(formula = crim ~ poly(black, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.096 -2.343 -2.128 -1.439 86.790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3536 10.218 <2e-16 ***
## poly(black, 3)1 -74.4312 7.9546 -9.357 <2e-16 ***
## poly(black, 3)2 5.9264 7.9546 0.745 0.457
## poly(black, 3)3 -4.8346 7.9546 -0.608 0.544
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.955 on 502 degrees of freedom
## Multiple R-squared: 0.1498, Adjusted R-squared: 0.1448
## F-statistic: 29.49 on 3 and 502 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(boston.black1)
boston.lstat1<-lm(crim~poly(lstat,3),data=Boston)
summary(boston.lstat1)
##
## Call:
## lm(formula = crim ~ poly(lstat, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.234 -2.151 -0.486 0.066 83.353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3392 10.654 <2e-16 ***
## poly(lstat, 3)1 88.0697 7.6294 11.543 <2e-16 ***
## poly(lstat, 3)2 15.8882 7.6294 2.082 0.0378 *
## poly(lstat, 3)3 -11.5740 7.6294 -1.517 0.1299
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.629 on 502 degrees of freedom
## Multiple R-squared: 0.2179, Adjusted R-squared: 0.2133
## F-statistic: 46.63 on 3 and 502 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(boston.lstat1)
boston.medv1<-lm(crim~poly(medv,3),data=Boston)
summary(boston.medv1)
##
## Call:
## lm(formula = crim ~ poly(medv, 3), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.427 -1.976 -0.437 0.439 73.655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.614 0.292 12.374 < 2e-16 ***
## poly(medv, 3)1 -75.058 6.569 -11.426 < 2e-16 ***
## poly(medv, 3)2 88.086 6.569 13.409 < 2e-16 ***
## poly(medv, 3)3 -48.033 6.569 -7.312 1.05e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.569 on 502 degrees of freedom
## Multiple R-squared: 0.4202, Adjusted R-squared: 0.4167
## F-statistic: 121.3 on 3 and 502 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(boston.medv1)
From the summary of each model it is clear that cubic relationship between predictor and response is significant for ** indus, nox, age, dis, ptratio, and medv** variables indicating non linear relationship. For black variable neither cubic nor quadratic coefficient is significant suggesting no non linear relationship visible.