Problem.13 from Chapter-III in ISLR

This work is part of my effort to become a well versed data analyst. At this point in time, and for the immediate future, I will undoubtedly be a novice at using R and solving the problem sets from this book. Hence, my solutions will at times reflect my limited abilities. But, with more practice, the quality and depth of my work will improve ( That is the whole point!). I welcome you to comment and critic my work to help me improve

Question-a

Using the rnorm() function, create a vector, x, containing 100 observations drawn from a N(0, 1) distribution. This represents a feature, X.

 x = rnorm(100, mean= 0, sd =1)

Question-b

Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a N(0, 0.25) distribution i.e. a normal distribution with mean zero and variance 0.25.

eps = rnorm(100, mean =0, sd = 0.25)

Question-c

Using x and eps, generate a vector y according to the model Y = −1+0.5X + eps. What is the length of the vector y? What are the values of βo and β1 in this linear model?

y = -1+0.5*x+eps
length(y)

## [1] 100

Question-d

Create a scatterplot displaying the relationship between x and y. Comment on what you observe.

plot(y~x)

The graph shows a linear relationship

Question-e

Fit a least squares linear model to predict y using x. Comment on the model obtained. How do β^o and β^1 compare to βo and β1?

lm.fit1 = lm(y~x)
summary(lm.fit1)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51995 -0.17474 -0.00238  0.13815  0.72515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.99412    0.02357  -42.17   <2e-16 ***
## x            0.49094    0.02221   22.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2325 on 98 degrees of freedom
## Multiple R-squared:  0.8329, Adjusted R-squared:  0.8312 
## F-statistic: 488.6 on 1 and 98 DF,  p-value: < 2.2e-16

β^o and β^1 are practically equal to βo and β1. As expected, β^o and β^1 are statistically signficant and R-square = 0.84 shows that the model fits the data quite well.

Question-f

Display the least squares line on the scatterplot obtained in (d).Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend

plot(y~x); abline(lm.fit1, col ="red") 
legend("bottomright", c("Regression line"), lwd=1, col="red",bty ="n")

Question-g

_Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the quadratic term improves the model fit? Explain your answer.__

lm.fit1 = lm(y~poly(x,2))
summary(lm.fit1)

## 
## Call:
## lm(formula = y ~ poly(x, 2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51960 -0.16988 -0.00291  0.14401  0.72280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.90865    0.02335 -38.914   <2e-16 ***
## poly(x, 2)1  5.13999    0.23350  22.013   <2e-16 ***
## poly(x, 2)2 -0.10178    0.23350  -0.436    0.664    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2335 on 97 degrees of freedom
## Multiple R-squared:  0.8333, Adjusted R-squared:  0.8298 
## F-statistic: 242.4 on 2 and 97 DF,  p-value: < 2.2e-16

The regression coefficient of the quadratic term is not statistically significant; hence, there is no evidence that the quadratic term improves the model.

Question-h

Repeat (a)–(f) after modifying the data generation process in such a way that there is less noise in the data. The model (part-c) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term in (b). Describe your results.

x = rnorm(100, mean= 0, sd =1)
eps = rnorm(100, mean =0, sd = 0.10)
y = -1+0.5*x+eps
lm.fit2 = lm(y~x)
summary(lm.fit2)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.278553 -0.078358 -0.009462  0.067681  0.273390 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.99122    0.01151  -86.15   <2e-16 ***
## x            0.50731    0.01140   44.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1127 on 98 degrees of freedom
## Multiple R-squared:  0.9529, Adjusted R-squared:  0.9524 
## F-statistic:  1982 on 1 and 98 DF,  p-value: < 2.2e-16

plot(y~x); abline(lm.fit2, col ="red") 
legend("bottomright", c("Regression line"), lwd=1, col="red",bty ="n")

lm.fit2 = lm(y~poly(x,2))
summary(lm.fit2)

## 
## Call:
## lm(formula = y ~ poly(x, 2))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.260392 -0.071462 -0.003471  0.079753  0.292848 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.09400    0.01101 -99.379   <2e-16 ***
## poly(x, 2)1  5.01761    0.11008  45.580   <2e-16 ***
## poly(x, 2)2  0.26386    0.11008   2.397   0.0184 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1101 on 97 degrees of freedom
## Multiple R-squared:  0.9555, Adjusted R-squared:  0.9546 
## F-statistic:  1042 on 2 and 97 DF,  p-value: < 2.2e-16

R-square = 0.95 shows that this model fits the data better.
The observation in (1) is evidenced graphically by how well the regression line fits the data points.
The quadratic term is, again, statistically insignificant

Question-i

Repeat (a)–(f) after modifying the data generation process in such a way that there is less noise in the data. The model (in part-c) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term in (b). Describe your results.

x = rnorm(100, mean= 0, sd =1)
eps = rnorm(100, mean =0, sd = 0.50)
y = -1+0.5*x+eps
lm.fit3 = lm(y~x)
summary(lm.fit3)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11488 -0.39238 -0.05344  0.47435  1.33973 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.96126    0.05525 -17.397  < 2e-16 ***
## x            0.51624    0.05999   8.606 1.27e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5508 on 98 degrees of freedom
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4246 
## F-statistic: 74.05 on 1 and 98 DF,  p-value: 1.274e-13

plot(y~x); abline(lm.fit3, col ="red") 
legend("bottomright", c("Regression line"), lwd=1, col="red",bty ="n")

lm.fit3 = lm(y~poly(x,2))
summary(lm.fit3)

## 
## Call:
## lm(formula = y ~ poly(x, 2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11020 -0.44360 -0.03023  0.38954  1.24626 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.99850    0.05518 -18.095  < 2e-16 ***
## poly(x, 2)1  4.74017    0.55182   8.590 1.47e-13 ***
## poly(x, 2)2  0.44460    0.55182   0.806    0.422    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5518 on 97 degrees of freedom
## Multiple R-squared:  0.4342, Adjusted R-squared:  0.4225 
## F-statistic: 37.22 on 2 and 97 DF,  p-value: 1.01e-12

R-squared = 0.4482 is very poor relatively
Graphically, we can see that the regression line is not fitting the data points quite well.
The quadratic is still statistically insignificant.

Question-j

What are the confidence intervals for βo and β1 based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.

lm.fit1 = lm(y~x); lm.fit2 = lm(y~x); lm.fit3 = lm(y~x)
confint(lm.fit1);confint(lm.fit2); confint(lm.fit3)

##                  2.5 %     97.5 %
## (Intercept) -1.0709058 -0.8516113
## x            0.3971961  0.6352924

##                  2.5 %     97.5 %
## (Intercept) -1.0709058 -0.8516113
## x            0.3971961  0.6352924

##                  2.5 %     97.5 %
## (Intercept) -1.0709058 -0.8516113
## x            0.3971961  0.6352924

We get the same confidence intervals for the parameters βo and β1 from each model.