This work is part of my effort to become a well versed data analyst. At this point in time, and for the immediate future, I will undoubtedly be a novice at using R and solving the problem sets from this book. Hence, my solutions will at times reflect my limited abilities. But, with more practice, the quality and depth of my work will improve ( That is the whole point!). I welcome you to comment and critic my work to help me improve
Using the rnorm() function, create a vector, x, containing 100 observations drawn from a N(0, 1) distribution. This represents a feature, X.
x = rnorm(100, mean= 0, sd =1)
Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a N(0, 0.25) distribution i.e. a normal distribution with mean zero and variance 0.25.
eps = rnorm(100, mean =0, sd = 0.25)
Using x and eps, generate a vector y according to the model Y = −1+0.5X + eps. What is the length of the vector y? What are the values of βo and β1 in this linear model?
y = -1+0.5*x+eps
length(y)
## [1] 100
Create a scatterplot displaying the relationship between x and y. Comment on what you observe.
plot(y~x)
The graph shows a linear relationship
Fit a least squares linear model to predict y using x. Comment on the model obtained. How do β^o and β^1 compare to βo and β1?
lm.fit1 = lm(y~x)
summary(lm.fit1)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51995 -0.17474 -0.00238 0.13815 0.72515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.99412 0.02357 -42.17 <2e-16 ***
## x 0.49094 0.02221 22.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2325 on 98 degrees of freedom
## Multiple R-squared: 0.8329, Adjusted R-squared: 0.8312
## F-statistic: 488.6 on 1 and 98 DF, p-value: < 2.2e-16
β^o and β^1 are practically equal to βo and β1. As expected, β^o and β^1 are statistically signficant and R-square = 0.84 shows that the model fits the data quite well.
Display the least squares line on the scatterplot obtained in (d).Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend
plot(y~x); abline(lm.fit1, col ="red")
legend("bottomright", c("Regression line"), lwd=1, col="red",bty ="n")
_Now fit a polynomial regression model that predicts y using x and x^2. Is there evidence that the quadratic term improves the model fit? Explain your answer.__
lm.fit1 = lm(y~poly(x,2))
summary(lm.fit1)
##
## Call:
## lm(formula = y ~ poly(x, 2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51960 -0.16988 -0.00291 0.14401 0.72280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.90865 0.02335 -38.914 <2e-16 ***
## poly(x, 2)1 5.13999 0.23350 22.013 <2e-16 ***
## poly(x, 2)2 -0.10178 0.23350 -0.436 0.664
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2335 on 97 degrees of freedom
## Multiple R-squared: 0.8333, Adjusted R-squared: 0.8298
## F-statistic: 242.4 on 2 and 97 DF, p-value: < 2.2e-16
The regression coefficient of the quadratic term is not statistically significant; hence, there is no evidence that the quadratic term improves the model.
Repeat (a)–(f) after modifying the data generation process in such a way that there is less noise in the data. The model (part-c) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term in (b). Describe your results.
x = rnorm(100, mean= 0, sd =1)
eps = rnorm(100, mean =0, sd = 0.10)
y = -1+0.5*x+eps
lm.fit2 = lm(y~x)
summary(lm.fit2)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.278553 -0.078358 -0.009462 0.067681 0.273390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.99122 0.01151 -86.15 <2e-16 ***
## x 0.50731 0.01140 44.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1127 on 98 degrees of freedom
## Multiple R-squared: 0.9529, Adjusted R-squared: 0.9524
## F-statistic: 1982 on 1 and 98 DF, p-value: < 2.2e-16
plot(y~x); abline(lm.fit2, col ="red")
legend("bottomright", c("Regression line"), lwd=1, col="red",bty ="n")
lm.fit2 = lm(y~poly(x,2))
summary(lm.fit2)
##
## Call:
## lm(formula = y ~ poly(x, 2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.260392 -0.071462 -0.003471 0.079753 0.292848
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.09400 0.01101 -99.379 <2e-16 ***
## poly(x, 2)1 5.01761 0.11008 45.580 <2e-16 ***
## poly(x, 2)2 0.26386 0.11008 2.397 0.0184 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1101 on 97 degrees of freedom
## Multiple R-squared: 0.9555, Adjusted R-squared: 0.9546
## F-statistic: 1042 on 2 and 97 DF, p-value: < 2.2e-16
Repeat (a)–(f) after modifying the data generation process in such a way that there is less noise in the data. The model (in part-c) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term in (b). Describe your results.
x = rnorm(100, mean= 0, sd =1)
eps = rnorm(100, mean =0, sd = 0.50)
y = -1+0.5*x+eps
lm.fit3 = lm(y~x)
summary(lm.fit3)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11488 -0.39238 -0.05344 0.47435 1.33973
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.96126 0.05525 -17.397 < 2e-16 ***
## x 0.51624 0.05999 8.606 1.27e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5508 on 98 degrees of freedom
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4246
## F-statistic: 74.05 on 1 and 98 DF, p-value: 1.274e-13
plot(y~x); abline(lm.fit3, col ="red")
legend("bottomright", c("Regression line"), lwd=1, col="red",bty ="n")
lm.fit3 = lm(y~poly(x,2))
summary(lm.fit3)
##
## Call:
## lm(formula = y ~ poly(x, 2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11020 -0.44360 -0.03023 0.38954 1.24626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.99850 0.05518 -18.095 < 2e-16 ***
## poly(x, 2)1 4.74017 0.55182 8.590 1.47e-13 ***
## poly(x, 2)2 0.44460 0.55182 0.806 0.422
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5518 on 97 degrees of freedom
## Multiple R-squared: 0.4342, Adjusted R-squared: 0.4225
## F-statistic: 37.22 on 2 and 97 DF, p-value: 1.01e-12
What are the confidence intervals for βo and β1 based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.
lm.fit1 = lm(y~x); lm.fit2 = lm(y~x); lm.fit3 = lm(y~x)
confint(lm.fit1);confint(lm.fit2); confint(lm.fit3)
## 2.5 % 97.5 %
## (Intercept) -1.0709058 -0.8516113
## x 0.3971961 0.6352924
## 2.5 % 97.5 %
## (Intercept) -1.0709058 -0.8516113
## x 0.3971961 0.6352924
## 2.5 % 97.5 %
## (Intercept) -1.0709058 -0.8516113
## x 0.3971961 0.6352924
We get the same confidence intervals for the parameters βo and β1 from each model.