Data_analytics_Lab

R Markdown

Importing Libraries:

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)
library(datasets)
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.2

## corrplot 0.92 loaded

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.3.2

library(MASS)

## Warning: package 'MASS' was built under R version 4.3.2

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

Q8. This question involves the use of simple linear regression on the Auto data set. (a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output. For example: i. Is there a relationship between the predictor and the response? ii. How strong is the relationship between the predictor and the response? iii. Is the relationship between the predictor and the response positive or negative? iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95% confidence and prediction intervals? (b) Plot the response and the predictor. Use the abline() function to display the least squares regression line. (c) Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

data(Auto)
str(Auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

summary(Auto)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

(a).

model <- lm(mpg ~ horsepower, data = Auto)
summary(model)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Is there a relationship between the predictor and the response? Answer: If the p-value is small (typically less than 0.05), there is evidence of a relationship. From the summary, we can say that there is a relationship between the predictor and the response.
How strong is the relationship between the predictor and the response? Answer: A higher R-squared indicates a stronger relationship. From the summary, The R-squared value is 0.6059 which is quite high, which indicates a strong relationship between the predictor and the response.
Is the relationship between the predictor and the response positive or negative? Answer: If the coefficient for horsepower is positive, it’s a positive relationship. From the summary, It is evident that the coefficient of the horsepower is positive indicating a positive relationship between the predictor and the response.

# (iv)
new_data <- data.frame(horsepower = 98)
predict_data <- predict(model, newdata = new_data, interval = "confidence", level = 0.95)
predict_data

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

(b).

layout(matrix(1:2, nrow = 1)) 
par(mar = c(5, 4, 2, 2))  
plot(Auto$horsepower, Auto$mpg, xlab = "Horsepower", ylab = "MPG", main = "Simple Linear Regression", pch = 19)
abline(model, col = "red")

(c).

par(mar = c(5, 4, 2, 2))  
plot(model)

Q10. This question should be answered using the Carseats data set. (a) Fit a multiple regression model to predict Sales using Price, Urban, and US. (b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative! (c) Write out the model in equation form, being careful to handle the qualitative variables properly. (d) For which of the predictors can you reject the null hypothesis H0 : j =0? (e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome. (f) How well do the models in (a) and (e) fit the data? (g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s). (h) Is there evidence of outliers or high leverage observations in the model from (e)?

data(Carseats)
str(Carseats)

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

summary(Carseats)

##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age          Education   
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0  
##  Urban       US     
##  No :118   No :142  
##  Yes:282   Yes:258  
##                     
##                     
##                     
##

(a).

model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Interpretation of coefficients -The intercept (Intercept): The expected Sales when all predictors are 0.

Coefficient for Price: The change in Sales for a one-unit increase in Price, holding Urban and US constant.
Coefficient for UrbanYes: The average change in Sales when Urban is “Yes” compared to when it is “No,” holding Price and US constant.

- Coefficient for USYes: The average change in Sales when US is “Yes” compared to when it is “No,” holding Price and Urban constant.

Model in equation form: Sales = β0 + β1 * Price + β2 * UrbanYes + β3 * USYes + ε
Test for predictors: we can examine the summary output of the multiple regression model to determine for which predictors we can reject the null hypothesis H0:βj=0. This is typically done by looking at the p-values associated with each coefficient.

model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

In the output, look at the “Pr(>|t|)” column. This column contains the p-values for each coefficient. If the p-value is less than your chosen significance level (commonly 0.05), you can reject the null hypothesis for that predictor. We can reject the null hypothesis for Price, UrbanYes, and USYes since their p-values are less than 0.05.

(e).

smaller_model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(smaller_model)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Model fit comparison: Compare the R-squared values between the models in (a) and (e) to assess their goodness of fit.

summary(model)$r.squared

## [1] 0.2392754

summary(smaller_model)$r.squared

## [1] 0.2392754

The R-squared value indicates the proportion of the variance in the response variable (Sales) that is explained by the predictors. A higher R-squared value suggests a better fit.

(g).

conf_intervals <- confint(smaller_model)
conf_intervals

##                   2.5 %      97.5 %
## (Intercept) 11.76359670 14.32334118
## Price       -0.06476419 -0.04415351
## UrbanYes    -0.55597316  0.51214085
## USYes        0.69130419  1.70984121

Check for outliers or high leverage observations:

plot(smaller_model)

Q14. This problem focuses on the collinearity problem. (a) Perform the following commands in R: > set.seed(1) > x1 <-runif(100) > x2 <- 0.5 * x1 + rnorm(100) / 10 > y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100) The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients? (b) What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables. (c) Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are ˆ0, ˆ1, and ˆ2? How do these relate to the true 0, 1, and 2? Can you reject the null hypothesis H0 : 1 =0? How about the null hypothesis H0 : 2 =0? (d) Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0 : 1 =0? (e) Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0 : 1 =0? (f) Do the results obtained in (c)–(e) contradict each other? Explain your answer. (g) Now suppose we obtain one additional observation, which was unfortunately mismeasured. > x1 <-c(x1, 0.1) > x2 <-c(x2, 0.8) > y <-c(y, 6) Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both Explain your answers.

set.seed(1)
x1 <-runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

The linear model is given by: y=β0+β1.x1+β2.x2+ϵ Here, β0 is the intercept, β1 is the coefficient for x1, β2 is the coefficient for x2, and ϵ is the error term.

β0 ≈ 2
β1 ≈ 2
β2 ≈ 0.3

Calculate the correlation between x1 and x2 and create a scatterplot:

The correlation between x1 and x2 can be calculated using cor(x1, x2). The scatterplot will show the relationship between the variables.

correlation <- cor(x1, x2)
plot(x1, x2, main = "Scatterplot of x1 and x2", xlab = "x1", ylab = "x2")

Fit a least squares regression to predict y using x1 and x2:

model <- lm(y ~ x1 + x2)
summary(model)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

Interpretation:

The coefficients β^0, β^1, and β^2 are given in the output.
Compare β^1 and β^2 with the true values of β1 and β2 to see how well the model estimates them.
Test the null hypotheses H0:β1=0 and H0:β2=0 using the t-tests.

Fit a least squares regression to predict y using only x1:

model_x1 <- lm(y ~ x1)
summary(model_x1)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

In this model, you’re predicting y using only x1. The coefficient for x1 (β^1) represents the estimated change in y for a one-unit change in x1, holding other variables constant. Check the p-value associated with x1 to see if you can reject the null hypothesis H0:β1=0. If the p-value is small, you can reject the null hypothesis, suggesting x1 is a significant predictor.

Fit a least squares regression to predict y using only x2:

model_x2 <- lm(y ~ x2)
summary(model_x2)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

Similar to (d), interpret the coefficient for x2 (β^2) and check the p-value associated with x2 to determine the significance. Rejecting the null hypothesis H0:β1=0 would indicate that x2 is a significant predictor of y.

The results may appear contradictory due to collinearity. In (c), when both x1 and x2 are included, their effects may be difficult to distinguish, leading to inflated standard errors and p-values. In (d) and (e), where each predictor is considered separately, the individual effects may seem more significant. This is a common issue with collinear predictors, and it emphasizes the importance of considering multicollinearity when interpreting regression results.
Now suppose we obtain one additional observation, which was unfortunately mismeasured:

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)

model_updated <- lm(y ~ x1 + x2)
summary(model_updated)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06

model_x1_updated <- lm(y ~ x1)
summary(model_x1_updated)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05

model_x2_updated <- lm(y ~ x2)
summary(model_x2_updated)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06

The new observation may impact the models, potentially influencing coefficients and overall fit. Check for changes in coefficients, standard errors, and p-values in the summary outputs. Additionally, inspect diagnostic plots, such as Cook’s distance and residuals vs. leverage, to understand the impact on model performance. The new observation could be an outlier, a high-leverage point, or both, depending on its influence on the model.

Data_analytics_Lab_2

Yagna Praseeda Atmuri

2024-02-13

R Markdown