Quiz on Correlation and Regression

Q1. Per the scatter plot and the computed correlation coefficient, describe relationships between the two variables - price and weight.
Q2. Is the coefficient of weight statistically significant at 5%? Interpret the coefficient.
Q4. What is the reported residual standard error? What does it mean?
Q6. Which of the two models better fits the data? Discuss your answer by comparing the residual standard error and the adjusted R squared between the two models.

Suppose that you want to build a regression model that predicts the price of cars using a data set named cars.

Q1. Per the scatter plot and the computed correlation coefficient, describe relationships between the two variables - `price` and `weight`.

There is a positive correlation between price and weight. Lighter cars tend to be cheeaper than heavier cars most likely because of the cost of materials.

Create scatterplots

# Load the package
library(openintro)
library(ggplot2)
str(cars)
## 'data.frame':    54 obs. of  6 variables:
##  $ type      : Factor w/ 3 levels "large","midsize",..: 3 2 2 2 2 1 1 2 1 2 ...
##  $ price     : num  15.9 33.9 37.7 30 15.7 20.8 23.7 26.3 34.7 40.1 ...
##  $ mpgCity   : int  25 18 19 22 22 19 16 19 16 16 ...
##  $ driveTrain: Factor w/ 3 levels "4WD","front",..: 2 2 2 3 2 2 3 2 2 2 ...
##  $ passengers: int  5 5 6 4 6 6 6 5 6 5 ...
##  $ weight    : int  2705 3560 3405 3640 2880 3470 4105 3495 3620 3935 ...

# relationship between height and wegit
ggplot(data = cars, aes(x = weight, y = price)) + #cars dataset is from openintro rpackage
  geom_point()+ geom_smooth(method = "lm", se= FALSE)



# Compute correlation coefficient
cor(cars$price, cars$weight, use = "pairwise.complete.obs")
## [1] 0.758112

Interpretation

There is a strong (the coefficient’s absolute value > 0.6) positive (its sign) association between weight and height.

Run a regression model for price with one explanatory variable, weight, and answer Q2 through Q5.

Q2. Is the coefficient of weight statistically significant at 5%? Interpret the coefficient.

yes it is significant at 5%. Weight increases and price increases by pound. ## Q3. What price does the model predict for a car that weighs 4000 pounds? The price would be about $20,000 Hint: Check the units of the variables in the openintro manual.

Q4. What is the reported residual standard error? What does it mean?

RSE is 7.575ata degree of 52. The best fit line is the one that shows a trend in data points. ## Q5. What is the reported adjusted R squared? What does it mean? 0.566

mod_1 <- lm(passengers ~ weight, data = cars)
summary(mod_1)
## 
## Call:
## lm(formula = passengers ~ weight, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4978 -0.4208  0.1407  0.3773  0.9899 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.1619504  0.3587028   8.815 6.72e-12 ***
## weight      0.0006417  0.0001155   5.558 9.53e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5528 on 52 degrees of freedom
## Multiple R-squared:  0.3726, Adjusted R-squared:  0.3606 
## F-statistic: 30.89 on 1 and 52 DF,  p-value: 9.531e-07

Q6. Which of the two models better fits the data? Discuss your answer by comparing the residual standard error and the adjusted R squared between the two models.

Model 1 has smaller RSE and model 1 will be better for fitting the data Build regression model

# Create a linear model 1
mod_1 <- lm(price ~ weight + passengers, data = cars)

# View summary of model 1
summary(mod_1)
## 
## Call:
## lm(formula = price ~ weight + passengers, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.647  -3.688  -1.134   2.677  33.704 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.348709   7.480301  -0.982   0.3305    
## weight       0.015891   0.001925   8.256  5.8e-11 ***
## passengers  -4.094465   1.831085  -2.236   0.0297 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.3 on 51 degrees of freedom
## Multiple R-squared:  0.6127, Adjusted R-squared:  0.5975 
## F-statistic: 40.34 on 2 and 51 DF,  p-value: 3.127e-11

# Create a linear model 2
mod_2 <- lm(price ~ weight + passengers, data = cars)

# View summary of model 2
summary(mod_2)
## 
## Call:
## lm(formula = price ~ weight + passengers, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.647  -3.688  -1.134   2.677  33.704 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.348709   7.480301  -0.982   0.3305    
## weight       0.015891   0.001925   8.256  5.8e-11 ***
## passengers  -4.094465   1.831085  -2.236   0.0297 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.3 on 51 degrees of freedom
## Multiple R-squared:  0.6127, Adjusted R-squared:  0.5975 
## F-statistic: 40.34 on 2 and 51 DF,  p-value: 3.127e-11