Quiz on Correlation and Regression

Q1. Per the scatter plot and the computed correlation coefficient, describe relationships between the two variables - price and weight.
Q2. Is the coefficient of weight statistically significant at 5%? Interpret the coefficient.
Q4. What is the reported residual standard error? What does it mean?
Q6. Which of the two models better fits the data? Discuss your answer by comparing the residual standard error and the adjusted R squared between the two models.

Suppose that you want to build a regression model that predicts the price of cars using a data set named cars.

Q1. Per the scatter plot and the computed correlation coefficient, describe relationships between the two variables - `price` and `weight`.

-There is a positive association between both the price and the weight. The cars differentiate them by type. The larger cars are usually more expensive than smaller cars.

Create scatterplots

# Load the package
library(openintro)
library(ggplot2)
str(cars)
## 'data.frame':    54 obs. of  6 variables:
##  $ type      : Factor w/ 3 levels "large","midsize",..: 3 2 2 2 2 1 1 2 1 2 ...
##  $ price     : num  15.9 33.9 37.7 30 15.7 20.8 23.7 26.3 34.7 40.1 ...
##  $ mpgCity   : int  25 18 19 22 22 19 16 19 16 16 ...
##  $ driveTrain: Factor w/ 3 levels "4WD","front",..: 2 2 2 3 2 2 3 2 2 2 ...
##  $ passengers: int  5 5 6 4 6 6 6 5 6 5 ...
##  $ weight    : int  2705 3560 3405 3640 2880 3470 4105 3495 3620 3935 ...

# relationship between height and wegit
ggplot(data = cars, aes(x = weight, y = price)) + #cars dataset is from openintro rpackage
  geom_point()+  geom_smooth(method = "lm", se = FALSE)


# Compute correlation coefficient
cor(cars$price, cars$weight, use = "pairwise.complete.obs")
## [1] 0.758112

Interpretation

There is a strong (the coefficient’s absolute value > 0.6) positive (its sign) association between weight and height.

Run a regression model for price with one explanatory variable, weight, and answer Q2 through Q5.

Q2. Is the coefficient of weight statistically significant at 5%? Interpret the coefficient.

-Yes, the weight increases and the price increases by pound ## Q3. What price does the model predict for a car that weighs 4000 pounds? -The price of a car that has a weight of 4000 pounds would be approximately $32,171

Q4. What is the reported residual standard error? What does it mean?

-The residual standard error is 7.575 on a degree of 52.this means that the line in which best fits will be around 7.575. This means the line that cuts through the data that minimizes the distance between the data points is the best fit. ## Q5. What is the reported adjusted R squared? What does it mean? -Reported adjusted R squared is 56.6% of variability in terms of price dependant on weight.

mod <- lm(passengers ~ weight, data = cars)
summary(mod)
## 
## Call:
## lm(formula = passengers ~ weight, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4978 -0.4208  0.1407  0.3773  0.9899 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.1619504  0.3587028   8.815 6.72e-12 ***
## weight      0.0006417  0.0001155   5.558 9.53e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5528 on 52 degrees of freedom
## Multiple R-squared:  0.3726, Adjusted R-squared:  0.3606 
## F-statistic: 30.89 on 1 and 52 DF,  p-value: 9.531e-07

Q6. Which of the two models better fits the data? Discuss your answer by comparing the residual standard error and the adjusted R squared between the two models.

model 1 has a smaller RSE. so that means model one will fit better.

# Create a linear model 1
mod_1 <- lm(price ~ weight + passengers, data = cars)

# View summary of model 1
summary(mod_1)
## 
## Call:
## lm(formula = price ~ weight + passengers, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.647  -3.688  -1.134   2.677  33.704 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.348709   7.480301  -0.982   0.3305    
## weight       0.015891   0.001925   8.256  5.8e-11 ***
## passengers  -4.094465   1.831085  -2.236   0.0297 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.3 on 51 degrees of freedom
## Multiple R-squared:  0.6127, Adjusted R-squared:  0.5975 
## F-statistic: 40.34 on 2 and 51 DF,  p-value: 3.127e-11

# Create a linear model 2
mod_2 <- lm(price ~ weight + passengers, data = cars)

# View summary of model 2
summary(mod_2)
## 
## Call:
## lm(formula = price ~ weight + passengers, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.647  -3.688  -1.134   2.677  33.704 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.348709   7.480301  -0.982   0.3305    
## weight       0.015891   0.001925   8.256  5.8e-11 ***
## passengers  -4.094465   1.831085  -2.236   0.0297 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.3 on 51 degrees of freedom
## Multiple R-squared:  0.6127, Adjusted R-squared:  0.5975 
## F-statistic: 40.34 on 2 and 51 DF,  p-value: 3.127e-11