Quiz on Correlation and Regression

Suppose that you want to build a regression model that predicts the price of cars using a data set named cars.

Q1. Per the scatter plot and the computed correlation coefficient, describe relationships between the two variables - `price` and `weight`.

Make sure to interpret the direction and the magnitude of the relationship. In addition, keep in mind that correlation (or regression) coefficients do not show causation but only association.

The two variables have a positive relationship. They have a strong relationship because the correlation coefficient is greater than .6.

Create scatterplots

## 'data.frame':    54 obs. of  6 variables:
##  $ type      : Factor w/ 3 levels "large","midsize",..: 3 2 2 2 2 1 1 2 1 2 ...
##  $ price     : num  15.9 33.9 37.7 30 15.7 20.8 23.7 26.3 34.7 40.1 ...
##  $ mpgCity   : int  25 18 19 22 22 19 16 19 16 16 ...
##  $ driveTrain: Factor w/ 3 levels "4WD","front",..: 2 2 2 3 2 2 3 2 2 2 ...
##  $ passengers: int  5 5 6 4 6 6 6 5 6 5 ...
##  $ weight    : int  2705 3560 3405 3640 2880 3470 4105 3495 3620 3935 ...

## [1] 0.758112

Interpretation

There is a strong (the coefficient’s absolute value > 0.6) positive (its sign) association between weight and height.

Run a regression model for price with one explanatory variable, weight, and answer Q2 through Q5.

Q2. Is the coefficient of weight statistically significant at 5%? Interpret the coefficient.

The coefficient of weight would be statistically significant at 5%. The coefficient is shown to be significant at 0.1% signficance level, which means it is meaningful. ## Q3. What price does the model predict for a car that weighs 4000 pounds? Hint: Check the units of the variables in the openintro manual. A car that weighs 4,000 pounds would cost 50,000 USD

Q4. What is the reported residual standard error? What does it mean?

The reported residual error for cars is 433. This means that the model estimated weight misses the actual weight by 433 pounds.

Q5. What is the reported adjusted R squared? What does it mean?

The adjusted R squared is .566 which means that %56.6 of the variability in weight can be explained by the price.

Run a second regression model for price with two explanatory variables: weight and passengers, and answer Q6.

Q6. Which of the two models better fits the data? Discuss your answer by comparing the residual standard error and the adjusted R squared between the two models.

In the second model the residual error drops down to 347, and the Adjusted R squared rises to %72. This means that the second model better describes the data set because of the addition of the passengers variable. The second model has less error.

Build regression model

## 
## Call:
## lm(formula = weight ~ price, data = cars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1328.29  -228.09    10.92   258.19   924.27 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2171.113    118.956  18.251  < 2e-16 ***
## price         43.331      5.169   8.383 3.17e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 433 on 52 degrees of freedom
## Multiple R-squared:  0.5747, Adjusted R-squared:  0.5666 
## F-statistic: 70.28 on 1 and 52 DF,  p-value: 3.173e-11
## 
## Call:
## lm(formula = weight ~ price + passengers, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -976.81 -201.56    6.13  151.33  799.88 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   294.25     356.98   0.824    0.414    
## price          35.99       4.36   8.256 5.80e-11 ***
## passengers    395.91      72.56   5.456 1.44e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 347.4 on 51 degrees of freedom
## Multiple R-squared:  0.7315, Adjusted R-squared:  0.7209 
## F-statistic: 69.46 on 2 and 51 DF,  p-value: 2.748e-15

Interpretation

significance of coefficients *** at the end of the coefficient of height indicates that the coefficient is significant at 0.1% signficance level (low p-values). In other words, changes in the height are highly likely meaningful in explaining changes in the weight. The same can be said for the y-intercept.
coefficient of height An one-centimeter increase in the height of a person is associated with an increase of 1.018 kg in the weight.
intercept When a person is 0 centimeter tall, his/her weight is -105.011 kg. Obviously, the intercept is meaningless in this case.
residual standard error The typical difference between the actual weight and the weight predicted by the model is about 9.3 kg. In other words, the model estimated weight misses the actual weight by about 9.3 kg.
Adjusted R-squared The R^2 of 0.5136 means that 51.36% of the variability in weight can be explained by height.
Making predictions Ben is predicted to be 81 Kg based on his his height while his actual weight is 74.8 kg. The model overestimated Ben’s weight 6.2 kg.

Quiz on Correlation and Regression

Sydney Linnick

5/1/2018

Q1. Per the scatter plot and the computed correlation coefficient, describe relationships between the two variables - `price` and `weight`.

Q2. Is the coefficient of weight statistically significant at 5%? Interpret the coefficient.

Q4. What is the reported residual standard error? What does it mean?

Q5. What is the reported adjusted R squared? What does it mean?

Q6. Which of the two models better fits the data? Discuss your answer by comparing the residual standard error and the adjusted R squared between the two models.

Quiz on Correlation and Regression

Sydney Linnick

5/1/2018

Q1. Per the scatter plot and the computed correlation coefficient, describe relationships between the two variables - price and weight.

Q2. Is the coefficient of weight statistically significant at 5%? Interpret the coefficient.

Q4. What is the reported residual standard error? What does it mean?

Q5. What is the reported adjusted R squared? What does it mean?

Q6. Which of the two models better fits the data? Discuss your answer by comparing the residual standard error and the adjusted R squared between the two models.

Q1. Per the scatter plot and the computed correlation coefficient, describe relationships between the two variables - `price` and `weight`.