Make sure you have set RStudio’s Working Directory to your STATS 10 Lab Folder:
Session -> Set Working Directory -> Choose Directory
IMPORTANT: Run the chunk below FIRST! We will need this data to complete the lab!
# Glossary:
# mpg: Miles/(US) gallon
# cyl: Number of cylinders
# disp: Displacement (cu.in.)
# hp: Gross horsepower
# drat: Rear axle ratio
# wt: Weight (1000 lbs)
# qsec: 1/4 mile time
# vs: Engine (0 = V-shaped, 1 = straight)
# am: Transmission (0 = automatic, 1 = manual)
# gear: Number of forward gears
# carb: Number of carburetors
# Source: https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/mtcars
mtcars_df <- mtcars
head(mtcars_df)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# y ~ x means that we are plotting the variables y on the y-axis and x on the x-axis
# We use these arguments to plot the data and label the scatter plot:
# main: Title the plot (character)
# xlab: Label the horizontal or x-axis (character)
# ylab: Label the vertical or y-axis (character)
# col: The colors of the points (character); see https://r-charts.com/colors/
# pch: The symbols of the points (numerical); see https://r-charts.com/base-r/pch-symbols/
# cex: The sizes of the points (numerical)
plot(
mtcars_df$mpg ~ mtcars_df$wt,
main = "Miles per Gallon versus Weight",
xlab = "Weight (1000 lbs)",
ylab = "Miles per Gallon",
col = "red",
pch = 3,
cex = 1.5
)
# cor(): Return the correlation coefficient between two vectors
cor(mtcars_df$mpg, mtcars_df$wt)
## [1] -0.8676594
# lm: Fit a linear model.
# y ~ x: Specify that y and x are your DEPENDENT and INDEPENDENT variables
# data: The data frame or table from which we are pulling the values
# summary(): Print a summary table for the model
# You can read the goodness of fit (R-squared) from the summary table
linear_model <- lm(mpg ~ wt, data = mtcars_df)
summary(linear_model)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
# We use abline(model_name) to add regression lines to scatterplots
# Below, we have the same scatterplot as above.
# We use abline(linear_model) to add the regression line to it.
# lwd: Line width
# lty: Line type (lty = 2 makes dashed lines); see https://r-charts.com/base-r/line-types/
plot(
mtcars_df$mpg ~ mtcars_df$wt,
main = "Miles per Gallon versus Weight",
xlab = "Weight (1000 lbs)",
ylab = "Miles per Gallon",
col = "red",
pch = 3,
cex = 1.5
)
abline(linear_model, lwd = 1.5, lty = 2)
Explain the difference between the correlation coefficient and the goodness of fit value.
Answer: Delete this text and type in your answer.
Choose two variables other than mpg and wt
from the mtcars dataset (you can use only one of mpg or
wt). Create a scatterplot and calculate/print the
correlation coefficient. Interpret the correlation coefficient for your
chosen two variables.
Scatterplot
# Fill in the dots based on the two variables you choose!
# Scroll up to the "Scatterplot" example to see how.
plot(
mtcars_df$wt ~ mtcars_df$disp,
main = "wt vs. disp",
xlab = "disp",
ylab = "wt",
col = "red",
pch = 3,
cex = 1.5
)
Correlation
# Fill in the dots based on the two variables you choose!
# Scroll up to the "Correlation" example to see how.
cor(mtcars_df$wt, mtcars_df$disp)
## [1] 0.8879799
Answer: Delete this text and type in your answer.
Using the same two variables from Question 2, create a linear regression line and plot it on the scatterplot. What is the intercept and slope coefficients, and what do they mean in the context of the data?
Linear Regression
# Fill in the dots based on the variables you choose!
# Scroll up to the "Simple Linear Regression Models" example to see how.
mtcars_model <- lm(wt ~ disp, data = mtcars_df)
summary(mtcars_model)
##
## Call:
## lm(formula = wt ~ disp, data = mtcars_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.89044 -0.29775 -0.00684 0.33428 0.66525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.5998146 0.1729964 9.248 2.74e-10 ***
## disp 0.0070103 0.0006629 10.576 1.22e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4574 on 30 degrees of freedom
## Multiple R-squared: 0.7885, Adjusted R-squared: 0.7815
## F-statistic: 111.8 on 1 and 30 DF, p-value: 1.222e-11
Regression Line
# Exactly the same scatterplot code as in Question 2
# We only add abline(mtcars_model) at the end to add the line
plot(
mtcars_df$wt ~ mtcars_df$disp,
main = "mpg vs. wt",
xlab = "wt",
ylab = "mpg",
col = "red",
pch = 3,
cex = 1.5
)
abline(mtcars_model, lwd = 1.5, lty = 2)
Answer: Delete this text and type in your answer.
Explain whether the linear regression line is a good model for the chosen variables from Question 2.
Answer: I would prefer a clustering strategy
Using the same two variables from Question 2, find
the correlation coefficient without using the cor()
function. Check with your answer in Question 2.
# Hints:
# 1. What will taking the square root of the R-squared yield?
# From which table can you get the R-squared?
# 2. Remember that the correlation coefficient has a sign.
# From which plot can you see the direction of the relationship?
summary(mtcars_model)$r.squared |> sqrt()
## [1] 0.8879799
## code here (OPTIONAL)
Answer: \(r^2 \implies \sqrt{r^2}=r_{xy}\)