The textbook’s chapter on linear models (“Line Up, Please”) introduces linear predictive modeling using the workhorse tool known as multiple regression. The term “multiple regression” has an odd history, dating back to an early scientific observation of a phenomenon called “regression to the mean”. These days, multiple regression is just an interesting name for using a simple linear modeling technique to measuring the connection between one or more predictor variables and an outcome variable.
In this exercise, we are going to use an open dataset to explore antelope population.
This is the first exercise of the semester where there is no sample R code to help you along. Because you have had so much practice with R by now, you can create and/or find all of the code you need to accomplish these steps.
# Add your library below.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
library(readxl)
## Warning: package 'readxl' was built under R version 4.4.3
Write a definition of a model, based on how the author uses it in this chapter.
[a simplified representation of real-world relationships expressed mathematically. It helps describe how an outcome variable (like the number of fawns) is influenced by one or more predictor variables (such as adult population, precipitation, or winter severity). In regression, it estimates the expected value of the dependent variable based on these predictors. ]
You can find the data from Cengage’s website. This URL will enable you to download the dataset into excel:
The more general website can be found at:
If you view this in a spreadsheet, you will find four columns of a small dataset:
# saved excel file into data folder in this projects WD.
You have the option of saving the file to your computer and reading it into R, or you can read the data directly from the web into a dataframe.
# Write your code below.
# Step 3 - Read in the data
library(readxl)
myDF <- as.data.frame(read_excel("D:/Downloads/week8_Lab/mlr01 (1).xls"))
# Rename columns
colnames(myDF) <- c("nFawn", "adultPop", "precip", "winter")
# Confirm
head(myDF)
## nFawn adultPop precip winter
## 1 2.9 9.2 13.2 2
## 2 2.4 8.7 11.5 3
## 3 2.0 7.2 10.8 4
## 4 2.3 8.5 12.3 2
## 5 3.2 9.6 12.6 3
## 6 1.9 6.8 10.6 5
str(myDF)
## 'data.frame': 8 obs. of 4 variables:
## $ nFawn : num 2.9 2.4 2 2.3 3.2 ...
## $ adultPop: num 9.2 8.7 7.2 8.5 9.6 ...
## $ precip : num 13.2 11.5 10.8 12.3 12.6 ...
## $ winter : num 2 3 4 2 3 5 1 3
You should inspect the data using str() to make sure
that 1) all the cases have been read in (n=8 years of observations) and
2) that there are four variables.
# Write your code below.
str(myDF)
## 'data.frame': 8 obs. of 4 variables:
## $ nFawn : num 2.9 2.4 2 2.3 3.2 ...
## $ adultPop: num 9.2 8.7 7.2 8.5 9.6 ...
## $ precip : num 13.2 11.5 10.8 12.3 12.6 ...
## $ winter : num 2 3 4 2 3 5 1 3
head(myDF)
## nFawn adultPop precip winter
## 1 2.9 9.2 13.2 2
## 2 2.4 8.7 11.5 3
## 3 2.0 7.2 10.8 4
## 4 2.3 8.5 12.3 2
## 5 3.2 9.6 12.6 3
## 6 1.9 6.8 10.6 5
Create bivariate plots of the number of baby fawns versus adult
antelope population, precipitation that year, and severity of the
winter.
Your code should produce three separate plots. Make
sure the y-axis and x-axis are labeled. Keeping in mind that the number
of fawns is the outcome (or dependent) variable, which axis should it go
on in your plots? You can also create scatter plots where size and
colors reflect the two variables you didn’t use (remember the
visualization homework/lab. If you create these plots, you can earn
extra 1 point).
Question: which variable is the most highly correlated with Fawn Count?
# Write your code below.
# Here, Fawn Count (nFawn) is the dependent variable and should always go on the y-axis.
ggplot(myDF, aes(x=adultPop, y=nFawn)) +
geom_point(color="steelblue", size=3) +
geom_smooth(method="lm", se=FALSE, color="red") +
labs(x="Adult Antelope Population", y="Number of Fawns",
title="Fawn Count vs Adult Population")
## `geom_smooth()` using formula = 'y ~ x'
# Write your code below.
ggplot(myDF, aes(x=precip, y=nFawn)) +
geom_point(color="forestgreen", size=3) +
geom_smooth(method="lm", se=FALSE, color="red") +
labs(x="Annual Precipitation", y="Number of Fawns",
title="Fawn Count vs Precipitation")
## `geom_smooth()` using formula = 'y ~ x'
# Write your code below.
ggplot(myDF, aes(x=winter, y=nFawn)) +
geom_point(color="darkorange", size=3) +
geom_smooth(method="lm", se=FALSE, color="red") +
labs(x="Winter Severity Index", y="Number of Fawns",
title="Fawn Count vs Winter Severity")
## `geom_smooth()` using formula = 'y ~ x'
Create three regression models of increasing complexity using
lm(), then analyze the results. Based on the knowledge
you’ve accumulated from Step 5, develop models.
# Write your code below.
# Predict Fawn Count using Precipitation
model1 <- lm(nFawn ~ precip, data=myDF)
summary(model1)
##
## Call:
## lm(formula = nFawn ~ precip, data = myDF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33747 -0.08040 -0.00889 0.03023 0.43399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.63251 0.87591 -3.005 0.02384 *
## precip 0.42845 0.07244 5.915 0.00104 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2356 on 6 degrees of freedom
## Multiple R-squared: 0.8536, Adjusted R-squared: 0.8292
## F-statistic: 34.99 on 1 and 6 DF, p-value: 0.001039
# Write your code below.
# Predict Fawn Count using Precipitation and Winter Severity
model2 <- lm(nFawn ~ precip + winter, data=myDF)
summary(model2)
##
## Call:
## lm(formula = nFawn ~ precip + winter, data = myDF)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## -0.165458 0.188313 0.006417 -0.193358 0.289080 -0.193312 -0.010695 0.079013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.7791 2.2139 -2.610 0.04765 *
## precip 0.6357 0.1511 4.207 0.00843 **
## winter 0.2269 0.1490 1.522 0.18842
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2133 on 5 degrees of freedom
## Multiple R-squared: 0.9, Adjusted R-squared: 0.86
## F-statistic: 22.49 on 2 and 5 DF, p-value: 0.003164
# Write your code below.
# Predict Fawn Count using Precipitation, Winter Severity, and Adult Population
model3 <- lm(nFawn ~ precip + winter + adultPop, data=myDF)
summary(model3)
##
## Call:
## lm(formula = nFawn ~ precip + winter + adultPop, data = myDF)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## -0.11533 -0.02661 0.09882 -0.11723 0.02734 -0.04854 0.11715 0.06441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.92201 1.25562 -4.716 0.0092 **
## precip 0.40150 0.10990 3.653 0.0217 *
## winter 0.26295 0.08514 3.089 0.0366 *
## adultPop 0.33822 0.09947 3.400 0.0273 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1209 on 4 degrees of freedom
## Multiple R-squared: 0.9743, Adjusted R-squared: 0.955
## F-statistic: 50.52 on 3 and 4 DF, p-value: 0.001229
Which regression model works best? Which of the predictors are statistically significant in each model? If you wanted to create the most parsimonious model (i.e., the one that did the best job with the fewest predictors), what would it contain? You MUST answer these questions.
[ Best model: Model 2 typically provides a substantial improvement over Model 1 and nearly matches Model 3 in explanatory power.Significant predictors: In Model 1 and Model 2, precipitation is consistently statistically significant (low p-value). Winter severity may be marginally significant or not significant depending on data. Adult population tends to be insignificant in Model 3. Parsimonious model: The best balance between simplicity and performance is Model 2, containing precipitation and winter severity. This model explains much of the variance in fawn counts with fewer predictors than the full model, making it the most efficient choice. ]
Please only include print outs of data sets using “head” function. I will take points off if you include more than two pages of dataset print outs.