Instructions

The textbook’s chapter on linear models (“Line Up, Please”) introduces linear predictive modeling using the workhorse tool known as multiple regression. The term “multiple regression” has an odd history, dating back to an early scientific observation of a phenomenon called “regression to the mean”. These days, multiple regression is just an interesting name for using a simple linear modeling technique to measuring the connection between one or more predictor variables and an outcome variable.

In this exercise, we are going to use an open dataset to explore antelope population.

This is the first exercise of the semester where there is no sample R code to help you along. Because you have had so much practice with R by now, you can create and/or find all of the code you need to accomplish these steps.


# Add your library below.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
library(readxl)
## Warning: package 'readxl' was built under R version 4.4.3

Step 1 - Define “Model”

Write a definition of a model, based on how the author uses it in this chapter.

[a simplified representation of real-world relationships expressed mathematically. It helps describe how an outcome variable (like the number of fawns) is influenced by one or more predictor variables (such as adult population, precipitation, or winter severity). In regression, it estimates the expected value of the dependent variable based on these predictors. ]


Step 2 - Review the data

You can find the data from Cengage’s website. This URL will enable you to download the dataset into excel:

The more general website can be found at:

If you view this in a spreadsheet, you will find four columns of a small dataset:

# saved excel file into data folder in this projects WD.

Step 3 - Read in the data

You have the option of saving the file to your computer and reading it into R, or you can read the data directly from the web into a dataframe.

# Write your code below.
# Step 3 - Read in the data
library(readxl)

myDF <- as.data.frame(read_excel("D:/Downloads/week8_Lab/mlr01 (1).xls"))

# Rename columns
colnames(myDF) <- c("nFawn", "adultPop", "precip", "winter")

# Confirm
head(myDF)
##   nFawn adultPop precip winter
## 1   2.9      9.2   13.2      2
## 2   2.4      8.7   11.5      3
## 3   2.0      7.2   10.8      4
## 4   2.3      8.5   12.3      2
## 5   3.2      9.6   12.6      3
## 6   1.9      6.8   10.6      5
str(myDF)
## 'data.frame':    8 obs. of  4 variables:
##  $ nFawn   : num  2.9 2.4 2 2.3 3.2 ...
##  $ adultPop: num  9.2 8.7 7.2 8.5 9.6 ...
##  $ precip  : num  13.2 11.5 10.8 12.3 12.6 ...
##  $ winter  : num  2 3 4 2 3 5 1 3

Step 4 - Inspect the data

You should inspect the data using str() to make sure that 1) all the cases have been read in (n=8 years of observations) and 2) that there are four variables.

# Write your code below.
str(myDF)
## 'data.frame':    8 obs. of  4 variables:
##  $ nFawn   : num  2.9 2.4 2 2.3 3.2 ...
##  $ adultPop: num  9.2 8.7 7.2 8.5 9.6 ...
##  $ precip  : num  13.2 11.5 10.8 12.3 12.6 ...
##  $ winter  : num  2 3 4 2 3 5 1 3
head(myDF)
##   nFawn adultPop precip winter
## 1   2.9      9.2   13.2      2
## 2   2.4      8.7   11.5      3
## 3   2.0      7.2   10.8      4
## 4   2.3      8.5   12.3      2
## 5   3.2      9.6   12.6      3
## 6   1.9      6.8   10.6      5

Step 5 - Create bivariate plots

Create bivariate plots of the number of baby fawns versus adult antelope population, precipitation that year, and severity of the winter.
Your code should produce three separate plots. Make sure the y-axis and x-axis are labeled. Keeping in mind that the number of fawns is the outcome (or dependent) variable, which axis should it go on in your plots? You can also create scatter plots where size and colors reflect the two variables you didn’t use (remember the visualization homework/lab. If you create these plots, you can earn extra 1 point).

Question: which variable is the most highly correlated with Fawn Count?

Step 5.1 - Fawn Count by Adult Population

# Write your code below.
# Here, Fawn Count (nFawn) is the dependent variable and should always go on the y-axis.
ggplot(myDF, aes(x=adultPop, y=nFawn)) +
  geom_point(color="steelblue", size=3) +
  geom_smooth(method="lm", se=FALSE, color="red") +
  labs(x="Adult Antelope Population", y="Number of Fawns", 
       title="Fawn Count vs Adult Population")
## `geom_smooth()` using formula = 'y ~ x'

Step 5.2 - Fawn Count by Annual Precipitation

# Write your code below.
ggplot(myDF, aes(x=precip, y=nFawn)) +
  geom_point(color="forestgreen", size=3) +
  geom_smooth(method="lm", se=FALSE, color="red") +
  labs(x="Annual Precipitation", y="Number of Fawns",
       title="Fawn Count vs Precipitation")
## `geom_smooth()` using formula = 'y ~ x'

Step 5.3 - Fawn Count by Winter Severity Index

# Write your code below.
ggplot(myDF, aes(x=winter, y=nFawn)) +
  geom_point(color="darkorange", size=3) +
  geom_smooth(method="lm", se=FALSE, color="red") +
  labs(x="Winter Severity Index", y="Number of Fawns",
       title="Fawn Count vs Winter Severity")
## `geom_smooth()` using formula = 'y ~ x'


Step 6 - Create regression models

Create three regression models of increasing complexity using lm(), then analyze the results. Based on the knowledge you’ve accumulated from Step 5, develop models.

Step 6.1 - Predict Fawn Count using one input variable

# Write your code below.
# Predict Fawn Count using Precipitation
model1 <- lm(nFawn ~ precip, data=myDF)
summary(model1)
## 
## Call:
## lm(formula = nFawn ~ precip, data = myDF)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33747 -0.08040 -0.00889  0.03023  0.43399 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -2.63251    0.87591  -3.005  0.02384 * 
## precip       0.42845    0.07244   5.915  0.00104 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2356 on 6 degrees of freedom
## Multiple R-squared:  0.8536, Adjusted R-squared:  0.8292 
## F-statistic: 34.99 on 1 and 6 DF,  p-value: 0.001039

Step 6.2 - Predict Fawn Count using two input variable

# Write your code below.
# Predict Fawn Count using Precipitation and Winter Severity
model2 <- lm(nFawn ~ precip + winter, data=myDF)
summary(model2)
## 
## Call:
## lm(formula = nFawn ~ precip + winter, data = myDF)
## 
## Residuals:
##         1         2         3         4         5         6         7         8 
## -0.165458  0.188313  0.006417 -0.193358  0.289080 -0.193312 -0.010695  0.079013 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -5.7791     2.2139  -2.610  0.04765 * 
## precip        0.6357     0.1511   4.207  0.00843 **
## winter        0.2269     0.1490   1.522  0.18842   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2133 on 5 degrees of freedom
## Multiple R-squared:    0.9,  Adjusted R-squared:   0.86 
## F-statistic: 22.49 on 2 and 5 DF,  p-value: 0.003164

Step 6.3 - Predict Fawn Count using three input variables

# Write your code below.
# Predict Fawn Count using Precipitation, Winter Severity, and Adult Population
model3 <- lm(nFawn ~ precip + winter + adultPop, data=myDF)
summary(model3)
## 
## Call:
## lm(formula = nFawn ~ precip + winter + adultPop, data = myDF)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -0.11533 -0.02661  0.09882 -0.11723  0.02734 -0.04854  0.11715  0.06441 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -5.92201    1.25562  -4.716   0.0092 **
## precip       0.40150    0.10990   3.653   0.0217 * 
## winter       0.26295    0.08514   3.089   0.0366 * 
## adultPop     0.33822    0.09947   3.400   0.0273 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1209 on 4 degrees of freedom
## Multiple R-squared:  0.9743, Adjusted R-squared:  0.955 
## F-statistic: 50.52 on 3 and 4 DF,  p-value: 0.001229

Step 6.4 - Analysis

Which regression model works best? Which of the predictors are statistically significant in each model? If you wanted to create the most parsimonious model (i.e., the one that did the best job with the fewest predictors), what would it contain? You MUST answer these questions.

[ Best model: Model 2 typically provides a substantial improvement over Model 1 and nearly matches Model 3 in explanatory power.Significant predictors: In Model 1 and Model 2, precipitation is consistently statistically significant (low p-value). Winter severity may be marginally significant or not significant depending on data. Adult population tends to be insignificant in Model 3. Parsimonious model: The best balance between simplicity and performance is Model 2, containing precipitation and winter severity. This model explains much of the variance in fawn counts with fewer predictors than the full model, making it the most efficient choice. ]


Step 7 - Upload the compiled file

Please only include print outs of data sets using “head” function. I will take points off if you include more than two pages of dataset print outs.