Instructions

The textbook’s chapter on linear models (“Line Up, Please”) introduces linear predictive modeling using the workhorse tool known as multiple regression. The term “multiple regression” has an odd history, dating back to an early scientific observation of a phenomenon called “regression to the mean”. These days, multiple regression is just an interesting name for using a simple linear modeling technique to measuring the connection between one or more predictor variables and an outcome variable.

In this exercise, we are going to use an open dataset to explore antelope population.

This is the first exercise of the semester where there is no sample R code to help you along. Because you have had so much practice with R by now, you can create and/or find all of the code you need to accomplish these steps.

# Add your library below.
library(tidyverse)
library(cowplot)
library(readxl)

Step 1 - Define “Model”

Write a definition of a model, based on how the author uses it in this chapter.

[A prediction analysis model analyzes and predicts data by calculating a set of numerical coefficients.]

Step 2 - Review the data

You can find the data from Cengage’s website. This URL will enable you to download the dataset into excel:

http://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/mlr/excel/mlr01.xls

The more general website can be found at:

http://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/mlr/frames/frame.html

If you view this in a spreadsheet, you will find four columns of a small dataset:

The first column shows the number of fawn in a given spring (fawn are baby antelope).
The second column shows the population of adult antelope.
The third shows the annual precipitation that year.
And finally the last column shows how bad the winter was during that year.

# No code necessary; Just review the data.

Step 3 - Read in the data

You have the option of saving the file to your computer and reading it into R, or you can read the data directly from the web into a dataframe.

# Write your code below.
Ant <- read_xls("C:/Users/sharl/Desktop/USF/Spring 2021/LIS 4761 Data-Text Mining/HW/Antelope Population Data Set.xls")

Step 4 - Inspect the data

You should inspect the data using str() to make sure that 1) all the cases have been read in (n=8 years of observations) and 2) that there are four variables.

# Write your code below.
str(Ant)

## tibble [8 x 4] (S3: tbl_df/tbl/data.frame)
##  $ X1: num [1:8] 2.9 2.4 2 2.3 3.2 ...
##  $ X2: num [1:8] 9.2 8.7 7.2 8.5 9.6 ...
##  $ X3: num [1:8] 13.2 11.5 10.8 12.3 12.6 ...
##  $ X4: num [1:8] 2 3 4 2 3 5 1 3

cnames<- c("Number of Fawn", "Pop of Adult Antelope", 
               "Annual Precipitation", "Severity of Winter")

colnames(Ant) <- cnames

Step 5 - Create bivariate plots

Create bivariate plots of the number of baby fawns versus adult antelope population, precipitation that year, and severity of the winter.
Your code should produce three separate plots. Make sure the y-axis and x-axis are labeled. Keeping in mind that the number of fawns is the outcome (or dependent) variable, which axis should it go on in your plots? You can also create scatter plots where size and colors reflect the two variables you didn’t use (remember the visualization homework/lab. If you create these plots, you can earn extra 1 point).

Step 5.1 - Fawn Count by Adult Population

# Write your code below.
plot(Ant$`Pop of Adult Antelope`, Ant$`Number of Fawn`)

ggplot(Ant, aes(`Pop of Adult Antelope`, `Number of Fawn`)) + 
  geom_point(aes(color = `Severity of Winter`, size = `Annual Precipitation`)) +
  ggtitle("Fawn vs Adult Antelope")

Step 5.2 - Fawn Count by Annual Precipitation

# Write your code below.
plot(Ant$`Annual Precipitation`, Ant$`Number of Fawn`)

ggplot(Ant, aes(`Annual Precipitation`, `Number of Fawn`)) + 
  geom_point(aes(color = `Severity of Winter`, 
                 size = `Pop of Adult Antelope`)) +
  ggtitle("Fawn vs Annual Precipitation")

Step 5.3 - Fawn Count by Winter Severity Index

# Write your code below.
plot(Ant$`Severity of Winter`, Ant$`Number of Fawn`)

g <- ggplot(Ant, aes(`Severity of Winter`, `Number of Fawn`)) + 
  geom_point(aes(color = `Annual Precipitation`, size = `Pop of Adult Antelope`)) +
  ggtitle("Fawn vs Severity of Winter")
g

g + geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

Step 6 - Create regression models

Create three regression models of increasing complexity using lm(), then analyze the results.

Model one: Fit the model to predict the number of fawns from the severity of the winter.
Model two: Fit the model to predict the number of fawns from two variables (one should be the severity of the winter).
Model three: Fit the model to predict the number of fawns from the three other variables.

Step 6.1 - Predict Fawn Count by Winter Severity Index

# Write your code below.
plot(Ant$`Severity of Winter`, Ant$`Number of Fawn`)
model1 <- lm(formula = `Number of Fawn` ~ `Severity of Winter`, data = Ant)
summary(model1)

## 
## Call:
## lm(formula = `Number of Fawn` ~ `Severity of Winter`, data = Ant)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52069 -0.20431 -0.00172  0.13017  0.71724 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.4966     0.3904   8.957 0.000108 ***
## `Severity of Winter`  -0.3379     0.1258  -2.686 0.036263 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.415 on 6 degrees of freedom
## Multiple R-squared:  0.5459, Adjusted R-squared:  0.4702 
## F-statistic: 7.213 on 1 and 6 DF,  p-value: 0.03626

abline(model1)

Step 6.2 - Predict Fawn Count by Winter Severity Index + your choice of variable

# Write your code below.
model2 <- lm(formula = `Number of Fawn` ~`Severity of Winter` +
               `Pop of Adult Antelope`, Ant)
summary(model2)

## 
## Call:
## lm(formula = `Number of Fawn` ~ `Severity of Winter` + `Pop of Adult Antelope`, 
##     data = Ant)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
##  0.01231 -0.27531  0.10301 -0.19154  0.01535  0.15880  0.29992 -0.12256 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)             -2.46009    1.53443  -1.603   0.1698  
## `Severity of Winter`     0.07058    0.12461   0.566   0.5956  
## `Pop of Adult Antelope`  0.56594    0.14439   3.920   0.0112 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2252 on 5 degrees of freedom
## Multiple R-squared:  0.8885, Adjusted R-squared:  0.8439 
## F-statistic: 19.92 on 2 and 5 DF,  p-value: 0.004152

Step 6.3 - Predict Fawn Count by the three other variables

# Write your code below.
model3 <- lm(formula = `Number of Fawn` ~ ., data = Ant)
summary(model3)

## 
## Call:
## lm(formula = `Number of Fawn` ~ ., data = Ant)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
## -0.11533 -0.02661  0.09882 -0.11723  0.02734 -0.04854  0.11715  0.06441 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)   
## (Intercept)             -5.92201    1.25562  -4.716   0.0092 **
## `Pop of Adult Antelope`  0.33822    0.09947   3.400   0.0273 * 
## `Annual Precipitation`   0.40150    0.10990   3.653   0.0217 * 
## `Severity of Winter`     0.26295    0.08514   3.089   0.0366 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1209 on 4 degrees of freedom
## Multiple R-squared:  0.9743, Adjusted R-squared:  0.955 
## F-statistic: 50.52 on 3 and 4 DF,  p-value: 0.001229

Step 6.4 - Analysis

Which regression model works best? Which of the predictors are statistically significant in each model? If you wanted to create the most parsimonious model (i.e., the one that did the best job with the fewest predictors), what would it contain? You MUST answer these questions.

[Model 3 works best with a 96% adjuted R square value and p-value that is 0.001, which is way less than 0.05. Annual precipitation has the most significance in each model. You can also visually see the significance in annual precipitation on the ggplot charts. I would use annual precipitation to create the most parsimonious model.]

Step 7 - Upload the compiled file

Please only include print outs of data sets using “head” function. I will take points off if you include more than two pages of dataset print outs.

Week 8: Lab - Linear Modeling (Making Predictions)

[Sharlee Crews]

[03/07/2021]