The textbook’s chapter on linear models (“Line Up, Please”) introduces linear predictive modeling using the workhorse tool known as multiple regression. The term “multiple regression” has an odd history, dating back to an early scientific observation of a phenomenon called “regression to the mean”. These days, multiple regression is just an interesting name for using a simple linear modeling technique to measuring the connection between one or more predictor variables and an outcome variable.
In this exercise, we are going to use an open dataset to explore antelope population.
This is the first exercise of the semester where there is no sample R code to help you along. Because you have had so much practice with R by now, you can create and/or find all of the code you need to accomplish these steps.
# Add your library below.
library(tidyverse)
library(cowplot)
library(readxl)
Write a definition of a model, based on how the author uses it in this chapter.
[A prediction analysis model analyzes and predicts data by calculating a set of numerical coefficients.]
You can find the data from Cengage’s website. This URL will enable you to download the dataset into excel:
The more general website can be found at:
If you view this in a spreadsheet, you will find four columns of a small dataset:
# No code necessary; Just review the data.
You have the option of saving the file to your computer and reading it into R, or you can read the data directly from the web into a dataframe.
# Write your code below.
Ant <- read_xls("C:/Users/sharl/Desktop/USF/Spring 2021/LIS 4761 Data-Text Mining/HW/Antelope Population Data Set.xls")
You should inspect the data using str() to make sure that 1) all the cases have been read in (n=8 years of observations) and 2) that there are four variables.
# Write your code below.
str(Ant)
## tibble [8 x 4] (S3: tbl_df/tbl/data.frame)
## $ X1: num [1:8] 2.9 2.4 2 2.3 3.2 ...
## $ X2: num [1:8] 9.2 8.7 7.2 8.5 9.6 ...
## $ X3: num [1:8] 13.2 11.5 10.8 12.3 12.6 ...
## $ X4: num [1:8] 2 3 4 2 3 5 1 3
cnames<- c("Number of Fawn", "Pop of Adult Antelope",
"Annual Precipitation", "Severity of Winter")
colnames(Ant) <- cnames
Create bivariate plots of the number of baby fawns versus adult antelope population, precipitation that year, and severity of the winter.
Your code should produce three separate plots. Make sure the y-axis and x-axis are labeled. Keeping in mind that the number of fawns is the outcome (or dependent) variable, which axis should it go on in your plots? You can also create scatter plots where size and colors reflect the two variables you didn’t use (remember the visualization homework/lab. If you create these plots, you can earn extra 1 point).
# Write your code below.
plot(Ant$`Pop of Adult Antelope`, Ant$`Number of Fawn`)
ggplot(Ant, aes(`Pop of Adult Antelope`, `Number of Fawn`)) +
geom_point(aes(color = `Severity of Winter`, size = `Annual Precipitation`)) +
ggtitle("Fawn vs Adult Antelope")
# Write your code below.
plot(Ant$`Annual Precipitation`, Ant$`Number of Fawn`)
ggplot(Ant, aes(`Annual Precipitation`, `Number of Fawn`)) +
geom_point(aes(color = `Severity of Winter`,
size = `Pop of Adult Antelope`)) +
ggtitle("Fawn vs Annual Precipitation")
# Write your code below.
plot(Ant$`Severity of Winter`, Ant$`Number of Fawn`)
g <- ggplot(Ant, aes(`Severity of Winter`, `Number of Fawn`)) +
geom_point(aes(color = `Annual Precipitation`, size = `Pop of Adult Antelope`)) +
ggtitle("Fawn vs Severity of Winter")
g
g + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Create three regression models of increasing complexity using lm(), then analyze the results.
# Write your code below.
plot(Ant$`Severity of Winter`, Ant$`Number of Fawn`)
model1 <- lm(formula = `Number of Fawn` ~ `Severity of Winter`, data = Ant)
summary(model1)
##
## Call:
## lm(formula = `Number of Fawn` ~ `Severity of Winter`, data = Ant)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52069 -0.20431 -0.00172 0.13017 0.71724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.4966 0.3904 8.957 0.000108 ***
## `Severity of Winter` -0.3379 0.1258 -2.686 0.036263 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.415 on 6 degrees of freedom
## Multiple R-squared: 0.5459, Adjusted R-squared: 0.4702
## F-statistic: 7.213 on 1 and 6 DF, p-value: 0.03626
abline(model1)
# Write your code below.
model2 <- lm(formula = `Number of Fawn` ~`Severity of Winter` +
`Pop of Adult Antelope`, Ant)
summary(model2)
##
## Call:
## lm(formula = `Number of Fawn` ~ `Severity of Winter` + `Pop of Adult Antelope`,
## data = Ant)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## 0.01231 -0.27531 0.10301 -0.19154 0.01535 0.15880 0.29992 -0.12256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.46009 1.53443 -1.603 0.1698
## `Severity of Winter` 0.07058 0.12461 0.566 0.5956
## `Pop of Adult Antelope` 0.56594 0.14439 3.920 0.0112 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2252 on 5 degrees of freedom
## Multiple R-squared: 0.8885, Adjusted R-squared: 0.8439
## F-statistic: 19.92 on 2 and 5 DF, p-value: 0.004152
# Write your code below.
model3 <- lm(formula = `Number of Fawn` ~ ., data = Ant)
summary(model3)
##
## Call:
## lm(formula = `Number of Fawn` ~ ., data = Ant)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## -0.11533 -0.02661 0.09882 -0.11723 0.02734 -0.04854 0.11715 0.06441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.92201 1.25562 -4.716 0.0092 **
## `Pop of Adult Antelope` 0.33822 0.09947 3.400 0.0273 *
## `Annual Precipitation` 0.40150 0.10990 3.653 0.0217 *
## `Severity of Winter` 0.26295 0.08514 3.089 0.0366 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1209 on 4 degrees of freedom
## Multiple R-squared: 0.9743, Adjusted R-squared: 0.955
## F-statistic: 50.52 on 3 and 4 DF, p-value: 0.001229
Which regression model works best? Which of the predictors are statistically significant in each model? If you wanted to create the most parsimonious model (i.e., the one that did the best job with the fewest predictors), what would it contain? You MUST answer these questions.
[Model 3 works best with a 96% adjuted R square value and p-value that is 0.001, which is way less than 0.05. Annual precipitation has the most significance in each model. You can also visually see the significance in annual precipitation on the ggplot charts. I would use annual precipitation to create the most parsimonious model.]
Please only include print outs of data sets using “head” function. I will take points off if you include more than two pages of dataset print outs.