library(openintro)
library(datarium)
# The openintro and datarium packages have datasets that we will use

library(dplyr)
library(ggplot2)
# These are packages for manipulating and plotting data

Working Together

1. Why do some mammals sleep more than others?

data("mammals", package = "openintro")
View(mammals)

Goal:

Predict TotalSleep from some combination of LifeSpan, Gestation, Pedation, Exposure, Danger, Body Weight and Brain Weight (it would be cheating to use “NonDreaming” and “Dreaming” sleep because these are amounts of two types of sleep).

You should start by reading a description of the data: Mammals Description

Then let’s start building our model by predicting TotalSleep from Predation.

colnames(mammals)
m <- lm(TotalSleep ~ Predation, data=mammals)
summary(m)

How would you interpret this model summary? Does this make intuitive sense? Roughly how much less sleep would a mammal with a predation level of 5 be expected to get than an animial with a predation level of 1?

We might also be interested in using mammals’s brain weights. There’s a bit of a problem with this, however, which is that the animals that are big (and big-brained) are orders of magnitude bigger than the smaller animals and the differences between smaller animals are irrelevant by comparison. We can see this in a histogram of brain weights and in a plot of sleep versus brain weight.

mammals %>% ggplot(aes(BrainWt))+geom_histogram()

mammals %>% ggplot(aes(BrainWt, TotalSleep))+geom_point()+
  geom_smooth(method="lm", color="red")

We can address this issue by creating a new variable which is the logarithm of brain weight. If you haven’t yet learned about logarithms, don’t fret! We are just creating a new variable, logBrainWt, such that:

\[10^{logBrainWt} = BrainWt\]

so that if logBrainWt is 2, then BrainWt must be \(10^2\) or 100 grams (the information page says that these brain weights are in kilograms but that’s ridiculous).

mammals <- mammals %>% mutate(logBrainWt = log(BrainWt, base=10))

mammals %>% ggplot(aes(logBrainWt))+geom_histogram()


mammals %>% ggplot(aes(logBrainWt, TotalSleep))+geom_point()+
  geom_smooth(method="lm", color="red")

mammals %>% ggplot(aes(logBrainWt, TotalSleep, label=Species))+geom_text()+
  geom_smooth(method="lm", color="red")

Now, we can add logBrainWt to our prediciton model:

m <- lm(TotalSleep ~ logBrainWt+Predation, data=mammals)
summary(m)

If we want, we can add the expectations (or predictions) of our model to the data set.

mammals$PredSleep <- predict(m, mammals)

and graph the predictions agains the actual amounsts of sleep:

mammals %>% ggplot(aes(PredSleep, TotalSleep, label=Species))+geom_text()+
  geom_smooth(method="lm")

Which mammals sleep more than we’d expect based on our model? Which mammals sleep less than expected?

Try adding other variables to your sleep model. Are any other variables important?

Working in Breakout Rooms

2. Stress

data("stress", package = "datarium")

Researchers want to evaluate the effect of a new “treatment” on the stress scores of subjects after adjusting for “age” and “exercise”.

m <- lm(score ~ treatment, data=stress)
summary(m)

The coefficient of “treatmentno” shows the difference in stress scores between “no” treatment and “yes” treatment. In other words, folks with no treatment had stress scores 4.8 points higher than folks with treatment.

The “residuals” of this model are the actual stress levels less the predicted stress levels. Let’s plot these residuals against the ages of our subjects.

plot(stress$age, residuals(m))

What does this residual plot show us? Does it seems like subject age is related to stress level?

We can build a new version of this model with an age adjustment.

m_age_adjusted <- lm(score ~ treatment+age, data=stress)
summary(m_age_adjusted)

Can you build a model that adjusts for age and exercise level? How would you interpret this model?

3. Surviving the Titanic

data("titanic.raw", package = "datarium")

titanic.raw <- titanic.raw %>% mutate(SurvivedTF = Survived=="Yes")

This data provides information on the passengers onboard the Titanic and, in paricular, who lived and who died.

The first thing, we need to do is tranform the “Yes”/“No” Survived column into TRUE/FALSE or 1/0 values. The code to do this is shown above. We’ll need to try to predict this “SurvivedTF” value we created rather than “Survived”.

First, let’s look at a summary of the data:

summary(titanic.raw)

This shows how many passengers fall into each category. When you build regression models with categorical data, the model will choose one value from each category you use in your model as the baseline and show the affects of other values relative to that value. For instance, if we run the code:

m <- lm(SurvivedTF ~ Class, data=titanic.raw)
summary(m)

R chooses 1st class as the default/baseline Class. The regression summary shows that 2nd class passengers were 21% less likely to survive than 1st class passengers and that 3rd class passengers were 37% less likely to survive than 1st class passengers. Crew members were the least likely to survive.

*Which other factors were important in determining who survived the sinking of the Titanic? Try building regression models to find out!

4. MLB Stats

This data contains Major League Baseball Player Hitting Statistics for 2010.

data("mlbbat10", package = "openintro")

MLB Bat Description

Try to predict Runs, “R”, from other batting statistics. Does this model make sense?

5. SAT Scores and College GPA

This dataset contains SAT and GPA data for 1000 students at an unnamed college.

data("satGPA", package = "openintro")

SAT and College GPA Description

Try to predict four year college GPA “FYGPA”. Note that SAT verbal and SAT math add up SATSum so it doesn’t make sense to have all three of these variables in your model simultaneously (and if you do, R will return a coefficient of NA for one of them).