library(openintro)
library(datarium)
# The openintro and datarium packages have datasets that we will use

library(dplyr)
library(ggplot2)
# These are packages for manipulating and plotting data

Working Together

1. Why do some mammals sleep more than others?

data("mammals", package = "openintro")
View(mammals)

Goal:

Predict total_sleep from some combination of LifeSpan, Gestation, predation, Exposure, Danger, Body Weight and Brain Weight (it would be cheating to use “NonDreaming” and “Dreaming” sleep because these are amounts of two types of sleep).

You should start by reading a description of the data: Mammals Description

Let’s look at a scatter plot of the data.

mammals %>% ggplot(aes(predation, total_sleep))+geom_point()

Now, let’s add a best-fit line.

mammals %>% 
  ggplot(aes(predation,total_sleep)) +   
  geom_point()+
  geom_smooth(method="lm")

Looking at this best-fit line, what can you say about the correlation between predation and total_sleep?

Let’s build a model to predicting total_sleep from predation.

colnames(mammals)
m <- lm(total_sleep ~ predation, data=mammals)
summary(m)

How would you interpret this model summary? Does this make intuitive sense? Roughly how much less sleep would a mammal with a predation level of 5 be expected to get than an animal with a predation level of 1?

We might also be interested in using mammals’s brain weights. There’s a bit of a problem with this, however, which is that the animals that are big (and big-brained) are orders of magnitude bigger than the smaller animals and the differences between smaller animals are irrelevant by comparison. We can see this in a histogram of brain weights and in a plot of sleep versus brain weight.

mammals %>% ggplot(aes(brain_wt))+geom_histogram()

mammals %>% ggplot(aes(brain_wt, total_sleep))+geom_point()+
  geom_smooth(method="lm", color="red")

We can address this issue by creating a new variable which is the logarithm of brain weight.

We’ll use log base 10 (the base doesn’t really matter for our purposes) so that if logbrain_wt is 2, then brain_wt must be \(10^2\) or 100 grams (the information page says that these brain weights are in kilograms but that’s ridiculous).

mammals <- mammals %>% 
  mutate(logbrain_wt = log(brain_wt, base=10))

mammals %>% ggplot(aes(logbrain_wt))+geom_histogram()


mammals %>% ggplot(aes(logbrain_wt, total_sleep))+geom_point()+
  geom_smooth(method="lm", color="red")

mammals %>% ggplot(aes(logbrain_wt, total_sleep, label=species))+geom_text()+
  geom_smooth(method="lm", color="red")

Now, let’s make a prediction model based on log of brain weight.

m <- lm(total_sleep ~ logbrain_wt, data=mammals)
summary(m)

How would you interpret the results of this model?

More Complex Models

We can ask R to come up with more complex models for us!

Imagine that we want to predict total sleep time from predation and log of brain weight. We can think of this as having a three dimension graph with sleep time, predation and log brain weight on the x, y and z axes and trying to come up with a best-fit plane instead of a best fit line. This is very hard to picture but, in fact, it’s not hard at all to come up with an equation for the best-fit plane.

m <- lm(total_sleep ~ logbrain_wt+predation, data=mammals)
summary(m)

How would you intepret this model?

If we want, we can add the expectations (or predictions) of our model to the data set.

mammals$PredSleep <- predict(m, mammals)

and graph the predictions agains the actual amounsts of sleep:

mammals %>% 
  ggplot(aes(PredSleep, total_sleep,label=species)) +
  geom_text()+
  geom_smooth(method="lm")

Which mammals sleep more than we’d expect based on our model? Which mammals sleep less than expected?

Even More Complex Models!

In statistics, we’re not limiting to fitting two dimensional models into three-dimensional space. We can have as many dimensions as we want (although this comes with its share of problems!)

Try adding other variables to your sleep model. Are any other variables important?

Working on Your Own

2. Surviving the Titanic

data("titanic.raw", package = "datarium")

titanic.raw <- titanic.raw %>% mutate(SurvivedTF = Survived=="Yes")

This data provides information on the passengers onboard the Titanic and, in paricular, who lived and who died.

The first thing, we need to do is tranform the “Yes”/“No” Survived column into TRUE/FALSE or 1/0 values. The code to do this is shown above. We’ll need to try to predict this “SurvivedTF” value we created rather than “Survived”.

First, let’s look at a summary of the data:

summary(titanic.raw)

This shows how many passengers fall into each category. When you build regression models with categorical data, the model will choose one value from each category you use in your model as the baseline and show the affects of other values relative to that value. For instance, if we run the code:

m <- lm(SurvivedTF ~ Class, data=titanic.raw)
summary(m)

R chooses 1st class as the default/baseline Class. The regression summary shows that 2nd class passengers were 21% less likely to survive than 1st class passengers and that 3rd class passengers were 37% less likely to survive than 1st class passengers. Crew members were the least likely to survive.

*Which other factors were important in determining who survived the sinking of the Titanic? Try building regression models to find out!

4. MLB Stats

This data contains Major League Baseball Player Hitting Statistics for 2010.

data("mlbbat10", package = "openintro")

MLB Bat Description

Try to predict Runs, “R”, from other batting statistics. Does this model make sense?

5. SAT Scores and College GPA

This dataset contains SAT and GPA data for 1000 students at an unnamed college.

data("satgpa", package = "openintro")

SAT and College GPA Description

Try to predict four year college GPA “FYGPA”. Note that SAT verbal and SAT math add up SATSum so it doesn’t make sense to have all three of these variables in your model simultaneously (and if you do, R will return a coefficient of NA for one of them).