In this lab you will be doing some basic regression models for your ship.

Load the data set for your ship

load("~/data/RMS_Titanic.Rda")

Let’s make nicer labels. We are making a new variable for Survival here because we need it to be a number (not a factor) later.

RMS_Titanic$Survival_nominal<- factor(RMS_Titanic$Survival, levels = c("0","1"), labels = c("Died", "Survived"))
RMS_Titanic$Gender<- factor(RMS_Titanic$Gender, levels = c("0","1"), labels = c("Male", "Female"))
RMS_Titanic$Crew<- factor(RMS_Titanic$Crew, levels = c("0","1"), labels = c("Passenger", "Crew"))

For comparison purposes let’s run a frequency of Survival_nominal and crosstabs of Survival_nominal with Gender and Crew. For the third crosstab make a copy and switch the order of your column variables.

frequency(RMS_Titanic$Survival_nominal)
##  Values   Freq Percent
##  Died     1496 67.8   
##  Survived 712  32.2   
##  Total    2208 100
pretty_tab(crosstab( RMS_Titanic, row.vars = "Survival_nominal", col.vars = "Gender", 
          title ="Surival by Gender", format="col_percent"))
  Male Female
Died 79.3 26.7
Survived 20.7 73.3
Total N 1722 486
pretty_tab(crosstab( RMS_Titanic, row.vars = "Survival_nominal", col.vars = "Crew", 
          title ="Surival of Crew and Passengers",
          format="col_percent"))
  Passenger Crew
Died 62 76.2
Survived 38 23.8
Total N 1317 891
# This cross tab will control for two variables.
pretty_tab(crosstab( RMS_Titanic, row.vars = "Survival_nominal", col.vars = c("Crew", "Gender"), format="col_percent", 
          title ="Surival by Gender and Group"))
  Crew Female Crew Male Passenger Female Passenger Male
Died 13 77.9 27.4 80.8
Survived 87 22.1 72.6 19.2
Total N 23 868 463 854
pretty_tab(crosstab( RMS_Titanic, row.vars = "Survival_nominal", col.vars = c("Gender", "Crew"), 
          title ="Surival by Gender and Group", format="col_percent"))
  Female Crew Female Passenger Male Crew Male Passenger
Died 13 27.4 77.9 80.8
Survived 87 72.6 22.1 19.2
Total N 23 463 868 854

Now we will run 3 regressions. First one without any independent variables, then one with Gender, then one with Gender and Crew. We use the lm() function which means linear model.

lm(Survival ~ 1, RMS_Titanic)
## 
## Call:
## lm(formula = Survival ~ 1, data = RMS_Titanic)
## 
## Coefficients:
## (Intercept)  
##      0.3225
lm(Survival ~ Gender, RMS_Titanic)
## 
## Call:
## lm(formula = Survival ~ Gender, data = RMS_Titanic)
## 
## Coefficients:
##  (Intercept)  GenderFemale  
##       0.2067        0.5258
lm(Survival ~ Gender + Crew, RMS_Titanic)
## 
## Call:
## lm(formula = Survival ~ Gender + Crew, data = RMS_Titanic)
## 
## Coefficients:
##  (Intercept)  GenderFemale      CrewCrew  
##      0.18924       0.54163       0.03472

Write out the three equations below:

No independent variable:

Gender only:

Gender and Crew:

Calculate the predicted values for each the equation with no indepdent variable:

How does the equation relate to the frequency distribution results?

Calculate the predicted values for males and females for the Gender only model.

Did being female help or hurt or make no difference?

How do these results compare to the cross tab for gender?

For the equation with Gender and Crew as independent variables, calculate the predicted values for:

Female, Crew Female, Passenger Male, Crew Male, Passenger

Which group has the best predicted probability of surviving?

Did Gender have an effect that was positive, negative or close to 0?

Did it stay the same as the equation for Gender alone? If not, how did it change?

Did Crew have an effect that was positive, negative or close to 0?

How do the predicted values from the equation with two variables compare to the crosstabs?