In this lab you will be doing some basic regression models for your ship.

Load the data set for your ship

load("~/data/RMS_Titanic.Rda")

Let’s make nicer labels. We are making a new variable for Survival here because we need it to be a number (not a factor) later.

RMS_Titanic$Survival_nominal<- factor(RMS_Titanic$Survival, levels = c("0","1"), labels = c("Died", "Survived"))
RMS_Titanic$Gender<- factor(RMS_Titanic$Gender, levels = c("0","1"), labels = c("Male", "Female"))
RMS_Titanic$Crew<- factor(RMS_Titanic$Crew, levels = c("0","1"), labels = c("Passenger", "Crew"))

For comparison purposes let’s run a frequency of Survival_nominal and crosstabs of Survival_nominal with Gender and Crew. For the third crosstab make a copy and switch the order of your column variables.

frequency(RMS_Titanic$Survival_nominal)

##  Values   Freq Percent
##  Died     1496 67.8   
##  Survived 712  32.2   
##  Total    2208 100

pretty_tab(crosstab( RMS_Titanic, row.vars = "Survival_nominal", col.vars = "Gender", 
          title ="Surival by Gender", format="col_percent"))

	Male	Female
Died	79.3	26.7
Survived	20.7	73.3
Total N	1722	486

pretty_tab(crosstab( RMS_Titanic, row.vars = "Survival_nominal", col.vars = "Crew", 
          title ="Surival of Crew and Passengers",
          format="col_percent"))

	Passenger	Crew
Died	62	76.2
Survived	38	23.8
Total N	1317	891

# This cross tab will control for two variables.
pretty_tab(crosstab( RMS_Titanic, row.vars = "Survival_nominal", col.vars = c("Crew", "Gender"), format="col_percent", 
          title ="Surival by Gender and Group"))

	Crew Female	Crew Male	Passenger Female	Passenger Male
Died	13	77.9	27.4	80.8
Survived	87	22.1	72.6	19.2
Total N	23	868	463	854

pretty_tab(crosstab( RMS_Titanic, row.vars = "Survival_nominal", col.vars = c("Gender", "Crew"), 
          title ="Surival by Gender and Group", format="col_percent"))

	Female Crew	Female Passenger	Male Crew	Male Passenger
Died	13	27.4	77.9	80.8
Survived	87	72.6	22.1	19.2
Total N	23	463	868	854

Now we will run 3 regressions. First one without any independent variables, then one with Gender, then one with Gender and Crew. We use the lm() function which means linear model.

lm(Survival ~ 1, RMS_Titanic)

## 
## Call:
## lm(formula = Survival ~ 1, data = RMS_Titanic)
## 
## Coefficients:
## (Intercept)  
##      0.3225

lm(Survival ~ Gender, RMS_Titanic)

## 
## Call:
## lm(formula = Survival ~ Gender, data = RMS_Titanic)
## 
## Coefficients:
##  (Intercept)  GenderFemale  
##       0.2067        0.5258

lm(Survival ~ Gender + Crew, RMS_Titanic)

## 
## Call:
## lm(formula = Survival ~ Gender + Crew, data = RMS_Titanic)
## 
## Coefficients:
##  (Intercept)  GenderFemale      CrewCrew  
##      0.18924       0.54163       0.03472

Write out the three equations below:

No independent variable:

Gender only:

Gender and Crew:

Calculate the predicted values for each the equation with no indepdent variable:

How does the equation relate to the frequency distribution results?

Calculate the predicted values for males and females for the Gender only model.

Did being female help or hurt or make no difference?

How do these results compare to the cross tab for gender?

For the equation with Gender and Crew as independent variables, calculate the predicted values for:

Female, Crew Female, Passenger Male, Crew Male, Passenger

Which group has the best predicted probability of surviving?

Did Gender have an effect that was positive, negative or close to 0?

Did it stay the same as the equation for Gender alone? If not, how did it change?

Did Crew have an effect that was positive, negative or close to 0?

How do the predicted values from the equation with two variables compare to the crosstabs?