First let’s create two new variables, the number of male passengers and the number of crew members. Remember that No. of passengers is the total passengers, No. of women passengers is the number of women passengers, and Ship size is the total number of people including passengers and crew. Set up the code below to create the new variables.

#type your code here
ships$No_of_male_passengers <- ships$`No. of passengers`- ships$`No. of women passengers`

ships$No_of_crew <- ships$`Ship size` - ships$`No. of passenger`

Let’s also create new dichtomous variables, one for the Titanic and one for your ship. These will be TRUE if the observation is that ship, FALSE otherwise.

ships$Titanic <- ifelse(ships$`Name of Ship` == 'RMS Titanic', TRUE, FALSE)
ships$Titanic
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# Now do yours

ships$Mv_Princess_Victoria <- ifelse(ships$`Name of Ship` == 'MV Princess Victoria', TRUE, FALSE)
ships$Mv_Princess_Victoria 
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

Make sure to check your results in the displayed the values.

Now get the summary data for the number of male passengers, number of women passengers and number of crew members.

How do the three variables compare to each other in terms of cnetral tendency (mean and median) and

variation (range and interquartile range)?

(Feel free to use additional R functions of your choice to get this information)

The three variables compare to each other in terms of cnetral tendency No. of male passengers has the greatest mean and median. The max for No. of crew is the greastest this could because its an outlier. We can’t focus on the range because sometimes the max has big numbers and this are outliers its more relevant to look at the interquartile range because they are closer to the mean.

summary(ships$No_of_male_passengers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    53.0   148.2   351.0   373.9   537.0   854.0
summary(ships$`No. of women passengers`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   64.25  238.50  248.20  385.20  578.00
summary(ships$No_of_crew)       
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    29.0    65.0   114.0   219.1   269.0   891.0

All three of these “number” variables are interval variables. What makes them interval variables?

The three of the number variable are interval variables because an interval variable is a measurement where the difference between two values is meaningful.

Do ships with more male passengers have more female passengers? To explore this one thing we can do is make a scatterplot using ggplot and geom_point.

plot1 <- ggplot(ships, aes(x=`No. of women passengers`, y=No_of_male_passengers)) 
  plot1 + 
  geom_point() +
  ggtitle("Relationship between the number of male and female passengers")

In looking at your results would you say that generally speaking as the number of female passengers

increases the number of male passengers increases? Yes

If you increase the number of female passengers by 100 how do you think that would change

the number of male passengers you would expect? (Just an estimation by eyeball)

I think it would also increase/ affect the number of male passengers.

Now make the same plot but switch the x and y axes.

plot1 <- ggplot(ships, aes( x=No_of_male_passengers, y=`No. of women passengers`)) 
  plot1 + 
  geom_point() +
  ggtitle("Relationship between the number of male and female passengers")

In looking at your results would you say that generally speaking as the number of male passengers

increases the number of female passengers increases? No

If you increase the number of male passengers by 100 how do you think that would change

the number of female passengers you would expect? (Just an estimation by eyeball)

I don’t think if I would increase the number of male passerngers by 100 would change the number of female passenger.

Now do the same thing but with number of crew members as the y variable. You can use either number of male passengers or number of female passengers as the x variable.

plot1 <- ggplot(ships, aes( x=No_of_male_passengers, y=No_of_crew)) 
  plot1 + 
  geom_point() +
  ggtitle("Relationship between the number of male and female passengers")

In looking at your results would you say that generally speaking as the number of male or female passengers

increases the number of crew members increases? No

If you increase the number of male or female passengers by 100 how do you think that would change

the number of crew you would expect? (Just an estimation by eyeball). No

Based on the change due to change of 100, what would the impact of change of 1 male or

female passenger be?

I don’t think it would change.

Now rerun two or three of the graphs but add this geom code to them to get a “linear fit” line.

geom_smooth(method=“lm”,se=FALSE)

plot1 <- ggplot(ships, aes( x=No_of_male_passengers, y=No_of_crew)) 
  plot1 + 
  geom_point() +
  ggtitle("Relationship between the number of male and female passengers") +geom_smooth(method="lm",se=FALSE)

Does the line change your thinking at all? Based on the line what are the impacts of changing

the x variable by 100? What would the impact of changing by 1 be?

Yes. Well I see its a positive slope this means it does increase. And also its more clear to see the graph.

Now let’s use R to estimate the slope of the line. Here is one example, add more corresponding to the graphs you made above.

results1 <- glm(No_of_male_passengers ~ `No. of women passengers`, family = gaussian, data = ships)
coefficients(results1)
##               (Intercept) `No. of women passengers` 
##               175.5697768                 0.7989579

How do the coeffcients for your x variables relate to your estimates of change based on increase

of 100 and increase of 1?

Its a positive this means there is a impact.

How does this glm differ from the glm() we did previously?

In the older glm we + and in this glm() we use ~ and also this glm there is no = binomial(link = logit)