First let’s create two new variables, the number of male passengers and the number of crew members. Remember that No. of passengers
is the total passengers, No. of women passengers
is the number of women passengers, and Ship size
is the total number of people including passengers and crew. Set up the code below to create the new variables.
#type your code here
ships$No_of_male_passengers <- ships$`No. of passengers`- ships$`No. of women passengers`
ships$No_of_crew <- ships$`Ship size` - ships$`No. of passenger`
Let’s also create new dichtomous variables, one for the Titanic and one for your ship. These will be TRUE if the observation is that ship, FALSE otherwise.
ships$Titanic <- ifelse(ships$`Name of Ship` == 'RMS Titanic', TRUE, FALSE)
ships$Titanic
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# Now do yours
ships$Mv_Princess_Victoria <- ifelse(ships$`Name of Ship` == 'MV Princess Victoria', TRUE, FALSE)
ships$Mv_Princess_Victoria
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
Make sure to check your results in the displayed the values.
Now get the summary data for the number of male passengers, number of women passengers and number of crew members.
(Feel free to use additional R functions of your choice to get this information)
The three variables compare to each other in terms of cnetral tendency No. of male passengers has the greatest mean and median. The max for No. of crew is the greastest this could because its an outlier. We can’t focus on the range because sometimes the max has big numbers and this are outliers its more relevant to look at the interquartile range because they are closer to the mean.
summary(ships$No_of_male_passengers)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 53.0 148.2 351.0 373.9 537.0 854.0
summary(ships$`No. of women passengers`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 64.25 238.50 248.20 385.20 578.00
summary(ships$No_of_crew)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.0 65.0 114.0 219.1 269.0 891.0
The three of the number variable are interval variables because an interval variable is a measurement where the difference between two values is meaningful.
Do ships with more male passengers have more female passengers? To explore this one thing we can do is make a scatterplot using ggplot and geom_point.
plot1 <- ggplot(ships, aes(x=`No. of women passengers`, y=No_of_male_passengers))
plot1 +
geom_point() +
ggtitle("Relationship between the number of male and female passengers")
I think it would also increase/ affect the number of male passengers.
Now make the same plot but switch the x and y axes.
plot1 <- ggplot(ships, aes( x=No_of_male_passengers, y=`No. of women passengers`))
plot1 +
geom_point() +
ggtitle("Relationship between the number of male and female passengers")
I don’t think if I would increase the number of male passerngers by 100 would change the number of female passenger.
Now do the same thing but with number of crew members as the y variable. You can use either number of male passengers or number of female passengers as the x variable.
plot1 <- ggplot(ships, aes( x=No_of_male_passengers, y=No_of_crew))
plot1 +
geom_point() +
ggtitle("Relationship between the number of male and female passengers")
I don’t think it would change.
Now rerun two or three of the graphs but add this geom code to them to get a “linear fit” line.
geom_smooth(method=“lm”,se=FALSE)
plot1 <- ggplot(ships, aes( x=No_of_male_passengers, y=No_of_crew))
plot1 +
geom_point() +
ggtitle("Relationship between the number of male and female passengers") +geom_smooth(method="lm",se=FALSE)
Yes. Well I see its a positive slope this means it does increase. And also its more clear to see the graph.
Now let’s use R to estimate the slope of the line. Here is one example, add more corresponding to the graphs you made above.
results1 <- glm(No_of_male_passengers ~ `No. of women passengers`, family = gaussian, data = ships)
coefficients(results1)
## (Intercept) `No. of women passengers`
## 175.5697768 0.7989579
Its a positive this means there is a impact.
In the older glm we + and in this glm() we use ~ and also this glm there is no = binomial(link = logit)