Describe why you picked that particular variable at the beginning of your report. I chose Ship Id because I thought it would be easy and because it had to numerical not a characteristic.
There are other ways we can display a variable that are different because they do not focus on individual observations. For example we could use a histogram. In that case we need a new plot object that only has an x variable.
Previously we had mad a histogram that included lines showing the 25th, 50th and 75th percentile, also known as the first, second and third quartiles.
Let’s do this in a slightly different way by making objects for each of our vertical lines.
You have to do the other two, follow the same naming pattern.
quartile1v <-
geom_vline(xintercept=quantile(ships$`Ship Id`, probs=c(.25),
na.rm=TRUE), color="red", linetype="dashed", size=2)
quartile2v <-
geom_vline(xintercept=quantile(ships$`Ship Id`, probs=c(.50),
na.rm=TRUE), color="blue", linetype="dashed", size=2)
quartile3v <-
geom_vline(xintercept=quantile(ships$`Ship Id`, probs=c(.75),
na.rm=TRUE), color="pink", linetype="dashed", size=2)
Now let’s make a our histogram again using these objects
plot1<- ggplot(ships, aes(x=`Ship Id`))
plot1 + geom_histogram(binwidth = 100 ) + quartile1v + quartile2v + quartile3v +
ggtitle("Figure 1: Number of passengers on ships that sunk")
For comparison sake let’s also get the summary of the distribution of No. of passengers and list the actual values (since there are only 18 it’s not too many).
ships$`Ship Id`
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
summary(ships$`Ship Id`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 5.25 9.50 9.50 13.75 18.00
Make sure you can see how the histogram relates to the summary and raw data.
Let’s create horizontal lines for the 3 quartiles. (You have to do the other 2)
quartile1h <-
geom_hline(yintercept=quantile(ships$`Ship Id`, probs=c(.25),
na.rm=TRUE), color="yellow", linetype="dashed", size=1)
quartile2h <-
geom_hline(yintercept=quantile(ships$`Ship Id`, probs=c(.50),
na.rm=TRUE), color="orange", linetype="dashed", size=1)
quartile3h <-
geom_hline(yintercept=quantile(ships$`Ship Id`, probs=c(.75),
na.rm=TRUE), color="brown", linetype="dashed", size=1)
Now make the box plot but include the 4 horizontal lines.
plot2<- ggplot(ships, aes(x = factor(0), y= `Ship Id`))
plot2+ geom_boxplot()+ quartile1h+ quartile2h+
quartile3h+
ggtitle("Ship Id")
The lines shows us the Q1 mean Q3.
The box plots with lines relate to the histogram with the lines because they give us the same value but in a different display.
Now let’s make boxplots with lines and titles for different x variables:
Survived, Quick, Cause, `Women and children first
plot4<-ggplot(ships, aes(x=factor(Survived) , y=`Ship Id` ))
plot4+ geom_boxplot()+ quartile1h+ quartile2h+
quartile3h+
ggtitle("Survived")
plot5<-ggplot(ships, aes(x=factor(Quick) , y=`Ship Id` ))
plot5+ geom_boxplot()+ quartile1h+ quartile2h+
quartile3h+
ggtitle("Quick")
plot6<-ggplot(ships, aes(x=factor(Cause) , y=`Ship Id` ))
plot6+ geom_boxplot()+ quartile1h+ quartile2h+
quartile3h+
ggtitle("Cause")
plot7<-ggplot(ships, aes(x=factor(`Women and children first`) , y=`Ship Id` ))
plot7+ geom_boxplot()+ quartile1h+ quartile2h+
quartile3h+
ggtitle("`Women and children first`")
# Display plot3 using geom_boxplot().
Each nominal variables relates to the number of passengers because it affects the the mean and range in extreme values.