Day 3 homework

Author

RB

remove(list=ls())
train <- read.csv("~/Boston College Experience - Data Analysis/train.csv")
x <- na.omit(train)
df_clean <- x

library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
?stargazer
stargazer(x,                                     
          type   = "text",                            
          title  = "Summary Statistics Table",        
          digits = 1
          )

Summary Statistics Table
========================================
Statistic    N  Mean  St. Dev. Min  Max 
----------------------------------------
PassengerId 714 448.6  259.1    1   891 
Survived    714  0.4    0.5     0    1  
Pclass      714  2.2    0.8     1    3  
Age         714 29.7    14.5   0.4 80.0 
SibSp       714  0.5    0.9     0    5  
Parch       714  0.4    0.9     0    6  
Fare        714 34.7    52.9   0.0 512.3
----------------------------------------

Key Observations

  1. A key observation I noticed from the data is that the max. fare was $512.30, while the mean fare was $34.70. This means that most passengers on board didn’t spend a large amount of money on their ticket, in relation to the max. fare they could have spent on a ticket. I can infer that there were not a lot of passengers buying tickets for $512.70 due to the mean being so much lower.

  2. Another observation I noticed from the data is that the standard deviation of age is 14.5, which indicates that there were a wide variety of ages on board the Titanic. This dispersion of data points means that passengers of many ages were on board.

library(ggplot2)

p <- ggplot(df_clean,
            aes(x = factor(Pclass),
                y = Age)
                ) + geom_boxplot()

p + labs(x = "Passenger Class")

The box in 1st class, representing the 1st and 3rd quartile and the median, is higher up on the y-axis than the boxes for the 2nd and 3rd class. This suggests that older people were in 1st class, while younger people were in 3rd class.

p <- ggplot(df_clean,
       aes(x = Age),
           ) + geom_histogram(binwidth = 5)
p + labs(y = "Frequency")

The age distribution is slightly skewed to the right as most passengers fall between the ages of 20-40 years old. A smaller number of passengers were over 60 years old, or under 20 years old.