remove(list=ls())train <-read.csv("~/Boston College Experience - Data Analysis/train.csv")x <-na.omit(train)df_clean <- xlibrary(stargazer)
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
?stargazerstargazer(x, type ="text", title ="Summary Statistics Table", digits =1 )
Summary Statistics Table
========================================
Statistic N Mean St. Dev. Min Max
----------------------------------------
PassengerId 714 448.6 259.1 1 891
Survived 714 0.4 0.5 0 1
Pclass 714 2.2 0.8 1 3
Age 714 29.7 14.5 0.4 80.0
SibSp 714 0.5 0.9 0 5
Parch 714 0.4 0.9 0 6
Fare 714 34.7 52.9 0.0 512.3
----------------------------------------
Key Observations
A key observation I noticed from the data is that the max. fare was $512.30, while the mean fare was $34.70. This means that most passengers on board didn’t spend a large amount of money on their ticket, in relation to the max. fare they could have spent on a ticket. I can infer that there were not a lot of passengers buying tickets for $512.70 due to the mean being so much lower.
Another observation I noticed from the data is that the standard deviation of age is 14.5, which indicates that there were a wide variety of ages on board the Titanic. This dispersion of data points means that passengers of many ages were on board.
The box in 1st class, representing the 1st and 3rd quartile and the median, is higher up on the y-axis than the boxes for the 2nd and 3rd class. This suggests that older people were in 1st class, while younger people were in 3rd class.
p <-ggplot(df_clean,aes(x = Age), ) +geom_histogram(binwidth =5)p +labs(y ="Frequency")
The age distribution is slightly skewed to the right as most passengers fall between the ages of 20-40 years old. A smaller number of passengers were over 60 years old, or under 20 years old.