homework3

Author

VT

Set Up

remove(list=ls())

Cleaning

train <- read.csv("~/Downloads/train.csv")                     # importing data
x <- na.omit(train)                                            # cleaning data (omitting blank values)
df_clean <- x                                                  # setting in a new object

library(stargazer)                                             # loading package

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
?stargazer

stargazer(df_clean,                                             # dataframe
          type   = "text",                                      # output format
          title  = "Summary Statistics Table For Ttianic",      # title
          digits = 1                                            # control decimal places
          )

Summary Statistics Table For Ttianic
========================================
Statistic    N  Mean  St. Dev. Min  Max 
----------------------------------------
PassengerId 714 448.6  259.1    1   891 
Survived    714  0.4    0.5     0    1  
Pclass      714  2.2    0.8     1    3  
Age         714 29.7    14.5   0.4 80.0 
SibSp       714  0.5    0.9     0    5  
Parch       714  0.4    0.9     0    6  
Fare        714 34.7    52.9   0.0 512.3
----------------------------------------
Key Observations
  • The Age and Fare variables have large standard deviations, of 14.5 and 52.9, as compared to the other variables.

  • Age and Fare also have higher means and maximum values than any of the other variables.

Creating a Box plot (class distribution by survival status)

library(ggplot2)                                                               # loading package

ggplot(df_clean, aes(df_clean = factor(Survived), y = Pclass)) +               # creating a box plot 
  geom_boxplot() +
  labs(x = "Survived",                                                         # axis labels
       y = "Class",
       title = "Class Distribution by Survival Status")                       # title

Take Away
  • Class had a huge part in Survival Status as from the box plot it is clear that out of those who did not survive, majority of them were not in first class.

Creating a Histogram (Age distribution by survival status)

ggplot(data = df_clean, aes(x = Age)) +                                     # creating a histogram
  geom_histogram(binwidth = 2, fill = "lightgreen", color = "black") +
  facet_wrap(~ Survived) +
  labs(title = "Histogram of Age by Survival Status",                       # title
       x = "Age",                                                           # axis labels
       y = "Frequency") +
  theme_minimal()                                                           # theme

Take Away
  • Age did not seem to play a big role in Survival Status. Most of the people in both categories were around 20-40 years old and the distributions of the histograms did not greatly vary based on Survival Status.