homework3

Author

Set Up

remove(list=ls())

Cleaning

train <- read.csv("~/Downloads/train.csv")                     # importing data
x <- na.omit(train)                                            # cleaning data (omitting blank values)
df_clean <- x                                                  # setting in a new object

library(stargazer)                                             # loading package


Please cite as:

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

?stargazer

stargazer(df_clean,                                             # dataframe
          type   = "text",                                      # output format
          title  = "Summary Statistics Table For Ttianic",      # title
          digits = 1                                            # control decimal places
          )


Summary Statistics Table For Ttianic
========================================
Statistic    N  Mean  St. Dev. Min  Max 
----------------------------------------
PassengerId 714 448.6  259.1    1   891 
Survived    714  0.4    0.5     0    1  
Pclass      714  2.2    0.8     1    3  
Age         714 29.7    14.5   0.4 80.0 
SibSp       714  0.5    0.9     0    5  
Parch       714  0.4    0.9     0    6  
Fare        714 34.7    52.9   0.0 512.3
----------------------------------------

Key Observations

The Age and Fare variables have large standard deviations, of 14.5 and 52.9, as compared to the other variables.
Age and Fare also have higher means and maximum values than any of the other variables.

Creating a Box plot (class distribution by survival status)

library(ggplot2)                                                               # loading package

ggplot(df_clean, aes(df_clean = factor(Survived), y = Pclass)) +               # creating a box plot 
  geom_boxplot() +
  labs(x = "Survived",                                                         # axis labels
       y = "Class",
       title = "Class Distribution by Survival Status")                       # title

Take Away

Class had a huge part in Survival Status as from the box plot it is clear that out of those who did not survive, majority of them were not in first class.

Creating a Histogram (Age distribution by survival status)

ggplot(data = df_clean, aes(x = Age)) +                                     # creating a histogram
  geom_histogram(binwidth = 2, fill = "lightgreen", color = "black") +
  facet_wrap(~ Survived) +
  labs(title = "Histogram of Age by Survival Status",                       # title
       x = "Age",                                                           # axis labels
       y = "Frequency") +
  theme_minimal()                                                           # theme

Take Away

Age did not seem to play a big role in Survival Status. Most of the people in both categories were around 20-40 years old and the distributions of the histograms did not greatly vary based on Survival Status.