remove(list=ls())homework3
Set Up
Cleaning
train <- read.csv("~/Downloads/train.csv") # importing data
x <- na.omit(train) # cleaning data (omitting blank values)
df_clean <- x # setting in a new object
library(stargazer) # loading package
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
?stargazer
stargazer(df_clean, # dataframe
type = "text", # output format
title = "Summary Statistics Table For Ttianic", # title
digits = 1 # control decimal places
)
Summary Statistics Table For Ttianic
========================================
Statistic N Mean St. Dev. Min Max
----------------------------------------
PassengerId 714 448.6 259.1 1 891
Survived 714 0.4 0.5 0 1
Pclass 714 2.2 0.8 1 3
Age 714 29.7 14.5 0.4 80.0
SibSp 714 0.5 0.9 0 5
Parch 714 0.4 0.9 0 6
Fare 714 34.7 52.9 0.0 512.3
----------------------------------------
Key Observations
The
AgeandFarevariables have large standard deviations, of 14.5 and 52.9, as compared to the other variables.AgeandFarealso have higher means and maximum values than any of the other variables.
Creating a Box plot (class distribution by survival status)
library(ggplot2) # loading package
ggplot(df_clean, aes(df_clean = factor(Survived), y = Pclass)) + # creating a box plot
geom_boxplot() +
labs(x = "Survived", # axis labels
y = "Class",
title = "Class Distribution by Survival Status") # title
Take Away
Classhad a huge part inSurvival Statusas from the box plot it is clear that out of those who did not survive, majority of them were not in first class.
Creating a Histogram (Age distribution by survival status)
ggplot(data = df_clean, aes(x = Age)) + # creating a histogram
geom_histogram(binwidth = 2, fill = "lightgreen", color = "black") +
facet_wrap(~ Survived) +
labs(title = "Histogram of Age by Survival Status", # title
x = "Age", # axis labels
y = "Frequency") +
theme_minimal() # theme
Take Away
Agedid not seem to play a big role inSurvival Status. Most of the people in both categories were around 20-40 years old and the distributions of the histograms did not greatly vary based onSurvival Status.