1 Set Up

1.1 Clear Environment

rm(list =ls())

1.2 Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(stargazer)
## 
## Please cite as: 
## 
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(visdat)
library(tidyr)
library(dplyr)
library(psych)
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

2 Import Data

dftest <- read.csv("test.csv")

2.1 Data Cleaning

vis_miss(dftest)

df_clean <- na.omit(dftest)

The functions above provide a cleaned data frame with no missing age values.

3 Summary Statistics

3.1 Basic Summarization of Data

stargazer(df_clean, 
          type = "text", 
          summary.stat = c("Mean", "sd", "Min", "Max"),
          covariate.labels = c("Passenger ID",
                               "Passenger Clas/Weath", 
                               "Age",
                               "# of Siblings or Spouses",
                               "# of Parents or Children",
                               "Passenger Fare"),
                               title = "Titanic Statistics")
## 
## Titanic Statistics
## =========================================================
## Statistic                  Mean    St. Dev.  Min    Max  
## ---------------------------------------------------------
## Passenger ID             1,100.233 122.910   892   1,307 
## Passenger Clas/Weath       2.142    0.846     1      3   
## Age                       30.181    14.105  0.170 76.000 
## # of Siblings or Spouses   0.483    0.875     0      8   
## # of Parents or Children   0.399    0.812     0      6   
## Passenger Fare            40.982    61.229  0.000 512.329
## ---------------------------------------------------------

3.1.1 Observations and Takeaway

The data illustrates that many passengers aboard the titanic had very few family members traveling with them. The measure of central tendency, mean, show that most passengers had fewer than 1 family member aboard.

The second largest spread of data was passenger fare with the a standard deviation of 61.2 although it had the largest range with 512.329(range is worse than standard deviation due to outliers for measuring spread). The spread shows that passenger fares could vary heavily and many passengers had different experiences when it came to financing their journey aboard the historic vessel.

3.2 Boxplot Comparing Class and Fare Data

boxplot(df_clean$Fare ~ df_clean$Pclass, ylab = "Ticket Fare", xlab = "Passenger Class")

3.2.1 Observations and Takeaway

The higher the class, the higher the ticket-fare. The boxplot above shows that the lowest 25th, 50th, and 75th percentiles for ticker-fare were in third then 2nd then 1st. This demonstrated that the central tendency was higher ticket-prices for the higher classes. Most likely attributed to higher classes paying more for fare to get better accommodations.

3.3 Age Histogram

hist(df_clean$Age, main = "Histogram of Age", xlab = "Age")

3.3.1 Observations and Takeaways

The mode of age was consistent with the other measurement of central tendency, mean. As the mean was 30.2 and the mode was 20-30. This shows that most passengers were in the 20-30 range.

ggplot(data = df_clean, 
       mapping = aes(x = Pclass)
       ) + geom_bar()