rm(list =ls())
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(stargazer)
##
## Please cite as:
##
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(visdat)
library(tidyr)
library(dplyr)
library(psych)
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
dftest <- read.csv("test.csv")
vis_miss(dftest)
df_clean <- na.omit(dftest)
The functions above provide a cleaned data frame with no missing age values.
stargazer(df_clean,
type = "text",
summary.stat = c("Mean", "sd", "Min", "Max"),
covariate.labels = c("Passenger ID",
"Passenger Clas/Weath",
"Age",
"# of Siblings or Spouses",
"# of Parents or Children",
"Passenger Fare"),
title = "Titanic Statistics")
##
## Titanic Statistics
## =========================================================
## Statistic Mean St. Dev. Min Max
## ---------------------------------------------------------
## Passenger ID 1,100.233 122.910 892 1,307
## Passenger Clas/Weath 2.142 0.846 1 3
## Age 30.181 14.105 0.170 76.000
## # of Siblings or Spouses 0.483 0.875 0 8
## # of Parents or Children 0.399 0.812 0 6
## Passenger Fare 40.982 61.229 0.000 512.329
## ---------------------------------------------------------
The data illustrates that many passengers aboard the titanic had very few family members traveling with them. The measure of central tendency, mean, show that most passengers had fewer than 1 family member aboard.
The second largest spread of data was passenger fare with the a standard deviation of 61.2 although it had the largest range with 512.329(range is worse than standard deviation due to outliers for measuring spread). The spread shows that passenger fares could vary heavily and many passengers had different experiences when it came to financing their journey aboard the historic vessel.
boxplot(df_clean$Fare ~ df_clean$Pclass, ylab = "Ticket Fare", xlab = "Passenger Class")
The higher the class, the higher the ticket-fare. The boxplot above shows that the lowest 25th, 50th, and 75th percentiles for ticker-fare were in third then 2nd then 1st. This demonstrated that the central tendency was higher ticket-prices for the higher classes. Most likely attributed to higher classes paying more for fare to get better accommodations.
hist(df_clean$Age, main = "Histogram of Age", xlab = "Age")
The mode of age was consistent with the other measurement of central tendency, mean. As the mean was 30.2 and the mode was 20-30. This shows that most passengers were in the 20-30 range.
ggplot(data = df_clean,
mapping = aes(x = Pclass)
) + geom_bar()