setwd('~/Downloads')
TrainInfo <- read.csv('train.csv')
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.1.8
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#Histogram of Embarked Location
qplot(x = Embarked, data = TrainInfo)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#Histogram of Fare
qplot(x = Fare, data = TrainInfo)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Histogram of Age
qplot(x = Age, data = TrainInfo)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 177 rows containing non-finite values (`stat_bin()`).
The reason I did a histogram of the Embarked, Fare, and Age variables was because I wanted to see an overall visual or picture of these variables.
Question to answer: what can I conclude by the histograms of these variables.
For the Embarked histogram I am just looking at the count of where the passengers Embarked from. The majority, over 600+, of the passengers embarked in Southhampton.Queenstown had less than 100 passengers and Cherbourg had a less than 200 passengers.
As for the Fare histogram it is heavily right skewed, and it looks to have an outlier just by eyeballing it.
The age histogram appears to be a little right skewed.
#BoxPlot of Age by Sex
qplot(x = Sex, y = Age,
data = subset(TrainInfo, !is.na(Age)),
geom = 'boxplot')
Question to answer: What can you conclude from the boxplots for males and females?
For the Boxplots I wanted to compare their age by sex. According to the boxplots there seems to be a few outliers on the males age. Also it seems that the median age is higher for males as well.