Project 1

Introduction

My data set shows the names of people on the Titanic along with their, class, gender, age, where they embarked, fare, and if they survived. I plan to explore how many people survived based on their gender, age group, and class to anaylze which group had the most and least survivors. The source for my data set is Encyclopedia Titanica.

library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.3.3
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(readr)
Titanic <- read_csv("Titanic.csv")
Rows: 891 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): last, first, gender, embarked, survived
dbl (3): age, class, fare

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Titanic <- na.omit(Titanic)
#Omit Na's in data
Titanic$age <- cut(Titanic$age, breaks = c(0, 18, 50, Inf), labels = c("Child", "Adult", "Senior"))
#Define catergories by age
survived_passengers <- filter(Titanic, survived == "yes")
#filter by survived passengers
survival_counts <- survived_passengers |>
  group_by(class, gender, age) |>
  summarize(amount_survived = n())
`summarise()` has grouped output by 'class', 'gender'. You can override using
the `.groups` argument.
#count of survived passengers for each unique combination of passenger class, gender, and age
ggplot(survival_counts, aes(x = class, y = amount_survived, fill = age)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_grid(. ~ gender) +
  scale_fill_manual(values = c("red", "orange", "yellow"),
                    name = "Age Group",
                    labels = c("Child", "Adult", "Senior")) +
  labs(title = "Number of Passengers Survived by Passenger Class, Age Group, and Gender",
       x = "Passenger Class",
       y = "Number of Passengers Survived",
       caption = "Source: Encyclopedia Titanica") +
  theme_light() +
  theme(legend.position = "top")

#Plotting graph

In order to get rid of the Na’s I used na.omit on my data set Titanic. I also defined categories by age, child, adult, and senior. I defined child as 0-18, adult as 18-50, and senior as 50 up. I then filtered by survived passengers, which originally was a categorical variable (yes or no), and turned it into a quantitative variable. Lastly, I found the combinations between class, gender, and age group to find out how many total survived for each combination to graph.

This visualization represents the amount of people in each class by gender and age group who survived on the Titanic. At first I was confused why more adult men in third class survived than second class but then I realized after double checking the data that there were more men in third class. There was also less children on the titanic so even though they had the highest survival rate there was less of them so it looks like at first glance that they had the lowest survival rate.

I wish I could have included percent survived on the y axis. This way it would truly show which group was the most likely to survive and which ones were not. Therefore, the graph could be a bit misleading since it does not account for the percent of people that survived it just shows the total amount that survived. For example, the third class had more people and had less of a chande of survival but from the graph it looks like they had a higer survival rate.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggfortify)
library(htmltools)
library(plotly)## 

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(readr)
Titanic <- read_csv("Titanic.csv")
Rows: 891 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): last, first, gender, embarked, survived
dbl (3): age, class, fare

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
p2 <- ggplot(Titanic, aes(x = age, y = fare)) +
      labs(title = "Age Versus fare on the Titanic",
           x = "Age",
           y = "Fare") +
      theme_minimal()
p2 + geom_point()
Warning: Removed 177 rows containing missing values or values outside the scale range
(`geom_point()`).

p3 <- p2 + xlim(0,80)+ ylim(0,500)
p3 + geom_point()
Warning: Removed 180 rows containing missing values or values outside the scale range
(`geom_point()`).

p4 <- p3 + geom_point() + geom_smooth(color = "red")
p4
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 180 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 180 rows containing missing values or values outside the scale range
(`geom_point()`).

p5 <- p3 + geom_point() + geom_smooth(method='lm',formula=y~x)
p5
Warning: Removed 180 rows containing non-finite outside the scale range
(`stat_smooth()`).
Removed 180 rows containing missing values or values outside the scale range
(`geom_point()`).

model <- lm(fare ~ age, data = Titanic)
summary(model)

Call:
lm(formula = fare ~ age, data = Titanic)

Residuals:
   Min     1Q Median     3Q    Max 
-42.42 -24.49 -17.60   2.33 475.78 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24.3009     4.4922   5.410 8.64e-08 ***
age           0.3500     0.1359   2.575   0.0102 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52.71 on 712 degrees of freedom
  (177 observations deleted due to missingness)
Multiple R-squared:  0.009229,  Adjusted R-squared:  0.007837 
F-statistic: 6.632 on 1 and 712 DF,  p-value: 0.01022

``` y= 24.3 + 0.3500x age

p-value = 0.0102

R^2= 0.009229

We can conclude that there is no correlation between age and fare using the Titanic dataset.