rm(list=ls())
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 592286 31.7 1357534 72.6 NA 669422 35.8
## Vcells 1115978 8.6 8388608 64.0 16384 1851681 14.2
directory <- "/Users/ruthiemaurer/Desktop/DATA 712"
setwd(directory)
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(haven)
library(tidyverse)
## Warning: package 'purrr' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ lubridate 1.9.3 ✔ stringr 1.5.1
## ✔ purrr 1.0.4 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(123123)
DATA <- read_xlsx("titanic_data.xlsx", col_names = TRUE)
head(DATA)
## # A tibble: 6 × 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 1 0 3 Braund… male 22 1 0 A/5 2… 7.25 <NA>
## 2 2 1 1 Cuming… fema… 38 1 0 PC 17… 71.3 C85
## 3 3 1 3 Heikki… fema… 26 0 0 STON/… 7.92 <NA>
## 4 4 1 1 Futrel… fema… 35 1 0 113803 53.1 C123
## 5 5 0 3 Allen,… male 35 0 0 373450 8.05 <NA>
## 6 6 0 3 Moran,… male NA 0 0 330877 8.46 <NA>
## # ℹ 1 more variable: Embarked <chr>
names(DATA)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
fare_sex <- DATA %>%
group_by(Sex) %>%
summarise(Average_Fare = mean(Fare, na.rm = TRUE))
print(fare_sex)
## # A tibble: 2 × 2
## Sex Average_Fare
## <chr> <dbl>
## 1 female 44.5
## 2 male 25.5
The analysis of the Titanic dataset reveals significant disparities in ticket prices and survival rates based on gender and passenger class. When examining ticket prices, women, on average, paid higher fares than men across all passenger classes. Specifically, the average fare for female passengers was $44.48, whereas for male passengers, it was significantly lower at $25.52. This difference was most pronounced in first class, where women paid an average fare of $106.13, compared to $67.23 for men. In contrast, third-class passengers, regardless of gender, had the lowest fares, with men paying an average of $12.66 and women paying $15.50. The substantial gap in fares between first and third-class passengers underscores the socioeconomic divide aboard the Titanic.
# Compare average Fare by Sex and Pclass
fare_sex_pclass <- DATA %>%
group_by(Sex, Pclass) %>%
summarise(Average_Fare = mean(Fare, na.rm = TRUE))
## `summarise()` has grouped output by 'Sex'. You can override using the `.groups`
## argument.
# Compare average Survival rate by Sex and Pclass
survival_sex_pclass <- DATA %>%
group_by(Sex, Pclass) %>%
summarise(Average_Survival = mean(Survived, na.rm = TRUE))
## `summarise()` has grouped output by 'Sex'. You can override using the `.groups`
## argument.
# Compare average Fare by Sex
fare_sex <- DATA %>%
group_by(Sex) %>%
summarise(Average_Fare = mean(Fare, na.rm = TRUE))
# Compare average Survival by Sex
survival_sex <- DATA %>%
group_by(Sex) %>%
summarise(Average_Survival = mean(Survived, na.rm = TRUE))
print(fare_sex_pclass)
## # A tibble: 6 × 3
## # Groups: Sex [2]
## Sex Pclass Average_Fare
## <chr> <dbl> <dbl>
## 1 female 1 106.
## 2 female 2 22.0
## 3 female 3 16.1
## 4 male 1 67.2
## 5 male 2 19.7
## 6 male 3 12.7
print(survival_sex_pclass)
## # A tibble: 6 × 3
## # Groups: Sex [2]
## Sex Pclass Average_Survival
## <chr> <dbl> <dbl>
## 1 female 1 0.968
## 2 female 2 0.921
## 3 female 3 0.5
## 4 male 1 0.369
## 5 male 2 0.157
## 6 male 3 0.135
survival_sex <- DATA %>%
group_by(Sex) %>%
summarise(Average_Survival = mean(Survived, na.rm = TRUE))
print(survival_sex)
## # A tibble: 2 × 2
## Sex Average_Survival
## <chr> <dbl>
## 1 female 0.742
## 2 male 0.189
survival_pclass <- DATA %>%
group_by(Pclass) %>%
summarise(Average_Survival = mean(Survived, na.rm = TRUE))
print(survival_pclass)
## # A tibble: 3 × 2
## Pclass Average_Survival
## <dbl> <dbl>
## 1 1 0.630
## 2 2 0.473
## 3 3 0.242
Survival rates also exhibited substantial differences between genders and passenger classes. Overall, the survival rate for female passengers was 74.2%, whereas for male passengers, it was significantly lower at 18.9%. The “women and children first” evacuation policy could make that significant difference. Regarding passenger class, first-class passengers had the highest survival rate, with 63% surviving. In comparison, second-class passengers had a 47.3% survival rate, and third-class passengers had the lowest survival rate at 24.2%. Among all groups, third-class male passengers fared the worst, with a survival rate of just 13.5%, indicating that class and gender played a significant role in determining who lived and who didn’t.
These statistics highlight the inequalities aboard the Titanic. First-class passengers and women had a much better chance of survival, possibly due to their proximity to lifeboats and better access to evacuation resources and procedures, while those in third-class faced significant barriers to safety. This analysis underscores how deeply social and economic factors influenced the fate of the Titanic’s passengers.
The mpg, or Fuel economy data from 1999 to 2008 for 38 popular models of cars is a built-in R dataset of data taken from fueleconomy.gov and was created to evaluate the new cars that came out from those years and their popularity. The data variables can be found here.
Using this dataset, I want to see which car manufacturer has the best city and highway miles per gallon (MPG). I also want to compare the fuel types of the vehicles to see which car performs best.
data("mpg")
head(mpg)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
names(mpg)
## [1] "manufacturer" "model" "displ" "year" "cyl"
## [6] "trans" "drv" "cty" "hwy" "fl"
## [11] "class"
colnames(mpg) <- c("Manufacturer", "Model", "Displacement", "Year", "Cylinder", "Transmission", "Drive Train", "City MPG", "Highway MPG", "Fuel Type", "Type of Car")
colnames(mpg)
## [1] "Manufacturer" "Model" "Displacement" "Year" "Cylinder"
## [6] "Transmission" "Drive Train" "City MPG" "Highway MPG" "Fuel Type"
## [11] "Type of Car"
print(mpg)
## # A tibble: 234 × 11
## Manufacturer Model Displacement Year Cylinder Transmission `Drive Train`
## <chr> <chr> <dbl> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f
## 2 audi a4 1.8 1999 4 manual(m5) f
## 3 audi a4 2 2008 4 manual(m6) f
## 4 audi a4 2 2008 4 auto(av) f
## 5 audi a4 2.8 1999 6 auto(l5) f
## 6 audi a4 2.8 1999 6 manual(m5) f
## 7 audi a4 3.1 2008 6 auto(av) f
## 8 audi a4 quatt… 1.8 1999 4 manual(m5) 4
## 9 audi a4 quatt… 1.8 1999 4 auto(l5) 4
## 10 audi a4 quatt… 2 2008 4 manual(m6) 4
## # ℹ 224 more rows
## # ℹ 4 more variables: `City MPG` <int>, `Highway MPG` <int>, `Fuel Type` <chr>,
## # `Type of Car` <chr>
# Compute average City MPG and Highway MPG by Manufacturer
mpg_by_manufacturer <- mpg %>%
group_by(Manufacturer) %>%
summarise(Average_City_MPG = mean(`City MPG`, na.rm = TRUE),
Average_Highway_MPG = mean(`Highway MPG`, na.rm = TRUE))
mpg_by_fuel <- mpg %>%
group_by(`Fuel Type`) %>%
summarise(Average_City_MPG = mean(`City MPG`, na.rm = TRUE),
Average_Highway_MPG = mean(`Highway MPG`, na.rm = TRUE))
print(mpg_by_manufacturer)
## # A tibble: 15 × 3
## Manufacturer Average_City_MPG Average_Highway_MPG
## <chr> <dbl> <dbl>
## 1 audi 17.6 26.4
## 2 chevrolet 15 21.9
## 3 dodge 13.1 17.9
## 4 ford 14 19.4
## 5 honda 24.4 32.6
## 6 hyundai 18.6 26.9
## 7 jeep 13.5 17.6
## 8 land rover 11.5 16.5
## 9 lincoln 11.3 17
## 10 mercury 13.2 18
## 11 nissan 18.1 24.6
## 12 pontiac 17 26.4
## 13 subaru 19.3 25.6
## 14 toyota 18.5 24.9
## 15 volkswagen 20.9 29.2
print(mpg_by_fuel)
## # A tibble: 5 × 3
## `Fuel Type` Average_City_MPG Average_Highway_MPG
## <chr> <dbl> <dbl>
## 1 c 24 36
## 2 d 25.6 33.6
## 3 e 9.75 13.2
## 4 p 17.4 25.2
## 5 r 16.7 23.0
mpg_by_manufacturer <- mpg %>%
group_by(Manufacturer) %>%
summarise(Average_City_MPG = mean(`City MPG`, na.rm = TRUE),
Average_Highway_MPG = mean(`Highway MPG`, na.rm = TRUE))
# Find the manufacturer with the best City MPG
best_city_mpg <- mpg_by_manufacturer %>%
filter(Average_City_MPG == max(mpg_by_manufacturer$Average_City_MPG))
# Find the manufacturer with the best Highway MPG
best_highway_mpg <- mpg_by_manufacturer %>%
filter(Average_Highway_MPG == max(mpg_by_manufacturer$Average_Highway_MPG))
print(best_city_mpg)
## # A tibble: 1 × 3
## Manufacturer Average_City_MPG Average_Highway_MPG
## <chr> <dbl> <dbl>
## 1 honda 24.4 32.6
print(best_highway_mpg)
## # A tibble: 1 × 3
## Manufacturer Average_City_MPG Average_Highway_MPG
## <chr> <dbl> <dbl>
## 1 honda 24.4 32.6
After analyzing the data, we can conclude that Honda was the manufacturer with the best city and highway MPG. The results showed Honda’s city MPG averaging 24.44 miles per gallon, making it the most fuel-efficient option for urban driving conditions. Honda’s Highway MPG averages 32.56 miles per gallon, making it the best choice for long-distance travel.
Volkswagen and Subaru also performed well. Volkswagen achieved 20.92 MPG in the city and 29.22 MPG on the highway, while Subaru recorded 19.28 MPG in the city and 25.57 MPG on the highway.
In contrast, the least fuel-efficient manufacturers were Land Rover, with 11.5 MPG city and 16.5 MPG highway, and Lincoln, with 11.33 MPG city and 17 MPG highway, likely due to the larger SUVs in their lineup. I further illustrated the findings in the bar charts below.
# Visualization: City vs. Highway MPG by Manufacturer with custom colors
ggplot(mpg_by_manufacturer, aes(x = Manufacturer)) +
geom_bar(aes(y = Average_City_MPG, fill = "City MPG"), stat = "identity", position = "dodge") +
geom_bar(aes(y = Average_Highway_MPG, fill = "Highway MPG"), stat = "identity", position = "dodge", alpha=0.7) +
scale_fill_manual(values = c("City MPG" = "blue", "Highway MPG" = "lightblue")) +
labs(title = "Average City & Highway MPG by Manufacturer", y = "Miles Per Gallon", fill = "MPG Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Visualization: City vs. Highway MPG by Fuel Type with custom colors
ggplot(mpg_by_fuel, aes(x = `Fuel Type`)) +
geom_bar(aes(y = Average_City_MPG, fill = "City MPG"), stat = "identity", position = "dodge") +
geom_bar(aes(y = Average_Highway_MPG, fill = "Highway MPG"), stat = "identity", position = "dodge", alpha=0.7) +
scale_fill_manual(values = c("City MPG" = "darkred", "Highway MPG" = "firebrick")) +
labs(title = "Average City & Highway MPG by Fuel Type", y = "Miles Per Gallon", fill = "MPG Type") +
theme_minimal()
When analyzing the fuel types, I found that diesel-powered vehicles (d) generally performed the best overall, achieving 25.6 MPG in the city and 33.6 MPG on the highway. Meanwhile, compressed natural gas (c) vehicles followed closely with 24 MPG city and 36 MPG highway, making them a good choice for long-distance driving. Petrol (p) vehicles averaged 17.36 MPG in the city and 25.23 MPG on highways, making them more fuel-intensive than diesel. Interestingly, electric cars (e) showed significantly lower MPG-equivalent ratings, averaging 9.75 MPG in the city and 13.25 MPG on the highway.
Now that we know this, one can make a more informed decision about which car to purchase based on their needs and the car’s performance.