This is an R Markdown document with Sean Amato’s work for the project 2. In this project I will be examining 3 data sets and try to answer three different questions.
First, I started by importing all the data and a map image (for #2) housed in my github repository.
baseball_df <- read.csv("https://raw.githubusercontent.com/samato0624/DATA607/main/Project2_Baseball.csv") # Analyze home runs over time aggregated by team.
weather_df <- read.csv("https://raw.githubusercontent.com/samato0624/DATA607/main/Project2_Weather.csv") # Analyze weather patterns.
ems_df <- read.csv("https://raw.githubusercontent.com/samato0624/DATA607/main/Project2_Weewoo.csv") # Compare EMS admission cases by severity.
map_url <- "https://github.com/samato0624/DATA607/blob/main/Pacific_Islands.png?raw=true"
1. Baseball
Step 1.1 First let’s clean up the home run data set.
# Ensure all blanks in the team column are filled in using a for loop.
x <- nrow(baseball_df)
for (i in 1:x) {
if(baseball_df$Team[i] == ""){
baseball_df$Team[i] <- baseball_df$Team[i -1]
}
}
# Clean up column names.
colnames(baseball_df) <- c("Team", "Position", "HR_2018", "HR_2019", "HR_2021", "HR_2022", "HR_2023")
# Remove position column, sum up the home runs across each team, and change the year column to just the integers.
baseball_df2 <- baseball_df %>%
select(c(1,3,4,5,6,7)) %>%
group_by(Team) %>%
summarise(`2018` = sum(HR_2018, na.rm = TRUE),
`2019` = sum(HR_2019, na.rm = TRUE),
`2021` = sum(HR_2021, na.rm = TRUE),
`2022` = sum(HR_2022, na.rm = TRUE),
`2023` = sum(HR_2023, na.rm = TRUE))
# Replace team acronyms with actual team names.
teams <- c("Baltimore Orioles", "Boston Red Sox", "New York Yankees", "Tampa Bay Rays", "Toronto Blue Jays")
y <- nrow(baseball_df2)
for(j in 1:y){
baseball_df2$Team[j] <- teams[j]
}
Step 1.2 Create the data table.
datatable(
data = baseball_df2,
options = list(scrollX = TRUE,
autoWidth = FALSE,
pageLength = 5),
caption = "Total homeruns for 5 MLB teams by year"
)
Conclusions: I don’t see any notable trends among the MLB teams, but we have no HRs for 2020 due to C-19.
2. Weather
Step 2.1 First let’s clean up the weather dataset.
#Need country, location, latitude, longitude, timezone, last_updated, temp_f, wind_mph, and wind_degree columns.
weather_df <- weather_df[c("country", "location_name", "latitude", "longitude", "timezone", "last_updated", "wind_mph", "wind_degree")]
weather_df2 <- weather_df %>%
filter(str_detect(timezone, "Pacific")) %>% # Filter to a subset of islands in the Pacific Timezone.
filter(str_detect(last_updated, "^9")) %>%
filter(longitude>0) %>%
filter(latitude>-30) %>%
mutate(wind_radian = wind_degree * pi/180) %>%
mutate(latitude_end = wind_mph*sin(wind_radian)/2 + latitude) %>% # Calculating ending positions for wind vectors.
mutate(longitude_end = wind_mph*cos(wind_radian)/2 + longitude)
weather_df2 <- weather_df2[c("country", "latitude", "longitude", "wind_mph", "latitude_end", "longitude_end")] # Removing unnecessary columns.
Step 2.2 Plot the data as vectors on a map of the islands in the pacific.
g <- ggplot(weather_df2, aes(x = longitude, y = latitude)) +
geom_segment(aes(xend = longitude_end, yend = latitude_end, color = wind_mph),
arrow = arrow(length = unit(0.5, "cm")),
size = 0.85) + # Create vectors on the chart.
geom_text(aes(label = country), vjust = -2.5) +
scale_color_gradient(low = "blue", high = "red", name = "wind_mph") +
theme(legend.position = "none")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ggbackground(g, map_url) # Place my map as a background image in my plot.
Conclusions: Via observation alone the Solomon Islands and Micronesia
seem to be less windy than all the other countries.
3. Emergency Services
Step 3.1 First let’s clean up the emergency services data set.
ems_df2 <- ems_df %>%
select(34:43) %>% # Select the appropriate columns and sum the rows.
summarise(`Non Urgent1` = sum(EMS_VISITS_NON_URGENT_NOT_adm, na.rm = TRUE),
`Non Urgent admitted` = sum(EMS_VISITS_NON_URGENT_adm, na.rm = TRUE),
`Urgent1` = sum(EMS_VISITS_URGENT_NOT_adm, na.rm = TRUE),
`Urgent admitted` = sum(EMS_VISITS_URGENT_adm, na.rm = TRUE),
`Moderate1` = sum(EMS_VISITS_MODERATE_NOT_adm, na.rm = TRUE),
`Moderate admitted` = sum(EMS_VISITS_MODERATE_adm, na.rm = TRUE),
`Severe1` = sum(EMS_VISITS_SEVERE_NOT_adm, na.rm = TRUE),
`Severe admitted` = sum(EMS_VISITS_SEVERE_adm, na.rm = TRUE),
`Critical1` = sum(EMS_VISITS_CRITICAL_NOT_adm, na.rm = TRUE),
`Critical admitted` = sum(EMS_VISITS_CRITICAL_adm, na.rm = TRUE)) %>%
mutate(`Non Urgent` = `Non Urgent admitted`/(`Non Urgent admitted`+`Non Urgent1`)) %>% # Calculate the proportion of hospital admission rates by severity.
mutate(`Urgent` = `Urgent admitted`/(`Urgent admitted`+`Urgent1`)) %>%
mutate(`Moderate` = `Moderate admitted`/(`Moderate admitted`+`Moderate1`)) %>%
mutate(`Severe` = `Severe admitted`/(`Severe admitted`+`Severe1`)) %>%
mutate(`Critical` = `Critical admitted`/(`Critical admitted`+`Critical1`)) %>%
select(c(11:15)) # Select the appropriate columns.
ems_df3 <- pivot_longer(
ems_df2,
cols = "Non Urgent":"Urgent":"Moderate":"Severe":"Critical",
names_to = "Severity Status",
values_to = "Admission Rate"
)
## Warning in x:y: numerical expression has 2 elements: only the first used
## Warning in x:y: numerical expression has 3 elements: only the first used
## Warning in x:y: numerical expression has 4 elements: only the first used
print(ems_df3)
## # A tibble: 5 × 2
## `Severity Status` `Admission Rate`
## <chr> <dbl>
## 1 Non Urgent 0.00670
## 2 Urgent 0.00943
## 3 Moderate 0.0365
## 4 Severe 0.126
## 5 Critical 0.478
Step 3.2 Now let’s plot the proportion of hospital admission rates by severity.
custom_order <- c("Non Urgent", "Urgent", "Moderate", "Severe", "Critical")
ems_df3$`Severity Status` <- factor(ems_df3$`Severity Status`, levels = custom_order)
ggplot(ems_df3, aes(x = `Severity Status`, y = `Admission Rate`)) +
geom_bar(stat = "identity", fill = "darkgreen")
Conclusions: The worse the severity the higher the admission rate.