Welcome to Jacob Silver’s Project 2! All data and code can be found in the following GitHub repository: https://github.com/LongSockSilver/project2
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
##Dataset 1: Top 100 NBA Players (The Ringer)
The Ringer is a sports media company whose NBA player rankings are a popular entertainment product. While the rankings are based on the subjective judgments of close watchers of the NBA, data is a massive part of their analysis.
However, due to the subjectivity of the rankings and the role of “eye test” and other non-quantitative assessments, the importance of particular datapoints is not entirely clear. Cleaning and analyzing the rankings may prove insightful as to what factors are seen as important to basketball experts, and whether having “top players” is connected to team success.
df <- read.csv('./data/nba_ringer_top100_UNTIDY.csv')
Let’s look at the data:
head(df)
## X1 X2 X3
## 1 Nikola Jokic Team: DEN Position: BIG
## 2 Field goals made: 9.3 Field goal percentage: 63.2
## 3 Free throw attempts: 6.1 Free throw percentage: 82.6
## 4 Assists: 10.0 Turnovers: 3.6
## 5 Giannis Antetokounmpo Team: MIL Position: BIG
## 6 Field goals made: 11.1 Field goal percentage: 53.8
## X4 X5
## 1 Age: 28 Points: 24.5
## 2 3-pointers made: 0.8 3-pointer percentage: 40.2
## 3 Offensive rebounds: 2.2 Defensive rebounds: 9.4
## 4 Steals: 1.3 Blocks: 0.7
## 5 Age: 28 Points: 31.3
## 6 3-pointers made: 0.8 3-pointer percentage: 27.0
I’d like to get some preliminary column names in, even if they’re somewhat meaningless and will be short-lived.
colnames(df) = c('player_name', 'values1', 'values2',
'values3', 'values4')
head(df)
## player_name values1 values2
## 1 Nikola Jokic Team: DEN Position: BIG
## 2 Field goals made: 9.3 Field goal percentage: 63.2
## 3 Free throw attempts: 6.1 Free throw percentage: 82.6
## 4 Assists: 10.0 Turnovers: 3.6
## 5 Giannis Antetokounmpo Team: MIL Position: BIG
## 6 Field goals made: 11.1 Field goal percentage: 53.8
## values3 values4
## 1 Age: 28 Points: 24.5
## 2 3-pointers made: 0.8 3-pointer percentage: 40.2
## 3 Offensive rebounds: 2.2 Defensive rebounds: 9.4
## 4 Steals: 1.3 Blocks: 0.7
## 5 Age: 28 Points: 31.3
## 6 3-pointers made: 0.8 3-pointer percentage: 27.0
I see that the data I care about are broken up arbitrarily across a number of columns. However, the player names–which I’d like to use as my row identifiers–all appear in column 1. I can make sure all relevant data is connected to those names by “filling” them down the table until we reach a new player.
#make sure blank strings are interpreted as null
df$player_name[df$player_name == ''] <- NA
#fill
df <- df %>%
fill(player_name)
head(df)
## player_name values1 values2
## 1 Nikola Jokic Team: DEN Position: BIG
## 2 Nikola Jokic Field goals made: 9.3 Field goal percentage: 63.2
## 3 Nikola Jokic Free throw attempts: 6.1 Free throw percentage: 82.6
## 4 Nikola Jokic Assists: 10.0 Turnovers: 3.6
## 5 Giannis Antetokounmpo Team: MIL Position: BIG
## 6 Giannis Antetokounmpo Field goals made: 11.1 Field goal percentage: 53.8
## values3 values4
## 1 Age: 28 Points: 24.5
## 2 3-pointers made: 0.8 3-pointer percentage: 40.2
## 3 Offensive rebounds: 2.2 Defensive rebounds: 9.4
## 4 Steals: 1.3 Blocks: 0.7
## 5 Age: 28 Points: 31.3
## 6 3-pointers made: 0.8 3-pointer percentage: 27.0
Now, I can pivot the data to be longer, consolidating all the metrics into one column. After that I can ditch the original “value” column names, which were arbitrary.
df <- df %>%
pivot_longer(
cols = seq(2,5)
)
#drop 'name', a set of arbitrary positional labels
df <- df %>%
subset(select = -c(name))
head(df)
## # A tibble: 6 × 2
## player_name value
## <chr> <chr>
## 1 Nikola Jokic Team: DEN
## 2 Nikola Jokic Position: BIG
## 3 Nikola Jokic Age: 28
## 4 Nikola Jokic Points: 24.5
## 5 Nikola Jokic Field goals made: 9.3
## 6 Nikola Jokic Field goal percentage: 63.2
All the values have a colon and space, we should be able to split out two columns by that separator. So the left column is the name and the right is the value.
df <- df %>%
separate(col = value,
into = c('variable', 'value'),
sep = ': ')
head(df)
## # A tibble: 6 × 3
## player_name variable value
## <chr> <chr> <chr>
## 1 Nikola Jokic Team DEN
## 2 Nikola Jokic Position BIG
## 3 Nikola Jokic Age 28
## 4 Nikola Jokic Points 24.5
## 5 Nikola Jokic Field goals made 9.3
## 6 Nikola Jokic Field goal percentage 63.2
Looking at the data, I realize I actually re-pivot it to be wider. While cleaning data often means pivoting to a longer format, I feel that would only be appropriate if the values were consistent. However, we actually have different types of values. Team name is a string, for example, while the other variables are numbers. Some numbers are percentages while others are totals.
#pivot wider
df <- df %>%
pivot_wider(names_from = variable,
values_from = value)
#make sure player rank is included
df$player_rank <- seq(1,100)
head(df)
## # A tibble: 6 × 18
## player_name Team Posit…¹ Age Points Field…² Field…³ 3-poi…⁴ 3-poi…⁵ Free …⁶
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Nikola Jok… DEN BIG 28 24.5 9.3 63.2 0.8 40.2 6.1
## 2 Giannis An… MIL BIG 28 31.3 11.1 53.8 0.8 27.0 12.7
## 3 Luka Doncic DAL GUARD 23 33.2 11.2 50.5 2.7 34.7 11.1
## 4 Stephen Cu… GSW GUARD 34 29.4 9.8 49.5 4.9 42.7 5.4
## 5 Kevin Dura… PHX FORWARD 34 29.7 10.5 55.9 1.8 37.6 7.3
## 6 Joel Embiid PHI BIG 28 33.1 11.0 53.1 1.1 34.0 11.8
## # … with 8 more variables: `Free throw percentage` <chr>,
## # `Offensive rebounds` <chr>, `Defensive rebounds` <chr>, Assists <chr>,
## # Turnovers <chr>, Steals <chr>, Blocks <chr>, player_rank <int>, and
## # abbreviated variable names ¹Position, ²`Field goals made`,
## # ³`Field goal percentage`, ⁴`3-pointers made`, ⁵`3-pointer percentage`,
## # ⁶`Free throw attempts`
Now I want to change the column names to reflect some industry standard abbreviations.
colnames(df) <- c('player_name', 'TEAM', 'POSITION', 'AGE', 'PTS',
'FGM','FG%','3PM', '3P%', 'FTA', 'FT%', 'ORB',
'DRB', 'AST', 'TOV', 'STL', 'BLK', 'player_rank')
head(df)
## # A tibble: 6 × 18
## player_n…¹ TEAM POSIT…² AGE PTS FGM `FG%` `3PM` `3P%` FTA `FT%` ORB
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Nikola Jo… DEN BIG 28 24.5 9.3 63.2 0.8 40.2 6.1 82.6 2.2
## 2 Giannis A… MIL BIG 28 31.3 11.1 53.8 0.8 27.0 12.7 64.6 2.3
## 3 Luka Donc… DAL GUARD 23 33.2 11.2 50.5 2.7 34.7 11.1 73.4 0.9
## 4 Stephen C… GSW GUARD 34 29.4 9.8 49.5 4.9 42.7 5.4 92.2 0.6
## 5 Kevin Dur… PHX FORWARD 34 29.7 10.5 55.9 1.8 37.6 7.3 93.4 0.4
## 6 Joel Embi… PHI BIG 28 33.1 11.0 53.1 1.1 34.0 11.8 85.8 1.8
## # … with 6 more variables: DRB <chr>, AST <chr>, TOV <chr>, STL <chr>,
## # BLK <chr>, player_rank <int>, and abbreviated variable names ¹player_name,
## # ²POSITION
It seems the numerical columns were interpreted as strings, so I’ll go through and fix their formatting. Percentages also came through as numbers between 1 and 100 (instead of between 0 and 1) making future calculations potential problematic; I’ll divide those by 100 to ensure proper math down the road.
#turn age into an integer and all other numerical values into doubles
df[c(4:17)] <- sapply(df[c(4:17)], as.numeric)
#turn any percentage column into a proper percentage by dividing by 100
for (val in c('FG%', '3P%', 'FT%')) {
df[[val]] <- df[[val]] / 100
}
Ah! clean data. now I can do the analyses I’m interested in.
df
## # A tibble: 100 × 18
## player_…¹ TEAM POSIT…² AGE PTS FGM `FG%` `3PM` `3P%` FTA `FT%` ORB
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Nikola J… DEN BIG 28 24.5 9.3 0.632 0.8 0.402 6.1 0.826 2.2
## 2 Giannis … MIL BIG 28 31.3 11.1 0.538 0.8 0.27 12.7 0.646 2.3
## 3 Luka Don… DAL GUARD 23 33.2 11.2 0.505 2.7 0.347 11.1 0.734 0.9
## 4 Stephen … GSW GUARD 34 29.4 9.8 0.495 4.9 0.427 5.4 0.922 0.6
## 5 Kevin Du… PHX FORWARD 34 29.7 10.5 0.559 1.8 0.376 7.3 0.934 0.4
## 6 Joel Emb… PHI BIG 28 33.1 11 0.531 1.1 0.34 11.8 0.858 1.8
## 7 Jayson T… BOS FORWARD 24 30.4 9.9 0.461 3.3 0.355 8.5 0.864 1.1
## 8 LeBron J… LAL FORWARD 38 29.6 11.3 0.503 2.2 0.31 6.3 0.76 1.2
## 9 Kawhi Le… LAC FORWARD 31 22.7 8.3 0.506 1.8 0.396 5 0.874 1.2
## 10 Shai Gil… OKC GUARD 24 31 10.2 0.507 0.9 0.338 10.6 0.908 0.9
## # … with 90 more rows, 6 more variables: DRB <dbl>, AST <dbl>, TOV <dbl>,
## # STL <dbl>, BLK <dbl>, player_rank <int>, and abbreviated variable names
## # ¹player_name, ²POSITION
First let’s see which teams are best represented in the data.
team_player_count_df <- data.frame(df %>%
dplyr::count(TEAM, sort = TRUE))
team_player_count_df
## TEAM n
## 1 BOS 7
## 2 GSW 5
## 3 LAC 5
## 4 MIN 5
## 5 TOR 5
## 6 BKN 4
## 7 CHI 4
## 8 CLE 4
## 9 DEN 4
## 10 MEM 4
## 11 MIL 4
## 12 PHI 4
## 13 PHX 4
## 14 ATL 3
## 15 DAL 3
## 16 MIA 3
## 17 NOP 3
## 18 NYK 3
## 19 ORL 3
## 20 POR 3
## 21 SAC 3
## 22 WAS 3
## 23 DET 2
## 24 IND 2
## 25 LAL 2
## 26 OKC 2
## 27 SAS 2
## 28 UTA 2
## 29 CHA 1
## 30 HOU 1
Anecdotally, I can tell a few things already. Boston (BOS) and Golden State (GSW) were in last year’s NBA championship series, so I’m not too surprised to see them top the list with the most players in the top 100. But I can use more data to get a more robust sense of team success. Here, I’ll pull in current records, calculate win % for each team, and correlate that win % against each team’s count of players in the top 100.
team_df <- read.csv('./data/team_data.csv')
#combine into one dataframe that contains player count and team win %
team_df <- merge(team_player_count_df,
team_df[c('TEAM', 'WIN.')],
by = 'TEAM')
team_df
## TEAM n WIN.
## 1 ATL 3 0.508
## 2 BKN 4 0.556
## 3 BOS 7 0.703
## 4 CHA 1 0.308
## 5 CHI 4 0.453
## 6 CLE 4 0.600
## 7 DAL 3 0.516
## 8 DEN 4 0.703
## 9 DET 2 0.238
## 10 GSW 5 0.531
## 11 HOU 1 0.210
## 12 IND 2 0.438
## 13 LAC 5 0.500
## 14 LAL 2 0.469
## 15 MEM 4 0.613
## 16 MIA 3 0.516
## 17 MIL 4 0.726
## 18 MIN 5 0.508
## 19 NOP 3 0.484
## 20 NYK 3 0.585
## 21 OKC 2 0.460
## 22 ORL 3 0.422
## 23 PHI 4 0.645
## 24 PHX 4 0.547
## 25 POR 3 0.460
## 26 SAC 3 0.597
## 27 SAS 2 0.254
## 28 TOR 5 0.484
## 29 UTA 2 0.484
## 30 WAS 3 0.484
Plotting and correlation:
team_df %>%
ggplot(aes(x = n,
y = WIN.)) +
geom_point() +
geom_text(label = team_df$TEAM) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
print(cor(team_df$n,
team_df$WIN.))
## [1] 0.6818405
This scatterplot presents some interesting findings, including both expected and unexpected results. For example, the Boston Celtics have two more Ringer Top 100 players than any other team, and one of the highest win %’s. Meanwhile, the bottom four teams in the league (Houston, Charlotte, San Antonio and Detroit) have only 0-2 players in the ranking. However, a couple teams with an entire starting-lineup’s worth of top players–namely, the Toronto Raptors and Minnesota Timberwolves–have only a .500 win percentage and are not thought to be contenders.
Perhaps a better metric would be a score based on the specific placement for players on the list. After all, the current analysis ‘flattens’ the number 1 player with the number 100 player. This would also produce continuous rather than ordinal metric, allowing for cleaner correlation.
#assign points to each player, starting from 100 and working down to 1.
df$ringer_rank_POINTS <- seq(100,1)
#build table of teams and the sum total of points
df_points <- df %>%
group_by(TEAM) %>%
summarise(ringer_rank_POINTS = sum(ringer_rank_POINTS))
#attach win % to teams
df_points <- merge(df_points,
team_df[c('TEAM', 'WIN.')],
by = 'TEAM')
Now we can create a new scatterplot:
df_points %>%
ggplot(aes(x = ringer_rank_POINTS,
y = WIN.)) +
geom_point() +
geom_text(label = df_points$TEAM) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
print(cor(df_points$ringer_rank_POINTS,
df_points$WIN.))
## [1] 0.8236894
Now we have a much smoother graph with a very strong positive correlation. To a basketball fan, the results also make intuitive sense: The cluster of teams in the top right corner of the chart are thought by many analysts to be the current title contenders.
Now, it would be interesting to see which variables seem to have been most important to the evaluators–or at least contribute, under the surface, to what experts see as the most exceptional players. I can create a table of correlations between player ranking points (I prefer this metric over pure ranking to show positive correlation instead of negative) and all other numeric variables in the data.
data_cor <- cor(df[c(4:17)], #all numeric columns besides rankings and points
df[19]) #ranking points
data_cor
## ringer_rank_POINTS
## AGE 0.13029854
## PTS 0.73334033
## FGM 0.71321837
## FG% 0.10813987
## 3PM 0.16541387
## 3P% 0.12766480
## FTA 0.68394614
## FT% 0.31075707
## ORB -0.04173480
## DRB 0.30060240
## AST 0.49779566
## TOV 0.60558322
## STL 0.43658679
## BLK 0.04410323
Perhaps unsurprisingly, points scored had the strongest correlation with player ranking, followed by field goals made. At the end of the day, as important as defensive prowess and play-making are, superstar players are generally expected to put the ball in the hoop.
Free-throw attempts also showed a strong correlation, in keeping with the idea that star players “get to the line” by playing aggressively, and by frustrating defenders who can only stop them by fouling.
Interestingly, turnovers showed a strong correlation with player ranking despite being a “negative” statistic. This may be because star players handle the ball so much that they also bear a lot of responsibility for potential mistakes.
##Dataset 2: Rail Casualties
Thanks to Joe Garcia for this data and analysis idea!
Let’s take a look at our data, which consists of rail casualties by state and year:
df1 <- read.csv('./data/rail_casualties.csv')
head(df1)
## X Fatal X.1 X.2 X.3 Nonfatal X.4 X.5 X.6
## 1 2019 2020 2021 2022 2019 2020 2021 2022
## 2 Alabama 15 13 21 20 101 79 82 75
## 3 Alaska . 1 . 1 53 30 24 34
## 4 Arizona 16 18 27 15 74 52 42 50
## 5 Arkansas 7 5 13 12 71 52 45 43
## 6 California 182 174 177 219 587 416 416 413
It appears that the CSV read in with two headers, bumping year down to the first row. There are a few different ways to approach this; I’m thinking I’d like to “attach” year and casualty type for now, then separate them out when the time comes.
colnames(df1) <- c('state', '2019_Fatal', '2020_Fatal', '2021_Fatal',
'2022_Fatal', '2019_Nonfatal', '2020_Nonfatal',
'2021_Nonfatal', '2022_Nonfatal')
#drop first row (just years)
df1 <- df1[-1,]
#reset index
rownames(df1) <- 1:nrow(df1)
head(df1)
## state 2019_Fatal 2020_Fatal 2021_Fatal 2022_Fatal 2019_Nonfatal
## 1 Alabama 15 13 21 20 101
## 2 Alaska . 1 . 1 53
## 3 Arizona 16 18 27 15 74
## 4 Arkansas 7 5 13 12 71
## 5 California 182 174 177 219 587
## 6 Colorado 18 13 13 13 112
## 2020_Nonfatal 2021_Nonfatal 2022_Nonfatal
## 1 79 82 75
## 2 30 24 34
## 3 52 42 50
## 4 52 45 43
## 5 416 416 413
## 6 90 105 81
Throughout the data, periods sometimes come up instead of numbers. Unfortunately I could not track down on the FRA website whether these constitute null values or 0s. I will treat them as 0s for the time being, because they only occur where 0s would make intuitive sense, such as small states where casualty numbers are very low. But I recognize this is not ideal, and in the future I would much rather have clear documentation on the meaning of these characters.
#convert periods to 0s
df1[df1=='.'] <- '0'
#get rid of "Tot" row--would rather calculate my own aggregates
df1 <- df1[df1$state != 'Tot',]
#convert all columns to numeric
for (i in seq(2,9)) {
df1[,i] <- as.numeric(gsub(',', '', df1[,i]))
}
head(df1)
## state 2019_Fatal 2020_Fatal 2021_Fatal 2022_Fatal 2019_Nonfatal
## 1 Alabama 15 13 21 20 101
## 2 Alaska 0 1 0 1 53
## 3 Arizona 16 18 27 15 74
## 4 Arkansas 7 5 13 12 71
## 5 California 182 174 177 219 587
## 6 Colorado 18 13 13 13 112
## 2020_Nonfatal 2021_Nonfatal 2022_Nonfatal
## 1 79 82 75
## 2 30 24 34
## 3 52 42 50
## 4 52 45 43
## 5 416 416 413
## 6 90 105 81
Now I want to pivot the data longer, so there are only three columns: state, year, and casualty type (fatal or nonfatal). I can do this first by using the pivot_longer function, and then by splitting the single year/casualty type column by the underscore I set up earlier.
#pivot into long-format dataframe
df1 <- df1 %>%
pivot_longer(cols = seq(2,9))
head(df1)
## # A tibble: 6 × 3
## state name value
## <chr> <chr> <dbl>
## 1 Alabama 2019_Fatal 15
## 2 Alabama 2020_Fatal 13
## 3 Alabama 2021_Fatal 21
## 4 Alabama 2022_Fatal 20
## 5 Alabama 2019_Nonfatal 101
## 6 Alabama 2020_Nonfatal 79
#split the joint columns of year_casualtytype into two distinct columns
df1 <- df1 %>%
separate(col = name,
into = c('year', 'casualty_type'),
sep = '_')
df1
## # A tibble: 400 × 4
## state year casualty_type value
## <chr> <chr> <chr> <dbl>
## 1 Alabama 2019 Fatal 15
## 2 Alabama 2020 Fatal 13
## 3 Alabama 2021 Fatal 21
## 4 Alabama 2022 Fatal 20
## 5 Alabama 2019 Nonfatal 101
## 6 Alabama 2020 Nonfatal 79
## 7 Alabama 2021 Nonfatal 82
## 8 Alabama 2022 Nonfatal 75
## 9 Alaska 2019 Fatal 0
## 10 Alaska 2020 Fatal 1
## # … with 390 more rows
Now that the data is in a good format, we can run some analyses. In his discussion post, Joe expressed interest in understanding the ratio of fatal vs. non-fatal casualties by state, which is obscured by the different volumes by state. I can create a new table showing the fatality ratio across all years using grouping
#condense all years so we have two numbers (fatal and nonfatal casualties) for each state
df1_total <- df1 %>%
group_by(state, casualty_type) %>%
summarise(value = sum(value))
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
#create a separate table totalling the FULL casualties for each state for all four years
df1_full_total <- df1 %>%
group_by(state) %>%
summarise(full_state_total = sum(value))
#add a new column containing the total state value
df1_total <- merge(df1_total,
df1_full_total,
by = 'state')
#calculate the rate of fatal and nonfatal casualties by state
df1_total$rate <- df1_total$value / df1_total$full_state_total
#to simplify, let's look only at the rate of fatal casualties
df1_fatal_share <- df1_total[df1_total$casualty_type == 'Fatal',]
#change name of columns to be more clear
colnames(df1_fatal_share)[3:5] <- c('fatalities', 'total_casualties', 'fatality_rate')
#drop casualty_type column as we are only looking at fatal accidents
df1_fatal_share <- df1_fatal_share %>%
subset(select = -c(2))
head(df1_fatal_share)
## state fatalities total_casualties fatality_rate
## 1 Alabama 69 406 0.16995074
## 3 Alaska 2 143 0.01398601
## 5 Arizona 76 294 0.25850340
## 7 Arkansas 37 248 0.14919355
## 9 California 752 2584 0.29102167
## 11 Colorado 57 445 0.12808989
Finally, I want to display the data sorted in descending order by fatality rate.
df1_fatal_share <- df1_fatal_share[order(df1_fatal_share$fatality_rate,
decreasing = TRUE),]
df1_fatal_share
## state fatalities total_casualties fatality_rate
## 55 Nevada 39 123 0.317073171
## 9 California 752 2584 0.291021672
## 5 Arizona 76 294 0.258503401
## 71 Oklahoma 51 256 0.199218750
## 33 Kentucky 43 217 0.198156682
## 61 New Mexico 36 193 0.186528497
## 79 South Carolina 53 293 0.180887372
## 93 Washington 94 535 0.175700935
## 65 North Carolina 68 394 0.172588832
## 1 Alabama 69 406 0.169950739
## 83 Tennessee 69 406 0.169950739
## 85 Texas 302 1814 0.166482911
## 73 Oregon 58 353 0.164305949
## 19 Florida 225 1390 0.161870504
## 69 Ohio 84 524 0.160305344
## 95 West Virginia 25 157 0.159235669
## 87 Utah 24 151 0.158940397
## 7 Arkansas 37 248 0.149193548
## 21 Georgia 98 669 0.146487294
## 51 Montana 43 296 0.145270270
## 47 Mississippi 43 298 0.144295302
## 97 Wisconsin 37 262 0.141221374
## 49 Missouri 68 482 0.141078838
## 31 Kansas 52 381 0.136482940
## 27 Indiana 93 695 0.133812950
## 43 Michigan 42 325 0.129230769
## 11 Colorado 57 445 0.128089888
## 91 Virginia 49 407 0.120393120
## 37 Maine 8 68 0.117647059
## 29 Iowa 39 354 0.110169492
## 57 New Hampshire 3 28 0.107142857
## 35 Louisiana 52 526 0.098859316
## 25 Illinois 182 1843 0.098752035
## 39 Maryland 35 358 0.097765363
## 23 Idaho 10 105 0.095238095
## 53 Nebraska 38 404 0.094059406
## 67 North Dakota 14 149 0.093959732
## 45 Minnesota 29 361 0.080332410
## 81 South Dakota 5 65 0.076923077
## 75 Pennsylvania 91 1572 0.057888041
## 41 Massachusetts 34 614 0.055374593
## 77 Rhode Island 5 96 0.052083333
## 89 Vermont 3 76 0.039473684
## 13 Connecticut 17 500 0.034000000
## 59 New Jersey 56 2025 0.027654321
## 63 New York 105 4457 0.023558447
## 15 Delaware 3 200 0.015000000
## 3 Alaska 2 143 0.013986014
## 99 Wyoming 1 114 0.008771930
## 17 Dist Of Columbia 2 315 0.006349206
We can see that the states with the highest rate of fatalities among accidents are Nevada, California, and Arizona–fascinatingly, all in the southwest. Washington DC, Wyoming, and Alaska, meanwhile, saw the lowest fatality rates.
We can also look at correlations between total casualties and fatality rate, to see if high rates of accident-related casualties overall might predict the likelihood of fatalities.
df1_fatal_share %>%
ggplot(aes(x = total_casualties,
y = fatality_rate)) +
geom_point() +
geom_text(label = df1_fatal_share$state) +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
print(cor(df1_fatal_share$total_casualties,
df1_fatal_share$fatality_rate))
## [1] -0.06721035
The lack of correlation shown here suggests it is not necessarily the states with the most overall casualties that have the highest rates of fatality, and vice versa.
##Dataset 3: Test Scores
Thanks to Waheeb Algabri for the data and analysis ideas! This is a sample of people and their age, gender, and test scores across three subjects.
df2 <- read.csv('./data/subject_scores.csv')
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on './data/subject_scores.csv'
df2
## Name Age Gender Math1 Science2 English
## 1 Suzan 27 F 90 75 86
## 2 John 29 M 95 80 91
## 3 Alex 31 M 84 70 65
First I want to change the column names to remove unnecessary numbers.
colnames(df2)[4:5] <- c('Math', 'Science')
df2
## Name Age Gender Math Science English
## 1 Suzan 27 F 90 75 86
## 2 John 29 M 95 80 91
## 3 Alex 31 M 84 70 65
Next, I’ll pivot the data to be longer, since “Subject” can be its own column, while scores can be consolidated.
df2 <- df2 %>%
pivot_longer(cols = c('Math', 'Science', 'English'),
names_to = 'Subject',
values_to = 'Score')
df2
## # A tibble: 9 × 5
## Name Age Gender Subject Score
## <chr> <int> <chr> <chr> <int>
## 1 Suzan 27 F Math 90
## 2 Suzan 27 F Science 75
## 3 Suzan 27 F English 86
## 4 John 29 M Math 95
## 5 John 29 M Science 80
## 6 John 29 M English 91
## 7 Alex 31 M Math 84
## 8 Alex 31 M Science 70
## 9 Alex 31 M English 65
Now that the data is clean (and resembles the version Waheeb produced in his own discussion post), I can look at some summary statistics.
df2_avg <- df2 %>%
group_by(Subject) %>%
summarise(mean = mean(Score),
median = median(Score),
min = min(Score),
max = max(Score))
df2_avg
## # A tibble: 3 × 5
## Subject mean median min max
## <chr> <dbl> <int> <int> <int>
## 1 English 80.7 86 65 91
## 2 Math 89.7 90 84 95
## 3 Science 75 75 70 80
We can see that the highest scores came on the Math test, in terms of mean, median and maximum. We can also see that the lowest score on the English test, 65, may be an outlier that is dragging down the mean score for that subject.
Let’s now look at some correlations:
cor(df2[df2$Subject == 'English',]$Score,
df2[df2$Subject == 'Math',]$Score)
## [1] 0.9585454
cor(df2[df2$Subject == 'English',]$Score,
df2[df2$Subject == 'Science',]$Score)
## [1] 0.9422929
cor(df2[df2$Subject == 'Math',]$Score,
df2[df2$Subject == 'Science',]$Score)
## [1] 0.9986254
All of the correlations are extremely high–probably a function of the small number of values. However, the relationship between Math and Science appears to be the strongest of the 3 associations. In a real-world situation, this might not be terribly surprising given the overlap between these two ‘STEM’ subjects.