Welcome to Jacob Silver’s Project 2! All data and code can be found in the following GitHub repository: https://github.com/LongSockSilver/project2

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

##Dataset 1: Top 100 NBA Players (The Ringer)

The Ringer is a sports media company whose NBA player rankings are a popular entertainment product. While the rankings are based on the subjective judgments of close watchers of the NBA, data is a massive part of their analysis.

However, due to the subjectivity of the rankings and the role of “eye test” and other non-quantitative assessments, the importance of particular datapoints is not entirely clear. Cleaning and analyzing the rankings may prove insightful as to what factors are seen as important to basketball experts, and whether having “top players” is connected to team success.

df <- read.csv('./data/nba_ringer_top100_UNTIDY.csv')

Let’s look at the data:

head(df)
##                      X1                       X2                          X3
## 1          Nikola Jokic                Team: DEN               Position: BIG
## 2                          Field goals made: 9.3 Field goal percentage: 63.2
## 3                       Free throw attempts: 6.1 Free throw percentage: 82.6
## 4                                  Assists: 10.0              Turnovers: 3.6
## 5 Giannis Antetokounmpo                Team: MIL               Position: BIG
## 6                         Field goals made: 11.1 Field goal percentage: 53.8
##                        X4                         X5
## 1                 Age: 28               Points: 24.5
## 2    3-pointers made: 0.8 3-pointer percentage: 40.2
## 3 Offensive rebounds: 2.2    Defensive rebounds: 9.4
## 4             Steals: 1.3                Blocks: 0.7
## 5                 Age: 28               Points: 31.3
## 6    3-pointers made: 0.8 3-pointer percentage: 27.0

I’d like to get some preliminary column names in, even if they’re somewhat meaningless and will be short-lived.

colnames(df) = c('player_name', 'values1', 'values2',
                   'values3', 'values4')

head(df)
##             player_name                  values1                     values2
## 1          Nikola Jokic                Team: DEN               Position: BIG
## 2                          Field goals made: 9.3 Field goal percentage: 63.2
## 3                       Free throw attempts: 6.1 Free throw percentage: 82.6
## 4                                  Assists: 10.0              Turnovers: 3.6
## 5 Giannis Antetokounmpo                Team: MIL               Position: BIG
## 6                         Field goals made: 11.1 Field goal percentage: 53.8
##                   values3                    values4
## 1                 Age: 28               Points: 24.5
## 2    3-pointers made: 0.8 3-pointer percentage: 40.2
## 3 Offensive rebounds: 2.2    Defensive rebounds: 9.4
## 4             Steals: 1.3                Blocks: 0.7
## 5                 Age: 28               Points: 31.3
## 6    3-pointers made: 0.8 3-pointer percentage: 27.0

I see that the data I care about are broken up arbitrarily across a number of columns. However, the player names–which I’d like to use as my row identifiers–all appear in column 1. I can make sure all relevant data is connected to those names by “filling” them down the table until we reach a new player.

#make sure blank strings are interpreted as null
df$player_name[df$player_name == ''] <- NA

#fill
df <- df %>%
  fill(player_name)

head(df)
##             player_name                  values1                     values2
## 1          Nikola Jokic                Team: DEN               Position: BIG
## 2          Nikola Jokic    Field goals made: 9.3 Field goal percentage: 63.2
## 3          Nikola Jokic Free throw attempts: 6.1 Free throw percentage: 82.6
## 4          Nikola Jokic            Assists: 10.0              Turnovers: 3.6
## 5 Giannis Antetokounmpo                Team: MIL               Position: BIG
## 6 Giannis Antetokounmpo   Field goals made: 11.1 Field goal percentage: 53.8
##                   values3                    values4
## 1                 Age: 28               Points: 24.5
## 2    3-pointers made: 0.8 3-pointer percentage: 40.2
## 3 Offensive rebounds: 2.2    Defensive rebounds: 9.4
## 4             Steals: 1.3                Blocks: 0.7
## 5                 Age: 28               Points: 31.3
## 6    3-pointers made: 0.8 3-pointer percentage: 27.0

Now, I can pivot the data to be longer, consolidating all the metrics into one column. After that I can ditch the original “value” column names, which were arbitrary.

df <- df %>%
  pivot_longer(
    cols = seq(2,5)
  )

#drop 'name', a set of arbitrary positional labels
df <- df %>%
  subset(select = -c(name))

head(df)
## # A tibble: 6 × 2
##   player_name  value                      
##   <chr>        <chr>                      
## 1 Nikola Jokic Team: DEN                  
## 2 Nikola Jokic Position: BIG              
## 3 Nikola Jokic Age: 28                    
## 4 Nikola Jokic Points: 24.5               
## 5 Nikola Jokic Field goals made: 9.3      
## 6 Nikola Jokic Field goal percentage: 63.2

All the values have a colon and space, we should be able to split out two columns by that separator. So the left column is the name and the right is the value.

df <- df %>%
  separate(col = value,
           into = c('variable', 'value'),
           sep = ': ')

head(df)
## # A tibble: 6 × 3
##   player_name  variable              value
##   <chr>        <chr>                 <chr>
## 1 Nikola Jokic Team                  DEN  
## 2 Nikola Jokic Position              BIG  
## 3 Nikola Jokic Age                   28   
## 4 Nikola Jokic Points                24.5 
## 5 Nikola Jokic Field goals made      9.3  
## 6 Nikola Jokic Field goal percentage 63.2

Looking at the data, I realize I actually re-pivot it to be wider. While cleaning data often means pivoting to a longer format, I feel that would only be appropriate if the values were consistent. However, we actually have different types of values. Team name is a string, for example, while the other variables are numbers. Some numbers are percentages while others are totals.

#pivot wider
df <- df %>%
  pivot_wider(names_from = variable,
              values_from = value)

#make sure player rank is included
df$player_rank <- seq(1,100)

head(df)
## # A tibble: 6 × 18
##   player_name Team  Posit…¹ Age   Points Field…² Field…³ 3-poi…⁴ 3-poi…⁵ Free …⁶
##   <chr>       <chr> <chr>   <chr> <chr>  <chr>   <chr>   <chr>   <chr>   <chr>  
## 1 Nikola Jok… DEN   BIG     28    24.5   9.3     63.2    0.8     40.2    6.1    
## 2 Giannis An… MIL   BIG     28    31.3   11.1    53.8    0.8     27.0    12.7   
## 3 Luka Doncic DAL   GUARD   23    33.2   11.2    50.5    2.7     34.7    11.1   
## 4 Stephen Cu… GSW   GUARD   34    29.4   9.8     49.5    4.9     42.7    5.4    
## 5 Kevin Dura… PHX   FORWARD 34    29.7   10.5    55.9    1.8     37.6    7.3    
## 6 Joel Embiid PHI   BIG     28    33.1   11.0    53.1    1.1     34.0    11.8   
## # … with 8 more variables: `Free throw percentage` <chr>,
## #   `Offensive rebounds` <chr>, `Defensive rebounds` <chr>, Assists <chr>,
## #   Turnovers <chr>, Steals <chr>, Blocks <chr>, player_rank <int>, and
## #   abbreviated variable names ¹​Position, ²​`Field goals made`,
## #   ³​`Field goal percentage`, ⁴​`3-pointers made`, ⁵​`3-pointer percentage`,
## #   ⁶​`Free throw attempts`

Now I want to change the column names to reflect some industry standard abbreviations.

colnames(df) <- c('player_name', 'TEAM', 'POSITION', 'AGE', 'PTS',
                  'FGM','FG%','3PM', '3P%', 'FTA', 'FT%', 'ORB', 
                  'DRB', 'AST', 'TOV', 'STL', 'BLK', 'player_rank')

head(df)
## # A tibble: 6 × 18
##   player_n…¹ TEAM  POSIT…² AGE   PTS   FGM   `FG%` `3PM` `3P%` FTA   `FT%` ORB  
##   <chr>      <chr> <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Nikola Jo… DEN   BIG     28    24.5  9.3   63.2  0.8   40.2  6.1   82.6  2.2  
## 2 Giannis A… MIL   BIG     28    31.3  11.1  53.8  0.8   27.0  12.7  64.6  2.3  
## 3 Luka Donc… DAL   GUARD   23    33.2  11.2  50.5  2.7   34.7  11.1  73.4  0.9  
## 4 Stephen C… GSW   GUARD   34    29.4  9.8   49.5  4.9   42.7  5.4   92.2  0.6  
## 5 Kevin Dur… PHX   FORWARD 34    29.7  10.5  55.9  1.8   37.6  7.3   93.4  0.4  
## 6 Joel Embi… PHI   BIG     28    33.1  11.0  53.1  1.1   34.0  11.8  85.8  1.8  
## # … with 6 more variables: DRB <chr>, AST <chr>, TOV <chr>, STL <chr>,
## #   BLK <chr>, player_rank <int>, and abbreviated variable names ¹​player_name,
## #   ²​POSITION

It seems the numerical columns were interpreted as strings, so I’ll go through and fix their formatting. Percentages also came through as numbers between 1 and 100 (instead of between 0 and 1) making future calculations potential problematic; I’ll divide those by 100 to ensure proper math down the road.

#turn age into an integer and all other numerical values into doubles
df[c(4:17)] <- sapply(df[c(4:17)], as.numeric)

#turn any percentage column into a proper percentage by dividing by 100

for (val in c('FG%', '3P%', 'FT%')) {
    df[[val]] <- df[[val]] / 100 
}

Ah! clean data. now I can do the analyses I’m interested in.

df
## # A tibble: 100 × 18
##    player_…¹ TEAM  POSIT…²   AGE   PTS   FGM `FG%` `3PM` `3P%`   FTA `FT%`   ORB
##    <chr>     <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Nikola J… DEN   BIG        28  24.5   9.3 0.632   0.8 0.402   6.1 0.826   2.2
##  2 Giannis … MIL   BIG        28  31.3  11.1 0.538   0.8 0.27   12.7 0.646   2.3
##  3 Luka Don… DAL   GUARD      23  33.2  11.2 0.505   2.7 0.347  11.1 0.734   0.9
##  4 Stephen … GSW   GUARD      34  29.4   9.8 0.495   4.9 0.427   5.4 0.922   0.6
##  5 Kevin Du… PHX   FORWARD    34  29.7  10.5 0.559   1.8 0.376   7.3 0.934   0.4
##  6 Joel Emb… PHI   BIG        28  33.1  11   0.531   1.1 0.34   11.8 0.858   1.8
##  7 Jayson T… BOS   FORWARD    24  30.4   9.9 0.461   3.3 0.355   8.5 0.864   1.1
##  8 LeBron J… LAL   FORWARD    38  29.6  11.3 0.503   2.2 0.31    6.3 0.76    1.2
##  9 Kawhi Le… LAC   FORWARD    31  22.7   8.3 0.506   1.8 0.396   5   0.874   1.2
## 10 Shai Gil… OKC   GUARD      24  31    10.2 0.507   0.9 0.338  10.6 0.908   0.9
## # … with 90 more rows, 6 more variables: DRB <dbl>, AST <dbl>, TOV <dbl>,
## #   STL <dbl>, BLK <dbl>, player_rank <int>, and abbreviated variable names
## #   ¹​player_name, ²​POSITION

First let’s see which teams are best represented in the data.

team_player_count_df <- data.frame(df %>%
  dplyr::count(TEAM, sort = TRUE))

team_player_count_df
##    TEAM n
## 1   BOS 7
## 2   GSW 5
## 3   LAC 5
## 4   MIN 5
## 5   TOR 5
## 6   BKN 4
## 7   CHI 4
## 8   CLE 4
## 9   DEN 4
## 10  MEM 4
## 11  MIL 4
## 12  PHI 4
## 13  PHX 4
## 14  ATL 3
## 15  DAL 3
## 16  MIA 3
## 17  NOP 3
## 18  NYK 3
## 19  ORL 3
## 20  POR 3
## 21  SAC 3
## 22  WAS 3
## 23  DET 2
## 24  IND 2
## 25  LAL 2
## 26  OKC 2
## 27  SAS 2
## 28  UTA 2
## 29  CHA 1
## 30  HOU 1

Anecdotally, I can tell a few things already. Boston (BOS) and Golden State (GSW) were in last year’s NBA championship series, so I’m not too surprised to see them top the list with the most players in the top 100. But I can use more data to get a more robust sense of team success. Here, I’ll pull in current records, calculate win % for each team, and correlate that win % against each team’s count of players in the top 100.

team_df <- read.csv('./data/team_data.csv')

#combine into one dataframe that contains player count and team win %

team_df <- merge(team_player_count_df, 
                 team_df[c('TEAM', 'WIN.')], 
                 by = 'TEAM')

team_df
##    TEAM n  WIN.
## 1   ATL 3 0.508
## 2   BKN 4 0.556
## 3   BOS 7 0.703
## 4   CHA 1 0.308
## 5   CHI 4 0.453
## 6   CLE 4 0.600
## 7   DAL 3 0.516
## 8   DEN 4 0.703
## 9   DET 2 0.238
## 10  GSW 5 0.531
## 11  HOU 1 0.210
## 12  IND 2 0.438
## 13  LAC 5 0.500
## 14  LAL 2 0.469
## 15  MEM 4 0.613
## 16  MIA 3 0.516
## 17  MIL 4 0.726
## 18  MIN 5 0.508
## 19  NOP 3 0.484
## 20  NYK 3 0.585
## 21  OKC 2 0.460
## 22  ORL 3 0.422
## 23  PHI 4 0.645
## 24  PHX 4 0.547
## 25  POR 3 0.460
## 26  SAC 3 0.597
## 27  SAS 2 0.254
## 28  TOR 5 0.484
## 29  UTA 2 0.484
## 30  WAS 3 0.484

Plotting and correlation:

team_df %>%
  ggplot(aes(x = n,
             y = WIN.)) + 
  geom_point() +
  geom_text(label = team_df$TEAM) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

print(cor(team_df$n,
          team_df$WIN.))
## [1] 0.6818405

This scatterplot presents some interesting findings, including both expected and unexpected results. For example, the Boston Celtics have two more Ringer Top 100 players than any other team, and one of the highest win %’s. Meanwhile, the bottom four teams in the league (Houston, Charlotte, San Antonio and Detroit) have only 0-2 players in the ranking. However, a couple teams with an entire starting-lineup’s worth of top players–namely, the Toronto Raptors and Minnesota Timberwolves–have only a .500 win percentage and are not thought to be contenders.

Perhaps a better metric would be a score based on the specific placement for players on the list. After all, the current analysis ‘flattens’ the number 1 player with the number 100 player. This would also produce continuous rather than ordinal metric, allowing for cleaner correlation.

#assign points to each player, starting from 100 and working down to 1.
df$ringer_rank_POINTS <- seq(100,1)

#build table of teams and the sum total of points
df_points <- df %>%
  group_by(TEAM) %>%
  summarise(ringer_rank_POINTS = sum(ringer_rank_POINTS))

#attach win % to teams
df_points <- merge(df_points, 
      team_df[c('TEAM', 'WIN.')],
      by = 'TEAM')

Now we can create a new scatterplot:

df_points %>%
  ggplot(aes(x = ringer_rank_POINTS,
             y = WIN.)) + 
  geom_point() +
  geom_text(label = df_points$TEAM) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

print(cor(df_points$ringer_rank_POINTS,
          df_points$WIN.))
## [1] 0.8236894

Now we have a much smoother graph with a very strong positive correlation. To a basketball fan, the results also make intuitive sense: The cluster of teams in the top right corner of the chart are thought by many analysts to be the current title contenders.

Now, it would be interesting to see which variables seem to have been most important to the evaluators–or at least contribute, under the surface, to what experts see as the most exceptional players. I can create a table of correlations between player ranking points (I prefer this metric over pure ranking to show positive correlation instead of negative) and all other numeric variables in the data.

data_cor <- cor(df[c(4:17)],  #all numeric columns besides rankings and points
                df[19])  #ranking points

data_cor
##     ringer_rank_POINTS
## AGE         0.13029854
## PTS         0.73334033
## FGM         0.71321837
## FG%         0.10813987
## 3PM         0.16541387
## 3P%         0.12766480
## FTA         0.68394614
## FT%         0.31075707
## ORB        -0.04173480
## DRB         0.30060240
## AST         0.49779566
## TOV         0.60558322
## STL         0.43658679
## BLK         0.04410323

Perhaps unsurprisingly, points scored had the strongest correlation with player ranking, followed by field goals made. At the end of the day, as important as defensive prowess and play-making are, superstar players are generally expected to put the ball in the hoop.

Free-throw attempts also showed a strong correlation, in keeping with the idea that star players “get to the line” by playing aggressively, and by frustrating defenders who can only stop them by fouling.

Interestingly, turnovers showed a strong correlation with player ranking despite being a “negative” statistic. This may be because star players handle the ball so much that they also bear a lot of responsibility for potential mistakes.

##Dataset 2: Rail Casualties

Thanks to Joe Garcia for this data and analysis idea!

Let’s take a look at our data, which consists of rail casualties by state and year:

df1 <- read.csv('./data/rail_casualties.csv')

head(df1)
##            X Fatal  X.1  X.2  X.3 Nonfatal  X.4  X.5  X.6
## 1             2019 2020 2021 2022     2019 2020 2021 2022
## 2    Alabama    15   13   21   20      101   79   82   75
## 3     Alaska     .    1    .    1       53   30   24   34
## 4    Arizona    16   18   27   15       74   52   42   50
## 5   Arkansas     7    5   13   12       71   52   45   43
## 6 California   182  174  177  219      587  416  416  413

It appears that the CSV read in with two headers, bumping year down to the first row. There are a few different ways to approach this; I’m thinking I’d like to “attach” year and casualty type for now, then separate them out when the time comes.

colnames(df1) <- c('state', '2019_Fatal', '2020_Fatal', '2021_Fatal',
                   '2022_Fatal', '2019_Nonfatal', '2020_Nonfatal',
                   '2021_Nonfatal', '2022_Nonfatal')

#drop first row (just years)
df1 <- df1[-1,]

#reset index
rownames(df1) <- 1:nrow(df1)  

head(df1)
##        state 2019_Fatal 2020_Fatal 2021_Fatal 2022_Fatal 2019_Nonfatal
## 1    Alabama         15         13         21         20           101
## 2     Alaska          .          1          .          1            53
## 3    Arizona         16         18         27         15            74
## 4   Arkansas          7          5         13         12            71
## 5 California        182        174        177        219           587
## 6   Colorado         18         13         13         13           112
##   2020_Nonfatal 2021_Nonfatal 2022_Nonfatal
## 1            79            82            75
## 2            30            24            34
## 3            52            42            50
## 4            52            45            43
## 5           416           416           413
## 6            90           105            81

Throughout the data, periods sometimes come up instead of numbers. Unfortunately I could not track down on the FRA website whether these constitute null values or 0s. I will treat them as 0s for the time being, because they only occur where 0s would make intuitive sense, such as small states where casualty numbers are very low. But I recognize this is not ideal, and in the future I would much rather have clear documentation on the meaning of these characters.

#convert periods to 0s

df1[df1=='.'] <- '0'

#get rid of "Tot" row--would rather calculate my own aggregates
df1 <- df1[df1$state != 'Tot',]

#convert all columns to numeric
for (i in seq(2,9)) {
  df1[,i] <- as.numeric(gsub(',', '', df1[,i]))
}

head(df1)
##        state 2019_Fatal 2020_Fatal 2021_Fatal 2022_Fatal 2019_Nonfatal
## 1    Alabama         15         13         21         20           101
## 2     Alaska          0          1          0          1            53
## 3    Arizona         16         18         27         15            74
## 4   Arkansas          7          5         13         12            71
## 5 California        182        174        177        219           587
## 6   Colorado         18         13         13         13           112
##   2020_Nonfatal 2021_Nonfatal 2022_Nonfatal
## 1            79            82            75
## 2            30            24            34
## 3            52            42            50
## 4            52            45            43
## 5           416           416           413
## 6            90           105            81

Now I want to pivot the data longer, so there are only three columns: state, year, and casualty type (fatal or nonfatal). I can do this first by using the pivot_longer function, and then by splitting the single year/casualty type column by the underscore I set up earlier.

#pivot into long-format dataframe
df1 <- df1 %>%
  pivot_longer(cols = seq(2,9))

head(df1)
## # A tibble: 6 × 3
##   state   name          value
##   <chr>   <chr>         <dbl>
## 1 Alabama 2019_Fatal       15
## 2 Alabama 2020_Fatal       13
## 3 Alabama 2021_Fatal       21
## 4 Alabama 2022_Fatal       20
## 5 Alabama 2019_Nonfatal   101
## 6 Alabama 2020_Nonfatal    79
#split the joint columns of year_casualtytype into two distinct columns
df1 <- df1 %>%
  separate(col = name,
           into = c('year', 'casualty_type'),
           sep = '_')

df1
## # A tibble: 400 × 4
##    state   year  casualty_type value
##    <chr>   <chr> <chr>         <dbl>
##  1 Alabama 2019  Fatal            15
##  2 Alabama 2020  Fatal            13
##  3 Alabama 2021  Fatal            21
##  4 Alabama 2022  Fatal            20
##  5 Alabama 2019  Nonfatal        101
##  6 Alabama 2020  Nonfatal         79
##  7 Alabama 2021  Nonfatal         82
##  8 Alabama 2022  Nonfatal         75
##  9 Alaska  2019  Fatal             0
## 10 Alaska  2020  Fatal             1
## # … with 390 more rows

Now that the data is in a good format, we can run some analyses. In his discussion post, Joe expressed interest in understanding the ratio of fatal vs. non-fatal casualties by state, which is obscured by the different volumes by state. I can create a new table showing the fatality ratio across all years using grouping

#condense all years so we have two numbers (fatal and nonfatal casualties) for each state
df1_total <- df1 %>%
  group_by(state, casualty_type) %>%
  summarise(value = sum(value))
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
#create a separate table totalling the FULL casualties for each state for all four years
df1_full_total <- df1 %>%
  group_by(state) %>%
  summarise(full_state_total = sum(value))

#add a new column containing the total state value
df1_total <- merge(df1_total,
                   df1_full_total,
                   by = 'state')

#calculate the rate of fatal and nonfatal casualties by state
df1_total$rate <- df1_total$value / df1_total$full_state_total

#to simplify, let's look only at the rate of fatal casualties
df1_fatal_share <- df1_total[df1_total$casualty_type == 'Fatal',]

#change name of columns to be more clear
colnames(df1_fatal_share)[3:5] <- c('fatalities', 'total_casualties', 'fatality_rate')

#drop casualty_type column as we are only looking at fatal accidents
df1_fatal_share <- df1_fatal_share %>%
  subset(select = -c(2))

head(df1_fatal_share) 
##         state fatalities total_casualties fatality_rate
## 1     Alabama         69              406    0.16995074
## 3      Alaska          2              143    0.01398601
## 5     Arizona         76              294    0.25850340
## 7    Arkansas         37              248    0.14919355
## 9  California        752             2584    0.29102167
## 11   Colorado         57              445    0.12808989

Finally, I want to display the data sorted in descending order by fatality rate.

df1_fatal_share <- df1_fatal_share[order(df1_fatal_share$fatality_rate,
                      decreasing = TRUE),]

df1_fatal_share
##               state fatalities total_casualties fatality_rate
## 55           Nevada         39              123   0.317073171
## 9        California        752             2584   0.291021672
## 5           Arizona         76              294   0.258503401
## 71         Oklahoma         51              256   0.199218750
## 33         Kentucky         43              217   0.198156682
## 61       New Mexico         36              193   0.186528497
## 79   South Carolina         53              293   0.180887372
## 93       Washington         94              535   0.175700935
## 65   North Carolina         68              394   0.172588832
## 1           Alabama         69              406   0.169950739
## 83        Tennessee         69              406   0.169950739
## 85            Texas        302             1814   0.166482911
## 73           Oregon         58              353   0.164305949
## 19          Florida        225             1390   0.161870504
## 69             Ohio         84              524   0.160305344
## 95    West Virginia         25              157   0.159235669
## 87             Utah         24              151   0.158940397
## 7          Arkansas         37              248   0.149193548
## 21          Georgia         98              669   0.146487294
## 51          Montana         43              296   0.145270270
## 47      Mississippi         43              298   0.144295302
## 97        Wisconsin         37              262   0.141221374
## 49         Missouri         68              482   0.141078838
## 31           Kansas         52              381   0.136482940
## 27          Indiana         93              695   0.133812950
## 43         Michigan         42              325   0.129230769
## 11         Colorado         57              445   0.128089888
## 91         Virginia         49              407   0.120393120
## 37            Maine          8               68   0.117647059
## 29             Iowa         39              354   0.110169492
## 57    New Hampshire          3               28   0.107142857
## 35        Louisiana         52              526   0.098859316
## 25         Illinois        182             1843   0.098752035
## 39         Maryland         35              358   0.097765363
## 23            Idaho         10              105   0.095238095
## 53         Nebraska         38              404   0.094059406
## 67     North Dakota         14              149   0.093959732
## 45        Minnesota         29              361   0.080332410
## 81     South Dakota          5               65   0.076923077
## 75     Pennsylvania         91             1572   0.057888041
## 41    Massachusetts         34              614   0.055374593
## 77     Rhode Island          5               96   0.052083333
## 89          Vermont          3               76   0.039473684
## 13      Connecticut         17              500   0.034000000
## 59       New Jersey         56             2025   0.027654321
## 63         New York        105             4457   0.023558447
## 15         Delaware          3              200   0.015000000
## 3            Alaska          2              143   0.013986014
## 99          Wyoming          1              114   0.008771930
## 17 Dist Of Columbia          2              315   0.006349206

We can see that the states with the highest rate of fatalities among accidents are Nevada, California, and Arizona–fascinatingly, all in the southwest. Washington DC, Wyoming, and Alaska, meanwhile, saw the lowest fatality rates.

We can also look at correlations between total casualties and fatality rate, to see if high rates of accident-related casualties overall might predict the likelihood of fatalities.

df1_fatal_share %>%
  ggplot(aes(x = total_casualties,
             y = fatality_rate)) + 
  geom_point() +
  geom_text(label = df1_fatal_share$state) +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

print(cor(df1_fatal_share$total_casualties,
          df1_fatal_share$fatality_rate))
## [1] -0.06721035

The lack of correlation shown here suggests it is not necessarily the states with the most overall casualties that have the highest rates of fatality, and vice versa.

##Dataset 3: Test Scores

Thanks to Waheeb Algabri for the data and analysis ideas! This is a sample of people and their age, gender, and test scores across three subjects.

df2 <- read.csv('./data/subject_scores.csv')
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on './data/subject_scores.csv'
df2
##    Name Age Gender Math1 Science2 English
## 1 Suzan  27      F    90       75      86
## 2  John  29      M    95       80      91
## 3  Alex  31      M    84       70      65

First I want to change the column names to remove unnecessary numbers.

colnames(df2)[4:5] <- c('Math', 'Science')

df2
##    Name Age Gender Math Science English
## 1 Suzan  27      F   90      75      86
## 2  John  29      M   95      80      91
## 3  Alex  31      M   84      70      65

Next, I’ll pivot the data to be longer, since “Subject” can be its own column, while scores can be consolidated.

df2 <- df2 %>%
  pivot_longer(cols = c('Math', 'Science', 'English'),
               names_to = 'Subject',
               values_to = 'Score')

df2
## # A tibble: 9 × 5
##   Name    Age Gender Subject Score
##   <chr> <int> <chr>  <chr>   <int>
## 1 Suzan    27 F      Math       90
## 2 Suzan    27 F      Science    75
## 3 Suzan    27 F      English    86
## 4 John     29 M      Math       95
## 5 John     29 M      Science    80
## 6 John     29 M      English    91
## 7 Alex     31 M      Math       84
## 8 Alex     31 M      Science    70
## 9 Alex     31 M      English    65

Now that the data is clean (and resembles the version Waheeb produced in his own discussion post), I can look at some summary statistics.

df2_avg <- df2 %>%
  group_by(Subject) %>%
  summarise(mean = mean(Score),
            median = median(Score),
            min = min(Score),
            max = max(Score))

df2_avg
## # A tibble: 3 × 5
##   Subject  mean median   min   max
##   <chr>   <dbl>  <int> <int> <int>
## 1 English  80.7     86    65    91
## 2 Math     89.7     90    84    95
## 3 Science  75       75    70    80

We can see that the highest scores came on the Math test, in terms of mean, median and maximum. We can also see that the lowest score on the English test, 65, may be an outlier that is dragging down the mean score for that subject.

Let’s now look at some correlations:

cor(df2[df2$Subject == 'English',]$Score,
    df2[df2$Subject == 'Math',]$Score)
## [1] 0.9585454
cor(df2[df2$Subject == 'English',]$Score,
    df2[df2$Subject == 'Science',]$Score)
## [1] 0.9422929
cor(df2[df2$Subject == 'Math',]$Score,
    df2[df2$Subject == 'Science',]$Score)
## [1] 0.9986254

All of the correlations are extremely high–probably a function of the small number of values. However, the relationship between Math and Science appears to be the strongest of the 3 associations. In a real-world situation, this might not be terribly surprising given the overlap between these two ‘STEM’ subjects.