Dataset Exploration

The flights dataset provides on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.

cat('number of rows:',nrow(flights), #336776 
   '\nnumber of columns:',ncol(flights)) #19
## number of rows: 336776 
## number of columns: 19
glimpse(flights) #shows rows, columns, and snummary of all columns
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
summary(flights) #column names
##       year          month             day           dep_time    sched_dep_time
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
##                                                  NA's   :8255                 
##    dep_delay          arr_time    sched_arr_time   arr_delay       
##  Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
##  Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
##  Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
##  Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
##  NA's   :8255      NA's   :8713                  NA's   :9430      
##    carrier              flight       tailnum             origin         
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##      dest              air_time        distance         hour      
##  Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
##  Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
##  Mode  :character   Median :129.0   Median : 872   Median :13.00  
##                     Mean   :150.7   Mean   :1040   Mean   :13.18  
##                     3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
##                     Max.   :695.0   Max.   :4983   Max.   :23.00  
##                     NA's   :9430                                  
##      minute        time_hour                     
##  Min.   : 0.00   Min.   :2013-01-01 05:00:00.00  
##  1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00.00  
##  Median :29.00   Median :2013-07-03 10:00:00.00  
##  Mean   :26.23   Mean   :2013-07-03 05:22:54.64  
##  3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00.00  
##  Max.   :59.00   Max.   :2013-12-31 23:00:00.00  
## 

Description of columns

Column Name Variable Type Variable Description
year categorical ordinal year of flight departure/arrival
month categorical ordinal month of flight departure/arrival
day categorical ordinal day of flight departure/arrival
dep_time numerical discrete time of flight departure
sched_dep_time numerical discrete scheduled flight departure
dep_delay numerical discrete departure delay in minutes
arr_time numerical discrete time of flight arrival
sched_arr_time numerical discrete scheduled arrival time
arr_delay numerical discrete arrival delay in minutes
carrier categorical nominal abbreviated flight carrier name
flight categorical nominal flight number
tailnum categorical nominal tail number, a license plate of a plane
origin categorical nominal origin city of flight
dest categorical nominal destination city of flight
air_time numerical discrete total flight time in minutes
distance numerical discrete total flight distance
hour numerical discrete scheduled hour departure of flight
minute numerical discrete scheduled minute departure of flight
time_hour datetime variable/numerical continuous departure of flight with date and time

Question 1 (dataset filtering)

In the code above, we frequently used not_cancelled, rather than flights as our data. How did this simplify our code? Think especially about the functions we used within summarise().

Answer

The object ‘not_cancelled’ is an object that contains a dataset which has a list of flights that have not been cancelled and have taken off.

One might think about using the group_by() function to filter out the delayed flights. This could work but, there are multiple variables that need to be sub-grouped within the ‘non_cancelled’ dataset. Because of this it would be easier to create an object that stores all the data for non-cancelled flights (in this example, this is the non_cancelled object).

Also, the mean() function was used within the summarize() function to calculate the mean of delayed flights per day. The mean for the delayed flights was computed based off the groups per day (these groups were created in the group_by() function). If the flight delay variables (dep_delay() and arr_delay() ) were used in the group_by() function along with the year, month, and day then R would get confused since it would try to create 2 groups at the same time (one group for each day and another group for delay times). Our objective is to find the average delayed flights per day so it makes sense to only include the year, month, and day variables in the group_by() function and no other variables.

# Here is the object 'not_cancelled' being created
not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay)) # filtering delayed flights because we want to see non-cancelled flights

# Showing average delays. avg_delay2 shows average for delayed flights only
not_cancelled %>% 
  group_by(.,year, month, day) %>% 
  summarize(.,
            avg_delay1 = mean(arr_delay),
            avg_delay2 = mean(arr_delay[arr_delay > 0]) # the ave positive delay
  )
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day avg_delay1 avg_delay2
##    <int> <int> <int>      <dbl>      <dbl>
##  1  2013     1     1     12.7         32.5
##  2  2013     1     2     12.7         32.0
##  3  2013     1     3      5.73        27.7
##  4  2013     1     4     -1.93        28.3
##  5  2013     1     5     -1.53        22.6
##  6  2013     1     6      4.24        24.4
##  7  2013     1     7     -4.95        27.8
##  8  2013     1     8     -3.23        20.8
##  9  2013     1     9     -0.264       25.6
## 10  2013     1    10     -5.90        27.3
## # ℹ 355 more rows

Question 2 (scatterplot of flight delays and flight cancellation proportion)

You might suspect that there is a relationship between the average delay (on a given day) and the proportion of flights that are cancelled on that day. For example, if there is bad weather, many flights might start off delayed, but then end up cancelled. Let’s test this intuition. First, find the average delay and proportion of flights cancelled each day. Second, plot them against one another and comment on the relationship. Did our intuition hold?

Answer

Flights that have longer delays have a higher chance of getting cancelled. This can be seen in the scatter plot below which shows that as a flights average delay time increases, then the proportion/percentage of the flight being cancelled increases. I reached this answer by finding the average delay and proportion of flights cancelled. I then used geom_point() to plot the data points and to see a relationship.

NOTE: I used the ‘dep_delay’ variable to calculate the average flights delay because if a flight is cancelled it would not have a arrival delay so using the variable ‘arr_delay’ to calculate the flight delay average would not be useful.

#avg delays and proportion of flights cancelled

flights %>%
  group_by(.,year, month, day) %>% 
  mutate(cancelled = ifelse((is.na(dep_delay) & is.na(arr_delay)),1,0)) %>%
  summarize(.,
            avg_delay1 = mean(dep_delay,na.rm=TRUE),
            flights_proportion = 100*mean(cancelled)) %>%
  filter(flights_proportion < 10) %>%
  ggplot(mapping = aes(x = avg_delay1, y = flights_proportion)) + 
  geom_point(alpha=1/10)
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.

In the scatter plot above I used the filter() function to remove some extreme outliers of data that I saw. I didn’t want those points to mess up my observations which is why I added the filter function. This is what the graph looked like without the filter function.

flights %>%
  group_by(.,year, month, day) %>% 
  mutate(cancelled = ifelse((is.na(dep_delay) & is.na(arr_delay)),1,0)) %>%
  summarize(.,
            avg_delay1 = mean(dep_delay,na.rm=TRUE),
            flights_proportion = 100*mean(cancelled)) %>%
  ggplot(mapping = aes(x = avg_delay1, y = flights_proportion)) + 
  geom_point(alpha=1/10)
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.

Question 3 (scatterplot of flight departure hour and on time flights)

No one likes to be delayed when flying. To try and avoid this, you might wonder what hour of the day is least likely to have a departure delay. What hour is it? Also, compute the percentage of flights that leave on time or early in each hour (i.e., the flights you want to find!). What hour of the day are you most likely to find these flights?

Answer

Flights that have the least delays would occur around 6-7 A.M. and I reached this conclusion from the data points that I plotted in the scatter plot below. I used this code in R to reach this conclusion and to make the graph.

I used the ‘not_cancelled’ object because it only has the dataset for non-cancelled flights. This way, I do not need to worry about any blank (NA) values that appear when I am filtering the data further.

I then used the mutate() function to add a column that converted the occurrence of early flights and late flights on a 1-0 scale (this would be helpful when finding the proportion of early flights). I used the ifelse() function to find flights that arrived early on time. I did this by setting the condition (dep_delay <= 0) to pull out the negative dep_delay values (the negative values represent an early departure).

The summarize() function was used to calculate the proportion of early flight occurrences for each hour group that I made in the group_by() function.

I then created another variable called ‘ontime_proportion’ to get the percentage of the early flight occurrences and displayed this in the table below as the y-axis.

NOTE: since the question only asked for departure delays, I only used the dep_delay variable and not the arr_delay variable

# code for scatterplot of early flight proportion for each hour

not_cancelled %>%
  group_by(.,hour) %>%
  mutate(flight_early = ifelse((dep_delay <= 0),1,0)) %>%
  summarize(.,
            ontime_proportion = 100*mean(flight_early)) %>%
  ggplot(mapping = aes(x = hour, y = ontime_proportion)) + 
  geom_point()

Question 4 (airline carriers and their departure delay proportion)

Which carriers are most likely to have a departure delay of at least 30 minutes? Hint: using the ifelse() function may be helpful

Answer

ExpressJet (EV) is the airline that will most likely have a departure delay for 30 minutes or more with a proportion of roughly 26%. Mesa Airlines (YV) comes in second with a proportion of roughly 23%. I reached this conclusion from the scatterplot in figure 9. I used the code below to reach this conclusion. I also displayed this data in a boxplot to display the data in another format visualization.

I used the ‘not_cancelled’ object because it only has the dataset for non-cancelled flights. This way, I do not need to worry about any blank (NA) values that appear when I am filtering the data further.

I then used the mutate() function to add a column that converted the flights that occurred 30 minutes or late on a 1-0 scale (this would be helpful when finding the proportion of 30+ minute late flights). I used the ifelse() function to find flights that arrived 30 or 30+ minutes late. I did this by setting the condition (dep_delay >= 0) to pull out the 30 or 30+ departure delay values.

The summarize() function was used to calculate the proportion of 30 or 30+ departure delay occurrences for each carrier group that I made in the group_by() function. I then created another variable called ‘f_30minlate_proportion’ to get the percentage of the 30 or 30+ flight delay occurrences and displayed this in the table below as the y-axis. NOTE: since the question only asked for departure delays, I only used the dep_delay variable and not the arr_delay variable

# code for scatterplot of 30 min late flight proportion sorted by each carrier

not_cancelled %>%
  group_by(.,carrier) %>%
  mutate(flight_30minlate = ifelse((dep_delay >= 30),1,0)) %>%
  summarize(.,
            f_30minlate_proportion = 100*mean(flight_30minlate)) %>%
  ggplot(mapping = aes(x = carrier, y = f_30minlate_proportion)) + 
  geom_point()

# A bar chart of 30 min late flight proportion, sorted by each carrier 
not_cancelled %>%
  group_by(.,carrier) %>%
  mutate(flight_30minlate = ifelse((dep_delay >= 30),1,0)) %>%
  summarize(.,
            f_30minlate_proportion = 100*mean(flight_30minlate)) %>%
  ggplot(mapping = aes(x = carrier, y = f_30minlate_proportion)) + 
  geom_bar(stat='identity')

Question 5 (city destination and the average flight delay in that location)

What destination has the smallest average arrival delay?

Answer

Lexington (LEX) has the smallest average arrival delay with an average of -22. I got this answer from the code I wrote which I talk about further below.

For the code I wrote, I grouped everything by destination since the question wanted the lowest average arrival delay for each destination. I then got the mean for all arrival delays for each destination through the summarize() function. I stored that value in the variable ‘arrivaldelay_average’.

NOTE: since the question only asked for arrival delays, I only used the ‘arr_delay’ variable and not the ‘dep_delay’ variable

# The code I had first. I noticed the graph was too big so I added 
# the filter function to get the value of the smallest average
not_cancelled %>%
  group_by(.,dest) %>%
  summarize(.,
            arrivaldelay_average = mean(arr_delay)) %>%
  ggplot(mapping = aes(x = dest, y = arrivaldelay_average)) + 
  geom_point()

# The code that displays the smallest average of arrivals
# and the destination that this value came from

not_cancelled %>%
  group_by(.,dest) %>%
  summarize(.,
            arrivaldelay_average = mean(arr_delay)) %>%
  filter(arrivaldelay_average < -20) %>%
  ggplot(mapping = aes(x = dest, y = arrivaldelay_average)) + 
  geom_point()

Question 6 (Bonus)

BONUS: Load the Lahman() library, which contains data on baseball players and their batting averages. First, convert it to a tibble (the tidyverse data structure we’ll cover in a future lecture) by calling: batting <- as_tibble(Lahman::Batting). Remember that with a built-in R data like Batting, you can write ?Batting in the R Console to display the help file, which will explain what the variables mean.

Dataset Exploration

The Batting dataset comes from the Lahman Baseball Database and is used for baseball statistics in R, providing comprehensive batting data and other tables from Major League Baseball.

Batting <- as_tibble(Lahman::Batting)

cat('number of rows:',nrow(Batting), #115450  
   '\nnumber of columns:',ncol(Batting)) #22
## number of rows: 115450 
## number of columns: 22
glimpse(Batting) #shows rows, columns, and snummary of all columns
## Rows: 115,450
## Columns: 22
## $ playerID <chr> "aardsda01", "aardsda01", "aardsda01", "aardsda01", "aardsda0…
## $ yearID   <int> 2004, 2006, 2007, 2008, 2009, 2010, 2012, 2013, 2015, 1954, 1…
## $ stint    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ teamID   <fct> SFN, CHN, CHA, BOS, SEA, SEA, NYA, NYN, ATL, ML1, ML1, ML1, M…
## $ lgID     <fct> NL, NL, AL, AL, AL, AL, AL, NL, NL, NL, NL, NL, NL, NL, NL, N…
## $ G        <int> 11, 45, 25, 47, 73, 53, 1, 43, 33, 122, 153, 153, 151, 153, 1…
## $ AB       <int> 0, 2, 0, 1, 0, 0, 0, 0, 1, 468, 602, 609, 615, 601, 629, 590,…
## $ R        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 58, 105, 106, 118, 109, 116, 102, …
## $ H        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 131, 189, 200, 198, 196, 223, 172,…
## $ X2B      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 27, 37, 34, 27, 34, 46, 20, 39, 28…
## $ X3B      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 9, 14, 6, 4, 7, 11, 10, 6, 4, 2…
## $ HR       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 27, 26, 44, 30, 39, 40, 34, 45…
## $ RBI      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 69, 106, 92, 132, 95, 123, 126, 12…
## $ SB       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 2, 1, 4, 8, 16, 21, 15, 31, …
## $ CS       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 4, 1, 1, 0, 7, 9, 7, 5, 4, 4…
## $ BB       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 28, 49, 37, 57, 59, 51, 60, 56, 66…
## $ SO       <int> 0, 0, 0, 1, 0, 0, 0, 0, 1, 39, 61, 54, 58, 49, 54, 63, 64, 73…
## $ IBB      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 5, 6, 15, 16, 17, 13, 20, 14, …
## $ HBP      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 0, 1, 4, 2, 2, 3, 0, 0, 1…
## $ SH       <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 6, 7, 5, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ SF       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 7, 3, 3, 9, 12, 9, 6, 5, 2, …
## $ GIDP     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 20, 21, 13, 21, 19, 8, 16, 14,…
summary(Batting) #column names
##    playerID             yearID         stint           teamID      lgID      
##  Length:115450      Min.   :1871   Min.   :1.000   CHN    : 5297   AA: 1893  
##  Class :character   1st Qu.:1940   1st Qu.:1.000   PHI    : 5173   AL:53424  
##  Mode  :character   Median :1980   Median :1.000   PIT    : 5166   FL:  472  
##                     Mean   :1970   Mean   :1.082   SLN    : 5051   NA:  737  
##                     3rd Qu.:2005   3rd Qu.:1.000   CIN    : 4978   NL:58441  
##                     Max.   :2024   Max.   :5.000   CLE    : 4882   PL:  149  
##                                                    (Other):84903   UA:  334  
##        G                AB              R                H         
##  Min.   :  1.00   Min.   :  0.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 12.00   1st Qu.:  3.0   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median : 34.00   Median : 43.0   Median :  3.00   Median :  8.00  
##  Mean   : 50.28   Mean   :136.9   Mean   : 18.17   Mean   : 35.67  
##  3rd Qu.: 77.00   3rd Qu.:219.0   3rd Qu.: 26.00   3rd Qu.: 54.00  
##  Max.   :165.00   Max.   :716.0   Max.   :198.00   Max.   :262.00  
##                                                                    
##       X2B             X3B               HR             RBI        
##  Min.   : 0.00   Min.   : 0.000   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.: 0.00   1st Qu.: 0.000   1st Qu.: 0.00   1st Qu.:  0.00  
##  Median : 1.00   Median : 0.000   Median : 0.00   Median :  3.00  
##  Mean   : 6.12   Mean   : 1.199   Mean   : 2.88   Mean   : 16.54  
##  3rd Qu.: 9.00   3rd Qu.: 1.000   3rd Qu.: 2.00   3rd Qu.: 23.00  
##  Max.   :67.00   Max.   :36.000   Max.   :73.00   Max.   :191.00  
##                                                   NA's   :756     
##        SB                CS               BB               SO        
##  Min.   :  0.000   Min.   : 0.000   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.000   1st Qu.: 0.000   1st Qu.:  0.00   1st Qu.:  1.00  
##  Median :  0.000   Median : 0.000   Median :  2.00   Median :  8.00  
##  Mean   :  2.851   Mean   : 1.128   Mean   : 12.63   Mean   : 20.82  
##  3rd Qu.:  2.000   3rd Qu.: 1.000   3rd Qu.: 17.00   3rd Qu.: 29.00  
##  Max.   :138.000   Max.   :42.000   Max.   :232.00   Max.   :223.00  
##  NA's   :2368      NA's   :23542                     NA's   :2100    
##       IBB              HBP               SH               SF       
##  Min.   :  0.00   Min.   : 0.000   Min.   : 0.000   Min.   : 0     
##  1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0     
##  Median :  0.00   Median : 0.000   Median : 0.000   Median : 0     
##  Mean   :  0.99   Mean   : 1.069   Mean   : 2.082   Mean   : 1     
##  3rd Qu.:  1.00   3rd Qu.: 1.000   3rd Qu.: 2.000   3rd Qu.: 1     
##  Max.   :120.00   Max.   :51.000   Max.   :67.000   Max.   :19     
##  NA's   :36651    NA's   :2816     NA's   :6068     NA's   :36104  
##       GIDP       
##  Min.   : 0.000  
##  1st Qu.: 0.000  
##  Median : 0.000  
##  Mean   : 2.828  
##  3rd Qu.: 4.000  
##  Max.   :36.000  
##  NA's   :25442

Question 6A (players and their batting averages)

Then find the players with the best or worst batting averages (batting average is simply the number of hits a player has, divided by the number of at bats they have). Why would this lead you astray?

Answer

When graphing this data, it would be impossible to see which players have the best and worst batting averages for 2 main reasons:

  1. There are many players who have perfect batting averages (they only hit one or two times) and this value is skewing our data.
  2. There would be many players in the x-axis so it would be hard to see the name of all players. This can easily be fixed by adding a filter function to only show the graph of the best and worst players.

I used the group_by() function to separate each player through their ID. This way I can calculate the batting average for each player. I then use the summarize function to define a variable ‘batting_average’ which gets the number of hits a player has (H) and divides that by the number of bates they have (AB).

Since the variable ‘player ID’ is a categorical variable and ‘batting_average’ is a continuous variable, I am using a scatter plot to display my data.

# players with best and worst averages

Batting %>%
  group_by(., playerID) %>%
  summarize(.,
            batting_average = sum(H)/sum(AB)) %>%
  ggplot(mapping = aes(x = playerID, y = batting_average)) + 
  geom_point()
## Warning: Removed 2765 rows containing missing values (`geom_point()`).

Question 6B (filtering players with larger bats and no perfect averages)

In the ‘Batting’ dataset, there were players who only batted once or twice, and got a hit each time (or never hit at all), so their perfect averages are skewing our data. We want to filter players with larger at-bat sample sizes, so we can see true averages that give us more information. Now I can see some points in my scatterplot, although it is still filled, I can see more points clearly.

Now condition on players who had at least 500 at bats. How would you answer change?

Answer

In our dataset, there were players who only batted once or twice, and got a hit each time (or never hit at all), so their perfect averages are skewing our data. We want to filter players with larger at-bat sample sizes, so we can see true averages that give us more information. Now I can see some points in my scatterplot, although it is still filled, I can see more points clearly.

# players with at least 500 at bats

Batting %>%
  group_by(.,playerID) %>%
  filter(AB >= 500) %>%
  summarize(.,
            batting_average = H/AB) %>%
  ggplot(mapping = aes(x = playerID, y = batting_average)) + 
  geom_point()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'playerID'. You can override using the
## `.groups` argument.