The flights dataset provides on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.
cat('number of rows:',nrow(flights), #336776
'\nnumber of columns:',ncol(flights)) #19
## number of rows: 336776
## number of columns: 19
glimpse(flights) #shows rows, columns, and snummary of all columns
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
summary(flights) #column names
## year month day dep_time sched_dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1 Min. : 106
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907 1st Qu.: 906
## Median :2013 Median : 7.000 Median :16.00 Median :1401 Median :1359
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349 Mean :1344
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744 3rd Qu.:1729
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400 Max. :2359
## NA's :8255
## dep_delay arr_time sched_arr_time arr_delay
## Min. : -43.00 Min. : 1 Min. : 1 Min. : -86.000
## 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124 1st Qu.: -17.000
## Median : -2.00 Median :1535 Median :1556 Median : -5.000
## Mean : 12.64 Mean :1502 Mean :1536 Mean : 6.895
## 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945 3rd Qu.: 14.000
## Max. :1301.00 Max. :2400 Max. :2359 Max. :1272.000
## NA's :8255 NA's :8713 NA's :9430
## carrier flight tailnum origin
## Length:336776 Min. : 1 Length:336776 Length:336776
## Class :character 1st Qu.: 553 Class :character Class :character
## Mode :character Median :1496 Mode :character Mode :character
## Mean :1972
## 3rd Qu.:3465
## Max. :8500
##
## dest air_time distance hour
## Length:336776 Min. : 20.0 Min. : 17 Min. : 1.00
## Class :character 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00
## Mode :character Median :129.0 Median : 872 Median :13.00
## Mean :150.7 Mean :1040 Mean :13.18
## 3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00
## Max. :695.0 Max. :4983 Max. :23.00
## NA's :9430
## minute time_hour
## Min. : 0.00 Min. :2013-01-01 05:00:00.00
## 1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00.00
## Median :29.00 Median :2013-07-03 10:00:00.00
## Mean :26.23 Mean :2013-07-03 05:22:54.64
## 3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00.00
## Max. :59.00 Max. :2013-12-31 23:00:00.00
##
| Column Name | Variable Type | Variable Description |
|---|---|---|
| year | categorical ordinal | year of flight departure/arrival |
| month | categorical ordinal | month of flight departure/arrival |
| day | categorical ordinal | day of flight departure/arrival |
| dep_time | numerical discrete | time of flight departure |
| sched_dep_time | numerical discrete | scheduled flight departure |
| dep_delay | numerical discrete | departure delay in minutes |
| arr_time | numerical discrete | time of flight arrival |
| sched_arr_time | numerical discrete | scheduled arrival time |
| arr_delay | numerical discrete | arrival delay in minutes |
| carrier | categorical nominal | abbreviated flight carrier name |
| flight | categorical nominal | flight number |
| tailnum | categorical nominal | tail number, a license plate of a plane |
| origin | categorical nominal | origin city of flight |
| dest | categorical nominal | destination city of flight |
| air_time | numerical discrete | total flight time in minutes |
| distance | numerical discrete | total flight distance |
| hour | numerical discrete | scheduled hour departure of flight |
| minute | numerical discrete | scheduled minute departure of flight |
| time_hour | datetime variable/numerical continuous | departure of flight with date and time |
In the code above, we frequently used not_cancelled, rather than
flights as our data. How did this simplify our code? Think especially
about the functions we used within summarise().
The object ‘not_cancelled’ is an object that contains a dataset which has a list of flights that have not been cancelled and have taken off.
One might think about using the group_by() function to
filter out the delayed flights. This could work but, there are multiple
variables that need to be sub-grouped within the ‘non_cancelled’
dataset. Because of this it would be easier to create an object that
stores all the data for non-cancelled flights (in this example, this is
the non_cancelled object).
Also, the mean() function was used within the
summarize() function to calculate the mean of delayed
flights per day. The mean for the delayed flights was computed based off
the groups per day (these groups were created in the group_by()
function). If the flight delay variables (dep_delay() and arr_delay() )
were used in the group_by() function along with the year, month, and day
then R would get confused since it would try to create 2 groups at the
same time (one group for each day and another group for delay times).
Our objective is to find the average delayed flights per day so it makes
sense to only include the year, month, and day variables in the
group_by() function and no other variables.
# Here is the object 'not_cancelled' being created
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay)) # filtering delayed flights because we want to see non-cancelled flights
# Showing average delays. avg_delay2 shows average for delayed flights only
not_cancelled %>%
group_by(.,year, month, day) %>%
summarize(.,
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the ave positive delay
)
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day avg_delay1 avg_delay2
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 12.7 32.5
## 2 2013 1 2 12.7 32.0
## 3 2013 1 3 5.73 27.7
## 4 2013 1 4 -1.93 28.3
## 5 2013 1 5 -1.53 22.6
## 6 2013 1 6 4.24 24.4
## 7 2013 1 7 -4.95 27.8
## 8 2013 1 8 -3.23 20.8
## 9 2013 1 9 -0.264 25.6
## 10 2013 1 10 -5.90 27.3
## # ℹ 355 more rows
You might suspect that there is a relationship between the average delay (on a given day) and the proportion of flights that are cancelled on that day. For example, if there is bad weather, many flights might start off delayed, but then end up cancelled. Let’s test this intuition. First, find the average delay and proportion of flights cancelled each day. Second, plot them against one another and comment on the relationship. Did our intuition hold?
AnswerFlights that have longer delays have a higher chance of getting
cancelled. This can be seen in the scatter plot below which shows that
as a flights average delay time increases, then the
proportion/percentage of the flight being cancelled increases. I reached
this answer by finding the average delay and proportion of flights
cancelled. I then used geom_point() to plot the data points
and to see a relationship.
NOTE: I used the ‘dep_delay’ variable to calculate the average flights delay because if a flight is cancelled it would not have a arrival delay so using the variable ‘arr_delay’ to calculate the flight delay average would not be useful.
#avg delays and proportion of flights cancelled
flights %>%
group_by(.,year, month, day) %>%
mutate(cancelled = ifelse((is.na(dep_delay) & is.na(arr_delay)),1,0)) %>%
summarize(.,
avg_delay1 = mean(dep_delay,na.rm=TRUE),
flights_proportion = 100*mean(cancelled)) %>%
filter(flights_proportion < 10) %>%
ggplot(mapping = aes(x = avg_delay1, y = flights_proportion)) +
geom_point(alpha=1/10)
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
In the scatter plot above I used the filter() function
to remove some extreme outliers of data that I saw. I didn’t want those
points to mess up my observations which is why I added the filter
function. This is what the graph looked like without the filter
function.
flights %>%
group_by(.,year, month, day) %>%
mutate(cancelled = ifelse((is.na(dep_delay) & is.na(arr_delay)),1,0)) %>%
summarize(.,
avg_delay1 = mean(dep_delay,na.rm=TRUE),
flights_proportion = 100*mean(cancelled)) %>%
ggplot(mapping = aes(x = avg_delay1, y = flights_proportion)) +
geom_point(alpha=1/10)
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
No one likes to be delayed when flying. To try and avoid this, you might wonder what hour of the day is least likely to have a departure delay. What hour is it? Also, compute the percentage of flights that leave on time or early in each hour (i.e., the flights you want to find!). What hour of the day are you most likely to find these flights?
AnswerFlights that have the least delays would occur around 6-7 A.M. and I reached this conclusion from the data points that I plotted in the scatter plot below. I used this code in R to reach this conclusion and to make the graph.
I used the ‘not_cancelled’ object because it only has the dataset for non-cancelled flights. This way, I do not need to worry about any blank (NA) values that appear when I am filtering the data further.
I then used the mutate() function to add a column that
converted the occurrence of early flights and late flights on a 1-0
scale (this would be helpful when finding the proportion of early
flights). I used the ifelse() function to find flights that
arrived early on time. I did this by setting the condition (dep_delay
<= 0) to pull out the negative dep_delay values (the negative values
represent an early departure).
The summarize() function was used to calculate the
proportion of early flight occurrences for each hour group that I made
in the group_by() function.
I then created another variable called ‘ontime_proportion’ to get the percentage of the early flight occurrences and displayed this in the table below as the y-axis.
NOTE: since the question only asked for departure delays, I only used the dep_delay variable and not the arr_delay variable
# code for scatterplot of early flight proportion for each hour
not_cancelled %>%
group_by(.,hour) %>%
mutate(flight_early = ifelse((dep_delay <= 0),1,0)) %>%
summarize(.,
ontime_proportion = 100*mean(flight_early)) %>%
ggplot(mapping = aes(x = hour, y = ontime_proportion)) +
geom_point()
Which carriers are most likely to have a departure delay of at least 30 minutes? Hint: using the ifelse() function may be helpful
AnswerExpressJet (EV) is the airline that will most likely have a departure delay for 30 minutes or more with a proportion of roughly 26%. Mesa Airlines (YV) comes in second with a proportion of roughly 23%. I reached this conclusion from the scatterplot in figure 9. I used the code below to reach this conclusion. I also displayed this data in a boxplot to display the data in another format visualization.
I used the ‘not_cancelled’ object because it only has the dataset for non-cancelled flights. This way, I do not need to worry about any blank (NA) values that appear when I am filtering the data further.
I then used the mutate() function to add a column that
converted the flights that occurred 30 minutes or late on a 1-0 scale
(this would be helpful when finding the proportion of 30+ minute late
flights). I used the ifelse() function to find flights that arrived 30
or 30+ minutes late. I did this by setting the condition (dep_delay
>= 0) to pull out the 30 or 30+ departure delay values.
The summarize() function was used to calculate the
proportion of 30 or 30+ departure delay occurrences for each carrier
group that I made in the group_by() function. I then created another
variable called ‘f_30minlate_proportion’ to get the percentage of the 30
or 30+ flight delay occurrences and displayed this in the table below as
the y-axis. NOTE: since the question only asked for departure delays, I
only used the dep_delay variable and not the arr_delay variable
# code for scatterplot of 30 min late flight proportion sorted by each carrier
not_cancelled %>%
group_by(.,carrier) %>%
mutate(flight_30minlate = ifelse((dep_delay >= 30),1,0)) %>%
summarize(.,
f_30minlate_proportion = 100*mean(flight_30minlate)) %>%
ggplot(mapping = aes(x = carrier, y = f_30minlate_proportion)) +
geom_point()
# A bar chart of 30 min late flight proportion, sorted by each carrier
not_cancelled %>%
group_by(.,carrier) %>%
mutate(flight_30minlate = ifelse((dep_delay >= 30),1,0)) %>%
summarize(.,
f_30minlate_proportion = 100*mean(flight_30minlate)) %>%
ggplot(mapping = aes(x = carrier, y = f_30minlate_proportion)) +
geom_bar(stat='identity')
What destination has the smallest average arrival delay?
AnswerLexington (LEX) has the smallest average arrival delay with an average of -22. I got this answer from the code I wrote which I talk about further below.
For the code I wrote, I grouped everything by destination since the
question wanted the lowest average arrival delay for each destination. I
then got the mean for all arrival delays for each destination through
the summarize() function. I stored that value in the
variable ‘arrivaldelay_average’.
NOTE: since the question only asked for arrival delays, I only used the ‘arr_delay’ variable and not the ‘dep_delay’ variable
# The code I had first. I noticed the graph was too big so I added
# the filter function to get the value of the smallest average
not_cancelled %>%
group_by(.,dest) %>%
summarize(.,
arrivaldelay_average = mean(arr_delay)) %>%
ggplot(mapping = aes(x = dest, y = arrivaldelay_average)) +
geom_point()
# The code that displays the smallest average of arrivals
# and the destination that this value came from
not_cancelled %>%
group_by(.,dest) %>%
summarize(.,
arrivaldelay_average = mean(arr_delay)) %>%
filter(arrivaldelay_average < -20) %>%
ggplot(mapping = aes(x = dest, y = arrivaldelay_average)) +
geom_point()
BONUS: Load the Lahman() library, which contains data on baseball players and their batting averages. First, convert it to a tibble (the tidyverse data structure we’ll cover in a future lecture) by calling: batting <- as_tibble(Lahman::Batting). Remember that with a built-in R data like Batting, you can write ?Batting in the R Console to display the help file, which will explain what the variables mean.
The Batting dataset comes from the Lahman Baseball Database and is used for baseball statistics in R, providing comprehensive batting data and other tables from Major League Baseball.
Batting <- as_tibble(Lahman::Batting)
cat('number of rows:',nrow(Batting), #115450
'\nnumber of columns:',ncol(Batting)) #22
## number of rows: 115450
## number of columns: 22
glimpse(Batting) #shows rows, columns, and snummary of all columns
## Rows: 115,450
## Columns: 22
## $ playerID <chr> "aardsda01", "aardsda01", "aardsda01", "aardsda01", "aardsda0…
## $ yearID <int> 2004, 2006, 2007, 2008, 2009, 2010, 2012, 2013, 2015, 1954, 1…
## $ stint <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ teamID <fct> SFN, CHN, CHA, BOS, SEA, SEA, NYA, NYN, ATL, ML1, ML1, ML1, M…
## $ lgID <fct> NL, NL, AL, AL, AL, AL, AL, NL, NL, NL, NL, NL, NL, NL, NL, N…
## $ G <int> 11, 45, 25, 47, 73, 53, 1, 43, 33, 122, 153, 153, 151, 153, 1…
## $ AB <int> 0, 2, 0, 1, 0, 0, 0, 0, 1, 468, 602, 609, 615, 601, 629, 590,…
## $ R <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 58, 105, 106, 118, 109, 116, 102, …
## $ H <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 131, 189, 200, 198, 196, 223, 172,…
## $ X2B <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 27, 37, 34, 27, 34, 46, 20, 39, 28…
## $ X3B <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 9, 14, 6, 4, 7, 11, 10, 6, 4, 2…
## $ HR <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 27, 26, 44, 30, 39, 40, 34, 45…
## $ RBI <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 69, 106, 92, 132, 95, 123, 126, 12…
## $ SB <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 2, 1, 4, 8, 16, 21, 15, 31, …
## $ CS <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 4, 1, 1, 0, 7, 9, 7, 5, 4, 4…
## $ BB <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 28, 49, 37, 57, 59, 51, 60, 56, 66…
## $ SO <int> 0, 0, 0, 1, 0, 0, 0, 0, 1, 39, 61, 54, 58, 49, 54, 63, 64, 73…
## $ IBB <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 5, 6, 15, 16, 17, 13, 20, 14, …
## $ HBP <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 0, 1, 4, 2, 2, 3, 0, 0, 1…
## $ SH <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 6, 7, 5, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ SF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 7, 3, 3, 9, 12, 9, 6, 5, 2, …
## $ GIDP <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 20, 21, 13, 21, 19, 8, 16, 14,…
summary(Batting) #column names
## playerID yearID stint teamID lgID
## Length:115450 Min. :1871 Min. :1.000 CHN : 5297 AA: 1893
## Class :character 1st Qu.:1940 1st Qu.:1.000 PHI : 5173 AL:53424
## Mode :character Median :1980 Median :1.000 PIT : 5166 FL: 472
## Mean :1970 Mean :1.082 SLN : 5051 NA: 737
## 3rd Qu.:2005 3rd Qu.:1.000 CIN : 4978 NL:58441
## Max. :2024 Max. :5.000 CLE : 4882 PL: 149
## (Other):84903 UA: 334
## G AB R H
## Min. : 1.00 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 12.00 1st Qu.: 3.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 34.00 Median : 43.0 Median : 3.00 Median : 8.00
## Mean : 50.28 Mean :136.9 Mean : 18.17 Mean : 35.67
## 3rd Qu.: 77.00 3rd Qu.:219.0 3rd Qu.: 26.00 3rd Qu.: 54.00
## Max. :165.00 Max. :716.0 Max. :198.00 Max. :262.00
##
## X2B X3B HR RBI
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 1.00 Median : 0.000 Median : 0.00 Median : 3.00
## Mean : 6.12 Mean : 1.199 Mean : 2.88 Mean : 16.54
## 3rd Qu.: 9.00 3rd Qu.: 1.000 3rd Qu.: 2.00 3rd Qu.: 23.00
## Max. :67.00 Max. :36.000 Max. :73.00 Max. :191.00
## NA's :756
## SB CS BB SO
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 1.00
## Median : 0.000 Median : 0.000 Median : 2.00 Median : 8.00
## Mean : 2.851 Mean : 1.128 Mean : 12.63 Mean : 20.82
## 3rd Qu.: 2.000 3rd Qu.: 1.000 3rd Qu.: 17.00 3rd Qu.: 29.00
## Max. :138.000 Max. :42.000 Max. :232.00 Max. :223.00
## NA's :2368 NA's :23542 NA's :2100
## IBB HBP SH SF
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0
## Median : 0.00 Median : 0.000 Median : 0.000 Median : 0
## Mean : 0.99 Mean : 1.069 Mean : 2.082 Mean : 1
## 3rd Qu.: 1.00 3rd Qu.: 1.000 3rd Qu.: 2.000 3rd Qu.: 1
## Max. :120.00 Max. :51.000 Max. :67.000 Max. :19
## NA's :36651 NA's :2816 NA's :6068 NA's :36104
## GIDP
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 2.828
## 3rd Qu.: 4.000
## Max. :36.000
## NA's :25442
Then find the players with the best or worst batting averages (batting average is simply the number of hits a player has, divided by the number of at bats they have). Why would this lead you astray?
AnswerWhen graphing this data, it would be impossible to see which players have the best and worst batting averages for 2 main reasons:
I used the group_by() function to separate each player
through their ID. This way I can calculate the batting average for each
player. I then use the summarize function to define a variable
‘batting_average’ which gets the number of hits a player has (H) and
divides that by the number of bates they have (AB).
Since the variable ‘player ID’ is a categorical variable and ‘batting_average’ is a continuous variable, I am using a scatter plot to display my data.
# players with best and worst averages
Batting %>%
group_by(., playerID) %>%
summarize(.,
batting_average = sum(H)/sum(AB)) %>%
ggplot(mapping = aes(x = playerID, y = batting_average)) +
geom_point()
## Warning: Removed 2765 rows containing missing values (`geom_point()`).
In the ‘Batting’ dataset, there were players who only batted once or twice, and got a hit each time (or never hit at all), so their perfect averages are skewing our data. We want to filter players with larger at-bat sample sizes, so we can see true averages that give us more information. Now I can see some points in my scatterplot, although it is still filled, I can see more points clearly.
Now condition on players who had at least 500 at bats. How would you answer change?
AnswerIn our dataset, there were players who only batted once or twice, and got a hit each time (or never hit at all), so their perfect averages are skewing our data. We want to filter players with larger at-bat sample sizes, so we can see true averages that give us more information. Now I can see some points in my scatterplot, although it is still filled, I can see more points clearly.
# players with at least 500 at bats
Batting %>%
group_by(.,playerID) %>%
filter(AB >= 500) %>%
summarize(.,
batting_average = H/AB) %>%
ggplot(mapping = aes(x = playerID, y = batting_average)) +
geom_point()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'playerID'. You can override using the
## `.groups` argument.