This RMarkdown contains:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(rmarkdown)
lahman_data = read.csv("/Users/anuragreddy/Desktop/Statistics with R/Lahmans Databse .csv")
head(lahman_data)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 2000 AL ANA ANA W 3 162 81 82 80 N N N
## 2 2000 AL BAL BAL E 4 162 81 74 88 N N N
## 3 2000 AL BOS BOS E 2 162 81 85 77 N N N
## 4 2000 AL CHA CHW C 1 162 81 95 67 Y N N
## 5 2000 AL CLE CLE C 2 162 81 90 72 N N N
## 6 2000 AL DET DET C 3 162 81 79 83 N N N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 N 864 5628 1574 309 34 236 608 1024 93 52 47 43 869 805 5.00 5 3 46
## 2 N 794 5549 1508 310 22 184 558 900 126 65 49 54 913 855 5.37 14 6 33
## 3 N 792 5630 1503 316 32 167 611 1019 43 30 42 48 745 683 4.23 7 12 46
## 4 N 978 5646 1615 325 33 216 591 960 119 42 53 61 839 751 4.66 5 7 43
## 5 N 950 5683 1639 310 30 221 685 1057 113 34 51 52 816 775 4.84 6 5 34
## 6 N 823 5644 1553 307 41 177 562 982 83 38 43 49 827 755 4.71 6 6 44
## IPouts HA HRA BBA SOA E DP FP name
## 1 4344 1534 228 662 846 134 182 0.978 Anaheim Angels
## 2 4300 1547 202 665 1017 116 151 0.981 Baltimore Orioles
## 3 4358 1433 173 498 1121 109 120 0.982 Boston Red Sox
## 4 4351 1509 195 614 1037 133 190 0.978 Chicago White Sox
## 5 4327 1511 173 666 1213 72 147 0.988 Cleveland Indians
## 6 4330 1583 177 496 978 105 171 0.983 Detroit Tigers
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 Edison International Field 2066982 102 103 ANA ANA
## 2 Oriole Park at Camden Yards 3297031 95 96 BAL BAL
## 3 Fenway Park II 2585895 104 103 BOS BOS
## 4 Comiskey Park II 1947799 102 102 CHW CHA
## 5 Jacobs Field 3456278 101 100 CLE CLE
## 6 Comerica Park 2438617 95 95 DET DET
## teamIDretro
## 1 ANA
## 2 BAL
## 3 BOS
## 4 CHA
## 5 CLE
## 6 DET
Question-1: Which division has scored most runs in 2000-2022? We will group_by year and div ID find out the mean of runs for each year.
lahman_data |>
group_by(yearID,divID) |>
summarize(Average_Runs = mean(R),.groups="keep") |>
arrange(desc(Average_Runs))
## # A tibble: 66 × 3
## # Groups: yearID, divID [66]
## yearID divID Average_Runs
## <int> <chr> <dbl>
## 1 2000 W 867.
## 2 2000 C 848.
## 3 2001 W 831
## 4 2007 E 810.
## 5 2006 E 807.
## 6 2019 W 800.
## 7 2003 E 799.
## 8 2019 E 798.
## 9 2002 W 785.
## 10 2000 E 784.
## # ℹ 56 more rows
The highest run scoring was seen in early 21st century in west and central division teams.
Question - 2: On similar lines lest find out the runs scored by teams from 2000-2022. Lets create a bins of c(600,650,700,750,800,850) and then label them with your own categories.
k <- lahman_data |>
group_by(franchID) |>
summarize(Average_Runs = mean(R)) |>
arrange(desc(Average_Runs))
k$Strength <- cut(k$Average_Runs,breaks = c(650,700,750,800,850), labels= c('Weak','Average','Strong','Very Strong'))
k
## # A tibble: 30 × 3
## franchID Average_Runs Strength
## <chr> <dbl> <fct>
## 1 BOS 806. Very Strong
## 2 NYY 806. Very Strong
## 3 COL 773. Strong
## 4 TEX 771. Strong
## 5 TOR 752 Strong
## 6 STL 744. Average
## 7 CLE 742. Average
## 8 ANA 732. Average
## 9 CHW 729. Average
## 10 MIN 728. Average
## # ℹ 20 more rows
Visualization - 1: Lets detect anomalies in average runs scored by all the teams from 2000 - 2022. Before creating anomalies lets look at the distributing of Average Runs scored by teams.
k |>
ggplot(aes(x = Average_Runs)) +
geom_histogram(color='white',fill='navyblue') +
labs(title = " Distribution of Average Runs Scored",x="Average Runs Scored",y="Frequency")+
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
If we keep threshold for detecting anomalies as - Avg_Runs < 700 and > 800
k |>
filter(Average_Runs > 800 | Average_Runs <700) |>
ggplot(aes(x = Average_Runs)) +
geom_histogram(color='white',fill='navyblue') +
labs(title = " Distribution of Average Runs Scored",x="Average Runs Scored",y="Frequency")+
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The above graph is the distribution of anomalies in Average Runs Scored.
On similar bases lets create the win% for all the teams in their particular years and find out the anomalies. Because Runs Scored only shows us the one side of the game. By calculating winning% we can actually find out the performance of the teams.
Question - 3: Group the data by year wise and calculate the winning percentage and find out the teams in each year from 2000-2022 who has winning % > 65%
teams <- lahman_data |>
mutate(winning_percentage = W/(W+L))|>
select(yearID,franchID,winning_percentage) |>
arrange(desc(winning_percentage))
teams_65 <- teams |>
filter(winning_percentage>0.65)
teams_65
## yearID franchID winning_percentage
## 1 2020 LAD 0.7166667
## 2 2001 SEA 0.7160494
## 3 2022 LAD 0.6851852
## 4 2020 TBD 0.6666667
## 5 2019 HOU 0.6604938
## 6 2021 SFG 0.6604938
## 7 2019 LAD 0.6543210
## 8 2021 LAD 0.6543210
## 9 2022 HOU 0.6543210
Los Angeles Dodgers (LAD) is the team which has winning percentage greater than 65% 4 times in 2019,2020,2021,2022…. A consistent performer and one of the potential winners for this season 2024. |
Visualization - 2: Lets find the anomalies in the win% distribution. Keeping threshold at less than 45%.
threshold <- 0.45
teams|>
ggplot(aes(x=winning_percentage))+
geom_histogram(fill='lightblue',color='black')+
labs(title = " Distribution of winning percentage",x="Winning Percentage",y="Frequency")+
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Lets print the anomalies in winning distribution table.
teams |>
filter(winning_percentage<threshold) |>
ggplot(aes(x=winning_percentage))+
geom_histogram(fill='lightblue',color='black')+
labs(title = " Distribution of winning percentage",x="Winning Percentage",y="Frequency")+
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Remember in the starting we found out the average runs scored by teams from 2000-2022 grouping by teamID. Lets use that data frame.
Grouped_K <-
k |>
group_by(Strength)|>
summarise(count=n())|>
mutate(probability = count/sum(count))
Grouped_K
## # A tibble: 4 × 3
## Strength count probability
## <fct> <int> <dbl>
## 1 Weak 9 0.3
## 2 Average 16 0.533
## 3 Strong 3 0.1
## 4 Very Strong 2 0.0667
Now we can interpret from the above output that, the very strong group is the smallest group and has lowest probability of getting selected if we randomly select a row from the k data set. Now let’s assign a lowest probability tag to the very strong group.
Lowest_Probability = min(Grouped_K$probability)
Highest_Probability = max(Grouped_K$probability)
Grouped_K <-
Grouped_K |>
mutate(Special_Tag = case_when(probability == Lowest_Probability~ "Smallest Group",probability == Highest_Probability~ "Highest Group", TRUE ~ "Other Groups"))
Grouped_K
## # A tibble: 4 × 4
## Strength count probability Special_Tag
## <fct> <int> <dbl> <chr>
## 1 Weak 9 0.3 Other Groups
## 2 Average 16 0.533 Highest Group
## 3 Strong 3 0.1 Other Groups
## 4 Very Strong 2 0.0667 Smallest Group
Now we have assigned the tags for highest and lowest probabilities. Let’s merge this data to the original data set (k).
k
## # A tibble: 30 × 3
## franchID Average_Runs Strength
## <chr> <dbl> <fct>
## 1 BOS 806. Very Strong
## 2 NYY 806. Very Strong
## 3 COL 773. Strong
## 4 TEX 771. Strong
## 5 TOR 752 Strong
## 6 STL 744. Average
## 7 CLE 742. Average
## 8 ANA 732. Average
## 9 CHW 729. Average
## 10 MIN 728. Average
## # ℹ 20 more rows
k_tag <- left_join(k, Grouped_K, by = "Strength") |>
select(franchID, Average_Runs, Strength, probability, Special_Tag)
k_tag
## # A tibble: 30 × 5
## franchID Average_Runs Strength probability Special_Tag
## <chr> <dbl> <fct> <dbl> <chr>
## 1 BOS 806. Very Strong 0.0667 Smallest Group
## 2 NYY 806. Very Strong 0.0667 Smallest Group
## 3 COL 773. Strong 0.1 Other Groups
## 4 TEX 771. Strong 0.1 Other Groups
## 5 TOR 752 Strong 0.1 Other Groups
## 6 STL 744. Average 0.533 Highest Group
## 7 CLE 742. Average 0.533 Highest Group
## 8 ANA 732. Average 0.533 Highest Group
## 9 CHW 729. Average 0.533 Highest Group
## 10 MIN 728. Average 0.533 Highest Group
## # ℹ 20 more rows
Finally, we have merged the special tag into the original data frame
k and named it has k_tag. Now let’s try to interpret why the occurrence
of some groups is rare compared to others.
If we see the means of smallest groups then we can see they have the
highest mean. BOS and NYY has approx 805 average runs scored from
2000-2022. It very rare to have highest means for all the groups. Hence
the occurrence is very rare for smallest groups.
In terms of probability:
Probability of randomly selecting a row from smallest group - that is very strong team is - Prob(very strong) = 0.067. On the other hand, Prob(Average) = 0.53.
We can interpret that, every time we randomly select a row there is 53% chance that the record is from Average teams whose average_runs scored is between - (700-750]. Similarly, only 7% of chance of randomly selecting a very strong team from the data set.
Testable Hypothesis: As the number of runs scored by the teams increases, the corresponding average is expected to rise, placing it in lower probability categories.
k_tag |>
ggplot(aes(x=Special_Tag,y=Average_Runs,color=Special_Tag))+
geom_boxplot()+
theme_classic()
Well above visualization can show us the difference between the three groups (Highest, Other and Smallest groups).
One issue we might get is we have 20 years of data. We can’t directly add the W and L of all those years for each team and find out the winning percentage (W/W+L). Before performing this action we need to find out the average wins acquired by the team per season. For this we can group by franchID and summarise the mean(W).
Teams_Avg <- lahman_data |>
group_by(franchID)|>
summarise(Mean_Wins = mean(W), Mean_Loss = mean(L))
Teams_Avg
## # A tibble: 30 × 3
## franchID Mean_Wins Mean_Loss
## <chr> <dbl> <dbl>
## 1 ANA 82.6 74.7
## 2 ARI 75.8 81.6
## 3 ATL 85.0 72.2
## 4 BAL 71.2 86.1
## 5 BOS 85.4 72.0
## 6 CHC 77.6 79.6
## 7 CHW 79.4 78.0
## 8 CIN 74.3 83.0
## 9 CLE 81.2 76.0
## 10 COL 72.6 84.7
## # ℹ 20 more rows
The above result shows us the average wins and losses a team is acquiring in a season. Now lets calculate the win% and create three Bins and try to group them into Best Performers and Under Performers.
Teams_Avg<-
Teams_Avg |>
mutate(Win_Percentage= round((Mean_Wins)/(Mean_Wins+Mean_Loss),2))|>
arrange(desc(Win_Percentage))
Teams_Avg
## # A tibble: 30 × 4
## franchID Mean_Wins Mean_Loss Win_Percentage
## <chr> <dbl> <dbl> <dbl>
## 1 NYY 91.4 65.8 0.58
## 2 LAD 88.6 68.7 0.56
## 3 STL 88.0 69.2 0.56
## 4 ATL 85.0 72.2 0.54
## 5 BOS 85.4 72.0 0.54
## 6 ANA 82.6 74.7 0.53
## 7 SFG 82.7 74.5 0.53
## 8 CLE 81.2 76.0 0.52
## 9 OAK 82.5 74.8 0.52
## 10 HOU 79.5 77.9 0.51
## # ℹ 20 more rows
Teams_Avg$Performance = cut(Teams_Avg$Win_Percentage,breaks = c(0.0,0.45,0.5,Inf),labels = c("Under Performers","Avg Performer","Best Performers"))
Teams_Avg
## # A tibble: 30 × 5
## franchID Mean_Wins Mean_Loss Win_Percentage Performance
## <chr> <dbl> <dbl> <dbl> <fct>
## 1 NYY 91.4 65.8 0.58 Best Performers
## 2 LAD 88.6 68.7 0.56 Best Performers
## 3 STL 88.0 69.2 0.56 Best Performers
## 4 ATL 85.0 72.2 0.54 Best Performers
## 5 BOS 85.4 72.0 0.54 Best Performers
## 6 ANA 82.6 74.7 0.53 Best Performers
## 7 SFG 82.7 74.5 0.53 Best Performers
## 8 CLE 81.2 76.0 0.52 Best Performers
## 9 OAK 82.5 74.8 0.52 Best Performers
## 10 HOU 79.5 77.9 0.51 Best Performers
## # ℹ 20 more rows
Now let’s add the probabilities for each group.
Teams_Avg_Tag <-
Teams_Avg |>
group_by(Performance)|>
summarise(Count = n())|>
mutate(Probability = round(Count/sum(Count),2))
Teams_Avg_Tag
## # A tibble: 3 × 3
## Performance Count Probability
## <fct> <int> <dbl>
## 1 Under Performers 3 0.1
## 2 Avg Performer 16 0.53
## 3 Best Performers 11 0.37
From the above result we interpret that, its high likely (53%) to select an average performers from the data set with an random selection, followed by best performers (37%) and under performers(10%). Add these probabilities to original data.
High_Prob = max(Teams_Avg_Tag$Probability)
Low_Prob = min(Teams_Avg_Tag$Probability)
Teams_Avg_Tag <- Teams_Avg_Tag|>
mutate(Occurence = case_when(Probability == Low_Prob~ "Small Group",Probability == High_Prob~ "High Group", TRUE ~ "Other Groups"))
Teams_Avg_Tag
## # A tibble: 3 × 4
## Performance Count Probability Occurence
## <fct> <int> <dbl> <chr>
## 1 Under Performers 3 0.1 Small Group
## 2 Avg Performer 16 0.53 High Group
## 3 Best Performers 11 0.37 Other Groups
Teams_tag <- left_join(Teams_Avg,Teams_Avg_Tag, by='Performance')|>
select(franchID,Win_Percentage,Probability,Performance,Occurence)
Teams_tag
## # A tibble: 30 × 5
## franchID Win_Percentage Probability Performance Occurence
## <chr> <dbl> <dbl> <fct> <chr>
## 1 NYY 0.58 0.37 Best Performers Other Groups
## 2 LAD 0.56 0.37 Best Performers Other Groups
## 3 STL 0.56 0.37 Best Performers Other Groups
## 4 ATL 0.54 0.37 Best Performers Other Groups
## 5 BOS 0.54 0.37 Best Performers Other Groups
## 6 ANA 0.53 0.37 Best Performers Other Groups
## 7 SFG 0.53 0.37 Best Performers Other Groups
## 8 CLE 0.52 0.37 Best Performers Other Groups
## 9 OAK 0.52 0.37 Best Performers Other Groups
## 10 HOU 0.51 0.37 Best Performers Other Groups
## # ℹ 20 more rows
Finally, we have merged the special tag into the original data frame
and named it has Teams_tag. Now let’s try to interpret why the
occurrence of some groups is rare compared to others.
In terms of probability:
Probability of randomly selecting a row from smallest group - that is Under performance team is - Prob(Under Performers) = 0.1. On the other hand, Prob(Avg Performers) = 0.53 and Prob(Best Performers) = 0.37.
We can interpret that, every time we randomly select a row there is 53% chance that the record is from Average teams whose win% is between 45%-50%. Similarly, only 10% of chance of randomly selecting a weak team from the data set.
Testable Hypothesis: As the number of runs scored by the teams increases and runs allowed decreases, the corresponding winning percentage is expected to rise, placing it in lower probability categories.
Teams_tag |>
ggplot(aes(x=Win_Percentage,color=Occurence))+
geom_histogram(binwidth = 0.01,fill='white')+
theme_classic()
Coming to the final question, we were asked to perform some analysis on two categorical variables. I am using division and World Series Winner.
Lets find out from 2000-2022 which division has produced world series winners. Remember we have three divisions - East, West, Central.
Div_WS <-
lahman_data|>
group_by(divID)|>
summarise(WS_Wins = sum(WSWin=='Y'))
Div_WS
## # A tibble: 3 × 2
## divID WS_Wins
## <chr> <int>
## 1 C 5
## 2 E 9
## 3 W 8
Now lets add the probability of finding a winner from each division.
Div_WS <-
Div_WS|>
mutate(Probability = round(WS_Wins/sum(WS_Wins),2))
Div_WS
## # A tibble: 3 × 3
## divID WS_Wins Probability
## <chr> <int> <dbl>
## 1 C 5 0.23
## 2 E 9 0.41
## 3 W 8 0.36
By looking at the above results we can interpret two thing - if we want pick a world series winner randomly from the three divisions there is an high chance (41%) we might get a winner from East Division, followed by west and central. We can further interpret or predict that the next season 2023 winner might be an East division team. East and West division team have higher chances of winning the 2023 season based on the past data.
2023 world series winner is Texas Rangers they are from west division.
Div_WS |>
ggplot(aes(x=divID,y=Probability,fill=divID))+
geom_bar(stat='identity')+
ggtitle("Probability of World Series winners from Each Division")+
geom_text(aes(label = Probability), size = 6, hjust = 0.5, vjust = 3)+
theme_classic()
Noe coming to missing values - In an year there can be only one winner and that winner can be only from any one of these divisions. It is quite obvious to have null values in all three groups depending on the winner for that season. (Null as in - WSWIN == N instead of Y)
Similarly, we can find out the world series winners from which league teams (American or national league teams) and find out the probabilities.