This RMarkdown contains:

  1. Finding out the anomalies in Average runs scored and winning percentage by defining our own threshold. (visualizing them using graphs)
  2. Probability concepts on Average runs scored and winning percentage group by franchID and finding out the under, avg and best performing teams from 2000-2022. (visualizing them using graphs)
  3. Defining testable hypothesis from above questions.
  4. Finally, we will walk through analyzing two categorical variable and why there are some null values.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(rmarkdown)
lahman_data = read.csv("/Users/anuragreddy/Desktop/Statistics with R/Lahmans Databse .csv")
head(lahman_data)
##   yearID lgID teamID franchID divID Rank   G Ghome  W  L DivWin WCWin LgWin
## 1   2000   AL    ANA      ANA     W    3 162    81 82 80      N     N     N
## 2   2000   AL    BAL      BAL     E    4 162    81 74 88      N     N     N
## 3   2000   AL    BOS      BOS     E    2 162    81 85 77      N     N     N
## 4   2000   AL    CHA      CHW     C    1 162    81 95 67      Y     N     N
## 5   2000   AL    CLE      CLE     C    2 162    81 90 72      N     N     N
## 6   2000   AL    DET      DET     C    3 162    81 79 83      N     N     N
##   WSWin   R   AB    H X2B X3B  HR  BB   SO  SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1     N 864 5628 1574 309  34 236 608 1024  93 52  47 43 869 805 5.00  5   3 46
## 2     N 794 5549 1508 310  22 184 558  900 126 65  49 54 913 855 5.37 14   6 33
## 3     N 792 5630 1503 316  32 167 611 1019  43 30  42 48 745 683 4.23  7  12 46
## 4     N 978 5646 1615 325  33 216 591  960 119 42  53 61 839 751 4.66  5   7 43
## 5     N 950 5683 1639 310  30 221 685 1057 113 34  51 52 816 775 4.84  6   5 34
## 6     N 823 5644 1553 307  41 177 562  982  83 38  43 49 827 755 4.71  6   6 44
##   IPouts   HA HRA BBA  SOA   E  DP    FP              name
## 1   4344 1534 228 662  846 134 182 0.978    Anaheim Angels
## 2   4300 1547 202 665 1017 116 151 0.981 Baltimore Orioles
## 3   4358 1433 173 498 1121 109 120 0.982    Boston Red Sox
## 4   4351 1509 195 614 1037 133 190 0.978 Chicago White Sox
## 5   4327 1511 173 666 1213  72 147 0.988 Cleveland Indians
## 6   4330 1583 177 496  978 105 171 0.983    Detroit Tigers
##                          park attendance BPF PPF teamIDBR teamIDlahman45
## 1  Edison International Field    2066982 102 103      ANA            ANA
## 2 Oriole Park at Camden Yards    3297031  95  96      BAL            BAL
## 3              Fenway Park II    2585895 104 103      BOS            BOS
## 4            Comiskey Park II    1947799 102 102      CHW            CHA
## 5                Jacobs Field    3456278 101 100      CLE            CLE
## 6               Comerica Park    2438617  95  95      DET            DET
##   teamIDretro
## 1         ANA
## 2         BAL
## 3         BOS
## 4         CHA
## 5         CLE
## 6         DET

Question-1: Which division has scored most runs in 2000-2022? We will group_by year and div ID find out the mean of runs for each year.

lahman_data |>
  group_by(yearID,divID) |>
  summarize(Average_Runs = mean(R),.groups="keep") |>
  arrange(desc(Average_Runs))
## # A tibble: 66 × 3
## # Groups:   yearID, divID [66]
##    yearID divID Average_Runs
##     <int> <chr>        <dbl>
##  1   2000 W             867.
##  2   2000 C             848.
##  3   2001 W             831 
##  4   2007 E             810.
##  5   2006 E             807.
##  6   2019 W             800.
##  7   2003 E             799.
##  8   2019 E             798.
##  9   2002 W             785.
## 10   2000 E             784.
## # ℹ 56 more rows

The highest run scoring was seen in early 21st century in west and central division teams.

Question - 2: On similar lines lest find out the runs scored by teams from 2000-2022. Lets create a bins of c(600,650,700,750,800,850) and then label them with your own categories.

k <- lahman_data |>
  group_by(franchID) |>
  summarize(Average_Runs = mean(R)) |>
  arrange(desc(Average_Runs)) 

k$Strength <- cut(k$Average_Runs,breaks = c(650,700,750,800,850), labels= c('Weak','Average','Strong','Very Strong'))
k
## # A tibble: 30 × 3
##    franchID Average_Runs Strength   
##    <chr>           <dbl> <fct>      
##  1 BOS              806. Very Strong
##  2 NYY              806. Very Strong
##  3 COL              773. Strong     
##  4 TEX              771. Strong     
##  5 TOR              752  Strong     
##  6 STL              744. Average    
##  7 CLE              742. Average    
##  8 ANA              732. Average    
##  9 CHW              729. Average    
## 10 MIN              728. Average    
## # ℹ 20 more rows

Visualization - 1: Lets detect anomalies in average runs scored by all the teams from 2000 - 2022. Before creating anomalies lets look at the distributing of Average Runs scored by teams.

k |>
  ggplot(aes(x = Average_Runs)) +
  geom_histogram(color='white',fill='navyblue') + 
  labs(title = " Distribution of Average Runs Scored",x="Average Runs Scored",y="Frequency")+
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

If we keep threshold for detecting anomalies as - Avg_Runs < 700 and > 800

k |>
  filter(Average_Runs > 800 | Average_Runs <700) |>
  ggplot(aes(x = Average_Runs)) +
  geom_histogram(color='white',fill='navyblue') + 
  labs(title = " Distribution of Average Runs Scored",x="Average Runs Scored",y="Frequency")+
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The above graph is the distribution of anomalies in Average Runs Scored.

On similar bases lets create the win% for all the teams in their particular years and find out the anomalies. Because Runs Scored only shows us the one side of the game. By calculating winning% we can actually find out the performance of the teams.

Question - 3: Group the data by year wise and calculate the winning percentage and find out the teams in each year from 2000-2022 who has winning % > 65%

teams <- lahman_data |>
  mutate(winning_percentage = W/(W+L))|>
  select(yearID,franchID,winning_percentage) |>
  arrange(desc(winning_percentage)) 
  
teams_65 <- teams |>
  filter(winning_percentage>0.65) 
teams_65
##   yearID franchID winning_percentage
## 1   2020      LAD          0.7166667
## 2   2001      SEA          0.7160494
## 3   2022      LAD          0.6851852
## 4   2020      TBD          0.6666667
## 5   2019      HOU          0.6604938
## 6   2021      SFG          0.6604938
## 7   2019      LAD          0.6543210
## 8   2021      LAD          0.6543210
## 9   2022      HOU          0.6543210

Los Angeles Dodgers (LAD) is the team which has winning percentage greater than 65% 4 times in 2019,2020,2021,2022…. A consistent performer and one of the potential winners for this season 2024.

Visualization - 2: Lets find the anomalies in the win% distribution. Keeping threshold at less than 45%.

threshold <- 0.45
teams|>
  ggplot(aes(x=winning_percentage))+
  geom_histogram(fill='lightblue',color='black')+
  labs(title = " Distribution of winning percentage",x="Winning Percentage",y="Frequency")+
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Lets print the anomalies in winning distribution table.

  teams |>
  filter(winning_percentage<threshold) |>
  ggplot(aes(x=winning_percentage))+
  geom_histogram(fill='lightblue',color='black')+
  labs(title = " Distribution of winning percentage",x="Winning Percentage",y="Frequency")+
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Till now we have used group_by function and summarized data and found out the anomalies in the data. Now lets try to question some probability concepts on the data.

Remember in the starting we found out the average runs scored by teams from 2000-2022 grouping by teamID. Lets use that data frame.

Grouped_K <-
  k |>
  group_by(Strength)|>
  summarise(count=n())|>
  mutate(probability = count/sum(count))
Grouped_K
## # A tibble: 4 × 3
##   Strength    count probability
##   <fct>       <int>       <dbl>
## 1 Weak            9      0.3   
## 2 Average        16      0.533 
## 3 Strong          3      0.1   
## 4 Very Strong     2      0.0667

Now we can interpret from the above output that, the very strong group is the smallest group and has lowest probability of getting selected if we randomly select a row from the k data set. Now let’s assign a lowest probability tag to the very strong group.

Lowest_Probability = min(Grouped_K$probability)
Highest_Probability = max(Grouped_K$probability)
Grouped_K <-
  Grouped_K |>
  mutate(Special_Tag = case_when(probability == Lowest_Probability~ "Smallest Group",probability == Highest_Probability~ "Highest Group", TRUE ~ "Other Groups"))

Grouped_K
## # A tibble: 4 × 4
##   Strength    count probability Special_Tag   
##   <fct>       <int>       <dbl> <chr>         
## 1 Weak            9      0.3    Other Groups  
## 2 Average        16      0.533  Highest Group 
## 3 Strong          3      0.1    Other Groups  
## 4 Very Strong     2      0.0667 Smallest Group

Now we have assigned the tags for highest and lowest probabilities. Let’s merge this data to the original data set (k).

k
## # A tibble: 30 × 3
##    franchID Average_Runs Strength   
##    <chr>           <dbl> <fct>      
##  1 BOS              806. Very Strong
##  2 NYY              806. Very Strong
##  3 COL              773. Strong     
##  4 TEX              771. Strong     
##  5 TOR              752  Strong     
##  6 STL              744. Average    
##  7 CLE              742. Average    
##  8 ANA              732. Average    
##  9 CHW              729. Average    
## 10 MIN              728. Average    
## # ℹ 20 more rows
k_tag <- left_join(k, Grouped_K, by = "Strength") |>
  select(franchID, Average_Runs, Strength, probability, Special_Tag)
k_tag
## # A tibble: 30 × 5
##    franchID Average_Runs Strength    probability Special_Tag   
##    <chr>           <dbl> <fct>             <dbl> <chr>         
##  1 BOS              806. Very Strong      0.0667 Smallest Group
##  2 NYY              806. Very Strong      0.0667 Smallest Group
##  3 COL              773. Strong           0.1    Other Groups  
##  4 TEX              771. Strong           0.1    Other Groups  
##  5 TOR              752  Strong           0.1    Other Groups  
##  6 STL              744. Average          0.533  Highest Group 
##  7 CLE              742. Average          0.533  Highest Group 
##  8 ANA              732. Average          0.533  Highest Group 
##  9 CHW              729. Average          0.533  Highest Group 
## 10 MIN              728. Average          0.533  Highest Group 
## # ℹ 20 more rows

Finally, we have merged the special tag into the original data frame k and named it has k_tag. Now let’s try to interpret why the occurrence of some groups is rare compared to others.

If we see the means of smallest groups then we can see they have the highest mean. BOS and NYY has approx 805 average runs scored from 2000-2022. It very rare to have highest means for all the groups. Hence the occurrence is very rare for smallest groups.

In terms of probability:

Probability of randomly selecting a row from smallest group - that is very strong team is - Prob(very strong) = 0.067. On the other hand, Prob(Average) = 0.53.

We can interpret that, every time we randomly select a row there is 53% chance that the record is from Average teams whose average_runs scored is between - (700-750]. Similarly, only 7% of chance of randomly selecting a very strong team from the data set.

Testable Hypothesis: As the number of runs scored by the teams increases, the corresponding average is expected to rise, placing it in lower probability categories.

k_tag |>
  ggplot(aes(x=Special_Tag,y=Average_Runs,color=Special_Tag))+
  geom_boxplot()+
  theme_classic()

Well above visualization can show us the difference between the three groups (Highest, Other and Smallest groups).

To identify the performance of the teams average runs scored is not the best metric. Lets use winning percentage as the metric and find out the probability of randomly selecting the high performer teams.

One issue we might get is we have 20 years of data. We can’t directly add the W and L of all those years for each team and find out the winning percentage (W/W+L). Before performing this action we need to find out the average wins acquired by the team per season. For this we can group by franchID and summarise the mean(W).

Teams_Avg <- lahman_data |>
  group_by(franchID)|>
  summarise(Mean_Wins = mean(W), Mean_Loss = mean(L))
Teams_Avg
## # A tibble: 30 × 3
##    franchID Mean_Wins Mean_Loss
##    <chr>        <dbl>     <dbl>
##  1 ANA           82.6      74.7
##  2 ARI           75.8      81.6
##  3 ATL           85.0      72.2
##  4 BAL           71.2      86.1
##  5 BOS           85.4      72.0
##  6 CHC           77.6      79.6
##  7 CHW           79.4      78.0
##  8 CIN           74.3      83.0
##  9 CLE           81.2      76.0
## 10 COL           72.6      84.7
## # ℹ 20 more rows

The above result shows us the average wins and losses a team is acquiring in a season. Now lets calculate the win% and create three Bins and try to group them into Best Performers and Under Performers.

Teams_Avg<-
  Teams_Avg |>
  mutate(Win_Percentage= round((Mean_Wins)/(Mean_Wins+Mean_Loss),2))|>
  arrange(desc(Win_Percentage))
Teams_Avg
## # A tibble: 30 × 4
##    franchID Mean_Wins Mean_Loss Win_Percentage
##    <chr>        <dbl>     <dbl>          <dbl>
##  1 NYY           91.4      65.8           0.58
##  2 LAD           88.6      68.7           0.56
##  3 STL           88.0      69.2           0.56
##  4 ATL           85.0      72.2           0.54
##  5 BOS           85.4      72.0           0.54
##  6 ANA           82.6      74.7           0.53
##  7 SFG           82.7      74.5           0.53
##  8 CLE           81.2      76.0           0.52
##  9 OAK           82.5      74.8           0.52
## 10 HOU           79.5      77.9           0.51
## # ℹ 20 more rows
Teams_Avg$Performance = cut(Teams_Avg$Win_Percentage,breaks = c(0.0,0.45,0.5,Inf),labels = c("Under Performers","Avg Performer","Best Performers"))
Teams_Avg
## # A tibble: 30 × 5
##    franchID Mean_Wins Mean_Loss Win_Percentage Performance    
##    <chr>        <dbl>     <dbl>          <dbl> <fct>          
##  1 NYY           91.4      65.8           0.58 Best Performers
##  2 LAD           88.6      68.7           0.56 Best Performers
##  3 STL           88.0      69.2           0.56 Best Performers
##  4 ATL           85.0      72.2           0.54 Best Performers
##  5 BOS           85.4      72.0           0.54 Best Performers
##  6 ANA           82.6      74.7           0.53 Best Performers
##  7 SFG           82.7      74.5           0.53 Best Performers
##  8 CLE           81.2      76.0           0.52 Best Performers
##  9 OAK           82.5      74.8           0.52 Best Performers
## 10 HOU           79.5      77.9           0.51 Best Performers
## # ℹ 20 more rows

Now let’s add the probabilities for each group.

Teams_Avg_Tag <-
  Teams_Avg |>
  group_by(Performance)|>
  summarise(Count = n())|>
  mutate(Probability = round(Count/sum(Count),2))
Teams_Avg_Tag  
## # A tibble: 3 × 3
##   Performance      Count Probability
##   <fct>            <int>       <dbl>
## 1 Under Performers     3        0.1 
## 2 Avg Performer       16        0.53
## 3 Best Performers     11        0.37

From the above result we interpret that, its high likely (53%) to select an average performers from the data set with an random selection, followed by best performers (37%) and under performers(10%). Add these probabilities to original data.

High_Prob = max(Teams_Avg_Tag$Probability)
Low_Prob = min(Teams_Avg_Tag$Probability)
Teams_Avg_Tag <- Teams_Avg_Tag|>
  mutate(Occurence = case_when(Probability == Low_Prob~ "Small Group",Probability == High_Prob~ "High Group", TRUE ~ "Other Groups"))

Teams_Avg_Tag
## # A tibble: 3 × 4
##   Performance      Count Probability Occurence   
##   <fct>            <int>       <dbl> <chr>       
## 1 Under Performers     3        0.1  Small Group 
## 2 Avg Performer       16        0.53 High Group  
## 3 Best Performers     11        0.37 Other Groups
Teams_tag <- left_join(Teams_Avg,Teams_Avg_Tag, by='Performance')|>
  select(franchID,Win_Percentage,Probability,Performance,Occurence)
Teams_tag
## # A tibble: 30 × 5
##    franchID Win_Percentage Probability Performance     Occurence   
##    <chr>             <dbl>       <dbl> <fct>           <chr>       
##  1 NYY                0.58        0.37 Best Performers Other Groups
##  2 LAD                0.56        0.37 Best Performers Other Groups
##  3 STL                0.56        0.37 Best Performers Other Groups
##  4 ATL                0.54        0.37 Best Performers Other Groups
##  5 BOS                0.54        0.37 Best Performers Other Groups
##  6 ANA                0.53        0.37 Best Performers Other Groups
##  7 SFG                0.53        0.37 Best Performers Other Groups
##  8 CLE                0.52        0.37 Best Performers Other Groups
##  9 OAK                0.52        0.37 Best Performers Other Groups
## 10 HOU                0.51        0.37 Best Performers Other Groups
## # ℹ 20 more rows

Finally, we have merged the special tag into the original data frame and named it has Teams_tag. Now let’s try to interpret why the occurrence of some groups is rare compared to others.

In terms of probability:

Probability of randomly selecting a row from smallest group - that is Under performance team is - Prob(Under Performers) = 0.1. On the other hand, Prob(Avg Performers) = 0.53 and Prob(Best Performers) = 0.37.

We can interpret that, every time we randomly select a row there is 53% chance that the record is from Average teams whose win% is between 45%-50%. Similarly, only 10% of chance of randomly selecting a weak team from the data set.

Testable Hypothesis: As the number of runs scored by the teams increases and runs allowed decreases, the corresponding winning percentage is expected to rise, placing it in lower probability categories.

Teams_tag |>
  ggplot(aes(x=Win_Percentage,color=Occurence))+
  geom_histogram(binwidth = 0.01,fill='white')+
  theme_classic()

Coming to the final question, we were asked to perform some analysis on two categorical variables. I am using division and World Series Winner.

Lets find out from 2000-2022 which division has produced world series winners. Remember we have three divisions - East, West, Central.

Div_WS <-
  lahman_data|>
  group_by(divID)|>
  summarise(WS_Wins = sum(WSWin=='Y'))
Div_WS
## # A tibble: 3 × 2
##   divID WS_Wins
##   <chr>   <int>
## 1 C           5
## 2 E           9
## 3 W           8

Now lets add the probability of finding a winner from each division.

Div_WS <-
  Div_WS|>
  mutate(Probability = round(WS_Wins/sum(WS_Wins),2))
Div_WS
## # A tibble: 3 × 3
##   divID WS_Wins Probability
##   <chr>   <int>       <dbl>
## 1 C           5        0.23
## 2 E           9        0.41
## 3 W           8        0.36

By looking at the above results we can interpret two thing - if we want pick a world series winner randomly from the three divisions there is an high chance (41%) we might get a winner from East Division, followed by west and central. We can further interpret or predict that the next season 2023 winner might be an East division team. East and West division team have higher chances of winning the 2023 season based on the past data.

2023 world series winner is Texas Rangers they are from west division.

Div_WS |>
  ggplot(aes(x=divID,y=Probability,fill=divID))+
  geom_bar(stat='identity')+
  ggtitle("Probability of World Series winners from Each Division")+
  geom_text(aes(label = Probability), size = 6, hjust = 0.5, vjust = 3)+
  theme_classic()

Noe coming to missing values - In an year there can be only one winner and that winner can be only from any one of these divisions. It is quite obvious to have null values in all three groups depending on the winner for that season. (Null as in - WSWIN == N instead of Y)

Similarly, we can find out the world series winners from which league teams (American or national league teams) and find out the probabilities.