week_4_Datadive

Read the CSV file

my_data <- read.csv('C:/Users/dell/Downloads/Ball_By_Ball.csv')
summary(my_data)

##     MatcH_id          Over_id         Ball_id        Innings_No   
##  Min.   : 335987   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 419154   1st Qu.: 5.00   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 548382   Median :10.00   Median :4.000   Median :1.000  
##  Mean   : 636208   Mean   :10.14   Mean   :3.617   Mean   :1.482  
##  3rd Qu.: 829742   3rd Qu.:15.00   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :1082650   Max.   :20.00   Max.   :9.000   Max.   :4.000  
##                                                                   
##  Team_Batting       Team_Bowling       Striker_Batting_Position
##  Length:150451      Length:150451      Min.   : 1.000          
##  Class :character   Class :character   1st Qu.: 2.000          
##  Mode  :character   Mode  :character   Median : 3.000          
##                                        Mean   : 3.584          
##                                        3rd Qu.: 5.000          
##                                        Max.   :11.000          
##                                        NA's   :13861           
##   Extra_Type         Runs_Scored      Extra_runs          Wides       
##  Length:150451      Min.   :0.000   Min.   :0.00000   Min.   :0.0000  
##  Class :character   1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Mode  :character   Median :1.000   Median :0.00000   Median :0.0000  
##                     Mean   :1.222   Mean   :0.06899   Mean   :0.0375  
##                     3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##                     Max.   :6.000   Max.   :5.00000   Max.   :5.0000  
##                                                                       
##     Legbyes             Byes             Noballs           Penalty       
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.0e+00  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.0e+00  
##  Median :0.00000   Median :0.000000   Median :0.00000   Median :0.0e+00  
##  Mean   :0.02223   Mean   :0.004885   Mean   :0.00434   Mean   :3.3e-05  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.0e+00  
##  Max.   :5.00000   Max.   :4.000000   Max.   :5.00000   Max.   :5.0e+00  
##                                                                          
##  Bowler_Extras       Out_type             Caught            Bowled        
##  Min.   :0.00000   Length:150451      Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   Class :character   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Mode  :character   Median :0.00000   Median :0.000000  
##  Mean   :0.04184                      Mean   :0.02907   Mean   :0.009186  
##  3rd Qu.:0.00000                      3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :5.00000                      Max.   :1.00000   Max.   :1.000000  
##                                                                           
##     Run_out              LBW            Retired_hurt         Stumped        
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.00e+00   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.00e+00   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000000   Median :0.00e+00   Median :0.000000  
##  Mean   :0.005018   Mean   :0.003024   Mean   :5.98e-05   Mean   :0.001615  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.00e+00   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.00e+00   Max.   :1.000000  
##                                                                             
##  caught_and_bowled    hit_wicket       ObstructingFeild  Bowler_Wicket    
##  Min.   :0.000000   Min.   :0.00e+00   Min.   :0.0e+00   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.00e+00   1st Qu.:0.0e+00   1st Qu.:0.00000  
##  Median :0.000000   Median :0.00e+00   Median :0.0e+00   Median :0.00000  
##  Mean   :0.001402   Mean   :5.98e-05   Mean   :6.6e-06   Mean   :0.04435  
##  3rd Qu.:0.000000   3rd Qu.:0.00e+00   3rd Qu.:0.0e+00   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.00e+00   Max.   :1.0e+00   Max.   :1.00000  
##                                                                           
##   Match_Date            Season        Striker       Non_Striker   
##  Length:150451      Min.   :2008   Min.   :  1.0   Min.   :  1.0  
##  Class :character   1st Qu.:2010   1st Qu.: 40.0   1st Qu.: 40.0  
##  Mode  :character   Median :2012   Median : 96.0   Median : 96.0  
##                     Mean   :2012   Mean   :136.5   Mean   :135.6  
##                     3rd Qu.:2015   3rd Qu.:208.0   3rd Qu.:208.0  
##                     Max.   :2017   Max.   :497.0   Max.   :497.0  
##                                                                   
##      Bowler        Player_Out        Fielders      Striker_match_SK
##  Min.   :  1.0   Min.   :  1.0    Min.   :  1.0    Min.   :12694   
##  1st Qu.: 77.0   1st Qu.: 41.0    1st Qu.: 47.0    1st Qu.:16173   
##  Median :174.0   Median :107.0    Median :111.0    Median :19672   
##  Mean   :194.1   Mean   :148.6    Mean   :155.4    Mean   :19675   
##  3rd Qu.:310.0   3rd Qu.:236.0    3rd Qu.:237.5    3rd Qu.:23127   
##  Max.   :497.0   Max.   :497.0    Max.   :497.0    Max.   :26685   
##                  NA's   :143013   NA's   :145100                   
##    StrikerSK     NonStriker_match_SK NONStriker_SK   Fielder_match_SK
##  Min.   :  0.0   Min.   :12694       Min.   :  0.0   Min.   :   -1   
##  1st Qu.: 39.0   1st Qu.:16173       1st Qu.: 39.0   1st Qu.:   -1   
##  Median : 95.0   Median :19672       Median : 95.0   Median :   -1   
##  Mean   :135.5   Mean   :19675       Mean   :134.6   Mean   :  690   
##  3rd Qu.:207.0   3rd Qu.:23127       3rd Qu.:207.0   3rd Qu.:   -1   
##  Max.   :496.0   Max.   :26685       Max.   :496.0   Max.   :26680   
##                                                                      
##    Fielder_SK      Bowler_match_SK   BOWLER_SK     PlayerOut_match_SK
##  Min.   : -1.000   Min.   :12697   Min.   :  0.0   Min.   :   -1.0   
##  1st Qu.: -1.000   1st Qu.:16175   1st Qu.: 76.0   1st Qu.:   -1.0   
##  Median : -1.000   Median :19674   Median :173.0   Median :   -1.0   
##  Mean   :  4.527   Mean   :19677   Mean   :193.1   Mean   :  970.3   
##  3rd Qu.: -1.000   3rd Qu.:23131   3rd Qu.:309.0   3rd Qu.:   -1.0   
##  Max.   :496.000   Max.   :26685   Max.   :496.0   Max.   :26685.0   
##                                                                      
##  BattingTeam_SK   BowlingTeam_SK    Keeper_Catch      Player_out_sk    
##  Min.   : 0.000   Min.   : 0.000   Min.   :0.000000   Min.   : -1.000  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.:0.000000   1st Qu.:  0.000  
##  Median : 4.000   Median : 4.000   Median :0.000000   Median :  0.000  
##  Mean   : 4.346   Mean   : 4.333   Mean   :0.000432   Mean   :  1.101  
##  3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.:0.000000   3rd Qu.:  0.000  
##  Max.   :12.000   Max.   :12.000   Max.   :1.000000   Max.   :496.000  
##                                                                        
##   MatchDateSK      
##  Min.   :20080418  
##  1st Qu.:20100411  
##  Median :20120520  
##  Mean   :20125288  
##  3rd Qu.:20150420  
##  Max.   :20170521  
##

A collection of 5-10 random samples of data (with replacement) from at least 6 columns of data Each subsample should be as long as roughly 50% percent of your data.

# Load necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

# Set the number of subsamples you want
num_subsamples <- 5  # You can change this to 10 if needed

# Create an empty list to store the subsamples
subsamples_list <- list()

# Define the size of each subsample (roughly 50%)
sample_size <- floor(0.5 * nrow(my_data))

# Define the columns i want to include in the subsamples
selected_columns <- c("MatcH_id", "Over_id", "Ball_id", "Innings_No", "Team_Batting", "Team_Bowling")

# Create random subsamples and store them in the list
for (i in 1:num_subsamples) {
  # Sample rows with replacement
  subsample <- my_data %>%
    select(all_of(selected_columns)) %>%
    sample_n(size = sample_size, replace = TRUE)
  
  # Assign a unique name to each subsample
  subsample_name <- paste0("df_", i)
  
  # Store the subsample in the list
  subsamples_list[[subsample_name]] <- subsample
}

# Now I have a list of random subsamples, each stored as a separate data frame (df_1, df_2, etc.).

\[ To scrutinize the subsamples and understand their differences, similarities, and potential anomalies, you can perform various analyses and comparisons. Here are some steps and considerations: 1) Calculating summary statistics for each subsample, including measures like mean, median, standard deviation, and quartiles for numeric variables. For categorical variables, compute frequency tables. 2) Creating visualizations to compare the distributions of numeric and categorical variables across subsamples. \]

# Load necessary libraries
library(dplyr)

# Define a function to calculate summary statistics
calculate_summary_statistics <- function(subsample) {
  # Identify numeric and categorical columns
  numeric_cols <- sapply(subsample, is.numeric)
  categorical_cols <- !numeric_cols
  
  # Calculate summary statistics for numeric columns
  numeric_summary <- summary(subsample[, numeric_cols])
  
  # Create frequency tables for categorical columns
  categorical_summary <- lapply(subsample[, categorical_cols], table)
  
  # Combine the results into a list
  summary_results <- list(numeric_summary = numeric_summary, categorical_summary = categorical_summary)
  
  return(summary_results)
}

# Create a list to store summary statistics for each subsample
summary_stats_list <- list()

# Calculate summary statistics for each subsample
for (i in 1:num_subsamples) {
  subsample_name <- paste0("df_", i)
  summary_stats <- calculate_summary_statistics(subsamples_list[[subsample_name]])
  summary_stats_list[[subsample_name]] <- summary_stats
}

# Now you have a list of summary statistics for each subsample.
# Display summary_stats_list
print(summary_stats_list)

## $df_1
## $df_1$numeric_summary
##     MatcH_id          Over_id         Ball_id        Innings_No  
##  Min.   : 335987   Min.   : 1.00   Min.   :1.000   Min.   :1.00  
##  1st Qu.: 419154   1st Qu.: 5.00   1st Qu.:2.000   1st Qu.:1.00  
##  Median : 548382   Median :10.00   Median :4.000   Median :1.00  
##  Mean   : 637059   Mean   :10.13   Mean   :3.609   Mean   :1.48  
##  3rd Qu.: 829744   3rd Qu.:15.00   3rd Qu.:5.000   3rd Qu.:2.00  
##  Max.   :1082650   Max.   :20.00   Max.   :9.000   Max.   :4.00  
## 
## $df_1$categorical_summary
## $df_1$categorical_summary$Team_Batting
## 
##                           1                          10 
##                        7772                        2666 
##                          11                          12 
##                        3750                         802 
##                          13                           2 
##                         978                        8112 
##                           3                           4 
##                        7847                        7986 
##                           5                           6 
##                        6841                        7765 
##                           7                           8 
##                        8470                        4538 
##                           9            Delhi Daredevils 
##                         733                         815 
##               Gujarat Lions             Kings XI Punjab 
##                         841                         842 
##       Kolkata Knight Riders              Mumbai Indians 
##                         894                        1036 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         941                         748 
##         Sunrisers Hyderabad 
##                         848 
## 
## $df_1$categorical_summary$Team_Bowling
## 
##                           1                          10 
##                        7796                        2773 
##                          11                          12 
##                        3750                         747 
##                          13                           2 
##                         894                        8240 
##                           3                           4 
##                        7679                        7931 
##                           5                           6 
##                        7047                        7795 
##                           7                           8 
##                        8351                        4461 
##                           9            Delhi Daredevils 
##                         796                         826 
##               Gujarat Lions             Kings XI Punjab 
##                         808                         807 
##       Kolkata Knight Riders              Mumbai Indians 
##                         978                        1003 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         972                         725 
##         Sunrisers Hyderabad 
##                         846 
## 
## 
## 
## $df_2
## $df_2$numeric_summary
##     MatcH_id          Over_id         Ball_id        Innings_No   
##  Min.   : 335987   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 419154   1st Qu.: 5.00   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 548382   Median :10.00   Median :4.000   Median :1.000  
##  Mean   : 637101   Mean   :10.14   Mean   :3.618   Mean   :1.479  
##  3rd Qu.: 829744   3rd Qu.:15.00   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :1082650   Max.   :20.00   Max.   :9.000   Max.   :4.000  
## 
## $df_2$categorical_summary
## $df_2$categorical_summary$Team_Batting
## 
##                           1                          10 
##                        7663                        2696 
##                          11                          12 
##                        3737                         775 
##                          13                           2 
##                         915                        8102 
##                           3                           4 
##                        7983                        8008 
##                           5                           6 
##                        6886                        7792 
##                           7                           8 
##                        8298                        4541 
##                           9            Delhi Daredevils 
##                         824                         837 
##               Gujarat Lions             Kings XI Punjab 
##                         856                         875 
##       Kolkata Knight Riders              Mumbai Indians 
##                         861                        1033 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         942                         802 
##         Sunrisers Hyderabad 
##                         799 
## 
## $df_2$categorical_summary$Team_Bowling
## 
##                           1                          10 
##                        7612                        2774 
##                          11                          12 
##                        3643                         825 
##                          13                           2 
##                         960                        8184 
##                           3                           4 
##                        7625                        7905 
##                           5                           6 
##                        7117                        7801 
##                           7                           8 
##                        8452                        4521 
##                           9            Delhi Daredevils 
##                         801                         871 
##               Gujarat Lions             Kings XI Punjab 
##                         811                         760 
##       Kolkata Knight Riders              Mumbai Indians 
##                         942                        1039 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         967                         784 
##         Sunrisers Hyderabad 
##                         831 
## 
## 
## 
## $df_3
## $df_3$numeric_summary
##     MatcH_id          Over_id         Ball_id        Innings_No   
##  Min.   : 335987   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 419155   1st Qu.: 5.00   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 548384   Median :10.00   Median :4.000   Median :1.000  
##  Mean   : 637957   Mean   :10.16   Mean   :3.613   Mean   :1.484  
##  3rd Qu.: 829746   3rd Qu.:15.00   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :1082650   Max.   :20.00   Max.   :9.000   Max.   :4.000  
## 
## $df_3$categorical_summary
## $df_3$categorical_summary$Team_Batting
## 
##                           1                          10 
##                        7658                        2733 
##                          11                          12 
##                        3659                         794 
##                          13                           2 
##                         981                        7958 
##                           3                           4 
##                        7874                        7939 
##                           5                           6 
##                        6921                        7851 
##                           7                           8 
##                        8548                        4480 
##                           9            Delhi Daredevils 
##                         799                         788 
##               Gujarat Lions             Kings XI Punjab 
##                         850                         797 
##       Kolkata Knight Riders              Mumbai Indians 
##                         896                        1060 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         941                         804 
##         Sunrisers Hyderabad 
##                         894 
## 
## $df_3$categorical_summary$Team_Bowling
## 
##                           1                          10 
##                        7837                        2814 
##                          11                          12 
##                        3682                         777 
##                          13                           2 
##                         965                        8168 
##                           3                           4 
##                        7618                        7742 
##                           5                           6 
##                        7147                        7762 
##                           7                           8 
##                        8383                        4461 
##                           9            Delhi Daredevils 
##                         839                         815 
##               Gujarat Lions             Kings XI Punjab 
##                         843                         836 
##       Kolkata Knight Riders              Mumbai Indians 
##                         921                        1051 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                        1010                         766 
##         Sunrisers Hyderabad 
##                         788 
## 
## 
## 
## $df_4
## $df_4$numeric_summary
##     MatcH_id          Over_id         Ball_id        Innings_No   
##  Min.   : 335987   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 419154   1st Qu.: 5.00   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 548382   Median :10.00   Median :4.000   Median :1.000  
##  Mean   : 636625   Mean   :10.16   Mean   :3.632   Mean   :1.483  
##  3rd Qu.: 829742   3rd Qu.:15.00   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :1082650   Max.   :20.00   Max.   :9.000   Max.   :4.000  
## 
## $df_4$categorical_summary
## $df_4$categorical_summary$Team_Batting
## 
##                           1                          10 
##                        7607                        2651 
##                          11                          12 
##                        3638                         871 
##                          13                           2 
##                        1018                        8080 
##                           3                           4 
##                        7876                        8097 
##                           5                           6 
##                        6946                        7717 
##                           7                           8 
##                        8636                        4495 
##                           9            Delhi Daredevils 
##                         735                         829 
##               Gujarat Lions             Kings XI Punjab 
##                         833                         794 
##       Kolkata Knight Riders              Mumbai Indians 
##                         899                        1022 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         878                         771 
##         Sunrisers Hyderabad 
##                         832 
## 
## $df_4$categorical_summary$Team_Bowling
## 
##                           1                          10 
##                        7739                        2706 
##                          11                          12 
##                        3728                         785 
##                          13                           2 
##                         917                        8204 
##                           3                           4 
##                        7872                        7863 
##                           5                           6 
##                        7024                        7863 
##                           7                           8 
##                        8360                        4484 
##                           9            Delhi Daredevils 
##                         822                         778 
##               Gujarat Lions             Kings XI Punjab 
##                         821                         819 
##       Kolkata Knight Riders              Mumbai Indians 
##                         935                        1025 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         923                         751 
##         Sunrisers Hyderabad 
##                         806 
## 
## 
## 
## $df_5
## $df_5$numeric_summary
##     MatcH_id          Over_id         Ball_id        Innings_No   
##  Min.   : 335987   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 419154   1st Qu.: 5.00   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 548383   Median :10.00   Median :4.000   Median :1.000  
##  Mean   : 636858   Mean   :10.19   Mean   :3.608   Mean   :1.481  
##  3rd Qu.: 829744   3rd Qu.:15.00   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :1082650   Max.   :20.00   Max.   :9.000   Max.   :4.000  
## 
## $df_5$categorical_summary
## $df_5$categorical_summary$Team_Batting
## 
##                           1                          10 
##                        7571                        2721 
##                          11                          12 
##                        3740                         772 
##                          13                           2 
##                         950                        8122 
##                           3                           4 
##                        7916                        8009 
##                           5                           6 
##                        6950                        7661 
##                           7                           8 
##                        8427                        4544 
##                           9            Delhi Daredevils 
##                         815                         848 
##               Gujarat Lions             Kings XI Punjab 
##                         867                         760 
##       Kolkata Knight Riders              Mumbai Indians 
##                         895                        1069 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         969                         788 
##         Sunrisers Hyderabad 
##                         831 
## 
## $df_5$categorical_summary$Team_Bowling
## 
##                           1                          10 
##                        7888                        2789 
##                          11                          12 
##                        3466                         852 
##                          13                           2 
##                         923                        8241 
##                           3                           4 
##                        7731                        7783 
##                           5                           6 
##                        7102                        7751 
##                           7                           8 
##                        8388                        4490 
##                           9            Delhi Daredevils 
##                         794                         834 
##               Gujarat Lions             Kings XI Punjab 
##                         861                         865 
##       Kolkata Knight Riders              Mumbai Indians 
##                         939                        1046 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         948                         727 
##         Sunrisers Hyderabad 
##                         807

library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

# Create histograms for numeric variables in each subsample
histograms <- lapply(subsamples_list, function(subsample) {
  ggplot(subsample, aes(x = Ball_id)) +
    geom_histogram(binwidth = 1, fill = "blue", color = "black") +
    labs(title = "Histogram of Your Ball_id",
         x = "Value",
         y = "Frequency")
})

# Display histograms side by side
grid.arrange(grobs = histograms, ncol = num_subsamples)

library(ggplot2)
library(gridExtra)
# Create bar charts for categorical variables in each subsample
bar_charts <- lapply(subsamples_list, function(subsample) {
  ggplot(subsample, aes(x = Team_Batting, fill = Team_Batting)) +
    geom_bar() +
    labs(title = "Bar Chart of Team_Batting",
         x = "Category",
         y = "Frequency") +
    theme(legend.position = "none")
})

# Display bar charts side by side
grid.arrange(grobs = bar_charts, ncol = num_subsamples)

# Combine all subsamples into one data frame
combined_data <- do.call(rbind, subsamples_list)

# Create a bar chart for the combined data
ggplot(combined_data, aes(x = Team_Batting, fill = Team_Batting)) +
  geom_bar() +
  labs(title = "Bar Chart of Your Team_Batting",
       x = "Category",
       y = "Frequency") +
  theme(legend.position = "none")

$$$ What would you have called an anomaly in one sub-sample that you wouldn’t in another? I see no anomalies in my subsamples as they are consistent. usually the anomaly could be unique based on the type of data.let’s consider an example based on the columns “MatcH_id,” “Over_id,” “Ball_id,” “Innings_No,” and “Team_Batting.” identifying anomalies in each of these columns within two sub-samples: Sub-Sample 1 (First Half of Matches): This sub-sample consists of data from the first half of matches.

Sub-Sample 2 (Second Half of Matches): This sub-sample consists of data from the second half of matches.

The anomalies in each of these columns might be defined as :

#MatcH_id:

Sub-Sample 1 (First Half of Matches): Anomalies could be identified as MatcH_id values that are smaller than the median MatcH_id value in the first half of matches, as these indicate matches that occurred relatively early in the dataset. Sub-Sample 2 (Second Half of Matches): Anomalies might be MatcH_id values that are greater than the median MatcH_id value in the second half of matches, representing matches that occurred relatively late in the dataset.

Over_id:

Sub-Sample 1 (First Half of Matches): Anomalies could be defined as Over_id values that are unusually low compared to the average Over_id in the first half, indicating early overs in matches. Sub-Sample 2 (Second Half of Matches): Anomalies might be Over_id values that are significantly higher than the average Over_id in the second half, indicating late overs in matches. Ball_id:

Sub-Sample 1 (First Half of Matches): Anomalies could include Ball_id values that are small, indicating early balls faced in matches within the first half. Sub-Sample 2 (Second Half of Matches): Anomalies might involve Ball_id values that are relatively high, representing late balls faced in matches within the second half.

Innings_No:

Sub-Sample 1 (First Half of Matches): Anomalies might be Innings_No values that are predominantly 1 (first innings), as most matches start with the first innings. Sub-Sample 2 (Second Half of Matches): Anomalies could involve Innings_No values that are predominantly 2 (second innings), indicating matches in the second half where teams batted second more often. Team_Batting:

Sub-Sample 1 (First Half of Matches): Anomalies might be specific teams that frequently batted early in matches in the first half. Sub-Sample 2 (Second Half of Matches): Anomalies could include specific teams that frequently batted late in matches in the second half. \[$ \]$ Are there aspects of the data that are consistent among all sub-samples?

On observing the above visualizations on subsamples,all of them were consistent the columns “MatcH_id,” “Over_id,” “Ball_id,” “Innings_No,” and “Team_Batting.” I’ll explore how these aspects might exhibit consistency across different sub-samples of a sports dataset.

Example: Cricket Match Data

Imagine I have a dataset that contains information about cricket matches, including details like match ID (“MatcH_id”), over ID (“Over_id”), ball ID (“Ball_id”), innings number (“Innings_No”), and the name of the batting team (“Team_Batting”). Each row in the dataset represents a specific ball bowled during a cricket match.

Here’s how these aspects could exhibit consistency among different sub-samples:

MatcH_id: Consistency in match IDs across sub-samples would mean that the same cricket matches are represented in each sub-sample. This suggests that the sub-samples are drawn from the same set of matches.
Over_id: Consistency in over IDs across sub-samples indicates that specific overs are common across different parts of the dataset. For example, if the 10th over consistently appears in all sub-samples, it means that the 10th over is played consistently in different matches.
Ball_id: Consistency in ball IDs implies that certain balls (deliveries) are consistent across sub-samples. This could indicate that specific key moments, such as boundaries or wickets, are consistently captured.
Innings_No: If the innings number is consistent in all sub-samples, it means that the dataset predominantly includes matches with the same type of innings (e.g., first innings). Consistency in innings number could also suggest that limited-overs matches dominate the dataset.
Team_Batting: Consistency in the names of the batting teams across sub-samples indicates that the same set of teams participates in various matches. For example, if “Team_A” and “Team_B” consistently appear in all sub-samples, it suggests these teams are common participants.

Overall, consistency in these aspects across sub-samples provides insights into the nature of the cricket matches represented in the dataset. It suggests that certain matches, overs, balls, innings types, and teams are consistently featured, making these aspects stable and reliable for analysis.

Analyzing such consistency helps ensure that any findings or patterns observed in the sub-samples are likely to hold across the entire dataset, making the conclusions more robust and applicable. \[ \] # Consider how this investigation affects how you might draw conclusions about the data in the future?

Considering the investigation into the columns “MatcH_id,” “Over_id,” “Ball_id,” “Innings_No,” and “Team_Batting” can affect drawing conclusions about the data in the future.

Example Scenario:

Imagine I’m analyzing data from a cricket database, and I’m particularly interested in these columns:

“MatcH_id” is a unique identifier for each cricket match. “Over_id” represents the specific over within a match. “Ball_id” represents the ball number within an over. “Innings_No” indicates whether it’s the first or second innings of a match. “Team_Batting” represents the team currently batting. Investigation: As part of my investigation, I’ve observed the following anomalies:

Missing Match IDs: In one sub-sample, I noticed missing “MatcH_id” values. This suggests data quality issues or incomplete records in that sub-sample.

Out-of-Sequence Over and Ball IDs: In another sub-sample, I found that “Over_id” and “Ball_id” values were not in sequential order for a few records. This could indicate data entry errors or inconsistencies in recording overs and balls.

Inconsistent Innings: In yet another sub-sample, I found a match where “Innings_No” was greater than 2. Upon further investigation, I realized that this particular match had an unusual format, which is a valuable discovery.

Mismatched Team and Innings: In some sub-samples, there were instances where “Team_Batting” did not align with the “Innings_No.” For example, “Innings_No” was 1, but “Team_Batting” indicated the second innings team. This inconsistency may require data cleansing.

Impact on Drawing Conclusions:

These anomalies affect how I draw conclusions from the data:

Data Quality Assessment: Anomalies highlight potential data quality issues. In the future, I will need to conduct thorough data quality assessments, including identifying and handling missing values, validating unique identifiers, and checking for data integrity.

Understanding Data Variability: Recognizing anomalies across sub-samples enhances my understanding of the data’s variability. I will be more cautious when interpreting results, considering potential inconsistencies, and accounting for variations in data quality.

Domain Knowledge: Discovering an unusual match format (more than two innings) provides valuable domain knowledge. In future analyses, I will account for such variations and adapt my analytical approach accordingly.

Data Cleansing: Anomalies related to mismatched teams and innings highlight the importance of data cleansing and validation. I will implement data preprocessing steps to ensure that team and inning information is consistent and accurate.

In summary, the investigation into anomalies not only improves data quality but also influences how I approach data analysis and interpretation in the future. It emphasizes the need for robust data preprocessing and domain-specific knowledge to draw meaningful conclusions from the dataset.”

week_4_Datadive

Sai Dheeraj

2023-09-18

Read the CSV file