week_4_Datadive

Read the CSV file

my_data <- read.csv('C:/Users/dell/Downloads/Ball_By_Ball.csv')
summary(my_data)

##     MatcH_id          Over_id         Ball_id        Innings_No   
##  Min.   : 335987   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 419154   1st Qu.: 5.00   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 548382   Median :10.00   Median :4.000   Median :1.000  
##  Mean   : 636208   Mean   :10.14   Mean   :3.617   Mean   :1.482  
##  3rd Qu.: 829742   3rd Qu.:15.00   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :1082650   Max.   :20.00   Max.   :9.000   Max.   :4.000  
##                                                                   
##  Team_Batting       Team_Bowling       Striker_Batting_Position
##  Length:150451      Length:150451      Min.   : 1.000          
##  Class :character   Class :character   1st Qu.: 2.000          
##  Mode  :character   Mode  :character   Median : 3.000          
##                                        Mean   : 3.584          
##                                        3rd Qu.: 5.000          
##                                        Max.   :11.000          
##                                        NA's   :13861           
##   Extra_Type         Runs_Scored      Extra_runs          Wides       
##  Length:150451      Min.   :0.000   Min.   :0.00000   Min.   :0.0000  
##  Class :character   1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Mode  :character   Median :1.000   Median :0.00000   Median :0.0000  
##                     Mean   :1.222   Mean   :0.06899   Mean   :0.0375  
##                     3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##                     Max.   :6.000   Max.   :5.00000   Max.   :5.0000  
##                                                                       
##     Legbyes             Byes             Noballs           Penalty       
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.0e+00  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.0e+00  
##  Median :0.00000   Median :0.000000   Median :0.00000   Median :0.0e+00  
##  Mean   :0.02223   Mean   :0.004885   Mean   :0.00434   Mean   :3.3e-05  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.0e+00  
##  Max.   :5.00000   Max.   :4.000000   Max.   :5.00000   Max.   :5.0e+00  
##                                                                          
##  Bowler_Extras       Out_type             Caught            Bowled        
##  Min.   :0.00000   Length:150451      Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   Class :character   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Mode  :character   Median :0.00000   Median :0.000000  
##  Mean   :0.04184                      Mean   :0.02907   Mean   :0.009186  
##  3rd Qu.:0.00000                      3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :5.00000                      Max.   :1.00000   Max.   :1.000000  
##                                                                           
##     Run_out              LBW            Retired_hurt         Stumped        
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.00e+00   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.00e+00   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000000   Median :0.00e+00   Median :0.000000  
##  Mean   :0.005018   Mean   :0.003024   Mean   :5.98e-05   Mean   :0.001615  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.00e+00   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.00e+00   Max.   :1.000000  
##                                                                             
##  caught_and_bowled    hit_wicket       ObstructingFeild  Bowler_Wicket    
##  Min.   :0.000000   Min.   :0.00e+00   Min.   :0.0e+00   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.00e+00   1st Qu.:0.0e+00   1st Qu.:0.00000  
##  Median :0.000000   Median :0.00e+00   Median :0.0e+00   Median :0.00000  
##  Mean   :0.001402   Mean   :5.98e-05   Mean   :6.6e-06   Mean   :0.04435  
##  3rd Qu.:0.000000   3rd Qu.:0.00e+00   3rd Qu.:0.0e+00   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.00e+00   Max.   :1.0e+00   Max.   :1.00000  
##                                                                           
##   Match_Date            Season        Striker       Non_Striker   
##  Length:150451      Min.   :2008   Min.   :  1.0   Min.   :  1.0  
##  Class :character   1st Qu.:2010   1st Qu.: 40.0   1st Qu.: 40.0  
##  Mode  :character   Median :2012   Median : 96.0   Median : 96.0  
##                     Mean   :2012   Mean   :136.5   Mean   :135.6  
##                     3rd Qu.:2015   3rd Qu.:208.0   3rd Qu.:208.0  
##                     Max.   :2017   Max.   :497.0   Max.   :497.0  
##                                                                   
##      Bowler        Player_Out        Fielders      Striker_match_SK
##  Min.   :  1.0   Min.   :  1.0    Min.   :  1.0    Min.   :12694   
##  1st Qu.: 77.0   1st Qu.: 41.0    1st Qu.: 47.0    1st Qu.:16173   
##  Median :174.0   Median :107.0    Median :111.0    Median :19672   
##  Mean   :194.1   Mean   :148.6    Mean   :155.4    Mean   :19675   
##  3rd Qu.:310.0   3rd Qu.:236.0    3rd Qu.:237.5    3rd Qu.:23127   
##  Max.   :497.0   Max.   :497.0    Max.   :497.0    Max.   :26685   
##                  NA's   :143013   NA's   :145100                   
##    StrikerSK     NonStriker_match_SK NONStriker_SK   Fielder_match_SK
##  Min.   :  0.0   Min.   :12694       Min.   :  0.0   Min.   :   -1   
##  1st Qu.: 39.0   1st Qu.:16173       1st Qu.: 39.0   1st Qu.:   -1   
##  Median : 95.0   Median :19672       Median : 95.0   Median :   -1   
##  Mean   :135.5   Mean   :19675       Mean   :134.6   Mean   :  690   
##  3rd Qu.:207.0   3rd Qu.:23127       3rd Qu.:207.0   3rd Qu.:   -1   
##  Max.   :496.0   Max.   :26685       Max.   :496.0   Max.   :26680   
##                                                                      
##    Fielder_SK      Bowler_match_SK   BOWLER_SK     PlayerOut_match_SK
##  Min.   : -1.000   Min.   :12697   Min.   :  0.0   Min.   :   -1.0   
##  1st Qu.: -1.000   1st Qu.:16175   1st Qu.: 76.0   1st Qu.:   -1.0   
##  Median : -1.000   Median :19674   Median :173.0   Median :   -1.0   
##  Mean   :  4.527   Mean   :19677   Mean   :193.1   Mean   :  970.3   
##  3rd Qu.: -1.000   3rd Qu.:23131   3rd Qu.:309.0   3rd Qu.:   -1.0   
##  Max.   :496.000   Max.   :26685   Max.   :496.0   Max.   :26685.0   
##                                                                      
##  BattingTeam_SK   BowlingTeam_SK    Keeper_Catch      Player_out_sk    
##  Min.   : 0.000   Min.   : 0.000   Min.   :0.000000   Min.   : -1.000  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.:0.000000   1st Qu.:  0.000  
##  Median : 4.000   Median : 4.000   Median :0.000000   Median :  0.000  
##  Mean   : 4.346   Mean   : 4.333   Mean   :0.000432   Mean   :  1.101  
##  3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.:0.000000   3rd Qu.:  0.000  
##  Max.   :12.000   Max.   :12.000   Max.   :1.000000   Max.   :496.000  
##                                                                        
##   MatchDateSK      
##  Min.   :20080418  
##  1st Qu.:20100411  
##  Median :20120520  
##  Mean   :20125288  
##  3rd Qu.:20150420  
##  Max.   :20170521  
##

A collection of 5-10 random samples of data (with replacement) from at least 6 columns of data Each subsample should be as long as roughly 50% percent of your data.

# Load necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

# Set the number of subsamples you want
num_subsamples <- 5  # You can change this to 10 if needed

# Create an empty list to store the subsamples
subsamples_list <- list()

# Define the size of each subsample (roughly 50%)
sample_size <- floor(0.5 * nrow(my_data))

# Define the columns i want to include in the subsamples
selected_columns <- c("MatcH_id", "Over_id", "Ball_id", "Innings_No", "Team_Batting", "Team_Bowling")

# Create random subsamples and store them in the list
for (i in 1:num_subsamples) {
  # Sample rows with replacement
  subsample <- my_data %>%
    select(all_of(selected_columns)) %>%
    sample_n(size = sample_size, replace = TRUE)
  
  # Assign a unique name to each subsample
  subsample_name <- paste0("df_", i)
  
  # Store the subsample in the list
  subsamples_list[[subsample_name]] <- subsample
}

# Now I have a list of random subsamples, each stored as a separate data frame (df_1, df_2, etc.).

\[ To scrutinize the subsamples and understand their differences, similarities, and potential anomalies, you can perform various analyses and comparisons. Here are some steps and considerations: 1) Calculating summary statistics for each subsample, including measures like mean, median, standard deviation, and quartiles for numeric variables. For categorical variables, compute frequency tables. 2) Creating visualizations to compare the distributions of numeric and categorical variables across subsamples. \]

# Load necessary libraries
library(dplyr)

# Define a function to calculate summary statistics
calculate_summary_statistics <- function(subsample) {
  # Identify numeric and categorical columns
  numeric_cols <- sapply(subsample, is.numeric)
  categorical_cols <- !numeric_cols
  
  # Calculate summary statistics for numeric columns
  numeric_summary <- summary(subsample[, numeric_cols])
  
  # Create frequency tables for categorical columns
  categorical_summary <- lapply(subsample[, categorical_cols], table)
  
  # Combine the results into a list
  summary_results <- list(numeric_summary = numeric_summary, categorical_summary = categorical_summary)
  
  return(summary_results)
}

# Create a list to store summary statistics for each subsample
summary_stats_list <- list()

# Calculate summary statistics for each subsample
for (i in 1:num_subsamples) {
  subsample_name <- paste0("df_", i)
  summary_stats <- calculate_summary_statistics(subsamples_list[[subsample_name]])
  summary_stats_list[[subsample_name]] <- summary_stats
}

# Now you have a list of summary statistics for each subsample.
# Display summary_stats_list
print(summary_stats_list)

## $df_1
## $df_1$numeric_summary
##     MatcH_id          Over_id         Ball_id        Innings_No   
##  Min.   : 335987   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 419155   1st Qu.: 5.00   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 548383   Median :10.00   Median :4.000   Median :1.000  
##  Mean   : 637194   Mean   :10.19   Mean   :3.611   Mean   :1.482  
##  3rd Qu.: 829744   3rd Qu.:15.00   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :1082650   Max.   :20.00   Max.   :9.000   Max.   :4.000  
## 
## $df_1$categorical_summary
## $df_1$categorical_summary$Team_Batting
## 
##                           1                          10 
##                        7781                        2642 
##                          11                          12 
##                        3692                         773 
##                          13                           2 
##                         971                        8271 
##                           3                           4 
##                        7909                        7948 
##                           5                           6 
##                        6908                        7694 
##                           7                           8 
##                        8413                        4483 
##                           9            Delhi Daredevils 
##                         779                         807 
##               Gujarat Lions             Kings XI Punjab 
##                         838                         824 
##       Kolkata Knight Riders              Mumbai Indians 
##                         887                        1009 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         989                         763 
##         Sunrisers Hyderabad 
##                         844 
## 
## $df_1$categorical_summary$Team_Bowling
## 
##                           1                          10 
##                        7813                        2761 
##                          11                          12 
##                        3684                         776 
##                          13                           2 
##                         976                        8309 
##                           3                           4 
##                        7669                        7882 
##                           5                           6 
##                        7084                        7637 
##                           7                           8 
##                        8395                        4462 
##                           9            Delhi Daredevils 
##                         816                         842 
##               Gujarat Lions             Kings XI Punjab 
##                         864                         844 
##       Kolkata Knight Riders              Mumbai Indians 
##                         953                         976 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         965                         732 
##         Sunrisers Hyderabad 
##                         785 
## 
## 
## 
## $df_2
## $df_2$numeric_summary
##     MatcH_id          Over_id        Ball_id        Innings_No   
##  Min.   : 335987   Min.   : 1.0   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 419155   1st Qu.: 5.0   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 548383   Median :10.0   Median :4.000   Median :1.000  
##  Mean   : 636778   Mean   :10.1   Mean   :3.607   Mean   :1.482  
##  3rd Qu.: 829742   3rd Qu.:15.0   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :1082650   Max.   :20.0   Max.   :9.000   Max.   :4.000  
## 
## $df_2$categorical_summary
## $df_2$categorical_summary$Team_Batting
## 
##                           1                          10 
##                        7732                        2697 
##                          11                          12 
##                        3706                         791 
##                          13                           2 
##                         913                        7999 
##                           3                           4 
##                        7845                        7999 
##                           5                           6 
##                        6945                        7731 
##                           7                           8 
##                        8623                        4467 
##                           9            Delhi Daredevils 
##                         847                         810 
##               Gujarat Lions             Kings XI Punjab 
##                         871                         755 
##       Kolkata Knight Riders              Mumbai Indians 
##                         918                        1058 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         890                         789 
##         Sunrisers Hyderabad 
##                         839 
## 
## $df_2$categorical_summary$Team_Bowling
## 
##                           1                          10 
##                        7793                        2776 
##                          11                          12 
##                        3694                         812 
##                          13                           2 
##                         972                        8235 
##                           3                           4 
##                        7778                        7751 
##                           5                           6 
##                        7074                        7715 
##                           7                           8 
##                        8318                        4553 
##                           9            Delhi Daredevils 
##                         824                         784 
##               Gujarat Lions             Kings XI Punjab 
##                         844                         839 
##       Kolkata Knight Riders              Mumbai Indians 
##                         916                        1032 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         983                         754 
##         Sunrisers Hyderabad 
##                         778 
## 
## 
## 
## $df_3
## $df_3$numeric_summary
##     MatcH_id          Over_id         Ball_id        Innings_No   
##  Min.   : 335987   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 419152   1st Qu.: 5.00   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 548380   Median :10.00   Median :4.000   Median :1.000  
##  Mean   : 635512   Mean   :10.12   Mean   :3.608   Mean   :1.482  
##  3rd Qu.: 829740   3rd Qu.:15.00   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :1082650   Max.   :20.00   Max.   :9.000   Max.   :4.000  
## 
## $df_3$categorical_summary
## $df_3$categorical_summary$Team_Batting
## 
##                           1                          10 
##                        7745                        2721 
##                          11                          12 
##                        3731                         811 
##                          13                           2 
##                         930                        8120 
##                           3                           4 
##                        7850                        8049 
##                           5                           6 
##                        6921                        7572 
##                           7                           8 
##                        8559                        4483 
##                           9            Delhi Daredevils 
##                         874                         761 
##               Gujarat Lions             Kings XI Punjab 
##                         866                         770 
##       Kolkata Knight Riders              Mumbai Indians 
##                         904                         968 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         950                         783 
##         Sunrisers Hyderabad 
##                         857 
## 
## $df_3$categorical_summary$Team_Bowling
## 
##                           1                          10 
##                        7828                        2764 
##                          11                          12 
##                        3709                         873 
##                          13                           2 
##                         939                        8149 
##                           3                           4 
##                        7761                        7815 
##                           5                           6 
##                        7173                        7782 
##                           7                           8 
##                        8226                        4539 
##                           9            Delhi Daredevils 
##                         808                         813 
##               Gujarat Lions             Kings XI Punjab 
##                         835                         835 
##       Kolkata Knight Riders              Mumbai Indians 
##                         871                        1049 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         916                         746 
##         Sunrisers Hyderabad 
##                         794 
## 
## 
## 
## $df_4
## $df_4$numeric_summary
##     MatcH_id          Over_id         Ball_id        Innings_No   
##  Min.   : 335987   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 419152   1st Qu.: 5.00   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 548380   Median :10.00   Median :4.000   Median :1.000  
##  Mean   : 635002   Mean   :10.14   Mean   :3.621   Mean   :1.485  
##  3rd Qu.: 829740   3rd Qu.:15.00   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :1082650   Max.   :20.00   Max.   :9.000   Max.   :4.000  
## 
## $df_4$categorical_summary
## $df_4$categorical_summary$Team_Batting
## 
##                           1                          10 
##                        7788                        2724 
##                          11                          12 
##                        3684                         762 
##                          13                           2 
##                         868                        8079 
##                           3                           4 
##                        7766                        7980 
##                           5                           6 
##                        6922                        7821 
##                           7                           8 
##                        8541                        4567 
##                           9            Delhi Daredevils 
##                         805                         801 
##               Gujarat Lions             Kings XI Punjab 
##                         853                         783 
##       Kolkata Knight Riders              Mumbai Indians 
##                         854                        1050 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         912                         801 
##         Sunrisers Hyderabad 
##                         864 
## 
## $df_4$categorical_summary$Team_Bowling
## 
##                           1                          10 
##                        7672                        2668 
##                          11                          12 
##                        3609                         788 
##                          13                           2 
##                         924                        8187 
##                           3                           4 
##                        7784                        7846 
##                           5                           6 
##                        7317                        7797 
##                           7                           8 
##                        8329                        4603 
##                           9            Delhi Daredevils 
##                         783                         825 
##               Gujarat Lions             Kings XI Punjab 
##                         764                         805 
##       Kolkata Knight Riders              Mumbai Indians 
##                         933                        1088 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         943                         747 
##         Sunrisers Hyderabad 
##                         813 
## 
## 
## 
## $df_5
## $df_5$numeric_summary
##     MatcH_id          Over_id         Ball_id        Innings_No   
##  Min.   : 335987   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 419153   1st Qu.: 5.00   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 548382   Median :10.00   Median :4.000   Median :1.000  
##  Mean   : 636304   Mean   :10.14   Mean   :3.611   Mean   :1.484  
##  3rd Qu.: 829742   3rd Qu.:15.00   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :1082650   Max.   :20.00   Max.   :9.000   Max.   :4.000  
## 
## $df_5$categorical_summary
## $df_5$categorical_summary$Team_Batting
## 
##                           1                          10 
##                        7717                        2724 
##                          11                          12 
##                        3737                         791 
##                          13                           2 
##                         953                        8096 
##                           3                           4 
##                        7897                        7890 
##                           5                           6 
##                        6947                        7796 
##                           7                           8 
##                        8323                        4536 
##                           9            Delhi Daredevils 
##                         844                         872 
##               Gujarat Lions             Kings XI Punjab 
##                         880                         802 
##       Kolkata Knight Riders              Mumbai Indians 
##                         852                        1022 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         934                         773 
##         Sunrisers Hyderabad 
##                         839 
## 
## $df_5$categorical_summary$Team_Bowling
## 
##                           1                          10 
##                        7754                        2689 
##                          11                          12 
##                        3533                         842 
##                          13                           2 
##                         982                        8173 
##                           3                           4 
##                        7807                        7886 
##                           5                           6 
##                        7124                        7804 
##                           7                           8 
##                        8375                        4508 
##                           9            Delhi Daredevils 
##                         774                         805 
##               Gujarat Lions             Kings XI Punjab 
##                         829                         770 
##       Kolkata Knight Riders              Mumbai Indians 
##                         924                        1062 
##     Rising Pune Supergiants Royal Challengers Bangalore 
##                         949                         788 
##         Sunrisers Hyderabad 
##                         847

library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

# Create histograms for numeric variables in each subsample
histograms <- lapply(subsamples_list, function(subsample) {
  ggplot(subsample, aes(x = Ball_id)) +
    geom_histogram(binwidth = 1, fill = "blue", color = "black") +
    labs(title = "Histogram of Your Ball_id",
         x = "Value",
         y = "Frequency")
})

# Display histograms side by side
grid.arrange(grobs = histograms, ncol = num_subsamples)

library(ggplot2)
library(gridExtra)
# Create bar charts for categorical variables in each subsample
bar_charts <- lapply(subsamples_list, function(subsample) {
  ggplot(subsample, aes(x = Team_Batting, fill = Team_Batting)) +
    geom_bar() +
    labs(title = "Bar Chart of Team_Batting",
         x = "Category",
         y = "Frequency") +
    theme(legend.position = "none")
})

# Display bar charts side by side
grid.arrange(grobs = bar_charts, ncol = num_subsamples)

# Combine all subsamples into one data frame
combined_data <- do.call(rbind, subsamples_list)

# Create a bar chart for the combined data
ggplot(combined_data, aes(x = Team_Batting, fill = Team_Batting)) +
  geom_bar() +
  labs(title = "Bar Chart of Your Team_Batting",
       x = "Category",
       y = "Frequency") +
  theme(legend.position = "none")

##What would you have called an anomaly in one sub-sample that you wouldn’t in another? $$ I see no anomalies in my subsamples as they are consistent. usually the anomaly could be unique based on the type of data.let’s consider an example based on the columns “MatcH_id,” “Over_id,” “Ball_id,” “Innings_No,” and “Team_Batting.” identifying anomalies in each of these columns within two sub-samples: Sub-Sample 1 (First Half of Matches): This sub-sample consists of data from the first half of matches.

Sub-Sample 2 (Second Half of Matches): This sub-sample consists of data from the second half of matches.

The anomalies in each of these columns might be defined as :

#MatcH_id:

Sub-Sample 1 (First Half of Matches): Anomalies could be identified as MatcH_id values that are smaller than the median MatcH_id value in the first half of matches, as these indicate matches that occurred relatively early in the dataset. Sub-Sample 2 (Second Half of Matches): Anomalies might be MatcH_id values that are greater than the median MatcH_id value in the second half of matches, representing matches that occurred relatively late in the dataset.

Over_id:

Sub-Sample 1 (First Half of Matches): Anomalies could be defined as Over_id values that are unusually low compared to the average Over_id in the first half, indicating early overs in matches. Sub-Sample 2 (Second Half of Matches): Anomalies might be Over_id values that are significantly higher than the average Over_id in the second half, indicating late overs in matches. Ball_id:

Sub-Sample 1 (First Half of Matches): Anomalies could include Ball_id values that are small, indicating early balls faced in matches within the first half. Sub-Sample 2 (Second Half of Matches): Anomalies might involve Ball_id values that are relatively high, representing late balls faced in matches within the second half.

Innings_No:

Sub-Sample 1 (First Half of Matches): Anomalies might be Innings_No values that are predominantly 1 (first innings), as most matches start with the first innings. Sub-Sample 2 (Second Half of Matches): Anomalies could involve Innings_No values that are predominantly 2 (second innings), indicating matches in the second half where teams batted second more often. Team_Batting:

Sub-Sample 1 (First Half of Matches): Anomalies might be specific teams that frequently batted early in matches in the first half. Sub-Sample 2 (Second Half of Matches): Anomalies could include specific teams that frequently batted late in matches in the second half. $$$

##Are there aspects of the data that are consistent among all sub-samples? $$ On observing the above visualizations on subsamples,all of them were consistent the columns “MatcH_id,” “Over_id,” “Ball_id,” “Innings_No,” and “Team_Batting.” I’ll explore how these aspects might exhibit consistency across different sub-samples of a sports dataset.

Example: Cricket Match Data

Imagine I have a dataset that contains information about cricket matches, including details like match ID (“MatcH_id”), over ID (“Over_id”), ball ID (“Ball_id”), innings number (“Innings_No”), and the name of the batting team (“Team_Batting”). Each row in the dataset represents a specific ball bowled during a cricket match.

Here’s how these aspects could exhibit consistency among different sub-samples:

MatcH_id: Consistency in match IDs across sub-samples would mean that the same cricket matches are represented in each sub-sample. This suggests that the sub-samples are drawn from the same set of matches.
Over_id: Consistency in over IDs across sub-samples indicates that specific overs are common across different parts of the dataset. For example, if the 10th over consistently appears in all sub-samples, it means that the 10th over is played consistently in different matches.
Ball_id: Consistency in ball IDs implies that certain balls (deliveries) are consistent across sub-samples. This could indicate that specific key moments, such as boundaries or wickets, are consistently captured.
Innings_No: If the innings number is consistent in all sub-samples, it means that the dataset predominantly includes matches with the same type of innings (e.g., first innings). Consistency in innings number could also suggest that limited-overs matches dominate the dataset.
Team_Batting: Consistency in the names of the batting teams across sub-samples indicates that the same set of teams participates in various matches. For example, if “Team_A” and “Team_B” consistently appear in all sub-samples, it suggests these teams are common participants.

Overall, consistency in these aspects across sub-samples provides insights into the nature of the cricket matches represented in the dataset. It suggests that certain matches, overs, balls, innings types, and teams are consistently featured, making these aspects stable and reliable for analysis.

Analyzing such consistency helps ensure that any findings or patterns observed in the sub-samples are likely to hold across the entire dataset, making the conclusions more robust and applicable. $$

week_4_Datadive

Sai Dheeraj

2023-09-18

Read the CSV file

Consider how this investigation affects how you might draw conclusions about the data in the future?