Week 5

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

my_data <- read.csv('C:/Users/dell/Downloads/Ball_By_Ball.csv')
summary(my_data)

##     MatcH_id          Over_id         Ball_id        Innings_No   
##  Min.   : 335987   Min.   : 1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 419154   1st Qu.: 5.00   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 548382   Median :10.00   Median :4.000   Median :1.000  
##  Mean   : 636208   Mean   :10.14   Mean   :3.617   Mean   :1.482  
##  3rd Qu.: 829742   3rd Qu.:15.00   3rd Qu.:5.000   3rd Qu.:2.000  
##  Max.   :1082650   Max.   :20.00   Max.   :9.000   Max.   :4.000  
##                                                                   
##  Team_Batting       Team_Bowling       Striker_Batting_Position
##  Length:150451      Length:150451      Min.   : 1.000          
##  Class :character   Class :character   1st Qu.: 2.000          
##  Mode  :character   Mode  :character   Median : 3.000          
##                                        Mean   : 3.584          
##                                        3rd Qu.: 5.000          
##                                        Max.   :11.000          
##                                        NA's   :13861           
##   Extra_Type         Runs_Scored      Extra_runs          Wides       
##  Length:150451      Min.   :0.000   Min.   :0.00000   Min.   :0.0000  
##  Class :character   1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Mode  :character   Median :1.000   Median :0.00000   Median :0.0000  
##                     Mean   :1.222   Mean   :0.06899   Mean   :0.0375  
##                     3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##                     Max.   :6.000   Max.   :5.00000   Max.   :5.0000  
##                                                                       
##     Legbyes             Byes             Noballs           Penalty       
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.0e+00  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.0e+00  
##  Median :0.00000   Median :0.000000   Median :0.00000   Median :0.0e+00  
##  Mean   :0.02223   Mean   :0.004885   Mean   :0.00434   Mean   :3.3e-05  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.0e+00  
##  Max.   :5.00000   Max.   :4.000000   Max.   :5.00000   Max.   :5.0e+00  
##                                                                          
##  Bowler_Extras       Out_type             Caught            Bowled        
##  Min.   :0.00000   Length:150451      Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   Class :character   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Mode  :character   Median :0.00000   Median :0.000000  
##  Mean   :0.04184                      Mean   :0.02907   Mean   :0.009186  
##  3rd Qu.:0.00000                      3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :5.00000                      Max.   :1.00000   Max.   :1.000000  
##                                                                           
##     Run_out              LBW            Retired_hurt         Stumped        
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.00e+00   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.00e+00   1st Qu.:0.000000  
##  Median :0.000000   Median :0.000000   Median :0.00e+00   Median :0.000000  
##  Mean   :0.005018   Mean   :0.003024   Mean   :5.98e-05   Mean   :0.001615  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.00e+00   3rd Qu.:0.000000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :1.00e+00   Max.   :1.000000  
##                                                                             
##  caught_and_bowled    hit_wicket       ObstructingFeild  Bowler_Wicket    
##  Min.   :0.000000   Min.   :0.00e+00   Min.   :0.0e+00   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.00e+00   1st Qu.:0.0e+00   1st Qu.:0.00000  
##  Median :0.000000   Median :0.00e+00   Median :0.0e+00   Median :0.00000  
##  Mean   :0.001402   Mean   :5.98e-05   Mean   :6.6e-06   Mean   :0.04435  
##  3rd Qu.:0.000000   3rd Qu.:0.00e+00   3rd Qu.:0.0e+00   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.00e+00   Max.   :1.0e+00   Max.   :1.00000  
##                                                                           
##   Match_Date            Season        Striker       Non_Striker   
##  Length:150451      Min.   :2008   Min.   :  1.0   Min.   :  1.0  
##  Class :character   1st Qu.:2010   1st Qu.: 40.0   1st Qu.: 40.0  
##  Mode  :character   Median :2012   Median : 96.0   Median : 96.0  
##                     Mean   :2012   Mean   :136.5   Mean   :135.6  
##                     3rd Qu.:2015   3rd Qu.:208.0   3rd Qu.:208.0  
##                     Max.   :2017   Max.   :497.0   Max.   :497.0  
##                                                                   
##      Bowler        Player_Out        Fielders      Striker_match_SK
##  Min.   :  1.0   Min.   :  1.0    Min.   :  1.0    Min.   :12694   
##  1st Qu.: 77.0   1st Qu.: 41.0    1st Qu.: 47.0    1st Qu.:16173   
##  Median :174.0   Median :107.0    Median :111.0    Median :19672   
##  Mean   :194.1   Mean   :148.6    Mean   :155.4    Mean   :19675   
##  3rd Qu.:310.0   3rd Qu.:236.0    3rd Qu.:237.5    3rd Qu.:23127   
##  Max.   :497.0   Max.   :497.0    Max.   :497.0    Max.   :26685   
##                  NA's   :143013   NA's   :145100                   
##    StrikerSK     NonStriker_match_SK NONStriker_SK   Fielder_match_SK
##  Min.   :  0.0   Min.   :12694       Min.   :  0.0   Min.   :   -1   
##  1st Qu.: 39.0   1st Qu.:16173       1st Qu.: 39.0   1st Qu.:   -1   
##  Median : 95.0   Median :19672       Median : 95.0   Median :   -1   
##  Mean   :135.5   Mean   :19675       Mean   :134.6   Mean   :  690   
##  3rd Qu.:207.0   3rd Qu.:23127       3rd Qu.:207.0   3rd Qu.:   -1   
##  Max.   :496.0   Max.   :26685       Max.   :496.0   Max.   :26680   
##                                                                      
##    Fielder_SK      Bowler_match_SK   BOWLER_SK     PlayerOut_match_SK
##  Min.   : -1.000   Min.   :12697   Min.   :  0.0   Min.   :   -1.0   
##  1st Qu.: -1.000   1st Qu.:16175   1st Qu.: 76.0   1st Qu.:   -1.0   
##  Median : -1.000   Median :19674   Median :173.0   Median :   -1.0   
##  Mean   :  4.527   Mean   :19677   Mean   :193.1   Mean   :  970.3   
##  3rd Qu.: -1.000   3rd Qu.:23131   3rd Qu.:309.0   3rd Qu.:   -1.0   
##  Max.   :496.000   Max.   :26685   Max.   :496.0   Max.   :26685.0   
##                                                                      
##  BattingTeam_SK   BowlingTeam_SK    Keeper_Catch      Player_out_sk    
##  Min.   : 0.000   Min.   : 0.000   Min.   :0.000000   Min.   : -1.000  
##  1st Qu.: 2.000   1st Qu.: 2.000   1st Qu.:0.000000   1st Qu.:  0.000  
##  Median : 4.000   Median : 4.000   Median :0.000000   Median :  0.000  
##  Mean   : 4.346   Mean   : 4.333   Mean   :0.000432   Mean   :  1.101  
##  3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.:0.000000   3rd Qu.:  0.000  
##  Max.   :12.000   Max.   :12.000   Max.   :1.000000   Max.   :496.000  
##                                                                        
##   MatchDateSK      
##  Min.   :20080418  
##  1st Qu.:20100411  
##  Median :20120520  
##  Mean   :20125288  
##  3rd Qu.:20150420  
##  Max.   :20170521  
##

$$ A list of at least 3 columns (or values) in your data which are unclear until you read the documentation. E.g., this could be a column name, or just some value inside a cell of your data Why do you think they chose to encode the data the way they did? What could have happened if you didn’t read the documentation?

In my dataset, “Ball_By_Ball.csv,” there are several columns and values that may be unclear until I read the documentation. Here are three columns:

Column: “Extra_Type”
- Unclear Value: “NA”
- Interpretation: The “Extra_Type” column seems to represent some kind of extra event in cricket, but the value “NA” is unclear without context.
- Why They Chose This Encoding: “NA” could be used to indicate that there was no extra event during that specific ball. Using “NA” as a placeholder might make it easier to filter or process the data later.
- Consequences of Not Reading the Documentation: Without understanding that “NA” represents a lack of extra events, misinterpretation could lead to incorrect analysis or conclusions.
Column: “Striker_Batting_Position”
- Unclear Value: “NA”
- Interpretation: The “Striker_Batting_Position” column likely indicates the position of the striker batsman, but “NA” doesn’t convey this information.
- Why They Chose This Encoding: “NA” could signify that no specific batting position is recorded for the striker in certain cases. This could happen if the striker’s position is not relevant or not available.
- Consequences of Not Reading the Documentation: Misinterpreting “NA” as a batting position could lead to errors in analyzing the player’s performance.
Column: “Runs_Scored”
- Unclear Value: “0”
- Interpretation: The “Runs_Scored” column appears to represent the number of runs scored during a ball, but “0” might not indicate whether runs were attempted or not.
- Why They Chose This Encoding: Using “0” could indicate that no runs were scored on that particular ball. This encoding allows for differentiating between balls where runs were attempted but not scored and balls where no attempt was made to score runs.
- Consequences of Not Reading the Documentation: Misinterpreting “0” as an absence of attempted runs could affect analyses related to batting performance.

Reading the documentation is essential because it provides the necessary context and explanations for the encoding choices. Without it, analysts might misinterpret these values, leading to inaccurate analyses and conclusions. Additionally, understanding the encoding rationale helps users make informed decisions about how to handle such values during data processing and analysis. \[ \] At least one element or your data that is unclear even after reading the documentation You may need to do some digging, but is there anything about the data that your documentation does not explain?

I’ve selected “Extra_Type” because it appears to contain some unclear or missing values (e.g., “No Extras”). I have build a visualization to highlight the issue and explain why it might be unclear. $$

# Load the required libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

# Create a custom variable for Extra_Type
my_data <- my_data %>%
  mutate(
    Extra_Type_Custom = ifelse(Extra_Type == "No Extras", "No Extras", "Other")
  )

# Create a bar chart to highlight the "No Extras" issue
ggplot(my_data, aes(x = Extra_Type_Custom, fill = Extra_Type_Custom)) +
  geom_bar() +
  labs(title = "Distribution of Extra Types", x = "Extra Type", y = "Count") +
  scale_fill_manual(values = c("red", "blue"), guide = "none") +
  theme_minimal()

$$ Build a visualization which uses a column of data that is affected by the issue you brought up in bullet #2, above. In this visualization, find a way to highlight the issue, and explain what is unclear and why it might be unclear. You can use color or an annotation, but also make sure to explain your thoughts using Markdown Do you notice any significant risks? If so, what could you do to reduce negative consequences?

Here’s a bar chart visualization of the “Extra_Type” column, with special attention to the unclear or missing values: This bar chart shows the distribution of different “Extra Types” in the dataset, excluding the “No Extras” category. As you can see, some categories are unclear or have missing values. This visual representation highlights the issue with the “Extra_Type” column.

The unclear or missing values in the “Extra_Type” column could lead to misunderstandings or inconsistencies in data analysis.

Further questions/risks include: What do the “No Extras” values represent, and why are they present in this column? Are there any data collection or entry errors that led to these unclear values? How should these values be handled in data analysis and modeling? ****************************************************************************

1)What do the “No Extras” values represent, and why are they present in this column?

Understanding the meaning of “No Extras” is crucial. It may indicate that no extra events occurred during certain ball-by-ball records. However, it’s essential to consult the dataset documentation or data providers to confirm this interpretation. 2)Are there any data collection or entry errors that led to these unclear values?

Investigating the source of unclear values, such as data collection or entry errors, can help ensure data quality. Verification with the data source or data collection process may be necessary. 3)How should these values be handled in data analysis and modeling?

Addressing unclear or missing values is a critical step in data preprocessing. Depending on the nature of the “No Extras” values and their impact on the analysis, you may choose to treat them differently. Options include excluding these rows, imputing missing values, or categorizing them appropriately based on your analysis goals.

Regarding risks, the unclear or missing values could lead to incorrect conclusions during data analysis. To reduce negative consequences, 1)we should carefully handle and preprocess this column before using it in your analysis, which may involve imputing missing values or recategorizing unclear values based on domain knowledge or further data exploration.

2)Reach out to domain experts or individuals who have a deep understanding of the dataset. They can provide insights into the meaning of “No Extras” and help clarify any ambiguities.

3)If possible, impute missing values with appropriate methods. For “No Extras,” we might choose to categorize it as a separate class or use other imputation techniques depending on the context. $$

# Load the required libraries
library(dplyr)
library(ggplot2)

# Create a custom column to categorize Extra_Type
my_data$Extra_Type_Custom <- ifelse(my_data$Extra_Type == "No Extras", "Unclear/Missing", "Clear")

# Create a bar chart to highlight clear and unclear/missing values
ggplot(data = my_data, aes(x = Extra_Type_Custom, fill = Extra_Type_Custom)) +
  geom_bar() +
  scale_fill_manual(values = c("Clear" = "blue", "Unclear/Missing" = "red")) +
  labs(title = "Distribution of Extra Types", x = "Extra Type Status", y = "Count") +
  theme_minimal()

# visual representation of each category of “extra_type”

# Load the required libraries
library(ggplot2)
library(dplyr)

# Create a pie chart to visualize the distribution of "Extra_Type"
pie_chart <- ggplot(data = my_data, aes(x = "", fill = Extra_Type)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y") +
  labs(title = "Distribution of 'Extra_Type' Categories") +
  theme_void()

# Display the pie chart
print(pie_chart)

Week 5

Sai Dheeraj Kanaparthi

2023-09-19

R Markdown