This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
my_data <- read.csv('C:/Users/dell/Downloads/Ball_By_Ball.csv')
summary(my_data)
## MatcH_id Over_id Ball_id Innings_No
## Min. : 335987 Min. : 1.00 Min. :1.000 Min. :1.000
## 1st Qu.: 419154 1st Qu.: 5.00 1st Qu.:2.000 1st Qu.:1.000
## Median : 548382 Median :10.00 Median :4.000 Median :1.000
## Mean : 636208 Mean :10.14 Mean :3.617 Mean :1.482
## 3rd Qu.: 829742 3rd Qu.:15.00 3rd Qu.:5.000 3rd Qu.:2.000
## Max. :1082650 Max. :20.00 Max. :9.000 Max. :4.000
##
## Team_Batting Team_Bowling Striker_Batting_Position
## Length:150451 Length:150451 Min. : 1.000
## Class :character Class :character 1st Qu.: 2.000
## Mode :character Mode :character Median : 3.000
## Mean : 3.584
## 3rd Qu.: 5.000
## Max. :11.000
## NA's :13861
## Extra_Type Runs_Scored Extra_runs Wides
## Length:150451 Min. :0.000 Min. :0.00000 Min. :0.0000
## Class :character 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.0000
## Mode :character Median :1.000 Median :0.00000 Median :0.0000
## Mean :1.222 Mean :0.06899 Mean :0.0375
## 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :6.000 Max. :5.00000 Max. :5.0000
##
## Legbyes Byes Noballs Penalty
## Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.0e+00
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.0e+00
## Median :0.00000 Median :0.000000 Median :0.00000 Median :0.0e+00
## Mean :0.02223 Mean :0.004885 Mean :0.00434 Mean :3.3e-05
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.0e+00
## Max. :5.00000 Max. :4.000000 Max. :5.00000 Max. :5.0e+00
##
## Bowler_Extras Out_type Caught Bowled
## Min. :0.00000 Length:150451 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 Class :character 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Mode :character Median :0.00000 Median :0.000000
## Mean :0.04184 Mean :0.02907 Mean :0.009186
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :5.00000 Max. :1.00000 Max. :1.000000
##
## Run_out LBW Retired_hurt Stumped
## Min. :0.000000 Min. :0.000000 Min. :0.00e+00 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00e+00 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.00e+00 Median :0.000000
## Mean :0.005018 Mean :0.003024 Mean :5.98e-05 Mean :0.001615
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00e+00 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.00e+00 Max. :1.000000
##
## caught_and_bowled hit_wicket ObstructingFeild Bowler_Wicket
## Min. :0.000000 Min. :0.00e+00 Min. :0.0e+00 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00e+00 1st Qu.:0.0e+00 1st Qu.:0.00000
## Median :0.000000 Median :0.00e+00 Median :0.0e+00 Median :0.00000
## Mean :0.001402 Mean :5.98e-05 Mean :6.6e-06 Mean :0.04435
## 3rd Qu.:0.000000 3rd Qu.:0.00e+00 3rd Qu.:0.0e+00 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00e+00 Max. :1.0e+00 Max. :1.00000
##
## Match_Date Season Striker Non_Striker
## Length:150451 Min. :2008 Min. : 1.0 Min. : 1.0
## Class :character 1st Qu.:2010 1st Qu.: 40.0 1st Qu.: 40.0
## Mode :character Median :2012 Median : 96.0 Median : 96.0
## Mean :2012 Mean :136.5 Mean :135.6
## 3rd Qu.:2015 3rd Qu.:208.0 3rd Qu.:208.0
## Max. :2017 Max. :497.0 Max. :497.0
##
## Bowler Player_Out Fielders Striker_match_SK
## Min. : 1.0 Min. : 1.0 Min. : 1.0 Min. :12694
## 1st Qu.: 77.0 1st Qu.: 41.0 1st Qu.: 47.0 1st Qu.:16173
## Median :174.0 Median :107.0 Median :111.0 Median :19672
## Mean :194.1 Mean :148.6 Mean :155.4 Mean :19675
## 3rd Qu.:310.0 3rd Qu.:236.0 3rd Qu.:237.5 3rd Qu.:23127
## Max. :497.0 Max. :497.0 Max. :497.0 Max. :26685
## NA's :143013 NA's :145100
## StrikerSK NonStriker_match_SK NONStriker_SK Fielder_match_SK
## Min. : 0.0 Min. :12694 Min. : 0.0 Min. : -1
## 1st Qu.: 39.0 1st Qu.:16173 1st Qu.: 39.0 1st Qu.: -1
## Median : 95.0 Median :19672 Median : 95.0 Median : -1
## Mean :135.5 Mean :19675 Mean :134.6 Mean : 690
## 3rd Qu.:207.0 3rd Qu.:23127 3rd Qu.:207.0 3rd Qu.: -1
## Max. :496.0 Max. :26685 Max. :496.0 Max. :26680
##
## Fielder_SK Bowler_match_SK BOWLER_SK PlayerOut_match_SK
## Min. : -1.000 Min. :12697 Min. : 0.0 Min. : -1.0
## 1st Qu.: -1.000 1st Qu.:16175 1st Qu.: 76.0 1st Qu.: -1.0
## Median : -1.000 Median :19674 Median :173.0 Median : -1.0
## Mean : 4.527 Mean :19677 Mean :193.1 Mean : 970.3
## 3rd Qu.: -1.000 3rd Qu.:23131 3rd Qu.:309.0 3rd Qu.: -1.0
## Max. :496.000 Max. :26685 Max. :496.0 Max. :26685.0
##
## BattingTeam_SK BowlingTeam_SK Keeper_Catch Player_out_sk
## Min. : 0.000 Min. : 0.000 Min. :0.000000 Min. : -1.000
## 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.:0.000000 1st Qu.: 0.000
## Median : 4.000 Median : 4.000 Median :0.000000 Median : 0.000
## Mean : 4.346 Mean : 4.333 Mean :0.000432 Mean : 1.101
## 3rd Qu.: 6.000 3rd Qu.: 6.000 3rd Qu.:0.000000 3rd Qu.: 0.000
## Max. :12.000 Max. :12.000 Max. :1.000000 Max. :496.000
##
## MatchDateSK
## Min. :20080418
## 1st Qu.:20100411
## Median :20120520
## Mean :20125288
## 3rd Qu.:20150420
## Max. :20170521
##
$$ A list of at least 3 columns (or values) in your data which are unclear until you read the documentation. E.g., this could be a column name, or just some value inside a cell of your data Why do you think they chose to encode the data the way they did? What could have happened if you didn’t read the documentation?
In my dataset, “Ball_By_Ball.csv,” there are several columns and values that may be unclear until I read the documentation. Here are three columns:
Reading the documentation is essential because it provides the necessary context and explanations for the encoding choices. Without it, analysts might misinterpret these values, leading to inaccurate analyses and conclusions. Additionally, understanding the encoding rationale helps users make informed decisions about how to handle such values during data processing and analysis. \[ \] At least one element or your data that is unclear even after reading the documentation You may need to do some digging, but is there anything about the data that your documentation does not explain?
I’ve selected “Extra_Type” because it appears to contain some unclear or missing values (e.g., “No Extras”). I have build a visualization to highlight the issue and explain why it might be unclear. $$
# Load the required libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# Create a custom variable for Extra_Type
my_data <- my_data %>%
mutate(
Extra_Type_Custom = ifelse(Extra_Type == "No Extras", "No Extras", "Other")
)
# Create a bar chart to highlight the "No Extras" issue
ggplot(my_data, aes(x = Extra_Type_Custom, fill = Extra_Type_Custom)) +
geom_bar() +
labs(title = "Distribution of Extra Types", x = "Extra Type", y = "Count") +
scale_fill_manual(values = c("red", "blue"), guide = "none") +
theme_minimal()
$$ Build a visualization which uses a column of data that is affected by the issue you brought up in bullet #2, above. In this visualization, find a way to highlight the issue, and explain what is unclear and why it might be unclear. You can use color or an annotation, but also make sure to explain your thoughts using Markdown Do you notice any significant risks? If so, what could you do to reduce negative consequences?
Here’s a bar chart visualization of the “Extra_Type” column, with special attention to the unclear or missing values: This bar chart shows the distribution of different “Extra Types” in the dataset, excluding the “No Extras” category. As you can see, some categories are unclear or have missing values. This visual representation highlights the issue with the “Extra_Type” column.
The unclear or missing values in the “Extra_Type” column could lead to misunderstandings or inconsistencies in data analysis.
Further questions/risks include: What do the “No Extras” values represent, and why are they present in this column? Are there any data collection or entry errors that led to these unclear values? How should these values be handled in data analysis and modeling? ****************************************************************************
1)What do the “No Extras” values represent, and why are they present in this column?
Understanding the meaning of “No Extras” is crucial. It may indicate that no extra events occurred during certain ball-by-ball records. However, it’s essential to consult the dataset documentation or data providers to confirm this interpretation. 2)Are there any data collection or entry errors that led to these unclear values?
Investigating the source of unclear values, such as data collection or entry errors, can help ensure data quality. Verification with the data source or data collection process may be necessary. 3)How should these values be handled in data analysis and modeling?
Addressing unclear or missing values is a critical step in data preprocessing. Depending on the nature of the “No Extras” values and their impact on the analysis, you may choose to treat them differently. Options include excluding these rows, imputing missing values, or categorizing them appropriately based on your analysis goals.
Regarding risks, the unclear or missing values could lead to incorrect conclusions during data analysis. To reduce negative consequences, 1)we should carefully handle and preprocess this column before using it in your analysis, which may involve imputing missing values or recategorizing unclear values based on domain knowledge or further data exploration.
2)Reach out to domain experts or individuals who have a deep understanding of the dataset. They can provide insights into the meaning of “No Extras” and help clarify any ambiguities.
3)If possible, impute missing values with appropriate methods. For “No Extras,” we might choose to categorize it as a separate class or use other imputation techniques depending on the context. $$
# Load the required libraries
library(dplyr)
library(ggplot2)
# Create a custom column to categorize Extra_Type
my_data$Extra_Type_Custom <- ifelse(my_data$Extra_Type == "No Extras", "Unclear/Missing", "Clear")
# Create a bar chart to highlight clear and unclear/missing values
ggplot(data = my_data, aes(x = Extra_Type_Custom, fill = Extra_Type_Custom)) +
geom_bar() +
scale_fill_manual(values = c("Clear" = "blue", "Unclear/Missing" = "red")) +
labs(title = "Distribution of Extra Types", x = "Extra Type Status", y = "Count") +
theme_minimal()
# visual representation of each category of “extra_type”
# Load the required libraries
library(ggplot2)
library(dplyr)
# Create a pie chart to visualize the distribution of "Extra_Type"
pie_chart <- ggplot(data = my_data, aes(x = "", fill = Extra_Type)) +
geom_bar(width = 1) +
coord_polar(theta = "y") +
labs(title = "Distribution of 'Extra_Type' Categories") +
theme_void()
# Display the pie chart
print(pie_chart)