library(readxl)
## Warning: package 'readxl' was built under R version 4.5.1
Odi_Match_Info <- read_excel("C:/Users/nishi/Downloads/Odi_Match_Info.xlsx")
View(Odi_Match_Info)
Assigning Variable to Data Set
My_Data1 = Odi_Match_Info
Q1.Print the Structure of Dataset
str(My_Data1)
## tibble [4,571 × 35] (S3: tbl_df/tbl/data.frame)
## $ ODI Match No : num [1:4571] 1 2 3 4 5 6 8 9 10 11 ...
## $ Match ID : num [1:4571] 64148 64944 64945 64946 64149 ...
## $ Match Name : chr [1:4571] "Australia Vs England Only Odi" "England Vs Australia 1St Odi" "England Vs Australia 2Nd Odi" "England Vs Australia 3Rd Odi" ...
## $ Match Type : chr [1:4571] "Single Match" "Multi Match" "Multi Match" "Multi Match" ...
## $ Series ID : num [1:4571] 60783 60784 60784 60784 60785 ...
## $ Series Name : chr [1:4571] "England [Marylebone Cricket Club] tour of Australia - 1970 (1970/71)" "Australia tour of England and Scotland - 1972 (1972)" "Australia tour of England and Scotland - 1972 (1972)" "Australia tour of England and Scotland - 1972 (1972)" ...
## $ WorldCup : chr [1:4571] "No" "No" "No" "No" ...
## $ Match Date : POSIXct[1:4571], format: "1971-01-05" "1972-08-24" ...
## $ Match Format : chr [1:4571] "ODI" "ODI" "ODI" "ODI" ...
## $ Team1 ID : num [1:4571] 1 2 1 2 5 5 4 1 5 2 ...
## $ Team1 Name : chr [1:4571] "England" "Australia" "England" "Australia" ...
## $ Team1 Captain : num [1:4571] 1081 1243 858 1243 1251 ...
## $ Team1 Captain Name : chr [1:4571] "Ray Illingworth" "Ian Chappell" "Brian Close" "Ian Chappell" ...
## $ Team1 Runs Scored : num [1:4571] 190 222 236 179 187 158 181 189 194 265 ...
## $ Team1 Wickets Fell : num [1:4571] 10 8 9 9 10 10 10 9 9 5 ...
## $ Team1 Extras Rec : num [1:4571] 10 10 15 10 24 9 12 8 16 12 ...
## $ Team2 ID : num [1:4571] 2 1 2 1 7 1 1 4 2 5 ...
## $ Team2 Name : chr [1:4571] "Australia" "England" "Australia" "England" ...
## $ Team2 Captain : num [1:4571] 1150 858 1243 858 1117 ...
## $ Team2 Captain Name : chr [1:4571] "Bill Lawry" "Brian Close" "Ian Chappell" "Brian Close" ...
## $ Team2 Runs Scored : num [1:4571] 191 226 240 180 165 159 182 190 195 234 ...
## $ Team2 Wickets Fell : num [1:4571] 5 4 5 8 10 3 9 2 3 6 ...
## $ Team2 Extras Rec : num [1:4571] 6 7 36 9 12 3 7 12 9 9 ...
## $ Match Venue (Stadium): chr [1:4571] "Melbourne Cricket Ground" "Old Trafford" "Lord's" "Edgbaston" ...
## $ Match Venue (City) : chr [1:4571] "Melbourne" "Manchester" "London" "Birmingham" ...
## $ Match Venue (Country): chr [1:4571] "Australia" "England" "England" "England" ...
## $ Home Country
## : chr [1:4571] "Australia" "England" "England" "England" ...
## $ Umpire 1 : chr [1:4571] "LP Rowan" "CS Elliott" "AE Fagg" "ASM Oakman" ...
## $ Umpire 2 : chr [1:4571] "TF Brooks" "AEG Rhodes" "TW Spencer" "DJ Constant" ...
## $ Match Referee : chr [1:4571] "Unknown" "Unknown" "Unknown" "Unknown" ...
## $ Toss Winner : chr [1:4571] "Australia" "Australia" "Australia" "England" ...
## $ Toss Winner Choice : chr [1:4571] "bowl" "bat" "bowl" "bowl" ...
## $ Match Winner : chr [1:4571] "AUSTRALIA" "ENGLAND" "AUSTRALIA" "ENGLAND" ...
## $ MOM Player : chr [1:4571] "1203.0" "1285.0" "1364.0" "1396.0" ...
## $ Man of the Match : chr [1:4571] "John Edrich" "Dennis Amiss" "Greg Chappell" "Barry Wood" ...
#Q2.List the Variables in the dataset
names(My_Data1)
## [1] "ODI Match No" "Match ID" "Match Name"
## [4] "Match Type" "Series ID" "Series Name"
## [7] "WorldCup" "Match Date" "Match Format"
## [10] "Team1 ID" "Team1 Name" "Team1 Captain"
## [13] "Team1 Captain Name" "Team1 Runs Scored" "Team1 Wickets Fell"
## [16] "Team1 Extras Rec" "Team2 ID" "Team2 Name"
## [19] "Team2 Captain" "Team2 Captain Name" "Team2 Runs Scored"
## [22] "Team2 Wickets Fell" "Team2 Extras Rec" "Match Venue (Stadium)"
## [25] "Match Venue (City)" "Match Venue (Country)" "Home Country\n"
## [28] "Umpire 1" "Umpire 2" "Match Referee"
## [31] "Toss Winner" "Toss Winner Choice" "Match Winner"
## [34] "MOM Player" "Man of the Match"
Q3.Print top 15 rows
head(My_Data1, 15)
## # A tibble: 15 × 35
## `ODI Match No` `Match ID` `Match Name` `Match Type` `Series ID` `Series Name`
## <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 1 64148 Australia V… Single Match 60783 England [Mar…
## 2 2 64944 England Vs … Multi Match 60784 Australia to…
## 3 3 64945 England Vs … Multi Match 60784 Australia to…
## 4 4 64946 England Vs … Multi Match 60784 Australia to…
## 5 5 64149 New Zealand… Single Match 60785 Pakistan tou…
## 6 6 64947 England Vs … Multi Match 60786 New Zealand …
## 7 8 64949 England Vs … Multi Match 60787 West Indies …
## 8 9 64950 England Vs … Multi Match 60787 West Indies …
## 9 10 64150 New Zealand… Multi Match 60788 Australia to…
## 10 11 64151 New Zealand… Multi Match 60788 Australia to…
## 11 12 64951 England Vs … Multi Match 60789 India tour o…
## 12 13 64952 England Vs … Multi Match 60789 India tour o…
## 13 14 64953 England Vs … Multi Match 60790 Pakistan tou…
## 14 15 64954 England Vs … Multi Match 60790 Pakistan tou…
## 15 16 64152 Australia V… Single Match 60791 England tour…
## # ℹ 29 more variables: WorldCup <chr>, `Match Date` <dttm>,
## # `Match Format` <chr>, `Team1 ID` <dbl>, `Team1 Name` <chr>,
## # `Team1 Captain` <dbl>, `Team1 Captain Name` <chr>,
## # `Team1 Runs Scored` <dbl>, `Team1 Wickets Fell` <dbl>,
## # `Team1 Extras Rec` <dbl>, `Team2 ID` <dbl>, `Team2 Name` <chr>,
## # `Team2 Captain` <dbl>, `Team2 Captain Name` <chr>,
## # `Team2 Runs Scored` <dbl>, `Team2 Wickets Fell` <dbl>, …
Q4. Write a user defined function using any of the variables from the data set
The below code is to find out how many ODI matches INDIA won in the time period of 1970 to 2022
count_wins <- function(team_name,My_Data1){
win_count <- sum(My_Data1$`Match Winner` == team_name, na.rm = TRUE)
return(paste(team_name,"won",win_count,"matches"))
}
count_wins("INDIA",Odi_Match_Info)
## [1] "INDIA won 559 matches"
Q5.Use data manipulation techniques and filter rows based on any logical criteria that exist in your dataset.
load necessary packages
library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.5.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
The below data shows the total number of matches played by India i.e 1010 matches and runs scored in each match.
INDIA <- My_Data1 %>% filter(`Team1 Name` == "India" | `Team2 Name` == "India")%>%mutate(India_Score = ifelse(`Team1 Name` == "India",`Team1 Runs Scored` , `Team2 Runs Scored`))
INDIA
## # A tibble: 1,010 × 36
## `ODI Match No` `Match ID` `Match Name` `Match Type` `Series ID` `Series Name`
## <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 12 64951 England Vs … Multi Match 60789 India tour o…
## 2 13 64952 England Vs … Multi Match 60789 India tour o…
## 3 19 65035 England Vs … Multi Match 60793 Prudential W…
## 4 24 65040 East Africa… Multi Match 60793 Prudential W…
## 5 28 65044 India Vs Ne… Multi Match 60793 Prudential W…
## 6 35 64156 New Zealand… Multi Match 60795 India tour o…
## 7 36 64157 New Zealand… Multi Match 60795 India tour o…
## 8 54 64165 Pakistan Vs… Multi Match 60804 India tour o…
## 9 55 64166 Pakistan Vs… Multi Match 60804 India tour o…
## 10 56 64167 Pakistan Vs… Multi Match 60804 India tour o…
## # ℹ 1,000 more rows
## # ℹ 30 more variables: WorldCup <chr>, `Match Date` <dttm>,
## # `Match Format` <chr>, `Team1 ID` <dbl>, `Team1 Name` <chr>,
## # `Team1 Captain` <dbl>, `Team1 Captain Name` <chr>,
## # `Team1 Runs Scored` <dbl>, `Team1 Wickets Fell` <dbl>,
## # `Team1 Extras Rec` <dbl>, `Team2 ID` <dbl>, `Team2 Name` <chr>,
## # `Team2 Captain` <dbl>, `Team2 Captain Name` <chr>, …
Q6. Identify the dependent and independent variables and use reshaping techniques and create a new data frame by joining those variables from your dataset
names(My_Data1)[names(My_Data1) == "Home Country\n"] <- "Home Country"
My_Selected <- cbind(
My_Data1$`Match ID`,My_Data1$`Toss Winner`,
My_Data1$`Toss Winner Choice`,
My_Data1$WorldCup,
My_Data1$`Match Venue (Country)`,
My_Data1$`Home Country`,
My_Data1$`Match Winner`
)
My_Selected <- as.data.frame(My_Selected)
colnames(My_Selected) <- c(
"Match_ID",
"Toss_Winner",
"Toss_Choice",
"WorldCup",
"Venue_Country",
"Home_Country",
"Match_Winner"
)
colnames(My_Selected) <- c("Match_ID", "Toss_Winner", "Toss_Choice", "WorldCup",
"Venue_Country", "Home_Country", "Match_Winner")
new_df=My_Data1%>%select(`Match ID`,`Toss Winner`,`Toss Winner Choice`,WorldCup,`Match Venue (Country)`,`Match Winner`)
new_df
## # A tibble: 4,571 × 6
## `Match ID` `Toss Winner` `Toss Winner Choice` WorldCup Match Venue (Country…¹
## <dbl> <chr> <chr> <chr> <chr>
## 1 64148 Australia bowl No Australia
## 2 64944 Australia bat No England
## 3 64945 Australia bowl No England
## 4 64946 England bowl No England
## 5 64149 Pakistan bowl No New Zealand
## 6 64947 New Zealand bat No England
## 7 64949 West Indies bat No England
## 8 64950 England bat No England
## 9 64150 New Zealand bat No New Zealand
## 10 64151 Australia bat No New Zealand
## # ℹ 4,561 more rows
## # ℹ abbreviated name: ¹`Match Venue (Country)`
## # ℹ 1 more variable: `Match Winner` <chr>
Created a new data frame by combining selected independent variables
(Match ID, Toss winner, Toss winner choice, Worldcup, Match venue, Home
country) with the dependent variable (Match winner) using
cbind()
.
This reshaped data will be used for predictive analysis to examine how factors like toss outcome, venue, and home advantage may influence match results.
Q7.Removing Missing Values in Dataset
data_clean <- na.omit(My_Data1)
data_clean
## # A tibble: 4,571 × 35
## `ODI Match No` `Match ID` `Match Name` `Match Type` `Series ID` `Series Name`
## <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 1 64148 Australia V… Single Match 60783 England [Mar…
## 2 2 64944 England Vs … Multi Match 60784 Australia to…
## 3 3 64945 England Vs … Multi Match 60784 Australia to…
## 4 4 64946 England Vs … Multi Match 60784 Australia to…
## 5 5 64149 New Zealand… Single Match 60785 Pakistan tou…
## 6 6 64947 England Vs … Multi Match 60786 New Zealand …
## 7 8 64949 England Vs … Multi Match 60787 West Indies …
## 8 9 64950 England Vs … Multi Match 60787 West Indies …
## 9 10 64150 New Zealand… Multi Match 60788 Australia to…
## 10 11 64151 New Zealand… Multi Match 60788 Australia to…
## # ℹ 4,561 more rows
## # ℹ 29 more variables: WorldCup <chr>, `Match Date` <dttm>,
## # `Match Format` <chr>, `Team1 ID` <dbl>, `Team1 Name` <chr>,
## # `Team1 Captain` <dbl>, `Team1 Captain Name` <chr>,
## # `Team1 Runs Scored` <dbl>, `Team1 Wickets Fell` <dbl>,
## # `Team1 Extras Rec` <dbl>, `Team2 ID` <dbl>, `Team2 Name` <chr>,
## # `Team2 Captain` <dbl>, `Team2 Captain Name` <chr>, …
Q8.Identify Duplicate datas in dataset
data_duplicates <- data_clean %>% duplicated()
head(data_duplicates,50)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE
In many sports datasets, Match_ID is designed to be a unique key — each match has a distinct ID, so no two rows share the same Match_ID.
Q9.Reorder Multiple rows in descending Order
data_desc <- My_Data1 %>%arrange(desc(`Team1 Runs Scored`), desc(`Team2 Runs Scored`))
head(data_desc,50)
## # A tibble: 50 × 35
## `ODI Match No` `Match ID` `Match Name` `Match Type` `Series ID` `Series Name`
## <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 4413 1281444 Netherlands… Multi Match 1227837 England in N…
## 2 4011 1119539 England Vs … Multi Match 1119526 AUS in ENG …
## 3 3773 913657 England Vs … Multi Match 913603 PAK tour of …
## 4 2390 247827 Netherlands… Multi Match 247645 Sri Lanka to…
## 5 3583 722341 South Afric… Multi Match 722325 West Indies …
## 6 3700 903601 India Vs So… Multi Match 903579 South Africa…
## 7 2349 238200 South Afric… Multi Match 238138 Australia to…
## 8 4661 1384395 South Afric… Multi Match 1367856 ICC Cricket …
## 9 4099 1158069 West Indies… Multi Match 1158058 England tour…
## 10 3223 536932 India Vs We… Multi Match 536926 West Indies …
## # ℹ 40 more rows
## # ℹ 29 more variables: WorldCup <chr>, `Match Date` <dttm>,
## # `Match Format` <chr>, `Team1 ID` <dbl>, `Team1 Name` <chr>,
## # `Team1 Captain` <dbl>, `Team1 Captain Name` <chr>,
## # `Team1 Runs Scored` <dbl>, `Team1 Wickets Fell` <dbl>,
## # `Team1 Extras Rec` <dbl>, `Team2 ID` <dbl>, `Team2 Name` <chr>,
## # `Team2 Captain` <dbl>, `Team2 Captain Name` <chr>, …
Q10. Rename some of the column names
data_renamed <- My_Data1 %>% rename(T1_Name = `Team1 Name`,T1_ID = `Team1 ID`)
head(data_renamed,50)
## # A tibble: 50 × 35
## `ODI Match No` `Match ID` `Match Name` `Match Type` `Series ID` `Series Name`
## <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 1 64148 Australia V… Single Match 60783 England [Mar…
## 2 2 64944 England Vs … Multi Match 60784 Australia to…
## 3 3 64945 England Vs … Multi Match 60784 Australia to…
## 4 4 64946 England Vs … Multi Match 60784 Australia to…
## 5 5 64149 New Zealand… Single Match 60785 Pakistan tou…
## 6 6 64947 England Vs … Multi Match 60786 New Zealand …
## 7 8 64949 England Vs … Multi Match 60787 West Indies …
## 8 9 64950 England Vs … Multi Match 60787 West Indies …
## 9 10 64150 New Zealand… Multi Match 60788 Australia to…
## 10 11 64151 New Zealand… Multi Match 60788 Australia to…
## # ℹ 40 more rows
## # ℹ 29 more variables: WorldCup <chr>, `Match Date` <dttm>,
## # `Match Format` <chr>, T1_ID <dbl>, T1_Name <chr>, `Team1 Captain` <dbl>,
## # `Team1 Captain Name` <chr>, `Team1 Runs Scored` <dbl>,
## # `Team1 Wickets Fell` <dbl>, `Team1 Extras Rec` <dbl>, `Team2 ID` <dbl>,
## # `Team2 Name` <chr>, `Team2 Captain` <dbl>, `Team2 Captain Name` <chr>,
## # `Team2 Runs Scored` <dbl>, `Team2 Wickets Fell` <dbl>, …
Q11. Add new variables in your data frame by using a mathematical function
INDIA$runrate=INDIA$India_Score/50
INDIA
## # A tibble: 1,010 × 37
## `ODI Match No` `Match ID` `Match Name` `Match Type` `Series ID` `Series Name`
## <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 12 64951 England Vs … Multi Match 60789 India tour o…
## 2 13 64952 England Vs … Multi Match 60789 India tour o…
## 3 19 65035 England Vs … Multi Match 60793 Prudential W…
## 4 24 65040 East Africa… Multi Match 60793 Prudential W…
## 5 28 65044 India Vs Ne… Multi Match 60793 Prudential W…
## 6 35 64156 New Zealand… Multi Match 60795 India tour o…
## 7 36 64157 New Zealand… Multi Match 60795 India tour o…
## 8 54 64165 Pakistan Vs… Multi Match 60804 India tour o…
## 9 55 64166 Pakistan Vs… Multi Match 60804 India tour o…
## 10 56 64167 Pakistan Vs… Multi Match 60804 India tour o…
## # ℹ 1,000 more rows
## # ℹ 31 more variables: WorldCup <chr>, `Match Date` <dttm>,
## # `Match Format` <chr>, `Team1 ID` <dbl>, `Team1 Name` <chr>,
## # `Team1 Captain` <dbl>, `Team1 Captain Name` <chr>,
## # `Team1 Runs Scored` <dbl>, `Team1 Wickets Fell` <dbl>,
## # `Team1 Extras Rec` <dbl>, `Team2 ID` <dbl>, `Team2 Name` <chr>,
## # `Team2 Captain` <dbl>, `Team2 Captain Name` <chr>, …
Run rate explains the average run rate per hour.
Q12.Create a training set using a random number generator engine.
set.seed(123)
Training=sample(1:nrow(My_Data1), size=0.6*nrow(My_Data1))
Trainingdata=as.data.frame(Training)
head(Trainingdata,50)
## Training
## 1 2463
## 2 2511
## 3 2227
## 4 526
## 5 4291
## 6 2986
## 7 1842
## 8 1142
## 9 3371
## 10 3446
## 11 1627
## 12 2757
## 13 953
## 14 4444
## 15 1017
## 16 2013
## 17 2888
## 18 2567
## 19 1450
## 20 1790
## 21 4307
## 22 2980
## 23 1614
## 24 555
## 25 4469
## 26 1167
## 27 2592
## 28 2538
## 29 1799
## 30 905
## 31 1047
## 32 3004
## 33 4405
## 34 3207
## 35 3995
## 36 166
## 37 217
## 38 1314
## 39 2629
## 40 588
## 41 1599
## 42 4237
## 43 3937
## 44 4089
## 45 2907
## 46 4249
## 47 294
## 48 277
## 49 41
## 50 316
Q13. Summary Statistics
summary(My_Data1)
## ODI Match No Match ID Match Name Match Type
## Min. : 1 Min. : 64148 Length:4571 Length:4571
## 1st Qu.:1178 1st Qu.: 65334 Class :character Class :character
## Median :2361 Median : 238196 Mode :character Mode :character
## Mean :2367 Mean : 433421
## 3rd Qu.:3556 3rd Qu.: 727922
## Max. :4745 Max. :1421073
## Series ID Series Name WorldCup
## Min. : 60783 Length:4571 Length:4571
## 1st Qu.: 61000 Class :character Class :character
## Median : 223346 Mode :character Mode :character
## Mean : 422073
## 3rd Qu.: 727913
## Max. :1420525
## Match Date Match Format Team1 ID
## Min. :1971-01-05 00:00:00 Length:4571 Min. : 1.00
## 1st Qu.:1997-02-17 12:00:00 Class :character 1st Qu.: 3.00
## Median :2006-04-12 00:00:00 Mode :character Median : 6.00
## Mean :2005-03-26 13:36:52 Mean : 11.72
## 3rd Qu.:2014-11-28 00:00:00 3rd Qu.: 9.00
## Max. :2024-03-18 00:00:00 Max. :4083.00
## Team1 Name Team1 Captain Team1 Captain Name Team1 Runs Scored
## Length:4571 Min. : 858 Length:4571 Min. : 35.0
## Class :character 1st Qu.: 1900 Class :character 1st Qu.:197.0
## Mode :character Median : 2270 Mode :character Median :237.0
## Mean : 18880 Mean :236.5
## 3rd Qu.: 45788 3rd Qu.:278.0
## Max. :108314 Max. :498.0
## Team1 Wickets Fell Team1 Extras Rec Team2 ID Team2 Name
## Min. : 1.000 Min. : 1.00 Min. : 1.00 Length:4571
## 1st Qu.: 6.000 1st Qu.:10.00 1st Qu.: 3.00 Class :character
## Median : 8.000 Median :14.00 Median : 6.00 Mode :character
## Mean : 7.847 Mean :15.17 Mean : 15.35
## 3rd Qu.:10.000 3rd Qu.:19.00 3rd Qu.: 9.00
## Max. :10.000 Max. :59.00 Max. :4083.00
## Team2 Captain Team2 Captain Name Team2 Runs Scored Team2 Wickets Fell
## Min. : 858 Length:4571 Min. : 36.0 Min. : 0.000
## 1st Qu.: 1873 Class :character 1st Qu.:165.0 1st Qu.: 4.000
## Median : 2281 Mode :character Median :204.0 Median : 7.000
## Mean : 19145 Mean :202.8 Mean : 6.749
## 3rd Qu.: 45788 3rd Qu.:240.5 3rd Qu.:10.000
## Max. :108314 Max. :438.0 Max. :10.000
## Team2 Extras Rec Match Venue (Stadium) Match Venue (City)
## Min. : 0.00 Length:4571 Length:4571
## 1st Qu.: 8.50 Class :character Class :character
## Median :13.00 Mode :character Mode :character
## Mean :13.51
## 3rd Qu.:18.00
## Max. :44.00
## Match Venue (Country) Home Country Umpire 1 Umpire 2
## Length:4571 Length:4571 Length:4571 Length:4571
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Match Referee Toss Winner Toss Winner Choice Match Winner
## Length:4571 Length:4571 Length:4571 Length:4571
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## MOM Player Man of the Match
## Length:4571 Length:4571
## Class :character Class :character
## Mode :character Mode :character
##
##
##
The summary statistics of the dataset reveal several meaningful insights into ODI cricket history. With a large sample size of 4,571 matches, the dataset offers a reliable foundation for trend analysis and comparisons. Spanning over five decades, from the first recorded match on January 5, 1971, to the most recent on March 18, 2022. It captures the evolution and global expansion of ODI cricket. On average, Team 1 appears to perform slightly better than Team 2, scoring a mean of 236.5 runs compared to 202.0, and losing fewer wickets (7.84 vs. 8.15). This suggests that Team 1, possibly batting first more often, may have a strategic advantage. The average number of extras per team is around 13.5, with some matches recording as many as 44, emphasizing the role extras can play in match outcomes. The dataset includes detailed information on match venues across various countries, cities, and stadiums, reflecting the international nature of ODI cricket. It also contains rich contextual data—such as toss winner, toss decision, match winner, and player awards—enabling deeper analysis of game conditions, pre-match decisions, and individual performance trends. Additionally, fields related to match officials allow for exploring patterns in umpiring and refereeing. Overall, this dataset offers a comprehensive view of team performance, historical developments, and the dynamics of international ODI cricket.
Q14.Use any of the numerical variables from the dataset and perform the following statistical functions.
Q14.1 Mean -
mean(INDIA$India_Score)
## [1] 230.3059
This is average score of the indian cricket team in the time period, 1970 to 2024
Q14.2 Median -
median(INDIA$India_Score)
## [1] 230
This represents the central value of indian score without the influence of outliers. As, there is not much diference between mean and median, it suggests a symmetrical distribution of runs scored
Q14.3 Mode -
get_mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
get_mode(INDIA$India_Score)
## [1] 245
Most commonly scored runs are 245 which is more than median and mode. This suggests that the team often scores high in matches.
Q14.4 Range
range(INDIA$India_Score)
## [1] 51 418
The above statistics suggest that scores are generally balanced around a central value but there is a cluster of high scores that appear more often than others. Lower scores still exist but do not drastically pull down the average.
load necessary packages
library(ggplot2)
Q15. Plot a scatter plot for any 2 variables in your dataset
ggplot(My_Data1, aes(x = `ODI Match No`, y = `Match Date`)) +
geom_point(color = "blue", size = 0.2) +
labs(title = "Timeline of ODI Matches",
x = "ODI Match Number",
y = "Match Date") +
theme_minimal()
The line shows a steady rise from the 1970s to 2020, reflecting the growing frequency of ODI matches and sustained global interest. In the early years, the steeper curve indicates fewer matches, as ODI cricket was still new with limited team participation. As the format gained popularity, especially with the introduction of World Cups and tri-nation series, the match frequency increased, and ODIs became a regular feature of bilateral tours. The more even spacing in recent years suggests professional scheduling. Slight plateaus, such as around match numbers ~4300–4400, likely correspond to disruptions like the COVID-19 pandemic. There are no outliers in the graph.
Q16.Plot a bar plot for any 2 variables in your dataset
library(dplyr)
library(ggplot2)
# Step 1: Summarize wins correctly
wincounts <- My_Data1 %>%
filter(!is.na(`Match Winner`) & `Match Winner` != "No Result") %>%
group_by(`Match Winner`) %>%
summarise(Wins = n()) %>%
arrange(desc(Wins))
# Step 2: Plot
ggplot(wincounts, aes(x = reorder(`Match Winner`, -Wins), y = Wins)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "ODI Match Wins by Team",
x = "Team",
y = "Number of Wins") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The bar chart illustrates the number of ODI match wins by each team, revealing clear patterns in performance and participation across countries. Australia stands out as the most successful team, followed closely by India and Pakistan, highlighting their dominance in the ODI format over the decades. Traditional cricketing nations such as Sri Lanka, West Indies, South Africa, and England also show strong win records, reflecting their long-standing involvement in international cricket. Emerging teams like Afghanistan and Ireland appear in the middle range, indicating their growing presence in ODIs in recent years. Teams lower on the chart, including the U.S.A., Namibia, and Hong Kong, represent associate nations with fewer opportunities and matches. Additionally, entries like “MATCH DRAW” and “ICC WORLD XI” suggest the need for minor data cleaning, as these are either rare outcomes or special-purpose teams. Overall, the chart effectively showcases both historical success and the evolving landscape of international ODI cricket.
Q17.Find the correlation between any 2 variables by applying least square linear regression model.
model <-lm(`ODI Match No` ~ `Series ID`,data = My_Data1)
summary(model)
##
## Call:
## lm(formula = `ODI Match No` ~ `Series ID`, data = My_Data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1377.14 -356.55 7.71 542.36 1039.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.212e+03 1.205e+01 100.6 <2e-16 ***
## `Series ID` 2.737e-03 1.951e-05 140.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 594.9 on 4569 degrees of freedom
## Multiple R-squared: 0.8116, Adjusted R-squared: 0.8116
## F-statistic: 1.969e+04 on 1 and 4569 DF, p-value: < 2.2e-16
cor(My_Data1$`ODI Match No`,My_Data1$`Series ID`,use = "c")
## [1] 0.9009111
Using the linear regression model, the results show a strong positive correlation between ODI Match No and Series ID. As the Series ID increases, the ODI Match Number also tends to go up because each new series usually includes a set of new matches. As more series occur, the more matches occur overall.
The correlation value is 0.9009 and shows a strong connection as its close to 1. The regression model also shows that about 81% of the changes in match numbers can be explained just by knowing the series number. This depicts that these two values increase consistently over time.