Assignment1(R)

library(readxl)

## Warning: package 'readxl' was built under R version 4.5.1

Odi_Match_Info <- read_excel("C:/Users/nishi/Downloads/Odi_Match_Info.xlsx")
View(Odi_Match_Info)

Assigning Variable to Data Set

My_Data1 = Odi_Match_Info

Q1.Print the Structure of Dataset

str(My_Data1)

## tibble [4,571 × 35] (S3: tbl_df/tbl/data.frame)
##  $ ODI Match No         : num [1:4571] 1 2 3 4 5 6 8 9 10 11 ...
##  $ Match ID             : num [1:4571] 64148 64944 64945 64946 64149 ...
##  $ Match Name           : chr [1:4571] "Australia Vs England Only Odi" "England Vs Australia 1St Odi" "England Vs Australia 2Nd Odi" "England Vs Australia 3Rd Odi" ...
##  $ Match Type           : chr [1:4571] "Single Match" "Multi Match" "Multi Match" "Multi Match" ...
##  $ Series ID            : num [1:4571] 60783 60784 60784 60784 60785 ...
##  $ Series Name          : chr [1:4571] "England [Marylebone Cricket Club] tour of Australia  - 1970 (1970/71)" "Australia tour of England and Scotland  - 1972 (1972)" "Australia tour of England and Scotland  - 1972 (1972)" "Australia tour of England and Scotland  - 1972 (1972)" ...
##  $ WorldCup             : chr [1:4571] "No" "No" "No" "No" ...
##  $ Match Date           : POSIXct[1:4571], format: "1971-01-05" "1972-08-24" ...
##  $ Match Format         : chr [1:4571] "ODI" "ODI" "ODI" "ODI" ...
##  $ Team1 ID             : num [1:4571] 1 2 1 2 5 5 4 1 5 2 ...
##  $ Team1 Name           : chr [1:4571] "England" "Australia" "England" "Australia" ...
##  $ Team1 Captain        : num [1:4571] 1081 1243 858 1243 1251 ...
##  $ Team1 Captain Name   : chr [1:4571] "Ray Illingworth" "Ian Chappell" "Brian Close" "Ian Chappell" ...
##  $ Team1 Runs Scored    : num [1:4571] 190 222 236 179 187 158 181 189 194 265 ...
##  $ Team1 Wickets Fell   : num [1:4571] 10 8 9 9 10 10 10 9 9 5 ...
##  $ Team1 Extras Rec     : num [1:4571] 10 10 15 10 24 9 12 8 16 12 ...
##  $ Team2 ID             : num [1:4571] 2 1 2 1 7 1 1 4 2 5 ...
##  $ Team2 Name           : chr [1:4571] "Australia" "England" "Australia" "England" ...
##  $ Team2 Captain        : num [1:4571] 1150 858 1243 858 1117 ...
##  $ Team2 Captain Name   : chr [1:4571] "Bill Lawry" "Brian Close" "Ian Chappell" "Brian Close" ...
##  $ Team2 Runs Scored    : num [1:4571] 191 226 240 180 165 159 182 190 195 234 ...
##  $ Team2 Wickets Fell   : num [1:4571] 5 4 5 8 10 3 9 2 3 6 ...
##  $ Team2 Extras Rec     : num [1:4571] 6 7 36 9 12 3 7 12 9 9 ...
##  $ Match Venue (Stadium): chr [1:4571] "Melbourne Cricket Ground" "Old Trafford" "Lord's" "Edgbaston" ...
##  $ Match Venue (City)   : chr [1:4571] "Melbourne" "Manchester" "London" "Birmingham" ...
##  $ Match Venue (Country): chr [1:4571] "Australia" "England" "England" "England" ...
##  $ Home Country
##        : chr [1:4571] "Australia" "England" "England" "England" ...
##  $ Umpire 1             : chr [1:4571] "LP Rowan" "CS Elliott" "AE Fagg" "ASM Oakman" ...
##  $ Umpire 2             : chr [1:4571] "TF Brooks" "AEG Rhodes" "TW Spencer" "DJ Constant" ...
##  $ Match Referee        : chr [1:4571] "Unknown" "Unknown" "Unknown" "Unknown" ...
##  $ Toss Winner          : chr [1:4571] "Australia" "Australia" "Australia" "England" ...
##  $ Toss Winner Choice   : chr [1:4571] "bowl" "bat" "bowl" "bowl" ...
##  $ Match Winner         : chr [1:4571] "AUSTRALIA" "ENGLAND" "AUSTRALIA" "ENGLAND" ...
##  $ MOM Player           : chr [1:4571] "1203.0" "1285.0" "1364.0" "1396.0" ...
##  $ Man of the Match     : chr [1:4571] "John Edrich" "Dennis Amiss" "Greg Chappell" "Barry Wood" ...

#Q2.List the Variables in the dataset

names(My_Data1)

##  [1] "ODI Match No"          "Match ID"              "Match Name"           
##  [4] "Match Type"            "Series ID"             "Series Name"          
##  [7] "WorldCup"              "Match Date"            "Match Format"         
## [10] "Team1 ID"              "Team1 Name"            "Team1 Captain"        
## [13] "Team1 Captain Name"    "Team1 Runs Scored"     "Team1 Wickets Fell"   
## [16] "Team1 Extras Rec"      "Team2 ID"              "Team2 Name"           
## [19] "Team2 Captain"         "Team2 Captain Name"    "Team2 Runs Scored"    
## [22] "Team2 Wickets Fell"    "Team2 Extras Rec"      "Match Venue (Stadium)"
## [25] "Match Venue (City)"    "Match Venue (Country)" "Home Country\n"       
## [28] "Umpire 1"              "Umpire 2"              "Match Referee"        
## [31] "Toss Winner"           "Toss Winner Choice"    "Match Winner"         
## [34] "MOM Player"            "Man of the Match"

Q3.Print top 15 rows

head(My_Data1, 15)

## # A tibble: 15 × 35
##    `ODI Match No` `Match ID` `Match Name` `Match Type` `Series ID` `Series Name`
##             <dbl>      <dbl> <chr>        <chr>              <dbl> <chr>        
##  1              1      64148 Australia V… Single Match       60783 England [Mar…
##  2              2      64944 England Vs … Multi Match        60784 Australia to…
##  3              3      64945 England Vs … Multi Match        60784 Australia to…
##  4              4      64946 England Vs … Multi Match        60784 Australia to…
##  5              5      64149 New Zealand… Single Match       60785 Pakistan tou…
##  6              6      64947 England Vs … Multi Match        60786 New Zealand …
##  7              8      64949 England Vs … Multi Match        60787 West Indies …
##  8              9      64950 England Vs … Multi Match        60787 West Indies …
##  9             10      64150 New Zealand… Multi Match        60788 Australia to…
## 10             11      64151 New Zealand… Multi Match        60788 Australia to…
## 11             12      64951 England Vs … Multi Match        60789 India tour o…
## 12             13      64952 England Vs … Multi Match        60789 India tour o…
## 13             14      64953 England Vs … Multi Match        60790 Pakistan tou…
## 14             15      64954 England Vs … Multi Match        60790 Pakistan tou…
## 15             16      64152 Australia V… Single Match       60791 England tour…
## # ℹ 29 more variables: WorldCup <chr>, `Match Date` <dttm>,
## #   `Match Format` <chr>, `Team1 ID` <dbl>, `Team1 Name` <chr>,
## #   `Team1 Captain` <dbl>, `Team1 Captain Name` <chr>,
## #   `Team1 Runs Scored` <dbl>, `Team1 Wickets Fell` <dbl>,
## #   `Team1 Extras Rec` <dbl>, `Team2 ID` <dbl>, `Team2 Name` <chr>,
## #   `Team2 Captain` <dbl>, `Team2 Captain Name` <chr>,
## #   `Team2 Runs Scored` <dbl>, `Team2 Wickets Fell` <dbl>, …

Q4. Write a user defined function using any of the variables from the data set

The below code is to find out how many ODI matches INDIA won in the time period of 1970 to 2022

count_wins <- function(team_name,My_Data1){
  win_count <- sum(My_Data1$`Match Winner` == team_name, na.rm = TRUE)
  return(paste(team_name,"won",win_count,"matches"))
}
count_wins("INDIA",Odi_Match_Info)

## [1] "INDIA won 559 matches"

Q5.Use data manipulation techniques and filter rows based on any logical criteria that exist in your dataset.

load necessary packages

library(tidyverse)

## Warning: package 'tidyr' was built under R version 4.5.1

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)

The below data shows the total number of matches played by India i.e 1010 matches and runs scored in each match.

INDIA <- My_Data1 %>%  filter(`Team1 Name` == "India" | `Team2 Name` == "India")%>%mutate(India_Score = ifelse(`Team1 Name`  == "India",`Team1 Runs Scored` , `Team2 Runs Scored`))
INDIA

## # A tibble: 1,010 × 36
##    `ODI Match No` `Match ID` `Match Name` `Match Type` `Series ID` `Series Name`
##             <dbl>      <dbl> <chr>        <chr>              <dbl> <chr>        
##  1             12      64951 England Vs … Multi Match        60789 India tour o…
##  2             13      64952 England Vs … Multi Match        60789 India tour o…
##  3             19      65035 England Vs … Multi Match        60793 Prudential W…
##  4             24      65040 East Africa… Multi Match        60793 Prudential W…
##  5             28      65044 India Vs Ne… Multi Match        60793 Prudential W…
##  6             35      64156 New Zealand… Multi Match        60795 India tour o…
##  7             36      64157 New Zealand… Multi Match        60795 India tour o…
##  8             54      64165 Pakistan Vs… Multi Match        60804 India tour o…
##  9             55      64166 Pakistan Vs… Multi Match        60804 India tour o…
## 10             56      64167 Pakistan Vs… Multi Match        60804 India tour o…
## # ℹ 1,000 more rows
## # ℹ 30 more variables: WorldCup <chr>, `Match Date` <dttm>,
## #   `Match Format` <chr>, `Team1 ID` <dbl>, `Team1 Name` <chr>,
## #   `Team1 Captain` <dbl>, `Team1 Captain Name` <chr>,
## #   `Team1 Runs Scored` <dbl>, `Team1 Wickets Fell` <dbl>,
## #   `Team1 Extras Rec` <dbl>, `Team2 ID` <dbl>, `Team2 Name` <chr>,
## #   `Team2 Captain` <dbl>, `Team2 Captain Name` <chr>, …

Q6. Identify the dependent and independent variables and use reshaping techniques and create a new data frame by joining those variables from your dataset

names(My_Data1)[names(My_Data1) == "Home Country\n"] <- "Home Country"

My_Selected <- cbind(
  My_Data1$`Match ID`,My_Data1$`Toss Winner`,
  My_Data1$`Toss Winner Choice`,
  My_Data1$WorldCup,
  My_Data1$`Match Venue (Country)`,
  My_Data1$`Home Country`,
  My_Data1$`Match Winner`
) 

My_Selected <- as.data.frame(My_Selected)
colnames(My_Selected) <- c(
  "Match_ID",
  "Toss_Winner",
  "Toss_Choice",
  "WorldCup",
  "Venue_Country",
  "Home_Country",
  "Match_Winner"
)

colnames(My_Selected) <- c("Match_ID", "Toss_Winner", "Toss_Choice", "WorldCup",
                           "Venue_Country", "Home_Country", "Match_Winner")

new_df=My_Data1%>%select(`Match ID`,`Toss Winner`,`Toss Winner Choice`,WorldCup,`Match Venue (Country)`,`Match Winner`)
new_df

## # A tibble: 4,571 × 6
##    `Match ID` `Toss Winner` `Toss Winner Choice` WorldCup Match Venue (Country…¹
##         <dbl> <chr>         <chr>                <chr>    <chr>                 
##  1      64148 Australia     bowl                 No       Australia             
##  2      64944 Australia     bat                  No       England               
##  3      64945 Australia     bowl                 No       England               
##  4      64946 England       bowl                 No       England               
##  5      64149 Pakistan      bowl                 No       New Zealand           
##  6      64947 New Zealand   bat                  No       England               
##  7      64949 West Indies   bat                  No       England               
##  8      64950 England       bat                  No       England               
##  9      64150 New Zealand   bat                  No       New Zealand           
## 10      64151 Australia     bat                  No       New Zealand           
## # ℹ 4,561 more rows
## # ℹ abbreviated name: ¹`Match Venue (Country)`
## # ℹ 1 more variable: `Match Winner` <chr>

Created a new data frame by combining selected independent variables (Match ID, Toss winner, Toss winner choice, Worldcup, Match venue, Home country) with the dependent variable (Match winner) using cbind().

This reshaped data will be used for predictive analysis to examine how factors like toss outcome, venue, and home advantage may influence match results.

Q7.Removing Missing Values in Dataset

data_clean <- na.omit(My_Data1)
data_clean

## # A tibble: 4,571 × 35
##    `ODI Match No` `Match ID` `Match Name` `Match Type` `Series ID` `Series Name`
##             <dbl>      <dbl> <chr>        <chr>              <dbl> <chr>        
##  1              1      64148 Australia V… Single Match       60783 England [Mar…
##  2              2      64944 England Vs … Multi Match        60784 Australia to…
##  3              3      64945 England Vs … Multi Match        60784 Australia to…
##  4              4      64946 England Vs … Multi Match        60784 Australia to…
##  5              5      64149 New Zealand… Single Match       60785 Pakistan tou…
##  6              6      64947 England Vs … Multi Match        60786 New Zealand …
##  7              8      64949 England Vs … Multi Match        60787 West Indies …
##  8              9      64950 England Vs … Multi Match        60787 West Indies …
##  9             10      64150 New Zealand… Multi Match        60788 Australia to…
## 10             11      64151 New Zealand… Multi Match        60788 Australia to…
## # ℹ 4,561 more rows
## # ℹ 29 more variables: WorldCup <chr>, `Match Date` <dttm>,
## #   `Match Format` <chr>, `Team1 ID` <dbl>, `Team1 Name` <chr>,
## #   `Team1 Captain` <dbl>, `Team1 Captain Name` <chr>,
## #   `Team1 Runs Scored` <dbl>, `Team1 Wickets Fell` <dbl>,
## #   `Team1 Extras Rec` <dbl>, `Team2 ID` <dbl>, `Team2 Name` <chr>,
## #   `Team2 Captain` <dbl>, `Team2 Captain Name` <chr>, …

Q8.Identify Duplicate datas in dataset

data_duplicates <- data_clean %>% duplicated()
head(data_duplicates,50)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE

In many sports datasets, Match_ID is designed to be a unique key — each match has a distinct ID, so no two rows share the same Match_ID.

Q9.Reorder Multiple rows in descending Order

data_desc <- My_Data1 %>%arrange(desc(`Team1 Runs Scored`), desc(`Team2 Runs Scored`))
head(data_desc,50)

## # A tibble: 50 × 35
##    `ODI Match No` `Match ID` `Match Name` `Match Type` `Series ID` `Series Name`
##             <dbl>      <dbl> <chr>        <chr>              <dbl> <chr>        
##  1           4413    1281444 Netherlands… Multi Match      1227837 England in N…
##  2           4011    1119539 England Vs … Multi Match      1119526 AUS in ENG  …
##  3           3773     913657 England Vs … Multi Match       913603 PAK tour of …
##  4           2390     247827 Netherlands… Multi Match       247645 Sri Lanka to…
##  5           3583     722341 South Afric… Multi Match       722325 West Indies …
##  6           3700     903601 India Vs So… Multi Match       903579 South Africa…
##  7           2349     238200 South Afric… Multi Match       238138 Australia to…
##  8           4661    1384395 South Afric… Multi Match      1367856 ICC Cricket …
##  9           4099    1158069 West Indies… Multi Match      1158058 England tour…
## 10           3223     536932 India Vs We… Multi Match       536926 West Indies …
## # ℹ 40 more rows
## # ℹ 29 more variables: WorldCup <chr>, `Match Date` <dttm>,
## #   `Match Format` <chr>, `Team1 ID` <dbl>, `Team1 Name` <chr>,
## #   `Team1 Captain` <dbl>, `Team1 Captain Name` <chr>,
## #   `Team1 Runs Scored` <dbl>, `Team1 Wickets Fell` <dbl>,
## #   `Team1 Extras Rec` <dbl>, `Team2 ID` <dbl>, `Team2 Name` <chr>,
## #   `Team2 Captain` <dbl>, `Team2 Captain Name` <chr>, …

Q10. Rename some of the column names

data_renamed <- My_Data1 %>% rename(T1_Name = `Team1 Name`,T1_ID = `Team1 ID`)
head(data_renamed,50)

## # A tibble: 50 × 35
##    `ODI Match No` `Match ID` `Match Name` `Match Type` `Series ID` `Series Name`
##             <dbl>      <dbl> <chr>        <chr>              <dbl> <chr>        
##  1              1      64148 Australia V… Single Match       60783 England [Mar…
##  2              2      64944 England Vs … Multi Match        60784 Australia to…
##  3              3      64945 England Vs … Multi Match        60784 Australia to…
##  4              4      64946 England Vs … Multi Match        60784 Australia to…
##  5              5      64149 New Zealand… Single Match       60785 Pakistan tou…
##  6              6      64947 England Vs … Multi Match        60786 New Zealand …
##  7              8      64949 England Vs … Multi Match        60787 West Indies …
##  8              9      64950 England Vs … Multi Match        60787 West Indies …
##  9             10      64150 New Zealand… Multi Match        60788 Australia to…
## 10             11      64151 New Zealand… Multi Match        60788 Australia to…
## # ℹ 40 more rows
## # ℹ 29 more variables: WorldCup <chr>, `Match Date` <dttm>,
## #   `Match Format` <chr>, T1_ID <dbl>, T1_Name <chr>, `Team1 Captain` <dbl>,
## #   `Team1 Captain Name` <chr>, `Team1 Runs Scored` <dbl>,
## #   `Team1 Wickets Fell` <dbl>, `Team1 Extras Rec` <dbl>, `Team2 ID` <dbl>,
## #   `Team2 Name` <chr>, `Team2 Captain` <dbl>, `Team2 Captain Name` <chr>,
## #   `Team2 Runs Scored` <dbl>, `Team2 Wickets Fell` <dbl>, …

Q11. Add new variables in your data frame by using a mathematical function

INDIA$runrate=INDIA$India_Score/50
INDIA

## # A tibble: 1,010 × 37
##    `ODI Match No` `Match ID` `Match Name` `Match Type` `Series ID` `Series Name`
##             <dbl>      <dbl> <chr>        <chr>              <dbl> <chr>        
##  1             12      64951 England Vs … Multi Match        60789 India tour o…
##  2             13      64952 England Vs … Multi Match        60789 India tour o…
##  3             19      65035 England Vs … Multi Match        60793 Prudential W…
##  4             24      65040 East Africa… Multi Match        60793 Prudential W…
##  5             28      65044 India Vs Ne… Multi Match        60793 Prudential W…
##  6             35      64156 New Zealand… Multi Match        60795 India tour o…
##  7             36      64157 New Zealand… Multi Match        60795 India tour o…
##  8             54      64165 Pakistan Vs… Multi Match        60804 India tour o…
##  9             55      64166 Pakistan Vs… Multi Match        60804 India tour o…
## 10             56      64167 Pakistan Vs… Multi Match        60804 India tour o…
## # ℹ 1,000 more rows
## # ℹ 31 more variables: WorldCup <chr>, `Match Date` <dttm>,
## #   `Match Format` <chr>, `Team1 ID` <dbl>, `Team1 Name` <chr>,
## #   `Team1 Captain` <dbl>, `Team1 Captain Name` <chr>,
## #   `Team1 Runs Scored` <dbl>, `Team1 Wickets Fell` <dbl>,
## #   `Team1 Extras Rec` <dbl>, `Team2 ID` <dbl>, `Team2 Name` <chr>,
## #   `Team2 Captain` <dbl>, `Team2 Captain Name` <chr>, …

Run rate explains the average run rate per hour.

Q12.Create a training set using a random number generator engine.

set.seed(123)
Training=sample(1:nrow(My_Data1), size=0.6*nrow(My_Data1))
Trainingdata=as.data.frame(Training)
head(Trainingdata,50)

##    Training
## 1      2463
## 2      2511
## 3      2227
## 4       526
## 5      4291
## 6      2986
## 7      1842
## 8      1142
## 9      3371
## 10     3446
## 11     1627
## 12     2757
## 13      953
## 14     4444
## 15     1017
## 16     2013
## 17     2888
## 18     2567
## 19     1450
## 20     1790
## 21     4307
## 22     2980
## 23     1614
## 24      555
## 25     4469
## 26     1167
## 27     2592
## 28     2538
## 29     1799
## 30      905
## 31     1047
## 32     3004
## 33     4405
## 34     3207
## 35     3995
## 36      166
## 37      217
## 38     1314
## 39     2629
## 40      588
## 41     1599
## 42     4237
## 43     3937
## 44     4089
## 45     2907
## 46     4249
## 47      294
## 48      277
## 49       41
## 50      316

Q13. Summary Statistics

summary(My_Data1)

##   ODI Match No     Match ID        Match Name         Match Type       
##  Min.   :   1   Min.   :  64148   Length:4571        Length:4571       
##  1st Qu.:1178   1st Qu.:  65334   Class :character   Class :character  
##  Median :2361   Median : 238196   Mode  :character   Mode  :character  
##  Mean   :2367   Mean   : 433421                                        
##  3rd Qu.:3556   3rd Qu.: 727922                                        
##  Max.   :4745   Max.   :1421073                                        
##    Series ID       Series Name          WorldCup        
##  Min.   :  60783   Length:4571        Length:4571       
##  1st Qu.:  61000   Class :character   Class :character  
##  Median : 223346   Mode  :character   Mode  :character  
##  Mean   : 422073                                        
##  3rd Qu.: 727913                                        
##  Max.   :1420525                                        
##    Match Date                  Match Format          Team1 ID      
##  Min.   :1971-01-05 00:00:00   Length:4571        Min.   :   1.00  
##  1st Qu.:1997-02-17 12:00:00   Class :character   1st Qu.:   3.00  
##  Median :2006-04-12 00:00:00   Mode  :character   Median :   6.00  
##  Mean   :2005-03-26 13:36:52                      Mean   :  11.72  
##  3rd Qu.:2014-11-28 00:00:00                      3rd Qu.:   9.00  
##  Max.   :2024-03-18 00:00:00                      Max.   :4083.00  
##   Team1 Name        Team1 Captain    Team1 Captain Name Team1 Runs Scored
##  Length:4571        Min.   :   858   Length:4571        Min.   : 35.0    
##  Class :character   1st Qu.:  1900   Class :character   1st Qu.:197.0    
##  Mode  :character   Median :  2270   Mode  :character   Median :237.0    
##                     Mean   : 18880                      Mean   :236.5    
##                     3rd Qu.: 45788                      3rd Qu.:278.0    
##                     Max.   :108314                      Max.   :498.0    
##  Team1 Wickets Fell Team1 Extras Rec    Team2 ID        Team2 Name       
##  Min.   : 1.000     Min.   : 1.00    Min.   :   1.00   Length:4571       
##  1st Qu.: 6.000     1st Qu.:10.00    1st Qu.:   3.00   Class :character  
##  Median : 8.000     Median :14.00    Median :   6.00   Mode  :character  
##  Mean   : 7.847     Mean   :15.17    Mean   :  15.35                     
##  3rd Qu.:10.000     3rd Qu.:19.00    3rd Qu.:   9.00                     
##  Max.   :10.000     Max.   :59.00    Max.   :4083.00                     
##  Team2 Captain    Team2 Captain Name Team2 Runs Scored Team2 Wickets Fell
##  Min.   :   858   Length:4571        Min.   : 36.0     Min.   : 0.000    
##  1st Qu.:  1873   Class :character   1st Qu.:165.0     1st Qu.: 4.000    
##  Median :  2281   Mode  :character   Median :204.0     Median : 7.000    
##  Mean   : 19145                      Mean   :202.8     Mean   : 6.749    
##  3rd Qu.: 45788                      3rd Qu.:240.5     3rd Qu.:10.000    
##  Max.   :108314                      Max.   :438.0     Max.   :10.000    
##  Team2 Extras Rec Match Venue (Stadium) Match Venue (City)
##  Min.   : 0.00    Length:4571           Length:4571       
##  1st Qu.: 8.50    Class :character      Class :character  
##  Median :13.00    Mode  :character      Mode  :character  
##  Mean   :13.51                                            
##  3rd Qu.:18.00                                            
##  Max.   :44.00                                            
##  Match Venue (Country) Home Country         Umpire 1           Umpire 2        
##  Length:4571           Length:4571        Length:4571        Length:4571       
##  Class :character      Class :character   Class :character   Class :character  
##  Mode  :character      Mode  :character   Mode  :character   Mode  :character  
##                                                                                
##                                                                                
##                                                                                
##  Match Referee      Toss Winner        Toss Winner Choice Match Winner      
##  Length:4571        Length:4571        Length:4571        Length:4571       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   MOM Player        Man of the Match  
##  Length:4571        Length:4571       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

The summary statistics of the dataset reveal several meaningful insights into ODI cricket history. With a large sample size of 4,571 matches, the dataset offers a reliable foundation for trend analysis and comparisons. Spanning over five decades, from the first recorded match on January 5, 1971, to the most recent on March 18, 2022. It captures the evolution and global expansion of ODI cricket. On average, Team 1 appears to perform slightly better than Team 2, scoring a mean of 236.5 runs compared to 202.0, and losing fewer wickets (7.84 vs. 8.15). This suggests that Team 1, possibly batting first more often, may have a strategic advantage. The average number of extras per team is around 13.5, with some matches recording as many as 44, emphasizing the role extras can play in match outcomes. The dataset includes detailed information on match venues across various countries, cities, and stadiums, reflecting the international nature of ODI cricket. It also contains rich contextual data—such as toss winner, toss decision, match winner, and player awards—enabling deeper analysis of game conditions, pre-match decisions, and individual performance trends. Additionally, fields related to match officials allow for exploring patterns in umpiring and refereeing. Overall, this dataset offers a comprehensive view of team performance, historical developments, and the dynamics of international ODI cricket.

Q14.Use any of the numerical variables from the dataset and perform the following statistical functions.

Q14.1 Mean -

mean(INDIA$India_Score)

## [1] 230.3059

This is average score of the indian cricket team in the time period, 1970 to 2024

Q14.2 Median -

median(INDIA$India_Score)

## [1] 230

This represents the central value of indian score without the influence of outliers. As, there is not much diference between mean and median, it suggests a symmetrical distribution of runs scored

Q14.3 Mode -

get_mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
get_mode(INDIA$India_Score)

## [1] 245

Most commonly scored runs are 245 which is more than median and mode. This suggests that the team often scores high in matches.

Q14.4 Range

range(INDIA$India_Score)

## [1]  51 418

The above statistics suggest that scores are generally balanced around a central value but there is a cluster of high scores that appear more often than others. Lower scores still exist but do not drastically pull down the average.

load necessary packages

library(ggplot2)

Q15. Plot a scatter plot for any 2 variables in your dataset

ggplot(My_Data1, aes(x = `ODI Match No`, y = `Match Date`)) +
  geom_point(color = "blue", size = 0.2) +
  labs(title = "Timeline of ODI Matches",
       x = "ODI Match Number",
       y = "Match Date") +
  theme_minimal()

The line shows a steady rise from the 1970s to 2020, reflecting the growing frequency of ODI matches and sustained global interest. In the early years, the steeper curve indicates fewer matches, as ODI cricket was still new with limited team participation. As the format gained popularity, especially with the introduction of World Cups and tri-nation series, the match frequency increased, and ODIs became a regular feature of bilateral tours. The more even spacing in recent years suggests professional scheduling. Slight plateaus, such as around match numbers ~4300–4400, likely correspond to disruptions like the COVID-19 pandemic. There are no outliers in the graph.

Q16.Plot a bar plot for any 2 variables in your dataset

library(dplyr)
library(ggplot2)

# Step 1: Summarize wins correctly
wincounts <- My_Data1 %>%
  filter(!is.na(`Match Winner`) & `Match Winner` != "No Result") %>%
  group_by(`Match Winner`) %>%
  summarise(Wins = n()) %>%
  arrange(desc(Wins))

# Step 2: Plot
ggplot(wincounts, aes(x = reorder(`Match Winner`, -Wins), y = Wins)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "ODI Match Wins by Team",
       x = "Team",
       y = "Number of Wins") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The bar chart illustrates the number of ODI match wins by each team, revealing clear patterns in performance and participation across countries. Australia stands out as the most successful team, followed closely by India and Pakistan, highlighting their dominance in the ODI format over the decades. Traditional cricketing nations such as Sri Lanka, West Indies, South Africa, and England also show strong win records, reflecting their long-standing involvement in international cricket. Emerging teams like Afghanistan and Ireland appear in the middle range, indicating their growing presence in ODIs in recent years. Teams lower on the chart, including the U.S.A., Namibia, and Hong Kong, represent associate nations with fewer opportunities and matches. Additionally, entries like “MATCH DRAW” and “ICC WORLD XI” suggest the need for minor data cleaning, as these are either rare outcomes or special-purpose teams. Overall, the chart effectively showcases both historical success and the evolving landscape of international ODI cricket.

Q17.Find the correlation between any 2 variables by applying least square linear regression model.

model <-lm(`ODI Match No` ~ `Series ID`,data = My_Data1)
summary(model)

## 
## Call:
## lm(formula = `ODI Match No` ~ `Series ID`, data = My_Data1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1377.14  -356.55     7.71   542.36  1039.30 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.212e+03  1.205e+01   100.6   <2e-16 ***
## `Series ID` 2.737e-03  1.951e-05   140.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 594.9 on 4569 degrees of freedom
## Multiple R-squared:  0.8116, Adjusted R-squared:  0.8116 
## F-statistic: 1.969e+04 on 1 and 4569 DF,  p-value: < 2.2e-16

cor(My_Data1$`ODI Match No`,My_Data1$`Series ID`,use = "c")

## [1] 0.9009111

Using the linear regression model, the results show a strong positive correlation between ODI Match No and Series ID. As the Series ID increases, the ODI Match Number also tends to go up because each new series usually includes a set of new matches. As more series occur, the more matches occur overall.

The correlation value is 0.9009 and shows a strong connection as its close to 1. The regression model also shows that about 81% of the changes in match numbers can be explained just by knowing the series number. This depicts that these two values increase consistently over time.

Assignment1(R)

group 3

2025-07-05