title: "Assessment 1" author: 'QIU HONG, XIAN. Student ID:a1609963' date: "Due 21st July 2024" output: html_document: default ---

{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)

1. Reading & Cleaning

Question 1.1

For our analysis, the subjects are not the cricketers themselves, but each batting innings they participated in. In order to make the data tidy:

a. Each subject needs its own row. Rearrange the data into a long format so that there is a row for each batter in each innings. Your new tibble should have 270 rows. [2 points]

``{r} # Read the file and answer Question 1.1(a) here: library(readr) ashes <- read_csv("Downloads/ashes.csv") batter_long <- gather(ashes, key= Innings,value="Test Results",Test 1, Innings 1:Test 5, Innings 2`)

```

batter_long

A tibble: 270 × 5

# batter team role Innings `Test Results`

1 Ali England allrounder Test 1, Innings 1 Batting at number 6, scored 38 runs from 102 balls including 2 fours and 1 sixes.

2 Anderson English bowl Test 1, Innings 1 Batting at number 11, scored 5 runs from 9 balls including 1 fours and 0 sixes.

3 Bairstow England wicketkeeper Test 1, Innings 1 Batting at number 7, scored 9 runs from 24 balls including 1 fours and 0 sixes.

4 Ball England bowl Test 1, Innings 1 Batting at number 10, scored 14 runs from 11 balls including 3 fours and 0 sixes.

5 Bancroft Australia bat Test 1, Innings 1 Batting at number 1, scored 5 runs from 19 balls including 0 fours and 0 sixes.

6 Bird Australia bowl Test 1, Innings 1 Batting at number NA, scored NA including NA fours and NA sixes.

7 Broad England bowler Test 1, Innings 1 Batting at number 9, scored 20 runs from 32 balls including 3 fours and 0 sixes.

8 Cook England bat Test 1, Innings 1 Batting at number 1, scored 2 runs from 10 balls including 0 fours and 0 sixes.

9 Crane England bowl Test 1, Innings 1 Batting at number NA, scored NA including NA fours and NA sixes.

10 Cummins Australia bowl Test 1, Innings 1 Batting at number 9, scored 42 runs from 120 balls including 5 fours and 1 sixes.

ℹ 260 more rows

ℹ Use `print(n = ...)` to see more rows

asher

b. Each cell should represent only one measurement. Use str_match() to create new columns for each of the following for each player innings:

the player’s batting number,
their score, and
the number of balls they faced. [2 points]

```{r}

Answer Question 1.1(b) here:

# Function to extract numbers and specific text extractinfo <- function(sentence) { # Extract the numbers and specific text,"Batting at number 9, scored 42 runs from 120 balls including 5 fours and 1 sixes" extracted <- strmatch(sentence, "number (\w+), scored (\w+) runs from (\w+) balls")

Handle the case where NA might be present

if (is.na(extracted[1])) {
    extracted[1] <- "NA"
}
 if (is.na(extracted[2])) {
     extracted[2] <- "NA"
 }  

 if (is.na(extracted[3])) {
     extracted[3] <- "NA"
 }

return (extracted) }

>

> # Apply the function to each sentence

extractedinfo <- lapply(Innings, extractinfo)

> extracted_info

$`Test Results`

[,1] [,2] [,3] [,4]

[1,] "number 6, scored 38 runs from 102 balls" "6" "38" "102"

[2,] "number 11, scored 5 runs from 9 balls" "11" "5" "9"

[3,] "number 7, scored 9 runs from 24 balls" "7" "9" "24"

[4,] "number 10, scored 14 runs from 11 balls" "10" "14" "11"

[5,] "number 1, scored 5 runs from 19 balls" "1" "5" "19"

[6,] NA NA NA NA

Extract the matrix from the list

matrixdata <- extractedinfo$Test Results

# Extract columns 2, 3, and 4 from the matrix scores <- matrix_data[, 2:4]

Print the extracted columns

print(scores) matrixscore <- matrix(scores[,2], nrow = 27, ncol = 10) matrixscore matrixball <- matrix(scores[,3], nrow = 27, ncol = 10) matrixball

# Load the stringr package

> library(stringr)

>

> # Your sentence

> sentence <- "Batting at number 9, scored 42 runs from 120 balls including 5 fours and 1 sixes"

>

> # Extract the numbers using str_match

> extracted <- str_match(sentence, "number (\w+), scored (\w+) runs from (\w+) balls")

>

> # Extract just the numbers

> number <- extracted[1, 2]

> scored_runs <- extracted[1, 3]

> balls <- extracted[1, 4]

Print the extracted numbers

> print(number) # "9"

[1] "9"

> print(scored_runs) # "42"

[1] "42"

> print(balls) # "120"

[1] "120"

>

```

Question 1.2

Recode the data to make it ‘tame’, that is,

ensure all categorical variables with a small number of levels are coded as factors,
ensure all categorical variables with a large number of levels are coded as characters, and
ensure all quantitative variables are coded as integers or numeric, as appropriate. [3 points]

```{r}

Answer Question 1.2 here:

df <- data.frame( # batters = c( "Ali", "Anderson","Bairstow","Ball" ,"Bancroft", "Bird", "Broad", "Cook", "Crane", "Cummins", "Curran", "Handscomb", "Hazlewood", "Khawaja", "Lyon", "Malan", #"MMarsh", "Overton", "Paine", "Root", "SMarsh", "Smith", "Starc", "Stoneman",

"Vince", "Warner", "Woakes"),

# team = c("Australia", "England"), batter <- as.character(ashes$batter), team <- as.factor(ashes$team), role <- as.factor(ashes$role), bat_number = as.character(scores[,1]), runs = as.numeric(scores[,2]), # Ensure scores are numeric balls = as.numeric(scores[,3]),

#role = c("Batsman", "Bowler", "wicketkeeper", "All-Rounder"),
#scores = c(20, 35, 15, 40, 25),  # Numeric example
stringsAsFactors = FALSE

)

```

Function to recode variables

recode_variables <- function(df, threshold = 5) { df <- df %>% mutate(across(everything(), ~ { if (is.character(.)) { if (nlevels(as.factor(.)) <= threshold) { as.factor(.) } else { as.character(.) } } else if (is.numeric(.)) { if (all(. == as.integer(.))) { as.integer(.) } else { as.numeric(.) } } else { . } })) return(df) }

Apply the function to the data frame

dfrecode <- recodevariables(df, threshold = 3)

Print the resulting data frame and structure

print(dfrecode) str(dfrecode) batterlong$team <- fctrecode(batterlong$team, England = "English") batterlong$role <- fctrecode(batterlong$role, bat = "batsman") batterlong$role <- fctrecode(batterlong$role, batting = "bat") print(levels(batterlong$role))df

recode_variables <- function(df, threshold = 3) { df <- df %>% mutate(across(everything(), ~ { if (is.character(.)) { if (nlevels(as.factor(.)) <= threshold) { as.factor(.) } else { as.character(.) } } else if (is.numeric(.)) { if (all(. == as.integer(.))) { as.integer(.) } else { as.numeric(.) } } else { . } })) return(df) }

Question 1.3

Clean the data; recode the factors using fct_recode() such that there are no typographical errors in the team names and player roles. [2 points]

```{r}

Answer Question 1.3 here:

df$team <- fctrecode(df$team, England = "English") df$role <- fctrecode(df$role, batsman = "bat") df$role <- fctrecode(df$role, batsman = "batting") df$role <- fctrecode(df$role, bowler = "bowl") df$role <- fctrecode(df$role, allrounder = "all rounder") df$role <- fctrecode(df$role, allrounder = "all-rounder")

Remove rows with NA values in 'runs' or 'balls'

cleandf <- df %>% dropna(runs, balls)

```

2. Univariate Analysis

Question 2.1

Produce a histogram of all scores during the series. [1 point]

{r} # Answer Question 2.1 here: ggplot(clean_df, aes(x = runs)) + geom_histogram(binwidth = 10, fill = "blue", col = "black", alpha = 0.7) + ggtitle("Histogram of All Scores During the Series") + xlab("Runs") + ylab("Frequency") + theme_minimal()

Question 2.2

Describe the distribution of scores, considering shape, location spread and outliers. [4 points] Check if the distribution is symmetric, left-skewed, or right-skewed. Look for patterns such as unimodal (one peak) or multimodal (multiple peaks). [Answer Question 2.2 with plain words.]

Question 2.3

Produce a bar chart of the teams participating in the series, with different colours for each team. Noting that each player is represented by 10 rows in the data frame, how many players were used by each team in the series? [3 points]

```{r}

Answer Question 2.3 here:

```

3. Scores for each team

Question 3.1

Using ggplot, produce histograms of scores during the series, faceted by team. [1 point]

{r} # Answer Question 3.1 here: ggplot(clean_df, aes(x = runs)) + geom_histogram(binwidth = 10, fill = "blue", col = "black", alpha = 0.7) + ggtitle("Histogram of All Scores During the Series") + xlab("Runs") + ylab("Frequency") + theme_minimal() ggplot(clean_df, aes(x = balls)) + geom_histogram(binwidth = 10, fill = "blue", col = "black", alpha = 0.7) + ggtitle("Histogram of All Scores During the Series") + xlab("Runs") + ylab("Frequency") + theme_minimal()

Question 3.2

Produce side-by-side boxplots of scores by each team during the series. [1 point]

{r} # Answer Question 3.2 here: # Create side-by-side boxplot ggplot(clean_df, aes(x = team, y = runs, fill = team)) + geom_boxplot() + ggtitle("Boxplot of Runs by Team") + xlab("Team") + ylab("Runs") + theme_minimal()

Question 3.3

Compare the distributions of scores by each team during the series, considering shape, location, spread and outliers, and referencing the relevant plots. Which team looks to have had a higher average score? [5 points]

[Answer Question 3.3 with plain words.]

4. Scoring rates

Question 4.1

Produce a scatterplot of scores against number of balls. [1 point] ggplot(cleandf, aes(x = balls, y = runs, color = team)) + geompoint() + facetwrap(~ team) + ggtitle("Scatter Plot of Runs vs. Balls by Team") + thememinimal()

{r} # Answer Question 4.1 here: ggplot(clean_df, aes(x = balls, y = runs, color = team)) + geom_point() + facet_wrap(~ team) + ggtitle("Scatter Plot of Runs vs. Balls by Team") + theme_minimal()

Question 4.2

Describe the relationship between score and number of balls. Are players who face more balls likely to score more runs? [4 points] model <- lm(runs ~ balls + team, data = clean_df) summary(model) [Answer Question 4.2 with plain words.]

Call: lm(formula = runs ~ balls + team, data = clean_df)

Residuals: Min 1Q Median 3Q Max -53.249 -5.949 0.311 5.691 63.245

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.95080 1.91907 -1.017 0.311
balls 0.50723 0.01319 38.456 <2e-16 teamEngland 0.08924 2.06653 0.043 0.966
--- Signif. codes: 0 ‘’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.94 on 166 degrees of freedom Multiple R-squared: 0.903, Adjusted R-squared: 0.9018 F-statistic: 772.5 on 2 and 166 DF, p-value: < 2.2e-16

Question 4.3

Compute a new variable, scoringrate, defined as the number of runs divided by the number of balls. Produce a scatterplot of scoringrate against number of balls. [2 points]

```{r}

Answer Question 4.3 here:

Compute scoring_rate

cleandf <- cleandf %>% mutate(scoring_rate = round(runs / balls, 2))

Handle NA values by removing rows with NA in runs or balls

cleandf <- cleandf %>% filter(!is.na(runs) & !is.na(balls))

Produce scatterplot of scoring_rate against number of balls

ggplot(cleandf, aes(x = balls, y = scoringrate)) + geompoint(color = "blue") + ggtitle("Scatterplot of Scoring Rate vs Number of Balls") + xlab("Number of Balls") + ylab("Scoring Rate (Runs per Ball)") + thememinimal()

```

Question 4.4

Is there a relationship between scoring rate and number of balls? Are players who face more balls likely to score runs more quickly? [2 points]

[Answer Question 4.4 with plain words.]

5. Teams’ roles

Question 5.1

Produce a bar chart of the number of players on each team participating in the series, with segments coloured by the players’ roles. [1 point]

```{r}

Answer Question 5.1 here:

Group by team and role, and count the number of players

teamrolecounts <- cleandf %>% groupby(team, role) %>% summarise(count = n()) %>% ungroup()

Plot the bar chart

ggplot(teamrolecounts, aes(x = team, y = count, fill = role)) + geombar(stat = "identity", position = "stack") + ggtitle("Number of Players on Each Team by Role") + xlab("Team") + ylab("Number of Players") + thememinimal() + scalefillbrewer(palette = "Set3") ```

Question 5.2

Produce a contingency table of the proportion of players from each team who play in each particular role. [2 points]

```{r}

Answer Question 5.2 here:

Group by team and role, and count the number of players

teamrolecounts <- cleandf %>% groupby(team, role) %>% summarise(count = n()) %>% ungroup()

Spread the data to wide format

teamrolewide <- teamrolecounts %>% spread(key = role, value = count, fill = 0)

Calculate the proportions for each team

teamroleproportions <- teamrolewide %>% mutate(across(-team, ~ . / sum(.), .names = "prop_{.col}"))

Select only the proportion columns and team

teamroleproportions <- teamroleproportions %>% select(team, startswith("prop"))

Rename columns for better readability

colnames(teamroleproportions) <- gsub("prop", "", colnames(teamrole_proportions))

Print the contingency table

print(teamroleproportions) ```

print(teamroleproportions)

A tibble: 2 × 5

team allrounder batsman bowler wicketkeeper 1 Australia 0.2 0.471 0.408 0.4 2 England 0.8 0.529 0.592 0.6

Question 5.3

Using these two figures, state which team is made up of a larger proportion of batters, and which team contains a larger proportion of all-rounders. [2 points] ```

Calculate proportions of batters and all-rounders for each team

teamproportions <- teamroleproportions %>% mutate( proportionbatsman = batsman, proportionallrounder = allrounder ) %>% select(team, proportionbatsman, proportion_allrounder)

Find team with larger proportion of batters

teamwithmorebatters <- teamproportions %>% slice(which.max(proportion_batsman))

Find team with larger proportion of all-rounders

teamwithmoreallrounders <- teamproportions %>% slice(which.max(proportion_allrounder))

Print the results

print("Team with larger proportion of batters:") print(teamwithmore_batters)

print("Team with larger proportion of all-rounders:") print(teamwithmore_allrounders) ```

Example code to calculate proportions within each team

library(dplyr)

Assuming clean_df has columns 'team' and 'role'

Count the number of players in each team and role

teamrolecounts <- cleandf %>% groupby(team, role) %>% summarise(count = n()) %>% ungroup()

Calculate total number of players in each team

teamcounts <- teamrolecounts %>% groupby(team) %>% summarise(total_players = sum(count))

Calculate proportions of batsmen and all-rounders within each team

teamroleproportions <- teamrolecounts %>% leftjoin(teamcounts, by = "team") %>% mutate(proportion = count / total_players)

Filter for batsmen and all-rounders only

teambatsmenallrounders <- teamroleproportions %>% filter(role %in% c("batsman", "allrounder"))

Print the proportions

print("Proportions of batsmen and all-rounders within each team:") print(teambatsmenallrounders) print(teambatsmenallrounders)

A tibble: 4 × 5

team role count total_players proportion 1 Australia allrounder 4 70 0.0571 2 Australia batsman 40 70 0.571 3 England allrounder 16 99 0.162 4 England batsman 45 99 0.455

Create the contingency table

contingencytable <- cleandf %>% group_by(team, role) %>% summarise(count = n()) %>% ungroup() %>% mutate(total = sum(count)) %>% mutate(proportion = count / total)

Create a proportion table for each team out of the total sum of both teams

totalplayers <- nrow(cleandf) contingencytable <- contingencytable %>% mutate(proportionoftotal = count / total_players)

Print the contingency table with proportions

print(contingency_table)

Reshape the contingency table to wide format

contingencytablewide <- contingencytable %>% select(team, role, proportionoftotal) %>% spread(key = role, value = proportionof_total)

Print the wide format contingency table

print(contingencytablewide)

A tibble: 2 × 5

team allrounder batsman bowler wicketkeeper 1 Australia 0.0237 0.237 0.118 0.0355 2 England 0.0947 0.266 0.172 0.0533 [Answer Question 5.3 with plain words.]

Create the contingency table

contingencytable <- cleandf %>% group_by(team, role) %>% summarise(count = n()) %>% ungroup()

Calculate the total number of players in each team

teamtotals <- contingencytable %>% group_by(team) %>% summarise(total = sum(count))

Merge the team totals back to the contingency table

contingencytable <- contingencytable %>% leftjoin(teamtotals, by = "team") %>% mutate(proportion = round(count / total,2)) %>% select(team, role, count, proportion)

Print the contingency table with proportions

print(contingency_table)

Reshape the contingency table to wide format

contingencytablewide <- contingency_table %>% select(team, role, proportion) %>% spread(key = role, value = proportion)

Print the wide format contingency table

print(contingencytablewide)

A tibble: 2 × 5

team allrounder batsman bowler wicketkeeper 1 Australia 0.0571 0.571 0.286 0.0857 2 England 0.162 0.455 0.293 0.0909

A tibble: 2 × 5

team allrounder batsman bowler wicketkeeper 1 Australia 0.057 0.571 0.286 0.086 2 England 0.162 0.455 0.293 0.091

A tibble: 2 × 5

team allrounder batsman bowler wicketkeeper 1 Australia 0.06 0.57 0.29 0.09 2 England 0.16 0.45 0.29 0.09

6. Summary of Insights

Cricket Australia are interested in any insights you can bring with respect to the differences between the two teams, as well as any insights related to scoring. In plain English, write a summary of your key findings from Questions 2-5. Your response should be between 200-250 words. [3 points]

[Answer Question 6 with plain words.]

Histogram of all scores

ggplot(cleandf, aes(x = runs)) + geomhistogram(binwidth = 10, fill = "blue", col = "black", alpha = 0.7) + ggtitle("Histogram of All Scores During the Series") + xlab("Runs") + ylab("Frequency") + theme_minimal()

Central Tendency

meanruns <- mean(cleandf$runs, na.rm = TRUE) medianruns <- median(cleandf$runs, na.rm = TRUE)

Spread

rangeruns <- range(cleandf$runs, na.rm = TRUE) iqrruns <- IQR(cleandf$runs, na.rm = TRUE) sdruns <- sd(cleandf$runs, na.rm = TRUE)

Summary statistics

summarystats <- data.frame( Mean = meanruns, Median = medianruns, Min = rangeruns[1], Max = rangeruns[2], IQR = iqrruns, SD = sdruns ) print(summarystats)

Identify outliers using the 1.5*IQR rule

q1 <- quantile(cleandf$runs, 0.25, na.rm = TRUE) q3 <- quantile(cleandf$runs, 0.75, na.rm = TRUE) iqr <- q3 - q1 lowerbound <- q1 - 1.5 * iqr upperbound <- q3 + 1.5 * iqr outliers <- cleandf %>% filter(runs < lowerbound | runs > upper_bound)

print(outliers)

print(outliers) batter....ashes.batter team....ashes.team role....ashes.role bat_number runs balls team role 1 Smith Australia bat 4 141 326 Australia batsman 2 SMarsh Australia bat 6 126 231 Australia batsman 3 Bairstow England wicketkeeper 6 119 215 England wicketkeeper 4 Malan England bat 5 140 227 England batsman 5 MMarsh Australia all rounder 6 181 236 Australia allrounder 6 Smith Australia bat 4 239 399 Australia batsman 7 Cook England bat 1 244 409 England batsman 8 Warner Australia bat 2 103 151 Australia batsman 9 Smith Australia bat 4 102 275 Australia batsman 10 Khawaja Australia batsman 3 171 381 Australia batsman 11 MMarsh Australia all rounder 6 101 141 Australia allrounder 12 SMarsh Australia bat 5 156 291 Australia batsman

1. Reading & Cleaning

Question 1.1

batter_long

A tibble: 270 × 5

# batter team role Innings Test Results

1 Ali England allrounder Test 1, Innings 1 Batting at number 6, scored 38 runs from 102 balls including 2 fours and 1 sixes.

2 Anderson English bowl Test 1, Innings 1 Batting at number 11, scored 5 runs from 9 balls including 1 fours and 0 sixes.

3 Bairstow England wicketkeeper Test 1, Innings 1 Batting at number 7, scored 9 runs from 24 balls including 1 fours and 0 sixes.

4 Ball England bowl Test 1, Innings 1 Batting at number 10, scored 14 runs from 11 balls including 3 fours and 0 sixes.

5 Bancroft Australia bat Test 1, Innings 1 Batting at number 1, scored 5 runs from 19 balls including 0 fours and 0 sixes.

6 Bird Australia bowl Test 1, Innings 1 Batting at number NA, scored NA including NA fours and NA sixes.

7 Broad England bowler Test 1, Innings 1 Batting at number 9, scored 20 runs from 32 balls including 3 fours and 0 sixes.

8 Cook England bat Test 1, Innings 1 Batting at number 1, scored 2 runs from 10 balls including 0 fours and 0 sixes.

9 Crane England bowl Test 1, Innings 1 Batting at number NA, scored NA including NA fours and NA sixes.

10 Cummins Australia bowl Test 1, Innings 1 Batting at number 9, scored 42 runs from 120 balls including 5 fours and 1 sixes.

ℹ 260 more rows

ℹ Use print(n = ...) to see more rows

Answer Question 1.1(b) here:

Handle the case where NA might be present

>

> # Apply the function to each sentence

> extracted_info

$Test Results

[,1] [,2] [,3] [,4]

[1,] "number 6, scored 38 runs from 102 balls" "6" "38" "102"

[2,] "number 11, scored 5 runs from 9 balls" "11" "5" "9"

[3,] "number 7, scored 9 runs from 24 balls" "7" "9" "24"

[4,] "number 10, scored 14 runs from 11 balls" "10" "14" "11"

[5,] "number 1, scored 5 runs from 19 balls" "1" "5" "19"

[6,] NA NA NA NA

Extract the matrix from the list

Print the extracted columns

> library(stringr)

>

> # Your sentence

> sentence <- "Batting at number 9, scored 42 runs from 120 balls including 5 fours and 1 sixes"

>

> # Extract the numbers using str_match

> extracted <- str_match(sentence, "number (\w+), scored (\w+) runs from (\w+) balls")

>

> # Extract just the numbers

> number <- extracted[1, 2]

> scored_runs <- extracted[1, 3]

> balls <- extracted[1, 4]

Print the extracted numbers

> print(number) # "9"

[1] "9"

> print(scored_runs) # "42"

[1] "42"

> print(balls) # "120"

[1] "120"

>

Question 1.2

Answer Question 1.2 here:

"Vince", "Warner", "Woakes"),

Function to recode variables

Apply the function to the data frame

Print the resulting data frame and structure

Question 1.3

Answer Question 1.3 here:

Remove rows with NA values in 'runs' or 'balls'

2. Univariate Analysis

Question 2.1

Question 2.2

Question 2.3

Answer Question 2.3 here:

3. Scores for each team

Question 3.1

Question 3.2

Question 3.3

4. Scoring rates

Question 4.1

Question 4.2

Question 4.3

Answer Question 4.3 here:

Compute scoring_rate

Handle NA values by removing rows with NA in runs or balls

Produce scatterplot of scoring_rate against number of balls

Question 4.4

5. Teams’ roles

# batter team role Innings `Test Results`

ℹ Use `print(n = ...)` to see more rows

$`Test Results`