title: "Assessment 1" author: 'QIU HONG, XIAN. Student ID:a1609963' date: "Due 21st July 2024" output: html_document: default ---
{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)
For our analysis, the subjects are not the cricketers themselves, but each batting innings they participated in. In order to make the data tidy:
a. Each subject needs its own row. Rearrange the data into a long format so that there is a row for each batter in each innings. Your new tibble should have 270 rows. [2 points]
``{r} # Read the file and answer Question 1.1(a) here: library(readr) ashes <- read_csv("Downloads/ashes.csv") batter_long <- gather(ashes, key= Innings,value="Test Results",Test
1, Innings 1:Test 5, Innings 2`)
```
Test Resultsprint(n = ...) to see more rowsasher
b. Each cell should represent only one measurement. Use str_match() to create new columns for each of the following for each player innings:
```{r}
# Function to extract numbers and specific text extractinfo <- function(sentence) { # Extract the numbers and specific text,"Batting at number 9, scored 42 runs from 120 balls including 5 fours and 1 sixes" extracted <- strmatch(sentence, "number (\w+), scored (\w+) runs from (\w+) balls")
if (is.na(extracted[1])) {
extracted[1] <- "NA"
}
if (is.na(extracted[2])) {
extracted[2] <- "NA"
}
if (is.na(extracted[3])) {
extracted[3] <- "NA"
}
return (extracted) }
extractedinfo <- lapply(Innings, extractinfo)
Test Resultsmatrixdata <- extractedinfo$Test Results
# Extract columns 2, 3, and 4 from the matrix scores <- matrix_data[, 2:4]
print(scores) matrixscore <- matrix(scores[,2], nrow = 27, ncol = 10) matrixscore matrixball <- matrix(scores[,3], nrow = 27, ncol = 10) matrixball
# Load the stringr package
```
Recode the data to make it ‘tame’, that is,
```{r}
df <- data.frame( # batters = c( "Ali", "Anderson","Bairstow","Ball" ,"Bancroft", "Bird", "Broad", "Cook", "Crane", "Cummins", "Curran", "Handscomb", "Hazlewood", "Khawaja", "Lyon", "Malan", #"MMarsh", "Overton", "Paine", "Root", "SMarsh", "Smith", "Starc", "Stoneman",
# team = c("Australia", "England"), batter <- as.character(ashes$batter), team <- as.factor(ashes$team), role <- as.factor(ashes$role), bat_number = as.character(scores[,1]), runs = as.numeric(scores[,2]), # Ensure scores are numeric balls = as.numeric(scores[,3]),
#role = c("Batsman", "Bowler", "wicketkeeper", "All-Rounder"),
#scores = c(20, 35, 15, 40, 25), # Numeric example
stringsAsFactors = FALSE
)
```
recode_variables <- function(df, threshold = 5) { df <- df %>% mutate(across(everything(), ~ { if (is.character(.)) { if (nlevels(as.factor(.)) <= threshold) { as.factor(.) } else { as.character(.) } } else if (is.numeric(.)) { if (all(. == as.integer(.))) { as.integer(.) } else { as.numeric(.) } } else { . } })) return(df) }
dfrecode <- recodevariables(df, threshold = 3)
print(dfrecode) str(dfrecode) batterlong$team <- fctrecode(batterlong$team, England = "English") batterlong$role <- fctrecode(batterlong$role, bat = "batsman") batterlong$role <- fctrecode(batterlong$role, batting = "bat") print(levels(batterlong$role))df
recode_variables <- function(df, threshold = 3) { df <- df %>% mutate(across(everything(), ~ { if (is.character(.)) { if (nlevels(as.factor(.)) <= threshold) { as.factor(.) } else { as.character(.) } } else if (is.numeric(.)) { if (all(. == as.integer(.))) { as.integer(.) } else { as.numeric(.) } } else { . } })) return(df) }
Clean the data; recode the factors using fct_recode() such that there are no typographical errors in the team names and player roles. [2 points]
```{r}
df$team <- fctrecode(df$team, England = "English") df$role <- fctrecode(df$role, batsman = "bat") df$role <- fctrecode(df$role, batsman = "batting") df$role <- fctrecode(df$role, bowler = "bowl") df$role <- fctrecode(df$role, allrounder = "all rounder") df$role <- fctrecode(df$role, allrounder = "all-rounder")
cleandf <- df %>% dropna(runs, balls)
```
Produce a histogram of all scores during the series. [1 point]
{r} # Answer Question 2.1 here: ggplot(clean_df, aes(x = runs)) + geom_histogram(binwidth = 10, fill = "blue", col = "black", alpha = 0.7) + ggtitle("Histogram of All Scores During the Series") + xlab("Runs") + ylab("Frequency") + theme_minimal()
Describe the distribution of scores, considering shape, location spread and outliers. [4 points] Check if the distribution is symmetric, left-skewed, or right-skewed. Look for patterns such as unimodal (one peak) or multimodal (multiple peaks). [Answer Question 2.2 with plain words.]
Produce a bar chart of the teams participating in the series, with different colours for each team. Noting that each player is represented by 10 rows in the data frame, how many players were used by each team in the series? [3 points]
```{r}
```
Using ggplot, produce histograms of scores during the series, faceted by team. [1 point]
{r} # Answer Question 3.1 here: ggplot(clean_df, aes(x = runs)) + geom_histogram(binwidth = 10, fill = "blue", col = "black", alpha = 0.7) + ggtitle("Histogram of All Scores During the Series") + xlab("Runs") + ylab("Frequency") + theme_minimal() ggplot(clean_df, aes(x = balls)) + geom_histogram(binwidth = 10, fill = "blue", col = "black", alpha = 0.7) + ggtitle("Histogram of All Scores During the Series") + xlab("Runs") + ylab("Frequency") + theme_minimal()
Produce side-by-side boxplots of scores by each team during the series. [1 point]
{r} # Answer Question 3.2 here: # Create side-by-side boxplot ggplot(clean_df, aes(x = team, y = runs, fill = team)) + geom_boxplot() + ggtitle("Boxplot of Runs by Team") + xlab("Team") + ylab("Runs") + theme_minimal()
Compare the distributions of scores by each team during the series, considering shape, location, spread and outliers, and referencing the relevant plots. Which team looks to have had a higher average score? [5 points]
[Answer Question 3.3 with plain words.]
Produce a scatterplot of scores against number of balls. [1 point] ggplot(cleandf, aes(x = balls, y = runs, color = team)) + geompoint() + facetwrap(~ team) + ggtitle("Scatter Plot of Runs vs. Balls by Team") + thememinimal()
{r} # Answer Question 4.1 here: ggplot(clean_df, aes(x = balls, y = runs, color = team)) + geom_point() + facet_wrap(~ team) + ggtitle("Scatter Plot of Runs vs. Balls by Team") + theme_minimal()
Describe the relationship between score and number of balls. Are players who face more balls likely to score more runs? [4 points] model <- lm(runs ~ balls + team, data = clean_df) summary(model) [Answer Question 4.2 with plain words.]
Call: lm(formula = runs ~ balls + team, data = clean_df)
Residuals: Min 1Q Median 3Q Max -53.249 -5.949 0.311 5.691 63.245
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.95080 1.91907 -1.017 0.311
balls 0.50723 0.01319 38.456 <2e-16 teamEngland 0.08924
2.06653 0.043 0.966
--- Signif. codes: 0 ‘’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
’ 1
Residual standard error: 12.94 on 166 degrees of freedom Multiple R-squared: 0.903, Adjusted R-squared: 0.9018 F-statistic: 772.5 on 2 and 166 DF, p-value: < 2.2e-16
Compute a new variable, scoringrate, defined as the number of runs divided by the number of balls. Produce a scatterplot of scoringrate against number of balls. [2 points]
```{r}
cleandf <- cleandf %>% mutate(scoring_rate = round(runs / balls, 2))
cleandf <- cleandf %>% filter(!is.na(runs) & !is.na(balls))
ggplot(cleandf, aes(x = balls, y = scoringrate)) + geompoint(color = "blue") + ggtitle("Scatterplot of Scoring Rate vs Number of Balls") + xlab("Number of Balls") + ylab("Scoring Rate (Runs per Ball)") + thememinimal()
```
Is there a relationship between scoring rate and number of balls? Are players who face more balls likely to score runs more quickly? [2 points]
[Answer Question 4.4 with plain words.]
Produce a bar chart of the number of players on each team participating in the series, with segments coloured by the players’ roles. [1 point]
```{r}
teamrolecounts <- cleandf %>% groupby(team, role) %>% summarise(count = n()) %>% ungroup()
ggplot(teamrolecounts, aes(x = team, y = count, fill = role)) + geombar(stat = "identity", position = "stack") + ggtitle("Number of Players on Each Team by Role") + xlab("Team") + ylab("Number of Players") + thememinimal() + scalefillbrewer(palette = "Set3") ```
Produce a contingency table of the proportion of players from each team who play in each particular role. [2 points]
```{r}
teamrolecounts <- cleandf %>% groupby(team, role) %>% summarise(count = n()) %>% ungroup()
teamrolewide <- teamrolecounts %>% spread(key = role, value = count, fill = 0)
teamroleproportions <- teamrolewide %>% mutate(across(-team, ~ . / sum(.), .names = "prop_{.col}"))
teamroleproportions <- teamroleproportions %>% select(team, startswith("prop"))
colnames(teamroleproportions) <- gsub("prop", "", colnames(teamrole_proportions))
print(teamroleproportions) ```
print(teamroleproportions)
team allrounder batsman bowler wicketkeeper
Using these two figures, state which team is made up of a larger proportion of batters, and which team contains a larger proportion of all-rounders. [2 points] ```
teamproportions <- teamroleproportions %>% mutate( proportionbatsman = batsman, proportionallrounder = allrounder ) %>% select(team, proportionbatsman, proportion_allrounder)
teamwithmorebatters <- teamproportions %>% slice(which.max(proportion_batsman))
teamwithmoreallrounders <- teamproportions %>% slice(which.max(proportion_allrounder))
print("Team with larger proportion of batters:") print(teamwithmore_batters)
print("Team with larger proportion of all-rounders:") print(teamwithmore_allrounders) ```
library(dplyr)
teamrolecounts <- cleandf %>% groupby(team, role) %>% summarise(count = n()) %>% ungroup()
teamcounts <- teamrolecounts %>% groupby(team) %>% summarise(total_players = sum(count))
teamroleproportions <- teamrolecounts %>% leftjoin(teamcounts, by = "team") %>% mutate(proportion = count / total_players)
teambatsmenallrounders <- teamroleproportions %>% filter(role %in% c("batsman", "allrounder"))
print("Proportions of batsmen and all-rounders within each team:") print(teambatsmenallrounders) print(teambatsmenallrounders)
team role count total_players proportion
contingencytable <- cleandf %>% group_by(team, role) %>% summarise(count = n()) %>% ungroup() %>% mutate(total = sum(count)) %>% mutate(proportion = count / total)
totalplayers <- nrow(cleandf) contingencytable <- contingencytable %>% mutate(proportionoftotal = count / total_players)
print(contingency_table)
contingencytablewide <- contingencytable %>% select(team, role, proportionoftotal) %>% spread(key = role, value = proportionof_total)
print(contingencytablewide)
team allrounder batsman bowler wicketkeeper
contingencytable <- cleandf %>% group_by(team, role) %>% summarise(count = n()) %>% ungroup()
teamtotals <- contingencytable %>% group_by(team) %>% summarise(total = sum(count))
contingencytable <- contingencytable %>% leftjoin(teamtotals, by = "team") %>% mutate(proportion = round(count / total,2)) %>% select(team, role, count, proportion)
print(contingency_table)
contingencytablewide <- contingency_table %>% select(team, role, proportion) %>% spread(key = role, value = proportion)
print(contingencytablewide)
team allrounder batsman bowler wicketkeeper
team allrounder batsman bowler wicketkeeper
team allrounder batsman bowler wicketkeeper
Cricket Australia are interested in any insights you can bring with respect to the differences between the two teams, as well as any insights related to scoring. In plain English, write a summary of your key findings from Questions 2-5. Your response should be between 200-250 words. [3 points]
[Answer Question 6 with plain words.]
ggplot(cleandf, aes(x = runs)) + geomhistogram(binwidth = 10, fill = "blue", col = "black", alpha = 0.7) + ggtitle("Histogram of All Scores During the Series") + xlab("Runs") + ylab("Frequency") + theme_minimal()
meanruns <- mean(cleandf$runs, na.rm = TRUE) medianruns <- median(cleandf$runs, na.rm = TRUE)
rangeruns <- range(cleandf$runs, na.rm = TRUE) iqrruns <- IQR(cleandf$runs, na.rm = TRUE) sdruns <- sd(cleandf$runs, na.rm = TRUE)
summarystats <- data.frame( Mean = meanruns, Median = medianruns, Min = rangeruns[1], Max = rangeruns[2], IQR = iqrruns, SD = sdruns ) print(summarystats)
q1 <- quantile(cleandf$runs, 0.25, na.rm = TRUE) q3 <- quantile(cleandf$runs, 0.75, na.rm = TRUE) iqr <- q3 - q1 lowerbound <- q1 - 1.5 * iqr upperbound <- q3 + 1.5 * iqr outliers <- cleandf %>% filter(runs < lowerbound | runs > upper_bound)
print(outliers)
print(outliers) batter....ashes.batter team....ashes.team role....ashes.role bat_number runs balls team role 1 Smith Australia bat 4 141 326 Australia batsman 2 SMarsh Australia bat 6 126 231 Australia batsman 3 Bairstow England wicketkeeper 6 119 215 England wicketkeeper 4 Malan England bat 5 140 227 England batsman 5 MMarsh Australia all rounder 6 181 236 Australia allrounder 6 Smith Australia bat 4 239 399 Australia batsman 7 Cook England bat 1 244 409 England batsman 8 Warner Australia bat 2 103 151 Australia batsman 9 Smith Australia bat 4 102 275 Australia batsman 10 Khawaja Australia batsman 3 171 381 Australia batsman 11 MMarsh Australia all rounder 6 101 141 Australia allrounder 12 SMarsh Australia bat 5 156 291 Australia batsman