In this document, we will continue the data processing tasks after
successfully scraping the ICC Test rankings data using the
html_table() function. The data scraping process allowed us
to extract tabular data from a webpage, and now we will perform further
data cleaning and modifications. Specifically, we will:
Before diving into the code, let’s outline the specific problem we’re addressing:
Problem:
The “TEAMS” column in our dataset contains numeric prefixes (e.g.,
“1India,” “2Australia”), which we want to remove. Additionally, the
“Series.Form” column contains a sequence of characters representing wins
and losses (e.g., “WWLWW”), and we need to calculate the total wins and
losses for each team.
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Read the CSV file into a data frame
Icc_test_2024 <- read.csv("scraped_and_filtered_data.csv")
head(Icc_test_2024)
## X TEAMS M W L T D N.R PT PCT Series.Form Next
## 1 1 1India 9 6 2 0 1 0 74 68.51 LWWWW vs BAN, BAN, NZ
## 2 2 2Australia 12 8 3 0 1 0 90 62.50 WWLWW vs IND, IND, IND
## 3 3 3New Zealand 6 3 3 0 0 0 36 50.00 WWWLL vs SL, SL, IND
## 4 4 4Sri Lanka 4 2 2 0 0 0 24 50.00 LLWW vs ENG, ENG, ENG
## 5 5 5Pakistan 5 2 3 0 0 0 22 36.66 WWLLL vs BAN, BAN, ENG
## 6 6 6England 13 6 6 0 1 0 57 36.54 LLWWW vs SL, SL, SL
# 1. Clean the "TEAMS" column by removing the numbers
Icc_test_2024 <- Icc_test_2024 %>%
mutate(TEAMS = gsub("^[0-9]+", "", TEAMS))
gsub(“2+”, ““, TEAMS): This function removes any digits at the beginning of the”TEAMS” column. The ^ symbol represents the start of the string, and [0-9]+ matches one or more digits. The matched digits are replaced with an empty string ““.
mutate(): This function is part of the dplyr package and is used to add or modify columns in a data frame. Here, it modifies the “TEAMS” column by removing the numeric prefixes.
# 2. Create "Total_wins" and "Total_lose" columns
Icc_test_2024 <- Icc_test_2024 %>%
mutate(
Total_wins = str_count(Series.Form, "W"),
Total_lose = str_count(Series.Form, "L")
)
# View the modified data frame
print(Icc_test_2024)
## X TEAMS M W L T D N.R PT PCT Series.Form Next
## 1 1 India 9 6 2 0 1 0 74 68.51 LWWWW vs BAN, BAN, NZ
## 2 2 Australia 12 8 3 0 1 0 90 62.50 WWLWW vs IND, IND, IND
## 3 3 New Zealand 6 3 3 0 0 0 36 50.00 WWWLL vs SL, SL, IND
## 4 4 Sri Lanka 4 2 2 0 0 0 24 50.00 LLWW vs ENG, ENG, ENG
## 5 5 Pakistan 5 2 3 0 0 0 22 36.66 WWLLL vs BAN, BAN, ENG
## 6 6 England 13 6 6 0 1 0 57 36.54 LLWWW vs SL, SL, SL
## 7 7 South Africa 4 1 3 0 0 0 12 25.00 WLLL vs WI, SL, SL
## 8 8 Bangladesh 4 1 3 0 0 0 12 25.00 WLLL vs PAK, PAK, IND
## 9 9 West Indies 7 1 5 0 1 0 16 19.04 LWLLL vs SA, BAN, BAN
## Total_wins Total_lose
## 1 4 1
## 2 4 1
## 3 3 2
## 4 2 2
## 5 2 3
## 6 3 2
## 7 1 3
## 8 1 3
## 9 1 4
This document demonstrated how to clean a dataset by removing unwanted prefixes from a column and how to create new columns by counting specific characters within another column. These steps are crucial in preparing data for further analysis, ensuring that it is both accurate and easy to interpret.