Introduction

In this document, we will continue the data processing tasks after successfully scraping the ICC Test rankings data using the html_table() function. The data scraping process allowed us to extract tabular data from a webpage, and now we will perform further data cleaning and modifications. Specifically, we will:

  1. Clean the “TEAMS” column by removing the numbers preceding each team name.
  2. Create two new columns to calculate the total number of wins and losses based on the “Series.Form” column.

The Problem

Before diving into the code, let’s outline the specific problem we’re addressing:

Problem:
The “TEAMS” column in our dataset contains numeric prefixes (e.g., “1India,” “2Australia”), which we want to remove. Additionally, the “Series.Form” column contains a sequence of characters representing wins and losses (e.g., “WWLWW”), and we need to calculate the total wins and losses for each team.

Questions to Address:

  • How can we clean a column in R by removing unwanted prefixes or characters?
  • How can we create new columns that count specific characters in an existing column?
  • What errors might occur if column names are not correctly identified, and how can they be resolved?

Code and Explanation

library(stringr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Explanation:

  • library(stringr) and library(dplyr): These commands load the stringr and dplyr libraries into the R environment. dplyr is widely used for data manipulation tasks, while stringr is specifically for string operations.
# Read the CSV file into a data frame
Icc_test_2024 <- read.csv("scraped_and_filtered_data.csv")
head(Icc_test_2024)
##   X        TEAMS  M W L T D N.R PT   PCT Series.Form             Next
## 1 1       1India  9 6 2 0 1   0 74 68.51       LWWWW  vs BAN, BAN, NZ
## 2 2   2Australia 12 8 3 0 1   0 90 62.50       WWLWW vs IND, IND, IND
## 3 3 3New Zealand  6 3 3 0 0   0 36 50.00       WWWLL   vs SL, SL, IND
## 4 4   4Sri Lanka  4 2 2 0 0   0 24 50.00        LLWW vs ENG, ENG, ENG
## 5 5    5Pakistan  5 2 3 0 0   0 22 36.66       WWLLL vs BAN, BAN, ENG
## 6 6     6England 13 6 6 0 1   0 57 36.54       LLWWW    vs SL, SL, SL
  • read.csv(“scraped_and_filtered_data.csv”): This function reads the CSV file into a data frame named Icc_test_2024. Ensure that the file path is correct, or you might encounter a “file not found” error.
# 1. Clean the "TEAMS" column by removing the numbers
Icc_test_2024 <- Icc_test_2024 %>%
  mutate(TEAMS = gsub("^[0-9]+", "", TEAMS))

Explanation:

mutate(TEAMS = gsub(“1+”, ““, TEAMS)):

  • gsub(“2+”, ““, TEAMS): This function removes any digits at the beginning of the”TEAMS” column. The ^ symbol represents the start of the string, and [0-9]+ matches one or more digits. The matched digits are replaced with an empty string ““.

  • mutate(): This function is part of the dplyr package and is used to add or modify columns in a data frame. Here, it modifies the “TEAMS” column by removing the numeric prefixes.

# 2. Create "Total_wins" and "Total_lose" columns
Icc_test_2024 <- Icc_test_2024 %>%
  mutate(
    Total_wins = str_count(Series.Form, "W"),
    Total_lose = str_count(Series.Form, "L")
  )

Explanation:

mutate(Total_wins = str_count(Series.Form, “W”)):

  • str_count(Series.Form, “W”): This function counts the number of occurrences of the character “W” in the “Series.Form” column. The result is stored in a new column named “Total_wins.”

mutate(Total_lose = str_count(Series.Form, “L”)):

  • str_count(Series.Form, “L”): Similarly, this counts the number of “L” characters in the “Series.Form” column and stores the result in a new column named “Total_lose.”
# View the modified data frame
print(Icc_test_2024)
##   X        TEAMS  M W L T D N.R PT   PCT Series.Form             Next
## 1 1        India  9 6 2 0 1   0 74 68.51       LWWWW  vs BAN, BAN, NZ
## 2 2    Australia 12 8 3 0 1   0 90 62.50       WWLWW vs IND, IND, IND
## 3 3  New Zealand  6 3 3 0 0   0 36 50.00       WWWLL   vs SL, SL, IND
## 4 4    Sri Lanka  4 2 2 0 0   0 24 50.00        LLWW vs ENG, ENG, ENG
## 5 5     Pakistan  5 2 3 0 0   0 22 36.66       WWLLL vs BAN, BAN, ENG
## 6 6      England 13 6 6 0 1   0 57 36.54       LLWWW    vs SL, SL, SL
## 7 7 South Africa  4 1 3 0 0   0 12 25.00        WLLL    vs WI, SL, SL
## 8 8   Bangladesh  4 1 3 0 0   0 12 25.00        WLLL vs PAK, PAK, IND
## 9 9  West Indies  7 1 5 0 1   0 16 19.04       LWLLL  vs SA, BAN, BAN
##   Total_wins Total_lose
## 1          4          1
## 2          4          1
## 3          3          2
## 4          2          2
## 5          2          3
## 6          3          2
## 7          1          3
## 8          1          3
## 9          1          4
  • print(Icc_test_2024): This command prints the modified data frame to the console, allowing you to verify that the changes have been applied correctly.

Conclusion

This document demonstrated how to clean a dataset by removing unwanted prefixes from a column and how to create new columns by counting specific characters within another column. These steps are crucial in preparing data for further analysis, ensuring that it is both accurate and easy to interpret.


  1. 0-9↩︎

  2. 0-9↩︎