Name: Hasan Md Khalid
Student ID: M24W0391
# Chunk 1: Data Cleansing
messy_data <- data.frame(
Name = c("Ali", "Sara", "John", "Nina", NA),
Age = c("25", "thirty", "29", NA, "22"),
Score = c("85", "90", "Ninety", "70", NA)
)
print("Original Messy Data:")
## [1] "Original Messy Data:"
print(messy_data)
## Name Age Score
## 1 Ali 25 85
## 2 Sara thirty 90
## 3 John 29 Ninety
## 4 Nina <NA> 70
## 5 <NA> 22 <NA>
# Clean Data
messy_data$Age <- as.numeric(messy_data$Age)
## Warning: NAs introduced by coercion
messy_data$Score <- as.numeric(messy_data$Score)
## Warning: NAs introduced by coercion
clean_data <- na.omit(messy_data)
print("Cleaned Data:")
## [1] "Cleaned Data:"
print(clean_data)
## Name Age Score
## 1 Ali 25 85
In this activity, I started with a messy dataset that contained missing values and non-numeric text entries for “Age” and “Score”. These types of irregularities are common in real-world datasets and can cause serious issues in data analysis and visualization.
I applied
as.numeric()to convert these columns into proper numerical formats. Non-convertible text like “thirty” or “Ninety” were automatically replaced with NA. Then, I usedna.omit()to remove any row containing missing or non-parsable data.The final cleaned dataset is much more suitable for statistical operations, such as computing mean or plotting histograms. This process highlights the importance of inspecting and cleaning datasets before drawing any conclusions.
dplyrName: Hasan Md Khalid
Student ID: M24W0391
# Chunk 2: Data Extraction using dplyr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
DF <- data.frame(
Year = rep(1955:2020, each = 100),
Age = rep(0:99, 66),
Death = sample(20:60, 6600, replace = TRUE)
)
SelectedData <- DF %>% filter((Year - Age) == 1955)
head(SelectedData)
## Year Age Death
## 1 1955 0 49
## 2 1956 1 58
## 3 1957 2 45
## 4 1958 3 30
## 5 1959 4 32
## 6 1960 5 39
# Modified version to exclude Death
SelectedData_Simple <- DF %>% select(Year, Age) %>% filter((Year - Age) == 1955)
head(SelectedData_Simple)
## Year Age
## 1 1955 0
## 2 1956 1
## 3 1957 2
## 4 1958 3
## 5 1959 4
## 6 1960 5
This task demonstrates how the
dplyrpackage can simplify data filtering based on conditional logic. I generated a dataset with 6600 rows usingdata.frame, simulating combinations of Year, Age, and a randomly assigned Death count.The expression
(Year - Age) == 1955was used to identify all individuals born in 1955, regardless of their age or the current year in the record. This is an example of dynamic filtering to extract meaningful subsets of data for demographic analysis.I also created a simplified version using
select(Year, Age)to focus only on those columns, excluding the “Death” variable. This helps streamline datasets for specific types of visualization or reporting when not all attributes are necessary.
Name: Hasan Md Khalid
Student ID: M24W0391
# Chunk 3: Web Scraping from BBC using rvest
library(rvest)
url <- 'https://www.bbc.com/news'
webpage <- read_html(url)
titles <- webpage %>% html_nodes(".sc-9d830f2a-3") %>% html_text()
print("Top Article Titles from BBC:")
## [1] "Top Article Titles from BBC:"
print(titles[1:10])
## [1] "First troops arrive in LA after Trump sends National Guard to curb immigration protests"
## [2] "Watch: Federal agents use teargas, flash grenades to disperse LA protesters"
## [3] "Colombia presidential hopeful shot in head at rally"
## [4] "Italy citizenship referendum: 'I was born here - but feel rejected'"
## [5] "Doctors trialling 'poo pills' to flush out dangerous superbugs"
## [6] "Watch: Federal agents use teargas, flash grenades to disperse LA protesters"
## [7] "Bowen: Israel is accused of the gravest war crimes - how governments respond could haunt them for years"
## [8] "How India's 'biggest art deal' buried MF Husain masterpieces in a bank vault"
## [9] "Gaza health workers say four killed by Israeli gunfire near aid centre"
## [10] "Colombia presidential hopeful shot in head at rally"
Using the
rvestpackage, I performed basic web scraping on the BBC News homepage. The goal was to extract top news headlines using a CSS selector (.sc-9d830f2a-3).This process showcases how R can retrieve real-time data from the internet, turning unstructured website text into structured lists. Although powerful, web scraping requires knowledge of HTML structure and regular updates since website layouts often change.
In this case, the headlines could be used to track trending topics, generate word clouds, or conduct sentiment analysis for media studies.
Name: Hasan Md Khalid
Student ID: M24W0391
# Chunk 4: Web Scraping from GeeksforGeeks
url2 <- "https://www.geeksforgeeks.org/web-scraping-using-r-language/"
webpage2 <- read_html(url2)
headers <- webpage2 %>% html_nodes("h1, h2, h3") %>% html_text()
print("Headings from GeeksforGeeks Web Scraping Article:")
## [1] "Headings from GeeksforGeeks Web Scraping Article:"
print(headers)
## [1] "Web Scraping using R Language"
## [2] "Implementation of Web Scraping using R"
## [3] "1. Import rvest libraries"
## [4] "2. Read the Webpage"
## [5] "3. Scrape Data From the Webpage"
## [6] "Complete Code Block"
## [7] "Applications of Web scraping"
## [8] "Similar Reads"
## [9] "Thank You!"
## [10] "What kind of Experience do you want to share?"
For the final chunk, I selected a freely accessible technical article on GeeksforGeeks about web scraping in R. I extracted the HTML headers (h1, h2, h3) using
html_nodes()and converted them into text usinghtml_text().This technique is useful for summarizing document outlines or creating searchable metadata. It can assist researchers or developers in auto-indexing technical content or extracting summaries for multiple pages.
This also shows that web scraping can be adapted to different types of websites—not only news platforms but also blogs, tutorials, or academic portals.