DA1 Homework Assignment 07

Data Cleansing Activity

Name: Hasan Md Khalid
Student ID: M24W0391

# Chunk 1: Data Cleansing
messy_data <- data.frame(
  Name = c("Ali", "Sara", "John", "Nina", NA),
  Age = c("25", "thirty", "29", NA, "22"),
  Score = c("85", "90", "Ninety", "70", NA)
)

print("Original Messy Data:")

## [1] "Original Messy Data:"

print(messy_data)

##   Name    Age  Score
## 1  Ali     25     85
## 2 Sara thirty     90
## 3 John     29 Ninety
## 4 Nina   <NA>     70
## 5 <NA>     22   <NA>

# Clean Data
messy_data$Age <- as.numeric(messy_data$Age)

## Warning: NAs introduced by coercion

messy_data$Score <- as.numeric(messy_data$Score)

## Warning: NAs introduced by coercion

clean_data <- na.omit(messy_data)

print("Cleaned Data:")

## [1] "Cleaned Data:"

print(clean_data)

##   Name Age Score
## 1  Ali  25    85

Observation:

In this activity, I started with a messy dataset that contained missing values and non-numeric text entries for “Age” and “Score”. These types of irregularities are common in real-world datasets and can cause serious issues in data analysis and visualization.

I applied as.numeric() to convert these columns into proper numerical formats. Non-convertible text like “thirty” or “Ninety” were automatically replaced with NA. Then, I used na.omit() to remove any row containing missing or non-parsable data.

The final cleaned dataset is much more suitable for statistical operations, such as computing mean or plotting histograms. This process highlights the importance of inspecting and cleaning datasets before drawing any conclusions.

Data Extraction with `dplyr`

Name: Hasan Md Khalid
Student ID: M24W0391

# Chunk 2: Data Extraction using dplyr
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

DF <- data.frame(
  Year = rep(1955:2020, each = 100),
  Age = rep(0:99, 66),
  Death = sample(20:60, 6600, replace = TRUE)
)

SelectedData <- DF %>% filter((Year - Age) == 1955)
head(SelectedData)

##   Year Age Death
## 1 1955   0    49
## 2 1956   1    58
## 3 1957   2    45
## 4 1958   3    30
## 5 1959   4    32
## 6 1960   5    39

# Modified version to exclude Death
SelectedData_Simple <- DF %>% select(Year, Age) %>% filter((Year - Age) == 1955)
head(SelectedData_Simple)

##   Year Age
## 1 1955   0
## 2 1956   1
## 3 1957   2
## 4 1958   3
## 5 1959   4
## 6 1960   5

Observation:

This task demonstrates how the dplyr package can simplify data filtering based on conditional logic. I generated a dataset with 6600 rows using data.frame, simulating combinations of Year, Age, and a randomly assigned Death count.

The expression (Year - Age) == 1955 was used to identify all individuals born in 1955, regardless of their age or the current year in the record. This is an example of dynamic filtering to extract meaningful subsets of data for demographic analysis.

I also created a simplified version using select(Year, Age) to focus only on those columns, excluding the “Death” variable. This helps streamline datasets for specific types of visualization or reporting when not all attributes are necessary.

Web Scraping with rvest: BBC News

Name: Hasan Md Khalid
Student ID: M24W0391

# Chunk 3: Web Scraping from BBC using rvest
library(rvest)

url <- 'https://www.bbc.com/news'
webpage <- read_html(url)
titles <- webpage %>% html_nodes(".sc-9d830f2a-3") %>% html_text()

print("Top Article Titles from BBC:")

## [1] "Top Article Titles from BBC:"

print(titles[1:10])

##  [1] "First troops arrive in LA after Trump sends National Guard to curb immigration protests"                
##  [2] "Watch: Federal agents use teargas, flash grenades to disperse LA protesters"                            
##  [3] "Colombia presidential hopeful shot in head at rally"                                                    
##  [4] "Italy citizenship referendum: 'I was born here - but feel rejected'"                                    
##  [5] "Doctors trialling 'poo pills' to flush out dangerous superbugs"                                         
##  [6] "Watch: Federal agents use teargas, flash grenades to disperse LA protesters"                            
##  [7] "Bowen: Israel is accused of the gravest war crimes - how governments respond could haunt them for years"
##  [8] "How India's 'biggest art deal' buried MF Husain masterpieces in a bank vault"                           
##  [9] "Gaza health workers say four killed by Israeli gunfire near aid centre"                                 
## [10] "Colombia presidential hopeful shot in head at rally"

Observation:

Using the rvest package, I performed basic web scraping on the BBC News homepage. The goal was to extract top news headlines using a CSS selector (.sc-9d830f2a-3).

This process showcases how R can retrieve real-time data from the internet, turning unstructured website text into structured lists. Although powerful, web scraping requires knowledge of HTML structure and regular updates since website layouts often change.

In this case, the headlines could be used to track trending topics, generate word clouds, or conduct sentiment analysis for media studies.

Web Scraping with rvest: GeeksforGeeks

Name: Hasan Md Khalid
Student ID: M24W0391

# Chunk 4: Web Scraping from GeeksforGeeks
url2 <- "https://www.geeksforgeeks.org/web-scraping-using-r-language/"
webpage2 <- read_html(url2)
headers <- webpage2 %>% html_nodes("h1, h2, h3") %>% html_text()

print("Headings from GeeksforGeeks Web Scraping Article:")

## [1] "Headings from GeeksforGeeks Web Scraping Article:"

print(headers)

##  [1] "Web Scraping using R Language"                
##  [2] "Implementation of Web Scraping using R"       
##  [3] "1. Import rvest libraries"                    
##  [4] "2. Read the Webpage"                          
##  [5] "3. Scrape Data From the Webpage"              
##  [6] "Complete Code Block"                          
##  [7] "Applications of Web scraping"                 
##  [8] "Similar Reads"                                
##  [9] "Thank You!"                                   
## [10] "What kind of Experience do you want to share?"

Observation:

For the final chunk, I selected a freely accessible technical article on GeeksforGeeks about web scraping in R. I extracted the HTML headers (h1, h2, h3) using html_nodes() and converted them into text using html_text().

This technique is useful for summarizing document outlines or creating searchable metadata. It can assist researchers or developers in auto-indexing technical content or extracting summaries for multiple pages.

This also shows that web scraping can be adapted to different types of websites—not only news platforms but also blogs, tutorials, or academic portals.

DA1 Homework Assignment 07

Hasan Md Khalid | Student ID: M24W0391

Data Cleansing Activity

Observation:

Data Extraction with dplyr

Observation:

Web Scraping with rvest: BBC News

Observation:

Web Scraping with rvest: GeeksforGeeks

Observation:

Data Extraction with `dplyr`