Matcha Group
Team Members
1
Nailatul Wafiroh
Web Scraping R
Nailatul Wafiroh
52250003
Working on Web Scraping & Data Programming Process. Performing scraping on links 1 & 2 (Static/Pagination) using R.
Data Engineer (R)
2
Nadia Apriani
Case Study R
Nadia Apriani
52250006
Working on Mini Project Case Study E-Commerce. Focused on processing, merging, and cleaning the e-commerce dataset using R.
Data Analyst (R)
3
Dhea Putri Khasanah
Case Study Python
Dhea Putri Khasanah
52250009
Working on Mini Project Case Study E-Commerce. Focused on processing, merging, and cleaning the e-commerce dataset using Python.
Data Analyst (Python)
4
Wulan Gustika A. T.
Web Scraping Python
Wulan Gustika A. T.
52250010
Working on Web Scraping & Data Programming Process. Performing scraping on links 3 & 4 (AJAX/iFrame) using Python.
Data Engineer (Python)

1 1. Mini Project: CASE STUDY E-Commerce

1.1 Section A

1.1.1 Logical Approach and Methodology

This section focuses on automating the collection and integration of data from multiple sources. To meet the assignment criteria, five different file formats—CSV, Excel, JSON, TXT, and XML—are processed.

The workflow is organized into three main stages:

  • Data Acquisition: Using appropriate R libraries to read each file format, ensuring the data is ready for further processing.
  • Structural Validation: A looping mechanism combined with if-else logic is used to inspect the rows and columns of each file. This step verifies whether the column structure matches the reference standard before any merging occurs.
  • Data Integration: All validated datasets are consolidated into a single master dataset named df_utama. This stage includes standardizing data types and removing duplicate records to ensure the final output is clean and consistent.
# ============================================================
# SECTION A – DATA COLLECTION USING PROGRAMMING
# ============================================================

# Loading essential libraries for data manipulation and reporting
library(readr)
library(readxl)
library(jsonlite)
library(XML)
library(dplyr)
library(knitr)
library(DT)

# Data Acquisition (Reading multiple file formats)
df_csv  <- read_csv("ecommerce.csv", show_col_types = FALSE)
df_xlsx <- read_excel("ecommerce.xlsx", sheet = "ecommerce_raw")
df_json <- fromJSON("ecommerce.json")
if (!is.data.frame(df_json)) { df_json <- as.data.frame(df_json) }
df_txt  <- read_delim("ecommerce.txt", delim = "|", show_col_types = FALSE)
xml_data <- xmlParse("ecommerce.xml")
df_xml   <- xmlToDataFrame(nodes = getNodeSet(xml_data, "//Record"), stringsAsFactors = FALSE)

# ============================================================
# Automation, Validation, and Integration
# ============================================================

file_list <- list(CSV = df_csv, XLSX = df_xlsx, JSON = df_json, TXT = df_txt, XML = df_xml)

# TASK REQUIREMENT: Display rows, columns, and column names per file 
for (file_name in names(file_list)) {
  df_temp <- file_list[[file_name]]
}

# DATA SUMMARY (Interactive Table) 
ringkasan_df <- data.frame(
  File         = names(file_list),
  Rows         = sapply(file_list, nrow),
  Columns      = sapply(file_list, ncol),
  Column_Names = sapply(file_list, function(df) paste(names(df), collapse = ", ")),
  row.names    = NULL
)
datatable(ringkasan_df,
          caption  = "Data Summary per File (Rows, Columns & Column Names)",
          options  = list(pageLength = 5, scrollX = TRUE, dom = 'ftp'),
          rownames = FALSE)
# COLUMN VALIDATION using FOR LOOP + IF/ELSE
reference_cols <- names(df_csv)
files_ready    <- list()
validasi_df    <- data.frame(File = character(), Rows = integer(),
                             Columns = integer(), Status = character(),
                             stringsAsFactors = FALSE)

for (file_name in names(file_list)) {
  df_temp   <- file_list[[file_name]]
  ref_norm  <- tolower(trimws(reference_cols))
  curr_norm <- tolower(trimws(names(df_temp)))

  # IF/ELSE: check whether column structure matches the reference
  if (identical(sort(ref_norm), sort(curr_norm))) {
    status <- "Ready to merge ✓"
    files_ready[[file_name]] <- df_temp
  } else {
    status <- "Need adjustment ✗"
  }
  validasi_df <- rbind(validasi_df, data.frame(
    File    = file_name,
    Rows    = nrow(df_temp),
    Columns = ncol(df_temp),
    Status  = status,
    stringsAsFactors = FALSE
  ))
}

datatable(validasi_df,
          caption  = "Column Structure Validation per File",
          options  = list(pageLength = 5, dom = 'ftp'),
          rownames = FALSE)
# Data Merging into df_utama 
if (length(files_ready) > 0) {
  standardize_cols <- function(df, ref_cols) {
    df <- df[, ref_cols[ref_cols %in% names(df)], drop = FALSE]
    df[] <- lapply(df, as.character)
    return(df)
  }

  files_std <- lapply(files_ready, standardize_cols, ref_cols = reference_cols)
  for (fname in names(files_std)) { files_std[[fname]]$source_file <- fname }

# Final Consolidated Dataset
# Create df_utama
df_utama <- bind_rows(files_std) %>%
  distinct(order_id, source_file, .keep_all = TRUE)


  # Interactive preview of merged dataset
  datatable(head(df_utama, 10),
            caption  = "Preview of df_utama — Top 10 Rows (All Columns)",
            options  = list(pageLength = 5, scrollX = TRUE),
            rownames = FALSE)

} else {
  cat("\n[ERROR] No datasets matched requirements. Merging cancelled.\n")
}

1.2 Section B

1.2.1 Logical Approach and Data Handling Methodology

Following the data integration in Section A, an evaluation of df_utama is conducted to ensure analytical quality and accuracy. This process focuses on three key areas:

  • Dataset Overview: Validating final dimensions and inspecting data types for each column to ensure consistent variable structures.

  • Data Integrity Check: Detecting data irregularities, including identifying missing values (system NA and empty strings) and duplicate rows to prevent double-counting.

  • Quality Issue Identification: Critically analyzing variable format anomalies and technical risks arising from merging data from diverse source formats.

1.2.2 R Code Implementation and Analysis Output

library(DT)
# ============================================================
# SECTION B – DATA HANDLING
# ============================================================

# Structural Overview & Data Types 
# Map each column to its primitive R class to identify storage types.
tipe_df <- data.frame(
  Column    = names(df_utama),
  Data_Type = sapply(df_utama, class),
  row.names = NULL
)

datatable(
  tipe_df,
  caption  = htmltools::tags$caption(
    style = "caption-side: top; text-align: left; font-size: 13px; padding: 8px 0 4px 0;",
    htmltools::strong(paste0(
      "Task 1 — Data Type per Column in df_utama",
      "  |  Total Rows: ", nrow(df_utama),
      "  ·  Total Columns: ", ncol(df_utama)
    ))
  ),
  options  = list(pageLength = 10, scrollX = TRUE),
  rownames = FALSE
)
# Missing Values & Duplicate Rows
# Detect both explicit NAs and empty strings/whitespace as missing values.
missing_df <- data.frame(
  Column        = names(df_utama),
  Missing_Count = sapply(df_utama, function(x) sum(is.na(x) | trimws(as.character(x)) == "")),
  row.names     = NULL
)

# Calculate relative impact of missingness per variable.
missing_df$Percentage <- paste0(
  round(missing_df$Missing_Count / nrow(df_utama) * 100, 2), "%"
)

total_missing <- sum(missing_df$Missing_Count)
dup_count      <- sum(duplicated(df_utama))

datatable(
  missing_df,
  caption  = htmltools::tags$caption(
    style = "caption-side: top; text-align: left; font-size: 13px; padding: 8px 0 4px 0;",
    htmltools::strong(paste0(
      "Task 2 — Missing Values per Column",
      "  |  Total Missing: ", total_missing,
      "  ·  Duplicate Rows: ", dup_count
    ))
  ),
  options  = list(pageLength = 10, scrollX = TRUE),
  rownames = FALSE
)
# Display a preview of duplicated records if any exist for verification.
if (dup_count > 0) {
  datatable(
    head(df_utama[duplicated(df_utama), ], 5),
    caption  = htmltools::tags$caption(
      style = "caption-side: top; text-align: left; font-size: 13px; padding: 8px 0 4px 0;",
      htmltools::strong(paste0(
        "Task 2 — Sample Duplicate Rows  |  Total: ", dup_count, " duplicate rows found"
      ))
    ),
    options  = list(pageLength = 5, scrollX = TRUE),
    rownames = FALSE
  )
}
# Data Quality Insights 
# Consolidate metadata findings into a human-readable quality report.
most_affected_col <- missing_df$Column[which.max(missing_df$Missing_Count)]
most_affected_n   <- max(missing_df$Missing_Count)
char_cols          <- sum(sapply(df_utama, is.character))
source_summary    <- table(df_utama$source_file)

quality_issues <- data.frame(
  No    = 1:4,
  Issue = c(
    "Missing Values",
    "Duplicate Rows",
    "Type Casting Issues",
    "Source Fragmentation"
  ),
  Description = c(
    paste0(
      "Identified ", total_missing, " NA/null entries. ",
      "Highest missingness in: '", most_affected_col,
      "' (", most_affected_n, " counts)."
    ),
    paste0(
      "Detected ", dup_count, " redundant rows. ",
      "May cause statistical bias or inflated metrics."
    ),
    paste0(
      char_cols, " of ", ncol(df_utama), " columns are 'character'. ",
      "Numeric fields like 'price/quantity' require conversion for calculation."
    ),
    paste0(
      "Merged from ", length(source_summary), " distinct files. ",
      "Risk of heterogeneous encoding and delimiter mismatches."
    )
  ),
  row.names = NULL
)

datatable(
  quality_issues,
  caption  = htmltools::tags$caption(
    style = "caption-side: top; text-align: left; font-size: 13px; padding: 8px 0 4px 0;",
    htmltools::strong("Task 3 — Data Quality Issues Summary")
  ),
  options  = list(
    pageLength = 10,
    scrollX    = TRUE,
    dom        = 'ftp',
    columnDefs = list(
      list(width = '8%',  targets = 0),
      list(width = '25%', targets = 1),
      list(width = '67%', targets = 2)
    )
  ),
  rownames = FALSE
) %>%
  formatStyle(
    columns    = 'No',
    textAlign  = 'center',
    fontWeight = 'bold'
  ) %>%
  formatStyle(
    columns    = 'Issue',
    fontWeight = 'bold',
    textAlign  = 'center'
  )

1.3 Section C

1.3.1 Data Cleaning Logical Framework

This stage focuses on executing the cleaning of df_utama using programmatic logic to ensure the dataset is analysis-ready. The process covers three key areas:

  • Price Normalization: Removing currency formats (Rp) and enforcing non-negative constraints to achieve mathematical consistency.

  • Missing Value Handling: Filling empty payment_method entries with “Unknown” and customer_rating with the median. The median is chosen for its robustness against extreme outliers.

  • Categorical Standardization Loop: Utilizing a single loop mechanism to perform whitespace trimming and unify text formatting across platform, order_status, and payment_method simultaneously.

1.3.2 Technical Implementation of Data Cleaning

# ============================================================
# SECTION C – DATA CLEANING
# ============================================================
library(DT)
library(knitr)

df_clean <- df_utama
# Currency Cleaning (4 Numeric Columns) 

clean_currency_vec <- function(vec) {
  result <- vector("numeric", length(vec))
  for (i in seq_along(vec)) {
    val <- as.character(vec[i])
    
    # Handle NA/empty strings
    if (is.na(val) || trimws(val) == "" || toupper(trimws(val)) == "NA") {
      result[i] <- NA_real_
      next
    }
    
    # Convert IDR format (Rp 10.000,00) to standard numeric
    if (grepl("Rp", trimws(val), ignore.case = TRUE)) {
      val <- gsub("Rp\\s*", "", val, ignore.case = TRUE)
      val <- gsub("\\.", "", val)
      val <- gsub(",", ".", val)
    }
    
    num <- suppressWarnings(as.numeric(trimws(val)))
    
    # Set negative values to 0; keep others as numeric
    if (!is.na(num) && num < 0) {
      result[i] <- 0
    } else {
      result[i] <- ifelse(is.na(num), NA_real_, num)
    }
  }
  return(result)
}

numeric_cols <- c("unit_price", "gross_sales", "net_sales", "discount_value")

for (col in numeric_cols) {
  df_clean[[col]] <- clean_currency_vec(df_clean[[col]])
}

num_summary <- data.frame(
  Column   = numeric_cols,
  Min      = sapply(numeric_cols, function(c) format(round(min(df_clean[[c]], na.rm = TRUE)), big.mark = ",")),
  Q1       = sapply(numeric_cols, function(c) format(round(quantile(df_clean[[c]], 0.25, na.rm = TRUE)), big.mark = ",")),
  Median   = sapply(numeric_cols, function(c) format(round(median(df_clean[[c]], na.rm = TRUE)), big.mark = ",")),
  Mean     = sapply(numeric_cols, function(c) format(round(mean(df_clean[[c]], na.rm = TRUE)), big.mark = ",")),
  Max      = sapply(numeric_cols, function(c) format(round(max(df_clean[[c]], na.rm = TRUE)), big.mark = ",")),
  NA_Count = sapply(numeric_cols, function(c) sum(is.na(df_clean[[c]]))),
  row.names = NULL
)

datatable(
  num_summary,
  caption  = htmltools::tags$caption(
    style = "caption-side: top; text-align: left; font-size: 13px; padding: 8px 0 4px 0;",
    htmltools::strong("Task 2 — Summary After Currency Cleaning")
  ),
  options  = list(pageLength = 4, dom = 'ftp', scrollX = TRUE),
  rownames = FALSE
)
# Missing Value Handling

# Use median for rating imputation to mitigate outlier impact
rating_median <- median(suppressWarnings(as.numeric(df_clean$customer_rating)), na.rm = TRUE)

for (i in 1:nrow(df_clean)) {
  # Fill missing categorical with 'Unknown'
  if (is.na(df_clean$payment_method[i]) || trimws(as.character(df_clean$payment_method[i])) == "") {
    df_clean$payment_method[i] <- "Unknown"
  }
  # Fill missing numerical rating with median
  if (is.na(df_clean$customer_rating[i]) || trimws(as.character(df_clean$customer_rating[i])) == "") {
    df_clean$customer_rating[i] <- rating_median
  }
}

df_clean$customer_rating <- suppressWarnings(as.numeric(df_clean$customer_rating))

missing_tbl <- data.frame(
  Column = c("payment_method", "customer_rating"),
  Missing_After = c(
    sum(is.na(df_clean$payment_method) | trimws(df_clean$payment_method) == ""),
    sum(is.na(df_clean$customer_rating))
  ),
  Method = c(
    "Filled with 'Unknown'",
    paste0("Filled with Median (", rating_median, ")")
  )
)

datatable(
  missing_tbl,
  caption  = htmltools::tags$caption(
    style = "caption-side: top; text-align: left; font-size: 13px; padding: 8px 0 4px 0;",
    htmltools::strong("Task 3 — Missing Value Handling Result")
  ),
  options  = list(pageLength = 5, dom = 'ftp', scrollX = TRUE),
  rownames = FALSE
)
# Categorical Standardization via Loop 

for (i in 1:nrow(df_clean)) {

  # Standardization: platform
  val_platform <- trimws(tolower(as.character(df_clean$platform[i])))
  if (val_platform == "shopee") {
    df_clean$platform[i] <- "Shopee"
  } else if (val_platform %in% c("tokped", "tokopedia")) {
    df_clean$platform[i] <- "Tokopedia"
  } else if (val_platform == "blibli") {
    df_clean$platform[i] <- "Blibli"
  } else if (val_platform == "lazada") {
    df_clean$platform[i] <- "Lazada"
  } else if (val_platform == "tiktok shop") {
    df_clean$platform[i] <- "TikTok Shop"
  }

  # Standardization: order_status
  val_status <- trimws(tolower(as.character(df_clean$order_status[i])))
  if (val_status == "delivered" || val_status == "completed") {
    df_clean$order_status[i] <- "Completed"
  } else if (val_status %in% c("cancelled", "cancel", "batal")) {
    df_clean$order_status[i] <- "Cancelled"
  } else if (val_status == "on delivery") {
    df_clean$order_status[i] <- "On Delivery"
  } else if (val_status == "shipped") {
    df_clean$order_status[i] <- "Shipped"
  } else if (val_status %in% c("retur", "returned")) {
    df_clean$order_status[i] <- "Returned"
  }

  # Standardization: category
  val_category <- trimws(tolower(gsub("_", " ", as.character(df_clean$category[i]))))
  if (val_category == "beauty") {
    df_clean$category[i] <- "Beauty"
  } else if (val_category == "electronics") {
    df_clean$category[i] <- "Electronics"
  } else if (val_category == "fashion") {
    df_clean$category[i] <- "Fashion"
  } else if (val_category == "home living") {
    df_clean$category[i] <- "Home Living"
  } else if (val_category == "sports") {
    df_clean$category[i] <- "Sports"
  }

  # Clean whitespace for payment_method
  df_clean$payment_method[i] <- trimws(as.character(df_clean$payment_method[i]))
}

# Platform Table
platform_tbl <- as.data.frame(table(df_clean$platform))
names(platform_tbl) <- c("Platform", "Count")
datatable(
  platform_tbl,
  caption  = htmltools::tags$caption(
    style = "caption-side: top; text-align: left; font-size: 13px; padding: 8px 0 4px 0;",
    htmltools::strong("Task 1 — Platform Distribution After Cleaning")
  ),
  options  = list(pageLength = 10, dom = 'ftp', scrollX = TRUE),
  rownames = FALSE
)
# Order Status Table
status_tbl <- as.data.frame(table(df_clean$order_status))
names(status_tbl) <- c("Order_Status", "Count")
datatable(
  status_tbl,
  caption  = htmltools::tags$caption(
    style = "caption-side: top; text-align: left; font-size: 13px; padding: 8px 0 4px 0;",
    htmltools::strong("Task 4 — Order Status Distribution After Cleaning")
  ),
  options  = list(pageLength = 10, dom = 'ftp', scrollX = TRUE),
  rownames = FALSE
)
# Category Table
category_tbl <- as.data.frame(table(df_clean$category))
names(category_tbl) <- c("Category", "Count")
datatable(
  category_tbl,
  caption  = htmltools::tags$caption(
    style = "caption-side: top; text-align: left; font-size: 13px; padding: 8px 0 4px 0;",
    htmltools::strong("Task 4 — Category Distribution After Cleaning")
  ),
  options  = list(pageLength = 10, dom = 'ftp', scrollX = TRUE),
  rownames = FALSE
)
# Summary of batch categorical cleaning performed in the loop
loop_result <- data.frame(
  Column  = c("platform", "order_status", "category", "payment_method"),
  Action  = c(
    "Standardized names (IF/Else)",
    "Mapped synonyms (IF/Else)",
    "Cleaned underscores (gsub)",
    "Trimmed whitespace (trimws)"
  ),
  Status  = c("Done", "Done", "Done", "Done")
)
datatable(
  loop_result,
  caption  = htmltools::tags$caption(
    style = "caption-side: top; text-align: left; font-size: 13px; padding: 8px 0 4px 0;",
    htmltools::strong("Task 5 — Loop-based Cleaning Process Summary")
  ),
  options  = list(pageLength = 10, dom = 'ftp', scrollX = TRUE),
  rownames = FALSE
)
# Final Audit: Missing Values After Cleaning

missing_after <- data.frame(
  Column        = names(df_clean),
  Missing_Count = sapply(df_clean, function(x) sum(is.na(x) | trimws(as.character(x)) == "")),
  Percentage    = paste0(
    round(sapply(df_clean, function(x) sum(is.na(x) | trimws(as.character(x)) == "")) /
          nrow(df_clean) * 100, 2), "%"),
  row.names = NULL
)

datatable(
  missing_after,
  caption  = htmltools::tags$caption(
    style = "caption-side: top; text-align: left; font-size: 13px; padding: 8px 0 4px 0;",
    htmltools::strong(paste0(
      "Cleaning Summary — Missing Values Post-Cleaning",
      "  |  Total Rows: ", nrow(df_clean)
    ))
  ),
  options  = list(pageLength = 10, dom = 'ftp', scrollX = TRUE),
  rownames = FALSE
)

1.4 Section D

1.4.1 Business Logic Analysis and Column Definition

In this section, new variables are constructed based on predefined business rules. The primary focus of this task is to implement branching logic using if, else if, and else structures within a looping mechanism to classify transaction data.

Three new columns are added to the dataset:

  • is_high_value: Used to identify transactions with net sales exceeding 1,000,000.

  • order_priority: Classifies order priority into three levels (High, Medium, Low) using a nested IF method based on the sales amount.

  • valid_transaction: Evaluates transaction validity based on order status, where any “Cancelled” order is flagged as an invalid transaction.

1.4.2 Procedural Implementation Using R

# ============================================================
# SECTION D – CONDITIONAL LOGIC
# ============================================================

# Ensure net_sales is numeric (already cleaned in Section C)
df_clean$net_sales <- suppressWarnings(as.numeric(df_clean$net_sales))

# ============================================================
# TASK 1: is_high_value
# ============================================================

# Initialize the column with NA before processing
df_clean$is_high_value <- NA_character_

for (i in 1:nrow(df_clean)) {
  # Logic: Binary classification based on a 1M threshold.
  # Check for NA to prevent logical errors during comparison.
  if (!is.na(df_clean$net_sales[i]) && df_clean$net_sales[i] > 1000000) {
    df_clean$is_high_value[i] <- "Yes"
  } else {
    df_clean$is_high_value[i] <- "No"
  }
}

high_value_tbl <- as.data.frame(table(df_clean$is_high_value))
names(high_value_tbl) <- c("is_high_value", "Count")
kable(high_value_tbl, format = "markdown", align = "l",
      caption = "is_high_value Distribution")

# ============================================================
# TASK 2: order_priority (REQUIRED Nested IF)
# ============================================================

df_clean$order_priority <- NA_character_

for (i in 1:nrow(df_clean)) {
  val <- df_clean$net_sales[i]
  # Logic: Multi-tier classification using Nested IF to segment sales performance.
  if (!is.na(val)) {
    if (val >= 1000000) {
      df_clean$order_priority[i] <- "High"
    } else if (val >= 500000) {
      # This part acts as the nested condition for values < 1,000,000
      df_clean$order_priority[i] <- "Medium"
    } else {
      df_clean$order_priority[i] <- "Low"
    }
  }
}

priority_tbl <- as.data.frame(table(df_clean$order_priority))
names(priority_tbl) <- c("order_priority", "Count")
kable(priority_tbl, format = "markdown", align = "l",
      caption = "order_priority Distribution (Nested IF)")

# ============================================================
# TASK 3: valid_transaction
# ============================================================

df_clean$valid_transaction <- NA_character_

for (i in 1:nrow(df_clean)) {
  # Logic: Categorical validation based on transaction status strings.
  # Flags "Cancelled" orders as "Invalid", otherwise "Valid".
  if (!is.na(df_clean$order_status[i]) && df_clean$order_status[i] == "Cancelled") {
    df_clean$valid_transaction[i] <- "Invalid"
  } else {
    df_clean$valid_transaction[i] <- "Valid"
  }
}

valid_tbl <- as.data.frame(table(df_clean$valid_transaction))
names(valid_tbl) <- c("valid_transaction", "Count")
kable(valid_tbl, format = "markdown", align = "l",
      caption = "valid_transaction Distribution")

# ============================================================
# FINAL PREVIEW – 3 New Columns (Interactive Table)
# ============================================================

summary_d <- data.frame(
  Column      = c("is_high_value", "order_priority", "valid_transaction"),
  Logic       = c("net_sales > 1,000,000 → Yes / No",
                  ">= 1M → High | >= 500K → Medium | < 500K → Low",
                  "order_status = Cancelled → Invalid / Valid"),
  Type        = c("Binary IF", "Nested IF (3 levels)", "Binary IF"),
  Unique_Values = c(
    paste(sort(unique(df_clean$is_high_value)),   collapse = " / "),
    paste(sort(unique(df_clean$order_priority)),  collapse = " / "),
    paste(sort(unique(df_clean$valid_transaction)), collapse = " / ")
  )
)
kable(summary_d, format = "markdown", align = "l",
      caption = "Section D – New Columns & Logic Applied")

datatable(
  head(df_clean[, c("order_id", "net_sales", "order_status",
                     "is_high_value", "order_priority", "valid_transaction")], 10),
  caption  = "Sample: 3 New Columns Section D",
  options  = list(pageLength = 10, scrollX = TRUE, dom = 'ftp'),
  rownames = FALSE
)
is_high_value Distribution
is_high_value Count
No 6055
Yes 3775
order_priority Distribution (Nested IF)
order_priority Count
High 3775
Low 3995
Medium 2060
valid_transaction Distribution
valid_transaction Count
Invalid 755
Valid 9075
Section D – New Columns & Logic Applied
Column Logic Type Unique_Values
is_high_value net_sales > 1,000,000 → Yes / No Binary IF No / Yes
order_priority >= 1M → High | >= 500K → Medium | < 500K → Low Nested IF (3 levels) High / Low / Medium
valid_transaction order_status = Cancelled → Invalid / Valid Binary IF Invalid / Valid

1.5 Section E

1.5.1 Analytical Thinking — Business Insights from the Cleaned Dataset

The following three questions are answered using live computation from df_clean, filtering only valid transactions (valid_transaction == "Valid") to ensure accuracy.

# ============================================================
# SECTION E – ANALYTICAL THINKING  |  Setup & Helpers
# ============================================================

library(ggplot2)
library(dplyr)
library(plotly)
library(scales)
library(htmltools)

# Filter valid transactions and remove duplicate Order IDs for analysis integrity
df_valid <- df_clean %>%
  filter(valid_transaction == "Valid") %>%
  distinct(order_id, .keep_all = TRUE)

# Light theme base
light_theme <- theme(
  plot.background    = element_rect(fill = "#ffffff", color = NA),
  panel.background   = element_rect(fill = "#ffffff", color = NA),
  panel.grid.major   = element_line(color = "#e5e7eb", linewidth = 0.4),
  panel.grid.minor   = element_blank(),
  axis.text          = element_text(color = "#6b7280", size = 10),
  axis.title         = element_text(color = "#374151", size = 11),
  plot.title         = element_text(color = "#111827", size = 16, face = "bold", margin = margin(b = 4)),
  plot.subtitle      = element_text(color = "#6b7280", size = 11, margin = margin(b = 16)),
  legend.background  = element_rect(fill = "#ffffff", color = NA),
  legend.text        = element_text(color = "#374151"),
  legend.title       = element_text(color = "#374151"),
  plot.margin        = margin(16, 20, 16, 20)
)

# Helper: HTML metric card
metric_card <- function(label, value, subtitle = "", border_color = "#3b82f6") {
  htmltools::div(
    style = paste0(
      "background:#f9fafb;",
      "border:1px solid #e5e7eb;border-top:3px solid ", border_color, ";",
      "border-radius:8px;padding:18px 20px;flex:1;min-width:0;"
    ),
    htmltools::p(label,
      style = "color:#6b7280;font-size:11px;font-weight:600;letter-spacing:.08em;margin:0 0 6px 0;font-family:monospace;"
    ),
    htmltools::h2(value,
      style = "color:#111827;font-size:28px;font-weight:800;margin:0 0 4px 0;font-family:sans-serif;"
    ),
    htmltools::p(subtitle,
      style = "color:#9ca3af;font-size:12px;margin:0;font-family:sans-serif;"
    )
  )
}

# Helper: section header HTML
section_header <- function(tag_label, title, subtitle) {
  htmltools::div(
    style = "background:#f9fafb;padding:24px 24px 8px 24px;border-bottom:1px solid #e5e7eb;",
    htmltools::span(tag_label,
      style = "background:#e5e7eb;color:#374151;font-size:11px;font-family:monospace;padding:4px 10px;border-radius:4px;"
    ),
    htmltools::h1(title,
      style = "color:#111827;font-size:28px;font-weight:800;margin:12px 0 4px 0;font-family:sans-serif;"
    ),
    htmltools::p(subtitle,
      style = "color:#6b7280;font-size:13px;margin:0;font-family:sans-serif;"
    )
  )
}
# ─────────────────────────────────────────────────────────────────────────────
# Q1: PLATFORM DOMINANCE
# ─────────────────────────────────────────────────────────────────────────────

platform_df <- df_valid %>%
  count(platform, name = "Transactions") %>%
  arrange(desc(Transactions)) %>%
  mutate(
    Share_Pct   = Transactions / sum(Transactions) * 100,
    Share_Label = paste0(round(Share_Pct, 1), "%"),
    platform    = factor(platform, levels = rev(platform))
  )

top_platform <- as.character(platform_df$platform[1])
top_n        <- platform_df$Transactions[1]
top_pct      <- platform_df$Share_Label[1]
total_tx     <- sum(platform_df$Transactions)
avg_per_plt  <- round(total_tx / nrow(platform_df))

platform_colors <- c("#6366f1","#3b82f6","#f59e0b","#10b981","#ef4444")

q1_cards <- htmltools::div(
  style = "display:flex;gap:12px;padding:16px 24px;background:#ffffff;flex-wrap:nowrap;",
  metric_card("DOMINANT PLATFORM",  top_platform,                     paste0("#1 of ", nrow(platform_df), " platforms"), "#6366f1"),
  metric_card("TOP TRANSACTIONS",    format(top_n, big.mark=","),      paste0(top_pct, " of all transactions"),            "#10b981"),
  metric_card("TOTAL TRANSACTIONS", format(total_tx, big.mark=","),   "across all platforms",                             "#3b82f6"),
  metric_card("AVG / PLATFORM",     format(avg_per_plt, big.mark=","),"transactions per platform",                        "#f59e0b")
)

p_q1 <- ggplot(platform_df, aes(x = Transactions, y = platform, fill = platform)) +
  geom_col(width = 0.65, show.legend = FALSE) +
  geom_text(aes(label = paste0(Transactions, "  ·  ", Share_Label)),
            hjust = -0.05, color = "#111827", size = 3.8, fontface = "bold") +
  geom_vline(xintercept = avg_per_plt, color = "#6b7280",
             linetype = "dashed", linewidth = 0.5, alpha = 0.6) +
  annotate("text", x = avg_per_plt + (max(platform_df$Transactions) * 0.01),
           y = nrow(platform_df) + 0.4, label = "average",
           color = "#6b7280", size = 3.2, hjust = 0) +
  scale_fill_manual(values = setNames(platform_colors, levels(platform_df$platform))) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.18)), labels = comma) +
  labs(title = paste0(top_platform, " leads with ", top_pct, " of all transactions"),
       x = "Transaction Count", y = NULL) +
  light_theme

q1_plotly <- ggplotly(p_q1, tooltip = c("x","y")) %>%
  layout(
    paper_bgcolor = "#ffffff", plot_bgcolor = "#ffffff",
    font   = list(color = "#374151"),
    margin = list(l = 80, r = 40, t = 40, b = 40),
    xaxis  = list(gridcolor = "#e5e7eb"),
    yaxis  = list(gridcolor = "#e5e7eb")
  )

htmltools::browsable(
  htmltools::div(
    style = "background:#ffffff;border:1px solid #e5e7eb;border-radius:12px;overflow:hidden;margin-bottom:32px;",
    section_header("Q1 · ANALYTICAL THINKING", "Platform Dominance",
                   "Transaction distribution by sales platform"),
    q1_cards,
    htmltools::div(style = "padding:8px 24px 24px;background:#ffffff;",
      htmltools::tagList(q1_plotly)
    )
  )
)
Q1 · ANALYTICAL THINKING

Platform Dominance

Transaction distribution by sales platform

DOMINANT PLATFORM

Shopee

#1 of 5 platforms

TOP TRANSACTIONS

375

20.7% of all transactions

TOTAL TRANSACTIONS

1,815

across all platforms

AVG / PLATFORM

363

transactions per platform

# ─────────────────────────────────────────────────────────────────────────────
# Q2: PRODUCT CATEGORY DISTRIBUTION
# ─────────────────────────────────────────────────────────────────────────────

category_df <- df_valid %>%
  count(category, name = "Transactions") %>%
  arrange(desc(Transactions)) %>%
  mutate(
    Share_Pct   = Transactions / sum(Transactions) * 100,
    Share_Label = paste0(round(Share_Pct, 1), "%"),
    category    = factor(category, levels = rev(category))
  )

top_cat     <- as.character(category_df$category[which.max(category_df$Transactions)])
top_cat_n   <- max(category_df$Transactions)
top_cat_pct <- category_df$Share_Label[which.max(category_df$Transactions)]
n_cats      <- nrow(category_df)
med_count   <- median(category_df$Transactions)

cat_colors <- c("#6366f1","#10b981","#f59e0b","#3b82f6","#ef4444")

q2_cards <- htmltools::div(
  style = "display:flex;gap:12px;padding:16px 24px;background:#ffffff;flex-wrap:nowrap;",
  metric_card("TOP CATEGORY",      top_cat,                        paste0(top_cat_pct, " of total"), "#6366f1"),
  metric_card("TOTAL CATEGORIES", as.character(n_cats),            "product categories",              "#10b981"),
  metric_card("HIGHEST COUNT",    format(top_cat_n, big.mark=","), top_cat,                           "#3b82f6"),
  metric_card("MEDIAN COUNT",     format(med_count, big.mark=","), "per category",                    "#f59e0b")
)

# Donut chart for categorical share comparison
donut_df <- category_df %>% arrange(desc(Transactions))

p2_donut_ly <- plot_ly(
  donut_df,
  labels        = ~as.character(category),
  values        = ~Transactions,
  type          = "pie",
  hole          = 0.55,
  marker        = list(colors = cat_colors,
                       line   = list(color = "#ffffff", width = 2)),
  textinfo      = "percent",
  textfont      = list(color = "#ffffff", size = 12),
  hovertemplate = "%{label}<br>%{value} transactions<br>%{percent}<extra></extra>"
) %>%
  layout(
    paper_bgcolor = "#ffffff", plot_bgcolor = "#ffffff",
    showlegend    = FALSE,
    annotations   = list(list(
      text      = paste0(top_cat_pct, "<br>", top_cat),
      x = 0.5, y = 0.5, xref = "paper", yref = "paper",
      showarrow = FALSE,
      font      = list(color = "#111827", size = 13, family = "sans-serif")
    )),
    margin = list(l = 10, r = 10, t = 10, b = 10)
  )

# Bar chart (horizontal)
p_cat_bar <- ggplot(category_df, aes(x = Transactions, y = category, fill = category)) +
  geom_col(width = 0.65, show.legend = FALSE) +
  geom_text(aes(label = Transactions), hjust = -0.2,
            color = "#111827", size = 3.8, fontface = "bold") +
  scale_fill_manual(values = setNames(cat_colors, levels(category_df$category))) +
  scale_x_continuous(expand = expansion(mult = c(0, 0.15)), labels = comma) +
  labs(x = "Transaction Count", y = NULL,
       title = paste0(top_cat, " leads with ", top_cat_n,
                      " transactions (", top_cat_pct, ")")) +
  light_theme

p2_bar_ly <- ggplotly(p_cat_bar, tooltip = c("x", "y")) %>%
  layout(paper_bgcolor = "#ffffff", plot_bgcolor = "#ffffff",
         font = list(color = "#374151"), showlegend = FALSE,
         margin = list(l = 10, r = 10, t = 40, b = 40),
         xaxis = list(gridcolor = "#e5e7eb"),
         yaxis = list(gridcolor = "#e5e7eb"))

htmltools::browsable(
  htmltools::div(
    style = "background:#ffffff;border:1px solid #e5e7eb;border-radius:12px;overflow:hidden;margin-bottom:32px;",
    section_header("Q2 · ANALYTICAL THINKING", "Product Category Distribution",
                   "Transaction frequency by category — donut & ranked bar"),
    q2_cards,
    htmltools::div(
      style = "display:flex;gap:0;padding:8px 24px 24px;background:#ffffff;align-items:center;",
      htmltools::div(style = "flex:0 0 38%;",
        htmltools::tagList(p2_donut_ly)
      ),
      htmltools::div(style = "flex:0 0 62%;",
        htmltools::tagList(p2_bar_ly)
      )
    )
  )
)
Q2 · ANALYTICAL THINKING

Product Category Distribution

Transaction frequency by category — donut & ranked bar

TOP CATEGORY

Fashion

21.3% of total

TOTAL CATEGORIES

5

product categories

HIGHEST COUNT

386

Fashion

MEDIAN COUNT

375

per category

# ─────────────────────────────────────────────────────────────────────────────
# Q3: TRANSACTION STATUS DISTRIBUTION
# ─────────────────────────────────────────────────────────────────────────────

status_df <- df_clean %>%
  count(order_status, name = "Count") %>%
  arrange(desc(Count)) %>%
  mutate(
    Share_Pct    = Count / sum(Count) * 100,
    Share_Label  = paste0(round(Share_Pct, 1), "%"),
    order_status = factor(order_status, levels = rev(order_status))
  )

top_status     <- as.character(status_df$order_status[1])
top_status_n   <- status_df$Count[1]
top_status_pct <- status_df$Share_Label[1]
total_all      <- sum(status_df$Count)
n_valid        <- sum(df_clean$valid_transaction == "Valid",   na.rm = TRUE)
n_invalid      <- sum(df_clean$valid_transaction == "Invalid", na.rm = TRUE)
pct_valid      <- round(n_valid   / total_all * 100, 1)
pct_invalid    <- round(n_invalid / total_all * 100, 1)
avg_status     <- round(mean(status_df$Count))

q3_cards <- htmltools::div(
  style = "display:flex;gap:12px;padding:16px 24px;background:#ffffff;flex-wrap:nowrap;",
  metric_card("MOST COMMON STATUS",  top_status,                      paste0(top_status_pct, " of total"),     "#10b981"),
  metric_card("VALID",                format(n_valid,   big.mark=","), paste0(pct_valid,   "% of all records"), "#10b981"),
  metric_card("INVALID (CANCELLED)", format(n_invalid, big.mark=","), paste0(pct_invalid, "% of all records"), "#ef4444"),
  metric_card("TOTAL RECORDS",        format(total_all, big.mark=","), "all statuses",                          "#3b82f6")
)

status_bar_colors <- c(
  "Completed"    = "#10b981",
  "Cancelled"    = "#ef4444",
  "Returned"     = "#6366f1",
  "Shipped"      = "#3b82f6",
  "On Delivery" = "#f59e0b"
)

status_df_plot <- status_df %>%
  arrange(desc(Count)) %>%
  mutate(order_status = factor(order_status, levels = rev(as.character(order_status))))

p_status_bar <- ggplot(status_df_plot,
                       aes(x = order_status, y = Count, fill = order_status)) +
  geom_col(width = 0.6, show.legend = FALSE) +
  geom_text(aes(label = paste0(format(Count, big.mark=","), "\n", Share_Label)),
            vjust = -0.4, color = "#111827", size = 3.2, fontface = "bold") +
  geom_hline(yintercept = avg_status, color = "#6b7280",
             linetype = "dashed", linewidth = 0.5, alpha = 0.6) +
  annotate("text", x = nrow(status_df_plot) + 0.4, y = avg_status,
           label = paste0("avg ", format(avg_status, big.mark=",")),
           color = "#6b7280", size = 3.2, hjust = 1) +
  scale_fill_manual(values = status_bar_colors) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  labs(title = paste0(top_status, " is the most common status · ",
                      pct_valid, "% of records are valid"),
       x = NULL, y = "Transaction Count") +
  light_theme

# Compare Valid vs Invalid data volumes
vi_df <- data.frame(
  label = c("Valid", "Invalid"),
  value = c(n_valid, n_invalid)
)

p3_donut_ly <- plot_ly(
  vi_df,
  labels        = ~label,
  values        = ~value,
  type          = "pie",
  hole          = 0.55,
  marker        = list(colors = c("#10b981", "#ef4444"),
                       line   = list(color = "#ffffff", width = 2)),
  textinfo      = "label+percent",
  textfont      = list(color = "#ffffff", size = 12),
  hovertemplate = "%{label}<br>%{value:,} records<br>%{percent}<extra></extra>"
) %>%
  layout(
    title         = list(text = "Valid vs Invalid Split",
                         font = list(color = "#111827", size = 13), x = 0.5),
    paper_bgcolor = "#ffffff", plot_bgcolor = "#ffffff",
    showlegend    = FALSE,
    annotations   = list(list(
      text      = paste0(round(pct_valid), "%<br>valid"),
      x = 0.5, y = 0.5, xref = "paper", yref = "paper",
      showarrow = FALSE,
      font      = list(color = "#10b981", size = 15, family = "sans-serif")
    )),
    margin = list(l = 10, r = 10, t = 40, b = 10)
  )

p3_bar_ly <- ggplotly(p_status_bar, tooltip = c("x", "y")) %>%
  layout(paper_bgcolor = "#ffffff", plot_bgcolor = "#ffffff",
         font = list(color = "#374151"), showlegend = FALSE,
         margin = list(l = 40, r = 10, t = 40, b = 40),
         xaxis = list(gridcolor = "#e5e7eb"),
         yaxis = list(gridcolor = "#e5e7eb"))

htmltools::browsable(
  htmltools::div(
    style = "background:#ffffff;border:1px solid #e5e7eb;border-radius:12px;overflow:hidden;margin-bottom:32px;",
    section_header("Q3 · ANALYTICAL THINKING", "Transaction Status Distribution",
                   "Valid vs invalid breakdown — frequency & proportion per status"),
    q3_cards,
    htmltools::div(
      style = "display:flex;gap:0;padding:8px 24px 24px;background:#ffffff;align-items:center;",
      htmltools::div(style = "flex:0 0 62%;",
        htmltools::tagList(p3_bar_ly)
      ),
      htmltools::div(style = "flex:0 0 38%;",
        htmltools::tagList(p3_donut_ly)
      )
    )
  )
)
Q3 · ANALYTICAL THINKING

Transaction Status Distribution

Valid vs invalid breakdown — frequency & proportion per status

MOST COMMON STATUS

Completed

78.8% of total

VALID

9,075

92.3% of all records

INVALID (CANCELLED)

755

7.7% of all records

TOTAL RECORDS

9,830

all statuses

# Final Dataset Snapshot

snapshot <- data.frame(
  Metric = c(
    "Total Rows in df_clean", "Total Columns",
    "Valid Transactions", "Invalid (Cancelled)",
    "High-Value Orders (net_sales > 1M)",
    "Most Dominant Platform", "Top Product Category", "Top Order Status"
  ),
  Value = c(
    nrow(df_clean), ncol(df_clean),
    sum(df_clean$valid_transaction == "Valid",   na.rm = TRUE),
    sum(df_clean$valid_transaction == "Invalid", na.rm = TRUE),
    sum(df_clean$is_high_value == "Yes",         na.rm = TRUE),
    top_platform, top_cat, top_status
  )
)

DT::datatable(snapshot,
              caption  = "Final Dataset Snapshot — Section E Summary",
              options  = list(pageLength = 10, dom = 'ftp'),
              rownames = FALSE)

2 2. Web Scraping & Data Programming Process

2.1 Section A — Data Collection Using Programming

2.1.1 Logical Approach and Methodology

This section collects data from two websites with different HTML structures using the rvest package in R.

The workflow is organized into two main stages:

  • Website 1 — Static HTML (Countries of the World): All data is available in a single HTTP response, so one read_html() call is enough. A for loop iterates through each country element one by one. If any element is missing, an IF check fills it in with a default value to keep the loop from breaking.

  • Website 2 — Pagination & Form (Hockey Teams): Since data is spread across multiple pages, the total page count is detected first from page 1. The loop then runs sequentially from the first to the last page. tryCatch() is wrapped around each request so that if one page fails, scraping continues on to the next without stopping entirely.

2.1.2 Website 1: Countries of the World

URL: https://www.scrapethissite.com/pages/simple/
Type: Static HTML | Method: rvest

url_countries <- "https://www.scrapethissite.com/pages/simple/"
page_countries <- read_html(url_countries)

# Select all country blocks using their shared CSS class
country_nodes <- page_countries %>% html_nodes("div.country")

# Initialize empty vectors — filled inside the loop below
country_name       <- c()
country_capital    <- c()
country_population <- c()
country_area       <- c()

# Loop over every country node and pull out each field individually
for (node in country_nodes) {
  name <- node %>% html_node("h3.country-name") %>% html_text(trim = TRUE)

  cap_node <- node %>% html_node("span.country-capital")
  capital  <- if (!is.na(cap_node)) cap_node %>% html_text(trim = TRUE) else "Unknown"

  pop_node   <- node %>% html_node("span.country-population")
  population <- ifelse(!is.na(pop_node), html_text(pop_node, trim = TRUE), NA)

  area_node <- node %>% html_node("span.country-area")
  area      <- ifelse(!is.na(area_node), html_text(area_node, trim = TRUE), NA)

  country_name       <- c(country_name,       name)
  country_capital    <- c(country_capital,    capital)
  country_population <- c(country_population, population)
  country_area       <- c(country_area,       area)
}

df_countries <- data.frame(
  country_name = country_name,
  capital      = country_capital,
  population   = suppressWarnings(as.numeric(country_population)),
  area_sq_km   = suppressWarnings(as.numeric(country_area)),
  stringsAsFactors = FALSE
)

write_csv(df_countries, "countries_of_the_world.csv")
Total country nodes detected     : 250 
Total records successfully parsed: 250 rows
Dataset exported to              : countries_of_the_world.csv
Sample — First 10 Records: Countries of the World
country_name capital population area_sq_km
Andorra Andorra la Vella 84000 468
United Arab Emirates Abu Dhabi 4975593 82880
Afghanistan Kabul 29121286 647500
Antigua and Barbuda St. John’s 86754 443
Anguilla The Valley 13254 102
Albania Tirana 2986952 28748
Armenia Yerevan 2968000 29800
Angola Luanda 13068161 1246700
Antarctica None 0 14000000
Argentina Buenos Aires 41343201 2766890

2.1.3 Website 2: Hockey Teams

URL: https://www.scrapethissite.com/pages/forms/
Type: Pagination + Form/Query | Method: rvest + page loop + GET parameters

base_url_hockey <- "https://www.scrapethissite.com/pages/forms/"

# Fetch page 1 to detect the total number of pages
page1 <- read_html(paste0(base_url_hockey, "?per_page=25"))

page_numbers <- page1 %>%
  html_nodes("ul.pagination li a") %>%
  html_attr("href") %>%
  str_extract("page_num=\\d+") %>%
  str_extract("\\d+") %>%
  as.integer() %>%
  na.omit()

total_pages <- ifelse(length(page_numbers) > 0, max(page_numbers), 1)

all_teams <- list()

# Outer loop — build each page URL dynamically
for (page_num in 1:total_pages) {
  url_page  <- paste0(base_url_hockey, "?page_num=", page_num, "&per_page=25")

  # Wrap in tryCatch so a single failed page does not abort the whole run
  page_html <- tryCatch(read_html(url_page), error = function(e) NULL)
  if (is.null(page_html)) next

  team_rows <- page_html %>% html_nodes("tr.team")
  if (length(team_rows) == 0) break

  for (row in team_rows) {
    get_text <- function(node, sel) {
      n <- node %>% html_node(sel)
      if (!is.na(n)) html_text(n, trim = TRUE) else NA
    }

    all_teams[[length(all_teams) + 1]] <- list(
      team_name     = get_text(row, "td.name"),
      year          = get_text(row, "td.year"),
      wins          = get_text(row, "td.wins"),
      losses        = get_text(row, "td.losses"),
      ot_losses     = get_text(row, "td.ot-losses"),
      win_pct       = get_text(row, "td.pct"),
      goals_for     = get_text(row, "td.gf"),
      goals_against = get_text(row, "td.ga")
    )
  }

  Sys.sleep(0.3)
}

df_hockey <- bind_rows(lapply(all_teams, as.data.frame, stringsAsFactors = FALSE))

# Validate form/query feature using a keyword search
query_page <- read_html(paste0(base_url_hockey, "?q=boston&per_page=25"))
query_rows <- query_page %>% html_nodes("tr.team")

write_csv(df_hockey, "hockey_teams.csv")
Total pages detected             : 24 
Total records successfully parsed: 582 rows
Boston query results             : 21 teams found
Dataset exported to              : hockey_teams.csv
Sample — First 10 Records: Hockey Teams
team_name year wins losses ot_losses win_pct goals_for goals_against
Boston Bruins 1990 44 24 0.55 299 264
Buffalo Sabres 1990 31 30 0.388 292 278
Calgary Flames 1990 46 26 0.575 344 263
Chicago Blackhawks 1990 49 23 0.613 284 211
Detroit Red Wings 1990 34 38 0.425 273 298
Edmonton Oilers 1990 37 37 0.463 272 272
Hartford Whalers 1990 31 38 0.388 238 276
Los Angeles Kings 1990 46 24 0.575 340 254
Minnesota North Stars 1990 27 39 0.338 256 266
Montreal Canadiens 1990 39 30 0.487 273 249

2.2 Section B — Data Handling

2.2.1 Logical Approach and Methodology

Once the data is collected, each dataset is evaluated to understand its condition before moving into the cleaning stage.

  • Structural Overview: Row count, column count, and data types are checked early to catch potential type mismatch issues.

  • Integrity Check: Missing values and duplicate rows are identified and counted across both datasets.

  • Quality Issues: At least 3 data problems per website are documented as the basis for the cleaning strategy in the next section.

2.2.2 Website 1: Countries of the World

col_names_c   <- names(df_countries)
col_types_c   <- sapply(names(df_countries), function(c) class(df_countries[[c]]))
col_missing_c <- sapply(names(df_countries), function(c) sum(is.na(df_countries[[c]])))
n_dupes_c     <- sum(duplicated(df_countries))
Dataset Structure
Countries of the World
250
ROWS
4
COLS
0
DUPES
Column Type Missing
country_namecharacter0
capitalcharacter0
populationnumeric0
area_sq_kmnumeric0
Identified Data Quality Issues
01 The population and area_sq_km columns are read as character strings from raw HTML and require explicit conversion to numeric type.
02 Some country name entries contain excess whitespace or non-standard characters due to HTML encoding inconsistencies.
03 There is no unique ID column — duplicate detection must rely on the composite key of country_name + capital.

2.2.3 Website 2: Hockey Teams

col_names_h   <- names(df_hockey)
col_types_h   <- sapply(names(df_hockey), function(c) class(df_hockey[[c]]))
col_missing_h <- sapply(names(df_hockey), function(c) sum(is.na(df_hockey[[c]])))
n_dupes_h     <- sum(duplicated(df_hockey))
Dataset Structure
Hockey Teams
582
ROWS
8
COLS
0
DUPES
Column Type Missing
team_namecharacter0
yearcharacter0
winscharacter0
lossescharacter0
ot_lossescharacter0
win_pctcharacter0
goals_forcharacter0
goals_againstcharacter0
Identified Data Quality Issues
01 All numeric columns (wins, losses, win_pct, etc.) are stored as strings and must be converted before any analysis.
02 The ot_losses column has many NAs because overtime losses were not consistently recorded in older seasons.
03 Team names occasionally have double spaces or inconsistent capitalization across pagination pages.

2.3 Section C — Data Cleaning

2.3.1 Logical Approach and Methodology

Both datasets are cleaned programmatically using a combination of loops and IF-ELSE logic.

  • Text Standardization: str_trim() and str_to_title() are applied through a loop to clean up whitespace and fix capitalization across text columns.

  • Type Conversion: Numeric columns are cast using as.numeric() inside a single loop over column names, so there is no need to repeat the same code for each column manually.

  • Missing Value Handling: IF-ELSE determines the imputation method per column type — median for numeric columns, “Unknown” for character columns. For ot_losses in the hockey dataset specifically, missing values are filled with 0, following standard sports statistics convention.

  • Duplicate Removal: distinct() is run at the end of the process to make sure no duplicate rows remain in the final dataset.

2.3.2 Website 1: Countries of the World

# Text standardization — trim whitespace and apply proper case
text_cols_c <- c("country_name", "capital")

for (col in text_cols_c) {
  df_countries[[col]] <- str_trim(df_countries[[col]])
  df_countries[[col]] <- str_to_title(df_countries[[col]])
}

# Convert columns to numeric type
num_cols_c <- c("population", "area_sq_km")

for (col in num_cols_c) {
  if (col %in% names(df_countries)) {
    df_countries[[col]] <- suppressWarnings(as.numeric(df_countries[[col]]))
  }
}

# Impute missing values using IF-ELSE per column
cleaning_log_c <- data.frame(
  column  = character(),
  action  = character(),
  details = character(),
  stringsAsFactors = FALSE
)

for (col in names(df_countries)) {
  n_na <- sum(is.na(df_countries[[col]]))

  if (n_na > 0) {
    if (is.numeric(df_countries[[col]])) {
      # Numeric: use median to preserve distribution shape
      med_val <- median(df_countries[[col]], na.rm = TRUE)
      df_countries[[col]][is.na(df_countries[[col]])] <- med_val
      cleaning_log_c <- rbind(cleaning_log_c, data.frame(
        column  = col,
        action  = "Median Imputation",
        details = sprintf("%d NA values filled with median = %.2f", n_na, med_val)
      ))
    } else {
      # Character: fill with "Unknown" as a safe placeholder
      df_countries[[col]][is.na(df_countries[[col]])] <- "Unknown"
      cleaning_log_c <- rbind(cleaning_log_c, data.frame(
        column  = col,
        action  = "Default Fill",
        details = sprintf('%d NA values filled with "Unknown"', n_na)
      ))
    }
  } else {
    cleaning_log_c <- rbind(cleaning_log_c, data.frame(
      column  = col,
      action  = "No Action",
      details = "No missing values"
    ))
  }
}

# Remove duplicate rows
before_c <- nrow(df_countries)
df_countries <- df_countries %>% distinct()
after_c  <- nrow(df_countries)
Data Cleaning Report
Countries of the World
250
BEFORE
250
AFTER
0
REMOVED
Step 1 — Text standardization: trim + proper case on country_name, capital
Step 2 — Type conversion: population, area_sq_km → numeric
Step 3 — Missing value imputation using IF-ELSE
Step 4 — Remove duplicates using distinct()
Column Action Taken Details
country_name No Action No missing values
capital No Action No missing values
population No Action No missing values
area_sq_km No Action No missing values
Countries of the World — Post-Cleaning Preview
country_name capital population area_sq_km
Andorra Andorra La Vella 84000 468
United Arab Emirates Abu Dhabi 4975593 82880
Afghanistan Kabul 29121286 647500
Antigua And Barbuda St. John’s 86754 443
Anguilla The Valley 13254 102
Albania Tirana 2986952 28748
Armenia Yerevan 2968000 29800
Angola Luanda 13068161 1246700

2.3.3 Website 2: Hockey Teams

# Text standardization
text_cols_h <- c("team_name")

for (col in text_cols_h) {
  df_hockey[[col]] <- str_trim(df_hockey[[col]])
  df_hockey[[col]] <- str_to_title(df_hockey[[col]])
}

# Convert all stat columns to numeric via loop
num_cols_h <- c("year", "wins", "losses", "ot_losses", "win_pct", "goals_for", "goals_against")

for (col in num_cols_h) {
  if (col %in% names(df_hockey)) {
    df_hockey[[col]] <- str_trim(df_hockey[[col]])
    df_hockey[[col]] <- suppressWarnings(as.numeric(df_hockey[[col]]))
  }
}

# Impute missing values using IF-ELSE
cleaning_log_h <- data.frame(
  column  = character(),
  action  = character(),
  details = character(),
  stringsAsFactors = FALSE
)

for (col in names(df_hockey)) {
  n_na <- sum(is.na(df_hockey[[col]]))

  if (n_na > 0) {
    if (is.numeric(df_hockey[[col]])) {
      # NAs in ot_losses filled with 0 — hockey stats convention for older seasons
      df_hockey[[col]][is.na(df_hockey[[col]])] <- 0
      cleaning_log_h <- rbind(cleaning_log_h, data.frame(
        column  = col,
        action  = "Zero Imputation",
        details = sprintf("%d NA values filled with 0 (hockey stats convention)", n_na)
      ))
    } else {
      df_hockey[[col]][is.na(df_hockey[[col]])] <- "Unknown"
      cleaning_log_h <- rbind(cleaning_log_h, data.frame(
        column  = col,
        action  = "Default Fill",
        details = sprintf('%d NA values filled with "Unknown"', n_na)
      ))
    }
  } else {
    cleaning_log_h <- rbind(cleaning_log_h, data.frame(
      column  = col,
      action  = "No Action",
      details = "No missing values"
    ))
  }
}

# Validity check — wins must not be negative
invalid_wins <- sum(df_hockey$wins < 0, na.rm = TRUE)
if (invalid_wins > 0) {
  df_hockey$wins[df_hockey$wins < 0] <- 0
  wins_note <- sprintf("%d negative wins values corrected to 0", invalid_wins)
} else {
  wins_note <- "All wins values are valid (>= 0)"
}

# Check year range validity
invalid_year <- sum(df_hockey$year < 1900 | df_hockey$year > 2030, na.rm = TRUE)
if (invalid_year > 0) {
  year_note <- sprintf("%d rows have year values outside valid range [1900, 2030]", invalid_year)
} else {
  year_note <- "All year values are within valid range [1900, 2030]"
}

# Remove duplicate rows
before_h <- nrow(df_hockey)
df_hockey <- df_hockey %>% distinct()
after_h  <- nrow(df_hockey)
Data Cleaning Report
Hockey Teams
582
BEFORE
582
AFTER
0
REMOVED
Step 1 — Text standardization: trim + proper case on team_name
Step 2 — Convert 7 numeric columns via loop
Step 3 — Missing value imputation using IF-ELSE
Step 4 — Validity check: wins < 0 and year range
Step 5 — Remove duplicates using distinct()
Column Action Taken Details
team_name No Action No missing values
year No Action No missing values
wins No Action No missing values
losses No Action No missing values
ot_losses Zero Imputation 224 NA values filled with 0 (hockey stats convention)
win_pct No Action No missing values
goals_for No Action No missing values
goals_against No Action No missing values
Validity Check Results
Wins: All wins values are valid (>= 0)
Year: All year values are within valid range [1900, 2030]
Hockey Teams — Post-Cleaning Preview
team_name year wins losses ot_losses win_pct goals_for goals_against
Boston Bruins 1990 44 24 0 0.550 299 264
Buffalo Sabres 1990 31 30 0 0.388 292 278
Calgary Flames 1990 46 26 0 0.575 344 263
Chicago Blackhawks 1990 49 23 0 0.613 284 211
Detroit Red Wings 1990 34 38 0 0.425 273 298
Edmonton Oilers 1990 37 37 0 0.463 272 272
Hartford Whalers 1990 31 38 0 0.388 238 276
Los Angeles Kings 1990 46 24 0 0.575 340 254

2.4 Section D — Conditional Logic

2.4.1 Logical Approach and Methodology

A new column data_status is added to each dataset using nested IF-ELSE logic, applied row by row through a for loop.

The rules are straightforward:

  • If any critical field is empty, missing, or invalid → labeled “Incomplete”
  • If all required fields pass validation → labeled “Complete”

2.4.2 Website 1: Countries of the World

df_countries$data_status <- NA_character_

for (i in seq_len(nrow(df_countries))) {
  row <- df_countries[i, ]

  has_name <- !is.na(row$country_name) && nchar(row$country_name) > 0 &&
              row$country_name != "Unknown"
  has_cap  <- !is.na(row$capital) && nchar(row$capital) > 0 &&
              row$capital != "Unknown"
  has_pop  <- !is.na(row$population) && row$population >= 0
  has_area <- !is.na(row$area_sq_km) && row$area_sq_km >= 0

  if (!has_name || !has_cap) {
    df_countries$data_status[i] <- "Incomplete"
  } else if (!has_pop || !has_area) {
    df_countries$data_status[i] <- "Incomplete"
  } else {
    df_countries$data_status[i] <- "Complete"
  }
}

status_table_c <- as.data.frame(table(df_countries$data_status))
names(status_table_c) <- c("status", "count")
Conditional Logic — Data Status
Countries of the World
250
TOTAL
250
COMPLETE
0
INCOMPLETE
Logika Kondisi
→ Jika country_name atau capital kosong/Unknown → Incomplete
→ Jika population atau area_sq_km negatif/kosong → Incomplete
→ Selain itu → Complete
Completeness Rate 100%
Complete
250
rows
Incomplete
0
rows
Countries of the World — with data_status Column
country_name capital population area_sq_km data_status
Andorra Andorra La Vella 84000 468 Complete
United Arab Emirates Abu Dhabi 4975593 82880 Complete
Afghanistan Kabul 29121286 647500 Complete
Antigua And Barbuda St. John’s 86754 443 Complete
Anguilla The Valley 13254 102 Complete
Albania Tirana 2986952 28748 Complete
Armenia Yerevan 2968000 29800 Complete
Angola Luanda 13068161 1246700 Complete

2.4.3 Website 2: Hockey Teams

df_hockey$data_status <- NA_character_

for (i in seq_len(nrow(df_hockey))) {
  row <- df_hockey[i, ]

  has_name   <- !is.na(row$team_name) && nchar(row$team_name) > 0 &&
                row$team_name != "Unknown"
  has_year   <- !is.na(row$year) && row$year > 0
  has_wins   <- !is.na(row$wins)
  has_losses <- !is.na(row$losses)

  if (!has_name || !has_year) {
    df_hockey$data_status[i] <- "Incomplete"
  } else if (!has_wins || !has_losses) {
    df_hockey$data_status[i] <- "Incomplete"
  } else {
    df_hockey$data_status[i] <- "Complete"
  }
}

status_table_h <- as.data.frame(table(df_hockey$data_status))
names(status_table_h) <- c("status", "count")
Conditional Logic — Data Status
Hockey Teams
582
TOTAL
582
COMPLETE
0
INCOMPLETE
Logika Kondisi
→ Jika team_name kosong/Unknown atau year tidak valid → Incomplete
→ Jika wins atau losses kosong → Incomplete
→ Selain itu → Complete
Completeness Rate 100%
Complete
582
rows
Incomplete
0
rows
Hockey Teams — with data_status Column
team_name year wins losses ot_losses win_pct goals_for goals_against data_status
Boston Bruins 1990 44 24 0 0.550 299 264 Complete
Buffalo Sabres 1990 31 30 0 0.388 292 278 Complete
Calgary Flames 1990 46 26 0 0.575 344 263 Complete
Chicago Blackhawks 1990 49 23 0 0.613 284 211 Complete
Detroit Red Wings 1990 34 38 0 0.425 273 298 Complete
Edmonton Oilers 1990 37 37 0 0.463 272 272 Complete
Hartford Whalers 1990 31 38 0 0.388 238 276 Complete
Los Angeles Kings 1990 46 24 0 0.575 340 254 Complete

2.5 Section E — Analytical Thinking

2.5.1 Logical Approach and Methodology

This section draws insights from the collected and cleaned data, while also reflecting on the scraping process itself.

  • Easiest website: Countries of the World — one request covers everything, no pagination needed, and the HTML structure is consistent throughout.

  • Most challenging website: Hockey Teams — data spans multiple pages, the page count has to be detected dynamically, and error handling with tryCatch() is necessary to keep the process stable.

  • Approach differences: A technical comparison of Static, Pagination, AJAX, and iFrame scraping methods is discussed further in the Short Report section.

  • Insights & recommendations: At least 3 insights and 2 recommendations are drawn from the collected and cleaned data.

2.5.2 Website 1: Countries of the World

# Insight 1: Countries with the largest population
top5_pop <- df_countries %>%
  arrange(desc(population)) %>%
  select(country_name, capital, population) %>%
  head(5)

# Insight 2: Countries with the largest area
top5_area <- df_countries %>%
  arrange(desc(area_sq_km)) %>%
  select(country_name, capital, area_sq_km) %>%
  head(5)

# Insight 3: Descriptive statistics for population
pop_mean <- round(mean(df_countries$population, na.rm = TRUE))
pop_med  <- median(df_countries$population, na.rm = TRUE)
pop_max  <- max(df_countries$population, na.rm = TRUE)
pop_min  <- min(df_countries$population, na.rm = TRUE)

# Population bracket distribution — passed dynamically to chart3
pop <- df_countries$population[!is.na(df_countries$population)]
bracket_lt1M   <- sum(pop < 1e6)
bracket_1_10M  <- sum(pop >= 1e6  & pop < 10e6)
bracket_10_50M <- sum(pop >= 10e6 & pop < 50e6)
bracket_50_200M<- sum(pop >= 50e6 & pop < 200e6)
bracket_gt200M <- sum(pop >= 200e6)
Analytical Thinking
Countries of the World
Insight 3 — Statistik Deskriptif Populasi
Mean
27,445,676
Median
4,288,139
Max
1,330,044,000
Min
0
Insight 1 — Top 5 Most Populous
# Country Capital Population
1 China Beijing 1,330,044,000
2 India New Delhi 1,173,108,018
3 United States Washington 310,232,863
4 Indonesia Jakarta 242,968,342
5 Brazil Brasília 201,103,330
Insight 2 — Top 5 Largest by Area
# Country Capital Area
1 Russia Moscow 17,100,000 km²
2 Antarctica None 1.4e+07 km²
3 Canada Ottawa 9,984,670 km²
4 United States Washington 9,629,091 km²
5 China Beijing 9,596,960 km²
Recommendations
01 Population data should be cross-validated against official sources such as the World Bank or UN, as this page is static and not updated in real-time.
02 The dataset could be enriched by adding a region or continent column to enable comparative analysis across geographic areas.
Interactive Data Visualization
Countries of the World — Chart Analysis
Mean Population
27.4M
Median Population
4.3M
Max Population
1.33B
Min Population
0
Insight 1 — Top 5 Most Populous Countries
Insight 2 — Top 5 Largest Countries by Area
Insight 3 — Population Distribution by Bracket
Data Completeness — Status Breakdown
Complete
250
Incomplete
0

2.5.3 Website 2: Hockey Teams

# Insight 1: Teams with the most wins in a single season
top5_wins <- df_hockey %>%
  arrange(desc(wins)) %>%
  select(team_name, year, wins, losses) %>%
  head(5)

# Insight 2: Average wins per year (top 5 years)
avg_wins_year <- df_hockey %>%
  group_by(year) %>%
  summarise(avg_wins = round(mean(wins, na.rm = TRUE), 1), .groups = "drop") %>%
  arrange(desc(avg_wins)) %>%
  head(5)

# Insight 3: Teams that appear in the most seasons
most_seasons <- df_hockey %>%
  group_by(team_name) %>%
  summarise(total_seasons = n(), .groups = "drop") %>%
  arrange(desc(total_seasons)) %>%
  head(5)
Analytical Thinking
Hockey Teams
Insight 1 — Top 5 Wins
# Team Year W L
1 Detroit Red Wings 1995 62 13
2 Detroit Red Wings 2005 58 16
3 Pittsburgh Penguins 1992 56 21
4 Detroit Red Wings 2007 54 21
5 Washington Capitals 2009 54 15
Insight 2 — Avg Wins/Year
# Year Avg W
1 2005 41
2 2006 41
3 2007 41
4 2008 41
5 2009 41
Insight 3 — Most Seasons
# Team Seasons
1 Boston Bruins 21
2 Buffalo Sabres 21
3 Calgary Flames 21
4 Chicago Blackhawks 21
5 Detroit Red Wings 21
Recommendations
01 Add a derived column win_rate = wins / (wins + losses) to make season-by-season performance comparisons fairer and independent of total games played.
02 Conduct a per-team performance trend analysis over time using a line chart to identify patterns of improvement or decline across seasons.
Interactive Data Visualization
Hockey Teams — Chart Analysis
Insight 1 — Top 5 Teams by Wins (Single Season)
Insight 2 — Average Wins per Year (Top 5 Years)
Insight 3 — Most Seasons Played (Top 5 Teams)
Data Completeness — Complete vs Incomplete
Complete
582
Incomplete
0

2.6 Short Report

2.6.1 Scraping Method

Short Report
Scraping Method
Website 1 — Countries of the World (Static HTML)
This page loads all its content in one HTTP response with no JavaScript involved, which made it the simplest site to work with. The scraper calls read_html() once to get the full DOM, then html_nodes("div.country") to grab every country block. From there, a for loop pulls country_name, capital, population, and area_sq_km out of each node using CSS selectors. If a node is missing, an IF check swaps in a default value so the loop does not stop unexpectedly.
Website 2 — Hockey Teams (Pagination & Form)
This one was more involved since the data is split across multiple pages and the site also has a search form. The scraper starts by fetching page 1 to read the total page count from ul.pagination, then loops through each page by constructing URLs in the format ?page_num=N&per_page=25. Every request is wrapped in tryCatch() so a single failed connection does not kill the whole run. The search form was also tested separately using ?q=boston to check that keyword filtering actually works.

2.6.2 Use of IF & Loop

Short Report
Use of IF & Loop
A. IF / IF-ELSE Usage
During Scraping Before extracting a value from a node, the code checks whether that node exists at all. If it comes back as NA, a default is used instead — "Unknown" for text fields — so the loop keeps going rather than throwing an error.
During Cleaning An IF-ELSE block inside the column loop decides how to handle missing values: numeric columns get the median (Countries) or zero (Hockey), while character columns get "Unknown". Extra IF checks also catch things like negative wins or impossible year values and fix them before analysis.
Data Status (Section D) A nested IF-ELSE checks each row against a list of required fields. If anything critical is missing or invalid, the row gets labelled "Incomplete". Rows that pass every check are labelled "Complete".
B. Loop Usage
HTML Element Loop A FOR loop goes through every div.country node (Countries) and every tr.team row (Hockey), pulling field values one record at a time and adding them to vectors before building the final data frame.
Pagination Loop For Hockey Teams, the outer loop runs from page 1 to the last page number detected, building and fetching each URL as it goes. This means the scraper will still work even if the site adds more pages later.
Column Cleaning Loop Text cleanup and numeric casting are handled through a loop over column names rather than writing the same lines over and over. This keeps the cleaning code short and easy to update.
Row-Level Status Loop In Section D, a FOR loop goes through the dataset row by row, runs the IF-ELSE check on each record, and writes the result to the data_status column.

2.6.3 Challenges

Short Report
Challenges Encountered
Challenge 1 — Everything Comes In as a String All values scraped from HTML arrive as raw text. Columns like population, area_sq_km, and wins had to be converted with as.numeric(). Some values also had hidden whitespace, so str_trim() was needed first — otherwise the conversion would silently produce NAs with no warning.
Challenge 2 — Not Every Row Has the Same Fields Some country entries on the static page were missing nodes entirely, like span.country-capital. Without a guard using if(!is.na()), calling html_text() on a missing node would crash the loop. Adding a fallback to "Unknown" fixed this.
Challenge 3 — No Clear Page Count The Hockey Teams site does not say anywhere how many pages there are. To get around this, the scraper reads the pagination nav on page 1, pulls all page_num values from the link hrefs, and takes the maximum as the stopping point for the loop.
Challenge 4 — ot_losses Has a Lot of NAs Overtime losses were not tracked consistently in older seasons, so the column has a high NA rate for historical records. Using the column mean would have inflated the numbers, so zero was used instead — in hockey stats, a missing OT loss entry typically means none were recorded, not that the data is truly absent.

2.6.4 Insights

Short Report
Insights & Approach Comparison
Easiest vs. Most Challenging Website
Easiest to Scrape
Countries of the World
Everything is in the first response. One read_html() call is enough, no pagination or JavaScript needed. The DOM structure stays consistent across entries, so the same selectors work on every row.
Most Challenging
Hockey Teams
Data is spread across multiple pages, and the total count is not stated directly so it has to be figured out from the nav links. tryCatch() was also necessary since an unstable connection could otherwise break the entire scrape halfway through.
Scraping Approach Comparison
Static HTML
All content is in the initial response. Tools like rvest can parse it right away — one request, one parse, done. This is the simplest case.
Pagination
Content is split across pages, each needing its own GET request. A loop handles this, and the page count has to be detected dynamically rather than hard-coded.
AJAX / JavaScript
The page loads content after rendering, so the raw HTML is incomplete. Options are to find and call the underlying API directly, or use something like Selenium to let the JavaScript run before scraping.
iFrames
The actual content is inside a nested <iframe> pointing to a separate URL. The right move is to pull that src URL and request it directly rather than trying to scrape through the parent page.
Key Insights
Insight 1 — Population is heavily skewed The gap between mean and median population is large, which means a handful of very populous countries are pulling the average way up. Most countries are actually quite small in comparison.
Insight 2 — Big land does not mean big population Russia and Canada are among the largest countries by area, but neither shows up near the top for population. Geography, climate, and economic conditions have a much bigger influence on where people actually live.
Insight 3 — One good season does not make a dominant team The team with the most wins in a single season is not always the strongest franchise overall. Teams that appear consistently across many seasons show that sustaining performance over time is harder than having one standout year.