BTVN-chidepgai.knit

BTVN

I. LESSON 1: TAKE CARE OF RAW DATA

1.1. Missions
1.2. My comments

II. LESSON 2: DATA CLEANING

2.1. Missions
2.2. My comments

III. LESSON 3: DEVIL GAP

3.1. What is missing data?
3.2. My comments

I. LESSON 1: TAKE CARE OF RAW DATA

1.1. Missions

Install and library package pacman, rio, tidyverse

if (!require("pacman")) install.packages("pacman")
pacman::p_load(rio, tidyverse)

Use rio::import() to import data

df_raw <- rio::import("raw_health_survey.csv")

Use summary() and glimpse() to test overview

summary(df_raw)

##        ID              Age            Gender            Heart_Rate    
##  Min.   :  1.00   Min.   :  0.00   Length:250         Min.   : 50.00  
##  1st Qu.: 63.25   1st Qu.: 37.25   Class :character   1st Qu.: 68.50  
##  Median :125.50   Median : 46.00   Mode  :character   Median : 76.00  
##  Mean   :125.50   Mean   : 56.60                      Mean   : 75.86  
##  3rd Qu.:187.75   3rd Qu.: 53.00                      3rd Qu.: 82.50  
##  Max.   :250.00   Max.   :999.00                      Max.   :106.00  
##                                                       NA's   :15      
##      Weight          Income         
##  Min.   : 17.70   Length:250        
##  1st Qu.: 54.70   Class :character  
##  Median : 65.35   Mode  :character  
##  Mean   : 66.04                     
##  3rd Qu.: 76.58                     
##  Max.   :103.20                     
##  NA's   :20

glimpse(df_raw)

## Rows: 250
## Columns: 6
## $ ID         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ Age        <int> 51, 32, 47, 44, 37, 15, 36, 33, 46, 39, 40, 999, 42, 42, 14…
## $ Gender     <chr> "nam", "nu", "1", "0", "nam", "Female", "F", "0", "M", "nu"…
## $ Heart_Rate <int> 73, 80, 61, 71, 83, 76, 78, 62, 80, 70, 58, 81, 75, 68, 78,…
## $ Weight     <dbl> 103.2, NA, 74.3, NA, 75.8, NA, NA, NA, 57.3, NA, NA, NA, NA…
## $ Income     <chr> "High", "Medium", "High", NA, "High", "Medium", "Medium", "…

1.2. My comments

a. Question: From the result of summary(), what outliers which is biologically irrational that you can find? In Gender variable, what is the disaster that the data importer made?

b. Answer:

From the summary output, several biologically implausible or inconsistent values can be identified:
- Age: The maximum value is 999, and minimum value is 0.
- Weight: The minimum value is 17.7.
For the Gender variable, there is a serious data entry issue:
- The variable includes inconsistent formats such as “nam”, “nu”, “Male”, “Female”, “M”, “F”, “1”, “0”.
- This reflects a lack of standardization, mixing languages (Vietnamese and English), abbreviations, and numeric coding.

II. LESSON 2: DATA CLEANING

2.1. Missions

Use dplyr syntax (mutate, filter, ifelse, case_when) to clean the data. Replace implausible age values with NA. Standardize the Gender variable into a single format: "Male" and "Female". Save everything into a new dataset called df_clean.

library(dplyr)

df_clean <- df_raw %>%
  mutate(
    Gender = case_when(
      Gender %in% c("nam", "Nam", "NAM", "M", "m", "Male", "male", "1") ~ "Male",
      Gender %in% c("nu", "Nu", "NU", "nữ", "Nữ", "F", "f", "Female", "female", "0") ~ "Female",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(Age <= 100 | is.na(Age))

2.2. My comments

a. Question: Compare the dataset before and after cleaning, what do you think about the role of data coding?

b. Answer:

summary(df_clean)

##        ID             Age           Gender            Heart_Rate    
##  Min.   :  1.0   Min.   : 0.00   Length:247         Min.   : 50.00  
##  1st Qu.: 63.5   1st Qu.:37.00   Class :character   1st Qu.: 68.00  
##  Median :126.0   Median :45.00   Mode  :character   Median : 76.00  
##  Mean   :125.8   Mean   :45.16                      Mean   : 75.82  
##  3rd Qu.:187.5   3rd Qu.:53.00                      3rd Qu.: 83.00  
##  Max.   :250.0   Max.   :77.00                      Max.   :106.00  
##                                                     NA's   :14      
##      Weight          Income         
##  Min.   : 17.70   Length:247        
##  1st Qu.: 54.70   Class :character  
##  Median : 65.35   Mode  :character  
##  Mean   : 66.03                     
##  3rd Qu.: 76.53                     
##  Max.   :103.20                     
##  NA's   :19

glimpse(df_clean)

## Rows: 247
## Columns: 6
## $ ID         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, …
## $ Age        <int> 51, 32, 47, 44, 37, 15, 36, 33, 46, 39, 40, 42, 42, 14, 61,…
## $ Gender     <chr> "Male", "Female", "Male", "Female", "Male", "Female", "Fema…
## $ Heart_Rate <int> 73, 80, 61, 71, 83, 76, 78, 62, 80, 70, 58, 75, 68, 78, 56,…
## $ Weight     <dbl> 103.2, NA, 74.3, NA, 75.8, NA, NA, NA, 57.3, NA, NA, NA, NA…
## $ Income     <chr> "High", "Medium", "High", NA, "High", "Medium", "Medium", "…

After cleaning, the dataset became more consistent and reliable.
- Implausible ages were removed, and the Gender variable was standardized into only “Male” and “Female,” which makes the data easier to analyze correctly.
- This shows that data coding is very important because it improves data quality, reduces errors, and helps produce more accurate results.

III. LESSON 3: DEVIL GAP

3.1. What is missing data?

Missing data refers to the absence of values for one or more variables in a dataset.
It is commonly classified into three types:
- MCAR (Missing Completely at Random): the missingness is entirely random and unrelated to any observed or unobserved variable.
- MAR (Missing at Random): the missingness is related to other observed variables, but not to the missing value itself.
- MNAR (Missing Not at Random): the missingness is related to the missing value itself or to unobserved factors.
In research, missing data can reduce the effective sample size, lower statistical power, and introduce bias, especially when the data are not missing at random. As a result, it may affect the validity and reliability of the study findings.
Common ways to handle missing data include deleting incomplete cases when the proportion is small, replacing missing values with summary measures such as the mean or median in simple situations, or using more advanced methods such as multiple imputation for more accurate analysis.

3.2. My comments

a. Question: Base on literature of lecture 2, adapt it to the definition and categorize exactly 3 mechanisms for above 3 variables. Explain?

b. Answer:

Based on the dataset, the three missing-data mechanisms can be most reasonably applied as follows:
- Heart_Rate → MCAR (Missing Completely at Random): Missing heart rate values are most likely due to random measurement or recording problems, such as device failure or accidental omission during data entry. In this case, the missingness is unrelated to the participant’s actual heart rate or other variables.
- Weight → MAR (Missing at Random): Missing weight values may be related to other observed characteristics, such as age, gender, or survey conditions. For example, some groups may have been less likely to have their weight measured. Here, the missingness is not assumed to depend directly on the true weight itself, but on other observed variables.
- Income → MNAR (Missing Not at Random): Missing income is most plausibly MNAR because income is a sensitive variable, and participants may refuse to report it because of their actual income level. In this case, the missingness may depend on the missing value itself, which is the defining feature of MNAR.

TABLE OF CONTENTS

I. LESSON 1: TAKE CARE OF RAW DATA

1.1. Missions

1.2. My comments

II. LESSON 2: DATA CLEANING

2.1. Missions

2.2. My comments

III. LESSON 3: DEVIL GAP

3.1. What is missing data?

3.2. My comments