BTVN
I. LESSON 1: TAKE CARE OF RAW DATA
pacman, rio,
tidyverseif (!require("pacman")) install.packages("pacman")
pacman::p_load(rio, tidyverse)
rio::import() to import datadf_raw <- rio::import("raw_health_survey.csv")
summary() and glimpse() to test
overviewsummary(df_raw)
## ID Age Gender Heart_Rate
## Min. : 1.00 Min. : 0.00 Length:250 Min. : 50.00
## 1st Qu.: 63.25 1st Qu.: 37.25 Class :character 1st Qu.: 68.50
## Median :125.50 Median : 46.00 Mode :character Median : 76.00
## Mean :125.50 Mean : 56.60 Mean : 75.86
## 3rd Qu.:187.75 3rd Qu.: 53.00 3rd Qu.: 82.50
## Max. :250.00 Max. :999.00 Max. :106.00
## NA's :15
## Weight Income
## Min. : 17.70 Length:250
## 1st Qu.: 54.70 Class :character
## Median : 65.35 Mode :character
## Mean : 66.04
## 3rd Qu.: 76.58
## Max. :103.20
## NA's :20
glimpse(df_raw)
## Rows: 250
## Columns: 6
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ Age <int> 51, 32, 47, 44, 37, 15, 36, 33, 46, 39, 40, 999, 42, 42, 14…
## $ Gender <chr> "nam", "nu", "1", "0", "nam", "Female", "F", "0", "M", "nu"…
## $ Heart_Rate <int> 73, 80, 61, 71, 83, 76, 78, 62, 80, 70, 58, 81, 75, 68, 78,…
## $ Weight <dbl> 103.2, NA, 74.3, NA, 75.8, NA, NA, NA, 57.3, NA, NA, NA, NA…
## $ Income <chr> "High", "Medium", "High", NA, "High", "Medium", "Medium", "…
a. Question: From the result of
summary(), what outliers which is biologically irrational
that you can find? In Gender variable, what is the disaster that the
data importer made?
b. Answer:
From the summary output, several biologically implausible or inconsistent values can be identified:
Age: The maximum value is 999, and minimum value is 0.
Weight: The minimum value is 17.7.
For the Gender variable, there is a serious data entry issue:
The variable includes inconsistent formats such as “nam”, “nu”, “Male”, “Female”, “M”, “F”, “1”, “0”.
This reflects a lack of standardization, mixing languages (Vietnamese and English), abbreviations, and numeric coding.
Use dplyr syntax (mutate,
filter, ifelse, case_when) to
clean the data. Replace implausible age values with NA.
Standardize the Gender variable into a single format:
"Male" and "Female". Save everything into a
new dataset called df_clean.
library(dplyr)
df_clean <- df_raw %>%
mutate(
Gender = case_when(
Gender %in% c("nam", "Nam", "NAM", "M", "m", "Male", "male", "1") ~ "Male",
Gender %in% c("nu", "Nu", "NU", "nữ", "Nữ", "F", "f", "Female", "female", "0") ~ "Female",
TRUE ~ NA_character_
)
) %>%
filter(Age <= 100 | is.na(Age))
a. Question: Compare the dataset before and after cleaning, what do you think about the role of data coding?
b. Answer:
summary(df_clean)
## ID Age Gender Heart_Rate
## Min. : 1.0 Min. : 0.00 Length:247 Min. : 50.00
## 1st Qu.: 63.5 1st Qu.:37.00 Class :character 1st Qu.: 68.00
## Median :126.0 Median :45.00 Mode :character Median : 76.00
## Mean :125.8 Mean :45.16 Mean : 75.82
## 3rd Qu.:187.5 3rd Qu.:53.00 3rd Qu.: 83.00
## Max. :250.0 Max. :77.00 Max. :106.00
## NA's :14
## Weight Income
## Min. : 17.70 Length:247
## 1st Qu.: 54.70 Class :character
## Median : 65.35 Mode :character
## Mean : 66.03
## 3rd Qu.: 76.53
## Max. :103.20
## NA's :19
glimpse(df_clean)
## Rows: 247
## Columns: 6
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, …
## $ Age <int> 51, 32, 47, 44, 37, 15, 36, 33, 46, 39, 40, 42, 42, 14, 61,…
## $ Gender <chr> "Male", "Female", "Male", "Female", "Male", "Female", "Fema…
## $ Heart_Rate <int> 73, 80, 61, 71, 83, 76, 78, 62, 80, 70, 58, 75, 68, 78, 56,…
## $ Weight <dbl> 103.2, NA, 74.3, NA, 75.8, NA, NA, NA, 57.3, NA, NA, NA, NA…
## $ Income <chr> "High", "Medium", "High", NA, "High", "Medium", "Medium", "…
After cleaning, the dataset became more consistent and reliable.
Implausible ages were removed, and the Gender variable was standardized into only “Male” and “Female,” which makes the data easier to analyze correctly.
This shows that data coding is very important because it improves data quality, reduces errors, and helps produce more accurate results.
a. Question: Base on literature of lecture 2, adapt it to the definition and categorize exactly 3 mechanisms for above 3 variables. Explain?
b. Answer:
Based on the dataset, the three missing-data mechanisms can be most reasonably applied as follows:
Heart_Rate → MCAR (Missing Completely at Random): Missing heart rate values are most likely due to random measurement or recording problems, such as device failure or accidental omission during data entry. In this case, the missingness is unrelated to the participant’s actual heart rate or other variables.
Weight → MAR (Missing at Random): Missing weight values may be related to other observed characteristics, such as age, gender, or survey conditions. For example, some groups may have been less likely to have their weight measured. Here, the missingness is not assumed to depend directly on the true weight itself, but on other observed variables.
Income → MNAR (Missing Not at Random): Missing income is most plausibly MNAR because income is a sensitive variable, and participants may refuse to report it because of their actual income level. In this case, the missingness may depend on the missing value itself, which is the defining feature of MNAR.