BTVN

TABLE OF CONTENTS

I. LESSON 1: TAKE CARE OF RAW DATA

II. LESSON 2: DATA CLEANING

III. LESSON 3: DEVIL GAP

I. LESSON 1: TAKE CARE OF RAW DATA

1.1. Missions

  1. Install and library package pacman, rio, tidyverse
if (!require("pacman")) install.packages("pacman")
pacman::p_load(rio, tidyverse)


  1. Use rio::import() to import data
df_raw <- rio::import("raw_health_survey.csv")


  1. Use summary() and glimpse() to test overview
summary(df_raw)
##        ID              Age            Gender            Heart_Rate    
##  Min.   :  1.00   Min.   :  0.00   Length:250         Min.   : 50.00  
##  1st Qu.: 63.25   1st Qu.: 37.25   Class :character   1st Qu.: 68.50  
##  Median :125.50   Median : 46.00   Mode  :character   Median : 76.00  
##  Mean   :125.50   Mean   : 56.60                      Mean   : 75.86  
##  3rd Qu.:187.75   3rd Qu.: 53.00                      3rd Qu.: 82.50  
##  Max.   :250.00   Max.   :999.00                      Max.   :106.00  
##                                                       NA's   :15      
##      Weight          Income         
##  Min.   : 17.70   Length:250        
##  1st Qu.: 54.70   Class :character  
##  Median : 65.35   Mode  :character  
##  Mean   : 66.04                     
##  3rd Qu.: 76.58                     
##  Max.   :103.20                     
##  NA's   :20
glimpse(df_raw)
## Rows: 250
## Columns: 6
## $ ID         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ Age        <int> 51, 32, 47, 44, 37, 15, 36, 33, 46, 39, 40, 999, 42, 42, 14…
## $ Gender     <chr> "nam", "nu", "1", "0", "nam", "Female", "F", "0", "M", "nu"…
## $ Heart_Rate <int> 73, 80, 61, 71, 83, 76, 78, 62, 80, 70, 58, 81, 75, 68, 78,…
## $ Weight     <dbl> 103.2, NA, 74.3, NA, 75.8, NA, NA, NA, 57.3, NA, NA, NA, NA…
## $ Income     <chr> "High", "Medium", "High", NA, "High", "Medium", "Medium", "…


1.2. My comments

a. Question: From the result of summary(), what outliers which is biologically irrational that you can find? In Gender variable, what is the disaster that the data importer made?


b. Answer:

II. LESSON 2: DATA CLEANING

2.1. Missions

Use dplyr syntax (mutate, filter, ifelse, case_when) to clean the data. Replace implausible age values with NA. Standardize the Gender variable into a single format: "Male" and "Female". Save everything into a new dataset called df_clean.

library(dplyr)

df_clean <- df_raw %>%
  mutate(
    Gender = case_when(
      Gender %in% c("nam", "Nam", "NAM", "M", "m", "Male", "male", "1") ~ "Male",
      Gender %in% c("nu", "Nu", "NU", "nữ", "Nữ", "F", "f", "Female", "female", "0") ~ "Female",
      TRUE ~ NA_character_
    )
  ) %>%
  filter(Age <= 100 | is.na(Age))

2.2. My comments

a. Question: Compare the dataset before and after cleaning, what do you think about the role of data coding?


b. Answer:

summary(df_clean)
##        ID             Age           Gender            Heart_Rate    
##  Min.   :  1.0   Min.   : 0.00   Length:247         Min.   : 50.00  
##  1st Qu.: 63.5   1st Qu.:37.00   Class :character   1st Qu.: 68.00  
##  Median :126.0   Median :45.00   Mode  :character   Median : 76.00  
##  Mean   :125.8   Mean   :45.16                      Mean   : 75.82  
##  3rd Qu.:187.5   3rd Qu.:53.00                      3rd Qu.: 83.00  
##  Max.   :250.0   Max.   :77.00                      Max.   :106.00  
##                                                     NA's   :14      
##      Weight          Income         
##  Min.   : 17.70   Length:247        
##  1st Qu.: 54.70   Class :character  
##  Median : 65.35   Mode  :character  
##  Mean   : 66.03                     
##  3rd Qu.: 76.53                     
##  Max.   :103.20                     
##  NA's   :19
glimpse(df_clean)
## Rows: 247
## Columns: 6
## $ ID         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, …
## $ Age        <int> 51, 32, 47, 44, 37, 15, 36, 33, 46, 39, 40, 42, 42, 14, 61,…
## $ Gender     <chr> "Male", "Female", "Male", "Female", "Male", "Female", "Fema…
## $ Heart_Rate <int> 73, 80, 61, 71, 83, 76, 78, 62, 80, 70, 58, 75, 68, 78, 56,…
## $ Weight     <dbl> 103.2, NA, 74.3, NA, 75.8, NA, NA, NA, 57.3, NA, NA, NA, NA…
## $ Income     <chr> "High", "Medium", "High", NA, "High", "Medium", "Medium", "…

III. LESSON 3: DEVIL GAP

3.1. What is missing data?

3.2. My comments

a. Question: Base on literature of lecture 2, adapt it to the definition and categorize exactly 3 mechanisms for above 3 variables. Explain?


b. Answer: