WA_4_NAs_Imputation_Outliers_key

DSC_406.001

Author

Write your name here

Published

October 27, 2024

0. Introduction

First rename this file as DSC406_001_FA24_WA4_unityID

Introduction

This synthetic dataset is designed to study vision quality on a scale from 1 to 10. It includes variables that describe individuals’ demographic information, vision scores, and their ability to perceive distance. The dataset is intended to explore patterns and relationships between various features and vision quality.

Our goal here is to look at the relationships of two variables ages and vision.

Variable Name Class/Type Description Example
Person Name of the individual “Andrés”
Ages Age of the individual 25
Gender Gender of the individual (M/F) “M”
Civil state Marital status (e.g., Single, Married) “S”
Height Height of the individual in centimeters 168
Profession Occupation of the individual “Student”
Vision Vision score on a scale of 1 to 10 8
Distance Distance measured (in unspecified units) 4.5
PercDist Percentage distance perceived by individual 50

1. Prepare

  1. Load the libraries that we will use today.
# Install and load necessary libraries
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tibble' was built under R version 4.3.3
Warning: package 'tidyr' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
Warning: package 'purrr' was built under R version 4.3.3
Warning: package 'dplyr' was built under R version 4.3.3
Warning: package 'stringr' was built under R version 4.3.3
Warning: package 'forcats' was built under R version 4.3.3
Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Check if the package is installed. If not, install it.
if (!require(janitor)) {
    install.packages("janitor", dependencies = TRUE)
}
Loading required package: janitor
Warning: package 'janitor' was built under R version 4.3.3

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(janitor)

# Check if the package is installed. If not, install it.
if (!require(naniar)) {
    install.packages("naniar", dependencies = TRUE)
}
Loading required package: naniar
Warning: package 'naniar' was built under R version 4.3.3
library(naniar)

# Check if the re package is installed. If not, install it.
if (!require(VIM)) {
    install.packages("VIM", dependencies = TRUE)
}
Loading required package: VIM
Warning: package 'VIM' was built under R version 4.3.3
Loading required package: colorspace
Warning: package 'colorspace' was built under R version 4.3.3
Loading required package: grid
VIM is ready to use.

Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

Attaching package: 'VIM'

The following object is masked from 'package:datasets':

    sleep
library(VIM)

Read in the data

#save data to url object
df3 <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRKH71cmJ5WDE11pDnJNYn4AIrpZ1yrgg8oR7GRyipqiBvj4w7YnuAMsgXjTIfo0A3JIMtnZicWDrTQ/pub?gid=129124664&single=true&output=csv")
Rows: 29 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Person, Gender, Civil state, Profession
dbl (6): Ages, Height, Vision, Distance, PercDist, Salary

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#inspect the data with a function of your choice
df3
# A tibble: 29 × 10
   Person   Ages Gender `Civil state` Height Profession Vision Distance PercDist
   <chr>   <dbl> <chr>  <chr>          <dbl> <chr>       <dbl>    <dbl>    <dbl>
 1 Andrés     25 M      S                265 Student         2      0.1     15  
 2 Anja       29 F      S                168 Professio…     10      4.5     45  
 3 Armando    31 M      S                169 Professio…      9      4.5     50  
 4 Carlos     25 M      M                185 Professio…      8      6       75  
 5 Cristi…    23 F      <NA>             170 Student        10      3       30  
 6 Delfa      39 F      M                100 Professio…     NA      5       75  
 7 Eduardo    28 M      S                166 Professio…      8      4.5     56.2
 8 Enrique    37 <NA>   <NA>              NA Professio…     NA      6       NA  
 9 Fanny      NA F      M                 NA Student        NA     NA       33.3
10 Franci…    46 M      M                168 Professio…      8      4.5     56.2
# ℹ 19 more rows
# ℹ 1 more variable: Salary <dbl>

2. Understanding the Structure and Cleaning the Data

1. 👉 Your Turn

TASK: Try and understand the structure of the data. Use a function to undertand the structure.

# ADD YOUR CODE BELOW with comments
#inspect the data using str() function

str(df3)
spc_tbl_ [29 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Person     : chr [1:29] "Andrés" "Anja" "Armando" "Carlos" ...
 $ Ages       : num [1:29] 25 29 31 25 23 39 28 37 NA 46 ...
 $ Gender     : chr [1:29] "M" "F" "M" "M" ...
 $ Civil state: chr [1:29] "S" "S" "S" "M" ...
 $ Height     : num [1:29] 265 168 169 185 170 100 166 NA NA 168 ...
 $ Profession : chr [1:29] "Student" "Professional" "Professional" "Professional" ...
 $ Vision     : num [1:29] 2 10 9 8 10 NA 8 NA NA 8 ...
 $ Distance   : num [1:29] 0.1 4.5 4.5 6 3 5 4.5 6 NA 4.5 ...
 $ PercDist   : num [1:29] 15 45 50 75 30 ...
 $ Salary     : num [1:29] NA NA 33000 NA NA NA NA 60000 NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   Person = col_character(),
  ..   Ages = col_double(),
  ..   Gender = col_character(),
  ..   `Civil state` = col_character(),
  ..   Height = col_double(),
  ..   Profession = col_character(),
  ..   Vision = col_double(),
  ..   Distance = col_double(),
  ..   PercDist = col_double(),
  ..   Salary = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

2. 👉 Your Turn

TASK: Answer the guiding questions and fill in the table?

  1. What do you see, notice and wonder about?
  • {Add a sentence or two here}
  1. Fill in the Class/Type below:
Variable Name Class/Type Description Example
Person Name of the individual “Andrés”
Ages Age of the individual 25
Gender Gender of the individual (M/F) “M”
Civil state Marital status (e.g., Single, Married) “S”
Height Height of the individual in centimeters 168
Profession Occupation of the individual “Student”
Vision Vision score on a scale of 1 to 10 8
Distance Distance measured (in unspecified units) 4.5
PercDist Percentage distance perceived by individual 50
  1. What is the structure of the data? (Rows/columns)
  • {Add a sentence or two here}
  1. What are the variable types?
  • {Add a sentence or two here}
  1. Are there any wrangling steps needed?
  • {Add a sentence or two here}
  1. Are there any thngs you will need to complete while wrangling the data?
  • {Add a sentence or two here}

Answer: The structure of the dataset reveals the following characteristics:

Rows and Columns: The dataset contains 29 rows and 9 columns.

  • There are four character variables (Person, Gender, Civil state, Profession), which are represented as chr (string).
  • There are five numerical variables (Ages, Height, Vision, Distance, PercDist), which are represented as dbl (double).
Variable Name Class/Type Description Example
Person character Name of the individual “Andrés”
Ages numeric Age of the individual 25
Gender character Gender of the individual (M/F) “M”
Civil state character Marital status (e.g., Single, Married) “S”
Height numeric Height of the individual in centimeters 168
Profession character Occupation of the individual “Student”
Vision numeric Vision score on a scale of 1 to 10 8
Distance numeric Distance measured (in unspecified units) 4.5
PercDist numeric Percentage distance perceived by individual 50

Tibble Summary:

The dataset is stored in a tibble format, which is an improved version of a data frame in R that provides better printing and subsetting capabilities. Some rows contain missing values (NA), particularly for Ages, Height, Civil state, and other columns. Attributes:

The column specification (spec) indicates that the dataset’s variables were automatically classified, with character and numeric types assigned as appropriate.

Things that may need to be addressed?

  • The data set may include missing or incomplete data points, which could impact subsequent analyses.
  • The data set could change the gender variable to a factor.
  • The column names are not in teh correct format.

Correct Column Names without renaming.

Our column names are not in a desirable format. To deal with this, we can use the clean_names() functions of {janitor}.

This package has several more handy functions for cleaning data that are worth checking out. Remember you can always use ?janitor to read about the functions in the “help” pane.

# Clean names with janitor package and save it back to df3 object
df3 <- df3 %>%
  clean_names()

# Check the dataset structure
df3
# A tibble: 29 × 10
   person    ages gender civil_state height profession vision distance perc_dist
   <chr>    <dbl> <chr>  <chr>        <dbl> <chr>       <dbl>    <dbl>     <dbl>
 1 Andrés      25 M      S              265 Student         2      0.1      15  
 2 Anja        29 F      S              168 Professio…     10      4.5      45  
 3 Armando     31 M      S              169 Professio…      9      4.5      50  
 4 Carlos      25 M      M              185 Professio…      8      6        75  
 5 Cristina    23 F      <NA>           170 Student        10      3        30  
 6 Delfa       39 F      M              100 Professio…     NA      5        75  
 7 Eduardo     28 M      S              166 Professio…      8      4.5      56.2
 8 Enrique     37 <NA>   <NA>            NA Professio…     NA      6        NA  
 9 Fanny       NA F      M               NA Student        NA     NA        33.3
10 Francis…    46 M      M              168 Professio…      8      4.5      56.2
# ℹ 19 more rows
# ℹ 1 more variable: salary <dbl>

3. 👉 Your Turn

TASK: If you haven’t done this already use the summary() function to get a feeling for this dataset.

# ADD YOUR CODE BELOW with comments
summary(df3)
    person               ages          gender          civil_state       
 Length:29          Min.   :22.00   Length:29          Length:29         
 Class :character   1st Qu.:25.00   Class :character   Class :character  
 Mode  :character   Median :27.00   Mode  :character   Mode  :character  
                    Mean   :31.83                                        
                    3rd Qu.:37.50                                        
                    Max.   :55.00                                        
                    NA's   :5                                            
     height     profession            vision          distance     
 Min.   : 95   Length:29          Min.   : 2.000   Min.   : 0.100  
 1st Qu.:164   Class :character   1st Qu.: 7.000   1st Qu.: 1.500  
 Median :168   Mode  :character   Median : 8.500   Median : 3.750  
 Mean   :167                      Mean   : 8.045   Mean   : 4.317  
 3rd Qu.:175                      3rd Qu.:10.000   3rd Qu.: 4.500  
 Max.   :265                      Max.   :10.000   Max.   :25.000  
 NA's   :6                        NA's   :7        NA's   :5       
   perc_dist          salary      
 Min.   : 15.00   Min.   : 33000  
 1st Qu.: 20.24   1st Qu.: 46500  
 Median : 40.18   Median : 60000  
 Mean   : 45.45   Mean   : 71000  
 3rd Qu.: 57.19   3rd Qu.: 90000  
 Max.   :150.00   Max.   :120000  
 NA's   :1        NA's   :26      

Plot to visualize

We can also use the plot() function to create a quick plot with base R.

4. 👉 Your Turn

TASK: - add plot() function - inside , add y = dataframe$vision - followed by a comma - add x = dataframe$ages - run

# ADD YOUR CODE BELOW with comments
# Plot Ages vs Vision using base R
plot(y = df3$vision, x = df3$ages)

I am also providing a graph in ggplot below:

# Plot Ages vs Vision using ggplot
ggplot(data = df3) +
  aes(x = ages, y = vision) +
  geom_point(size = 2) +
  scale_x_continuous(
    name = "Person's age",
    limits = c(20, 60),
    expand = expansion(mult = c(0, 0.05))
  ) +
  scale_y_continuous(
    name = "Person's vision",
    limits = c(0, NA),
    breaks = seq(0, 10, 2),
    expand = expansion(mult = c(0, 0.05))
  ) +
  theme_bw()
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_point()`).

TASK What are you noticing here?

  • {Add a sentence or two here}

Answer - most people in their 20s can see wuite well while most older folks tend to have vision that is worse.

Missing Data

Understanding Missing data

Note that NA as missing values are treated somewhat special in R. As an example: If you want to filter for missing values, you cannot write value == NA, but must instead write is.na(value):

# Filter for rows where vision has NA
df3 %>% 
  filter(is.na(vision))
# A tibble: 7 × 10
  person   ages gender civil_state height profession   vision distance perc_dist
  <chr>   <dbl> <chr>  <chr>        <dbl> <chr>         <dbl>    <dbl>     <dbl>
1 Delfa      39 F      M              100 Professional     NA        5      75  
2 Enrique    37 <NA>   <NA>            NA Professional     NA        6      NA  
3 Fanny      NA F      M               NA Student          NA       NA      33.3
4 Henrry     NA M      S               NA Professional     NA       NA      60  
5 Lorena     NA F      M               NA Student          NA       NA      42.9
6 Pablo      NA M      S               NA Student          NA       NA      33.3
7 Silvio     NA M      S               NA Professional     NA       NA      56.2
# ℹ 1 more variable: salary <dbl>

5. 👉 Your Turn

TASK: Filter missing values for ages

# ADD YOUR CODE BELOW with comments
# Filter for rows where vision has NA
df3 %>% 
  filter(is.na(vision))
# A tibble: 7 × 10
  person   ages gender civil_state height profession   vision distance perc_dist
  <chr>   <dbl> <chr>  <chr>        <dbl> <chr>         <dbl>    <dbl>     <dbl>
1 Delfa      39 F      M              100 Professional     NA        5      75  
2 Enrique    37 <NA>   <NA>            NA Professional     NA        6      NA  
3 Fanny      NA F      M               NA Student          NA       NA      33.3
4 Henrry     NA M      S               NA Professional     NA       NA      60  
5 Lorena     NA F      M               NA Student          NA       NA      42.9
6 Pablo      NA M      S               NA Student          NA       NA      33.3
7 Silvio     NA M      S               NA Professional     NA       NA      56.2
# ℹ 1 more variable: salary <dbl>

The complete.cases function will examine a data frame, find complete cases, and return a logical vector of the rows which contain missing values or incomplete cases.

You can create a name to do this. In this instance, the name will be “rowswithNA”.

❗The quotation mark is the logical NOT operator in R. So, !complete.cases means NOT Complete Cases in the data set.

Here we want to work with rows so, inside the brackets don’t specify the column after the comma, just rownswithNA before the comma (that you named in the previous step).

# Identifying rows with missing values using complete.cases
rowswithNA <- !complete.cases(df3)
df3[rowswithNA, ]
# A tibble: 27 × 10
   person    ages gender civil_state height profession vision distance perc_dist
   <chr>    <dbl> <chr>  <chr>        <dbl> <chr>       <dbl>    <dbl>     <dbl>
 1 Andrés      25 M      S              265 Student         2      0.1      15  
 2 Anja        29 F      S              168 Professio…     10      4.5      45  
 3 Carlos      25 M      M              185 Professio…      8      6        75  
 4 Cristina    23 F      <NA>           170 Student        10      3        30  
 5 Delfa       39 F      M              100 Professio…     NA      5        75  
 6 Eduardo     28 M      S              166 Professio…      8      4.5      56.2
 7 Enrique     37 <NA>   <NA>            NA Professio…     NA      6        NA  
 8 Fanny       NA F      M               NA Student        NA     NA        33.3
 9 Francis…    46 M      M              168 Professio…      8      4.5      56.2
10 Franklin    55 M      M              165 Professio…      7      4.5      64.3
# ℹ 17 more rows
# ℹ 1 more variable: salary <dbl>

In tidyverse you will want to use the ‘groupby()’ fucntion. There are not any special function in tidyverse to handle NAs.

# Group by profession and summarise missing vs not-missing counts for vision
df3 %>% 
  group_by(profession) %>% 
  summarise(
    n_rows = n(),
    n_NA = sum(is.na(vision)),
    n_notNA = sum(!is.na(vision))
  )
# A tibble: 2 × 4
  profession   n_rows  n_NA n_notNA
  <chr>         <int> <int>   <int>
1 Professional     18     4      14
2 Student          11     3       8

However, in the {Naniar} package there are many functions that can help you with NA’s.

# Same summary using naniar functions
df3 %>%
  group_by(profession) %>% 
  summarise(
    n_rows = n(),
    n_NA = n_miss(vision),
    n_notNA = n_complete(vision)
  )
# A tibble: 2 × 4
  profession   n_rows  n_NA n_notNA
  <chr>         <int> <int>   <int>
1 Professional     18     4      14
2 Student          11     3       8

Why use Naniar? The naniar package has several advantages over base R or tidyverse alone when it comes to handling missing data. It provides specialized functions and visualizations that make it easier to understand and explore missing data patterns

6. 👉 Your Turn

Pick one way to look at the NAs in the varibale Ages

# ADD YOUR CODE BELOW with comments

A. Visualizing Missing Data Patterns

Using naniar, we can visualize missing values more effectively. We’ll create a bar chart showing the number of missing values for each variable.

# Bar chart showing missing values by variable
gg_miss_var(df3)

# Summary of missing values
miss_var_summary(df3)
# A tibble: 10 × 3
   variable    n_miss pct_miss
   <chr>        <int>    <num>
 1 salary          26    89.7 
 2 vision           7    24.1 
 3 height           6    20.7 
 4 ages             5    17.2 
 5 distance         5    17.2 
 6 civil_state      2     6.90
 7 gender           1     3.45
 8 perc_dist        1     3.45
 9 person           0     0   
10 profession       0     0   
df3 %>%
  group_by(profession) %>%
  summarise(n_miss_vision = n_miss(vision),
            pct_miss_vision = pct_miss(vision))
# A tibble: 2 × 3
  profession   n_miss_vision pct_miss_vision
  <chr>                <int>           <dbl>
1 Professional             4            22.2
2 Student                  3            27.3

We can also add shadow columns to track missing values in the dataset. These columns show whether a value was originally missing (NA) or not.

# Add shadow columns to mark missing data
df3_shadow <- bind_shadow(df3)
head(df3_shadow)
# A tibble: 6 × 20
  person    ages gender civil_state height profession  vision distance perc_dist
  <chr>    <dbl> <chr>  <chr>        <dbl> <chr>        <dbl>    <dbl>     <dbl>
1 Andrés      25 M      S              265 Student          2      0.1        15
2 Anja        29 F      S              168 Profession…     10      4.5        45
3 Armando     31 M      S              169 Profession…      9      4.5        50
4 Carlos      25 M      M              185 Profession…      8      6          75
5 Cristina    23 F      <NA>           170 Student         10      3          30
6 Delfa       39 F      M              100 Profession…     NA      5          75
# ℹ 11 more variables: salary <dbl>, person_NA <fct>, ages_NA <fct>,
#   gender_NA <fct>, civil_state_NA <fct>, height_NA <fct>,
#   profession_NA <fct>, vision_NA <fct>, distance_NA <fct>,
#   perc_dist_NA <fct>, salary_NA <fct>
# Visualizing missing data patterns
aggr_plot <- aggr(df3, col=c('blue','red'), sortVars=TRUE, cex.axis=.7, ylab=c("Missing Data Bar Plot","Heatmap"))


 Variables sorted by number of missings: 
    Variable      Count
      salary 0.89655172
      vision 0.24137931
      height 0.20689655
        ages 0.17241379
    distance 0.17241379
 civil_state 0.06896552
      gender 0.03448276
   perc_dist 0.03448276
      person 0.00000000
  profession 0.00000000

5. Handling Missing Values

A. Omit missing data in you data set. You can do this with the na.omit() function and then save it to omittingNAobject.

# Omit rows with missing values
omittingNA <- na.omit(df3)
head(omittingNA)
# A tibble: 2 × 10
  person   ages gender civil_state height profession   vision distance perc_dist
  <chr>   <dbl> <chr>  <chr>        <dbl> <chr>         <dbl>    <dbl>     <dbl>
1 Armando    31 M      S              169 Professional      9      4.5      50  
2 Iván       29 M      S              169 Professional      8      3        37.5
# ℹ 1 more variable: salary <dbl>

B. If the variable is mostly filled with missing values, one possibility would be its exclusion. Here, the variable called `salary has many missing values, so we can delete it

df3$salary <- NULL
head(df3)
# A tibble: 6 × 9
  person    ages gender civil_state height profession  vision distance perc_dist
  <chr>    <dbl> <chr>  <chr>        <dbl> <chr>        <dbl>    <dbl>     <dbl>
1 Andrés      25 M      S              265 Student          2      0.1        15
2 Anja        29 F      S              168 Profession…     10      4.5        45
3 Armando     31 M      S              169 Profession…      9      4.5        50
4 Carlos      25 M      M              185 Profession…      8      6          75
5 Cristina    23 F      <NA>           170 Student         10      3          30
6 Delfa       39 F      M              100 Profession…     NA      5          75

Mean and Median

Use mean for normally distributed data without outliers.

Hint: the argument na.rm = TRUE is used inside mean function to calculate the mean excluding NA values (missing values)

# Impute missing values in 'height' using mean
mean_ages <- mean(df3$ages, na.rm = TRUE)

mean_ages
[1] 31.83333
# Impute missing values using mean
df_mean_imputed <- df3 %>%
  mutate(ages = ifelse(is.na(ages), mean(ages, na.rm = TRUE), ages))

Quick BaseR plot to check for normality

# Histogram for Ages distribution
hist(df3$ages, col = "orange", border = "purple")

Use median for skewed data or data with outliers. Is our data skedwed? How can we tell?

# Impute missing values using median
df_median_imputed <- df3 %>%
  mutate(ages = ifelse(is.na(ages), median(ages, na.rm = TRUE), ages))

#inspect data
df_median_imputed
# A tibble: 29 × 9
   person    ages gender civil_state height profession vision distance perc_dist
   <chr>    <dbl> <chr>  <chr>        <dbl> <chr>       <dbl>    <dbl>     <dbl>
 1 Andrés      25 M      S              265 Student         2      0.1      15  
 2 Anja        29 F      S              168 Professio…     10      4.5      45  
 3 Armando     31 M      S              169 Professio…      9      4.5      50  
 4 Carlos      25 M      M              185 Professio…      8      6        75  
 5 Cristina    23 F      <NA>           170 Student        10      3        30  
 6 Delfa       39 F      M              100 Professio…     NA      5        75  
 7 Eduardo     28 M      S              166 Professio…      8      4.5      56.2
 8 Enrique     37 <NA>   <NA>            NA Professio…     NA      6        NA  
 9 Fanny       27 F      M               NA Student        NA     NA        33.3
10 Francis…    46 M      M              168 Professio…      8      4.5      56.2
# ℹ 19 more rows

Lets compare the souces of each before and after we change.

# Add a column indicating the dataset source for comparison
df3$source <- "Original"
df_mean_imputed$source <- "Mean Imputed"
df_median_imputed$source <- "Median Imputed"

# Combine the data frames into one for comparison
combined_df <- bind_rows(df3, df_mean_imputed, df_median_imputed)

# Boxplot to compare the three datasets side by side
ggplot(combined_df, aes(x = source, y = ages, fill = source)) +
  geom_boxplot(alpha = 0.4, na.rm = TRUE) +
  labs(title = "Comparison of Ages: Original, Mean Imputed, and Median Imputed", y = "Ages") +
  theme_minimal() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

7. 👉 Your Turn

Compare the Mean, Median and without change for height.

7a. Calculate the mean

# Impute missing values in 'height' using mean
MeanHeight <- mean(df3$height, na.rm = TRUE)

#inspect answer
MeanHeight
[1] 166.9565

7b. Impute the Mean

# Impute missing values using mean
df_mean_imputed <- df3 %>%
  mutate(height = ifelse(is.na(height), mean(height, na.rm = TRUE), height))

7c. Calcualte the Median

# Impute missing values using median
df_median_imputed <- df3 %>%
  mutate(ages = ifelse(is.na(height), median(ages, na.rm = TRUE), height))

#inspect data
df_median_imputed
# A tibble: 29 × 10
   person    ages gender civil_state height profession vision distance perc_dist
   <chr>    <dbl> <chr>  <chr>        <dbl> <chr>       <dbl>    <dbl>     <dbl>
 1 Andrés     265 M      S              265 Student         2      0.1      15  
 2 Anja       168 F      S              168 Professio…     10      4.5      45  
 3 Armando    169 M      S              169 Professio…      9      4.5      50  
 4 Carlos     185 M      M              185 Professio…      8      6        75  
 5 Cristina   170 F      <NA>           170 Student        10      3        30  
 6 Delfa      100 F      M              100 Professio…     NA      5        75  
 7 Eduardo    166 M      S              166 Professio…      8      4.5      56.2
 8 Enrique     27 <NA>   <NA>            NA Professio…     NA      6        NA  
 9 Fanny       27 F      M               NA Student        NA     NA        33.3
10 Francis…   168 M      M              168 Professio…      8      4.5      56.2
# ℹ 19 more rows
# ℹ 1 more variable: source <chr>

7d. Create the plot for the varibale height

# Add a column indicating the dataset source for comparison
df3$source <- "Original"
df_mean_imputed$source <- "Mean Imputed"
df_median_imputed$source <- "Median Imputed"

# Combine the data frames into one for comparison
combined_df <- bind_rows(df3, df_mean_imputed, df_median_imputed)

# Boxplot to compare the three datasets side by side
ggplot(combined_df, aes(x = source, y = height, fill = source)) +
  geom_boxplot(alpha = 0.4, na.rm = TRUE) +
  labs(title = "Comparison of Ages: Original, Mean Imputed, and Median Imputed", y = "Height") +
  theme_minimal() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

6. Outlier Detection

Dr. Neves Faria explains that outliers are single data points that lay far from the rest of the data distribution.

In data analytics, outliers are values within a dataset that vary greatly from the others—they’re either much larger, or significantly smaller. Outliers may indicate variabilities in a measurement, experimental errors, or a novelty.

Outliers can cause anomalies in the results obtained during data analysis. This means that they require some special attention and, in some cases, will need to be removed in order to analyze data effectively.

Natural outliers are those which occur “naturally” in the dataset, as opposed to being the result of an error. These naturally occurring outliers are known as novelties. For example, the average height of a giraffe is about 16 feet tall. However, there have been recent discoveries of two giraffes that stand at 9 feet and 8.5 feet, respectively. These two giraffes would be considered outliers in comparison to the general giraffe population.

The Three Main Reasons Why Outliers Occur:

  • Data Entry Errors: Data entry errors can happen when there is an issue with the data collection process. Mistakes in data entry could be as simple as a misplaced decimal point or a typo in a value.
  • Instrument Errors: Instrument errors occur when there is a problem with the equipment that collects data.
  • Natural Variation: Natural variation is a normal occurrence in any system.

Important Considerations

  • For symmetrical distribution (not necessarily normal) and N > 30, express results as mean ± standard deviation.
  • For asymmetrical distribution, N < 30, or evidence for outliers, use median ± IQR, which are more robust.

There are several types of visualizations that you can use depending on the dataset.

By going through the three steps of exploratory data analysis, you’ll have a much better understanding of your data, which will make it easier to choose your model, your attributes, and refine it overall.

Visualizing and Handling Outliers

We can detect outliers using boxplots and handle them appropriately, such as by removing or capping them.

# Boxplot to detect outliers in 'height'
ggplot(df3, aes(x = profession, y = height)) + 
  geom_boxplot() +
  theme_minimal()
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_boxplot()`).

# Filter for potential outliers in height and vision
df3 %>% 
  filter(height < 150 & vision < 60) %>% 
  print()
# A tibble: 2 × 10
  person  ages gender civil_state height profession   vision distance perc_dist
  <chr>  <dbl> <chr>  <chr>        <dbl> <chr>         <dbl>    <dbl>     <dbl>
1 Marta     47 F      S               95 Professional      7       25      21.4
2 Nora      50 F      M              145 Professional      6        6     100  
# ℹ 1 more variable: source <chr>
# Remove outliers in height that are below a certain threshold (e.g., height < 150)
df_no_outliers <- df3 %>% 
  filter(height >= 150)

# Inspect the cleaned dataset
head(df_no_outliers)
# A tibble: 6 × 10
  person    ages gender civil_state height profession  vision distance perc_dist
  <chr>    <dbl> <chr>  <chr>        <dbl> <chr>        <dbl>    <dbl>     <dbl>
1 Andrés      25 M      S              265 Student          2      0.1      15  
2 Anja        29 F      S              168 Profession…     10      4.5      45  
3 Armando     31 M      S              169 Profession…      9      4.5      50  
4 Carlos      25 M      M              185 Profession…      8      6        75  
5 Cristina    23 F      <NA>           170 Student         10      3        30  
6 Eduardo     28 M      S              166 Profession…      8      4.5      56.2
# ℹ 1 more variable: source <chr>

Methods for Handling Outliers

  1. Removing Outliers: Simply removing the outliers from the dataset if they are errors or significantly affect the analysis.
  2. Capping/Flooring Outliers: Replace extreme values with a specified threshold (e.g., capping at the 95th percentile).
  3. Transforming Data: Apply transformations like log or square root to reduce the impact of outliers.
  4. Using Robust Statistical Methods: Use methods that are less sensitive to outliers, such as median instead of mean.

Methods for Handling Outliers

  1. Removing Outliers
# Remove outliers in 'height' below a certain threshold (e.g., height < 150)
df_no_outliers <- df3 %>%
  filter(height >= 145)
   
# Inspect the cleaned dataset
head(df_no_outliers)
# A tibble: 6 × 10
  person    ages gender civil_state height profession  vision distance perc_dist
  <chr>    <dbl> <chr>  <chr>        <dbl> <chr>        <dbl>    <dbl>     <dbl>
1 Andrés      25 M      S              265 Student          2      0.1      15  
2 Anja        29 F      S              168 Profession…     10      4.5      45  
3 Armando     31 M      S              169 Profession…      9      4.5      50  
4 Carlos      25 M      M              185 Profession…      8      6        75  
5 Cristina    23 F      <NA>           170 Student         10      3        30  
6 Eduardo     28 M      S              166 Profession…      8      4.5      56.2
# ℹ 1 more variable: source <chr>
# Histogram for height after removing outliers
ggplot(df_no_outliers, aes(x = height)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Height After Removing Outliers", x = "Height (cm)", y = "Frequency") +
  theme_minimal()

  1. Capping/Flooring Outliers
# Capping height values at the 95th percentile
height_95th <- quantile(df3$height, 0.95, na.rm = TRUE)
  df_capped <- df3 %>%
  mutate(height = ifelse(height > height_95th, height_95th, height))

# Inspect the modified dataset
head(df_capped)
# A tibble: 6 × 10
  person    ages gender civil_state height profession  vision distance perc_dist
  <chr>    <dbl> <chr>  <chr>        <dbl> <chr>        <dbl>    <dbl>     <dbl>
1 Andrés      25 M      S             189. Student          2      0.1        15
2 Anja        29 F      S             168  Profession…     10      4.5        45
3 Armando     31 M      S             169  Profession…      9      4.5        50
4 Carlos      25 M      M             185  Profession…      8      6          75
5 Cristina    23 F      <NA>          170  Student         10      3          30
6 Delfa       39 F      M             100  Profession…     NA      5          75
# ℹ 1 more variable: source <chr>
# Histogram for height after capping outliers
ggplot(df_capped, aes(x = height)) +
  geom_histogram(binwidth = 5, fill = "lightgreen", color = "black") +
  labs(title = "Histogram of Height After Capping Outliers", x = "Height (cm)", y = "Frequency") +
  theme_minimal()
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).

  1. Transforming Data
# Apply log transformation to reduce the impact of outliers 
df_transformed <- df3 %>%
  mutate(height_log = log(height))
   
# Inspect the transformed dataset
  head(df_transformed)
# A tibble: 6 × 11
  person    ages gender civil_state height profession  vision distance perc_dist
  <chr>    <dbl> <chr>  <chr>        <dbl> <chr>        <dbl>    <dbl>     <dbl>
1 Andrés      25 M      S              265 Student          2      0.1        15
2 Anja        29 F      S              168 Profession…     10      4.5        45
3 Armando     31 M      S              169 Profession…      9      4.5        50
4 Carlos      25 M      M              185 Profession…      8      6          75
5 Cristina    23 F      <NA>           170 Student         10      3          30
6 Delfa       39 F      M              100 Profession…     NA      5          75
# ℹ 2 more variables: source <chr>, height_log <dbl>
# Histogram for height after log transformation
ggplot(df_transformed, aes(x = height_log)) +
  geom_histogram(binwidth = 0.2, fill = "lightcoral", color = "black") +
  labs(title = "Histogram of Log-Transformed Height", x = "Log(Height)", y = "Frequency") +
  theme_minimal()
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).

  1. Using Robust Statistical Methods
# Calculate median and IQR for height
median_height <- median(df3$height, na.rm = TRUE)
iqr_height <- IQR(df3$height, na.rm = TRUE)
   
# Print median and IQR
median_height
[1] 168
iqr_height
[1] 11

After calculating the median and IQR, we can visualize the distribution of height to understand the effect of using these robust statistical methods.

# Histogram for height using median and IQR
ggplot(df3, aes(x = height)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "black") +
  labs(title = "Histogram of Height Using Median and IQR", x = "Height (cm)", y = "Frequency") +
  theme_minimal()
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).

8. 👉 Your Turn

TASK: - Choose a method for handling outliers and apply it to the vision variable. - produce a visualization - Add a few sentences what is happening.

# ADD YOUR CODE BELOW with comments
# Example: Capping height values at the 95th percentile
vision_95th <- quantile(df3$vision, 0.95, na.rm = TRUE)
df_capped <- df3 %>%
  mutate(vision = ifelse(vision > vision_95th, vision_95th, vision))

# Inspect the modified dataset
head(df_capped)
# A tibble: 6 × 10
  person    ages gender civil_state height profession  vision distance perc_dist
  <chr>    <dbl> <chr>  <chr>        <dbl> <chr>        <dbl>    <dbl>     <dbl>
1 Andrés      25 M      S              265 Student          2      0.1        15
2 Anja        29 F      S              168 Profession…     10      4.5        45
3 Armando     31 M      S              169 Profession…      9      4.5        50
4 Carlos      25 M      M              185 Profession…      8      6          75
5 Cristina    23 F      <NA>           170 Student         10      3          30
6 Delfa       39 F      M              100 Profession…     NA      5          75
# ℹ 1 more variable: source <chr>

References: Priscila Neves Faria. 2024. EDA For Big Bata: Missing Values, Box Plot Histogram and OUtliers.

Paul Schmidt. 2023. “Bad Data & Outliers.” November 14, 2023. https://schmidtpaul.github.io/dsfair_quarto//ch/rbasics/baddata.html.