WA_4_NAs_Imputation_Outliers_student

DSC_406.001

Author

August Pallesen

Published

November 6, 2024

0. Introduction

First rename this file as DSC406_001_FA24_WA4_unityID

Introduction

This synthetic data set is designed to study vision quality on a scale from 1 to 10. It includes variables that describe individuals’ demographic information, vision scores, and their ability to perceive distance. The data set is intended to explore patterns and relationships between various features and vision quality.

Our goal here is to look at the relationships of two variables ages and vision.

Variable Name	Description	Example
Person	Name of the individual	“Andrés”
Ages	Age of the individual	25
Gender	Gender of the individual (M/F)	“M”
Civil state	Marital status (e.g., Single, Married)	“S”
Height	Height of the individual in centimeters	168
Profession	Occupation of the individual	“Student”
Vision	Vision score on a scale of 1 to 10	8
Distance	Distance measured (in unspecified units)	4.5
PercDist	Percentage distance perceived by individual	50

1. Prepare

Load the libraries that we will use today.

# Install and load necessary libraries
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Check if the package is installed. If not, install it.
if (!require(janitor)) {
    install.packages("janitor", dependencies = TRUE)
}

Loading required package: janitor

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

library(janitor)

# Check if the package is installed. If not, install it.
if (!require(naniar)) {
    install.packages("naniar", dependencies = TRUE)
}

Loading required package: naniar

library(naniar)

# Check if the re package is installed. If not, install it.
if (!require(VIM)) {
    install.packages("VIM", dependencies = TRUE)
}

Loading required package: VIM
Loading required package: colorspace
Loading required package: grid
VIM is ready to use.

Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

Attaching package: 'VIM'

The following object is masked from 'package:datasets':

    sleep

library(VIM)

Read in the data

#save data to url object
df3 <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRKH71cmJ5WDE11pDnJNYn4AIrpZ1yrgg8oR7GRyipqiBvj4w7YnuAMsgXjTIfo0A3JIMtnZicWDrTQ/pub?gid=129124664&single=true&output=csv")

Rows: 29 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Person, Gender, Civil state, Profession
dbl (6): Ages, Height, Vision, Distance, PercDist, Salary

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#inspect the data with a function of your choice
#add your code below

2. Understanding the Structure and Cleaning the Data

1. 👉 Your Turn ⤵

TASK: Try and understand the structure of the data. Use a function to undertand the structure.

# ADD YOUR CODE BELOW with comments
str(df3)

spc_tbl_ [29 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Person     : chr [1:29] "Andrés" "Anja" "Armando" "Carlos" ...
 $ Ages       : num [1:29] 25 29 31 25 23 39 28 37 NA 46 ...
 $ Gender     : chr [1:29] "M" "F" "M" "M" ...
 $ Civil state: chr [1:29] "S" "S" "S" "M" ...
 $ Height     : num [1:29] 265 168 169 185 170 100 166 NA NA 168 ...
 $ Profession : chr [1:29] "Student" "Professional" "Professional" "Professional" ...
 $ Vision     : num [1:29] 2 10 9 8 10 NA 8 NA NA 8 ...
 $ Distance   : num [1:29] 0.1 4.5 4.5 6 3 5 4.5 6 NA 4.5 ...
 $ PercDist   : num [1:29] 15 45 50 75 30 ...
 $ Salary     : num [1:29] NA NA 33000 NA NA NA NA 60000 NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   Person = col_character(),
  ..   Ages = col_double(),
  ..   Gender = col_character(),
  ..   `Civil state` = col_character(),
  ..   Height = col_double(),
  ..   Profession = col_character(),
  ..   Vision = col_double(),
  ..   Distance = col_double(),
  ..   PercDist = col_double(),
  ..   Salary = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

2. 👉 Your Turn ⤵

TASK: Answer the guiding questions and fill in the table?

What do you see, notice and wonder about?

It seems the data contains different health and person information about a 29 people.

Fill in the Class/Type below:

Variable Name	Class/Type	Description	Example
Person	Nominal/String	Name of the individual	“Andrés”
Ages	Ratio/Int	Age of the individual	25
Gender	Binary/String	Gender of the individual (M/F)	“M”
Civil state	Ordinal/String	Marital status (e.g., Single, Married)	“S”
Height	Ratio/Int	Height of the individual in centimeters	168
Profession	Ordinal/String	Occupation of the individual	“Student”
Vision	Interval/Int	Vision score on a scale of 1 to 10	8
Distance	Ratio/Double	Distance measured (in unspecified units)	4.5
PercDist	Ratio/Int	Percentage distance perceived by individual	50
Salary	Ratio/Double	Salary	33000

What is the structure of the data? (Rows/columns)

Our data structure has 29 observations/entries each detailed through 10 features.

What are the variable types?

There is different varable types, ex. Strings for names and size, and int/doubles for numerical data, such as salary or distance

Are there any wrangling steps needed?

Yes, some wrangling is needed to address missing values, particularly in columns like height. Mean or median imputation can be applied to fill in these gaps for consistent analysis.

Are there any thngs you will need to complete while wrangling the data?

We’ll need to create imputed versions of the data and add a source column to each dataset for comparison. This will make it easier to analyze the differences between the original and imputed datasets.

Correct Column Names without renaming.

Our column names are not in a desirable format. To deal with this, we can use the clean_names() functions of {janitor}.

This package has several more handy functions for cleaning data that are worth checking out. Remember you can always use ?janitor to read about the functions in the “help” pane.

# Clean names with janitor package and save it back to df3 object
df3 <- df3 %>%
  clean_names()

# Check the dataset structure
df3

# A tibble: 29 × 10
   person    ages gender civil_state height profession vision distance perc_dist
   <chr>    <dbl> <chr>  <chr>        <dbl> <chr>       <dbl>    <dbl>     <dbl>
 1 Andrés      25 M      S              265 Student         2      0.1      15  
 2 Anja        29 F      S              168 Professio…     10      4.5      45  
 3 Armando     31 M      S              169 Professio…      9      4.5      50  
 4 Carlos      25 M      M              185 Professio…      8      6        75  
 5 Cristina    23 F      <NA>           170 Student        10      3        30  
 6 Delfa       39 F      M              100 Professio…     NA      5        75  
 7 Eduardo     28 M      S              166 Professio…      8      4.5      56.2
 8 Enrique     37 <NA>   <NA>            NA Professio…     NA      6        NA  
 9 Fanny       NA F      M               NA Student        NA     NA        33.3
10 Francis…    46 M      M              168 Professio…      8      4.5      56.2
# ℹ 19 more rows
# ℹ 1 more variable: salary <dbl>

3. 👉 Your Turn ⤵

TASK: If you haven’t done this already use the summary() function to get a feeling for this dataset.

# ADD YOUR CODE BELOW with comments
summary(df3)

    person               ages          gender          civil_state       
 Length:29          Min.   :22.00   Length:29          Length:29         
 Class :character   1st Qu.:25.00   Class :character   Class :character  
 Mode  :character   Median :27.00   Mode  :character   Mode  :character  
                    Mean   :31.83                                        
                    3rd Qu.:37.50                                        
                    Max.   :55.00                                        
                    NA's   :5                                            
     height     profession            vision          distance     
 Min.   : 95   Length:29          Min.   : 2.000   Min.   : 0.100  
 1st Qu.:164   Class :character   1st Qu.: 7.000   1st Qu.: 1.500  
 Median :168   Mode  :character   Median : 8.500   Median : 3.750  
 Mean   :167                      Mean   : 8.045   Mean   : 4.317  
 3rd Qu.:175                      3rd Qu.:10.000   3rd Qu.: 4.500  
 Max.   :265                      Max.   :10.000   Max.   :25.000  
 NA's   :6                        NA's   :7        NA's   :5       
   perc_dist          salary      
 Min.   : 15.00   Min.   : 33000  
 1st Qu.: 20.24   1st Qu.: 46500  
 Median : 40.18   Median : 60000  
 Mean   : 45.45   Mean   : 71000  
 3rd Qu.: 57.19   3rd Qu.: 90000  
 Max.   :150.00   Max.   :120000  
 NA's   :1        NA's   :26

Plot to visualize

We can also use the plot() function to create a quick plot with base R.

4. 👉 Your Turn ⤵

TASK: - add plot() function - inside , add y = dataframe$vision - followed by a comma - add x = dataframe$ages - run

# ADD YOUR CODE BELOW with comments
# Plot Ages vs Vision using base R
plot(y = df3$vision, x = df3$ages)

I am also providing a graph in ggplot below:

# Plot Ages vs Vision using ggplot
ggplot(data = df3) +
  aes(x = ages, y = vision) +
  geom_point(size = 2) +
  scale_x_continuous(
    name = "Person's age",
    limits = c(20, 60),
    expand = expansion(mult = c(0, 0.05))
  ) +
  scale_y_continuous(
    name = "Person's vision",
    limits = c(0, NA),
    breaks = seq(0, 10, 2),
    expand = expansion(mult = c(0, 0.05))
  ) +
  theme_bw()

Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_point()`).

TASK What are you noticing here?

Vision is clearly superior among younger people, which is what we would have expected.

Missing Data

Understanding Missing data

Note that NA as missing values are treated somewhat special in R. As an example: If you want to filter for missing values, you cannot write value == NA, but must instead write is.na(value):

# Filter for rows where vision has NA
df3 %>% 
  filter(is.na(vision))

# A tibble: 7 × 10
  person   ages gender civil_state height profession   vision distance perc_dist
  <chr>   <dbl> <chr>  <chr>        <dbl> <chr>         <dbl>    <dbl>     <dbl>
1 Delfa      39 F      M              100 Professional     NA        5      75  
2 Enrique    37 <NA>   <NA>            NA Professional     NA        6      NA  
3 Fanny      NA F      M               NA Student          NA       NA      33.3
4 Henrry     NA M      S               NA Professional     NA       NA      60  
5 Lorena     NA F      M               NA Student          NA       NA      42.9
6 Pablo      NA M      S               NA Student          NA       NA      33.3
7 Silvio     NA M      S               NA Professional     NA       NA      56.2
# ℹ 1 more variable: salary <dbl>

5. 👉 Your Turn ⤵

TASK: Filter missing values for ages

# ADD YOUR CODE BELOW with comments
# Filter for rows where ages has NA
df3 %>% 
  filter(is.na(ages))

# A tibble: 5 × 10
  person  ages gender civil_state height profession   vision distance perc_dist
  <chr>  <dbl> <chr>  <chr>        <dbl> <chr>         <dbl>    <dbl>     <dbl>
1 Fanny     NA F      M               NA Student          NA       NA      33.3
2 Henrry    NA M      S               NA Professional     NA       NA      60  
3 Lorena    NA F      M               NA Student          NA       NA      42.9
4 Pablo     NA M      S               NA Student          NA       NA      33.3
5 Silvio    NA M      S               NA Professional     NA       NA      56.2
# ℹ 1 more variable: salary <dbl>

The complete.cases function will examine a data frame, find complete cases, and return a logical vector of the rows which contain missing values or incomplete cases.

You can create a name to do this. In this instance, the name will be “rowswithNA”.

❗The quotation mark is the logical NOT operator in R. So, !complete.cases means NOT Complete Cases in the data set.

Here we want to work with rows so, inside the brackets don’t specify the column after the comma, just rownswithNA before the comma (that you named in the previous step).

# Identifying rows with missing values using complete.cases
rowswithNA <- !complete.cases(df3)
df3[rowswithNA, ]

# A tibble: 27 × 10
   person    ages gender civil_state height profession vision distance perc_dist
   <chr>    <dbl> <chr>  <chr>        <dbl> <chr>       <dbl>    <dbl>     <dbl>
 1 Andrés      25 M      S              265 Student         2      0.1      15  
 2 Anja        29 F      S              168 Professio…     10      4.5      45  
 3 Carlos      25 M      M              185 Professio…      8      6        75  
 4 Cristina    23 F      <NA>           170 Student        10      3        30  
 5 Delfa       39 F      M              100 Professio…     NA      5        75  
 6 Eduardo     28 M      S              166 Professio…      8      4.5      56.2
 7 Enrique     37 <NA>   <NA>            NA Professio…     NA      6        NA  
 8 Fanny       NA F      M               NA Student        NA     NA        33.3
 9 Francis…    46 M      M              168 Professio…      8      4.5      56.2
10 Franklin    55 M      M              165 Professio…      7      4.5      64.3
# ℹ 17 more rows
# ℹ 1 more variable: salary <dbl>

In tidyverse you will want to use the ‘groupby()’ fucntion. There are not any special function in tidyverse to handle NAs.

# Group by profession and summarise missing vs not-missing counts for vision
df3 %>% 
  group_by(profession) %>% 
  summarise(
    n_rows = n(),
    n_NA = sum(is.na(vision)),
    n_notNA = sum(!is.na(vision))
  )

# A tibble: 2 × 4
  profession   n_rows  n_NA n_notNA
  <chr>         <int> <int>   <int>
1 Professional     18     4      14
2 Student          11     3       8

However, in the {Naniar} package there are many functions that can help you with NA’s.

# Same summary using naniar functions
df3 %>%
  group_by(profession) %>% 
  summarise(
    n_rows = n(),
    n_NA = n_miss(vision),
    n_notNA = n_complete(vision)
  )

# A tibble: 2 × 4
  profession   n_rows  n_NA n_notNA
  <chr>         <int> <int>   <int>
1 Professional     18     4      14
2 Student          11     3       8

Why use Naniar? The naniar package has several advantages over base R or tidyverse alone when it comes to handling missing data. It provides specialized functions and visualizations that make it easier to understand and explore missing data patterns

6. 👉 Your Turn ⤵

Pick one way to look at the NAs in the varibale Ages

# ADD YOUR CODE BELOW with comments
summary(df3$ages)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  22.00   25.00   27.00   31.83   37.50   55.00       5

A. Visualizing Missing Data Patterns

Using {naniar}, we can visualize missing values more effectively. We’ll create a bar chart showing the number of missing values for each variable.

# Bar chart showing missing values by variable
gg_miss_var(df3)

# Summary of missing values
miss_var_summary(df3)

# A tibble: 10 × 3
   variable    n_miss pct_miss
   <chr>        <int>    <num>
 1 salary          26    89.7 
 2 vision           7    24.1 
 3 height           6    20.7 
 4 ages             5    17.2 
 5 distance         5    17.2 
 6 civil_state      2     6.90
 7 gender           1     3.45
 8 perc_dist        1     3.45
 9 person           0     0   
10 profession       0     0

df3 %>%
  group_by(profession) %>%
  summarise(n_miss_vision = n_miss(vision),
            pct_miss_vision = pct_miss(vision))

# A tibble: 2 × 3
  profession   n_miss_vision pct_miss_vision
  <chr>                <int>           <dbl>
1 Professional             4            22.2
2 Student                  3            27.3

We can also add shadow columns to track missing values in the dataset. These columns show whether a value was originally missing (NA) or not.

# Add shadow columns to mark missing data
df3_shadow <- bind_shadow(df3)
head(df3_shadow)

# A tibble: 6 × 20
  person    ages gender civil_state height profession  vision distance perc_dist
  <chr>    <dbl> <chr>  <chr>        <dbl> <chr>        <dbl>    <dbl>     <dbl>
1 Andrés      25 M      S              265 Student          2      0.1        15
2 Anja        29 F      S              168 Profession…     10      4.5        45
3 Armando     31 M      S              169 Profession…      9      4.5        50
4 Carlos      25 M      M              185 Profession…      8      6          75
5 Cristina    23 F      <NA>           170 Student         10      3          30
6 Delfa       39 F      M              100 Profession…     NA      5          75
# ℹ 11 more variables: salary <dbl>, person_NA <fct>, ages_NA <fct>,
#   gender_NA <fct>, civil_state_NA <fct>, height_NA <fct>,
#   profession_NA <fct>, vision_NA <fct>, distance_NA <fct>,
#   perc_dist_NA <fct>, salary_NA <fct>

# Visualizing missing data patterns
aggr_plot <- aggr(df3, col=c('blue','red'), sortVars=TRUE, cex.axis=.7, ylab=c("Missing Data Bar Plot","Heatmap"))


 Variables sorted by number of missings: 
    Variable      Count
      salary 0.89655172
      vision 0.24137931
      height 0.20689655
        ages 0.17241379
    distance 0.17241379
 civil_state 0.06896552
      gender 0.03448276
   perc_dist 0.03448276
      person 0.00000000
  profession 0.00000000

5. Handling Missing Values

A. Omit missing data in you data set. You can do this with the na.omit() function and then save it to omittingNAobject.

# Omit rows with missing values
omittingNA <- na.omit(df3)

#Inspect using head function
head(omittingNA)

# A tibble: 2 × 10
  person   ages gender civil_state height profession   vision distance perc_dist
  <chr>   <dbl> <chr>  <chr>        <dbl> <chr>         <dbl>    <dbl>     <dbl>
1 Armando    31 M      S              169 Professional      9      4.5      50  
2 Iván       29 M      S              169 Professional      8      3        37.5
# ℹ 1 more variable: salary <dbl>

B. If the variable is mostly filled with missing values, one possibility would be its exclusion. Here, the variable called salary has many missing values, so we can delete it.

#Pull the salary variable to delete
df3$salary <- NULL

#Inspect the dataset
head(df3)

# A tibble: 6 × 9
  person    ages gender civil_state height profession  vision distance perc_dist
  <chr>    <dbl> <chr>  <chr>        <dbl> <chr>        <dbl>    <dbl>     <dbl>
1 Andrés      25 M      S              265 Student          2      0.1        15
2 Anja        29 F      S              168 Profession…     10      4.5        45
3 Armando     31 M      S              169 Profession…      9      4.5        50
4 Carlos      25 M      M              185 Profession…      8      6          75
5 Cristina    23 F      <NA>           170 Student         10      3          30
6 Delfa       39 F      M              100 Profession…     NA      5          75

Mean and Median

Use mean for normally distributed data without outliers.

Hint: the argument na.rm = TRUE is used inside mean function to calculate the mean excluding NA values (missing values)

# Impute missing values in 'height' using mean
mean_ages <- mean(df3$ages, na.rm = TRUE)

mean_ages

[1] 31.83333

# Impute missing values using mean
df_mean_imputed <- df3 %>%
  mutate(ages = ifelse(is.na(ages), mean(ages, na.rm = TRUE), ages))

Quick BaseR plot to check for normality

# Histogram for Ages distribution
hist(df3$ages, col = "orange", border = "purple")

Use median for skewed data or data with outliers. Is our data skedwed? How can we tell?

# Impute missing values using median
df_median_imputed <- df3 %>%
  mutate(ages = ifelse(is.na(ages), median(ages, na.rm = TRUE), ages))

#inspect data
df_median_imputed

# A tibble: 29 × 9
   person    ages gender civil_state height profession vision distance perc_dist
   <chr>    <dbl> <chr>  <chr>        <dbl> <chr>       <dbl>    <dbl>     <dbl>
 1 Andrés      25 M      S              265 Student         2      0.1      15  
 2 Anja        29 F      S              168 Professio…     10      4.5      45  
 3 Armando     31 M      S              169 Professio…      9      4.5      50  
 4 Carlos      25 M      M              185 Professio…      8      6        75  
 5 Cristina    23 F      <NA>           170 Student        10      3        30  
 6 Delfa       39 F      M              100 Professio…     NA      5        75  
 7 Eduardo     28 M      S              166 Professio…      8      4.5      56.2
 8 Enrique     37 <NA>   <NA>            NA Professio…     NA      6        NA  
 9 Fanny       27 F      M               NA Student        NA     NA        33.3
10 Francis…    46 M      M              168 Professio…      8      4.5      56.2
# ℹ 19 more rows

Lets compare the souces of each before and after we change.

# Add a column indicating the dataset source for comparison
df3$source <- "Original"
df_mean_imputed$source <- "Mean Imputed"
df_median_imputed$source <- "Median Imputed"

# Combine the data frames into one for comparison
combined_df <- bind_rows(df3, df_mean_imputed, df_median_imputed)

# Boxplot to compare the three datasets side by side
ggplot(combined_df, aes(x = source, y = ages, fill = source)) +
  geom_boxplot(alpha = 0.4, na.rm = TRUE) +
  labs(title = "Comparison of Ages: Original, Mean Imputed, and Median Imputed", y = "Ages") +
  theme_minimal() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

7. 👉 Your Turn ⤵

Compare the Mean, Median and without change for height.

7a. Calculate the mean

# ADD YOUR CODE BELOW with comments
# Impute missing values in 'height' using mean
mean_height <- mean(df3$height, na.rm = TRUE)

mean_height

[1] 166.9565

7b. Impute the Mean

# ADD YOUR CODE BELOW with comments
# Impute missing values using mean
df_mean_imputed1 <- df3 %>%
  mutate(height = ifelse(is.na(height), mean(height, na.rm = TRUE), height))

7c. Calculate the Median

# ADD YOUR CODE BELOW with comments

# Impute missing values in 'height' using median
mean_height1 <- median(df3$height, na.rm = TRUE)

mean_height

[1] 166.9565

# Impute missing values using median
df_median_imputed1 <- df3 %>%
  mutate(height = ifelse(is.na(height), median(height, na.rm = TRUE), height))

7d. Create the plot for the varibale height

# ADD YOUR CODE BELOW with comments
hist(df3$height, col = "red", border = "blue")

# Add a column indicating the dataset source for comparison
df3$source <- "Original"
mean_height$source <- "Mean Imputed"

Warning in mean_height$source <- "Mean Imputed": Coercing LHS to a list

mean_height1$source <- "Median Imputed"

Warning in mean_height1$source <- "Median Imputed": Coercing LHS to a list

# Combine the data frames into one for comparison
combined_df <- bind_rows(df3, df_mean_imputed, df_median_imputed)

# Boxplot to compare the three datasets side by side
ggplot(combined_df, aes(x = source, y = height, fill = source)) +
  geom_boxplot(alpha = 0.4, na.rm = TRUE) +
  labs(title = "Comparison of height: Original, Mean Imputed, and Median Imputed", y = "height") +
  theme_minimal() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

6. Outlier Detection

Dr. Neves Faria explains that outliers are single data points that lay far from the rest of the data distribution.

In data analytics, outliers are values within a dataset that vary greatly from the others—they’re either much larger, or significantly smaller. Outliers may indicate variabilities in a measurement, experimental errors, or a novelty.

Outliers can cause anomalies in the results obtained during data analysis. This means that they require some special attention and, in some cases, will need to be removed in order to analyze data effectively.

Natural outliers are those which occur “naturally” in the dataset, as opposed to being the result of an error. These naturally occurring outliers are known as novelties. For example, the average height of a giraffe is about 16 feet tall. However, there have been recent discoveries of two giraffes that stand at 9 feet and 8.5 feet, respectively. These two giraffes would be considered outliers in comparison to the general giraffe population.

The Three Main Reasons Why Outliers Occur:

Data Entry Errors: Data entry errors can happen when there is an issue with the data collection process. Mistakes in data entry could be as simple as a misplaced decimal point or a typo in a value.
Instrument Errors: Instrument errors occur when there is a problem with the equipment that collects data.
Natural Variation: Natural variation is a normal occurrence in any system.

Important Considerations

For symmetrical distribution (not necessarily normal) and N > 30, express results as mean ± standard deviation.
For asymmetrical distribution, N < 30, or evidence for outliers, use median ± IQR, which are more robust.

There are several types of visualizations that you can use depending on the dataset.

By going through the three steps of exploratory data analysis, you’ll have a much better understanding of your data, which will make it easier to choose your model, your attributes, and refine it overall.

Visualizing and Handling Outliers

We can detect outliers using boxplots and handle them appropriately, such as by removing or capping them.

# Boxplot to detect outliers in 'height'
ggplot(df3, aes(x = profession, y = height)) + 
  geom_boxplot() +
  theme_minimal()

Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_boxplot()`).

# Filter for potential outliers in height and vision
df3 %>% 
  filter(height < 150 & vision < 60) %>% 
  print()

# A tibble: 2 × 10
  person  ages gender civil_state height profession   vision distance perc_dist
  <chr>  <dbl> <chr>  <chr>        <dbl> <chr>         <dbl>    <dbl>     <dbl>
1 Marta     47 F      S               95 Professional      7       25      21.4
2 Nora      50 F      M              145 Professional      6        6     100  
# ℹ 1 more variable: source <chr>

# Remove outliers in height that are below a certain threshold (e.g., height < 150)
df_no_outliers <- df3 %>% 
  filter(height >= 150)

# Inspect the cleaned dataset
head(df_no_outliers)

# A tibble: 6 × 10
  person    ages gender civil_state height profession  vision distance perc_dist
  <chr>    <dbl> <chr>  <chr>        <dbl> <chr>        <dbl>    <dbl>     <dbl>
1 Andrés      25 M      S              265 Student          2      0.1      15  
2 Anja        29 F      S              168 Profession…     10      4.5      45  
3 Armando     31 M      S              169 Profession…      9      4.5      50  
4 Carlos      25 M      M              185 Profession…      8      6        75  
5 Cristina    23 F      <NA>           170 Student         10      3        30  
6 Eduardo     28 M      S              166 Profession…      8      4.5      56.2
# ℹ 1 more variable: source <chr>

Methods for Handling Outliers

Removing Outliers: Simply removing the outliers from the dataset if they are errors or significantly affect the analysis.
Capping/Flooring Outliers: Replace extreme values with a specified threshold (e.g., capping at the 95th percentile).
Transforming Data: Apply transformations like log or square root to reduce the impact of outliers.
Using Robust Statistical Methods: Use methods that are less sensitive to outliers, such as median instead of mean.

Methods for Handling Outliers

Removing Outliers

# Remove outliers in 'height' below a certain threshold (e.g., height < 150)
df_no_outliers <- df3 %>%
  filter(height >= 145)
   
# Inspect the cleaned dataset
head(df_no_outliers)

# A tibble: 6 × 10
  person    ages gender civil_state height profession  vision distance perc_dist
  <chr>    <dbl> <chr>  <chr>        <dbl> <chr>        <dbl>    <dbl>     <dbl>
1 Andrés      25 M      S              265 Student          2      0.1      15  
2 Anja        29 F      S              168 Profession…     10      4.5      45  
3 Armando     31 M      S              169 Profession…      9      4.5      50  
4 Carlos      25 M      M              185 Profession…      8      6        75  
5 Cristina    23 F      <NA>           170 Student         10      3        30  
6 Eduardo     28 M      S              166 Profession…      8      4.5      56.2
# ℹ 1 more variable: source <chr>

# Histogram for height after removing outliers
ggplot(df_no_outliers, aes(x = height)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Height After Removing Outliers", x = "Height (cm)", y = "Frequency") +
  theme_minimal()

Capping/Flooring Outliers

# Capping height values at the 95th percentile
height_95th <- quantile(df3$height, 0.95, na.rm = TRUE)
  df_capped <- df3 %>%
  mutate(height = ifelse(height > height_95th, height_95th, height))

# Inspect the modified dataset
head(df_capped)

# A tibble: 6 × 10
  person    ages gender civil_state height profession  vision distance perc_dist
  <chr>    <dbl> <chr>  <chr>        <dbl> <chr>        <dbl>    <dbl>     <dbl>
1 Andrés      25 M      S             189. Student          2      0.1        15
2 Anja        29 F      S             168  Profession…     10      4.5        45
3 Armando     31 M      S             169  Profession…      9      4.5        50
4 Carlos      25 M      M             185  Profession…      8      6          75
5 Cristina    23 F      <NA>          170  Student         10      3          30
6 Delfa       39 F      M             100  Profession…     NA      5          75
# ℹ 1 more variable: source <chr>

# Histogram for height after capping outliers
ggplot(df_capped, aes(x = height)) +
  geom_histogram(binwidth = 5, fill = "lightgreen", color = "black") +
  labs(title = "Histogram of Height After Capping Outliers", x = "Height (cm)", y = "Frequency") +
  theme_minimal()

Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).

Transforming Data

# Apply log transformation to reduce the impact of outliers 
df_transformed <- df3 %>%
  mutate(height_log = log(height))
   
# Inspect the transformed dataset
  head(df_transformed)

# A tibble: 6 × 11
  person    ages gender civil_state height profession  vision distance perc_dist
  <chr>    <dbl> <chr>  <chr>        <dbl> <chr>        <dbl>    <dbl>     <dbl>
1 Andrés      25 M      S              265 Student          2      0.1        15
2 Anja        29 F      S              168 Profession…     10      4.5        45
3 Armando     31 M      S              169 Profession…      9      4.5        50
4 Carlos      25 M      M              185 Profession…      8      6          75
5 Cristina    23 F      <NA>           170 Student         10      3          30
6 Delfa       39 F      M              100 Profession…     NA      5          75
# ℹ 2 more variables: source <chr>, height_log <dbl>

# Histogram for height after log transformation
ggplot(df_transformed, aes(x = height_log)) +
  geom_histogram(binwidth = 0.2, fill = "lightcoral", color = "black") +
  labs(title = "Histogram of Log-Transformed Height", x = "Log(Height)", y = "Frequency") +
  theme_minimal()

Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).

Using Robust Statistical Methods

# Calculate median and IQR for height
median_height <- median(df3$height, na.rm = TRUE)
iqr_height <- IQR(df3$height, na.rm = TRUE)
   
# Print median and IQR
median_height

[1] 168

iqr_height

[1] 11

After calculating the median and IQR, we can visualize the distribution of height to understand the effect of using these robust statistical methods.

# Histogram for height using median and IQR
ggplot(df3, aes(x = height)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "black") +
  labs(title = "Histogram of Height Using Median and IQR", x = "Height (cm)", y = "Frequency") +
  theme_minimal()

Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).

8. 👉 Your Turn ⤵

TASK: - Choose a method for handling outliers and apply it to the vision variable. - produce a visualization - Add a few sentences what is happening.

# Remove outliers in 'vision' below a certain threshold (e.g., vision < 3)
df_no_outliers <- df3 %>%
  filter(vision > 2)
   
# Inspect the cleaned dataset
head(df_no_outliers)

# A tibble: 6 × 10
  person     ages gender civil_state height profession vision distance perc_dist
  <chr>     <dbl> <chr>  <chr>        <dbl> <chr>       <dbl>    <dbl>     <dbl>
1 Anja         29 F      S              168 Professio…     10      4.5      45  
2 Armando      31 M      S              169 Professio…      9      4.5      50  
3 Carlos       25 M      M              185 Professio…      8      6        75  
4 Cristina     23 F      <NA>           170 Student        10      3        30  
5 Eduardo      28 M      S              166 Professio…      8      4.5      56.2
6 Francisco    46 M      M              168 Professio…      8      4.5      56.2
# ℹ 1 more variable: source <chr>

# ADD YOUR CODE BELOW with comments

# Histogram for vision after removing outliers
ggplot(df_no_outliers, aes(x = vision)) +
  geom_histogram(binwidth = 5, fill = "gold", color = "purple") +
  labs(title = "Histogram of Height After Removing Outliers", x = "vision (cm)", y = "Frequency") +
  theme_minimal()

Next week we will take care of factoring data, decide on handling missing data and outliers. Then we will do regressions.

References: Priscila Neves Faria. 2024. EDA For Big Bata: Missing Values, Box Plot Histogram and Outliers.

Paul Schmidt. 2023. “Bad Data & Outliers.” November 14, 2023. https://schmidtpaul.github.io/dsfair_quarto//ch/rbasics/baddata.html.