First rename this file as DSC406_001_FA24_WA4_unityID
Introduction
This synthetic dataset is designed to study vision quality on a scale from 1 to 10. It includes variables that describe individuals’ demographic information, vision scores, and their ability to perceive distance. The dataset is intended to explore patterns and relationships between various features and vision quality.
Our goal here is to look at the relationships of two variables ages and vision.
Variable Name
Class/Type
Description
Example
Person
Name of the individual
“Andrés”
Ages
Age of the individual
25
Gender
Gender of the individual (M/F)
“M”
Civil state
Marital status (e.g., Single, Married)
“S”
Height
Height of the individual in centimeters
168
Profession
Occupation of the individual
“Student”
Vision
Vision score on a scale of 1 to 10
8
Distance
Distance measured (in unspecified units)
4.5
PercDist
Percentage distance perceived by individual
50
1. Prepare
Load the libraries that we will use today.
# Install and load necessary librarieslibrary(tidyverse)
Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tibble' was built under R version 4.3.3
Warning: package 'tidyr' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
Warning: package 'purrr' was built under R version 4.3.3
Warning: package 'dplyr' was built under R version 4.3.3
Warning: package 'stringr' was built under R version 4.3.3
Warning: package 'forcats' was built under R version 4.3.3
Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Check if the package is installed. If not, install it.if (!require(janitor)) {install.packages("janitor", dependencies =TRUE)}
Loading required package: janitor
Warning: package 'janitor' was built under R version 4.3.3
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(janitor)# Check if the package is installed. If not, install it.if (!require(naniar)) {install.packages("naniar", dependencies =TRUE)}
Loading required package: naniar
Warning: package 'naniar' was built under R version 4.3.3
library(naniar)# Check if the re package is installed. If not, install it.if (!require(VIM)) {install.packages("VIM", dependencies =TRUE)}
Loading required package: VIM
Warning: package 'VIM' was built under R version 4.3.3
Loading required package: colorspace
Warning: package 'colorspace' was built under R version 4.3.3
Loading required package: grid
VIM is ready to use.
Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
Attaching package: 'VIM'
The following object is masked from 'package:datasets':
sleep
library(VIM)
Read in the data
#save data to url objectdf3 <-read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRKH71cmJ5WDE11pDnJNYn4AIrpZ1yrgg8oR7GRyipqiBvj4w7YnuAMsgXjTIfo0A3JIMtnZicWDrTQ/pub?gid=129124664&single=true&output=csv")
Rows: 29 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Person, Gender, Civil state, Profession
dbl (6): Ages, Height, Vision, Distance, PercDist, Salary
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#inspect the data with a function of your choicedf3
# A tibble: 29 × 10
Person Ages Gender `Civil state` Height Profession Vision Distance PercDist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Professio… 10 4.5 45
3 Armando 31 M S 169 Professio… 9 4.5 50
4 Carlos 25 M M 185 Professio… 8 6 75
5 Cristi… 23 F <NA> 170 Student 10 3 30
6 Delfa 39 F M 100 Professio… NA 5 75
7 Eduardo 28 M S 166 Professio… 8 4.5 56.2
8 Enrique 37 <NA> <NA> NA Professio… NA 6 NA
9 Fanny NA F M NA Student NA NA 33.3
10 Franci… 46 M M 168 Professio… 8 4.5 56.2
# ℹ 19 more rows
# ℹ 1 more variable: Salary <dbl>
2. Understanding the Structure and Cleaning the Data
1. 👉 Your Turn⤵
TASK: Try and understand the structure of the data. Use a function to undertand the structure.
# ADD YOUR CODE BELOW with comments#inspect the data using str() functionstr(df3)
spc_tbl_ [29 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Person : chr [1:29] "Andrés" "Anja" "Armando" "Carlos" ...
$ Ages : num [1:29] 25 29 31 25 23 39 28 37 NA 46 ...
$ Gender : chr [1:29] "M" "F" "M" "M" ...
$ Civil state: chr [1:29] "S" "S" "S" "M" ...
$ Height : num [1:29] 265 168 169 185 170 100 166 NA NA 168 ...
$ Profession : chr [1:29] "Student" "Professional" "Professional" "Professional" ...
$ Vision : num [1:29] 2 10 9 8 10 NA 8 NA NA 8 ...
$ Distance : num [1:29] 0.1 4.5 4.5 6 3 5 4.5 6 NA 4.5 ...
$ PercDist : num [1:29] 15 45 50 75 30 ...
$ Salary : num [1:29] NA NA 33000 NA NA NA NA 60000 NA NA ...
- attr(*, "spec")=
.. cols(
.. Person = col_character(),
.. Ages = col_double(),
.. Gender = col_character(),
.. `Civil state` = col_character(),
.. Height = col_double(),
.. Profession = col_character(),
.. Vision = col_double(),
.. Distance = col_double(),
.. PercDist = col_double(),
.. Salary = col_double()
.. )
- attr(*, "problems")=<externalptr>
2. 👉 Your Turn⤵
TASK: Answer the guiding questions and fill in the table?
What do you see, notice and wonder about?
{Add a sentence or two here}
Fill in the Class/Type below:
Variable Name
Class/Type
Description
Example
Person
Name of the individual
“Andrés”
Ages
Age of the individual
25
Gender
Gender of the individual (M/F)
“M”
Civil state
Marital status (e.g., Single, Married)
“S”
Height
Height of the individual in centimeters
168
Profession
Occupation of the individual
“Student”
Vision
Vision score on a scale of 1 to 10
8
Distance
Distance measured (in unspecified units)
4.5
PercDist
Percentage distance perceived by individual
50
What is the structure of the data? (Rows/columns)
{Add a sentence or two here}
What are the variable types?
{Add a sentence or two here}
Are there any wrangling steps needed?
{Add a sentence or two here}
Are there any thngs you will need to complete while wrangling the data?
{Add a sentence or two here}
Answer: The structure of the dataset reveals the following characteristics:
Rows and Columns: The dataset contains 29 rows and 9 columns.
There are four character variables (Person, Gender, Civil state, Profession), which are represented as chr (string).
There are five numerical variables (Ages, Height, Vision, Distance, PercDist), which are represented as dbl (double).
Variable Name
Class/Type
Description
Example
Person
character
Name of the individual
“Andrés”
Ages
numeric
Age of the individual
25
Gender
character
Gender of the individual (M/F)
“M”
Civil state
character
Marital status (e.g., Single, Married)
“S”
Height
numeric
Height of the individual in centimeters
168
Profession
character
Occupation of the individual
“Student”
Vision
numeric
Vision score on a scale of 1 to 10
8
Distance
numeric
Distance measured (in unspecified units)
4.5
PercDist
numeric
Percentage distance perceived by individual
50
Tibble Summary:
The dataset is stored in a tibble format, which is an improved version of a data frame in R that provides better printing and subsetting capabilities. Some rows contain missing values (NA), particularly for Ages, Height, Civil state, and other columns. Attributes:
The column specification (spec) indicates that the dataset’s variables were automatically classified, with character and numeric types assigned as appropriate.
Things that may need to be addressed?
The data set may include missing or incomplete data points, which could impact subsequent analyses.
The data set could change the gender variable to a factor.
The column names are not in teh correct format.
Correct Column Names without renaming.
Our column names are not in a desirable format. To deal with this, we can use the clean_names() functions of {janitor}.
This package has several more handy functions for cleaning data that are worth checking out. Remember you can always use ?janitor to read about the functions in the “help” pane.
# Clean names with janitor package and save it back to df3 objectdf3 <- df3 %>%clean_names()# Check the dataset structuredf3
# A tibble: 29 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Professio… 10 4.5 45
3 Armando 31 M S 169 Professio… 9 4.5 50
4 Carlos 25 M M 185 Professio… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Delfa 39 F M 100 Professio… NA 5 75
7 Eduardo 28 M S 166 Professio… 8 4.5 56.2
8 Enrique 37 <NA> <NA> NA Professio… NA 6 NA
9 Fanny NA F M NA Student NA NA 33.3
10 Francis… 46 M M 168 Professio… 8 4.5 56.2
# ℹ 19 more rows
# ℹ 1 more variable: salary <dbl>
3. 👉 Your Turn⤵
TASK: If you haven’t done this already use the summary() function to get a feeling for this dataset.
# ADD YOUR CODE BELOW with commentssummary(df3)
person ages gender civil_state
Length:29 Min. :22.00 Length:29 Length:29
Class :character 1st Qu.:25.00 Class :character Class :character
Mode :character Median :27.00 Mode :character Mode :character
Mean :31.83
3rd Qu.:37.50
Max. :55.00
NA's :5
height profession vision distance
Min. : 95 Length:29 Min. : 2.000 Min. : 0.100
1st Qu.:164 Class :character 1st Qu.: 7.000 1st Qu.: 1.500
Median :168 Mode :character Median : 8.500 Median : 3.750
Mean :167 Mean : 8.045 Mean : 4.317
3rd Qu.:175 3rd Qu.:10.000 3rd Qu.: 4.500
Max. :265 Max. :10.000 Max. :25.000
NA's :6 NA's :7 NA's :5
perc_dist salary
Min. : 15.00 Min. : 33000
1st Qu.: 20.24 1st Qu.: 46500
Median : 40.18 Median : 60000
Mean : 45.45 Mean : 71000
3rd Qu.: 57.19 3rd Qu.: 90000
Max. :150.00 Max. :120000
NA's :1 NA's :26
Plot to visualize
We can also use the plot() function to create a quick plot with base R.
4. 👉 Your Turn⤵
TASK: - add plot() function - inside , add y = dataframe$vision - followed by a comma - add x = dataframe$ages - run
# ADD YOUR CODE BELOW with comments# Plot Ages vs Vision using base Rplot(y = df3$vision, x = df3$ages)
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_point()`).
TASK What are you noticing here?
{Add a sentence or two here}
Answer - most people in their 20s can see wuite well while most older folks tend to have vision that is worse.
Missing Data
Understanding Missing data
Note that NA as missing values are treated somewhat special in R. As an example: If you want to filter for missing values, you cannot write value == NA, but must instead write is.na(value):
# Filter for rows where vision has NAdf3 %>%filter(is.na(vision))
# A tibble: 7 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Delfa 39 F M 100 Professional NA 5 75
2 Enrique 37 <NA> <NA> NA Professional NA 6 NA
3 Fanny NA F M NA Student NA NA 33.3
4 Henrry NA M S NA Professional NA NA 60
5 Lorena NA F M NA Student NA NA 42.9
6 Pablo NA M S NA Student NA NA 33.3
7 Silvio NA M S NA Professional NA NA 56.2
# ℹ 1 more variable: salary <dbl>
5. 👉 Your Turn⤵
TASK: Filter missing values for ages
# ADD YOUR CODE BELOW with comments# Filter for rows where vision has NAdf3 %>%filter(is.na(vision))
# A tibble: 7 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Delfa 39 F M 100 Professional NA 5 75
2 Enrique 37 <NA> <NA> NA Professional NA 6 NA
3 Fanny NA F M NA Student NA NA 33.3
4 Henrry NA M S NA Professional NA NA 60
5 Lorena NA F M NA Student NA NA 42.9
6 Pablo NA M S NA Student NA NA 33.3
7 Silvio NA M S NA Professional NA NA 56.2
# ℹ 1 more variable: salary <dbl>
The complete.cases function will examine a data frame, find complete cases, and return a logical vector of the rows which contain missing values or incomplete cases.
You can create a name to do this. In this instance, the name will be “rowswithNA”.
❗The quotation mark is the logical NOT operator in R. So, !complete.cases means NOT Complete Cases in the data set.
Here we want to work with rows so, inside the brackets don’t specify the column after the comma, just rownswithNA before the comma (that you named in the previous step).
# Identifying rows with missing values using complete.casesrowswithNA <-!complete.cases(df3)df3[rowswithNA, ]
# A tibble: 27 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Professio… 10 4.5 45
3 Carlos 25 M M 185 Professio… 8 6 75
4 Cristina 23 F <NA> 170 Student 10 3 30
5 Delfa 39 F M 100 Professio… NA 5 75
6 Eduardo 28 M S 166 Professio… 8 4.5 56.2
7 Enrique 37 <NA> <NA> NA Professio… NA 6 NA
8 Fanny NA F M NA Student NA NA 33.3
9 Francis… 46 M M 168 Professio… 8 4.5 56.2
10 Franklin 55 M M 165 Professio… 7 4.5 64.3
# ℹ 17 more rows
# ℹ 1 more variable: salary <dbl>
In tidyverse you will want to use the ‘groupby()’ fucntion. There are not any special function in tidyverse to handle NAs.
# Group by profession and summarise missing vs not-missing counts for visiondf3 %>%group_by(profession) %>%summarise(n_rows =n(),n_NA =sum(is.na(vision)),n_notNA =sum(!is.na(vision)) )
Why use Naniar? The naniar package has several advantages over base R or tidyverse alone when it comes to handling missing data. It provides specialized functions and visualizations that make it easier to understand and explore missing data patterns
6. 👉 Your Turn⤵
Pick one way to look at the NAs in the varibale Ages
# ADD YOUR CODE BELOW with comments
A. Visualizing Missing Data Patterns
Using naniar, we can visualize missing values more effectively. We’ll create a bar chart showing the number of missing values for each variable.
# Bar chart showing missing values by variablegg_miss_var(df3)
# A tibble: 2 × 3
profession n_miss_vision pct_miss_vision
<chr> <int> <dbl>
1 Professional 4 22.2
2 Student 3 27.3
We can also add shadow columns to track missing values in the dataset. These columns show whether a value was originally missing (NA) or not.
# Add shadow columns to mark missing datadf3_shadow <-bind_shadow(df3)head(df3_shadow)
# A tibble: 6 × 20
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Profession… 10 4.5 45
3 Armando 31 M S 169 Profession… 9 4.5 50
4 Carlos 25 M M 185 Profession… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Delfa 39 F M 100 Profession… NA 5 75
# ℹ 11 more variables: salary <dbl>, person_NA <fct>, ages_NA <fct>,
# gender_NA <fct>, civil_state_NA <fct>, height_NA <fct>,
# profession_NA <fct>, vision_NA <fct>, distance_NA <fct>,
# perc_dist_NA <fct>, salary_NA <fct>
# Visualizing missing data patternsaggr_plot <-aggr(df3, col=c('blue','red'), sortVars=TRUE, cex.axis=.7, ylab=c("Missing Data Bar Plot","Heatmap"))
Variables sorted by number of missings:
Variable Count
salary 0.89655172
vision 0.24137931
height 0.20689655
ages 0.17241379
distance 0.17241379
civil_state 0.06896552
gender 0.03448276
perc_dist 0.03448276
person 0.00000000
profession 0.00000000
5. Handling Missing Values
A. Omit missing data in you data set. You can do this with the na.omit() function and then save it to omittingNAobject.
# Omit rows with missing valuesomittingNA <-na.omit(df3)head(omittingNA)
# A tibble: 2 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Armando 31 M S 169 Professional 9 4.5 50
2 Iván 29 M S 169 Professional 8 3 37.5
# ℹ 1 more variable: salary <dbl>
B. If the variable is mostly filled with missing values, one possibility would be its exclusion. Here, the variable called `salary has many missing values, so we can delete it
df3$salary <-NULLhead(df3)
# A tibble: 6 × 9
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Profession… 10 4.5 45
3 Armando 31 M S 169 Profession… 9 4.5 50
4 Carlos 25 M M 185 Profession… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Delfa 39 F M 100 Profession… NA 5 75
Mean and Median
Use mean for normally distributed data without outliers.
Hint: the argument na.rm = TRUE is used inside mean function to calculate the mean excluding NA values (missing values)
# Impute missing values in 'height' using meanmean_ages <-mean(df3$ages, na.rm =TRUE)mean_ages
# A tibble: 29 × 9
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Professio… 10 4.5 45
3 Armando 31 M S 169 Professio… 9 4.5 50
4 Carlos 25 M M 185 Professio… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Delfa 39 F M 100 Professio… NA 5 75
7 Eduardo 28 M S 166 Professio… 8 4.5 56.2
8 Enrique 37 <NA> <NA> NA Professio… NA 6 NA
9 Fanny 27 F M NA Student NA NA 33.3
10 Francis… 46 M M 168 Professio… 8 4.5 56.2
# ℹ 19 more rows
Lets compare the souces of each before and after we change.
# Add a column indicating the dataset source for comparisondf3$source <-"Original"df_mean_imputed$source <-"Mean Imputed"df_median_imputed$source <-"Median Imputed"# Combine the data frames into one for comparisoncombined_df <-bind_rows(df3, df_mean_imputed, df_median_imputed)# Boxplot to compare the three datasets side by sideggplot(combined_df, aes(x = source, y = ages, fill = source)) +geom_boxplot(alpha =0.4, na.rm =TRUE) +labs(title ="Comparison of Ages: Original, Mean Imputed, and Median Imputed", y ="Ages") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
7. 👉 Your Turn⤵
Compare the Mean, Median and without change for height.
7a. Calculate the mean
# Impute missing values in 'height' using meanMeanHeight <-mean(df3$height, na.rm =TRUE)#inspect answerMeanHeight
# A tibble: 29 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 265 M S 265 Student 2 0.1 15
2 Anja 168 F S 168 Professio… 10 4.5 45
3 Armando 169 M S 169 Professio… 9 4.5 50
4 Carlos 185 M M 185 Professio… 8 6 75
5 Cristina 170 F <NA> 170 Student 10 3 30
6 Delfa 100 F M 100 Professio… NA 5 75
7 Eduardo 166 M S 166 Professio… 8 4.5 56.2
8 Enrique 27 <NA> <NA> NA Professio… NA 6 NA
9 Fanny 27 F M NA Student NA NA 33.3
10 Francis… 168 M M 168 Professio… 8 4.5 56.2
# ℹ 19 more rows
# ℹ 1 more variable: source <chr>
7d. Create the plot for the varibale height
# Add a column indicating the dataset source for comparisondf3$source <-"Original"df_mean_imputed$source <-"Mean Imputed"df_median_imputed$source <-"Median Imputed"# Combine the data frames into one for comparisoncombined_df <-bind_rows(df3, df_mean_imputed, df_median_imputed)# Boxplot to compare the three datasets side by sideggplot(combined_df, aes(x = source, y = height, fill = source)) +geom_boxplot(alpha =0.4, na.rm =TRUE) +labs(title ="Comparison of Ages: Original, Mean Imputed, and Median Imputed", y ="Height") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
6. Outlier Detection
Dr. Neves Faria explains that outliers are single data points that lay far from the rest of the data distribution.
In data analytics, outliers are values within a dataset that vary greatly from the others—they’re either much larger, or significantly smaller. Outliers may indicate variabilities in a measurement, experimental errors, or a novelty.
Outliers can cause anomalies in the results obtained during data analysis. This means that they require some special attention and, in some cases, will need to be removed in order to analyze data effectively.
Natural outliers are those which occur “naturally” in the dataset, as opposed to being the result of an error. These naturally occurring outliers are known as novelties. For example, the average height of a giraffe is about 16 feet tall. However, there have been recent discoveries of two giraffes that stand at 9 feet and 8.5 feet, respectively. These two giraffes would be considered outliers in comparison to the general giraffe population.
The Three Main Reasons Why Outliers Occur:
Data Entry Errors: Data entry errors can happen when there is an issue with the data collection process. Mistakes in data entry could be as simple as a misplaced decimal point or a typo in a value.
Instrument Errors: Instrument errors occur when there is a problem with the equipment that collects data.
Natural Variation: Natural variation is a normal occurrence in any system.
Important Considerations
For symmetrical distribution (not necessarily normal) and N > 30, express results as mean ± standard deviation.
For asymmetrical distribution, N < 30, or evidence for outliers, use median ± IQR, which are more robust.
There are several types of visualizations that you can use depending on the dataset.
By going through the three steps of exploratory data analysis, you’ll have a much better understanding of your data, which will make it easier to choose your model, your attributes, and refine it overall.
Visualizing and Handling Outliers
We can detect outliers using boxplots and handle them appropriately, such as by removing or capping them.
# Boxplot to detect outliers in 'height'ggplot(df3, aes(x = profession, y = height)) +geom_boxplot() +theme_minimal()
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_boxplot()`).
# Filter for potential outliers in height and visiondf3 %>%filter(height <150& vision <60) %>%print()
# A tibble: 2 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Marta 47 F S 95 Professional 7 25 21.4
2 Nora 50 F M 145 Professional 6 6 100
# ℹ 1 more variable: source <chr>
# Remove outliers in height that are below a certain threshold (e.g., height < 150)df_no_outliers <- df3 %>%filter(height >=150)# Inspect the cleaned datasethead(df_no_outliers)
# A tibble: 6 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Profession… 10 4.5 45
3 Armando 31 M S 169 Profession… 9 4.5 50
4 Carlos 25 M M 185 Profession… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Eduardo 28 M S 166 Profession… 8 4.5 56.2
# ℹ 1 more variable: source <chr>
Methods for Handling Outliers
Removing Outliers: Simply removing the outliers from the dataset if they are errors or significantly affect the analysis.
Capping/Flooring Outliers: Replace extreme values with a specified threshold (e.g., capping at the 95th percentile).
Transforming Data: Apply transformations like log or square root to reduce the impact of outliers.
Using Robust Statistical Methods: Use methods that are less sensitive to outliers, such as median instead of mean.
Methods for Handling Outliers
Removing Outliers
# Remove outliers in 'height' below a certain threshold (e.g., height < 150)df_no_outliers <- df3 %>%filter(height >=145)# Inspect the cleaned datasethead(df_no_outliers)
# A tibble: 6 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Profession… 10 4.5 45
3 Armando 31 M S 169 Profession… 9 4.5 50
4 Carlos 25 M M 185 Profession… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Eduardo 28 M S 166 Profession… 8 4.5 56.2
# ℹ 1 more variable: source <chr>
# Histogram for height after removing outliersggplot(df_no_outliers, aes(x = height)) +geom_histogram(binwidth =5, fill ="skyblue", color ="black") +labs(title ="Histogram of Height After Removing Outliers", x ="Height (cm)", y ="Frequency") +theme_minimal()
Capping/Flooring Outliers
# Capping height values at the 95th percentileheight_95th <-quantile(df3$height, 0.95, na.rm =TRUE) df_capped <- df3 %>%mutate(height =ifelse(height > height_95th, height_95th, height))# Inspect the modified datasethead(df_capped)
# A tibble: 6 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 189. Student 2 0.1 15
2 Anja 29 F S 168 Profession… 10 4.5 45
3 Armando 31 M S 169 Profession… 9 4.5 50
4 Carlos 25 M M 185 Profession… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Delfa 39 F M 100 Profession… NA 5 75
# ℹ 1 more variable: source <chr>
# Histogram for height after capping outliersggplot(df_capped, aes(x = height)) +geom_histogram(binwidth =5, fill ="lightgreen", color ="black") +labs(title ="Histogram of Height After Capping Outliers", x ="Height (cm)", y ="Frequency") +theme_minimal()
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).
Transforming Data
# Apply log transformation to reduce the impact of outliers df_transformed <- df3 %>%mutate(height_log =log(height))# Inspect the transformed datasethead(df_transformed)
# A tibble: 6 × 11
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Profession… 10 4.5 45
3 Armando 31 M S 169 Profession… 9 4.5 50
4 Carlos 25 M M 185 Profession… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Delfa 39 F M 100 Profession… NA 5 75
# ℹ 2 more variables: source <chr>, height_log <dbl>
# Histogram for height after log transformationggplot(df_transformed, aes(x = height_log)) +geom_histogram(binwidth =0.2, fill ="lightcoral", color ="black") +labs(title ="Histogram of Log-Transformed Height", x ="Log(Height)", y ="Frequency") +theme_minimal()
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).
Using Robust Statistical Methods
# Calculate median and IQR for heightmedian_height <-median(df3$height, na.rm =TRUE)iqr_height <-IQR(df3$height, na.rm =TRUE)# Print median and IQRmedian_height
[1] 168
iqr_height
[1] 11
After calculating the median and IQR, we can visualize the distribution of height to understand the effect of using these robust statistical methods.
# Histogram for height using median and IQRggplot(df3, aes(x = height)) +geom_histogram(binwidth =5, fill ="steelblue", color ="black") +labs(title ="Histogram of Height Using Median and IQR", x ="Height (cm)", y ="Frequency") +theme_minimal()
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).
8. 👉 Your Turn⤵
TASK: - Choose a method for handling outliers and apply it to the vision variable. - produce a visualization - Add a few sentences what is happening.
# ADD YOUR CODE BELOW with comments# Example: Capping height values at the 95th percentilevision_95th <-quantile(df3$vision, 0.95, na.rm =TRUE)df_capped <- df3 %>%mutate(vision =ifelse(vision > vision_95th, vision_95th, vision))# Inspect the modified datasethead(df_capped)
# A tibble: 6 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Profession… 10 4.5 45
3 Armando 31 M S 169 Profession… 9 4.5 50
4 Carlos 25 M M 185 Profession… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Delfa 39 F M 100 Profession… NA 5 75
# ℹ 1 more variable: source <chr>
References: Priscila Neves Faria. 2024. EDA For Big Bata: Missing Values, Box Plot Histogram and OUtliers.
Paul Schmidt. 2023. “Bad Data & Outliers.” November 14, 2023. https://schmidtpaul.github.io/dsfair_quarto//ch/rbasics/baddata.html.