First rename this file as DSC406_001_FA24_WA4_unityID
Introduction
This synthetic data set is designed to study vision quality on a scale from 1 to 10. It includes variables that describe individuals’ demographic information, vision scores, and their ability to perceive distance. The data set is intended to explore patterns and relationships between various features and vision quality.
Our goal here is to look at the relationships of two variables ages and vision.
Variable Name
Class/Type
Description
Example
Person
Name of the individual
“Andrés”
Ages
Age of the individual
25
Gender
Gender of the individual (M/F)
“M”
Civil state
Marital status (e.g., Single, Married)
“S”
Height
Height of the individual in centimeters
168
Profession
Occupation of the individual
“Student”
Vision
Vision score on a scale of 1 to 10
8
Distance
Distance measured (in unspecified units)
4.5
PercDist
Percentage distance perceived by individual
50
1. Prepare
Load the libraries that we will use today.
# Install and load necessary librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Check if the package is installed. If not, install it.if (!require(janitor)) {install.packages("janitor", dependencies =TRUE)}
Loading required package: janitor
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(janitor)# Check if the package is installed. If not, install it.if (!require(naniar)) {install.packages("naniar", dependencies =TRUE)}
Loading required package: naniar
library(naniar)# Check if the re package is installed. If not, install it.if (!require(VIM)) {install.packages("VIM", dependencies =TRUE)}
Loading required package: VIM
Loading required package: colorspace
Loading required package: grid
VIM is ready to use.
Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
Attaching package: 'VIM'
The following object is masked from 'package:datasets':
sleep
library(VIM)
Read in the data
#save data to url objectdf3 <-read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRKH71cmJ5WDE11pDnJNYn4AIrpZ1yrgg8oR7GRyipqiBvj4w7YnuAMsgXjTIfo0A3JIMtnZicWDrTQ/pub?gid=129124664&single=true&output=csv")
Rows: 29 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Person, Gender, Civil state, Profession
dbl (6): Ages, Height, Vision, Distance, PercDist, Salary
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#inspect the data with a function of your choice#add your code below
2. Understanding the Structure and Cleaning the Data
1. 👉 Your Turn⤵
TASK: Try and understand the structure of the data. Use a function to undertand the structure.
# ADD YOUR CODE BELOW with commentsstr(df3)
spc_tbl_ [29 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Person : chr [1:29] "Andrés" "Anja" "Armando" "Carlos" ...
$ Ages : num [1:29] 25 29 31 25 23 39 28 37 NA 46 ...
$ Gender : chr [1:29] "M" "F" "M" "M" ...
$ Civil state: chr [1:29] "S" "S" "S" "M" ...
$ Height : num [1:29] 265 168 169 185 170 100 166 NA NA 168 ...
$ Profession : chr [1:29] "Student" "Professional" "Professional" "Professional" ...
$ Vision : num [1:29] 2 10 9 8 10 NA 8 NA NA 8 ...
$ Distance : num [1:29] 0.1 4.5 4.5 6 3 5 4.5 6 NA 4.5 ...
$ PercDist : num [1:29] 15 45 50 75 30 ...
$ Salary : num [1:29] NA NA 33000 NA NA NA NA 60000 NA NA ...
- attr(*, "spec")=
.. cols(
.. Person = col_character(),
.. Ages = col_double(),
.. Gender = col_character(),
.. `Civil state` = col_character(),
.. Height = col_double(),
.. Profession = col_character(),
.. Vision = col_double(),
.. Distance = col_double(),
.. PercDist = col_double(),
.. Salary = col_double()
.. )
- attr(*, "problems")=<externalptr>
2. 👉 Your Turn⤵
TASK: Answer the guiding questions and fill in the table?
What do you see, notice and wonder about?
It seems the data contains different health and person information about a 29 people.
Fill in the Class/Type below:
Variable Name
Class/Type
Description
Example
Person
Nominal/String
Name of the individual
“Andrés”
Ages
Ratio/Int
Age of the individual
25
Gender
Binary/String
Gender of the individual (M/F)
“M”
Civil state
Ordinal/String
Marital status (e.g., Single, Married)
“S”
Height
Ratio/Int
Height of the individual in centimeters
168
Profession
Ordinal/String
Occupation of the individual
“Student”
Vision
Interval/Int
Vision score on a scale of 1 to 10
8
Distance
Ratio/Double
Distance measured (in unspecified units)
4.5
PercDist
Ratio/Int
Percentage distance perceived by individual
50
Salary
Ratio/Double
Salary
33000
What is the structure of the data? (Rows/columns)
Our data structure has 29 observations/entries each detailed through 10 features.
What are the variable types?
There is different varable types, ex. Strings for names and size, and int/doubles for numerical data, such as salary or distance
Are there any wrangling steps needed?
Yes, some wrangling is needed to address missing values, particularly in columns like height. Mean or median imputation can be applied to fill in these gaps for consistent analysis.
Are there any thngs you will need to complete while wrangling the data?
We’ll need to create imputed versions of the data and add a source column to each dataset for comparison. This will make it easier to analyze the differences between the original and imputed datasets.
Correct Column Names without renaming.
Our column names are not in a desirable format. To deal with this, we can use the clean_names() functions of {janitor}.
This package has several more handy functions for cleaning data that are worth checking out. Remember you can always use ?janitor to read about the functions in the “help” pane.
# Clean names with janitor package and save it back to df3 objectdf3 <- df3 %>%clean_names()# Check the dataset structuredf3
# A tibble: 29 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Professio… 10 4.5 45
3 Armando 31 M S 169 Professio… 9 4.5 50
4 Carlos 25 M M 185 Professio… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Delfa 39 F M 100 Professio… NA 5 75
7 Eduardo 28 M S 166 Professio… 8 4.5 56.2
8 Enrique 37 <NA> <NA> NA Professio… NA 6 NA
9 Fanny NA F M NA Student NA NA 33.3
10 Francis… 46 M M 168 Professio… 8 4.5 56.2
# ℹ 19 more rows
# ℹ 1 more variable: salary <dbl>
3. 👉 Your Turn⤵
TASK: If you haven’t done this already use the summary() function to get a feeling for this dataset.
# ADD YOUR CODE BELOW with commentssummary(df3)
person ages gender civil_state
Length:29 Min. :22.00 Length:29 Length:29
Class :character 1st Qu.:25.00 Class :character Class :character
Mode :character Median :27.00 Mode :character Mode :character
Mean :31.83
3rd Qu.:37.50
Max. :55.00
NA's :5
height profession vision distance
Min. : 95 Length:29 Min. : 2.000 Min. : 0.100
1st Qu.:164 Class :character 1st Qu.: 7.000 1st Qu.: 1.500
Median :168 Mode :character Median : 8.500 Median : 3.750
Mean :167 Mean : 8.045 Mean : 4.317
3rd Qu.:175 3rd Qu.:10.000 3rd Qu.: 4.500
Max. :265 Max. :10.000 Max. :25.000
NA's :6 NA's :7 NA's :5
perc_dist salary
Min. : 15.00 Min. : 33000
1st Qu.: 20.24 1st Qu.: 46500
Median : 40.18 Median : 60000
Mean : 45.45 Mean : 71000
3rd Qu.: 57.19 3rd Qu.: 90000
Max. :150.00 Max. :120000
NA's :1 NA's :26
Plot to visualize
We can also use the plot() function to create a quick plot with base R.
4. 👉 Your Turn⤵
TASK: - add plot() function - inside , add y = dataframe$vision - followed by a comma - add x = dataframe$ages - run
# ADD YOUR CODE BELOW with comments# Plot Ages vs Vision using base Rplot(y = df3$vision, x = df3$ages)
Warning: Removed 7 rows containing missing values or values outside the scale range
(`geom_point()`).
TASK What are you noticing here?
Vision is clearly superior among younger people, which is what we would have expected.
Missing Data
Understanding Missing data
Note that NA as missing values are treated somewhat special in R. As an example: If you want to filter for missing values, you cannot write value == NA, but must instead write is.na(value):
# Filter for rows where vision has NAdf3 %>%filter(is.na(vision))
# A tibble: 7 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Delfa 39 F M 100 Professional NA 5 75
2 Enrique 37 <NA> <NA> NA Professional NA 6 NA
3 Fanny NA F M NA Student NA NA 33.3
4 Henrry NA M S NA Professional NA NA 60
5 Lorena NA F M NA Student NA NA 42.9
6 Pablo NA M S NA Student NA NA 33.3
7 Silvio NA M S NA Professional NA NA 56.2
# ℹ 1 more variable: salary <dbl>
5. 👉 Your Turn⤵
TASK: Filter missing values for ages
# ADD YOUR CODE BELOW with comments# Filter for rows where ages has NAdf3 %>%filter(is.na(ages))
# A tibble: 5 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Fanny NA F M NA Student NA NA 33.3
2 Henrry NA M S NA Professional NA NA 60
3 Lorena NA F M NA Student NA NA 42.9
4 Pablo NA M S NA Student NA NA 33.3
5 Silvio NA M S NA Professional NA NA 56.2
# ℹ 1 more variable: salary <dbl>
The complete.cases function will examine a data frame, find complete cases, and return a logical vector of the rows which contain missing values or incomplete cases.
You can create a name to do this. In this instance, the name will be “rowswithNA”.
❗The quotation mark is the logical NOT operator in R. So, !complete.cases means NOT Complete Cases in the data set.
Here we want to work with rows so, inside the brackets don’t specify the column after the comma, just rownswithNA before the comma (that you named in the previous step).
# Identifying rows with missing values using complete.casesrowswithNA <-!complete.cases(df3)df3[rowswithNA, ]
# A tibble: 27 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Professio… 10 4.5 45
3 Carlos 25 M M 185 Professio… 8 6 75
4 Cristina 23 F <NA> 170 Student 10 3 30
5 Delfa 39 F M 100 Professio… NA 5 75
6 Eduardo 28 M S 166 Professio… 8 4.5 56.2
7 Enrique 37 <NA> <NA> NA Professio… NA 6 NA
8 Fanny NA F M NA Student NA NA 33.3
9 Francis… 46 M M 168 Professio… 8 4.5 56.2
10 Franklin 55 M M 165 Professio… 7 4.5 64.3
# ℹ 17 more rows
# ℹ 1 more variable: salary <dbl>
In tidyverse you will want to use the ‘groupby()’ fucntion. There are not any special function in tidyverse to handle NAs.
# Group by profession and summarise missing vs not-missing counts for visiondf3 %>%group_by(profession) %>%summarise(n_rows =n(),n_NA =sum(is.na(vision)),n_notNA =sum(!is.na(vision)) )
Why use Naniar? The naniar package has several advantages over base R or tidyverse alone when it comes to handling missing data. It provides specialized functions and visualizations that make it easier to understand and explore missing data patterns
6. 👉 Your Turn⤵
Pick one way to look at the NAs in the varibale Ages
# ADD YOUR CODE BELOW with commentssummary(df3$ages)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
22.00 25.00 27.00 31.83 37.50 55.00 5
A. Visualizing Missing Data Patterns
Using {naniar}, we can visualize missing values more effectively. We’ll create a bar chart showing the number of missing values for each variable.
# Bar chart showing missing values by variablegg_miss_var(df3)
# A tibble: 2 × 3
profession n_miss_vision pct_miss_vision
<chr> <int> <dbl>
1 Professional 4 22.2
2 Student 3 27.3
We can also add shadow columns to track missing values in the dataset. These columns show whether a value was originally missing (NA) or not.
# Add shadow columns to mark missing datadf3_shadow <-bind_shadow(df3)head(df3_shadow)
# A tibble: 6 × 20
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Profession… 10 4.5 45
3 Armando 31 M S 169 Profession… 9 4.5 50
4 Carlos 25 M M 185 Profession… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Delfa 39 F M 100 Profession… NA 5 75
# ℹ 11 more variables: salary <dbl>, person_NA <fct>, ages_NA <fct>,
# gender_NA <fct>, civil_state_NA <fct>, height_NA <fct>,
# profession_NA <fct>, vision_NA <fct>, distance_NA <fct>,
# perc_dist_NA <fct>, salary_NA <fct>
# Visualizing missing data patternsaggr_plot <-aggr(df3, col=c('blue','red'), sortVars=TRUE, cex.axis=.7, ylab=c("Missing Data Bar Plot","Heatmap"))
Variables sorted by number of missings:
Variable Count
salary 0.89655172
vision 0.24137931
height 0.20689655
ages 0.17241379
distance 0.17241379
civil_state 0.06896552
gender 0.03448276
perc_dist 0.03448276
person 0.00000000
profession 0.00000000
5. Handling Missing Values
A. Omit missing data in you data set. You can do this with the na.omit() function and then save it to omittingNAobject.
# Omit rows with missing valuesomittingNA <-na.omit(df3)#Inspect using head functionhead(omittingNA)
# A tibble: 2 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Armando 31 M S 169 Professional 9 4.5 50
2 Iván 29 M S 169 Professional 8 3 37.5
# ℹ 1 more variable: salary <dbl>
B. If the variable is mostly filled with missing values, one possibility would be its exclusion. Here, the variable called salary has many missing values, so we can delete it.
#Pull the salary variable to deletedf3$salary <-NULL#Inspect the datasethead(df3)
# A tibble: 6 × 9
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Profession… 10 4.5 45
3 Armando 31 M S 169 Profession… 9 4.5 50
4 Carlos 25 M M 185 Profession… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Delfa 39 F M 100 Profession… NA 5 75
Mean and Median
Use mean for normally distributed data without outliers.
Hint: the argument na.rm = TRUE is used inside mean function to calculate the mean excluding NA values (missing values)
# Impute missing values in 'height' using meanmean_ages <-mean(df3$ages, na.rm =TRUE)mean_ages
# A tibble: 29 × 9
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Professio… 10 4.5 45
3 Armando 31 M S 169 Professio… 9 4.5 50
4 Carlos 25 M M 185 Professio… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Delfa 39 F M 100 Professio… NA 5 75
7 Eduardo 28 M S 166 Professio… 8 4.5 56.2
8 Enrique 37 <NA> <NA> NA Professio… NA 6 NA
9 Fanny 27 F M NA Student NA NA 33.3
10 Francis… 46 M M 168 Professio… 8 4.5 56.2
# ℹ 19 more rows
Lets compare the souces of each before and after we change.
# Add a column indicating the dataset source for comparisondf3$source <-"Original"df_mean_imputed$source <-"Mean Imputed"df_median_imputed$source <-"Median Imputed"# Combine the data frames into one for comparisoncombined_df <-bind_rows(df3, df_mean_imputed, df_median_imputed)# Boxplot to compare the three datasets side by sideggplot(combined_df, aes(x = source, y = ages, fill = source)) +geom_boxplot(alpha =0.4, na.rm =TRUE) +labs(title ="Comparison of Ages: Original, Mean Imputed, and Median Imputed", y ="Ages") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
7. 👉 Your Turn⤵
Compare the Mean, Median and without change for height.
7a. Calculate the mean
# ADD YOUR CODE BELOW with comments# Impute missing values in 'height' using meanmean_height <-mean(df3$height, na.rm =TRUE)mean_height
[1] 166.9565
7b. Impute the Mean
# ADD YOUR CODE BELOW with comments# Impute missing values using meandf_mean_imputed1 <- df3 %>%mutate(height =ifelse(is.na(height), mean(height, na.rm =TRUE), height))
7c. Calculate the Median
# ADD YOUR CODE BELOW with comments# Impute missing values in 'height' using medianmean_height1 <-median(df3$height, na.rm =TRUE)mean_height
# ADD YOUR CODE BELOW with commentshist(df3$height, col ="red", border ="blue")
# Add a column indicating the dataset source for comparisondf3$source <-"Original"mean_height$source <-"Mean Imputed"
Warning in mean_height$source <- "Mean Imputed": Coercing LHS to a list
mean_height1$source <-"Median Imputed"
Warning in mean_height1$source <- "Median Imputed": Coercing LHS to a list
# Combine the data frames into one for comparisoncombined_df <-bind_rows(df3, df_mean_imputed, df_median_imputed)# Boxplot to compare the three datasets side by sideggplot(combined_df, aes(x = source, y = height, fill = source)) +geom_boxplot(alpha =0.4, na.rm =TRUE) +labs(title ="Comparison of height: Original, Mean Imputed, and Median Imputed", y ="height") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
6. Outlier Detection
Dr. Neves Faria explains that outliers are single data points that lay far from the rest of the data distribution.
In data analytics, outliers are values within a dataset that vary greatly from the others—they’re either much larger, or significantly smaller. Outliers may indicate variabilities in a measurement, experimental errors, or a novelty.
Outliers can cause anomalies in the results obtained during data analysis. This means that they require some special attention and, in some cases, will need to be removed in order to analyze data effectively.
Natural outliers are those which occur “naturally” in the dataset, as opposed to being the result of an error. These naturally occurring outliers are known as novelties. For example, the average height of a giraffe is about 16 feet tall. However, there have been recent discoveries of two giraffes that stand at 9 feet and 8.5 feet, respectively. These two giraffes would be considered outliers in comparison to the general giraffe population.
The Three Main Reasons Why Outliers Occur:
Data Entry Errors: Data entry errors can happen when there is an issue with the data collection process. Mistakes in data entry could be as simple as a misplaced decimal point or a typo in a value.
Instrument Errors: Instrument errors occur when there is a problem with the equipment that collects data.
Natural Variation: Natural variation is a normal occurrence in any system.
Important Considerations
For symmetrical distribution (not necessarily normal) and N > 30, express results as mean ± standard deviation.
For asymmetrical distribution, N < 30, or evidence for outliers, use median ± IQR, which are more robust.
There are several types of visualizations that you can use depending on the dataset.
By going through the three steps of exploratory data analysis, you’ll have a much better understanding of your data, which will make it easier to choose your model, your attributes, and refine it overall.
Visualizing and Handling Outliers
We can detect outliers using boxplots and handle them appropriately, such as by removing or capping them.
# Boxplot to detect outliers in 'height'ggplot(df3, aes(x = profession, y = height)) +geom_boxplot() +theme_minimal()
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_boxplot()`).
# Filter for potential outliers in height and visiondf3 %>%filter(height <150& vision <60) %>%print()
# A tibble: 2 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Marta 47 F S 95 Professional 7 25 21.4
2 Nora 50 F M 145 Professional 6 6 100
# ℹ 1 more variable: source <chr>
# Remove outliers in height that are below a certain threshold (e.g., height < 150)df_no_outliers <- df3 %>%filter(height >=150)# Inspect the cleaned datasethead(df_no_outliers)
# A tibble: 6 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Profession… 10 4.5 45
3 Armando 31 M S 169 Profession… 9 4.5 50
4 Carlos 25 M M 185 Profession… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Eduardo 28 M S 166 Profession… 8 4.5 56.2
# ℹ 1 more variable: source <chr>
Methods for Handling Outliers
Removing Outliers: Simply removing the outliers from the dataset if they are errors or significantly affect the analysis.
Capping/Flooring Outliers: Replace extreme values with a specified threshold (e.g., capping at the 95th percentile).
Transforming Data: Apply transformations like log or square root to reduce the impact of outliers.
Using Robust Statistical Methods: Use methods that are less sensitive to outliers, such as median instead of mean.
Methods for Handling Outliers
Removing Outliers
# Remove outliers in 'height' below a certain threshold (e.g., height < 150)df_no_outliers <- df3 %>%filter(height >=145)# Inspect the cleaned datasethead(df_no_outliers)
# A tibble: 6 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Profession… 10 4.5 45
3 Armando 31 M S 169 Profession… 9 4.5 50
4 Carlos 25 M M 185 Profession… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Eduardo 28 M S 166 Profession… 8 4.5 56.2
# ℹ 1 more variable: source <chr>
# Histogram for height after removing outliersggplot(df_no_outliers, aes(x = height)) +geom_histogram(binwidth =5, fill ="skyblue", color ="black") +labs(title ="Histogram of Height After Removing Outliers", x ="Height (cm)", y ="Frequency") +theme_minimal()
Capping/Flooring Outliers
# Capping height values at the 95th percentileheight_95th <-quantile(df3$height, 0.95, na.rm =TRUE) df_capped <- df3 %>%mutate(height =ifelse(height > height_95th, height_95th, height))# Inspect the modified datasethead(df_capped)
# A tibble: 6 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 189. Student 2 0.1 15
2 Anja 29 F S 168 Profession… 10 4.5 45
3 Armando 31 M S 169 Profession… 9 4.5 50
4 Carlos 25 M M 185 Profession… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Delfa 39 F M 100 Profession… NA 5 75
# ℹ 1 more variable: source <chr>
# Histogram for height after capping outliersggplot(df_capped, aes(x = height)) +geom_histogram(binwidth =5, fill ="lightgreen", color ="black") +labs(title ="Histogram of Height After Capping Outliers", x ="Height (cm)", y ="Frequency") +theme_minimal()
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).
Transforming Data
# Apply log transformation to reduce the impact of outliers df_transformed <- df3 %>%mutate(height_log =log(height))# Inspect the transformed datasethead(df_transformed)
# A tibble: 6 × 11
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Andrés 25 M S 265 Student 2 0.1 15
2 Anja 29 F S 168 Profession… 10 4.5 45
3 Armando 31 M S 169 Profession… 9 4.5 50
4 Carlos 25 M M 185 Profession… 8 6 75
5 Cristina 23 F <NA> 170 Student 10 3 30
6 Delfa 39 F M 100 Profession… NA 5 75
# ℹ 2 more variables: source <chr>, height_log <dbl>
# Histogram for height after log transformationggplot(df_transformed, aes(x = height_log)) +geom_histogram(binwidth =0.2, fill ="lightcoral", color ="black") +labs(title ="Histogram of Log-Transformed Height", x ="Log(Height)", y ="Frequency") +theme_minimal()
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).
Using Robust Statistical Methods
# Calculate median and IQR for heightmedian_height <-median(df3$height, na.rm =TRUE)iqr_height <-IQR(df3$height, na.rm =TRUE)# Print median and IQRmedian_height
[1] 168
iqr_height
[1] 11
After calculating the median and IQR, we can visualize the distribution of height to understand the effect of using these robust statistical methods.
# Histogram for height using median and IQRggplot(df3, aes(x = height)) +geom_histogram(binwidth =5, fill ="steelblue", color ="black") +labs(title ="Histogram of Height Using Median and IQR", x ="Height (cm)", y ="Frequency") +theme_minimal()
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_bin()`).
8. 👉 Your Turn⤵
TASK: - Choose a method for handling outliers and apply it to the vision variable. - produce a visualization - Add a few sentences what is happening.
# Remove outliers in 'vision' below a certain threshold (e.g., vision < 3)df_no_outliers <- df3 %>%filter(vision >2)# Inspect the cleaned datasethead(df_no_outliers)
# A tibble: 6 × 10
person ages gender civil_state height profession vision distance perc_dist
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Anja 29 F S 168 Professio… 10 4.5 45
2 Armando 31 M S 169 Professio… 9 4.5 50
3 Carlos 25 M M 185 Professio… 8 6 75
4 Cristina 23 F <NA> 170 Student 10 3 30
5 Eduardo 28 M S 166 Professio… 8 4.5 56.2
6 Francisco 46 M M 168 Professio… 8 4.5 56.2
# ℹ 1 more variable: source <chr>
# ADD YOUR CODE BELOW with comments# Histogram for vision after removing outliersggplot(df_no_outliers, aes(x = vision)) +geom_histogram(binwidth =5, fill ="gold", color ="purple") +labs(title ="Histogram of Height After Removing Outliers", x ="vision (cm)", y ="Frequency") +theme_minimal()
Next week we will take care of factoring data, decide on handling missing data and outliers. Then we will do regressions.
References: Priscila Neves Faria. 2024. EDA For Big Bata: Missing Values, Box Plot Histogram and Outliers.
Paul Schmidt. 2023. “Bad Data & Outliers.” November 14, 2023. https://schmidtpaul.github.io/dsfair_quarto//ch/rbasics/baddata.html.