——————————————————————————————–

Facilitator: CDAM Experts

——————————————————————————————–

Session 2: Data Import, Cleaning, and Exploratory Data Analysis (EDA)

Learning Objectives:

By the end of this session, you will be able to:

1. Importing Data into R

R supports multiple file formats for importing data. The most common are:

Importing from CSV

# Base R
#df <- read.csv("data.csv", header = TRUE)

# Using tidyverse (faster and more user-friendly)
library(readr)
#df <- read_csv("data.csv")

Importing from Excel

library(readxl)
#df <- read_excel("data.xlsx", sheet = "Sheet1")

Importing from any file except a few specific Excel files

# Using "rio" package  (faster and more user-friendly)
library(rio)
#df <- import("data.csv")

Getting Started

set a working directory Default location where R looks for files and saves outputs

setwd("~/2025_R_TRAINING") # It tells R where to look for files and where to save files

Install and load the packages

library(rio)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(dplyr)
library(psych)
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

Load dataset (Import dataset)

gss = read.csv("GSSsubset.csv")

2. Data Cleaning

Real-world datasets often contain missing values , inconsistent formatting , and outliers .

Inspect Data Structure

str(gss)          # View structure
## 'data.frame':    994 obs. of  9 variables:
##  $ id     : int  1 2 4 14 16 19 21 27 28 30 ...
##  $ sex    : chr  "MALE" "FEMALE" "FEMALE" "FEMALE" ...
##  $ degree : chr  "BACHELOR" "BACHELOR" "BACHELOR" "HIGH SCHOOL" ...
##  $ income : num  60968 60968 10161 17551 17551 ...
##  $ marital: chr  "DIVORCED" "MARRIED" "MARRIED" "MARRIED" ...
##  $ age    : int  53 26 56 40 56 51 30 35 57 54 ...
##  $ height : int  72 60 68 65 66 68 62 70 71 71 ...
##  $ weight : int  190 97 160 156 210 170 115 180 225 165 ...
##  $ hrswrk : int  60 40 20 37 6 50 38 40 40 40 ...
summary(gss)      # Summary statistics
##        id             sex               degree              income        
##  Min.   :   1.0   Length:994         Length:994         Min.   :   369.5  
##  1st Qu.: 648.2   Class :character   Class :character   1st Qu.: 15703.8  
##  Median :1254.5   Mode  :character   Mode  :character   Median : 27712.5  
##  Mean   :1271.2                                         Mean   : 36887.2  
##  3rd Qu.:1915.8                                         3rd Qu.: 49882.5  
##  Max.   :2538.0                                         Max.   :158657.0  
##    marital               age            height          weight     
##  Length:994         Min.   :19.00   Min.   :57.00   Min.   : 90.0  
##  Class :character   1st Qu.:33.00   1st Qu.:64.00   1st Qu.:150.0  
##  Mode  :character   Median :44.00   Median :67.00   Median :175.0  
##                     Mean   :44.49   Mean   :67.41   Mean   :181.3  
##                     3rd Qu.:55.00   3rd Qu.:70.00   3rd Qu.:205.0  
##                     Max.   :79.00   Max.   :79.00   Max.   :410.0  
##      hrswrk     
##  Min.   : 1.00  
##  1st Qu.:38.00  
##  Median :40.00  
##  Mean   :42.64  
##  3rd Qu.:50.00  
##  Max.   :89.00
head(gss)         # First few rows
##   id    sex         degree   income  marital age height weight hrswrk
## 1  1   MALE       BACHELOR 60967.50 DIVORCED  53     72    190     60
## 2  2 FEMALE       BACHELOR 60967.50  MARRIED  26     60     97     40
## 3  4 FEMALE       BACHELOR 10161.25  MARRIED  56     68    160     20
## 4 14 FEMALE    HIGH SCHOOL 17551.25  MARRIED  40     65    156     37
## 5 16   MALE    HIGH SCHOOL 17551.25  MARRIED  56     66    210      6
## 6 19   MALE LT HIGH SCHOOL 15703.75  MARRIED  51     68    170     50
tail(gss)         # Last few rows
##       id    sex      degree   income       marital age height weight hrswrk
## 989 2530   MALE HIGH SCHOOL 33255.00       MARRIED  62     71    215     48
## 990 2531   MALE HIGH SCHOOL  1478.00 NEVER MARRIED  40     71    230     48
## 991 2535   MALE HIGH SCHOOL 33255.00      DIVORCED  56     72    195     46
## 992 2536   MALE HIGH SCHOOL  8313.75 NEVER MARRIED  24     68    145     40
## 993 2537   MALE HIGH SCHOOL 27712.50 NEVER MARRIED  27     68    180     40
## 994 2538 FEMALE HIGH SCHOOL 15703.75       WIDOWED  71     63    140     48
dim(gss)          # Dimensions
## [1] 994   9
colnames(gss)     # Column names
## [1] "id"      "sex"     "degree"  "income"  "marital" "age"     "height" 
## [8] "weight"  "hrswrk"

Handling Missing Values

##Check for Missing Values

sum(is.na(gss))            # Total NA values
## [1] 0
colSums(is.na(gss))        # Missing per column
##      id     sex  degree  income marital     age  height  weight  hrswrk 
##       0       0       0       0       0       0       0       0       0

Remove Rows with Missing Values

gss_clean <- na.omit(gss)   # Remove all rows with any NA

Impute Missing Values

library(dplyr)

gss <- gss %>%
  mutate(
    Age = ifelse(is.na(age), mean(age, na.rm = TRUE), age),
    Salary = ifelse(is.na(income), median(income, na.rm = TRUE), income))

Detecting and Handling Outliers

Visual Inspection (Boxplot)

boxplot(gss$age, main = "Age Distribution", ylab = "Age")

Z-Score Method

z_scores <- scale(gss$income)
outliers <- which(abs(z_scores) > 3, arr.ind = TRUE)
gss$outlier_flag <- ifelse(abs(z_scores) > 3, TRUE, FALSE)

IQR Method

Q1 <- quantile(gss$income, 0.25)
Q3 <- quantile(gss$income, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

gss$outlier_iqr <- ifelse(gss$income < lower_bound | gss$income > upper_bound, TRUE, FALSE)

3. Summary Statistics

Use these to understand central tendency and spread:

mean(gss$age, na.rm = TRUE)
## [1] 44.48893
median(gss$age, na.rm = TRUE)
## [1] 44
sd(gss$age, na.rm = TRUE)
## [1] 13.12686
var(gss$age, na.rm = TRUE)
## [1] 172.3146
quantile(gss$age, probs = c(0.25, 0.5, 0.75))
## 25% 50% 75% 
##  33  44  55
range(gss$age, na.rm = TRUE)
## [1] 19 79

For categorical variables:

table(gss$sex)             # Frequency table
## 
## FEMALE   MALE 
##    487    507
prop.table(table(gss$sex)) # Proportions
## 
##    FEMALE      MALE 
## 0.4899396 0.5100604

To summarize the entire dataset:

summary(gss)
##        id             sex               degree              income        
##  Min.   :   1.0   Length:994         Length:994         Min.   :   369.5  
##  1st Qu.: 648.2   Class :character   Class :character   1st Qu.: 15703.8  
##  Median :1254.5   Mode  :character   Mode  :character   Median : 27712.5  
##  Mean   :1271.2                                         Mean   : 36887.2  
##  3rd Qu.:1915.8                                         3rd Qu.: 49882.5  
##  Max.   :2538.0                                         Max.   :158657.0  
##    marital               age            height          weight     
##  Length:994         Min.   :19.00   Min.   :57.00   Min.   : 90.0  
##  Class :character   1st Qu.:33.00   1st Qu.:64.00   1st Qu.:150.0  
##  Mode  :character   Median :44.00   Median :67.00   Median :175.0  
##                     Mean   :44.49   Mean   :67.41   Mean   :181.3  
##                     3rd Qu.:55.00   3rd Qu.:70.00   3rd Qu.:205.0  
##                     Max.   :79.00   Max.   :79.00   Max.   :410.0  
##      hrswrk           Age            Salary         outlier_flag.V1
##  Min.   : 1.00   Min.   :19.00   Min.   :   369.5   Mode :logical  
##  1st Qu.:38.00   1st Qu.:33.00   1st Qu.: 15703.8   FALSE:948      
##  Median :40.00   Median :44.00   Median : 27712.5   TRUE :46       
##  Mean   :42.64   Mean   :44.49   Mean   : 36887.2                  
##  3rd Qu.:50.00   3rd Qu.:55.00   3rd Qu.: 49882.5                  
##  Max.   :89.00   Max.   :79.00   Max.   :158657.0                  
##  outlier_iqr    
##  Mode :logical  
##  FALSE:933      
##  TRUE :61       
##                 
##                 
## 
describe(gss)
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
##              vars   n     mean       sd  median  trimmed      mad   min    max
## id              1 994  1271.18   734.07  1254.5  1270.81   951.83   1.0   2538
## sex*            2 994     1.51     0.50     2.0     1.51     0.00   1.0      2
## degree*         3 994     2.70     1.13     3.0     2.65     1.48   1.0      5
## income          4 994 36887.22 34702.26 27712.5 30548.51 19173.72 369.5 158657
## marital*        5 994     2.28     0.92     2.0     2.21     1.48   1.0      5
## age             6 994    44.49    13.13    44.0    44.30    16.31  19.0     79
## height          7 994    67.41     4.04    67.0    67.41     4.45  57.0     79
## weight          8 994   181.32    41.64   175.0   178.15    37.06  90.0    410
## hrswrk          9 994    42.64    14.39    40.0    42.43    11.86   1.0     89
## Age            10 994    44.49    13.13    44.0    44.30    16.31  19.0     79
## Salary         11 994 36887.22 34702.26 27712.5 30548.51 19173.72 369.5 158657
## outlier_flag   12 994      NaN       NA      NA      NaN       NA   Inf   -Inf
## outlier_iqr    13 994      NaN       NA      NA      NaN       NA   Inf   -Inf
##                 range  skew kurtosis      se
## id             2537.0  0.02    -1.20   23.28
## sex*              1.0 -0.04    -2.00    0.02
## degree*           4.0  0.05    -0.43    0.04
## income       158287.5  2.07     4.56 1100.69
## marital*          4.0  0.84     1.03    0.03
## age              60.0  0.11    -0.86    0.42
## height           22.0  0.00    -0.64    0.13
## weight          320.0  0.89     1.59    1.32
## hrswrk           88.0  0.32     1.20    0.46
## Age              60.0  0.11    -0.86    0.42
## Salary       158287.5  2.07     4.56 1100.69
## outlier_flag     -Inf    NA       NA      NA
## outlier_iqr      -Inf    NA       NA      NA

4. Introduction to Data Visualization for EDA

Histogram – Distribution of Numerical Variables

hist(gss$age, col = "blue", main = "Age Distribution", xlab = "Age")

Boxplot – Detect Outliers

boxplot(gss$income ~ gss$sex,
        main = "Income by Sex",
        ylab = "Income", xlab = "Sex")

Scatter Plot – Relationship Between Two Numerical Variable

plot(gss$age, gss$income,
     main = "Age vs income",
     xlab = "Age", ylab = "Income",
     pch = 16, col = "red")

Bar Chart – Categorical Variables

barplot(table(gss$marital),
        main = "Number of Employees by Marital",
        xlab = "Marital", ylab = "Count",
        col = "tomato")

Correlation Matrix (for numeric columns only)

Correlation is a statistical measure that describes the strength and direction of a linear relationship between two quantitative variables.

Key Concepts:

Range: Correlation values range from –1 to +1.

+1: Perfect positive linear relationship (as one variable increases, the other increases).

–1: Perfect negative linear relationship (as one increases, the other decreases).

0: No linear relationship

cor_matrix <- cor(select(gss, where(is.numeric)), use = "complete.obs")
print(cor_matrix)
##                   id     income          age        height     weight
## id      1.0000000000 -0.1375496  0.006290912 -0.0003195252 0.10036872
## income -0.1375495601  1.0000000  0.211181494  0.2705246369 0.16934018
## age     0.0062909115  0.2111815  1.000000000  0.0358444335 0.06989901
## height -0.0003195252  0.2705246  0.035844434  1.0000000000 0.48243338
## weight  0.1003687178  0.1693402  0.069899013  0.4824333755 1.00000000
## hrswrk  0.0400281251  0.3430680 -0.013437836  0.2210808260 0.20172883
## Age     0.0062909115  0.2111815  1.000000000  0.0358444335 0.06989901
## Salary -0.1375495601  1.0000000  0.211181494  0.2705246369 0.16934018
##             hrswrk          Age     Salary
## id      0.04002813  0.006290912 -0.1375496
## income  0.34306802  0.211181494  1.0000000
## age    -0.01343784  1.000000000  0.2111815
## height  0.22108083  0.035844434  0.2705246
## weight  0.20172883  0.069899013  0.1693402
## hrswrk  1.00000000 -0.013437836  0.3430680
## Age    -0.01343784  1.000000000  0.2111815
## Salary  0.34306802  0.211181494  1.0000000

Visualize correlation matrix:

library(corrplot)
## corrplot 0.95 loaded
corrplot(cor_matrix, method = "circle", type = "upper")

corrplot(cor_matrix,
         method = "circle",      # Shape of the correlation indicators
         type = "upper",         # Show only the upper triangle
         tl.col = "black",       # Text label color
         tl.srt = 45,            # Text label rotation
         addCoef.col = "black", # Add correlation coefficients
         number.cex = 0.7,       # Size of the coefficients
         diag = FALSE)          # Hide the diagonal

5. Hands-On Practice

Task 1: Import and Inspect a Dataset

# Import dataset
#df <- read_csv("employee_data.csv")

# Inspect
#str(df)
#summary(df)
#head(df)

Task 2: Clean the Data

# Handle missing values
#df <- df %>%
#  mutate(
 #   Salary = ifelse(is.na(Salary), median(Salary, na.rm = TRUE), Salary),
  #  Gender = ifelse(is.na(Gender), "Unknown", Gender))

# Flag outliers
#df$outlier <- ifelse(scale(df$Salary) > 3, TRUE, FALSE)

Task 3: Perform EDA

# Summary stats
#mean(df$Salary, na.rm = TRUE)
#median(df$Salary, na.rm = TRUE)
#sd(df$Salary, na.rm = TRUE)

# Visualization
#hist(df$Salary, col = "lightgreen", main = "Salary Distribution")
#boxplot(df$Salary ~ df$Department, main = "Salary by Department")
#plot(df$Age, df$Salary, main = "Age vs Salary")

6. Homework Assignment

Exercise A: Import a Dataset

  • Download a public dataset (e.g., from Kaggle or UCI Machine Learning Repository )

  • Load it into R using read_csv() or read_excel() or import()

  • Print first few rows and dimensions

Exercise B: Clean the Data

  • Identify and handle missing values

  • Detect and flag outliers in at least one numeric column

  • Convert character columns to factor if appropriate

Exercise C: Perform EDA

  • Compute mean, median, standard deviation for at least two numeric variables

  • Create:

    • One histogram

    • One boxplot

    • One scatter plot

    • One bar chart or frequency table

  • Write a short interpretation of your findings