By the end of this session, you will be able to:
Import datasets from various formats (CSV, Excel, etc.)
Clean real-world datasets by handling missing values and outliers
Calculate and interpret summary statistics
Perform basic EDA using visualizations
Prepare data for further modeling or analysis
R supports multiple file formats for importing data. The most common are:
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
Real-world datasets often contain missing values , inconsistent formatting , and outliers .
## 'data.frame': 994 obs. of 9 variables:
## $ id : int 1 2 4 14 16 19 21 27 28 30 ...
## $ sex : chr "MALE" "FEMALE" "FEMALE" "FEMALE" ...
## $ degree : chr "BACHELOR" "BACHELOR" "BACHELOR" "HIGH SCHOOL" ...
## $ income : num 60968 60968 10161 17551 17551 ...
## $ marital: chr "DIVORCED" "MARRIED" "MARRIED" "MARRIED" ...
## $ age : int 53 26 56 40 56 51 30 35 57 54 ...
## $ height : int 72 60 68 65 66 68 62 70 71 71 ...
## $ weight : int 190 97 160 156 210 170 115 180 225 165 ...
## $ hrswrk : int 60 40 20 37 6 50 38 40 40 40 ...
## id sex degree income
## Min. : 1.0 Length:994 Length:994 Min. : 369.5
## 1st Qu.: 648.2 Class :character Class :character 1st Qu.: 15703.8
## Median :1254.5 Mode :character Mode :character Median : 27712.5
## Mean :1271.2 Mean : 36887.2
## 3rd Qu.:1915.8 3rd Qu.: 49882.5
## Max. :2538.0 Max. :158657.0
## marital age height weight
## Length:994 Min. :19.00 Min. :57.00 Min. : 90.0
## Class :character 1st Qu.:33.00 1st Qu.:64.00 1st Qu.:150.0
## Mode :character Median :44.00 Median :67.00 Median :175.0
## Mean :44.49 Mean :67.41 Mean :181.3
## 3rd Qu.:55.00 3rd Qu.:70.00 3rd Qu.:205.0
## Max. :79.00 Max. :79.00 Max. :410.0
## hrswrk
## Min. : 1.00
## 1st Qu.:38.00
## Median :40.00
## Mean :42.64
## 3rd Qu.:50.00
## Max. :89.00
## id sex degree income marital age height weight hrswrk
## 1 1 MALE BACHELOR 60967.50 DIVORCED 53 72 190 60
## 2 2 FEMALE BACHELOR 60967.50 MARRIED 26 60 97 40
## 3 4 FEMALE BACHELOR 10161.25 MARRIED 56 68 160 20
## 4 14 FEMALE HIGH SCHOOL 17551.25 MARRIED 40 65 156 37
## 5 16 MALE HIGH SCHOOL 17551.25 MARRIED 56 66 210 6
## 6 19 MALE LT HIGH SCHOOL 15703.75 MARRIED 51 68 170 50
## id sex degree income marital age height weight hrswrk
## 989 2530 MALE HIGH SCHOOL 33255.00 MARRIED 62 71 215 48
## 990 2531 MALE HIGH SCHOOL 1478.00 NEVER MARRIED 40 71 230 48
## 991 2535 MALE HIGH SCHOOL 33255.00 DIVORCED 56 72 195 46
## 992 2536 MALE HIGH SCHOOL 8313.75 NEVER MARRIED 24 68 145 40
## 993 2537 MALE HIGH SCHOOL 27712.50 NEVER MARRIED 27 68 180 40
## 994 2538 FEMALE HIGH SCHOOL 15703.75 WIDOWED 71 63 140 48
## [1] 994 9
## [1] "id" "sex" "degree" "income" "marital" "age" "height"
## [8] "weight" "hrswrk"
##Check for Missing Values
## [1] 0
## id sex degree income marital age height weight hrswrk
## 0 0 0 0 0 0 0 0 0
Use these to understand central tendency and spread:
## [1] 44.48893
## [1] 44
## [1] 13.12686
## [1] 172.3146
## 25% 50% 75%
## 33 44 55
## [1] 19 79
For categorical variables:
##
## FEMALE MALE
## 487 507
##
## FEMALE MALE
## 0.4899396 0.5100604
To summarize the entire dataset:
## id sex degree income
## Min. : 1.0 Length:994 Length:994 Min. : 369.5
## 1st Qu.: 648.2 Class :character Class :character 1st Qu.: 15703.8
## Median :1254.5 Mode :character Mode :character Median : 27712.5
## Mean :1271.2 Mean : 36887.2
## 3rd Qu.:1915.8 3rd Qu.: 49882.5
## Max. :2538.0 Max. :158657.0
## marital age height weight
## Length:994 Min. :19.00 Min. :57.00 Min. : 90.0
## Class :character 1st Qu.:33.00 1st Qu.:64.00 1st Qu.:150.0
## Mode :character Median :44.00 Median :67.00 Median :175.0
## Mean :44.49 Mean :67.41 Mean :181.3
## 3rd Qu.:55.00 3rd Qu.:70.00 3rd Qu.:205.0
## Max. :79.00 Max. :79.00 Max. :410.0
## hrswrk Age Salary outlier_flag.V1
## Min. : 1.00 Min. :19.00 Min. : 369.5 Mode :logical
## 1st Qu.:38.00 1st Qu.:33.00 1st Qu.: 15703.8 FALSE:948
## Median :40.00 Median :44.00 Median : 27712.5 TRUE :46
## Mean :42.64 Mean :44.49 Mean : 36887.2
## 3rd Qu.:50.00 3rd Qu.:55.00 3rd Qu.: 49882.5
## Max. :89.00 Max. :79.00 Max. :158657.0
## outlier_iqr
## Mode :logical
## FALSE:933
## TRUE :61
##
##
##
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
## vars n mean sd median trimmed mad min max
## id 1 994 1271.18 734.07 1254.5 1270.81 951.83 1.0 2538
## sex* 2 994 1.51 0.50 2.0 1.51 0.00 1.0 2
## degree* 3 994 2.70 1.13 3.0 2.65 1.48 1.0 5
## income 4 994 36887.22 34702.26 27712.5 30548.51 19173.72 369.5 158657
## marital* 5 994 2.28 0.92 2.0 2.21 1.48 1.0 5
## age 6 994 44.49 13.13 44.0 44.30 16.31 19.0 79
## height 7 994 67.41 4.04 67.0 67.41 4.45 57.0 79
## weight 8 994 181.32 41.64 175.0 178.15 37.06 90.0 410
## hrswrk 9 994 42.64 14.39 40.0 42.43 11.86 1.0 89
## Age 10 994 44.49 13.13 44.0 44.30 16.31 19.0 79
## Salary 11 994 36887.22 34702.26 27712.5 30548.51 19173.72 369.5 158657
## outlier_flag 12 994 NaN NA NA NaN NA Inf -Inf
## outlier_iqr 13 994 NaN NA NA NaN NA Inf -Inf
## range skew kurtosis se
## id 2537.0 0.02 -1.20 23.28
## sex* 1.0 -0.04 -2.00 0.02
## degree* 4.0 0.05 -0.43 0.04
## income 158287.5 2.07 4.56 1100.69
## marital* 4.0 0.84 1.03 0.03
## age 60.0 0.11 -0.86 0.42
## height 22.0 0.00 -0.64 0.13
## weight 320.0 0.89 1.59 1.32
## hrswrk 88.0 0.32 1.20 0.46
## Age 60.0 0.11 -0.86 0.42
## Salary 158287.5 2.07 4.56 1100.69
## outlier_flag -Inf NA NA NA
## outlier_iqr -Inf NA NA NA
Histogram – Distribution of Numerical Variables
Boxplot – Detect Outliers
Scatter Plot – Relationship Between Two Numerical Variable
plot(gss$age, gss$income,
main = "Age vs income",
xlab = "Age", ylab = "Income",
pch = 16, col = "red")
Bar Chart – Categorical Variables
barplot(table(gss$marital),
main = "Number of Employees by Marital",
xlab = "Marital", ylab = "Count",
col = "tomato")
Correlation Matrix (for numeric columns only)
Correlation is a statistical measure that describes the strength and direction of a linear relationship between two quantitative variables.
Key Concepts:
Range: Correlation values range from –1 to +1.
+1: Perfect positive linear relationship (as one variable increases, the other increases).
–1: Perfect negative linear relationship (as one increases, the other decreases).
0: No linear relationship
## id income age height weight
## id 1.0000000000 -0.1375496 0.006290912 -0.0003195252 0.10036872
## income -0.1375495601 1.0000000 0.211181494 0.2705246369 0.16934018
## age 0.0062909115 0.2111815 1.000000000 0.0358444335 0.06989901
## height -0.0003195252 0.2705246 0.035844434 1.0000000000 0.48243338
## weight 0.1003687178 0.1693402 0.069899013 0.4824333755 1.00000000
## hrswrk 0.0400281251 0.3430680 -0.013437836 0.2210808260 0.20172883
## Age 0.0062909115 0.2111815 1.000000000 0.0358444335 0.06989901
## Salary -0.1375495601 1.0000000 0.211181494 0.2705246369 0.16934018
## hrswrk Age Salary
## id 0.04002813 0.006290912 -0.1375496
## income 0.34306802 0.211181494 1.0000000
## age -0.01343784 1.000000000 0.2111815
## height 0.22108083 0.035844434 0.2705246
## weight 0.20172883 0.069899013 0.1693402
## hrswrk 1.00000000 -0.013437836 0.3430680
## Age -0.01343784 1.000000000 0.2111815
## Salary 0.34306802 0.211181494 1.0000000
Visualize correlation matrix:
## corrplot 0.95 loaded
corrplot(cor_matrix,
method = "circle", # Shape of the correlation indicators
type = "upper", # Show only the upper triangle
tl.col = "black", # Text label color
tl.srt = 45, # Text label rotation
addCoef.col = "black", # Add correlation coefficients
number.cex = 0.7, # Size of the coefficients
diag = FALSE) # Hide the diagonal
Task 1: Import and Inspect a Dataset
Task 2: Clean the Data
# Handle missing values
#df <- df %>%
# mutate(
# Salary = ifelse(is.na(Salary), median(Salary, na.rm = TRUE), Salary),
# Gender = ifelse(is.na(Gender), "Unknown", Gender))
# Flag outliers
#df$outlier <- ifelse(scale(df$Salary) > 3, TRUE, FALSE)
Task 3: Perform EDA
# Summary stats
#mean(df$Salary, na.rm = TRUE)
#median(df$Salary, na.rm = TRUE)
#sd(df$Salary, na.rm = TRUE)
# Visualization
#hist(df$Salary, col = "lightgreen", main = "Salary Distribution")
#boxplot(df$Salary ~ df$Department, main = "Salary by Department")
#plot(df$Age, df$Salary, main = "Age vs Salary")
Download a public dataset (e.g., from Kaggle or UCI Machine Learning Repository )
Load it into R using read_csv() or read_excel() or import()
Print first few rows and dimensions
Identify and handle missing values
Detect and flag outliers in at least one numeric column
Convert character columns to factor if appropriate
Compute mean, median, standard deviation for at least two numeric variables
Create:
One histogram
One boxplot
One scatter plot
One bar chart or frequency table
Write a short interpretation of your findings