This report demonstrates three techniques for outlier detection and handling on a weather dataset:
Z-Score Normalization
IQR (Interquartile Range) Outlier Detection
Winsorizing Extreme Values
# Load the dataset
weather_data <- read_excel("6 Data Transformation – Data Science Programming.xlsx", sheet = "Sheet1", skip = 1)
## New names:
## • `` -> `...1`
# Rename columns for clarity
colnames(weather_data) <- c("ID", "Observation_ID", "Date", "Location", "Season", "Temperature", "Humidity", "Rainfall", "Wind_Speed")
# Display first few rows
head(weather_data)
## # A tibble: 6 × 9
## ID Observation_ID Date Location Season Temperature Humidity
## <dbl> <chr> <dttm> <chr> <chr> <dbl> <dbl>
## 1 1 JpeDLWuzIJdl 2021-07-14 00:00:00 Jakarta Dry Se… 30.7 89.5
## 2 2 EF8r6hXBCqfr 2020-11-16 00:00:00 Bandung Rainy … 26.2 70.1
## 3 3 cov0TYDwyQoF 2023-03-22 00:00:00 Makassar Transi… 27.5 80.7
## 4 4 aRB0N7xSEnfa 2023-01-02 00:00:00 Surabaya Rainy … 26.8 90.4
## 5 5 fjf9bvNshjHQ 2023-06-05 00:00:00 Medan Dry Se… 24.9 85.4
## 6 6 jnQGmBA0NZn0 2023-03-15 00:00:00 Jakarta Transi… 31.1 68.2
## # ℹ 2 more variables: Rainfall <dbl>, Wind_Speed <dbl>
Detecting outliers in the Temperature column using the Z-Score method.
# Z-score method for detecting outliers in Temperature
z_scores <- scale(weather_data$Temperature)
weather_data <- weather_data %>%
mutate(Outlier_Zscore = ifelse(abs(z_scores) > 3, "Outlier", "Normal"))
# Show result
head(weather_data[, c("Temperature", "Outlier_Zscore")])
## # A tibble: 6 × 2
## Temperature Outlier_Zscore[,1]
## <dbl> <chr>
## 1 30.7 Normal
## 2 26.2 Normal
## 3 27.5 Normal
## 4 26.8 Normal
## 5 24.9 Normal
## 6 31.1 Normal
Detecting and filtering outliers in the Rainfall column using the IQR method.
# IQR method for detecting and filtering outliers in Rainfall
Q1 <- quantile(weather_data$Rainfall, 0.25, na.rm = TRUE)
Q3 <- quantile(weather_data$Rainfall, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
# Filtered dataset without outliers
filtered_weather <- weather_data %>%
filter(Rainfall > (Q1 - 1.5 * IQR_val) & Rainfall < (Q3 + 1.5 * IQR_val))
# Show result
head(filtered_weather)
## # A tibble: 6 × 10
## ID Observation_ID Date Location Season Temperature Humidity
## <dbl> <chr> <dttm> <chr> <chr> <dbl> <dbl>
## 1 1 JpeDLWuzIJdl 2021-07-14 00:00:00 Jakarta Dry Se… 30.7 89.5
## 2 2 EF8r6hXBCqfr 2020-11-16 00:00:00 Bandung Rainy … 26.2 70.1
## 3 3 cov0TYDwyQoF 2023-03-22 00:00:00 Makassar Transi… 27.5 80.7
## 4 4 aRB0N7xSEnfa 2023-01-02 00:00:00 Surabaya Rainy … 26.8 90.4
## 5 5 fjf9bvNshjHQ 2023-06-05 00:00:00 Medan Dry Se… 24.9 85.4
## 6 6 jnQGmBA0NZn0 2023-03-15 00:00:00 Jakarta Transi… 31.1 68.2
## # ℹ 3 more variables: Rainfall <dbl>, Wind_Speed <dbl>,
## # Outlier_Zscore <chr[,1]>
Optional: You can also tag outliers instead of filtering them:
# Add outlier flag instead of filtering
weather_data <- weather_data %>%
mutate(
Outlier_IQR = ifelse(
Rainfall < (Q1 - 1.5 * IQR_val) | Rainfall > (Q3 + 1.5 * IQR_val),
"Outlier", "Normal"
)
)
head(weather_data[, c("Rainfall", "Outlier_IQR")])
## # A tibble: 6 × 2
## Rainfall Outlier_IQR
## <dbl> <chr>
## 1 7.9 Normal
## 2 4.6 Normal
## 3 11 Normal
## 4 8.7 Normal
## 5 7.5 Normal
## 6 3.2 Normal
Winsorizing reduces the effect of extreme values by capping them using built-in cutoffs (default 5%).
Pastikan paket DescTools sudah diinstal dan dimuat
Menghapus nilai NA dari kolom Temperature jika ada
Menerapkan Winsorizing untuk menangani outlier pada kolom Temperature (memotong 5% dari setiap ujung distribusi)
Menambahkan kolom winsorized_temperature ke dataset
Menampilkan beberapa baris pertama dari data untuk membandingkan hasil asli dengan yang sudah di-Winsorize
In this section, we applied three important techniques for handling outliers:
Z-Score Normalization helped us identify values far from the mean.
IQR Method provided statistical thresholds to flag or filter extreme values.
Winsorizing
modified extreme values to minimize their impact on the analysis.
Proper outlier detection and handling are essential for improving data quality and building robust statistical models.