1. Introduction

This report demonstrates three techniques for outlier detection and handling on a weather dataset:

2. Data Loading

# Load the dataset
weather_data <- read_excel("6 Data Transformation – Data Science Programming.xlsx", sheet = "Sheet1", skip = 1)
## New names:
## • `` -> `...1`
# Rename columns for clarity
colnames(weather_data) <- c("ID", "Observation_ID", "Date", "Location", "Season", "Temperature", "Humidity", "Rainfall", "Wind_Speed")

# Display first few rows
head(weather_data)
## # A tibble: 6 × 9
##      ID Observation_ID Date                Location Season  Temperature Humidity
##   <dbl> <chr>          <dttm>              <chr>    <chr>         <dbl>    <dbl>
## 1     1 JpeDLWuzIJdl   2021-07-14 00:00:00 Jakarta  Dry Se…        30.7     89.5
## 2     2 EF8r6hXBCqfr   2020-11-16 00:00:00 Bandung  Rainy …        26.2     70.1
## 3     3 cov0TYDwyQoF   2023-03-22 00:00:00 Makassar Transi…        27.5     80.7
## 4     4 aRB0N7xSEnfa   2023-01-02 00:00:00 Surabaya Rainy …        26.8     90.4
## 5     5 fjf9bvNshjHQ   2023-06-05 00:00:00 Medan    Dry Se…        24.9     85.4
## 6     6 jnQGmBA0NZn0   2023-03-15 00:00:00 Jakarta  Transi…        31.1     68.2
## # ℹ 2 more variables: Rainfall <dbl>, Wind_Speed <dbl>

3. Z-Score Method

Detecting outliers in the Temperature column using the Z-Score method.

# Z-score method for detecting outliers in Temperature
z_scores <- scale(weather_data$Temperature)

weather_data <- weather_data %>%
  mutate(Outlier_Zscore = ifelse(abs(z_scores) > 3, "Outlier", "Normal"))

# Show result
head(weather_data[, c("Temperature", "Outlier_Zscore")])
## # A tibble: 6 × 2
##   Temperature Outlier_Zscore[,1]
##         <dbl> <chr>             
## 1        30.7 Normal            
## 2        26.2 Normal            
## 3        27.5 Normal            
## 4        26.8 Normal            
## 5        24.9 Normal            
## 6        31.1 Normal

4. IQR Method

Detecting and filtering outliers in the Rainfall column using the IQR method.

# IQR method for detecting and filtering outliers in Rainfall
Q1 <- quantile(weather_data$Rainfall, 0.25, na.rm = TRUE)
Q3 <- quantile(weather_data$Rainfall, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1

# Filtered dataset without outliers
filtered_weather <- weather_data %>%
  filter(Rainfall > (Q1 - 1.5 * IQR_val) & Rainfall < (Q3 + 1.5 * IQR_val))

# Show result
head(filtered_weather)
## # A tibble: 6 × 10
##      ID Observation_ID Date                Location Season  Temperature Humidity
##   <dbl> <chr>          <dttm>              <chr>    <chr>         <dbl>    <dbl>
## 1     1 JpeDLWuzIJdl   2021-07-14 00:00:00 Jakarta  Dry Se…        30.7     89.5
## 2     2 EF8r6hXBCqfr   2020-11-16 00:00:00 Bandung  Rainy …        26.2     70.1
## 3     3 cov0TYDwyQoF   2023-03-22 00:00:00 Makassar Transi…        27.5     80.7
## 4     4 aRB0N7xSEnfa   2023-01-02 00:00:00 Surabaya Rainy …        26.8     90.4
## 5     5 fjf9bvNshjHQ   2023-06-05 00:00:00 Medan    Dry Se…        24.9     85.4
## 6     6 jnQGmBA0NZn0   2023-03-15 00:00:00 Jakarta  Transi…        31.1     68.2
## # ℹ 3 more variables: Rainfall <dbl>, Wind_Speed <dbl>,
## #   Outlier_Zscore <chr[,1]>

Optional: You can also tag outliers instead of filtering them:

# Add outlier flag instead of filtering
weather_data <- weather_data %>%
  mutate(
    Outlier_IQR = ifelse(
      Rainfall < (Q1 - 1.5 * IQR_val) | Rainfall > (Q3 + 1.5 * IQR_val),
      "Outlier", "Normal"
    )
  )

head(weather_data[, c("Rainfall", "Outlier_IQR")])
## # A tibble: 6 × 2
##   Rainfall Outlier_IQR
##      <dbl> <chr>      
## 1      7.9 Normal     
## 2      4.6 Normal     
## 3     11   Normal     
## 4      8.7 Normal     
## 5      7.5 Normal     
## 6      3.2 Normal

5. Winsorizing Extreme Values

Winsorizing reduces the effect of extreme values by capping them using built-in cutoffs (default 5%).

Pastikan paket DescTools sudah diinstal dan dimuat

Menghapus nilai NA dari kolom Temperature jika ada

Menerapkan Winsorizing untuk menangani outlier pada kolom Temperature (memotong 5% dari setiap ujung distribusi)

Menambahkan kolom winsorized_temperature ke dataset

Menampilkan beberapa baris pertama dari data untuk membandingkan hasil asli dengan yang sudah di-Winsorize

6. Conclusion

In this section, we applied three important techniques for handling outliers:

modified extreme values to minimize their impact on the analysis.

Proper outlier detection and handling are essential for improving data quality and building robust statistical models.