# 1. Hafta: Veri Analizine Giriş
İlk hafta dersimizde, veri analizine başlamadan önce gerçekleştirilen ön işlemler ile ilgilidir. Veri temizleme, eksik değerlerin ele alınması ve veri manipülasyonu gibi konulara değineceğiz.
1.1 Veri Setini Yükleme
Öncelikle PISA 2015 verisini yükleyip ilk 6 satırını görüntüleyelim.
# Gerekli kütüphaneleri yükleyelim
library(readxl)
library(dplyr)
library(ggplot2)
library(naniar)
library(tidyr)
library(tidyverse)
# Eksik veri görselleştirme
# PISA 2015 verisini okuyalım
pisa_data <- read_excel("PISA2015.xlsx", sheet = "Trend Science")
# Veri setinin ilk satırlarını inceleyelim
head(pisa_data)
## # A tibble: 6 × 41
## `Generic ID` CBA Item ID in MS Analysis O…¹ PBA Item ID in MS An…² `Unit Name`
## <chr> <chr> <chr> <chr>
## 1 S131Q02 DS131Q02C PS131Q02 Good Vibra…
## 2 S131Q04 DS131Q04C PS131Q04 Good Vibra…
## 3 S252Q01 CS252Q01S PS252Q01S South Rain…
## 4 S252Q02 CS252Q02S PS252Q02S South Rain…
## 5 S252Q03 CS252Q03S PS252Q03S South Rain…
## 6 S256Q01 CS256Q01S PS256Q01S Spoons
## # ℹ abbreviated names: ¹`CBA Item ID in MS Analysis Output`,
## # ²`PBA Item ID in MS Analysis Output`
## # ℹ 37 more variables: `Mode\r\n[Paper-Based (PB); Computer-Based (CB)]` <chr>,
## # `2015 FT & MS Cluster` <chr>, `Sequence in Cluster` <dbl>,
## # `Sequence in Unit` <chr>, `Item Format - CBA` <chr>,
## # `Item Format - PBA` <chr>, `Context 1\r\n(2015)` <chr>,
## # `Context 1\r\n(2006)` <chr>, `Context 2` <chr>, …
# Değişken isimleri
colnames(pisa_data)
## [1] "Generic ID"
## [2] "CBA Item ID in MS Analysis Output"
## [3] "PBA Item ID in MS Analysis Output"
## [4] "Unit Name"
## [5] "Mode\r\n[Paper-Based (PB); Computer-Based (CB)]"
## [6] "2015 FT & MS Cluster"
## [7] "Sequence in Cluster"
## [8] "Sequence in Unit"
## [9] "Item Format - CBA"
## [10] "Item Format - PBA"
## [11] "Context 1\r\n(2015)"
## [12] "Context 1\r\n(2006)"
## [13] "Context 2"
## [14] "Competency\r\n(2015)"
## [15] "Compentency\r\n(2006)"
## [16] "Knowledge\r\n(2015)"
## [17] "Knowledge\r\n(2006)"
## [18] "System\r\n(2015)"
## [19] "System\r\n(2006)"
## [20] "Depth of Knowledge"
## [21] "Unit Origin"
## [22] "Language of Submission"
## [23] "Source"
## [24] "CBA International % Correct"
## [25] "CBA International % Correct S.E."
## [26] "Slope...26"
## [27] "Difficulty...27"
## [28] "Step 1...28"
## [29] "Step 2...29"
## [30] "1...30"
## [31] "2...31"
## [32] "\r\nLevel"
## [33] "PBA International % Correct"
## [34] "PBA International % Correct S.E."
## [35] "Slope...35"
## [36] "Difficulty...36"
## [37] "Step 1...37"
## [38] "Step 2...38"
## [39] "1...39"
## [40] "2...40"
## [41] "Level"
1.2 Değişkenleri İnceleme Veri girişinde hata olup olmadığını anlamak için özet istatistiklere bakalım:
# Veri setinin genel yapısını inceleyelim
glimpse(pisa_data)
## Rows: 85
## Columns: 41
## $ `Generic ID` <chr> "S131Q02", "S131Q04"…
## $ `CBA Item ID in MS Analysis Output` <chr> "DS131Q02C", "DS131Q…
## $ `PBA Item ID in MS Analysis Output` <chr> "PS131Q02", "PS131Q0…
## $ `Unit Name` <chr> "Good Vibrations", "…
## $ `Mode\r\n[Paper-Based (PB); Computer-Based (CB)]` <chr> "PB and CB", "PB and…
## $ `2015 FT & MS Cluster` <chr> "S03", "S03", "S06",…
## $ `Sequence in Cluster` <dbl> 4, 5, 6, 7, 8, 5, 8,…
## $ `Sequence in Unit` <chr> "1/2", "2/2", "1/3",…
## $ `Item Format - CBA` <chr> "Open Response - Hum…
## $ `Item Format - PBA` <chr> "Open Response - Hum…
## $ `Context 1\r\n(2015)` <chr> "Personal", "Local/ …
## $ `Context 1\r\n(2006)` <chr> "Social", "Social", …
## $ `Context 2` <chr> "Health & Disease", …
## $ `Competency\r\n(2015)` <chr> "Interpret data and …
## $ `Compentency\r\n(2006)` <chr> "Using scientific ev…
## $ `Knowledge\r\n(2015)` <chr> "Procedural", "Proce…
## $ `Knowledge\r\n(2006)` <chr> "Knowledge about Sci…
## $ `System\r\n(2015)` <chr> "Living", "Living", …
## $ `System\r\n(2006)` <chr> "NA", "NA", "NA", "N…
## $ `Depth of Knowledge` <chr> "Low", "Medium", "Me…
## $ `Unit Origin` <chr> "ACER", "ACER", "Kor…
## $ `Language of Submission` <chr> "English", "English"…
## $ Source <chr> "2012", "2012", "200…
## $ `CBA International % Correct` <dbl> 45.02305, 26.78152, …
## $ `CBA International % Correct S.E.` <dbl> 0.2696944, 0.2403569…
## $ Slope...26 <dbl> 1.42528, 1.21112, 0.…
## $ Difficulty...27 <dbl> 0.07477, 0.55121, 0.…
## $ `Step 1...28` <dbl> NA, NA, NA, NA, NA, …
## $ `Step 2...29` <dbl> NA, NA, NA, NA, NA, …
## $ `1...30` <chr> "537", "624", "553",…
## $ `2...31` <chr> NA, NA, NA, NA, NA, …
## $ `\r\nLevel` <chr> "Level 3", "Level 4"…
## $ `PBA International % Correct` <dbl> 32.46211, 18.24561, …
## $ `PBA International % Correct S.E.` <dbl> 0.3933822, 0.3020953…
## $ Slope...35 <dbl> 1.42528, 1.21112, 0.…
## $ Difficulty...36 <dbl> 0.07477, 0.55121, 0.…
## $ `Step 1...37` <dbl> NA, NA, NA, NA, NA, …
## $ `Step 2...38` <dbl> NA, NA, NA, NA, NA, …
## $ `1...39` <chr> "537", "624", "553",…
## $ `2...40` <chr> NA, NA, NA, NA, NA, …
## $ Level <chr> "Level 3", "Level 4"…
# Sayısal değişkenlerin betimsel istatistikleri
summary(pisa_data)
## Generic ID CBA Item ID in MS Analysis Output
## Length:85 Length:85
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## PBA Item ID in MS Analysis Output Unit Name
## Length:85 Length:85
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## Mode\r\n[Paper-Based (PB); Computer-Based (CB)] 2015 FT & MS Cluster
## Length:85 Length:85
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## Sequence in Cluster Sequence in Unit Item Format - CBA Item Format - PBA
## Min. : 1.000 Length:85 Length:85 Length:85
## 1st Qu.: 4.000 Class :character Class :character Class :character
## Median : 8.000 Mode :character Mode :character Mode :character
## Mean : 8.071
## 3rd Qu.:12.000
## Max. :18.000
##
## Context 1\r\n(2015) Context 1\r\n(2006) Context 2
## Length:85 Length:85 Length:85
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Competency\r\n(2015) Compentency\r\n(2006) Knowledge\r\n(2015)
## Length:85 Length:85 Length:85
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Knowledge\r\n(2006) System\r\n(2015) System\r\n(2006) Depth of Knowledge
## Length:85 Length:85 Length:85 Length:85
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Unit Origin Language of Submission Source
## Length:85 Length:85 Length:85
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## CBA International % Correct CBA International % Correct S.E. Slope...26
## Min. :11.81 Min. :0.1733 Min. :0.4017
## 1st Qu.:36.78 1st Qu.:0.2522 1st Qu.:0.8782
## Median :47.96 Median :0.2647 Median :1.0000
## Mean :47.48 Mean :0.2583 Mean :1.0872
## 3rd Qu.:56.40 3rd Qu.:0.2707 3rd Qu.:1.3339
## Max. :87.63 Max. :0.2974 Max. :2.4747
##
## Difficulty...27 Step 1...28 Step 2...29 1...30
## Min. :-1.41194 Min. :-1.0223 Min. :0.0768 Length:85
## 1st Qu.:-0.30237 1st Qu.:-0.6342 1st Qu.:0.1615 Class :character
## Median :-0.04738 Median :-0.2462 Median :0.2462 Mode :character
## Mean :-0.01182 Mean :-0.4484 Mean :0.4484
## 3rd Qu.: 0.24946 3rd Qu.:-0.1615 3rd Qu.:0.6342
## Max. : 1.52393 Max. :-0.0768 Max. :1.0223
## NA's :82 NA's :82
## 2...31 \r\nLevel PBA International % Correct
## Length:85 Length:85 Min. :11.09
## Class :character Class :character 1st Qu.:26.80
## Mode :character Mode :character Median :37.27
## Mean :37.82
## 3rd Qu.:47.28
## Max. :81.39
##
## PBA International % Correct S.E. Slope...35 Difficulty...36
## Min. :0.2338 Min. :0.4017 Min. :-1.41194
## 1st Qu.:0.3510 1st Qu.:0.8782 1st Qu.:-0.32316
## Median :0.3771 Median :1.0000 Median :-0.09803
## Mean :0.3669 Mean :1.0872 Mean :-0.05153
## 3rd Qu.:0.3929 3rd Qu.:1.3339 3rd Qu.: 0.23450
## Max. :0.4110 Max. :2.4747 Max. : 1.22396
##
## Step 1...37 Step 2...38 1...39 2...40
## Min. :-0.63107 Min. :0.07957 Length:85 Length:85
## 1st Qu.:-0.43861 1st Qu.:0.16286 Class :character Class :character
## Median :-0.24615 Median :0.24615 Mode :character Mode :character
## Mean :-0.31893 Mean :0.31893
## 3rd Qu.:-0.16286 3rd Qu.:0.43861
## Max. :-0.07957 Max. :0.63107
## NA's :82 NA's :82
## Level
## Length:85
## Class :character
## Mode :character
##
##
##
##
Bazı değişkenlerde uç değerler olup olmadığını görmek için minimum ve maksimum değerleri kontrol edelim.
# Sayısal değişkenlerde min, max değerleri inceleyelim
pisa_data %>%
summarise(across(where(is.numeric), list(min = min, max = max), na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(where(is.numeric), list(min = min, max = max), na.rm =
## TRUE)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 1 × 26
## `Sequence in Cluster_min` `Sequence in Cluster_max` CBA International % Corr…¹
## <dbl> <dbl> <dbl>
## 1 1 18 11.8
## # ℹ abbreviated name: ¹`CBA International % Correct_min`
## # ℹ 23 more variables: `CBA International % Correct_max` <dbl>,
## # `CBA International % Correct S.E._min` <dbl>,
## # `CBA International % Correct S.E._max` <dbl>, Slope...26_min <dbl>,
## # Slope...26_max <dbl>, Difficulty...27_min <dbl>, Difficulty...27_max <dbl>,
## # `Step 1...28_min` <dbl>, `Step 1...28_max` <dbl>, `Step 2...29_min` <dbl>,
## # `Step 2...29_max` <dbl>, `PBA International % Correct_min` <dbl>, …
1.3 Kayıp Veri Analizi Bazı değişkenlerde eksik değerler bulunup bulunmadığını kontrol edelim.
n_miss(pisa_data)
## [1] 492
prop_miss(pisa_data)
## [1] 0.1411765
# Eksik veri oranlarını hesaplama
colSums(is.na(pisa_data))
## Generic ID
## 0
## CBA Item ID in MS Analysis Output
## 0
## PBA Item ID in MS Analysis Output
## 0
## Unit Name
## 0
## Mode\r\n[Paper-Based (PB); Computer-Based (CB)]
## 0
## 2015 FT & MS Cluster
## 0
## Sequence in Cluster
## 0
## Sequence in Unit
## 0
## Item Format - CBA
## 0
## Item Format - PBA
## 0
## Context 1\r\n(2015)
## 0
## Context 1\r\n(2006)
## 0
## Context 2
## 0
## Competency\r\n(2015)
## 0
## Compentency\r\n(2006)
## 0
## Knowledge\r\n(2015)
## 0
## Knowledge\r\n(2006)
## 0
## System\r\n(2015)
## 0
## System\r\n(2006)
## 0
## Depth of Knowledge
## 0
## Unit Origin
## 0
## Language of Submission
## 0
## Source
## 0
## CBA International % Correct
## 0
## CBA International % Correct S.E.
## 0
## Slope...26
## 0
## Difficulty...27
## 0
## Step 1...28
## 82
## Step 2...29
## 82
## 1...30
## 0
## 2...31
## 82
## \r\nLevel
## 0
## PBA International % Correct
## 0
## PBA International % Correct S.E.
## 0
## Slope...35
## 0
## Difficulty...36
## 0
## Step 1...37
## 82
## Step 2...38
## 82
## 1...39
## 0
## 2...40
## 82
## Level
## 0
# Eksik veri yüzdeleri
pisa_data %>%
summarise(across(everything(), ~ mean(is.na(.)) * 100))
## # A tibble: 1 × 41
## `Generic ID` CBA Item ID in MS Analysis O…¹ PBA Item ID in MS An…² `Unit Name`
## <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0
## # ℹ abbreviated names: ¹`CBA Item ID in MS Analysis Output`,
## # ²`PBA Item ID in MS Analysis Output`
## # ℹ 37 more variables: `Mode\r\n[Paper-Based (PB); Computer-Based (CB)]` <dbl>,
## # `2015 FT & MS Cluster` <dbl>, `Sequence in Cluster` <dbl>,
## # `Sequence in Unit` <dbl>, `Item Format - CBA` <dbl>,
## # `Item Format - PBA` <dbl>, `Context 1\r\n(2015)` <dbl>,
## # `Context 1\r\n(2006)` <dbl>, `Context 2` <dbl>, …
Eksik verinin rastgele mi yoksa sistematik mi olduğuna bakalım:
# Eksik veri desenini görselleştirme
gg_miss_var(pisa_data) +
labs(title = "PISA 2015 - Eksik Veri Dağılımı")
1.4 Eksik Veri ile Başa Çıkma Kayıp veri için listwise deletion (eksik satırları silme) veya ortalama ile doldurma gibi yöntemleri uygulayabiliriz.
Önce eksik satırları silelim:
pisa_clean <- pisa_data %>%
drop_na()
# Eksik veri temizlendikten sonraki satır sayısı
nrow(pisa_clean)
## [1] 3
Alternatif olarak, eksik verileri değişkenin ortalaması ile doldurabiliriz:
pisa_imputed <- pisa_data %>%
mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))
# Eksik veri kaldı mı?
colSums(is.na(pisa_imputed))
## Generic ID
## 0
## CBA Item ID in MS Analysis Output
## 0
## PBA Item ID in MS Analysis Output
## 0
## Unit Name
## 0
## Mode\r\n[Paper-Based (PB); Computer-Based (CB)]
## 0
## 2015 FT & MS Cluster
## 0
## Sequence in Cluster
## 0
## Sequence in Unit
## 0
## Item Format - CBA
## 0
## Item Format - PBA
## 0
## Context 1\r\n(2015)
## 0
## Context 1\r\n(2006)
## 0
## Context 2
## 0
## Competency\r\n(2015)
## 0
## Compentency\r\n(2006)
## 0
## Knowledge\r\n(2015)
## 0
## Knowledge\r\n(2006)
## 0
## System\r\n(2015)
## 0
## System\r\n(2006)
## 0
## Depth of Knowledge
## 0
## Unit Origin
## 0
## Language of Submission
## 0
## Source
## 0
## CBA International % Correct
## 0
## CBA International % Correct S.E.
## 0
## Slope...26
## 0
## Difficulty...27
## 0
## Step 1...28
## 0
## Step 2...29
## 0
## 1...30
## 0
## 2...31
## 82
## \r\nLevel
## 0
## PBA International % Correct
## 0
## PBA International % Correct S.E.
## 0
## Slope...35
## 0
## Difficulty...36
## 0
## Step 1...37
## 0
## Step 2...38
## 0
## 1...39
## 0
## 2...40
## 82
## Level
## 0
1.5 Uç Değerleri İnceleme Boxplot kullanarak uç değerleri tespit edelim:
ggplot(pisa_clean, aes(y = `CBA International % Correct`)) +
geom_boxplot(fill = "red", alpha = 0.5) +
labs(title = "PISA 2015 - Doğru Cevap Oranlarında Uç Değerler")
Bu haftaki öğrenme günlüğünde PISA 2015 veri setini kullanarak veri inceleme, eksik veri analizi ve veri temizleme adımlarını gerçekleştirdik. Öncelikle, veri dosyasının doğruluğunu kontrol ederek özet istatistikleri ve uç değerleri inceledik. Daha sonra, veri setindeki eksik değerlerin oranını hesapladık ve eksik verinin rastgele mi yoksa sistematik mi olduğunu görselleştirme teknikleriyle analiz ettik. Eksik verileri ele almak için listwise deletion (eksik satırları silme) ve ortalama ile doldurma yöntemlerini uyguladık. Son olarak, uç değerleri (outliers) tespit etmek için boxplot oluşturduk. Bu süreç, veri analizine başlamadan önce veri setini kontrol etmek ve güvenilir sonuçlar elde etmek için önemli bir adımdır.