# 1. Hafta: Veri Analizine Giriş

İlk hafta dersimizde, veri analizine başlamadan önce gerçekleştirilen ön işlemler ile ilgilidir. Veri temizleme, eksik değerlerin ele alınması ve veri manipülasyonu gibi konulara değineceğiz.

1.1 Veri Setini Yükleme

Öncelikle PISA 2015 verisini yükleyip ilk 6 satırını görüntüleyelim.

# Gerekli kütüphaneleri yükleyelim
library(readxl)
library(dplyr)
library(ggplot2)
library(naniar)
library(tidyr)
library(tidyverse)
# Eksik veri görselleştirme

# PISA 2015 verisini okuyalım
pisa_data <- read_excel("PISA2015.xlsx", sheet = "Trend Science")

# Veri setinin ilk satırlarını inceleyelim
head(pisa_data)
## # A tibble: 6 × 41
##   `Generic ID` CBA Item ID in MS Analysis O…¹ PBA Item ID in MS An…² `Unit Name`
##   <chr>        <chr>                          <chr>                  <chr>      
## 1 S131Q02      DS131Q02C                      PS131Q02               Good Vibra…
## 2 S131Q04      DS131Q04C                      PS131Q04               Good Vibra…
## 3 S252Q01      CS252Q01S                      PS252Q01S              South Rain…
## 4 S252Q02      CS252Q02S                      PS252Q02S              South Rain…
## 5 S252Q03      CS252Q03S                      PS252Q03S              South Rain…
## 6 S256Q01      CS256Q01S                      PS256Q01S              Spoons     
## # ℹ abbreviated names: ¹​`CBA Item ID in MS Analysis Output`,
## #   ²​`PBA Item ID in MS Analysis Output`
## # ℹ 37 more variables: `Mode\r\n[Paper-Based (PB); Computer-Based (CB)]` <chr>,
## #   `2015 FT & MS Cluster` <chr>, `Sequence in Cluster` <dbl>,
## #   `Sequence in Unit` <chr>, `Item Format - CBA` <chr>,
## #   `Item Format - PBA` <chr>, `Context 1\r\n(2015)` <chr>,
## #   `Context 1\r\n(2006)` <chr>, `Context 2` <chr>, …
# Değişken isimleri
colnames(pisa_data)
##  [1] "Generic ID"                                     
##  [2] "CBA Item ID in MS Analysis Output"              
##  [3] "PBA Item ID in MS Analysis Output"              
##  [4] "Unit Name"                                      
##  [5] "Mode\r\n[Paper-Based (PB); Computer-Based (CB)]"
##  [6] "2015 FT & MS Cluster"                           
##  [7] "Sequence in Cluster"                            
##  [8] "Sequence in Unit"                               
##  [9] "Item Format - CBA"                              
## [10] "Item Format - PBA"                              
## [11] "Context 1\r\n(2015)"                            
## [12] "Context 1\r\n(2006)"                            
## [13] "Context 2"                                      
## [14] "Competency\r\n(2015)"                           
## [15] "Compentency\r\n(2006)"                          
## [16] "Knowledge\r\n(2015)"                            
## [17] "Knowledge\r\n(2006)"                            
## [18] "System\r\n(2015)"                               
## [19] "System\r\n(2006)"                               
## [20] "Depth of Knowledge"                             
## [21] "Unit Origin"                                    
## [22] "Language of Submission"                         
## [23] "Source"                                         
## [24] "CBA International % Correct"                    
## [25] "CBA International  % Correct S.E."              
## [26] "Slope...26"                                     
## [27] "Difficulty...27"                                
## [28] "Step 1...28"                                    
## [29] "Step 2...29"                                    
## [30] "1...30"                                         
## [31] "2...31"                                         
## [32] "\r\nLevel"                                      
## [33] "PBA International % Correct"                    
## [34] "PBA International  % Correct S.E."              
## [35] "Slope...35"                                     
## [36] "Difficulty...36"                                
## [37] "Step 1...37"                                    
## [38] "Step 2...38"                                    
## [39] "1...39"                                         
## [40] "2...40"                                         
## [41] "Level"

1.2 Değişkenleri İnceleme Veri girişinde hata olup olmadığını anlamak için özet istatistiklere bakalım:

# Veri setinin genel yapısını inceleyelim
glimpse(pisa_data)
## Rows: 85
## Columns: 41
## $ `Generic ID`                                      <chr> "S131Q02", "S131Q04"…
## $ `CBA Item ID in MS Analysis Output`               <chr> "DS131Q02C", "DS131Q…
## $ `PBA Item ID in MS Analysis Output`               <chr> "PS131Q02", "PS131Q0…
## $ `Unit Name`                                       <chr> "Good Vibrations", "…
## $ `Mode\r\n[Paper-Based (PB); Computer-Based (CB)]` <chr> "PB and CB", "PB and…
## $ `2015 FT & MS Cluster`                            <chr> "S03", "S03", "S06",…
## $ `Sequence in Cluster`                             <dbl> 4, 5, 6, 7, 8, 5, 8,…
## $ `Sequence in Unit`                                <chr> "1/2", "2/2", "1/3",…
## $ `Item Format - CBA`                               <chr> "Open Response - Hum…
## $ `Item Format - PBA`                               <chr> "Open Response - Hum…
## $ `Context 1\r\n(2015)`                             <chr> "Personal", "Local/ …
## $ `Context 1\r\n(2006)`                             <chr> "Social", "Social", …
## $ `Context 2`                                       <chr> "Health & Disease", …
## $ `Competency\r\n(2015)`                            <chr> "Interpret data and …
## $ `Compentency\r\n(2006)`                           <chr> "Using scientific ev…
## $ `Knowledge\r\n(2015)`                             <chr> "Procedural", "Proce…
## $ `Knowledge\r\n(2006)`                             <chr> "Knowledge about Sci…
## $ `System\r\n(2015)`                                <chr> "Living", "Living", …
## $ `System\r\n(2006)`                                <chr> "NA", "NA", "NA", "N…
## $ `Depth of Knowledge`                              <chr> "Low", "Medium", "Me…
## $ `Unit Origin`                                     <chr> "ACER", "ACER", "Kor…
## $ `Language of Submission`                          <chr> "English", "English"…
## $ Source                                            <chr> "2012", "2012", "200…
## $ `CBA International % Correct`                     <dbl> 45.02305, 26.78152, …
## $ `CBA International  % Correct S.E.`               <dbl> 0.2696944, 0.2403569…
## $ Slope...26                                        <dbl> 1.42528, 1.21112, 0.…
## $ Difficulty...27                                   <dbl> 0.07477, 0.55121, 0.…
## $ `Step 1...28`                                     <dbl> NA, NA, NA, NA, NA, …
## $ `Step 2...29`                                     <dbl> NA, NA, NA, NA, NA, …
## $ `1...30`                                          <chr> "537", "624", "553",…
## $ `2...31`                                          <chr> NA, NA, NA, NA, NA, …
## $ `\r\nLevel`                                       <chr> "Level 3", "Level 4"…
## $ `PBA International % Correct`                     <dbl> 32.46211, 18.24561, …
## $ `PBA International  % Correct S.E.`               <dbl> 0.3933822, 0.3020953…
## $ Slope...35                                        <dbl> 1.42528, 1.21112, 0.…
## $ Difficulty...36                                   <dbl> 0.07477, 0.55121, 0.…
## $ `Step 1...37`                                     <dbl> NA, NA, NA, NA, NA, …
## $ `Step 2...38`                                     <dbl> NA, NA, NA, NA, NA, …
## $ `1...39`                                          <chr> "537", "624", "553",…
## $ `2...40`                                          <chr> NA, NA, NA, NA, NA, …
## $ Level                                             <chr> "Level 3", "Level 4"…
# Sayısal değişkenlerin betimsel istatistikleri
summary(pisa_data)
##   Generic ID        CBA Item ID in MS Analysis Output
##  Length:85          Length:85                        
##  Class :character   Class :character                 
##  Mode  :character   Mode  :character                 
##                                                      
##                                                      
##                                                      
##                                                      
##  PBA Item ID in MS Analysis Output  Unit Name        
##  Length:85                         Length:85         
##  Class :character                  Class :character  
##  Mode  :character                  Mode  :character  
##                                                      
##                                                      
##                                                      
##                                                      
##  Mode\r\n[Paper-Based (PB); Computer-Based (CB)] 2015 FT & MS Cluster
##  Length:85                                       Length:85           
##  Class :character                                Class :character    
##  Mode  :character                                Mode  :character    
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##  Sequence in Cluster Sequence in Unit   Item Format - CBA  Item Format - PBA 
##  Min.   : 1.000      Length:85          Length:85          Length:85         
##  1st Qu.: 4.000      Class :character   Class :character   Class :character  
##  Median : 8.000      Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 8.071                                                              
##  3rd Qu.:12.000                                                              
##  Max.   :18.000                                                              
##                                                                              
##  Context 1\r\n(2015) Context 1\r\n(2006)  Context 2        
##  Length:85           Length:85           Length:85         
##  Class :character    Class :character    Class :character  
##  Mode  :character    Mode  :character    Mode  :character  
##                                                            
##                                                            
##                                                            
##                                                            
##  Competency\r\n(2015) Compentency\r\n(2006) Knowledge\r\n(2015)
##  Length:85            Length:85             Length:85          
##  Class :character     Class :character      Class :character   
##  Mode  :character     Mode  :character      Mode  :character   
##                                                                
##                                                                
##                                                                
##                                                                
##  Knowledge\r\n(2006) System\r\n(2015)   System\r\n(2006)   Depth of Knowledge
##  Length:85           Length:85          Length:85          Length:85         
##  Class :character    Class :character   Class :character   Class :character  
##  Mode  :character    Mode  :character   Mode  :character   Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  Unit Origin        Language of Submission    Source         
##  Length:85          Length:85              Length:85         
##  Class :character   Class :character       Class :character  
##  Mode  :character   Mode  :character       Mode  :character  
##                                                              
##                                                              
##                                                              
##                                                              
##  CBA International % Correct CBA International  % Correct S.E.   Slope...26    
##  Min.   :11.81               Min.   :0.1733                    Min.   :0.4017  
##  1st Qu.:36.78               1st Qu.:0.2522                    1st Qu.:0.8782  
##  Median :47.96               Median :0.2647                    Median :1.0000  
##  Mean   :47.48               Mean   :0.2583                    Mean   :1.0872  
##  3rd Qu.:56.40               3rd Qu.:0.2707                    3rd Qu.:1.3339  
##  Max.   :87.63               Max.   :0.2974                    Max.   :2.4747  
##                                                                                
##  Difficulty...27     Step 1...28       Step 2...29        1...30         
##  Min.   :-1.41194   Min.   :-1.0223   Min.   :0.0768   Length:85         
##  1st Qu.:-0.30237   1st Qu.:-0.6342   1st Qu.:0.1615   Class :character  
##  Median :-0.04738   Median :-0.2462   Median :0.2462   Mode  :character  
##  Mean   :-0.01182   Mean   :-0.4484   Mean   :0.4484                     
##  3rd Qu.: 0.24946   3rd Qu.:-0.1615   3rd Qu.:0.6342                     
##  Max.   : 1.52393   Max.   :-0.0768   Max.   :1.0223                     
##                     NA's   :82        NA's   :82                         
##     2...31             \r\nLevel       PBA International % Correct
##  Length:85          Length:85          Min.   :11.09              
##  Class :character   Class :character   1st Qu.:26.80              
##  Mode  :character   Mode  :character   Median :37.27              
##                                        Mean   :37.82              
##                                        3rd Qu.:47.28              
##                                        Max.   :81.39              
##                                                                   
##  PBA International  % Correct S.E.   Slope...35     Difficulty...36   
##  Min.   :0.2338                    Min.   :0.4017   Min.   :-1.41194  
##  1st Qu.:0.3510                    1st Qu.:0.8782   1st Qu.:-0.32316  
##  Median :0.3771                    Median :1.0000   Median :-0.09803  
##  Mean   :0.3669                    Mean   :1.0872   Mean   :-0.05153  
##  3rd Qu.:0.3929                    3rd Qu.:1.3339   3rd Qu.: 0.23450  
##  Max.   :0.4110                    Max.   :2.4747   Max.   : 1.22396  
##                                                                       
##   Step 1...37        Step 2...38         1...39             2...40         
##  Min.   :-0.63107   Min.   :0.07957   Length:85          Length:85         
##  1st Qu.:-0.43861   1st Qu.:0.16286   Class :character   Class :character  
##  Median :-0.24615   Median :0.24615   Mode  :character   Mode  :character  
##  Mean   :-0.31893   Mean   :0.31893                                        
##  3rd Qu.:-0.16286   3rd Qu.:0.43861                                        
##  Max.   :-0.07957   Max.   :0.63107                                        
##  NA's   :82         NA's   :82                                             
##     Level          
##  Length:85         
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Bazı değişkenlerde uç değerler olup olmadığını görmek için minimum ve maksimum değerleri kontrol edelim.

# Sayısal değişkenlerde min, max değerleri inceleyelim
pisa_data %>% 
  summarise(across(where(is.numeric), list(min = min, max = max), na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(where(is.numeric), list(min = min, max = max), na.rm =
##   TRUE)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
## # A tibble: 1 × 26
##   `Sequence in Cluster_min` `Sequence in Cluster_max` CBA International % Corr…¹
##                       <dbl>                     <dbl>                      <dbl>
## 1                         1                        18                       11.8
## # ℹ abbreviated name: ¹​`CBA International % Correct_min`
## # ℹ 23 more variables: `CBA International % Correct_max` <dbl>,
## #   `CBA International  % Correct S.E._min` <dbl>,
## #   `CBA International  % Correct S.E._max` <dbl>, Slope...26_min <dbl>,
## #   Slope...26_max <dbl>, Difficulty...27_min <dbl>, Difficulty...27_max <dbl>,
## #   `Step 1...28_min` <dbl>, `Step 1...28_max` <dbl>, `Step 2...29_min` <dbl>,
## #   `Step 2...29_max` <dbl>, `PBA International % Correct_min` <dbl>, …

1.3 Kayıp Veri Analizi Bazı değişkenlerde eksik değerler bulunup bulunmadığını kontrol edelim.

n_miss(pisa_data)
## [1] 492
prop_miss(pisa_data)
## [1] 0.1411765
# Eksik veri oranlarını hesaplama
colSums(is.na(pisa_data))
##                                      Generic ID 
##                                               0 
##               CBA Item ID in MS Analysis Output 
##                                               0 
##               PBA Item ID in MS Analysis Output 
##                                               0 
##                                       Unit Name 
##                                               0 
## Mode\r\n[Paper-Based (PB); Computer-Based (CB)] 
##                                               0 
##                            2015 FT & MS Cluster 
##                                               0 
##                             Sequence in Cluster 
##                                               0 
##                                Sequence in Unit 
##                                               0 
##                               Item Format - CBA 
##                                               0 
##                               Item Format - PBA 
##                                               0 
##                             Context 1\r\n(2015) 
##                                               0 
##                             Context 1\r\n(2006) 
##                                               0 
##                                       Context 2 
##                                               0 
##                            Competency\r\n(2015) 
##                                               0 
##                           Compentency\r\n(2006) 
##                                               0 
##                             Knowledge\r\n(2015) 
##                                               0 
##                             Knowledge\r\n(2006) 
##                                               0 
##                                System\r\n(2015) 
##                                               0 
##                                System\r\n(2006) 
##                                               0 
##                              Depth of Knowledge 
##                                               0 
##                                     Unit Origin 
##                                               0 
##                          Language of Submission 
##                                               0 
##                                          Source 
##                                               0 
##                     CBA International % Correct 
##                                               0 
##               CBA International  % Correct S.E. 
##                                               0 
##                                      Slope...26 
##                                               0 
##                                 Difficulty...27 
##                                               0 
##                                     Step 1...28 
##                                              82 
##                                     Step 2...29 
##                                              82 
##                                          1...30 
##                                               0 
##                                          2...31 
##                                              82 
##                                       \r\nLevel 
##                                               0 
##                     PBA International % Correct 
##                                               0 
##               PBA International  % Correct S.E. 
##                                               0 
##                                      Slope...35 
##                                               0 
##                                 Difficulty...36 
##                                               0 
##                                     Step 1...37 
##                                              82 
##                                     Step 2...38 
##                                              82 
##                                          1...39 
##                                               0 
##                                          2...40 
##                                              82 
##                                           Level 
##                                               0
# Eksik veri yüzdeleri
pisa_data %>%
  summarise(across(everything(), ~ mean(is.na(.)) * 100))
## # A tibble: 1 × 41
##   `Generic ID` CBA Item ID in MS Analysis O…¹ PBA Item ID in MS An…² `Unit Name`
##          <dbl>                          <dbl>                  <dbl>       <dbl>
## 1            0                              0                      0           0
## # ℹ abbreviated names: ¹​`CBA Item ID in MS Analysis Output`,
## #   ²​`PBA Item ID in MS Analysis Output`
## # ℹ 37 more variables: `Mode\r\n[Paper-Based (PB); Computer-Based (CB)]` <dbl>,
## #   `2015 FT & MS Cluster` <dbl>, `Sequence in Cluster` <dbl>,
## #   `Sequence in Unit` <dbl>, `Item Format - CBA` <dbl>,
## #   `Item Format - PBA` <dbl>, `Context 1\r\n(2015)` <dbl>,
## #   `Context 1\r\n(2006)` <dbl>, `Context 2` <dbl>, …

Eksik verinin rastgele mi yoksa sistematik mi olduğuna bakalım:

# Eksik veri desenini görselleştirme
gg_miss_var(pisa_data) +
  labs(title = "PISA 2015 - Eksik Veri Dağılımı")

1.4 Eksik Veri ile Başa Çıkma Kayıp veri için listwise deletion (eksik satırları silme) veya ortalama ile doldurma gibi yöntemleri uygulayabiliriz.

Önce eksik satırları silelim:

pisa_clean <- pisa_data %>%
  drop_na()
# Eksik veri temizlendikten sonraki satır sayısı
nrow(pisa_clean)
## [1] 3

Alternatif olarak, eksik verileri değişkenin ortalaması ile doldurabiliriz:

pisa_imputed <- pisa_data %>%
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

# Eksik veri kaldı mı?
colSums(is.na(pisa_imputed))
##                                      Generic ID 
##                                               0 
##               CBA Item ID in MS Analysis Output 
##                                               0 
##               PBA Item ID in MS Analysis Output 
##                                               0 
##                                       Unit Name 
##                                               0 
## Mode\r\n[Paper-Based (PB); Computer-Based (CB)] 
##                                               0 
##                            2015 FT & MS Cluster 
##                                               0 
##                             Sequence in Cluster 
##                                               0 
##                                Sequence in Unit 
##                                               0 
##                               Item Format - CBA 
##                                               0 
##                               Item Format - PBA 
##                                               0 
##                             Context 1\r\n(2015) 
##                                               0 
##                             Context 1\r\n(2006) 
##                                               0 
##                                       Context 2 
##                                               0 
##                            Competency\r\n(2015) 
##                                               0 
##                           Compentency\r\n(2006) 
##                                               0 
##                             Knowledge\r\n(2015) 
##                                               0 
##                             Knowledge\r\n(2006) 
##                                               0 
##                                System\r\n(2015) 
##                                               0 
##                                System\r\n(2006) 
##                                               0 
##                              Depth of Knowledge 
##                                               0 
##                                     Unit Origin 
##                                               0 
##                          Language of Submission 
##                                               0 
##                                          Source 
##                                               0 
##                     CBA International % Correct 
##                                               0 
##               CBA International  % Correct S.E. 
##                                               0 
##                                      Slope...26 
##                                               0 
##                                 Difficulty...27 
##                                               0 
##                                     Step 1...28 
##                                               0 
##                                     Step 2...29 
##                                               0 
##                                          1...30 
##                                               0 
##                                          2...31 
##                                              82 
##                                       \r\nLevel 
##                                               0 
##                     PBA International % Correct 
##                                               0 
##               PBA International  % Correct S.E. 
##                                               0 
##                                      Slope...35 
##                                               0 
##                                 Difficulty...36 
##                                               0 
##                                     Step 1...37 
##                                               0 
##                                     Step 2...38 
##                                               0 
##                                          1...39 
##                                               0 
##                                          2...40 
##                                              82 
##                                           Level 
##                                               0

1.5 Uç Değerleri İnceleme Boxplot kullanarak uç değerleri tespit edelim:

ggplot(pisa_clean, aes(y = `CBA International % Correct`)) +
  geom_boxplot(fill = "red", alpha = 0.5) +
  labs(title = "PISA 2015 - Doğru Cevap Oranlarında Uç Değerler")

Bu haftaki öğrenme günlüğünde PISA 2015 veri setini kullanarak veri inceleme, eksik veri analizi ve veri temizleme adımlarını gerçekleştirdik. Öncelikle, veri dosyasının doğruluğunu kontrol ederek özet istatistikleri ve uç değerleri inceledik. Daha sonra, veri setindeki eksik değerlerin oranını hesapladık ve eksik verinin rastgele mi yoksa sistematik mi olduğunu görselleştirme teknikleriyle analiz ettik. Eksik verileri ele almak için listwise deletion (eksik satırları silme) ve ortalama ile doldurma yöntemlerini uyguladık. Son olarak, uç değerleri (outliers) tespit etmek için boxplot oluşturduk. Bu süreç, veri analizine başlamadan önce veri setini kontrol etmek ve güvenilir sonuçlar elde etmek için önemli bir adımdır.