Korelasi

Korelasi merupakan pola hubungan linier antar peubah numerik yang dapat diketahui tingkat keerannya melalui koefisien \(r\). Nilai koefisien \(r\) berada pada rentang \(-1 \leq r \leq 1\). Nilai \(r\) emakin mendekati 1 ataupun -1 maka semakin erat hubungan linier antar peubah numerik tersebut. Sedangkan nilai positif maupun negatif menunjukkan arah hubungan antar peubah numerik tersebut. Nilai \(r\) positif berarti nilai rataan antar peubah yang searah, sedangkan nilai \(r\) yang negatif menunjukkan nilai rataan antar peubah yang berlawanan.

Dataset

Berikut adalah korelasi yang dapat dilakukan pada dataset soil yang tersedia pada paket agricolae. Sebelum melakukan analisis korelasi, dapat dilakukan mengeksplorasi data terlebih dahulu.

library(agricolae)
data(soil)
data_soil <- soil
str(data_soil)
## 'data.frame':    13 obs. of  23 variables:
##  $ place: Factor w/ 13 levels "Chmar","Chz",..: 11 9 10 12 13 3 4 5 2 1 ...
##  $ pH   : num  3.8 7.2 8.4 4.8 4.4 7.6 8.3 8.4 6.1 4.4 ...
##  $ EC   : num  0.72 1 0.32 0.83 0.49 1.09 3.67 7.17 0.15 0.18 ...
##  $ CaCO3: num  0 1.4 9.6 0 0 0.9 7.6 2.7 0 0 ...
##  $ MO   : num  1.7 2.6 2 1.9 2 1.3 1.7 1.1 1.6 3.8 ...
##  $ CIC  : num  5.76 20 18.88 17.28 6.4 ...
##  $ P    : num  8.1 17.5 17.6 15.8 7.2 8.9 20 50 8.5 44.1 ...
##  $ K    : num  28 166 124 86 30 350 537 414 84 240 ...
##  $ sand : num  80 26 44 58 78 48 36 50 52 68 ...
##  $ slime: num  16 42 32 30 14 38 36 28 28 26 ...
##  $ clay : num  4 32 24 12 8 14 28 22 20 6 ...
##  $ Ca   : num  0.74 17.21 14.91 12.49 0.76 ...
##  $ Mg   : num  0.15 2.06 3.37 1.9 0.13 1.79 4.44 2.67 1.54 0.56 ...
##  $ K2   : num  0.1 0.47 0.36 0.26 0.09 0.88 1.42 0.76 0.25 0.56 ...
##  $ Na   : num  0.27 0.26 0.24 0.32 0.26 0.37 0.62 0.92 0.29 0.27 ...
##  $ Al_H : num  1.2 0 0 0.4 0.9 0 0 0 0 2.6 ...
##  $ K_Mg : num  0.667 0.228 0.107 0.137 0.692 ...
##  $ Ca_Mg: num  4.93 8.35 4.42 6.57 5.85 ...
##  $ B    : num  0.9 0.2 0.4 0.5 0.4 2.5 3.1 4.9 0.4 1.1 ...
##  $ Cu   : num  0.4 12.1 2.5 3 0.6 5.1 1.7 1.5 1.9 2.3 ...
##  $ Fe   : num  295.4 12.9 23.2 352.4 334.4 ...
##  $ Mn   : num  1.3 0.5 7.9 5.1 1.8 2.1 2.1 4.6 20.6 7.6 ...
##  $ Zn   : num  2.1 62.2 6.7 7 1.7 3.1 8.3 6.3 3.2 4.5 ...
summary(data_soil)
##      place         pH              EC            CaCO3             MO       
##  Chmar  :1   Min.   :3.800   Min.   :0.090   Min.   :0.000   Min.   :1.100  
##  Chz    :1   1st Qu.:4.700   1st Qu.:0.210   1st Qu.:0.000   1st Qu.:1.700  
##  Cnt1   :1   Median :6.100   Median :0.720   Median :0.000   Median :1.900  
##  Cnt2   :1   Mean   :6.154   Mean   :1.289   Mean   :1.708   Mean   :2.254  
##  Cnt3   :1   3rd Qu.:7.600   3rd Qu.:1.000   3rd Qu.:1.400   3rd Qu.:2.500  
##  Hco1   :1   Max.   :8.400   Max.   :7.170   Max.   :9.600   Max.   :5.400  
##  (Other):7                                                                  
##       CIC              P               K              sand      
##  Min.   : 5.76   Min.   : 3.60   Min.   : 28.0   Min.   :26.00  
##  1st Qu.:13.60   1st Qu.: 8.50   1st Qu.: 65.0   1st Qu.:44.00  
##  Median :17.28   Median :17.50   Median :124.0   Median :48.00  
##  Mean   :16.17   Mean   :20.65   Mean   :202.5   Mean   :51.69  
##  3rd Qu.:20.00   3rd Qu.:20.10   3rd Qu.:350.0   3rd Qu.:58.00  
##  Max.   :25.60   Max.   :50.00   Max.   :537.0   Max.   :80.00  
##                                                                 
##      slime            clay             Ca               Mg       
##  Min.   :14.00   Min.   : 4.00   Min.   : 0.740   Min.   :0.130  
##  1st Qu.:28.00   1st Qu.:10.00   1st Qu.: 7.440   1st Qu.:1.540  
##  Median :32.00   Median :14.00   Median : 9.570   Median :2.060  
##  Mean   :31.85   Mean   :16.46   Mean   : 9.296   Mean   :2.422  
##  3rd Qu.:38.00   3rd Qu.:24.00   3rd Qu.:12.490   3rd Qu.:3.370  
##  Max.   :46.00   Max.   :32.00   Max.   :17.210   Max.   :6.250  
##                                                                  
##        K2               Na              Al_H             K_Mg       
##  Min.   :0.0900   Min.   :0.2400   Min.   :0.0000   Min.   :0.0240  
##  1st Qu.:0.1900   1st Qu.:0.2600   1st Qu.:0.0000   1st Qu.:0.1368  
##  Median :0.3600   Median :0.2700   Median :0.0000   Median :0.2846  
##  Mean   :0.5077   Mean   :0.3569   Mean   :0.5308   Mean   :0.3574  
##  3rd Qu.:0.7600   3rd Qu.:0.3200   3rd Qu.:0.9000   3rd Qu.:0.4916  
##  Max.   :1.4200   Max.   :0.9200   Max.   :2.6000   Max.   :1.0000  
##                                                                     
##      Ca_Mg             B               Cu               Fe        
##  Min.   :1.848   Min.   :0.000   Min.   : 0.400   Min.   :   4.3  
##  1st Qu.:3.514   1st Qu.:0.400   1st Qu.: 1.500   1st Qu.:  16.3  
##  Median :4.933   Median :0.500   Median : 2.300   Median : 295.4  
##  Mean   :4.712   Mean   :1.185   Mean   : 3.131   Mean   : 444.3  
##  3rd Qu.:5.846   3rd Qu.:1.100   3rd Qu.: 3.000   3rd Qu.: 352.4  
##  Max.   :8.354   Max.   :4.900   Max.   :12.100   Max.   :1715.0  
##                                                                   
##        Mn               Zn        
##  Min.   : 0.500   Min.   : 1.700  
##  1st Qu.: 2.100   1st Qu.: 2.900  
##  Median : 4.600   Median : 4.500  
##  Mean   : 5.992   Mean   : 8.892  
##  3rd Qu.: 7.900   3rd Qu.: 6.700  
##  Max.   :20.600   Max.   :62.200  
## 

Statistika Deskripsi

Kita bisa membuat boxplot untuk setiap peubah numerik seperti pada pH, P, K, sand, dan clay sebagai berikut.

new_data <- data_soil[, c("place", "pH", "P", "K", "sand", "clay")]
str(new_data)
## 'data.frame':    13 obs. of  6 variables:
##  $ place: Factor w/ 13 levels "Chmar","Chz",..: 11 9 10 12 13 3 4 5 2 1 ...
##  $ pH   : num  3.8 7.2 8.4 4.8 4.4 7.6 8.3 8.4 6.1 4.4 ...
##  $ P    : num  8.1 17.5 17.6 15.8 7.2 8.9 20 50 8.5 44.1 ...
##  $ K    : num  28 166 124 86 30 350 537 414 84 240 ...
##  $ sand : num  80 26 44 58 78 48 36 50 52 68 ...
##  $ clay : num  4 32 24 12 8 14 28 22 20 6 ...
library(ggpubr)
## Loading required package: ggplot2
# Bar plot (bp)
bp1 <- ggbarplot(new_data, x = "place", y = "clay",
          fill = "place",               # change fill color by cyl
          color = "white",            # Set bar border colors to white
          palette = "jco",            # jco journal color palett. see ?ggpar
          sort.val = "asc",           # Sort the value in ascending order
          sort.by.groups = TRUE,      # Sort inside each group
          x.text.angle = 90           # Rotate vertically x axis texts
          ) + font("x.text", size = 8)

bp2 <- ggbarplot(new_data, x = "place", y = "sand",
          fill = "place",               # change fill color by cyl
          color = "white",            # Set bar border colors to white
          palette = "jco",            # jco journal color palett. see ?ggpar
          sort.val = "asc",           # Sort the value in ascending order
          sort.by.groups = TRUE,      # Sort inside each group
          x.text.angle = 90           # Rotate vertically x axis texts
          ) + font("x.text", size = 8)

bp3 <- ggbarplot(new_data, x = "place", y = "P",
          fill = "place",               # change fill color by cyl
          color = "white",            # Set bar border colors to white
          palette = "jco",            # jco journal color palett. see ?ggpar
          sort.val = "asc",           # Sort the value in ascending order
          sort.by.groups = TRUE,      # Sort inside each group
          x.text.angle = 90           # Rotate vertically x axis texts
          ) + font("x.text", size = 8)

bp4 <- ggbarplot(new_data, x = "place", y = "pH",
          fill = "place",               # change fill color by cyl
          color = "white",            # Set bar border colors to white
          palette = "jco",            # jco journal color palett. see ?ggpar
          sort.val = "asc",           # Sort the value in ascending order
          sort.by.groups = TRUE,      # Sort inside each group
          x.text.angle = 90           # Rotate vertically x axis texts
          ) + font("x.text", size = 8)

bp1
## Warning: This manual palette can handle a maximum of 10 values. You have
## supplied 13

bp2
## Warning: This manual palette can handle a maximum of 10 values. You have
## supplied 13

bp3
## Warning: This manual palette can handle a maximum of 10 values. You have
## supplied 13

bp4
## Warning: This manual palette can handle a maximum of 10 values. You have
## supplied 13

Koefisien r

Selanjutnya kita bisa langsung melakukan korelasi dari seluruh peubah numerik dengan menggunakan heatmap. Berikut coding R nya.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
# Fungsi untuk menghitung p-value dari korelasi
cor.mtest <- function(mat, ...) {
  mat <- as.matrix(mat)
  n <- ncol(mat)
  p.mat <- matrix(NA, n, n)
  diag(p.mat) <- 0
  for (i in 1:(n - 1)) {
    for (j in (i + 1):n) {
      tmp <- cor.test(mat[, i], mat[, j], ...)
      p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
    }
  }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}

# Pilih hanya kolom numerik
new <- drop(data_soil[,-1])
d1_numeric <- new

# Hitung matriks korelasi dan p-value
cor_matrix <- cor(d1_numeric, use = "complete.obs")
p_matrix <- cor.mtest(d1_numeric)

# Ubah matriks korelasi dan p-value menjadi format panjang
cor_df <- melt(cor_matrix)
p_df <- melt(p_matrix)

# Gabungkan data korelasi dan p-value
cor_df <- cor_df %>%
  rename(correlation = value) %>%
  mutate(p_value = p_df$value)

# Tambahkan kolom signifikansi
cor_df <- cor_df %>%
  mutate(signif = case_when(
    p_value < 0.01 ~ "**",
    p_value < 0.05 ~ "*",
    TRUE ~ ""
  ))

# Gabungkan nilai korelasi dengan tanda signifikan
cor_df <- cor_df %>%
  mutate(label = ifelse(Var1 == Var2, "", paste0(round(correlation, 2), signif)))  # Hilangkan diagonal

# Buat heatmap dengan nilai korelasi dan signifikansi
pr1 <- ggplot(cor_df, aes(x = Var1, y = Var2, fill = correlation)) +
  geom_tile(color = "white") +
  geom_text(aes(label = label), color = "black", size = 3) +
  scale_fill_gradient2(low = "red", mid = "darkslategray1", high = "blue", 
                       midpoint = 0) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Heatmap Korelasi",
       x = "Variabel X",
       y = "Variabel Y",
       fill = "r")
print(pr1)

# Fungsi untuk menghitung p-value dari korelasi
cor.mtest <- function(mat, ...) {
  mat <- as.matrix(mat)
  n <- ncol(mat)
  p.mat <- matrix(NA, n, n)
  diag(p.mat) <- 0
  for (i in 1:(n - 1)) {
    for (j in (i + 1):n) {
      tmp <- cor.test(mat[, i], mat[, j], ...)
      p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
    }
  }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}

# Pilih hanya kolom numerik
new <- drop(data_soil[,-1])
new <- new[,c("pH", "CaCO3", "P", "K", "Mg", "Na", "Al_H", "Cu", "Fe", "Zn")]
d1_numeric <- new

# Hitung matriks korelasi dan p-value
cor_matrix <- cor(d1_numeric, use = "complete.obs")
p_matrix <- cor.mtest(d1_numeric)

# Ubah matriks korelasi dan p-value menjadi format panjang
cor_df <- melt(cor_matrix)
p_df <- melt(p_matrix)

# Gabungkan data korelasi dan p-value
cor_df <- cor_df %>%
  rename(correlation = value) %>%
  mutate(p_value = p_df$value)

# Tambahkan kolom signifikansi
cor_df <- cor_df %>%
  mutate(signif = case_when(
    p_value < 0.01 ~ "**",
    p_value < 0.05 ~ "*",
    TRUE ~ ""
  ))

# Gabungkan nilai korelasi dengan tanda signifikan
cor_df <- cor_df %>%
  mutate(label = ifelse(Var1 == Var2, "", paste0(round(correlation, 2), signif)))  # Hilangkan diagonal

# Buat heatmap dengan nilai korelasi dan signifikansi
pr2 <- ggplot(cor_df, aes(x = Var1, y = Var2, fill = correlation)) +
  geom_tile(color = "white") +
  geom_text(aes(label = label), color = "black", size = 3) +
  scale_fill_gradient2(low = "blue", mid = "darkslategray3", high = "skyblue", 
                       midpoint = 0) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Heatmap Korelasi Data soil",
       x = "Variabel X",
       y = "Variabel Y",
       fill = "r")
print(pr2)

Kesimpulan

Berdasarkan heatmap yang dibuat menunjukkan koefisien korelasi dari peubah numerik berada pada rentang -0.2 hingga 0.9 untuk seluruh peubah respon. Pada plot kedua, menunjukkan nilai r dari pH dengan Al_H sebesar -0.65 dan signifikan. Hal ini menunjukkan hubungan linier yang berlawanan dan kuat. Artinya rerata kenaikan pH tidak diikuti oleh rerata Al_H. Selain itu, nilai r yang positif, kuat dan signifikan ditunjukkan oleh Zn dan Cu sehingga keduanya memiliki hubungan linier yang searah sehingga rerata kenaikan Zn diikuti oleh Cu.