Eksplorasi Data

Author

Steven Danendra

Eksplorasi Data ‘Alcohol’

library(wooldridge)
library(ggplot2)
data("alcohol")
data("alcohol", package = "wooldridge")
str(alcohol)

'data.frame':   9822 obs. of  33 variables:
 $ abuse     : int  1 0 0 0 0 0 0 0 0 0 ...
 $ status    : int  1 3 3 3 3 3 3 1 1 3 ...
 $ unemrate  : num  4 4 4 3.3 3.3 ...
 $ age       : int  50 37 53 59 43 38 34 45 47 31 ...
 $ educ      : int  4 12 9 11 10 10 10 2 5 12 ...
 $ married   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ famsize   : int  1 5 3 1 1 1 4 2 2 1 ...
 $ white     : int  1 1 1 1 1 1 1 1 0 1 ...
 $ exhealth  : int  0 0 1 1 1 1 0 0 0 1 ...
 $ vghealth  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ goodhealth: int  0 1 0 0 0 0 1 0 0 0 ...
 $ fairhealth: int  0 0 0 0 0 0 0 0 0 0 ...
 $ northeast : int  0 0 0 1 1 1 0 0 0 0 ...
 $ midwest   : int  1 1 1 0 0 0 1 1 1 1 ...
 $ south     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ centcity  : int  0 0 0 1 1 1 0 0 0 1 ...
 $ outercity : int  0 0 0 0 0 0 1 1 1 0 ...
 $ qrt1      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ qrt2      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ qrt3      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ beertax   : num  0.334 0.334 0.334 0.24 0.24 ...
 $ cigtax    : num  38 38 38 26 26 26 20 20 20 20 ...
 $ ethanol   : num  2.04 2.04 2.04 2.45 2.45 ...
 $ mothalc   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ fathalc   : int  0 0 0 0 1 0 1 0 1 0 ...
 $ livealc   : int  0 0 0 0 1 0 1 0 1 0 ...
 $ inwf      : int  0 1 1 1 1 1 1 0 0 1 ...
 $ employ    : int  0 1 1 1 1 1 1 0 0 1 ...
 $ agesq     : int  2500 1369 2809 3481 1849 1444 1156 2025 2209 961 ...
 $ beertaxsq : num  0.1116 0.1116 0.1116 0.0576 0.0576 ...
 $ cigtaxsq  : num  1444 1444 1444 676 676 ...
 $ ethanolsq : num  4.16 4.16 4.16 6 6 ...
 $ educsq    : int  16 144 81 121 100 100 100 4 25 144 ...
 - attr(*, "time.stamp")= chr "22 Jan 2013 14:09"

str(alcohol) digunakan untuk melihat struktur dari dataset Alcohol. Ini mencakup informasi tentang jumlah pengamatan, nama-nama variabel, dan jenis data.

head(alcohol)

  abuse status unemrate age educ married famsize white exhealth vghealth
1     1      1      4.0  50    4       1       1     1        0        0
2     0      3      4.0  37   12       1       5     1        0        0
3     0      3      4.0  53    9       1       3     1        1        0
4     0      3      3.3  59   11       1       1     1        1        0
5     0      3      3.3  43   10       1       1     1        1        0
6     0      3      3.3  38   10       1       1     1        1        0
  goodhealth fairhealth northeast midwest south centcity outercity qrt1 qrt2
1          0          0         0       1     0        0         0    1    0
2          1          0         0       1     0        0         0    1    0
3          0          0         0       1     0        0         0    1    0
4          0          0         1       0     0        1         0    1    0
5          0          0         1       0     0        1         0    1    0
6          0          0         1       0     0        1         0    1    0
  qrt3 beertax cigtax ethanol mothalc fathalc livealc inwf employ agesq
1    0   0.334     38 2.03946       0       0       0    0      0  2500
2    0   0.334     38 2.03946       0       0       0    1      1  1369
3    0   0.334     38 2.03946       0       0       0    1      1  2809
4    0   0.240     26 2.44998       0       0       0    1      1  3481
5    0   0.240     26 2.44998       0       1       1    1      1  1849
6    0   0.240     26 2.44998       0       0       0    1      1  1444
  beertaxsq cigtaxsq ethanolsq educsq
1  0.111556     1444  4.159397     16
2  0.111556     1444  4.159397    144
3  0.111556     1444  4.159397     81
4  0.057600      676  6.002402    121
5  0.057600      676  6.002402    100
6  0.057600      676  6.002402    100

head(alcohol) menampilkan enam baris pertama dari dataset Alcohol, memberi Anda gambaran singkat tentang bagaimana data tersebut terlihat.

summary(alcohol)

     abuse             status         unemrate           age       
 Min.   :0.00000   Min.   :1.000   Min.   : 2.800   Min.   :25.00  
 1st Qu.:0.00000   1st Qu.:3.000   1st Qu.: 4.300   1st Qu.:31.00  
 Median :0.00000   Median :3.000   Median : 5.300   Median :38.00  
 Mean   :0.09917   Mean   :2.829   Mean   : 5.569   Mean   :39.18  
 3rd Qu.:0.00000   3rd Qu.:3.000   3rd Qu.: 6.700   3rd Qu.:46.00  
 Max.   :1.00000   Max.   :3.000   Max.   :10.900   Max.   :59.00  
      educ          married          famsize           white       
 Min.   : 0.00   Min.   :0.0000   Min.   : 1.000   Min.   :0.0000  
 1st Qu.:12.00   1st Qu.:1.0000   1st Qu.: 1.000   1st Qu.:1.0000  
 Median :13.00   Median :1.0000   Median : 3.000   Median :1.0000  
 Mean   :13.31   Mean   :0.8164   Mean   : 2.741   Mean   :0.8531  
 3rd Qu.:16.00   3rd Qu.:1.0000   3rd Qu.: 4.000   3rd Qu.:1.0000  
 Max.   :19.00   Max.   :1.0000   Max.   :13.000   Max.   :1.0000  
    exhealth         vghealth        goodhealth       fairhealth     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.00000  
 Mean   :0.4159   Mean   :0.3019   Mean   :0.2053   Mean   :0.05345  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
   northeast        midwest           south           centcity     
 Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.203   Mean   :0.2656   Mean   :0.3183   Mean   :0.3332  
 3rd Qu.:0.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
   outercity           qrt1             qrt2             qrt3       
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.4349   Mean   :0.2546   Mean   :0.2527   Mean   :0.2428  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
    beertax          cigtax         ethanol         mothalc       
 Min.   :0.045   Min.   : 2.00   Min.   :1.035   Min.   :0.00000  
 1st Qu.:0.145   1st Qu.:13.00   1st Qu.:1.798   1st Qu.:0.00000  
 Median :0.259   Median :20.00   Median :2.016   Median :0.00000  
 Mean   :0.426   Mean   :17.96   Mean   :2.036   Mean   :0.04042  
 3rd Qu.:0.446   3rd Qu.:23.00   3rd Qu.:2.390   3rd Qu.:0.00000  
 Max.   :2.370   Max.   :38.00   Max.   :4.017   Max.   :1.00000  
    fathalc          livealc            inwf            employ      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:1.0000  
 Median :0.0000   Median :0.0000   Median :1.0000   Median :1.0000  
 Mean   :0.1543   Mean   :0.1881   Mean   :0.9304   Mean   :0.8982  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
     agesq        beertaxsq           cigtaxsq        ethanolsq     
 Min.   : 625   Min.   :0.002025   Min.   :   4.0   Min.   : 1.071  
 1st Qu.: 961   1st Qu.:0.021025   1st Qu.: 169.0   1st Qu.: 3.231  
 Median :1444   Median :0.067081   Median : 400.0   Median : 4.062  
 Mean   :1628   Mean   :0.378427   Mean   : 375.3   Mean   : 4.286  
 3rd Qu.:2116   3rd Qu.:0.198916   3rd Qu.: 529.0   3rd Qu.: 5.714  
 Max.   :3481   Max.   :5.616899   Max.   :1444.0   Max.   :16.134  
     educsq     
 Min.   :  0.0  
 1st Qu.:144.0  
 Median :169.0  
 Mean   :185.5  
 3rd Qu.:256.0  
 Max.   :361.0

Summary(alcohol) memberikan ringkasan statistik deskriptif untuk setiap variabel dalam dataset Alcohol, termasuk mean, median, kuartil, dan lainnya tergantung pada jenis variabelnya.

ggplot(alcohol, aes(x = age, y = abuse)) +
  geom_point() +
  labs(x = "Age", y = "Abuse",
       title = "Scatter Plot of Age and Abuse") +
  theme_minimal()

Scatter plot digunakan untuk menampilkan hubungan antara dua variabel numerik, dalam hal ini Age dan Abuse. Garis GGPlot mengatur plot, dengan sumbu x dan y yang diberi label, dan tema minimal untuk estetika.

cor(alcohol$age, alcohol$ethanol, use = "complete.obs")

[1] -0.008264124

cor() digunakan untuk menghitung koefisien korelasi antara dua variabel numerik, dalam hal ini antara dan Ethanol. Use = “complete.obs” mengabaikan data yang hilang (NA).

ggplot(alcohol, aes(x = age)) +
  geom_histogram(binwidth = 1, fill = "purple", color = "black") +
  labs(x = "Age", y = "Frequency",
       title = "Histogram of Age") +
  theme_minimal()

Histogram digunakan untuk menampilkan distribusi frekuensi dari satu variabel numerik (Age dalam kasus ini) dengan lebar bin tertentu, diplot menggunakan GGPlot.

ggplot(alcohol, aes(y = ethanol)) +
  geom_boxplot() +
  labs(y = "Ethanol",
       title = "Boxplot of Ethanol") +
  theme_minimal()

Boxplot menunjukkan distribusi variabel numerik (ethanol) berdasarkan kuartilnya. GGPlot digunakan untuk membuat plot ini dengan tema minimal.

ggplot(alcohol, aes(x = factor(status), y = ethanol)) +
  geom_boxplot() +
  labs(x = "Status", y = "Ethanol",
       title = "Boxplot of Ethanol by Status") +
  theme_minimal()

Boxplot diplot untuk menunjukkan perbandingan distribusi Ethanol di antara kelompok yang berbeda dalam variabel faktor (status).

anova_result <- aov(ethanol ~ status, data = alcohol)
summary(anova_result)

              Df Sum Sq Mean Sq F value Pr(>F)  
status         1    0.5  0.4923   3.471 0.0625 .
Residuals   9820 1392.9  0.1418                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analisis varians (ANOVA) digunakan untuk menguji apakah terdapat perbedaan signifikan dalam rata-rata Ethanol di antara kelompok yang berbeda dalam variabel status.

reg_model <- lm(ethanol ~ age + educ + unemrate, data = alcohol)
summary(reg_model)


Call:
lm(formula = ethanol ~ age + educ + unemrate, data = alcohol)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.07722 -0.26551  0.00997  0.24009  1.98782 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.3529074  0.0276563  85.077  < 2e-16 ***
age         -0.0001219  0.0003798  -0.321    0.748    
educ         0.0060111  0.0012634   4.758 1.98e-06 ***
unemrate    -0.0704597  0.0024209 -29.104  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3609 on 9818 degrees of freedom
Multiple R-squared:  0.0823,    Adjusted R-squared:  0.08202 
F-statistic: 293.5 on 3 and 9818 DF,  p-value: < 2.2e-16

Regresi linear dilakukan untuk memodelkan hubungan antara Ethanol dengan variabel prediktor lainnya (Age, Educ, dan Unemrate). lm() digunakan untuk membuat model regresi. Karena p-value < 2.2e-16, maka hipotesis 0 ditolak.

ggplot(alcohol, aes(x = age, y = ethanol)) +
  geom_point() +
  geom_smooth(method = "lm", col = "pink") +
  labs(x = "Age", y = "Ethanol",
       title = "Scatter Plot of Age and Ethanol with Regression Line") +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

Scatter plot digunakan untuk menampilkan hubungan antara age dan ethanol, dengan tambahan garis regresi linear (geom_smooth(method = "lm")).

ggplot(alcohol, aes(x = ethanol)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 0.1, fill = "blue", color = "black") +
  geom_density(color = "red", linewidth = 1) +
  labs(x = "Ethanol", y = "Density", title = "Histogram and Density Plot of Ethanol") +
  theme_minimal()

Histogram ditambah dengan plot kepadatan untuk menunjukkan distribusi ethanol. ggplot digunakan untuk menggabungkan kedua jenis plot ini dalam satu gambar.

ggplot(alcohol, aes(sample = ethanol)) +
  stat_qq() +
  stat_qq_line() +
  labs(x = "Theoretical Quantiles", y = "Sample Quantiles", title = "Q-Q Plot of Ethanol") +
  theme_minimal()

Q-Q plot digunakan untuk membandingkan distribusi dari data ethanol dengan distribusi normal. ggplot digunakan untuk membuat plot ini.

set.seed(123)
sample_ethanol <- sample(alcohol$ethanol, 5000)
shapiro_test_sample <- shapiro.test(sample_ethanol)
shapiro_test_sample


    Shapiro-Wilk normality test

data:  sample_ethanol
W = 0.92394, p-value < 2.2e-16

Uji Shapiro-Wilk dilakukan untuk menguji keberadaan normalitas dalam distribusi ethanol. Hasilnya disimpan dalam shapiro_test.

mean_ethanol <- mean(alcohol$ethanol, na.rm = TRUE)
var_ethanol <- var(alcohol$ethanol, na.rm = TRUE)
sd_ethanol <- sd(alcohol$ethanol, na.rm = TRUE)
mean_ethanol

[1] 2.035733

var_ethanol

[1] 0.1418819

sd_ethanol

[1] 0.3766722

Mean, varians, dan deviasi standar dari variabel ethanol dihitung menggunakan fungsi mean(), var(), dan sd().

library(caret)

Loading required package: lattice

set.seed(123)
train_index <- createDataPartition(alcohol$ethanol, p = 0.7, list = FALSE)
train_data <- alcohol[train_index, ]
test_data <- alcohol[-train_index, ]
reg_model <- lm(ethanol ~ age + educ + unemrate, data = train_data)
predictions <- predict(reg_model, newdata = test_data)
library(Metrics)


Attaching package: 'Metrics'

The following objects are masked from 'package:caret':

    precision, recall

mae <- mae(predictions, test_data$ethanol)
rmse <- rmse(predictions, test_data$ethanol)
mae

[1] 0.2687811

Dataset alcohol dibagi menjadi data latih dan uji menggunakan fungsi createDataPartition dari paket caret, dengan data latih (70%) dan data uji (30%). Model regresi linear dibangun menggunakan variabel prediktor age, educ, dan unemrate dari data latih untuk memprediksi nilai ethanol. Setelah model dibuat, dilakukan prediksi terhadap nilai ethanol pada data uji, dan performa model dievaluasi menggunakan Mean Absolute Error (MAE) dan Root Mean Squared Error (RMSE) untuk mengukur kesalahan prediksi terhadap nilai aktual ethanol.