Eksplorasi Data

Author

Trevin Yoris Kenjiro

Eksplorasi Data Wooldridge

Tentang eksplorasi data menggunakan dataset “cement” dari paket “wooldridge”: Data ini mencakup 312 observasi dari tahun 1964 hingga 1989, dengan variabel seperti indeks harga produsen untuk semen (prccem), indeks produksi industri untuk semen (ipcem), dan konstruksi residensial (rresc). Saya memulai dengan memeriksa struktur data menggunakan str() dan melihat ringkasan statistik deskriptif dengan summary(). Visualisasi menggunakan ggplot2 mencakup time series plot untuk prccem, histogram untuk perubahan log dalam prccem, scatter plot untuk hubungan antara ipcem dan rresc, serta boxplot untuk rnonc berdasarkan bulan. Eksplorasi ini memberikan pemahaman tentang karakteristik data untuk analisis lebih lanjut dalam konteks ekonomi dan industri.

Pemusatan Data

library(wooldridge)
library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(stats)
data("cement")
str(cement)
'data.frame':   312 obs. of  30 variables:
 $ year   : int  1964 1964 1964 1964 1964 1964 1964 1964 1964 1964 ...
 $ month  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ prccem : int  NA NA NA NA NA NA NA NA NA NA ...
 $ ipcem  : num  0.474 0.531 0.643 0.826 1.027 ...
 $ prcpet : num  13.4 13.4 13.4 13.4 13.4 ...
 $ rresc  : num  115401 115118 123663 116178 111034 ...
 $ rnonc  : num  142180 144190 145577 150793 149259 ...
 $ ip     : num  44.6 45.9 46.2 46.9 47.1 ...
 $ rdefs  : num  NA 1.66 1.66 1.66 1.66 ...
 $ milemp : int  2687 2696 2693 2694 2690 2687 2696 2693 2690 2680 ...
 $ gprc   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ gcem   : num  NA 0.113 0.19 0.251 0.218 ...
 $ gprcpet: num  NA 0 0 0 0 0 0 0 0 0 ...
 $ gres   : num  NA -0.00246 0.0716 -0.06244 -0.04529 ...
 $ gnon   : num  NA 0.01404 0.00957 0.0352 -0.01022 ...
 $ gip    : num  NA 0.02873 0.00651 0.01504 0.00426 ...
 $ gdefs  : num  NA NA -0.000481 0 -0.000963 ...
 $ gmilemp: num  NA 0.003344 -0.001113 0.000371 -0.001486 ...
 $ jan    : int  1 0 0 0 0 0 0 0 0 0 ...
 $ feb    : int  0 1 0 0 0 0 0 0 0 0 ...
 $ mar    : int  0 0 1 0 0 0 0 0 0 0 ...
 $ apr    : int  0 0 0 1 0 0 0 0 0 0 ...
 $ may    : int  0 0 0 0 1 0 0 0 0 0 ...
 $ jun    : int  0 0 0 0 0 1 0 0 0 0 ...
 $ jul    : int  0 0 0 0 0 0 1 0 0 0 ...
 $ aug    : int  0 0 0 0 0 0 0 1 0 0 ...
 $ sep    : int  0 0 0 0 0 0 0 0 1 0 ...
 $ oct    : int  0 0 0 0 0 0 0 0 0 1 ...
 $ nov    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ dec    : int  0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
summary(cement)
      year          month           prccem           ipcem       
 Min.   :1964   Min.   : 1.00   Min.   : 287.0   Min.   :0.4072  
 1st Qu.:1970   1st Qu.: 3.75   1st Qu.: 361.0   1st Qu.:0.7632  
 Median :1976   Median : 6.50   Median : 672.0   Median :1.0315  
 Mean   :1976   Mean   : 6.50   Mean   : 672.4   Mean   :0.9724  
 3rd Qu.:1983   3rd Qu.: 9.25   3rd Qu.: 999.0   3rd Qu.:1.1798  
 Max.   :1989   Max.   :12.00   Max.   :1067.0   Max.   :1.4567  
                                NA's   :11       NA's   :2       
     prcpet           rresc            rnonc              ip        
 Min.   : 13.40   Min.   : 80798   Min.   :141753   Min.   : 44.60  
 1st Qu.: 14.50   1st Qu.:106085   1st Qu.:161970   1st Qu.: 62.48  
 Median : 36.10   Median :124003   Median :172170   Median : 76.55  
 Mean   : 43.48   Mean   :128436   Mean   :169424   Mean   : 76.65  
 3rd Qu.: 59.83   3rd Qu.:151440   3rd Qu.:178110   3rd Qu.: 87.80  
 Max.   :114.90   Max.   :177458   Max.   :187406   Max.   :110.90  
                                                                    
     rdefs           milemp          gprc                gcem          
 Min.   :1.620   Min.   :2018   Min.   :-0.025872   Min.   :-0.616466  
 1st Qu.:1.634   1st Qu.:2104   1st Qu.: 0.000000   1st Qu.:-0.082918  
 Median :1.663   Median :2153   Median : 0.000000   Median : 0.031626  
 Mean   :1.667   Mean   :2425   Mean   : 0.004217   Mean   : 0.003198  
 3rd Qu.:1.702   3rd Qu.:2688   3rd Qu.: 0.003188   3rd Qu.: 0.124797  
 Max.   :1.723   Max.   :3547   Max.   : 0.089841   Max.   : 0.469712  
 NA's   :5       NA's   :5      NA's   :12          NA's   :3          
    gprcpet               gres                 gnon           
 Min.   :-0.324846   Min.   :-0.0949657   Min.   :-0.0557258  
 1st Qu.: 0.000000   1st Qu.:-0.0141437   1st Qu.:-0.0116814  
 Median : 0.000000   Median : 0.0006191   Median : 0.0015713  
 Mean   : 0.004810   Mean   : 0.0011140   Mean   : 0.0007989  
 3rd Qu.: 0.008285   3rd Qu.: 0.0175560   3rd Qu.: 0.0136559  
 Max.   : 0.199757   Max.   : 0.0936829   Max.   : 0.0637026  
 NA's   :1           NA's   :1            NA's   :1           
      gip                gdefs              gmilemp               jan         
 Min.   :-0.082037   Min.   :-0.006371   Min.   :-0.023540   Min.   :0.00000  
 1st Qu.:-0.012534   1st Qu.:-0.000177   1st Qu.:-0.003223   1st Qu.:0.00000  
 Median : 0.003210   Median : 0.000000   Median :-0.000472   Median :0.00000  
 Mean   : 0.002808   Mean   : 0.000099   Mean   :-0.000781   Mean   :0.08333  
 3rd Qu.: 0.025387   3rd Qu.: 0.000605   3rd Qu.: 0.002369   3rd Qu.:0.00000  
 Max.   : 0.055362   Max.   : 0.004352   Max.   : 0.019082   Max.   :1.00000  
 NA's   :1           NA's   :6           NA's   :6                            
      feb               mar               apr               may         
 Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
 Mean   :0.08333   Mean   :0.08333   Mean   :0.08333   Mean   :0.08333  
 3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
                                                                        
      jun               jul               aug               sep         
 Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
 Mean   :0.08333   Mean   :0.08333   Mean   :0.08333   Mean   :0.08333  
 3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
                                                                        
      oct               nov               dec         
 Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :0.00000   Median :0.00000   Median :0.00000  
 Mean   :0.08333   Mean   :0.08333   Mean   :0.08333  
 3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
                                                      

Contoh visualisasi: Plot time series untuk variabel prccem

sum(is.na(cement$prccem))
[1] 11
cement <- na.omit(cement)
ggplot(cement, aes(x = as.Date(paste(year, month, 1, sep = "-")), y = prccem)) +
  geom_line() +
  labs(title = "BLS PPI for Cement over Time",
       x = "Year",
       y = "PPI for Cement") +
  theme_minimal()

Plot histogram dari variabel gprc

ggplot(cement, aes(x = gprc)) +
  geom_histogram(binwidth = 0.1, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Log Changes in BLS PPI for Cement",
       x = "Log Changes (gprc)",
       y = "Frequency") +
  theme_minimal()

Plot scatter plot antara variabel ip dan rresc

ggplot(cement, aes(x = ip, y = rresc)) +
  geom_point(color = "green") +
  labs(title = "Scatter Plot: Aggregate Index of Industrial Production vs Real Residential Construction",
       x = "Aggregate Index of Industrial Production (ip)",
       y = "Real Residential Construction (rresc)") +
  theme_minimal()

Boxplot untuk Variabel Real Nonres. Construction (RNONC) Berdasarkan Bulan

cement$month <- factor(cement$month, labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
ggplot(cement, aes(x = month, y = rnonc, fill = month)) +
  geom_boxplot() +
  labs(title = "Boxplot of Real Nonresidential Construction by Month",
       x = "Month",
       y = "Real Nonresidential Construction (rnonc)") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

Scatter Plot Matrix untuk Variabel Ekonomi Makro

macro_vars <- cement[, c("prccem", "ipcem", "prcpet", "rdefs", "milemp")]
pairs(macro_vars)

Distribusi Normal

shapiro.test(cement$prccem)

    Shapiro-Wilk normality test

data:  cement$prccem
W = 0.83808, p-value < 2.2e-16
ggplot(cement, aes(x = prccem)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 50, fill = "skyblue", color = "black") +
  geom_density(alpha = 0.2, fill = "blue") +
  labs(title = "Histogram with Normal Distribution Curve of BLS PPI for Cement",
       x = "PPI for Cement",
       y = "Density") +
  theme_minimal()

analisis varians (ANOVA)

anova_result <- aov(prccem ~ factor(month), data = cement)
summary(anova_result)
               Df   Sum Sq Mean Sq F value Pr(>F)
factor(month)  11    10617     965    0.01      1
Residuals     283 26467233   93524               
cement$month <- factor(cement$month, labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
ggplot(cement, aes(x = month, y = prccem, fill = month)) +
  geom_boxplot() +
  labs(title = "Boxplot of BLS PPI for Cement by Month",
       x = "Month",
       y = "PPI for Cement") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

Regresi Linier

lm_model <- lm(prccem ~ ipcem, data = cement)
summary(lm_model)

Call:
lm(formula = prccem ~ ipcem, data = cement)

Residuals:
   Min     1Q Median     3Q    Max 
-392.5 -313.0   -7.5  329.6  399.3 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   612.66      69.50   8.815   <2e-16 ***
ipcem          56.76      69.08   0.822    0.412    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 300.3 on 293 degrees of freedom
Multiple R-squared:  0.002299,  Adjusted R-squared:  -0.001106 
F-statistic: 0.6752 on 1 and 293 DF,  p-value: 0.4119
ggplot(cement, aes(x = ipcem, y = prccem)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, color = "red") +
  labs(title = "Regression Analysis: BLS PPI for Cement vs Industrial Prod. Index (ipcem)",
       x = "Industrial Prod. Index (ipcem)",
       y = "PPI for Cement") +
  theme_minimal()

Peramalan

new_data <- data.frame(ipcem = seq(min(cement$ipcem), max(cement$ipcem), length.out = 10))
predicted <- predict(lm_model, newdata = new_data, interval = "prediction")
predicted_df <- cbind(new_data, as.data.frame(predicted))
ggplot() +
  geom_point(data = cement, aes(x = ipcem, y = prccem), color = "blue") +
  geom_line(data = predicted_df, aes(x = ipcem, y = fit), color = "red") +
  geom_ribbon(data = predicted_df, aes(x = ipcem, ymin = lwr, ymax = upr), alpha = 0.3, fill = "grey") +
  labs(title = "Forecasting BLS PPI for Cement based on Industrial Prod. Index (ipcem)",
       x = "Industrial Prod. Index (ipcem)",
       y = "PPI for Cement") +
  theme_minimal()