Tentang eksplorasi data menggunakan dataset “cement” dari paket “wooldridge”: Data ini mencakup 312 observasi dari tahun 1964 hingga 1989, dengan variabel seperti indeks harga produsen untuk semen (prccem), indeks produksi industri untuk semen (ipcem), dan konstruksi residensial (rresc). Saya memulai dengan memeriksa struktur data menggunakan str() dan melihat ringkasan statistik deskriptif dengan summary(). Visualisasi menggunakan ggplot2 mencakup time series plot untuk prccem, histogram untuk perubahan log dalam prccem, scatter plot untuk hubungan antara ipcem dan rresc, serta boxplot untuk rnonc berdasarkan bulan. Eksplorasi ini memberikan pemahaman tentang karakteristik data untuk analisis lebih lanjut dalam konteks ekonomi dan industri.
Pemusatan Data
library(wooldridge)library(ggplot2)library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(stats)data("cement")str(cement)
'data.frame': 312 obs. of 30 variables:
$ year : int 1964 1964 1964 1964 1964 1964 1964 1964 1964 1964 ...
$ month : int 1 2 3 4 5 6 7 8 9 10 ...
$ prccem : int NA NA NA NA NA NA NA NA NA NA ...
$ ipcem : num 0.474 0.531 0.643 0.826 1.027 ...
$ prcpet : num 13.4 13.4 13.4 13.4 13.4 ...
$ rresc : num 115401 115118 123663 116178 111034 ...
$ rnonc : num 142180 144190 145577 150793 149259 ...
$ ip : num 44.6 45.9 46.2 46.9 47.1 ...
$ rdefs : num NA 1.66 1.66 1.66 1.66 ...
$ milemp : int 2687 2696 2693 2694 2690 2687 2696 2693 2690 2680 ...
$ gprc : num NA NA NA NA NA NA NA NA NA NA ...
$ gcem : num NA 0.113 0.19 0.251 0.218 ...
$ gprcpet: num NA 0 0 0 0 0 0 0 0 0 ...
$ gres : num NA -0.00246 0.0716 -0.06244 -0.04529 ...
$ gnon : num NA 0.01404 0.00957 0.0352 -0.01022 ...
$ gip : num NA 0.02873 0.00651 0.01504 0.00426 ...
$ gdefs : num NA NA -0.000481 0 -0.000963 ...
$ gmilemp: num NA 0.003344 -0.001113 0.000371 -0.001486 ...
$ jan : int 1 0 0 0 0 0 0 0 0 0 ...
$ feb : int 0 1 0 0 0 0 0 0 0 0 ...
$ mar : int 0 0 1 0 0 0 0 0 0 0 ...
$ apr : int 0 0 0 1 0 0 0 0 0 0 ...
$ may : int 0 0 0 0 1 0 0 0 0 0 ...
$ jun : int 0 0 0 0 0 1 0 0 0 0 ...
$ jul : int 0 0 0 0 0 0 1 0 0 0 ...
$ aug : int 0 0 0 0 0 0 0 1 0 0 ...
$ sep : int 0 0 0 0 0 0 0 0 1 0 ...
$ oct : int 0 0 0 0 0 0 0 0 0 1 ...
$ nov : int 0 0 0 0 0 0 0 0 0 0 ...
$ dec : int 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
summary(cement)
year month prccem ipcem
Min. :1964 Min. : 1.00 Min. : 287.0 Min. :0.4072
1st Qu.:1970 1st Qu.: 3.75 1st Qu.: 361.0 1st Qu.:0.7632
Median :1976 Median : 6.50 Median : 672.0 Median :1.0315
Mean :1976 Mean : 6.50 Mean : 672.4 Mean :0.9724
3rd Qu.:1983 3rd Qu.: 9.25 3rd Qu.: 999.0 3rd Qu.:1.1798
Max. :1989 Max. :12.00 Max. :1067.0 Max. :1.4567
NA's :11 NA's :2
prcpet rresc rnonc ip
Min. : 13.40 Min. : 80798 Min. :141753 Min. : 44.60
1st Qu.: 14.50 1st Qu.:106085 1st Qu.:161970 1st Qu.: 62.48
Median : 36.10 Median :124003 Median :172170 Median : 76.55
Mean : 43.48 Mean :128436 Mean :169424 Mean : 76.65
3rd Qu.: 59.83 3rd Qu.:151440 3rd Qu.:178110 3rd Qu.: 87.80
Max. :114.90 Max. :177458 Max. :187406 Max. :110.90
rdefs milemp gprc gcem
Min. :1.620 Min. :2018 Min. :-0.025872 Min. :-0.616466
1st Qu.:1.634 1st Qu.:2104 1st Qu.: 0.000000 1st Qu.:-0.082918
Median :1.663 Median :2153 Median : 0.000000 Median : 0.031626
Mean :1.667 Mean :2425 Mean : 0.004217 Mean : 0.003198
3rd Qu.:1.702 3rd Qu.:2688 3rd Qu.: 0.003188 3rd Qu.: 0.124797
Max. :1.723 Max. :3547 Max. : 0.089841 Max. : 0.469712
NA's :5 NA's :5 NA's :12 NA's :3
gprcpet gres gnon
Min. :-0.324846 Min. :-0.0949657 Min. :-0.0557258
1st Qu.: 0.000000 1st Qu.:-0.0141437 1st Qu.:-0.0116814
Median : 0.000000 Median : 0.0006191 Median : 0.0015713
Mean : 0.004810 Mean : 0.0011140 Mean : 0.0007989
3rd Qu.: 0.008285 3rd Qu.: 0.0175560 3rd Qu.: 0.0136559
Max. : 0.199757 Max. : 0.0936829 Max. : 0.0637026
NA's :1 NA's :1 NA's :1
gip gdefs gmilemp jan
Min. :-0.082037 Min. :-0.006371 Min. :-0.023540 Min. :0.00000
1st Qu.:-0.012534 1st Qu.:-0.000177 1st Qu.:-0.003223 1st Qu.:0.00000
Median : 0.003210 Median : 0.000000 Median :-0.000472 Median :0.00000
Mean : 0.002808 Mean : 0.000099 Mean :-0.000781 Mean :0.08333
3rd Qu.: 0.025387 3rd Qu.: 0.000605 3rd Qu.: 0.002369 3rd Qu.:0.00000
Max. : 0.055362 Max. : 0.004352 Max. : 0.019082 Max. :1.00000
NA's :1 NA's :6 NA's :6
feb mar apr may
Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
Mean :0.08333 Mean :0.08333 Mean :0.08333 Mean :0.08333
3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
jun jul aug sep
Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
Mean :0.08333 Mean :0.08333 Mean :0.08333 Mean :0.08333
3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
oct nov dec
Min. :0.00000 Min. :0.00000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
Median :0.00000 Median :0.00000 Median :0.00000
Mean :0.08333 Mean :0.08333 Mean :0.08333
3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.00000 Max. :1.00000
Contoh visualisasi: Plot time series untuk variabel prccem
sum(is.na(cement$prccem))
[1] 11
cement <-na.omit(cement)ggplot(cement, aes(x =as.Date(paste(year, month, 1, sep ="-")), y = prccem)) +geom_line() +labs(title ="BLS PPI for Cement over Time",x ="Year",y ="PPI for Cement") +theme_minimal()
Plot histogram dari variabel gprc
ggplot(cement, aes(x = gprc)) +geom_histogram(binwidth =0.1, fill ="skyblue", color ="black") +labs(title ="Histogram of Log Changes in BLS PPI for Cement",x ="Log Changes (gprc)",y ="Frequency") +theme_minimal()
Plot scatter plot antara variabel ip dan rresc
ggplot(cement, aes(x = ip, y = rresc)) +geom_point(color ="green") +labs(title ="Scatter Plot: Aggregate Index of Industrial Production vs Real Residential Construction",x ="Aggregate Index of Industrial Production (ip)",y ="Real Residential Construction (rresc)") +theme_minimal()
Boxplot untuk Variabel Real Nonres. Construction (RNONC) Berdasarkan Bulan
cement$month <-factor(cement$month, labels =c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))ggplot(cement, aes(x = month, y = rnonc, fill = month)) +geom_boxplot() +labs(title ="Boxplot of Real Nonresidential Construction by Month",x ="Month",y ="Real Nonresidential Construction (rnonc)") +theme_minimal() +scale_fill_brewer(palette ="Set3")
Shapiro-Wilk normality test
data: cement$prccem
W = 0.83808, p-value < 2.2e-16
ggplot(cement, aes(x = prccem)) +geom_histogram(aes(y =after_stat(density)), binwidth =50, fill ="skyblue", color ="black") +geom_density(alpha =0.2, fill ="blue") +labs(title ="Histogram with Normal Distribution Curve of BLS PPI for Cement",x ="PPI for Cement",y ="Density") +theme_minimal()
analisis varians (ANOVA)
anova_result <-aov(prccem ~factor(month), data = cement)summary(anova_result)
Df Sum Sq Mean Sq F value Pr(>F)
factor(month) 11 10617 965 0.01 1
Residuals 283 26467233 93524
cement$month <-factor(cement$month, labels =c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))ggplot(cement, aes(x = month, y = prccem, fill = month)) +geom_boxplot() +labs(title ="Boxplot of BLS PPI for Cement by Month",x ="Month",y ="PPI for Cement") +theme_minimal() +scale_fill_brewer(palette ="Set3")
Regresi Linier
lm_model <-lm(prccem ~ ipcem, data = cement)summary(lm_model)
Call:
lm(formula = prccem ~ ipcem, data = cement)
Residuals:
Min 1Q Median 3Q Max
-392.5 -313.0 -7.5 329.6 399.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 612.66 69.50 8.815 <2e-16 ***
ipcem 56.76 69.08 0.822 0.412
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 300.3 on 293 degrees of freedom
Multiple R-squared: 0.002299, Adjusted R-squared: -0.001106
F-statistic: 0.6752 on 1 and 293 DF, p-value: 0.4119
ggplot(cement, aes(x = ipcem, y = prccem)) +geom_point(color ="blue") +geom_smooth(method ="lm", formula = y ~poly(x, 2), se =FALSE, color ="red") +labs(title ="Regression Analysis: BLS PPI for Cement vs Industrial Prod. Index (ipcem)",x ="Industrial Prod. Index (ipcem)",y ="PPI for Cement") +theme_minimal()
Peramalan
new_data <-data.frame(ipcem =seq(min(cement$ipcem), max(cement$ipcem), length.out =10))predicted <-predict(lm_model, newdata = new_data, interval ="prediction")predicted_df <-cbind(new_data, as.data.frame(predicted))ggplot() +geom_point(data = cement, aes(x = ipcem, y = prccem), color ="blue") +geom_line(data = predicted_df, aes(x = ipcem, y = fit), color ="red") +geom_ribbon(data = predicted_df, aes(x = ipcem, ymin = lwr, ymax = upr), alpha =0.3, fill ="grey") +labs(title ="Forecasting BLS PPI for Cement based on Industrial Prod. Index (ipcem)",x ="Industrial Prod. Index (ipcem)",y ="PPI for Cement") +theme_minimal()