Eksplorasi Data Wooldridge

Author

Aveline Ong

Eksplorasi Data

Eksplorasi data adalah proses awal dalam analisis data yang melibatkan pemeriksaan dan visualisasi dataset untuk memahami struktur, pola, dan karakteristiknya sebelum melakukan analisis lebih lanjut. Tujuan utamanya adalah mengidentifikasi distribusi, hubungan, dan potensi anomali dalam data.

Packeges Wooldridge : Datasets Wage1

Pemusatan Data

library(wooldridge)
library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(stats)
data("wage1")
str(wage1)
'data.frame':   526 obs. of  24 variables:
 $ wage    : num  3.1 3.24 3 6 5.3 ...
 $ educ    : int  11 12 11 8 12 16 18 12 12 17 ...
 $ exper   : int  2 22 2 44 7 9 15 5 26 22 ...
 $ tenure  : int  0 2 0 28 2 8 7 3 4 21 ...
 $ nonwhite: int  0 0 0 0 0 0 0 0 0 0 ...
 $ female  : int  1 1 0 0 0 0 0 1 1 0 ...
 $ married : int  0 1 0 1 1 1 0 0 0 1 ...
 $ numdep  : int  2 3 2 0 1 0 0 0 2 0 ...
 $ smsa    : int  1 1 0 1 0 1 1 1 1 1 ...
 $ northcen: int  0 0 0 0 0 0 0 0 0 0 ...
 $ south   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ west    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ construc: int  0 0 0 0 0 0 0 0 0 0 ...
 $ ndurman : int  0 0 0 0 0 0 0 0 0 0 ...
 $ trcommpu: int  0 0 0 0 0 0 0 0 0 0 ...
 $ trade   : int  0 0 1 0 0 0 1 0 1 0 ...
 $ services: int  0 1 0 0 0 0 0 0 0 0 ...
 $ profserv: int  0 0 0 0 0 1 0 0 0 0 ...
 $ profocc : int  0 0 0 0 0 1 1 1 1 1 ...
 $ clerocc : int  0 0 0 1 0 0 0 0 0 0 ...
 $ servocc : int  0 1 0 0 0 0 0 0 0 0 ...
 $ lwage   : num  1.13 1.18 1.1 1.79 1.67 ...
 $ expersq : int  4 484 4 1936 49 81 225 25 676 484 ...
 $ tenursq : int  0 4 0 784 4 64 49 9 16 441 ...
 - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
summary(wage1)
      wage             educ           exper           tenure      
 Min.   : 0.530   Min.   : 0.00   Min.   : 1.00   Min.   : 0.000  
 1st Qu.: 3.330   1st Qu.:12.00   1st Qu.: 5.00   1st Qu.: 0.000  
 Median : 4.650   Median :12.00   Median :13.50   Median : 2.000  
 Mean   : 5.896   Mean   :12.56   Mean   :17.02   Mean   : 5.105  
 3rd Qu.: 6.880   3rd Qu.:14.00   3rd Qu.:26.00   3rd Qu.: 7.000  
 Max.   :24.980   Max.   :18.00   Max.   :51.00   Max.   :44.000  
    nonwhite          female          married           numdep     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
 Median :0.0000   Median :0.0000   Median :1.0000   Median :1.000  
 Mean   :0.1027   Mean   :0.4791   Mean   :0.6084   Mean   :1.044  
 3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:2.000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :6.000  
      smsa           northcen         south             west       
 Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :1.0000   Median :0.000   Median :0.0000   Median :0.0000  
 Mean   :0.7224   Mean   :0.251   Mean   :0.3555   Mean   :0.1692  
 3rd Qu.:1.0000   3rd Qu.:0.750   3rd Qu.:1.0000   3rd Qu.:0.0000  
 Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000  
    construc          ndurman          trcommpu           trade       
 Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
 1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
 Median :0.00000   Median :0.0000   Median :0.00000   Median :0.0000  
 Mean   :0.04563   Mean   :0.1141   Mean   :0.04373   Mean   :0.2871  
 3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:1.0000  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
    services         profserv         profocc          clerocc      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.1008   Mean   :0.2586   Mean   :0.3669   Mean   :0.1673  
 3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
    servocc           lwage            expersq          tenursq       
 Min.   :0.0000   Min.   :-0.6349   Min.   :   1.0   Min.   :   0.00  
 1st Qu.:0.0000   1st Qu.: 1.2030   1st Qu.:  25.0   1st Qu.:   0.00  
 Median :0.0000   Median : 1.5369   Median : 182.5   Median :   4.00  
 Mean   :0.1407   Mean   : 1.6233   Mean   : 473.4   Mean   :  78.15  
 3rd Qu.:0.0000   3rd Qu.: 1.9286   3rd Qu.: 676.0   3rd Qu.:  49.00  
 Max.   :1.0000   Max.   : 3.2181   Max.   :2601.0   Max.   :1936.00  

Data Upah pada Wage1

ggplot(wage1, aes(x = wage)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Distribusi Upah", x = "Upah", y = "Frekuensi")

Data ini berisi informasi mengenai upah pekerja beserta variabel-variabel lainnya. Pada sumbu x menunjukkan upah dan sumbu y menunjukkan frekuensinya.

Perbandingan upah berdasarkan gender :

ggplot(wage1, aes(x = factor(female, labels = c("Male", "Female")), y = wage)) +
  geom_boxplot(fill = c("lightblue", "pink")) +
  labs(title = "Perbandingan Upah Berdasarkan Gender", x = "Gender", y = "Upah")

Perbandingan rata-rata upah antara pekerja laki-laki dan perempuan menunjukkan perbedaan yang signifikan. Biasanya, rata-rata upah pekerja laki-laki lebih tinggi dibandingkan pekerja perempuan.

Perbandingan upah dengan pendidikan yang dimiliki :

ggplot(wage1, aes(x = factor(educ), y = wage)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Perbandingan Upah Berdasarkan Pendidikan", x = "Tingkat Pendidikan (Tahun)", y = "Upah")

Rata-rata upah meningkat seiring dengan meningkatnya tingkat pendidikan. Pekerja dengan pendidikan yang lebih tinggi cenderung mendapatkan upah yang lebih tinggi.

Hubungan pengalaman dengan upah :

ggplot(wage1, aes(x = exper, y = wage)) +
  geom_point(color = "blue") +
  labs(title = "Hubungan antara Pengalaman dan Upah", x = "Pengalaman (Tahun)", y = "Upah")

Terdapat korelasi positif antara pengalaman kerja dan upah. Pekerja dengan pengalaman 20-30 tahun cenderung mendapatkan upah yang lebih tinggi.

Menghitung rata-rata upah berdasarkan gender :

average_wage_gender <- aggregate(wage ~ female, data = wage1, FUN = mean)
colnames(average_wage_gender) <- c("Gender", "Average_Wage")
average_wage_gender$Gender <- factor(average_wage_gender$Gender, labels = c("Male", "Female"))
print(average_wage_gender)
  Gender Average_Wage
1   Male     7.099489
2 Female     4.587659
ggplot(average_wage_gender, aes(x = Gender, y = Average_Wage)) +
  geom_bar(stat = "identity", fill = c("lightblue", "pink")) +
  labs(title = "Rata-rata Upah Berdasarkan Gender", x = "Gender", y = "Rata-rata Upah")

Dari data diatas menunjukkan bahwa rata-rata upah laki-laki lebih tinggi dibandingkan perempuan.

Menghitung rata-rata upah berdasarkan pendidikan :

average_wage_education <- aggregate(wage ~ educ, data = wage1, FUN = mean)
colnames(average_wage_education) <- c("Education_Years", "Average_Wage")
print(average_wage_education)
   Education_Years Average_Wage
1                0     3.530000
2                2     3.750000
3                3     2.920000
4                4     3.170000
5                5     2.900000
6                6     3.985000
7                7     4.387500
8                8     5.038182
9                9     3.275882
10              10     3.835667
11              11     4.185517
12              12     5.371364
13              13     5.598974
14              14     6.231698
15              15     6.321429
16              16     8.041618
17              17    11.343333
18              18    10.678947
ggplot(average_wage_education, aes(x = factor(Education_Years), y = Average_Wage)) +
  geom_bar(stat = "identity", fill = "lightgreen") +
  labs(title = "Rata-rata Upah Berdasarkan Pendidikan", x = "Tingkat Pendidikan   (Tahun)", y = "Rata-rata Upah")

Rata-rata upah meningkat dengan semakin tinggi tingkat pendidikan. Rata-rata upah tertinggi dimiliki oleh yang memiliki tingkat pendidikan selama 17 tahun.

Menghitung Distribusi Normal dengan Uji Shapiro-Wilk untuk menguji normalitas

ggplot(wage1, aes(x = wage)) +
  geom_histogram(aes(y = ..density..), binwidth = 1, fill = "blue", color = "black") +
  stat_function(fun = dnorm, args = list(mean = mean(wage1$wage, na.rm = TRUE), sd = sd(wage1$wage, na.rm = TRUE)), color = "red", size = 1) +
  labs(title = "Distribusi Upah dengan Kurva Normal", x = "Upah", y = "Density")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

shapiro_test <- shapiro.test(wage1$wage)
print(shapiro_test)

    Shapiro-Wilk normality test

data:  wage1$wage
W = 0.80273, p-value < 2.2e-16

Uji Shapiro-Wilk memberikan nilai p. Jika nilai p kurang dari 0.05, data tidak berdistribusi normal.

Estimasi Mean Wage dan Confidence Interval:

mean_wage <- mean(wage1$wage, na.rm = TRUE)
std_error <- sd(wage1$wage, na.rm = TRUE) / sqrt(nrow(wage1))
conf_interval <- mean_wage + c(-1, 1) * qt(0.975, df = nrow(wage1) - 1) * std_error
print(conf_interval)
[1] 5.579768 6.212437
conf_df <- data.frame(
  Statistic = c("Mean Wage"),
  Estimate = mean_wage,
  Lower_CI = conf_interval[1],
  Upper_CI = conf_interval[2]
)

ggplot(conf_df, aes(x = Statistic, y = Estimate)) +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.2) +
  labs(title = "Rata-rata Upah dengan Confidence Interval",
       x = "",
       y = "Upah") +
  theme_minimal()

Hasil data menunjukkan bahwa 85% rata-rata upah sebenarnya dari populasi berada di antara 5.579768 dan 6.212437. Interval kepercayaan ini memberikan rentang nilai yang mungkin untuk rata-rata upah sebenarnya, menunjukkan tingkat ketidakpastian dalam estimasi kita.

Analisis varians pada data wage1 :

anova_result <- aov(wage ~ factor(educ), data = wage1)
summary(anova_result)
              Df Sum Sq Mean Sq F value Pr(>F)    
factor(educ)  17   1604   94.33   8.623 <2e-16 ***
Residuals    508   5557   10.94                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(wage1, aes(x = factor(educ), y = wage)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Perbandingan Upah Berdasarkan Pendidikan", x = "Tingkat Pendidikan (Tahun)", y = "Upah")

ANOVA menunjukkan apakah terdapat perbedaan yang signifikan dalam rata-rata upah antara kelompok-kelompok yang berbeda.

Regresi Linear untuk memodelkan hubungan antara upah dengan variabel pendidikan, pengalaman, tenure, dan gender :

model <- lm(wage ~ educ + exper + tenure + female, data = wage1)
summary(model)

Call:
lm(formula = wage ~ educ + exper + tenure + female, data = wage1)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.7675 -1.8080 -0.4229  1.0467 14.0075 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.56794    0.72455  -2.164   0.0309 *  
educ         0.57150    0.04934  11.584  < 2e-16 ***
exper        0.02540    0.01157   2.195   0.0286 *  
tenure       0.14101    0.02116   6.663 6.83e-11 ***
female      -1.81085    0.26483  -6.838 2.26e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.958 on 521 degrees of freedom
Multiple R-squared:  0.3635,    Adjusted R-squared:  0.3587 
F-statistic:  74.4 on 4 and 521 DF,  p-value: < 2.2e-16
ggplot(wage1, aes(x = educ, y = wage)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Regresi Upah terhadap Pendidikan", x = "Tingkat Pendidikan (Tahun)", y = "Upah")
`geom_smooth()` using formula = 'y ~ x'

Koefisien regresi menunjukkan seberapa besar pengaruh pendidikan terhadap upah. Nilai p menunjukkan signifikansi pengaruh tersebut.

Prediksi Upah untuk beberapa data baru dan menambahkannya ke data frame new_data

new_data <- data.frame(
  educ = c(12, 16),
  exper = c(10, 5),
  tenure = c(3, 2),
  female = c(0, 1)
)
predictions <- predict(model, newdata = new_data)
new_data$predicted_wage <- predictions
print(new_data)
  educ exper tenure female predicted_wage
1   12    10      3      0       5.967092
2   16     5      2      1       6.174275
ggplot(new_data, aes(x = factor(educ), y = predicted_wage)) +
  geom_point(color = "green", size = 4) +
  labs(title = "Prediksi Upah Berdasarkan Pendidikan", x = "Tingkat Pendidikan (Tahun)", y = "Prediksi Upah")

Model regresi digunakan untuk memprediksi upah berdasarkan variabel-variabel tersebut.