Analisis Regresi dengan Variabel Dummy
Penerapan pada Data Lungcap
Berikut ini akan diberikan contoh analisis hubungan kapasitas paru-paru dengan kelompok tinggi badan. Tinggi badan dalam variabel numerik, untuk menyusun kelompok tinggi badan digunakan variabel dummy untuk tinggi badan pada data kapasitas paru-paru.
Import Data
## Age LungCap Height SqrHeight HeightT SqrHeightT SqrAge AgeT
## 1 9 3.124 57.0 3249.00 -4.143578 17.16923849 81 -0.9311927
## 2 8 3.172 67.5 4556.25 6.356422 40.40410088 64 -1.9311927
## 3 7 3.160 54.5 2970.25 -6.643578 44.13712840 49 -2.9311927
## 4 9 2.674 53.0 2809.00 -8.143578 66.31786234 81 -0.9311927
## 5 9 3.685 57.0 3249.00 -4.143578 17.16923849 81 -0.9311927
## 6 8 5.008 61.0 3721.00 -0.143578 0.02061464 64 -1.9311927
## SqrAgeT AgeHeight AgeTHeightT
## 1 0.8671198 513.0 3.8584694
## 2 3.7295051 540.0 -12.2754755
## 3 8.5918904 381.5 19.4736070
## 4 0.8671198 477.0 7.5832400
## 5 0.8671198 513.0 3.8584694
## 6 3.7295051 488.0 0.2772767
Mengubah variabel numerik menjadi kategorik
#Jika tinggi badan akan dibagi menjadi dua kelompok dengan interval A = (0,60], B = (60,75] #gunakan perintah berikut ini,
CatHeight<-cut(LungCap$Height,breaks=c(0,60,75),labels=c("A","B"))
#Dan jika interval A = [0,60), B = [60,75) yang akan disusun, gunakan perintah berikut ini
CatHeight<-cut(LungCap$Height,breaks=c(0,60,75),labels=c("A","B"),right=F)
# atau
CatHeight<-cut(LungCap$Height,breaks=2,labels=c("A","B"),right=F)
LungCapNew<-data.frame(LungCap,CatHeight)
str(LungCapNew)
## 'data.frame': 654 obs. of 12 variables:
## $ Age : int 9 8 7 9 9 8 6 6 8 9 ...
## $ LungCap : num 3.12 3.17 3.16 2.67 3.68 ...
## $ Height : num 57 67.5 54.5 53 57 61 58 56 58.5 60 ...
## $ SqrHeight : num 3249 4556 2970 2809 3249 ...
## $ HeightT : num -4.14 6.36 -6.64 -8.14 -4.14 ...
## $ SqrHeightT : num 17.2 40.4 44.1 66.3 17.2 ...
## $ SqrAge : int 81 64 49 81 81 64 36 36 64 81 ...
## $ AgeT : num -0.931 -1.931 -2.931 -0.931 -0.931 ...
## $ SqrAgeT : num 0.867 3.73 8.592 0.867 0.867 ...
## $ AgeHeight : num 513 540 382 477 513 ...
## $ AgeTHeightT: num 3.86 -12.28 19.47 7.58 3.86 ...
## $ CatHeight : Factor w/ 2 levels "A","B": 1 2 1 1 1 2 1 1 1 2 ...
Eksplorasi Data
# menghitung rata-rata kapasitas paru-paru untuk subyek pada kelompok tinggi A
mean(LungCapNew$LungCap[LungCapNew$CatHeight=="A"])
## [1] 3.576216
# menghitung rata-rata kapasitas paru-paru untuk subyek pada kelompok tinggi B
mean(LungCapNew$LungCap[LungCapNew$CatHeight=="B"])
## [1] 7.308531
Persamaan Regresi
model1<-lm(LungCap~CatHeight,data=LungCapNew)
summary(model1) #reference category adalah A (0) -> B(1)
##
## Call:
## lm(formula = LungCap ~ CatHeight, data = LungCapNew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2265 -1.2760 -0.2379 0.8998 8.0705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5762 0.1196 29.91 <2e-16 ***
## CatHeightB 3.7323 0.1512 24.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.872 on 652 degrees of freedom
## Multiple R-squared: 0.4831, Adjusted R-squared: 0.4823
## F-statistic: 609.3 on 1 and 652 DF, p-value: < 2.2e-16
# mengganti reference category menjadi B (0) -> A(1)
LungCapNew$CatHeight<-factor(LungCapNew$CatHeight,c("B","A"))
model2<-lm(LungCap~CatHeight,data=LungCapNew)
summary(model2)
##
## Call:
## lm(formula = LungCap ~ CatHeight, data = LungCapNew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2265 -1.2760 -0.2379 0.8998 8.0705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.30853 0.09255 78.97 <2e-16 ***
## CatHeightA -3.73231 0.15120 -24.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.872 on 652 degrees of freedom
## Multiple R-squared: 0.4831, Adjusted R-squared: 0.4823
## F-statistic: 609.3 on 1 and 652 DF, p-value: < 2.2e-16
Baik A sebagai refence variable maupun B sebagai refence variable. Nilai dugaan lungcap untuk masing-masing kategori, akan bernilai sama.
Contoh:
- Model 1
\(\hat {Y} = 3.7795+3.6770 *(CatheightB)\)
\(\hat {Y}_1 = 3.7795+3.6770 * 0\)
\(\hat {Y}_1 = 3.7795\)
- Model 2
\(\hat {Y} = 7.45650-3.67705*(CatheightA)\)
\(\hat {Y}_1 = 7.45650-3.67705*1\)
\(\hat {Y}_1 = 3.7795\)
Visualisasi Data
library(ggplot2)
ggplot(LungCapNew,aes(x=CatHeight,y=LungCap,group=1))+ geom_point()+ geom_smooth(method=lm,se=FALSE)+
theme_gray()
ggplot(LungCapNew,aes(x=CatHeight,y=LungCap))+ geom_point()+ geom_smooth(method=lm,se=FALSE)+
theme_classic()
## 'data.frame': 654 obs. of 12 variables:
## $ Age : int 9 8 7 9 9 8 6 6 8 9 ...
## $ LungCap : num 3.12 3.17 3.16 2.67 3.68 ...
## $ Height : num 57 67.5 54.5 53 57 61 58 56 58.5 60 ...
## $ SqrHeight : num 3249 4556 2970 2809 3249 ...
## $ HeightT : num -4.14 6.36 -6.64 -8.14 -4.14 ...
## $ SqrHeightT : num 17.2 40.4 44.1 66.3 17.2 ...
## $ SqrAge : int 81 64 49 81 81 64 36 36 64 81 ...
## $ AgeT : num -0.931 -1.931 -2.931 -0.931 -0.931 ...
## $ SqrAgeT : num 0.867 3.73 8.592 0.867 0.867 ...
## $ AgeHeight : num 513 540 382 477 513 ...
## $ AgeTHeightT: num 3.86 -12.28 19.47 7.58 3.86 ...
## $ CatHeight : num 2 1 2 2 2 1 2 2 2 1 ...
model1<-lm(LungCap~Age,data=LungCapNew)
plot(LungCap~Age,data=LungCapNew, col = CatHeight + 1, pch = CatHeight + 1, cex = 0.5)
abline(model1, lwd = 3, col = "grey")
legend("topright", c("B", "A"), col = c(2, 3), pch = c(2, 3))
model2<-lm(LungCap~Age+CatHeight,data=LungCapNew)
plot(LungCap~Age,data=LungCapNew, col = CatHeight + 1, pch = CatHeight + 1, cex = 0.5)
abline(model1, lwd = 3, col = "grey")
legend("topright", c("Height B", "Height A"), col = c(2, 3), pch = c(2, 3))
Penerapan pada Data Brithsmoker
Berikut ini akan diberikan contoh analisis hubungan berat bayi (Wgt), status merokok ibu (Smoke), dan jangka waktu kehamilan dalam satuan minggu (Gest). Perhatikan tipe data dari variabel kategorik dalam data.
Import Data
## 'data.frame': 32 obs. of 3 variables:
## $ Wgt : int 2940 3130 2420 2450 2760 2440 3226 3301 2729 3410 ...
## $ Gest : int 38 38 36 34 39 35 40 42 37 40 ...
## $ Smoke: chr "yes" "no" "yes" "no" ...
Persamaan Regresi
plot(x=Gest, y=Wgt, ylim=c(2300, 3700),
col=ifelse(Smoke=="yes", "red", "blue"),
panel.last = c(lines(sort(Gest[Smoke=="no"]),
fitted(model)[Smoke=="no"][order(Gest[Smoke=="no"])],
col="blue"),
lines(sort(Gest[Smoke=="yes"]),
fitted(model)[Smoke=="yes"][order(Gest[Smoke=="yes"])],
col="red")))
##
## Call:
## lm(formula = Wgt ~ Gest + Smoke)
##
## Residuals:
## Min 1Q Median 3Q Max
## -223.693 -92.063 -9.365 79.663 197.507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2389.573 349.206 -6.843 1.63e-07 ***
## Gest 143.100 9.128 15.677 1.07e-15 ***
## Smokeyes -244.544 41.982 -5.825 2.58e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115.5 on 29 degrees of freedom
## Multiple R-squared: 0.8964, Adjusted R-squared: 0.8892
## F-statistic: 125.4 on 2 and 29 DF, p-value: 5.289e-15
Interval Kepercayaan
## 2.5 % 97.5 %
## (Intercept) -3103.7795 -1675.3663
## Gest 124.4312 161.7694
## Smokeyes -330.4064 -158.6817
Prediksi Data Baru
## fit lwr upr
## 1 2803.693 2740.599 2866.788
## 2 3048.237 2989.120 3107.355
Persamaan Regresi hanya untuk Kategori Tidak
#Ulangi proses analisis hanya untuk subyek yang tidak merokok saja
model.0 <- lm(Wgt ~ Gest, subset=Smoke=="no")
summary(model.0)
##
## Call:
## lm(formula = Wgt ~ Gest, subset = Smoke == "no")
##
## Residuals:
## Min 1Q Median 3Q Max
## -171.52 -101.59 23.28 83.63 139.48
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2546.14 457.29 -5.568 6.93e-05 ***
## Gest 147.21 11.97 12.294 6.85e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 106.9 on 14 degrees of freedom
## Multiple R-squared: 0.9152, Adjusted R-squared: 0.9092
## F-statistic: 151.1 on 1 and 14 DF, p-value: 6.852e-09
Prediksi Data Baru
## fit lwr upr
## 1 3047.724 2990.298 3105.15
Persamaan Regresi hanya untuk Kategori Ya
#Ulangi proses analisis hanya untuk subyek yang merokok saja
model.1 <- lm(Wgt ~ Gest, subset=Smoke=="yes")
summary(model.1)
##
## Call:
## lm(formula = Wgt ~ Gest, subset = Smoke == "yes")
##
## Residuals:
## Min 1Q Median 3Q Max
## -228.53 -64.86 -19.10 93.89 184.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2474.56 553.97 -4.467 0.000532 ***
## Gest 139.03 14.11 9.851 1.12e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 126.6 on 14 degrees of freedom
## Multiple R-squared: 0.8739, Adjusted R-squared: 0.8649
## F-statistic: 97.04 on 1 and 14 DF, p-value: 1.125e-07
Prediksi Data Baru
## fit lwr upr
## 1 2808.528 2731.726 2885.331
Persamaan Regresi hanya untuk Kategori Ya
dan
Tidak
#Ulangi proses analisis dengan menggunakan kode 1 dan -1
#model3: references : yes
Smoke2 <- ifelse(Smoke=="yes", 1, -1)
model.3 <- lm(Wgt ~ Gest + Smoke2)
summary(model.3)
##
## Call:
## lm(formula = Wgt ~ Gest + Smoke2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -223.693 -92.063 -9.365 79.663 197.507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2511.845 353.449 -7.107 8.07e-08 ***
## Gest 143.100 9.128 15.677 1.07e-15 ***
## Smoke2 -122.272 20.991 -5.825 2.58e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115.5 on 29 degrees of freedom
## Multiple R-squared: 0.8964, Adjusted R-squared: 0.8892
## F-statistic: 125.4 on 2 and 29 DF, p-value: 5.289e-15
#atau gunakan perintah lain (-1 menjadi yes, 1 menjadi no)
model.3 <- lm(Wgt ~ Gest + Smoke, contrasts=list(Smoke="contr.sum"))
summary(model.3)
##
## Call:
## lm(formula = Wgt ~ Gest + Smoke, contrasts = list(Smoke = "contr.sum"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -223.693 -92.063 -9.365 79.663 197.507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2511.845 353.449 -7.107 8.07e-08 ***
## Gest 143.100 9.128 15.677 1.07e-15 ***
## Smoke1 122.272 20.991 5.825 2.58e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115.5 on 29 degrees of freedom
## Multiple R-squared: 0.8964, Adjusted R-squared: 0.8892
## F-statistic: 125.4 on 2 and 29 DF, p-value: 5.289e-15
Penerapan pada data dengan lebih dari dua faktor
Data
Data yang digunakan dalam praktikum ini adalah data mtcars. Data ini merupakan dataset yang ada dalam R. Deskripsi data mtcars:
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
A data frame with 32 observations on 11 (numeric) variables.
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
data(mtcars)
mtcars$kata[mtcars$vs==1]="SL" #menambahkan variabel kata
mtcars$kata[mtcars$vs==0]="VS"
head(mtcars$kata)
## [1] "VS" "VS" "SL" "SL" "VS" "SL"
## [1] FALSE
## 'data.frame': 32 obs. of 12 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## $ kata: chr "VS" "VS" "SL" "SL" ...
Model Regresi
Model dengan menggunakan variabel hp dan kata.
##
## Call:
## lm(formula = mpg ~ hp + kata, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7131 -2.3336 -0.1332 1.9055 7.9055
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.53922 1.67062 17.682 < 2e-16 ***
## hp -0.05453 0.01448 -3.766 0.000752 ***
## kataVS -2.57622 1.96966 -1.308 0.201163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.818 on 29 degrees of freedom
## Multiple R-squared: 0.6246, Adjusted R-squared: 0.5987
## F-statistic: 24.12 on 2 and 29 DF, p-value: 6.768e-07
Model dengan mengganti reference variable
## [1] "SL" "VS"
## [1] "VS" "SL"
##
## Call:
## lm(formula = mpg ~ hp + kata, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7131 -2.3336 -0.1332 1.9055 7.9055
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.96300 2.89069 9.328 3.13e-10 ***
## hp -0.05453 0.01448 -3.766 0.000752 ***
## kataSL 2.57622 1.96966 1.308 0.201163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.818 on 29 degrees of freedom
## Multiple R-squared: 0.6246, Adjusted R-squared: 0.5987
## F-statistic: 24.12 on 2 and 29 DF, p-value: 6.768e-07
Model dengan menggunakan variabel yang memiliki lebih dari dua level
## [1] FALSE
## [1] "4" "6" "8"
##
## Call:
## lm(formula = mpg ~ hp + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.818 -1.959 0.080 1.627 6.812
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.65012 1.58779 18.044 < 2e-16 ***
## hp -0.02404 0.01541 -1.560 0.12995
## cyl6 -5.96766 1.63928 -3.640 0.00109 **
## cyl8 -8.52085 2.32607 -3.663 0.00103 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.146 on 28 degrees of freedom
## Multiple R-squared: 0.7539, Adjusted R-squared: 0.7275
## F-statistic: 28.59 on 3 and 28 DF, p-value: 1.14e-08
int_4cyl = coef(model7)[1]
int_6cyl = coef(model7)[1] + coef(model7)[3]
int_8cyl = coef(model7)[1] + coef(model7)[4]
slope_all = coef(model7)[2]
plot(mpg ~ hp, data = mtcars, col = cyl, pch = as.numeric(cyl))
abline(int_4cyl, slope_all, col = 1, lty = 1, lwd = 2) # add line for auto
abline(int_6cyl, slope_all, col = 2, lty = 2, lwd = 2) # add line for manual
abline(int_8cyl, slope_all, col = 3, lty = 3, lwd = 2) # add line for auto
legend("topright", c("4 cylinder", "6 cylinder","8 cylinder"), col = c(1, 2,3), pch = c(1, 2,3))
Persamaan Regresi:
\(\hat {Y}=28.65012-0.02404*hp-5.96766*Cyl6-8.52085*Cyl8\)