Analisis Regresi dengan Variabel Dummy

Penerapan pada Data `Lungcap`

Berikut ini akan diberikan contoh analisis hubungan kapasitas paru-paru dengan kelompok tinggi badan. Tinggi badan dalam variabel numerik, untuk menyusun kelompok tinggi badan digunakan variabel dummy untuk tinggi badan pada data kapasitas paru-paru.

Import Data

library(readxl)
LungCap<-read.csv("D:/LungCapData2rev.csv")
head(LungCap)

##   Age LungCap Height SqrHeight   HeightT  SqrHeightT SqrAge       AgeT
## 1   9   3.124   57.0   3249.00 -4.143578 17.16923849     81 -0.9311927
## 2   8   3.172   67.5   4556.25  6.356422 40.40410088     64 -1.9311927
## 3   7   3.160   54.5   2970.25 -6.643578 44.13712840     49 -2.9311927
## 4   9   2.674   53.0   2809.00 -8.143578 66.31786234     81 -0.9311927
## 5   9   3.685   57.0   3249.00 -4.143578 17.16923849     81 -0.9311927
## 6   8   5.008   61.0   3721.00 -0.143578  0.02061464     64 -1.9311927
##     SqrAgeT AgeHeight AgeTHeightT
## 1 0.8671198     513.0   3.8584694
## 2 3.7295051     540.0 -12.2754755
## 3 8.5918904     381.5  19.4736070
## 4 0.8671198     477.0   7.5832400
## 5 0.8671198     513.0   3.8584694
## 6 3.7295051     488.0   0.2772767

Mengubah variabel numerik menjadi kategorik

#Jika tinggi badan akan dibagi menjadi dua kelompok dengan interval A = (0,60], B = (60,75] #gunakan perintah berikut ini,
CatHeight<-cut(LungCap$Height,breaks=c(0,60,75),labels=c("A","B"))

#Dan jika interval A = [0,60), B = [60,75) yang akan disusun, gunakan perintah berikut ini

CatHeight<-cut(LungCap$Height,breaks=c(0,60,75),labels=c("A","B"),right=F)

# atau

CatHeight<-cut(LungCap$Height,breaks=2,labels=c("A","B"),right=F)

LungCapNew<-data.frame(LungCap,CatHeight)
str(LungCapNew)

## 'data.frame':    654 obs. of  12 variables:
##  $ Age        : int  9 8 7 9 9 8 6 6 8 9 ...
##  $ LungCap    : num  3.12 3.17 3.16 2.67 3.68 ...
##  $ Height     : num  57 67.5 54.5 53 57 61 58 56 58.5 60 ...
##  $ SqrHeight  : num  3249 4556 2970 2809 3249 ...
##  $ HeightT    : num  -4.14 6.36 -6.64 -8.14 -4.14 ...
##  $ SqrHeightT : num  17.2 40.4 44.1 66.3 17.2 ...
##  $ SqrAge     : int  81 64 49 81 81 64 36 36 64 81 ...
##  $ AgeT       : num  -0.931 -1.931 -2.931 -0.931 -0.931 ...
##  $ SqrAgeT    : num  0.867 3.73 8.592 0.867 0.867 ...
##  $ AgeHeight  : num  513 540 382 477 513 ...
##  $ AgeTHeightT: num  3.86 -12.28 19.47 7.58 3.86 ...
##  $ CatHeight  : Factor w/ 2 levels "A","B": 1 2 1 1 1 2 1 1 1 2 ...

Eksplorasi Data

# menghitung rata-rata kapasitas paru-paru untuk subyek pada kelompok tinggi A
mean(LungCapNew$LungCap[LungCapNew$CatHeight=="A"])

## [1] 3.576216

# menghitung rata-rata kapasitas paru-paru untuk subyek pada kelompok tinggi B
mean(LungCapNew$LungCap[LungCapNew$CatHeight=="B"])

## [1] 7.308531

Persamaan Regresi

model1<-lm(LungCap~CatHeight,data=LungCapNew)
summary(model1) #reference category adalah A (0) -> B(1)

## 
## Call:
## lm(formula = LungCap ~ CatHeight, data = LungCapNew)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2265 -1.2760 -0.2379  0.8998  8.0705 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.5762     0.1196   29.91   <2e-16 ***
## CatHeightB    3.7323     0.1512   24.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.872 on 652 degrees of freedom
## Multiple R-squared:  0.4831, Adjusted R-squared:  0.4823 
## F-statistic: 609.3 on 1 and 652 DF,  p-value: < 2.2e-16

# mengganti reference category menjadi B (0) -> A(1)
LungCapNew$CatHeight<-factor(LungCapNew$CatHeight,c("B","A"))
model2<-lm(LungCap~CatHeight,data=LungCapNew)
summary(model2)

## 
## Call:
## lm(formula = LungCap ~ CatHeight, data = LungCapNew)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2265 -1.2760 -0.2379  0.8998  8.0705 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.30853    0.09255   78.97   <2e-16 ***
## CatHeightA  -3.73231    0.15120  -24.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.872 on 652 degrees of freedom
## Multiple R-squared:  0.4831, Adjusted R-squared:  0.4823 
## F-statistic: 609.3 on 1 and 652 DF,  p-value: < 2.2e-16

Baik A sebagai refence variable maupun B sebagai refence variable. Nilai dugaan lungcap untuk masing-masing kategori, akan bernilai sama.

Contoh:

Model 1

\(\hat {Y} = 3.7795+3.6770 *(CatheightB)\)

\(\hat {Y}_1 = 3.7795+3.6770 * 0\)

\(\hat {Y}_1 = 3.7795\)

Model 2

\(\hat {Y} = 7.45650-3.67705*(CatheightA)\)

\(\hat {Y}_1 = 7.45650-3.67705*1\)

\(\hat {Y}_1 = 3.7795\)

Visualisasi Data

library(ggplot2)

ggplot(LungCapNew,aes(x=CatHeight,y=LungCap,group=1))+ geom_point()+ geom_smooth(method=lm,se=FALSE)+
theme_gray()

ggplot(LungCapNew,aes(x=CatHeight,y=LungCap))+ geom_point()+ geom_smooth(method=lm,se=FALSE)+
theme_classic()

LungCapNew$CatHeight <- as.numeric(LungCapNew$CatHeight)
str(LungCapNew)

## 'data.frame':    654 obs. of  12 variables:
##  $ Age        : int  9 8 7 9 9 8 6 6 8 9 ...
##  $ LungCap    : num  3.12 3.17 3.16 2.67 3.68 ...
##  $ Height     : num  57 67.5 54.5 53 57 61 58 56 58.5 60 ...
##  $ SqrHeight  : num  3249 4556 2970 2809 3249 ...
##  $ HeightT    : num  -4.14 6.36 -6.64 -8.14 -4.14 ...
##  $ SqrHeightT : num  17.2 40.4 44.1 66.3 17.2 ...
##  $ SqrAge     : int  81 64 49 81 81 64 36 36 64 81 ...
##  $ AgeT       : num  -0.931 -1.931 -2.931 -0.931 -0.931 ...
##  $ SqrAgeT    : num  0.867 3.73 8.592 0.867 0.867 ...
##  $ AgeHeight  : num  513 540 382 477 513 ...
##  $ AgeTHeightT: num  3.86 -12.28 19.47 7.58 3.86 ...
##  $ CatHeight  : num  2 1 2 2 2 1 2 2 2 1 ...

model1<-lm(LungCap~Age,data=LungCapNew)
plot(LungCap~Age,data=LungCapNew, col = CatHeight + 1, pch = CatHeight + 1, cex = 0.5)
abline(model1, lwd = 3, col = "grey")
legend("topright", c("B", "A"), col = c(2, 3), pch = c(2, 3))

model2<-lm(LungCap~Age+CatHeight,data=LungCapNew)
plot(LungCap~Age,data=LungCapNew, col = CatHeight + 1, pch = CatHeight + 1, cex = 0.5)
abline(model1, lwd = 3, col = "grey")
legend("topright", c("Height B", "Height A"), col = c(2, 3), pch = c(2, 3))

Penerapan pada Data `Brithsmoker`

Berikut ini akan diberikan contoh analisis hubungan berat bayi (Wgt), status merokok ibu (Smoke), dan jangka waktu kehamilan dalam satuan minggu (Gest). Perhatikan tipe data dari variabel kategorik dalam data.

Import Data

birthsmokers <- read.table("D:\\birthsmokers.txt", header=T)
attach(birthsmokers)
str(birthsmokers)

## 'data.frame':    32 obs. of  3 variables:
##  $ Wgt  : int  2940 3130 2420 2450 2760 2440 3226 3301 2729 3410 ...
##  $ Gest : int  38 38 36 34 39 35 40 42 37 40 ...
##  $ Smoke: chr  "yes" "no" "yes" "no" ...

Persamaan Regresi

model <- lm(Wgt ~ Gest + Smoke)

plot(x=Gest, y=Wgt, ylim=c(2300, 3700),
col=ifelse(Smoke=="yes", "red", "blue"),
panel.last = c(lines(sort(Gest[Smoke=="no"]),
fitted(model)[Smoke=="no"][order(Gest[Smoke=="no"])],
col="blue"),
lines(sort(Gest[Smoke=="yes"]),
fitted(model)[Smoke=="yes"][order(Gest[Smoke=="yes"])],
col="red")))

summary(model)

## 
## Call:
## lm(formula = Wgt ~ Gest + Smoke)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -223.693  -92.063   -9.365   79.663  197.507 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2389.573    349.206  -6.843 1.63e-07 ***
## Gest          143.100      9.128  15.677 1.07e-15 ***
## Smokeyes     -244.544     41.982  -5.825 2.58e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115.5 on 29 degrees of freedom
## Multiple R-squared:  0.8964, Adjusted R-squared:  0.8892 
## F-statistic: 125.4 on 2 and 29 DF,  p-value: 5.289e-15

Interval Kepercayaan

confint(model)

##                  2.5 %     97.5 %
## (Intercept) -3103.7795 -1675.3663
## Gest          124.4312   161.7694
## Smokeyes     -330.4064  -158.6817

Prediksi Data Baru

predict(model, interval="confidence",
newdata=data.frame(Gest=c(38, 38), Smoke=c("yes", "no")))

##        fit      lwr      upr
## 1 2803.693 2740.599 2866.788
## 2 3048.237 2989.120 3107.355

Persamaan Regresi hanya untuk Kategori `Tidak`

#Ulangi proses analisis hanya untuk subyek yang tidak merokok saja
model.0 <- lm(Wgt ~ Gest, subset=Smoke=="no")
summary(model.0)

## 
## Call:
## lm(formula = Wgt ~ Gest, subset = Smoke == "no")
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -171.52 -101.59   23.28   83.63  139.48 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2546.14     457.29  -5.568 6.93e-05 ***
## Gest          147.21      11.97  12.294 6.85e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 106.9 on 14 degrees of freedom
## Multiple R-squared:  0.9152, Adjusted R-squared:  0.9092 
## F-statistic: 151.1 on 1 and 14 DF,  p-value: 6.852e-09

Prediksi Data Baru

predict(model.0, interval="confidence",
newdata=data.frame(Gest=38))

##        fit      lwr     upr
## 1 3047.724 2990.298 3105.15

Persamaan Regresi hanya untuk Kategori `Ya`

#Ulangi proses analisis hanya untuk subyek yang merokok saja
model.1 <- lm(Wgt ~ Gest, subset=Smoke=="yes")
summary(model.1)

## 
## Call:
## lm(formula = Wgt ~ Gest, subset = Smoke == "yes")
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -228.53  -64.86  -19.10   93.89  184.53 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2474.56     553.97  -4.467 0.000532 ***
## Gest          139.03      14.11   9.851 1.12e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 126.6 on 14 degrees of freedom
## Multiple R-squared:  0.8739, Adjusted R-squared:  0.8649 
## F-statistic: 97.04 on 1 and 14 DF,  p-value: 1.125e-07

Prediksi Data Baru

predict(model.1, interval="confidence",
newdata=data.frame(Gest=38))

##        fit      lwr      upr
## 1 2808.528 2731.726 2885.331

Persamaan Regresi hanya untuk Kategori `Ya` dan `Tidak`

#Ulangi proses analisis dengan menggunakan kode 1 dan -1
#model3: references : yes
Smoke2 <- ifelse(Smoke=="yes", 1, -1)
model.3 <- lm(Wgt ~ Gest + Smoke2)
summary(model.3)

## 
## Call:
## lm(formula = Wgt ~ Gest + Smoke2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -223.693  -92.063   -9.365   79.663  197.507 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2511.845    353.449  -7.107 8.07e-08 ***
## Gest          143.100      9.128  15.677 1.07e-15 ***
## Smoke2       -122.272     20.991  -5.825 2.58e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115.5 on 29 degrees of freedom
## Multiple R-squared:  0.8964, Adjusted R-squared:  0.8892 
## F-statistic: 125.4 on 2 and 29 DF,  p-value: 5.289e-15

#atau gunakan perintah lain (-1 menjadi yes, 1 menjadi no)
model.3 <- lm(Wgt ~ Gest + Smoke, contrasts=list(Smoke="contr.sum"))
summary(model.3)

## 
## Call:
## lm(formula = Wgt ~ Gest + Smoke, contrasts = list(Smoke = "contr.sum"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -223.693  -92.063   -9.365   79.663  197.507 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2511.845    353.449  -7.107 8.07e-08 ***
## Gest          143.100      9.128  15.677 1.07e-15 ***
## Smoke1        122.272     20.991   5.825 2.58e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 115.5 on 29 degrees of freedom
## Multiple R-squared:  0.8964, Adjusted R-squared:  0.8892 
## F-statistic: 125.4 on 2 and 29 DF,  p-value: 5.289e-15

Penerapan pada data dengan lebih dari dua faktor

Data

Data yang digunakan dalam praktikum ini adalah data mtcars. Data ini merupakan dataset yang ada dalam R. Deskripsi data mtcars:

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

A data frame with 32 observations on 11 (numeric) variables.

[, 1] mpg Miles/(US) gallon

[, 2] cyl Number of cylinders

[, 3] disp Displacement (cu.in.)

[, 4] hp Gross horsepower

[, 5] drat Rear axle ratio

[, 6] wt Weight (1000 lbs)

[, 7] qsec 1/4 mile time

[, 8] vs Engine (0 = V-shaped, 1 = straight)

[, 9] am Transmission (0 = automatic, 1 = manual)

[,10] gear Number of forward gears

[,11] carb Number of carburetors

data(mtcars)
mtcars$kata[mtcars$vs==1]="SL" #menambahkan variabel kata
mtcars$kata[mtcars$vs==0]="VS"
head(mtcars$kata)

## [1] "VS" "VS" "SL" "SL" "VS" "SL"

is.factor(mtcars$kata)

## [1] FALSE

str(mtcars)

## 'data.frame':    32 obs. of  12 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
##  $ kata: chr  "VS" "VS" "SL" "SL" ...

mtcars$kata=as.factor(mtcars$kata)

Model Regresi

Model dengan menggunakan variabel hp dan kata.

model5 <- lm(mpg ~ hp + kata, data = mtcars) 
summary(model5)

## 
## Call:
## lm(formula = mpg ~ hp + kata, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7131 -2.3336 -0.1332  1.9055  7.9055 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.53922    1.67062  17.682  < 2e-16 ***
## hp          -0.05453    0.01448  -3.766 0.000752 ***
## kataVS      -2.57622    1.96966  -1.308 0.201163    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.818 on 29 degrees of freedom
## Multiple R-squared:  0.6246, Adjusted R-squared:  0.5987 
## F-statistic: 24.12 on 2 and 29 DF,  p-value: 6.768e-07

Model dengan mengganti reference variable

levels(mtcars$kata) #perhatikan,urutan level berdasarkan urutan alphabet

## [1] "SL" "VS"

mtcars$kata=factor(mtcars$kata,c("VS","SL")) #cara mengganti reference variable
levels(mtcars$kata)

## [1] "VS" "SL"

model6 <- lm(mpg ~ hp + kata, data = mtcars)
summary(model6)

## 
## Call:
## lm(formula = mpg ~ hp + kata, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7131 -2.3336 -0.1332  1.9055  7.9055 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 26.96300    2.89069   9.328 3.13e-10 ***
## hp          -0.05453    0.01448  -3.766 0.000752 ***
## kataSL       2.57622    1.96966   1.308 0.201163    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.818 on 29 degrees of freedom
## Multiple R-squared:  0.6246, Adjusted R-squared:  0.5987 
## F-statistic: 24.12 on 2 and 29 DF,  p-value: 6.768e-07

Model dengan menggunakan variabel yang memiliki lebih dari dua level

#factors more than two levels
is.factor(mtcars$cyl)

## [1] FALSE

mtcars$cyl=as.factor(mtcars$cyl)
levels(mtcars$cyl)

## [1] "4" "6" "8"

model7<- lm(mpg ~ hp + cyl, data = mtcars)
summary(model7)

## 
## Call:
## lm(formula = mpg ~ hp + cyl, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.818 -1.959  0.080  1.627  6.812 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.65012    1.58779  18.044  < 2e-16 ***
## hp          -0.02404    0.01541  -1.560  0.12995    
## cyl6        -5.96766    1.63928  -3.640  0.00109 ** 
## cyl8        -8.52085    2.32607  -3.663  0.00103 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.146 on 28 degrees of freedom
## Multiple R-squared:  0.7539, Adjusted R-squared:  0.7275 
## F-statistic: 28.59 on 3 and 28 DF,  p-value: 1.14e-08

int_4cyl = coef(model7)[1] 
int_6cyl = coef(model7)[1] + coef(model7)[3]
int_8cyl = coef(model7)[1] + coef(model7)[4]
slope_all = coef(model7)[2]
plot(mpg ~ hp, data = mtcars, col = cyl, pch = as.numeric(cyl))
abline(int_4cyl, slope_all, col = 1, lty = 1, lwd = 2) # add line for auto
abline(int_6cyl, slope_all, col = 2, lty = 2, lwd = 2) # add line for manual
abline(int_8cyl, slope_all, col = 3, lty = 3, lwd = 2) # add line for auto
legend("topright", c("4 cylinder", "6 cylinder","8 cylinder"), col = c(1, 2,3), pch = c(1, 2,3))

Persamaan Regresi:

\(\hat {Y}=28.65012-0.02404*hp-5.96766*Cyl6-8.52085*Cyl8\)

Analisis Regresi dengan Variabel Dummy