데이터분석방법론2 기말고사

서론

데이터 정리

edu<-read.csv('edu.csv')
nrow(edu)                                 #rawdata 행의 개수는 6908개 이다.

[1] 6908

edu$math<-as.character(edu$math)
edu$math<-as.numeric(edu$math)
edu$language<-as.character(edu$language)
edu$language<-as.numeric(edu$language)
edu$foreign<-as.character(edu$foreign)
edu$foreign<-as.numeric(edu$foreign)      #언어,수학,외국어 점수가 facotr로 되어있어 수치형으로 전환한다.
edu.rm<-na.omit(edu)                      #결측치를 모두 제거한다.
edu.sub<-subset(edu.rm,income>=0&educost>=0,select=-c(parreply,stureply))   #소득,교육비에서 무응답과 기재오류는 음수로 기재되어 있으므로 제거한다.
nrow(edu.sub)                            #결측치 제외하고 3612명의 데이터가 존재한다.

[1] 3612

한국연구 종단 데이터 자료 (2010년)

TableID11e82c7132cf

Data: edu.copy • Chart ID: TableID11e82c7132cf • googleVis-0.6.1
R version 3.3.2 (2016-10-31) • Google Terms of Use • Documentation and Data Policy

hist(edu.sub$income,200,xlim=c(0,3000),probability = T,main='월평균 소득 분포의 히스토그램',xlab='월소득(만원)',col='darkgoldenrod1')
lines(density(edu.sub$income),col='navy',lwd=2)
legend(1500,0.002,'평균     406만원',bty='n')  
legend(1500,0.0017,'중앙값  350만원',bty='n')

hist(edu.sub$educost,150,xlim=c(0,800),probability = T,main='월평균 교육비 지출 분포의 히스토그램',xlab='월 교육비 (만원)',col='olivedrab1')
lines(density(edu.sub$educost),col='tomato',lwd=2)
legend(450,0.01,  '평균    58만원',bty='n')  
legend(450,0.0085,'중앙값   50만원',bty='n')

월 수입과 교육비 지출간의 Scattorplot by ggriaphExtra

2010년 수능 등급

표준점수를 등급으로 변환

TableID28dcb824119

Data: edu.cut • Chart ID: TableID28dcb824119 • googleVis-0.6.1
R version 3.3.2 (2016-10-31) • Google Terms of Use • Documentation and Data Policy

소득 구간과 과목별 등급 구간 사이의 누적 로지스틱 회귀모형

소득이 1단위 올라갈 수록 언어 영역 등급이 상위 영역에 속할 누적 로지스틱 회귀모형을 적합하고자 한다. VGAM 패키지의 vglm 함수를 사용하기 위하여 분할표 형식으로 재작성 하였다. 분할표 작성에는 reshape2 패키지의 dcast 함수를 사용하였다.

library(reshape2)
langdf<-dcast(incomegr~langgr,data=edu.sub)[c(1:4),]
colnames(langdf)<-c('incomegr',paste0('Gr',9:1))
langdf

  incomegr Gr9 Gr8 Gr7 Gr6 Gr5 Gr4 Gr3 Gr2 Gr1
1        1  57  72 122 127 180 132  83  41  25
2        2  21  58 115 131 181 131  90  56  26
3        3  15  25  64 129 186 135  97  59  30
4        4  13  28  94 111 197 162 134  69  66

library(VGAM)                           
model1<-vglm(cbind(Gr1,Gr2,Gr3,Gr4,Gr5,Gr6,Gr7,Gr8,Gr9)~incomegr,family = cumulative(parallel = TRUE),data=langdf)
summary(model1)


Call:
vglm(formula = cbind(Gr1, Gr2, Gr3, Gr4, Gr5, Gr6, Gr7, Gr8, 
    Gr9) ~ incomegr, family = cumulative(parallel = TRUE), data = langdf)

Pearson residuals:
  logit(P[Y<=1]) logit(P[Y<=2]) logit(P[Y<=3]) logit(P[Y<=4])
1       0.001378        0.02663         0.4444         1.6254
2      -1.124973        0.81359         0.2252         0.1523
3      -1.180680        0.10719        -0.4300        -0.8246
4       1.784312       -0.77569        -0.1870        -0.8501
  logit(P[Y<=5]) logit(P[Y<=6]) logit(P[Y<=7]) logit(P[Y<=8])
1         1.0148        -0.7160        -1.9547         -2.693
2        -0.4650        -0.8599        -0.4773          1.737
3        -0.2095         3.2028         1.5951          0.639
4        -0.3339        -1.2536         1.6747          1.063

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept):1 -3.76930    0.11241 -33.531   <2e-16 ***
(Intercept):2 -2.75982    0.09180 -30.062   <2e-16 ***
(Intercept):3 -1.86452    0.08283 -22.510   <2e-16 ***
(Intercept):4 -1.05349    0.07828 -13.458   <2e-16 ***
(Intercept):5 -0.10567    0.07600  -1.390    0.164    
(Intercept):6  0.66926    0.07769   8.615   <2e-16 ***
(Intercept):7  1.68747    0.08817  19.140   <2e-16 ***
(Intercept):8  2.75936    0.11676  23.632   <2e-16 ***
incomegr       0.26936    0.02728   9.873   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Number of linear predictors:  8 

Dispersion Parameter for cumulative family:   1

Residual deviance: 47.1536 on 23 degrees of freedom

Log-likelihood: -116.6102 on 23 degrees of freedom

Number of iterations: 4 

Exponentiated coefficients:
incomegr 
1.309121

등급이 9개 이므로, 9등급을 기준으로 하여 총 8개의 intercept와 소득 등급에 대한 1개의 coefficient가 추정되었다.

\[logit[p(Y \leq j등급)]= \alpha_{j}+ 0.27\times 소득등급\]

임의의 j 점수 등급에 대하여 소득이 1등급 상승하면, 점수 등급이 j 등급 보다 하위에 속할 오즈보다 상위에 속할 오즈가 exp(0.27)=1.3배 증가한다.

\[logit[p(Y \leq 1등급)]= -3.77+ 0.27\times 소득등급 \]

\[P(Y \leq 1등급)= \dfrac{exp(-3.77+0.27 \times 소득등급)}{1+exp(-3.77+0.27 \times 소득등급)}\]

소득 최하위 등급일 때 언어 영역 1등급 받을 확률= 2.9%

소득 최상위 등급일 때 언어 영역 1등급 받을 확률= 6.3%

누적 로지스틱 모형은 각 j 마다 intercept는 다르나, coefficient는 하나로 공유한다. 즉 비례 오즈 모형(proportional odss model)인데, 과연 비례 오즈의 가정(assumption)이 맞는지 확인해 볼 필요가 있다. 이를 위해서 각 j 마다 서로다른 coefficient를 가지는 모형을 적합한 다음, 가능도비 검정을 통해서 모형 적합도 여부를 확인하였다.

model1.1<-vglm(cbind(Gr1,Gr2,Gr3,Gr4,Gr5,Gr6,Gr7,Gr8,Gr9)~incomegr, family = cumulative,data=langdfrm)

pchisq(2*(logLik(model1.1)-logLik(model1)),df=length(coef(model1.1))-length(coef(model1)),lower.tail = FALSE)

[1] 0.001392138

P-value는 0.05보다 작으므로, 비례오즈 가정이 만족하지 못한다는 것을 확인하였다. 즉 현재 데이터를 누적 로지스틱 모형을 통해 설명하는 것은 부적절하여, 다범주 회귀모형(multinomial regression)을 통해 데이터 분석을 시도하였다.

소득 구간과 과목별 등급 구간 사이의 다범주 로지스틱 회귀모형

library(nnet)
model2<-multinom(langgr~incomegr,data=edu.sub)

# weights:  27 (16 variable)
initial  value 7167.346571 
iter  10 value 6636.127269
iter  20 value 6588.249841
final  value 6588.029446 
converged

summary(model2)

Call:
multinom(formula = langgr ~ incomegr, data = edu.sub)

Coefficients:
     (Intercept)  incomegr
Gr 8   0.1798572 0.1880819
Gr 7   0.4424347 0.4198536
Gr 6   0.4478921 0.5145472
Gr 5   0.6789001 0.5828952
Gr 4   0.3024922 0.6189410
Gr 3  -0.2606483 0.7085328
Gr 2  -0.8360391 0.7048447
Gr 1  -1.8125864 0.9005770

Std. Errors:
     (Intercept)  incomegr
Gr 8   0.2604003 0.1199112
Gr 7   0.2360416 0.1077512
Gr 6   0.2319340 0.1057704
Gr 5   0.2248306 0.1033565
Gr 4   0.2315193 0.1050610
Gr 3   0.2437305 0.1078395
Gr 2   0.2694719 0.1148285
Gr 1   0.3163647 0.1247855

Residual Deviance: 13176.06 
AIC: 13208.06

nnet 패키지의 multinom 함수를 이용하여 다범주 로지스틱 회귀모형을 적합하였다.결과를 보면 9등급을 기준으로 두었어다는 것을 알 수 있다. 1등급에 대한 회귀식을 정리하면 다음과 같다.

\[log(\dfrac{\pi_Gr1}{\pi_Gr9})= -1.81 + 0.90 \times 소득등급 \]

즉 소득이 1등급 증가하면 언어 영역 등급이 9등급이 아니라 1등급일 오즈의 추정값은 exp(0.90)=2.45배 이다.

수리 영역 및 외국어 영역에서도 다범주 로지스틱 회귀분석을 적용하여, 요약표로 정리해 보았다.

과목 별 계수와 오즈비의95% 신뢰구간
	Coef	LowCI	HighCI
언어영역	0.90	1.93	3.14
수리 가	1.46	2.08	9.01
수리 나	0.67	1.60	2.42
외국어영역	0.87	1.82	3.15

library(plotly)
subj<-c('lang','lang','matha','matha','mathb','mathb','fore','fore')
hr<-c(1.92,3.14,2.07,9,1.59,2.41,1.82,3.15)
testtable<-data.frame(subj,hr)

testtable$subj<-factor(testtable$subj,levels=c('fore', 'mathb','matha','lang'), labels=c('외국어','수학 나','수학 가','언어'))
pl<-ggplot(testtable, aes(x=hr, y=subj))+geom_line(size=1, aes(colour=subj))+geom_point()+ggtitle("과목별 1등급에 속할 오즈 비")+ theme(plot.title=element_text(face="bold", size=20, vjust=0, color="darkblue"))
py <- plotly("RgraphingAPI", "ektgzomjbx")
ggplotly(pl, session='knitr')

데이터분석방법론2 기말고사

KO SUNG WOO

2016년 12월 17일

서론

데이터 정리

한국연구 종단 데이터 자료 (2010년)

월 수입과 교육비 지출간의 Scattorplot by ggriaphExtra

2010년 수능 등급

표준점수를 등급으로 변환

소득 구간과 과목별 등급 구간 사이의 누적 로지스틱 회귀모형

등급이 9개 이므로, 9등급을 기준으로 하여 총 8개의 intercept와 소득 등급에 대한 1개의 coefficient가 추정되었다.

\[logit[p(Y \leq j등급)]= \alpha_{j}+ 0.27\times 소득등급\]

임의의 j 점수 등급에 대하여 소득이 1등급 상승하면, 점수 등급이 j 등급 보다 하위에 속할 오즈보다 상위에 속할 오즈가 exp(0.27)=1.3배 증가한다.

\[logit[p(Y \leq 1등급)]= -3.77+ 0.27\times 소득등급 \]

\[P(Y \leq 1등급)= \dfrac{exp(-3.77+0.27 \times 소득등급)}{1+exp(-3.77+0.27 \times 소득등급)}\]

소득 최하위 등급일 때 언어 영역 1등급 받을 확률= 2.9%

소득 최상위 등급일 때 언어 영역 1등급 받을 확률= 6.3%

P-value는 0.05보다 작으므로, 비례오즈 가정이 만족하지 못한다는 것을 확인하였다. 즉 현재 데이터를 누적 로지스틱 모형을 통해 설명하는 것은 부적절하여, 다범주 회귀모형(multinomial regression)을 통해 데이터 분석을 시도하였다.

소득 구간과 과목별 등급 구간 사이의 다범주 로지스틱 회귀모형

nnet 패키지의 multinom 함수를 이용하여 다범주 로지스틱 회귀모형을 적합하였다.결과를 보면 9등급을 기준으로 두었어다는 것을 알 수 있다. 1등급에 대한 회귀식을 정리하면 다음과 같다.

\[log(\dfrac{\pi_Gr1}{\pi_Gr9})= -1.81 + 0.90 \times 소득등급 \]

즉 소득이 1등급 증가하면 언어 영역 등급이 9등급이 아니라 1등급일 오즈의 추정값은 exp(0.90)=2.45배 이다.

수리 영역 및 외국어 영역에서도 다범주 로지스틱 회귀분석을 적용하여, 요약표로 정리해 보았다.