Introduction
GRE.Score - GRE 점수 TOEFL.Score - 토플 점수 University.Ranking - 학부 대학 레이팅 SOP - 자기 소개서 점수 LOR - 추천서 점수 CGPA - 학부 점수 Research - 연구 경험의 유무 Chance of Admit - 대학원 합격 확률
Target Variable 에 따라 지도/비지도학습으로 나눌 수 있기 때문에 중요
Taraget 있음 - 지도 학습 설정
setwd('C:/Users/Administrator/Desktop/R Analysis/Fast Campus')
read.csv('university.csv', header = T) -> raw_data1
str(raw_data1)## 'data.frame': 500 obs. of 8 variables:
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
## GRE.Score TOEFL.Score University.Rating SOP
## 0 0 0 0
## LOR CGPA Research Chance.of.Admit
## 0 0 0 0
## [1] 0
#--------------------------------------------------------------------
# Unique 함수를 사용해서 분포도를 확인한다. 이상치 너무 작거나 큰 것, 혹은 특수문자들 확인
#---------------------------------------------------------------------
unique(raw_data1$GRE.Score)## [1] 337 324 316 322 314 330 321 308 302 323 325 327 328 307 311 317 319 318 303
## [20] 312 334 336 340 298 295 310 300 338 331 320 299 304 313 332 326 329 339 309
## [39] 315 301 296 294 306 305 290 335 333 297 293
par(mfrow=(c(2,3)))
hist(raw_data1$GRE.Score, xlab = "GRE 점수")
hist(raw_data1$TOEFL.Score, xlab= "TOEFL 점수")
hist(raw_data1$SOP)
hist(raw_data1$LOR)
hist(raw_data1$University.Rating)
hist(raw_data1$Chance.of.Admit)## null device
## 1
# X 축 기준 각 분포도를 확인
#-------------------------------------------------------------------------
# UNIQUE Resarch - Nominal or Binary?
#------------------------------------------------------------------------
table(raw_data1$Research)##
## 0 1
## 220 280
#-------------------------------------------------------------------------
# Target - Chance of Admin
# 1) MAX, MIN 를 확인해야한다. 0~1 사이의 값 (이상치 확인)
#------------------------------------------------------------------------
summary(raw_data1$Chance.of.Admit)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3400 0.6300 0.7200 0.7217 0.8200 0.9700
par(mfrow=(c(2,3)))
boxplot(raw_data1$GRE.Score, xlab = "GRE 점수")
boxplot(raw_data1$TOEFL.Score, xlab= "TOEFL 점수")
boxplot(raw_data1$SOP)
boxplot(raw_data1$LOR)
boxplot(raw_data1$University.Rating)
boxplot(raw_data1$Chance.of.Admit)
dev.off()## null device
## 1
Target 이 명목 - Classification VS 연속 - Regression
Target 의 범위가 0~1 사이의 제한이 있기 때문에, Logistic Regression 으로 진행
로지스틱 회귀 분석
엘라스틱 넷
랜덤 포레스트
SVM
커널 서포트 벡터 머신
set.seed(2020)
raw_data1 -> df
sort(sample(nrow(df), nrow(df)*0.7)) -> flag
train <- df[flag,]
test <- df[-flag,]
#---------------------------------------
# Setting the model controls
# metric = RMSE : 연속형이기 떄문에, Accuracy (분류)
#-----------------------------------------
trainControl(method = "repeatedcv", repeats = 5) -> ctrl
train(Chance.of.Admit~.,
data= train,
method = "glm",
trControl= ctrl,
preProcess= c("center", "scale"),
metric = "RMSE") -> logistic_fit
logistic_fit## Generalized Linear Model
##
## 350 samples
## 7 predictor
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 314, 314, 314, 317, 317, 315, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.06088181 0.8226372 0.04426021
1.1 RMSE 구하는 방법
postResample()
RMSE()
직접 계산
postResample
predict(logistic_fit, newdata = test) -> logistic_pred
postResample(pred = logistic_pred, obs= test$Chance.of.Admit)## RMSE Rsquared MAE
## 0.05797190 0.81681161 0.04105176
RMSE
## [1] 0.0579719
직접계산
## [1] 0.0579719
직접 계산
y_bar = mean(test$Chance.of.Admit)
1 - (sum((logistic_pred - test$Chance.of.Admit)^2) /sum((test$Chance.of.Admit-y_bar)^2))## [1] 0.8146877
MAE (MEAN Absoulte Error) - 오차항의 절대값
|실제값 - 예측값|
#---------------------------------------
# postResample
#-----------------------------------------
postResample(logistic_pred, obs=test$Chance.of.Admit)## RMSE Rsquared MAE
## 0.05797190 0.81681161 0.04105176
#---------------------------------------
# MAE
#-----------------------------------------
MAE(logistic_pred, test$Chance.of.Admit)## [1] 0.04105176
머신 러닝의 목적은 -> 최적화 (optimaztion) -> Objective Function (목적 함수)
Object Function = 오차 제곱합 (Error: 실제- 추청) 을 줄이는 것
<목적 함수를 최적화 한다는 것은?>
오차제곱합을 최소화 시키는 모수를 추정한다는 것.
Elastic Net - Penalty : 목적함수의 최적화를 도와주는 역할 (=contraint, regularization)
목적함수 + 패널티 = 최적화된 모수
Penalty 종류
사진
최소화 시키는 B를 찾는것이 목적이며, Elastic Net 은 L1 (라쏘) + L2 (릿지) 패널티를 사용한다.
L1 : B는 마름모 안에 있어야 한다. 걸쳐있어도 괜찮음 (네모) : Lasso Regression
L2 : B의 제곱합 안에 있어야한다. (타원형) :Ridge Regression
ctrl <- trainControl(method = "repeatedcv", repeats = 5)
train(Chance.of.Admit~.,
data=train,
method = "glmnet",
preProcess= c("center","scale"),
trControl=ctrl,
metric= "RMSE") -> logit_penal_fit
logit_penal_fit## glmnet
##
## 350 samples
## 7 predictor
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 316, 315, 315, 314, 314, 314, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.0002548148 0.06115579 0.8215358 0.04432927
## 0.10 0.0025481481 0.06113681 0.8216559 0.04429890
## 0.10 0.0254814812 0.06225507 0.8180528 0.04486913
## 0.55 0.0002548148 0.06115361 0.8215213 0.04430084
## 0.55 0.0025481481 0.06115367 0.8216010 0.04425493
## 0.55 0.0254814812 0.06404851 0.8204819 0.04686890
## 1.00 0.0002548148 0.06116789 0.8214258 0.04430305
## 1.00 0.0025481481 0.06122845 0.8212661 0.04426004
## 1.00 0.0254814812 0.06831794 0.8094939 0.05130325
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.002548148.
<결과 해석>
alpha = 1 이다. 즉, 라쏘 / alpha=0 즉, Ridge 다
정규화의 비율을 나타내는 모수
lamdba
전체 제약식의 크기를 나타낸다.
predict(logit_penal_fit, newdata = test) -> logit_penal_pred
postResample(logit_penal_pred, obs=test$Chance.of.Admit)## RMSE Rsquared MAE
## 0.05759752 0.81881695 0.04075955
Decision Tree 가 여러번 그려서 예측하게 만드는 것. 각 Tree 의 결과를 붙이는 것
rf() 사용
ctrl <- trainControl(method ="repeatedcv", repeats = 5)
train(Chance.of.Admit ~ .,
data=train,
method = 'rf',
trControl = ctrl,
preProcess= c("center", "scale"),
metric = "RMSE") -> rf_fit
rf_fit## Random Forest
##
## 350 samples
## 7 predictor
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 315, 314, 316, 314, 315, 315, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 0.06330474 0.8076945 0.04467191
## 4 0.06420105 0.8024873 0.04487456
## 7 0.06622380 0.7907155 0.04640276
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
mtry =2인 값이 최적화된 값
RMSE 값은 낮을 수록 좋음 (하단 그림)
## RMSE Rsquared MAE
## 0.05784775 0.81636321 0.04135581
ctrl <- trainControl(method="repeatedcv",repeats = 5)
svm_linear_fit <- train(Chance.of.Admit ~ .,
data = train,
method = "svmLinear",
trControl = ctrl,
preProcess = c("center","scale"),
metric="RMSE")
svm_linear_fit## Support Vector Machines with Linear Kernel
##
## 350 samples
## 7 predictor
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 316, 314, 315, 315, 314, 315, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.06203925 0.8194168 0.04323064
##
## Tuning parameter 'C' was held constant at a value of 1
predict(svm_linear_fit, newdata = test) -> svm_linear_pred
postResample(svm_linear_pred, obs=test$Chance.of.Admit)## RMSE Rsquared MAE
## 0.05871683 0.82091845 0.04052400
ctrl <- trainControl(method="repeatedcv",repeats = 5)
svm_poly_fit <- train(Chance.of.Admit ~ .,
data = train,
method = "svmPoly",
trControl = ctrl,
preProcess = c("center","scale"),
metric="RMSE")
svm_poly_fit## Support Vector Machines with Polynomial Kernel
##
## 350 samples
## 7 predictor
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 316, 315, 316, 314, 315, 314, ...
## Resampling results across tuning parameters:
##
## degree scale C RMSE Rsquared MAE
## 1 0.001 0.25 0.11101217 0.7986452 0.08763644
## 1 0.001 0.50 0.09038984 0.8003988 0.06932875
## 1 0.001 1.00 0.07375383 0.8035701 0.05308638
## 1 0.010 0.25 0.06698655 0.8101050 0.04689173
## 1 0.010 0.50 0.06420095 0.8163621 0.04460727
## 1 0.010 1.00 0.06283496 0.8209871 0.04373283
## 1 0.100 0.25 0.06189390 0.8240876 0.04311051
## 1 0.100 0.50 0.06160267 0.8248343 0.04302902
## 1 0.100 1.00 0.06151960 0.8247830 0.04306234
## 2 0.001 0.25 0.09038852 0.8004018 0.06932513
## 2 0.001 0.50 0.07376284 0.8035875 0.05308453
## 2 0.001 1.00 0.06785106 0.8088417 0.04771846
## 2 0.010 0.25 0.06415825 0.8166842 0.04459720
## 2 0.010 0.50 0.06262941 0.8216866 0.04354784
## 2 0.010 1.00 0.06177907 0.8243951 0.04303447
## 2 0.100 0.25 0.06074884 0.8274281 0.04226873
## 2 0.100 0.50 0.06052952 0.8277168 0.04213215
## 2 0.100 1.00 0.06058939 0.8269039 0.04216885
## 3 0.001 0.25 0.07932900 0.8019707 0.05867723
## 3 0.001 0.50 0.06956808 0.8066014 0.04921830
## 3 0.001 1.00 0.06610394 0.8116417 0.04609460
## 3 0.010 0.25 0.06301703 0.8204469 0.04378371
## 3 0.010 0.50 0.06192837 0.8239894 0.04314165
## 3 0.010 1.00 0.06119671 0.8266704 0.04259881
## 3 0.100 0.25 0.06049868 0.8282863 0.04201662
## 3 0.100 0.50 0.06125080 0.8231222 0.04260882
## 3 0.100 1.00 0.06219942 0.8170933 0.04337886
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were degree = 3, scale = 0.1 and C = 0.25.
svm_poly_pred <- predict(svm_poly_fit, newdata=test)
postResample(pred = svm_poly_pred, obs = test$Chance.of.Admit)## RMSE Rsquared MAE
## 0.05834794 0.82775010 0.04118332
#-----------------------------------------------------------
# STEP 1) 표준화 Scaling
#-----------------------------------------------------------
raw_data1 ->df
df[,-8] ->df
scale(df) -> df.scale
#-----------------------------------------------------------
# STEP 2) 유클리디안 행렬 - 거리가 가까운 순서대로 확인
#-----------------------------------------------------------
dist(df.scale, method = 'euclidean') %>%
as.matrix() -> df.dist
#-----------------------------------------------------------
# STEP 3) K 갯수 WSS, 실루엣
#-----------------------------------------------------------
fviz_nbclust(df.scale,
kmeans,
method = 'wss',
k.max = 10)#-----------------------------------------------------------
# STEP 4) Modelling 4개로 시도
#-----------------------------------------------------------
kmeans(df.scale, centers = 4, iter.max = 1000) -> df.kmeans
df.kmeans$centers #각 군집별 평균값 ## GRE.Score TOEFL.Score University.Rating SOP LOR CGPA
## 1 -1.1146540 -1.11227795 -1.07754185 -1.20759122 -1.05344377 -1.174945070
## 2 0.1775932 0.05064234 -0.08683261 -0.02866969 -0.05421848 0.002846773
## 3 -0.4080599 -0.28218199 -0.17021701 -0.03154183 -0.01321067 -0.235724231
## 4 1.1466053 1.15733680 1.16496729 1.08188691 0.96069732 1.216050779
## Research
## 1 -0.6879234
## 2 0.8855184
## 3 -1.1270234
## 4 0.7307075
#-----------------------------------------------------------
# STEP 5) 시각화
#-----------------------------------------------------------
barplot(t(df.kmeans$centers), beside = T, col=2:7)
legend("topright", colnames(df.scale), fill = 2:7, cex=0.5)## null device
## 1
#-----------------------------------------------------------
# STEP 6) Clustering Allocation
#-----------------------------------------------------------
df$kmean_cluster <- df.kmeans$cluster#-----------------------------------------------------------
# STEP 1) Finding optimal K using hcut
#-----------------------------------------------------------
fviz_nbclust(df.scale,
hcut,
method ='wss',
k.max = 10)#-----------------------------------------------------------
# STEP 2) Creat cluster using single, complete, ward, average
#-----------------------------------------------------------
hclust(dist(df.scale), method = 'ward.D') -> hlust_ward
plot(hlust_ward)
rect.hclust(hlust_ward, k=3)library(corrplot)
cor(df.scale) -> M
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method = "color", col = col(200),
type = "upper", order = "hclust",
addCoef.col = "black", # Add coefficient of correlation
tl.col = "darkblue", tl.srt = 45, #Text label color and rotation
# Combine with significance level
sig.level = 0.01,
# hide correlation coefficient on the principal diagonal
diag = FALSE
)#install.packages("PerformanceAnalytics")
library("PerformanceAnalytics")
chart.Correlation(raw_data1, histogram = T, pch=20,cex.cor.scale=2)setwd('C:/Users/Administrator/Desktop/R Analysis/Fast Campus')
read.csv('university.csv', header = T) -> raw_data1
#----------------------------------------------
# 명목형 분류 문제로 분석
#----------------------------------------------
par(mfrow= c(1,2), mar=c(5.1,4.1,4.1,2.1))
histogram(raw_data1$Chance.of.Admit)boxplot(raw_data1$Chance.of.Admit)
#----------------------------------------------
# 합격 구분을 어떻게 할 것인가?
# 0.5 기준의 정규 분포가 아니기 때문에, 중심이 0.7으로 보여진다.
# 절반의 지원자들이 0.7이상이라는 것
# 박스 플롯의 Median 기준으로 불합격/합격으로 나누어본다.
#----------------------------------------------
summary(raw_data1$Chance.of.Admit)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3400 0.6300 0.7200 0.7217 0.8200 0.9700
## [1] 0.72
#----------------------------------------------
# 0.72 기준으로 0,1 Factor형으로 변환
#----------------------------------------------
raw_data1 %>%
mutate(Chance.of.Admit = ifelse(Chance.of.Admit < 0.72, 0,1) %>%
as.factor()) -> df
str(df$Chance.of.Admit)## Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
## [1] 1 0
## Levels: 0 1
가장 가까운 K 개의 이웃을 보는 것
ctrl <- trainControl(method = "repeatedcv",
repeats = 5)
expand.grid(k=1:20) ->customGrid
train(Chance.of.Admit~.,
data=train2,
method = "knn",
trControl=ctrl,
preProcess= c("center","scale"),
tuneGrid = customGrid,
metric="Accuracy") -> knn_fit2
knn_fit2## k-Nearest Neighbors
##
## 350 samples
## 7 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 314, 315, 315, 315, 316, 315, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.8370672 0.6724243
## 2 0.8387983 0.6761329
## 3 0.8695565 0.7380581
## 4 0.8616013 0.7219644
## 5 0.8832250 0.7652513
## 6 0.8831447 0.7650922
## 7 0.8860364 0.7708929
## 8 0.8842717 0.7673118
## 9 0.8883212 0.7756319
## 10 0.8848926 0.7686115
## 11 0.8871951 0.7736279
## 12 0.8849244 0.7689364
## 13 0.8911307 0.7815553
## 14 0.8871130 0.7735362
## 15 0.8837488 0.7668651
## 16 0.8820654 0.7635209
## 17 0.8792082 0.7578016
## 18 0.8775415 0.7544751
## 19 0.8831933 0.7657717
## 20 0.8820504 0.7635590
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 13.
K=15 일때 0.8925로 가장 높은 정확도를 가진다.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 54 12
## 1 18 66
##
## Accuracy : 0.8
## 95% CI : (0.727, 0.8608)
## No Information Rate : 0.52
## P-Value [Acc > NIR] : 9.977e-13
##
## Kappa : 0.5981
##
## Mcnemar's Test P-Value : 0.3613
##
## Sensitivity : 0.7500
## Specificity : 0.8462
## Pos Pred Value : 0.8182
## Neg Pred Value : 0.7857
## Prevalence : 0.4800
## Detection Rate : 0.3600
## Detection Prevalence : 0.4400
## Balanced Accuracy : 0.7981
##
## 'Positive' Class : 0
##
method = “LogitBoost”
가장 간단한 모형을 만들어서 발전을 시키는 것
ctrl <- trainControl(method ="repeatedcv",
repeats = 5)
train(Chance.of.Admit~.,
data= train2,
method = "LogitBoost",
trControl=ctrl,
preProcess= c("scale","center"),
metric = "Accuracy") -> logit_boost
logit_boost## Boosted Logistic Regression
##
## 350 samples
## 7 predictor
## 2 classes: '0', '1'
##
## Pre-processing: scaled (7), centered (7)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 315, 314, 315, 315, 316, 314, ...
## Resampling results across tuning parameters:
##
## nIter Accuracy Kappa
## 11 0.8651727 0.7284506
## 21 0.8563576 0.7109251
## 31 0.8513585 0.7005781
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was nIter = 11.
predict(logit_boost, newdata = test2) -> pred_logboost
confusionMatrix(pred_logboost, test2$Chance.of.Admit)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 51 9
## 1 21 69
##
## Accuracy : 0.8
## 95% CI : (0.727, 0.8608)
## No Information Rate : 0.52
## P-Value [Acc > NIR] : 9.977e-13
##
## Kappa : 0.5968
##
## Mcnemar's Test P-Value : 0.04461
##
## Sensitivity : 0.7083
## Specificity : 0.8846
## Pos Pred Value : 0.8500
## Neg Pred Value : 0.7667
## Prevalence : 0.4800
## Detection Rate : 0.3400
## Detection Prevalence : 0.4000
## Balanced Accuracy : 0.7965
##
## 'Positive' Class : 0
##
method = plr
정규화 -> 모델의 복잡성 조절 -> 오버피팅 방지
ctrl <- trainControl(method ="repeatedcv",
repeats = 5)
train(Chance.of.Admit~.,
data= train2,
method = "plr",
trControl=ctrl,
preProcess= c("scale","center"),
metric = "Accuracy") -> logit_penal
logit_penal## Penalized Logistic Regression
##
## 350 samples
## 7 predictor
## 2 classes: '0', '1'
##
## Pre-processing: scaled (7), centered (7)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 316, 316, 316, 314, 315, 314, ...
## Resampling results across tuning parameters:
##
## lambda Accuracy Kappa
## 0e+00 0.8851914 0.7697735
## 1e-04 0.8851914 0.7697735
## 1e-01 0.8857302 0.7709273
##
## Tuning parameter 'cp' was held constant at a value of bic
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were lambda = 0.1 and cp = bic.
cp (complexity parameter 복잡성) - 모형 평가 방법 중, BIC 기준
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 58 11
## 1 14 67
##
## Accuracy : 0.8333
## 95% CI : (0.7639, 0.8891)
## No Information Rate : 0.52
## P-Value [Acc > NIR] : 8.444e-16
##
## Kappa : 0.6656
##
## Mcnemar's Test P-Value : 0.6892
##
## Sensitivity : 0.8056
## Specificity : 0.8590
## Pos Pred Value : 0.8406
## Neg Pred Value : 0.8272
## Prevalence : 0.4800
## Detection Rate : 0.3867
## Detection Prevalence : 0.4600
## Balanced Accuracy : 0.8323
##
## 'Positive' Class : 0
##
각 feature들을 독립으로 보고, 조건부 확률로 분석하는 기법
ctrl <- trainControl(method ="repeatedcv",
repeats = 5)
train(Chance.of.Admit~.,
data= train2,
method = "naive_bayes",
trControl=ctrl,
preProcess= c("scale","center"),
metric = "Accuracy") -> nb_fits
nb_fits## Naive Bayes
##
## 350 samples
## 7 predictor
## 2 classes: '0', '1'
##
## Pre-processing: scaled (7), centered (7)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 315, 315, 315, 314, 314, 316, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.8905910 0.7816522
## TRUE 0.8905742 0.7811957
##
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = FALSE
## and adjust = 1.
usekernerl = 히스토그램을 사용해서 추정하는 것 커널 밀도 함수 사용여부 adjust = 커널 밀도함수의 bandwidth 값을 조절 laplace = 라플라스 스무딩 파라미터 값 조절
ctrl <- trainControl(method ="repeatedcv",
repeats = 5)
train(Chance.of.Admit~.,
data= train2,
method = "rf",
trControl=ctrl,
preProcess= c("scale","center"),
metric = "Accuracy") -> rf_fits
rf_fits## Random Forest
##
## 350 samples
## 7 predictor
## 2 classes: '0', '1'
##
## Pre-processing: scaled (7), centered (7)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 315, 315, 315, 316, 314, 314, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8760831 0.7516461
## 4 0.8727040 0.7445861
## 7 0.8618254 0.7223609
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 53 11
## 1 19 67
##
## Accuracy : 0.8
## 95% CI : (0.727, 0.8608)
## No Information Rate : 0.52
## P-Value [Acc > NIR] : 9.977e-13
##
## Kappa : 0.5976
##
## Mcnemar's Test P-Value : 0.2012
##
## Sensitivity : 0.7361
## Specificity : 0.8590
## Pos Pred Value : 0.8281
## Neg Pred Value : 0.7791
## Prevalence : 0.4800
## Detection Rate : 0.3533
## Detection Prevalence : 0.4267
## Balanced Accuracy : 0.7975
##
## 'Positive' Class : 0
##
#SVM
ctrl <- trainControl(method ="repeatedcv",
repeats = 5)
train(Chance.of.Admit~.,
data= train2,
method = "svmLinear",
trControl=ctrl,
preProcess= c("scale","center"),
metric = "Accuracy") -> svm_linear_fit2
svm_linear_fit2## Support Vector Machines with Linear Kernel
##
## 350 samples
## 7 predictor
## 2 classes: '0', '1'
##
## Pre-processing: scaled (7), centered (7)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 316, 316, 315, 315, 314, 315, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8898095 0.7793358
##
## Tuning parameter 'C' was held constant at a value of 1
svm_linear_pred2 <- predict(svm_linear_fit2, newdata=test2)
confusionMatrix(svm_linear_pred2, test2$Chance.of.Admit)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 56 10
## 1 16 68
##
## Accuracy : 0.8267
## 95% CI : (0.7564, 0.8835)
## No Information Rate : 0.52
## P-Value [Acc > NIR] : 3.796e-15
##
## Kappa : 0.6517
##
## Mcnemar's Test P-Value : 0.3268
##
## Sensitivity : 0.7778
## Specificity : 0.8718
## Pos Pred Value : 0.8485
## Neg Pred Value : 0.8095
## Prevalence : 0.4800
## Detection Rate : 0.3733
## Detection Prevalence : 0.4400
## Balanced Accuracy : 0.8248
##
## 'Positive' Class : 0
##
ctrl <- trainControl(method="repeatedcv",repeats = 5)
svm_poly_fit2 <- train(Chance.of.Admit ~ .,
data = train2,
method = "svmPoly",
trControl = ctrl,
preProcess = c("center","scale"),
metric="Accuracy")
svm_poly_fit2## Support Vector Machines with Polynomial Kernel
##
## 350 samples
## 7 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 315, 315, 314, 316, 315, 315, ...
## Resampling results across tuning parameters:
##
## degree scale C Accuracy Kappa
## 1 0.001 0.25 0.5314379 0.0000000
## 1 0.001 0.50 0.8644641 0.7245690
## 1 0.001 1.00 0.8881307 0.7761674
## 1 0.010 0.25 0.8812726 0.7628816
## 1 0.010 0.50 0.8823828 0.7646292
## 1 0.010 1.00 0.8789860 0.7581030
## 1 0.100 0.25 0.8823828 0.7646898
## 1 0.100 0.50 0.8927049 0.7849915
## 1 0.100 1.00 0.8949748 0.7894334
## 2 0.001 0.25 0.8639085 0.7234510
## 2 0.001 0.50 0.8892736 0.7783921
## 2 0.001 1.00 0.8853221 0.7711291
## 2 0.010 0.25 0.8835257 0.7668870
## 2 0.010 0.50 0.8789860 0.7581030
## 2 0.010 1.00 0.8807003 0.7614249
## 2 0.100 0.25 0.8841457 0.7676057
## 2 0.100 0.50 0.8864155 0.7719207
## 2 0.100 1.00 0.8846022 0.7682575
## 3 0.001 0.25 0.8892726 0.7776995
## 3 0.001 0.50 0.8875761 0.7751015
## 3 0.001 1.00 0.8807330 0.7616250
## 3 0.010 0.25 0.8813044 0.7626467
## 3 0.010 0.50 0.8812400 0.7624807
## 3 0.010 1.00 0.8858123 0.7714050
## 3 0.100 0.25 0.8880980 0.7752829
## 3 0.100 0.50 0.8891587 0.7773534
## 3 0.100 1.00 0.8840299 0.7669015
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 1, scale = 0.1 and C = 1.
svm_poly_pred2 <- predict(svm_poly_fit2, newdata=test2)
confusionMatrix(svm_poly_pred2, test2$Chance.of.Admit)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 54 10
## 1 18 68
##
## Accuracy : 0.8133
## 95% CI : (0.7416, 0.8722)
## No Information Rate : 0.52
## P-Value [Acc > NIR] : 6.705e-14
##
## Kappa : 0.6245
##
## Mcnemar's Test P-Value : 0.1859
##
## Sensitivity : 0.7500
## Specificity : 0.8718
## Pos Pred Value : 0.8438
## Neg Pred Value : 0.7907
## Prevalence : 0.4800
## Detection Rate : 0.3600
## Detection Prevalence : 0.4267
## Balanced Accuracy : 0.8109
##
## 'Positive' Class : 0
##