| Course: MDS 503 (Statistical Computing with R) |
| Student: Shubham Dhakal (35) |
| Teacher: Shital Bhandary (Associate Professor) |
| School: School of Mathematical Sciences, IOST, TU |
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(car)
## Loading required package: carData
library(foreign)
suppressWarnings({
data <- read.spss("./data/ZZIR62FL.SAV", to.data.frame = TRUE)
})
## re-encoding from CP1252
data<-data[,c("V201","V013","V024","V025","V106","V190")]
head(data)
## V201 V013 V024 V025 V106 V190
## 1 6 30-34 Region 2 Rural No education Poorest
## 2 1 20-24 Region 2 Rural Secondary Middle
## 3 2 40-44 Region 2 Rural No education Middle
## 4 3 25-29 Region 2 Rural Primary Middle
## 5 4 25-29 Region 2 Rural Secondary Poorer
## 6 2 35-39 Region 2 Rural No education Middle
str(data)
## 'data.frame': 8348 obs. of 6 variables:
## $ V201: num 6 1 2 3 4 2 0 1 2 5 ...
## $ V013: Factor w/ 7 levels "15-19","20-24",..: 4 2 6 3 3 5 1 2 7 5 ...
## $ V024: Factor w/ 4 levels "Region 1","Region 2",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ V025: Factor w/ 2 levels "Urban","Rural": 2 2 2 2 2 2 2 2 2 2 ...
## $ V106: Factor w/ 4 levels "No education",..: 1 3 1 2 3 1 3 1 1 2 ...
## $ V190: Factor w/ 5 levels "Poorest","Poorer",..: 1 3 3 3 2 3 3 1 4 3 ...
nrow(data)
## [1] 8348
set.seed(35)
#Do random sampling to divide the cases into two independent samples
index <- sample(2, nrow(data), replace = T, prob = c(0.8, 0.2))
#Data partition
train.data <- data[index==1,]
test.data <- data[index==2,]
To fit a linear regression model, certain assumptions need to be satisfied. These include linearity (a linear relationship between predictors and the response), independence of observations, homoscedasticity (constant variance of residuals), normality of residuals, and no multicollinearity among predictors.
Using histograms to assess the normality assumption in linear regression.
hist(data$V201,col="blue", main = "Histogram of Data V201", xlab = "Data V201")
Based on the histogram, it can be concluded that the distribution of the dependent variable is not normal. The histogram displays a right-skewed distribution, indicating that the data is concentrated towards the left and has a longer tail on the right side.
To further assess the normality of the dependent variable, we can use a QQ plot.
qqnorm(data$V201)
qqline(data$V201,col="blue",lw=2)
The points in the plot deviate noticeably from the straight QQ line, indicating a departure from normality.
To confirm whether the dependent variable follows a normal distribution, we can conduct a formal test for normality.
ks.test(data$V201,'pnorm')
## Warning in ks.test.default(data$V201, "pnorm"): ties should not be present for
## the Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: data$V201
## D = 0.57937, p-value < 2.2e-16
## alternative hypothesis: two-sided
The obtained p-value is significantly smaller than 0.05, indicating that the dependent variable does not follow a normal distribution. Therefore, using linear regression to fit the model is not appropriate due to the violation of the normality assumption. We need to explore alternative supervised regression methods that do not rely on normality.
#Decision tree regression
dtr.model<-train(V201~., data = train.data, method="rpart2")
#fit SVM
svm.model<-train(V201~., data=train.data, method="svmRadial")
#predict on train data in decision tree regression
dtr.predict<-predict(dtr.model, train.data)
dtr.R2<-R2(dtr.predict, train.data$V201)
dtr.RMSE<-RMSE(dtr.predict, train.data$V201)
data.frame(R2 = dtr.R2, RMSE = dtr.RMSE)
## R2 RMSE
## 1 0.4822596 1.962232
#svm predict
svm.predict<-predict(svm.model,train.data)
svm.R2<-R2(svm.predict,train.data$V201)
svm.RMSE<-RMSE(svm.predict,train.data$V201)
data.frame(R2 = svm.R2, RMSE = svm.RMSE)
## R2 RMSE
## 1 0.6477173 1.620649
#predict on test data,decision tree regression
dtr.predict.test<-predict(dtr.model,test.data)
dtr.test.R2<-R2(dtr.predict.test,test.data$V201)
dtr.test.RMSE<-RMSE(dtr.predict.test,test.data$V201)
data.frame(R2 = dtr.test.R2, RMSE = dtr.test.RMSE)
## R2 RMSE
## 1 0.4530359 1.931988
#predict on test data, svm
svm.predict.test<-predict(svm.model,test.data)
svm.test.R2<-R2(svm.predict.test,test.data$V201)
svm.test.RMSE<-RMSE(svm.predict.test,test.data$V201)
data.frame(R2 = svm.test.R2, RMSE = svm.test.RMSE)
## R2 RMSE
## 1 0.6185729 1.614154
#loocv
dtr.model.loocv<-train(V201~.,data=train.data,method="rpart2",trControl=trainControl(method="loocv"))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
dtr.loocv.predict<-predict(dtr.model.loocv,test.data)
dtr.loocv.R2<-R2(dtr.loocv.predict,test.data$V201)
dtr.loocv.RMSE<-RMSE(dtr.loocv.predict,test.data$V201)
#k-fold cross validation
dtr.model.cv<-train(V201~.,data=train.data,method="rpart2",trControl=trainControl(method="cv",number = 3))
dtr.cv.predict<-predict(dtr.model.cv,test.data)
dtr.cv.R2<-R2(dtr.cv.predict,test.data$V201)
dtr.cv.RMSE<-RMSE(dtr.cv.predict,test.data$V201)
#repeated k-fold cross validation
dtr.model.rcv<-train(V201~.,data=train.data,method="rpart2",trControl=trainControl(method="repeatedcv",number=5,repeats = 5))
dtr.rcv.predict<-predict(dtr.model.rcv,test.data)
dtr.rcv.R2<-R2(dtr.rcv.predict,test.data$V201)
dtr.rcv.RMSE<-RMSE(dtr.rcv.predict,test.data$V201)
models<-data.frame(
model=c(
'SVM',
'decision tree reg validation Set Approach',
'decision tree reg loocv',
'decision tree reg k-fold cross validation',
'decision tree reg repeated k-fold'
),
R2=c(
svm.test.R2,
dtr.test.R2,
dtr.loocv.R2,
dtr.cv.R2,
dtr.rcv.R2
),
RMSE=c(
svm.test.RMSE,
dtr.test.RMSE,
dtr.loocv.RMSE,
dtr.cv.RMSE,
dtr.rcv.RMSE
)
)
models
## model R2 RMSE
## 1 SVM 0.6185729 1.614154
## 2 decision tree reg validation Set Approach 0.4530359 1.931988
## 3 decision tree reg loocv 0.4530359 1.931988
## 4 decision tree reg k-fold cross validation 0.4530359 1.931988
## 5 decision tree reg repeated k-fold 0.4530359 1.931988
The evaluation of different regression models is based on their predictive performance, measured by the R2 value and RMSE. The R2 value indicates the proportion of variance in the dependent variable that can be predicted from the independent variables, with higher values indicating better predictive performance. On the other hand, the RMSE represents the average magnitude of the prediction errors, with lower values indicating higher accuracy.
Among the listed models, the SVM model stands out with the highest value of R2, indicating that it explains a significant portion of the variance in the dependent variable. Additionally, the SVM model also exhibits the lowest value of RMSE, suggesting that it has the highest accuracy in predicting the outcome. These findings highlight the superior performance of the SVM model in terms of both predictive power and accuracy compared to the other models considered.