supervised-regression

Course: MDS 503 (Statistical Computing with R)

Student: Shubham Dhakal (35)

Teacher: Shital Bhandary (Associate Professor)

School: School of Mathematical Sciences, IOST, TU

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(car)

## Loading required package: carData

library(foreign)

1. Download the Individual recode file from: https://dhsprogram.com/data/Download-Model-Datasets.cfm

suppressWarnings({
  data <- read.spss("./data/ZZIR62FL.SAV", to.data.frame = TRUE)
})

## re-encoding from CP1252

data<-data[,c("V201","V013","V024","V025","V106","V190")]

head(data)

##   V201  V013     V024  V025         V106    V190
## 1    6 30-34 Region 2 Rural No education Poorest
## 2    1 20-24 Region 2 Rural    Secondary  Middle
## 3    2 40-44 Region 2 Rural No education  Middle
## 4    3 25-29 Region 2 Rural      Primary  Middle
## 5    4 25-29 Region 2 Rural    Secondary  Poorer
## 6    2 35-39 Region 2 Rural No education  Middle

str(data)

## 'data.frame':    8348 obs. of  6 variables:
##  $ V201: num  6 1 2 3 4 2 0 1 2 5 ...
##  $ V013: Factor w/ 7 levels "15-19","20-24",..: 4 2 6 3 3 5 1 2 7 5 ...
##  $ V024: Factor w/ 4 levels "Region 1","Region 2",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ V025: Factor w/ 2 levels "Urban","Rural": 2 2 2 2 2 2 2 2 2 2 ...
##  $ V106: Factor w/ 4 levels "No education",..: 1 3 1 2 3 1 3 1 1 2 ...
##  $ V190: Factor w/ 5 levels "Poorest","Poorer",..: 1 3 3 3 2 3 3 1 4 3 ...

nrow(data)

## [1] 8348

2. Read it in R Studio and split it into training (80%) and testing (20%) datasets with set.seed as your class roll number

set.seed(35)

#Do random sampling to divide the cases into two independent samples
index <- sample(2, nrow(data), replace = T, prob = c(0.8, 0.2))

#Data partition
train.data <- data[index==1,]
test.data <- data[index==2,]

3. Fit a supervised regression model on the training data with Total Children Ever Born (V201) as dependent variable and age group (V013), region (V024), type of place of residence (V025), highest education level (V106) and wealth index (V190) as independent variables and interpret the result carefully, check VIF too and do the needful statistically if required

To fit a linear regression model, certain assumptions need to be satisfied. These include linearity (a linear relationship between predictors and the response), independence of observations, homoscedasticity (constant variance of residuals), normality of residuals, and no multicollinearity among predictors.

Using histograms to assess the normality assumption in linear regression.

hist(data$V201,col="blue", main = "Histogram of Data V201", xlab = "Data V201")

Based on the histogram, it can be concluded that the distribution of the dependent variable is not normal. The histogram displays a right-skewed distribution, indicating that the data is concentrated towards the left and has a longer tail on the right side.

To further assess the normality of the dependent variable, we can use a QQ plot.

qqnorm(data$V201)
qqline(data$V201,col="blue",lw=2)

The points in the plot deviate noticeably from the straight QQ line, indicating a departure from normality.

To confirm whether the dependent variable follows a normal distribution, we can conduct a formal test for normality.

ks.test(data$V201,'pnorm')

## Warning in ks.test.default(data$V201, "pnorm"): ties should not be present for
## the Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$V201
## D = 0.57937, p-value < 2.2e-16
## alternative hypothesis: two-sided

The obtained p-value is significantly smaller than 0.05, indicating that the dependent variable does not follow a normal distribution. Therefore, using linear regression to fit the model is not appropriate due to the violation of the normality assumption. We need to explore alternative supervised regression methods that do not rely on normality.

#Decision tree regression
dtr.model<-train(V201~., data = train.data, method="rpart2")


#fit SVM
svm.model<-train(V201~., data=train.data, method="svmRadial")

4. Get the R-square and RMSE of this fitted model on training data using caret package

#predict on train data in decision tree regression
dtr.predict<-predict(dtr.model, train.data)
dtr.R2<-R2(dtr.predict, train.data$V201)
dtr.RMSE<-RMSE(dtr.predict, train.data$V201)


data.frame(R2 = dtr.R2, RMSE = dtr.RMSE)

##          R2     RMSE
## 1 0.4822596 1.962232

#svm predict
svm.predict<-predict(svm.model,train.data)
svm.R2<-R2(svm.predict,train.data$V201)
svm.RMSE<-RMSE(svm.predict,train.data$V201)

data.frame(R2 = svm.R2, RMSE = svm.RMSE)

##          R2     RMSE
## 1 0.6477173 1.620649

5. Predict the dependent variable on the test data and get the R-square and RMSE using caret package

#predict on test data,decision tree regression
dtr.predict.test<-predict(dtr.model,test.data)
dtr.test.R2<-R2(dtr.predict.test,test.data$V201)
dtr.test.RMSE<-RMSE(dtr.predict.test,test.data$V201)

data.frame(R2 = dtr.test.R2, RMSE = dtr.test.RMSE)

##          R2     RMSE
## 1 0.4530359 1.931988

#predict on test data, svm
svm.predict.test<-predict(svm.model,test.data)
svm.test.R2<-R2(svm.predict.test,test.data$V201)
svm.test.RMSE<-RMSE(svm.predict.test,test.data$V201)

data.frame(R2 = svm.test.R2, RMSE = svm.test.RMSE)

##          R2     RMSE
## 1 0.6185729 1.614154

6. Tune the R-square and RMSE values of the testing model using LOOCV, k-fold cross validation and k-fold cross-validation with repeated samples using caret package

#loocv
dtr.model.loocv<-train(V201~.,data=train.data,method="rpart2",trControl=trainControl(method="loocv"))

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

dtr.loocv.predict<-predict(dtr.model.loocv,test.data)
dtr.loocv.R2<-R2(dtr.loocv.predict,test.data$V201)
dtr.loocv.RMSE<-RMSE(dtr.loocv.predict,test.data$V201)

#k-fold cross validation
dtr.model.cv<-train(V201~.,data=train.data,method="rpart2",trControl=trainControl(method="cv",number = 3))
dtr.cv.predict<-predict(dtr.model.cv,test.data)
dtr.cv.R2<-R2(dtr.cv.predict,test.data$V201)
dtr.cv.RMSE<-RMSE(dtr.cv.predict,test.data$V201)

#repeated k-fold cross validation
dtr.model.rcv<-train(V201~.,data=train.data,method="rpart2",trControl=trainControl(method="repeatedcv",number=5,repeats = 5))
dtr.rcv.predict<-predict(dtr.model.rcv,test.data)
dtr.rcv.R2<-R2(dtr.rcv.predict,test.data$V201)
dtr.rcv.RMSE<-RMSE(dtr.rcv.predict,test.data$V201)

7. Compare the R-square and RMSE of all the model and choose the one for final prediction

models<-data.frame(
  model=c(
    'SVM',
    'decision tree reg validation Set Approach',
    'decision tree reg loocv',
    'decision tree reg k-fold cross validation',
    'decision tree reg repeated k-fold'
    ),
  R2=c(
    svm.test.R2,
    dtr.test.R2,
    dtr.loocv.R2,
    dtr.cv.R2,
    dtr.rcv.R2
    ),
  RMSE=c(
    svm.test.RMSE,
    dtr.test.RMSE,
    dtr.loocv.RMSE,
    dtr.cv.RMSE,
    dtr.rcv.RMSE
    )
)
models

##                                       model        R2     RMSE
## 1                                       SVM 0.6185729 1.614154
## 2 decision tree reg validation Set Approach 0.4530359 1.931988
## 3                   decision tree reg loocv 0.4530359 1.931988
## 4 decision tree reg k-fold cross validation 0.4530359 1.931988
## 5         decision tree reg repeated k-fold 0.4530359 1.931988

The evaluation of different regression models is based on their predictive performance, measured by the R2 value and RMSE. The R2 value indicates the proportion of variance in the dependent variable that can be predicted from the independent variables, with higher values indicating better predictive performance. On the other hand, the RMSE represents the average magnitude of the prediction errors, with lower values indicating higher accuracy.

Among the listed models, the SVM model stands out with the highest value of R2, indicating that it explains a significant portion of the variance in the dependent variable. Additionally, the SVM model also exhibits the lowest value of RMSE, suggesting that it has the highest accuracy in predicting the outcome. These findings highlight the superior performance of the SVM model in terms of both predictive power and accuracy compared to the other models considered.