School: School of Mathematical Sciences, TU

1. Download the Individual re-code file from: https://dhsprogram.com/data/Download-Model-Datasets.cfm

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(car)

## Loading required package: carData

library(foreign)

suppressWarnings({
  data <- as.data.frame(read.spss("C:/Users/Acer/Desktop/R program/ZZIR62FL.SAV"))
})

## re-encoding from CP1252

data<-data[,c("V201","V013","V024","V025","V106","V190")]
set.seed(30)

2. Read it in R Studio and split it into training (80%) and testing (20%) datasets with set.seed as your class roll number

#2. Read it in R Studio and split it into training (80%) and testing (20%) data sets with set.seed as your class roll number
idx=sample(2,nrow(data),replace=T,prob=c(0.8,0.2))
train.data<-data[idx==1,]
test.data<-data[idx==2,]

3. Fit a supervised regression model on the training data with Total Children Ever Born (V201) as dependent variable and age group (V013), region (V024), type of place of residence (V025), highest education level (V106) and wealth index (V190) as independent variables and interpret the result carefully, check VIF too and do the needful statistically if required

#To utilize linear regression, it is necessary for the linear regression model to satisfy specific assumptions. Among these assumptions is the requirement that the dependent variable adheres to a normal distribution. 

#Let's represent this assumption visually using a histogram.

hist(data$V201,col="red")

#Based on the provided histogram, it can be observed that the distribution of the dependent variable is not normal. Instead, the distribution appears to be positively skewed or skewed to the right.

#Visualizatin by qqplot

qqnorm(data$V201)
qqline(data$V201,col="red",lw=2)

#From the qqplot and qqline ,it again suggests dependent variable is not normally distributed


#To validate and confirm whether the dependent variable follows a normal distribution, a test for normality can be conducted.

ks.test(data$V201,'pnorm')

## Warning in ks.test.default(data$V201, "pnorm"): ties should not be present for
## the Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$V201
## D = 0.57937, p-value < 2.2e-16
## alternative hypothesis: two-sided

#Since the obtained p-value is considerably lower than the standard threshold of 0.05, we have sufficient evidence to reject the null hypothesis. This leads us to conclude that the data does not exhibit a normal distribution.

#Based on the aforementioned conclusion, it is not appropriate to employ linear regression for modeling purposes as the dependent variable does not adhere to a normal distribution.

#Therefore ,we must find alternative supervised regression to fit the model.

#Decision tree regression

dtr.model<-train(V201~.,data = train.data,method="rpart2")

#fit SVM
svm.model<-train(V201~.,data=train.data,method="svmRadial")

#predict on train data in decision tree regression
dtr.predict<-predict(dtr.model,train.data)
dtr.R2<-R2(dtr.predict,train.data$V201)
dtr.RMSE<-RMSE(dtr.predict,train.data$V201)


#svm predict
svm.predict<-predict(svm.model,train.data)
svm.R2<-R2(svm.predict,train.data$V201)
svm.RMSE<-RMSE(svm.predict,train.data$V201)

#predict on test data,decision tree regression
dtr.predict.test<-predict(dtr.model,test.data)
dtr.test.R2<-R2(dtr.predict.test,test.data$V201)
dtr.test.RMSE<-RMSE(dtr.predict.test,test.data$V201)

#predict on test data ,svm
svm.predict.test<-predict(svm.model,test.data)
svm.test.R2<-R2(svm.predict.test,test.data$V201)
svm.test.RMSE<-RMSE(svm.predict.test,test.data$V201)

6. Tune the R-square and RMSE values of the testing model using LOOCV, k-fold cross validation and k-fold cross-validation with repeated samples using caret package

#lets tune the model
#loocv
dtr.model.loocv<-train(V201~.,data=train.data,method="rpart2",trControl=trainControl(method="loocv"))

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

dtr.loocv.predict<-predict(dtr.model.loocv,test.data)
dtr.loocv.R2<-R2(dtr.loocv.predict,test.data$V201)
dtr.loocv.RMSE<-RMSE(dtr.loocv.predict,test.data$V201)

#k-fold cross validation
dtr.model.cv<-train(V201~.,data=train.data,method="rpart2",trControl=trainControl(method="cv",number = 3))
dtr.cv.predict<-predict(dtr.model.cv,test.data)
dtr.cv.R2<-R2(dtr.cv.predict,test.data$V201)
dtr.cv.RMSE<-RMSE(dtr.cv.predict,test.data$V201)

#repeated k-fold cross validation
dtr.model.rcv<-train(V201~.,data=train.data,method="rpart2",trControl=trainControl(method="repeatedcv",number=5,repeats = 5))
dtr.rcv.predict<-predict(dtr.model.rcv,test.data)
dtr.rcv.R2<-R2(dtr.rcv.predict,test.data$V201)
dtr.rcv.RMSE<-RMSE(dtr.rcv.predict,test.data$V201)

7. Compare the R-square and RMSE of all the model and choose the one for final prediction

models<-data.frame(
  model=c(
    'SVM',
    'decision tree reg validation Set Approach',
    'decision tree reg loocv',
    'decision tree reg k-fold cross validation',
    'decision tree reg repeated k-fold'
    ),
  R2=c(
    svm.test.R2,
    dtr.test.R2,
    dtr.loocv.R2,
    dtr.cv.R2,
    dtr.rcv.R2
    ),
  RMSE=c(
    svm.test.RMSE,
    dtr.test.RMSE,
    dtr.loocv.RMSE,
    dtr.cv.RMSE,
    dtr.rcv.RMSE
    )
)
models

##                                       model        R2     RMSE
## 1                                       SVM 0.6432484 1.633890
## 2 decision tree reg validation Set Approach 0.4895866 1.950743
## 3                   decision tree reg loocv 0.4895866 1.950743
## 4 decision tree reg k-fold cross validation 0.4895866 1.950743
## 5         decision tree reg repeated k-fold 0.4895866 1.950743

The analysis evaluated different regression models in terms of their predictive abilities. The R2 value, which ranges from 0 to 1, indicates the proportion of the dependent variable’s variability that can be explained by the independent variables.

A higher R2 value suggests better predictive performance. On the other hand, the RMSE (Root Mean Squared Error) represents the average magnitude of the residuals, serving as a measure of the model’s prediction accuracy.

Lower RMSE values indicate higher precision in predictions. According to the findings, among the models considered, the SVM (Support Vector Machine) model achieved the highest R2 value of 0.6432484 and the lowest RMSE of 1.633890.

Assignment 6.1

Rubal Chakubaji

2023-07-15

Course: MDS 503 (Statistical Computing with R)

Student: Rubal Chakubaji (30)

Teacher: Shital Bhandary (Associate Professor)

School: School of Mathematical Sciences, TU