Course: MDS 503 (Statistical Computing with R)

Student: Arpan Sapkota (07)

Teacher: Shital Bhandary (Associate Professor

School: School of Mathematical Sciences, IOST, TU

1. Download the Individual recode file from: https://dhsprogram.com/data/Download-Model-Datasets.cfm

setwd("/Users/arpan/Desktop/MDS/01 MDS I-I/MDS 503 - Statistical Computing with R/Lab/Data")

library(haven)
data <- read_sav("ZZIR62FL.SAV")
head(data)
## # A tibble: 6 × 4,275
##   CASEID      V000   V001  V002  V003  V004   V005  V006  V007  V008  V009  V010
##   <chr>       <chr> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 "        1… ZZ6       1     1     2     1 1.06e6     6  2015  1386     6  1985
## 2 "        1… ZZ6       1     3     2     1 1.06e6     6  2015  1386     4  1993
## 3 "        1… ZZ6       1     4     2     1 1.06e6     6  2015  1386     4  1973
## 4 "        1… ZZ6       1     4     3     1 1.06e6     6  2015  1386     7  1989
## 5 "        1… ZZ6       1     5     1     1 1.06e6     6  2015  1386     2  1990
## 6 "        1… ZZ6       1     6     2     1 1.06e6     6  2015  1386     2  1978
## # ℹ 4,263 more variables: V011 <dbl>, V012 <dbl>, V013 <dbl+lbl>,
## #   V014 <dbl+lbl>, V015 <dbl+lbl>, V016 <dbl>, V017 <dbl>, V018 <dbl+lbl>,
## #   V019 <dbl+lbl>, V019A <dbl+lbl>, V020 <dbl+lbl>, V021 <dbl>, V022 <dbl>,
## #   V023 <dbl+lbl>, V024 <dbl+lbl>, V025 <dbl+lbl>, V026 <dbl+lbl>, V027 <dbl>,
## #   V028 <dbl>, V029 <dbl>, V030 <dbl>, V031 <dbl>, V032 <dbl>, V034 <dbl+lbl>,
## #   V040 <dbl>, V042 <dbl+lbl>, V044 <dbl+lbl>, V101 <dbl+lbl>, V102 <dbl+lbl>,
## #   V103 <dbl+lbl>, V104 <dbl+lbl>, V105 <dbl+lbl>, V106 <dbl+lbl>, …

2. Read it in R Studio and split it into training (80%) and testing (20%) datasets with set.seed as your class roll number

set.seed(07)
# Split into training and testing datasets
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
#trainIndex <- createDataPartition(data$V201, p = 0.8, list = FALSE)
train_indices <- sample(nrow(data), floor(0.8 * nrow(data)))
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]

3. Fit a supervised regression model on the training data with Total Children Ever Born (V201) as dependent variable and age group (V013), region (V024), type of place of residence (V025), highest education level (V106) and wealth index (V190) as independent variables and interpret the result carefully, check VIF too and do the needful statistically if required

Test of Normality

#Test of Normality
# Histogram
hist(train_data$V201, breaks = 20, col = "lightblue", main = "Histogram of V201")

# Q-Q plot
qqnorm(train_data$V201)
qqline(train_data$V201)

Skewed histogram for the dependent variable V201, it indicates that the distribution is not symmetric and deviates from a normal distribution.

In the context of linear regression, a skewed dependent variable can affect the assumptions of the model, such as the assumption of normality of residuals. It is generally desirable to have the dependent variable and residuals follow a normal distribution for accurate and reliable regression results.

In this cases, it may be beneficial to consider transforming the dependent variable to achieve a more symmetrical distribution. transformations like taking the logarithm, square root, or reciprocal of the variable.

Apply logarithm transformation

# Apply logarithm transformation to V201
train_data$log_V201 <- log(train_data$V201)

# Histogram of log_V201
hist(train_data$log_V201, breaks = 20, col = "lightblue", main = "Histogram of log_V201")

 # Check for missing values
any(is.na(train_data$log_V201))
## [1] FALSE
# Check for infinite values
any(is.infinite(train_data$log_V201))
## [1] TRUE
# Replace infinite values with the maximum finite value
train_data$log_V201[is.infinite(train_data$log_V201)] <- max(train_data$log_V201, na.rm = TRUE)

# Q-Q plot of log_V201
qqnorm(train_data$log_V201)
qqline(train_data$log_V201)

Apply square root transformation

# Apply square root transformation to V201
train_data$sqrt_V201 <- sqrt(train_data$V201)

# Histogram of sqrt_V201
hist(train_data$sqrt_V201, breaks = 20, col = "lightblue", main = "Histogram of sqrt_V201")

Above transformation also give the skewed histogram, So the dependent variable is not normally distributed, So we can not use the linear model in this case.

So Using Generalized Linear Models (GLMs): GLMs are an extension of linear regression that can handle non-normal dependent variables or non-linear relationships. They are suitable when the dependent variable follows a non-normal distribution or when the relationship between the dependent variable and independent variables is not linear. Examples of GLMs include logistic regression for binary outcomes and Poisson regression for count data.

library(caret)

# Fit a GLM on the training data
model <- glm(V201 ~ V013 + V024 + V025 + V106 + V190, data = train_data, family = gaussian)

# Print the summary of the model
summary(model)
## 
## Call:
## glm(formula = V201 ~ V013 + V024 + V025 + V106 + V190, family = gaussian, 
##     data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7.0335  -0.8661  -0.0283   0.8905   9.8510  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.086375   0.149002   0.580  0.56214    
## V013         0.960474   0.011498  83.537  < 2e-16 ***
## V024        -0.001018   0.018434  -0.055  0.95596    
## V025         0.174019   0.055383   3.142  0.00168 ** 
## V106        -0.436162   0.025569 -17.058  < 2e-16 ***
## V190        -0.123216   0.019514  -6.314 2.89e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.752417)
## 
##     Null deviance: 48447  on 6677  degrees of freedom
## Residual deviance: 18364  on 6672  degrees of freedom
## AIC: 25721
## 
## Number of Fisher Scoring iterations: 2
# Check VIF
library(car)
## Loading required package: carData
vif(model)  # Check for multicollinearity
##     V013     V024     V025     V106     V190 
## 1.209411 1.063295 1.797346 1.457527 1.797802

The independent variables V013, V025, V106, and V190 have significant coefficients (p < 0.001), indicating their strong relationship with the dependent variable V201. The intercept and V024 are not statistically significant. The model has a reasonably good fit, as indicated by the residual deviance and AIC values. The dispersion parameter for the gaussian family is 2.752417.

4. Get the R-square and RMSE of this fitted model on training data using caret package

# Predict on the training data
predictions_train <- predict(model, newdata = train_data)

# Calculate R-square on the training data
rsquare_train <- caret::R2(predictions_train, train_data$V201)
rsquare_train
## [1] 0.6209423
# Calculate RMSE on the training data
rmse_train <- caret::RMSE(predictions_train, train_data$V201)
rmse_train
## [1] 1.658296

5. Predict the dependent variable on the test data and get the R-square and RMSE using caret package

# Predict on the test data
predictions_test <- predict(model, newdata = test_data)

# Calculate R-square on the test data
rsquare_test <- caret::R2(predictions_test, test_data$V201)
rsquare_test
## [1] 0.6124119
# Calculate RMSE on the test data
rmse_test <- caret::RMSE(predictions_test, test_data$V201)
rmse_test
## [1] 1.705514

6. Tune the R-square and RMSE values of the testing model using LOOCV, k-fold cross validation and k-fold cross-validation with repeated samples using caret package

# LOOCV
loocv <- trainControl(method = "LOOCV")
model_loocv <- train(V201 ~ V013 + V024 + V025 + V106 + V190, data = train_data, method = "glm", trControl = loocv, family = gaussian)
rsquare_loocv <- model_loocv$results$Rsquared
rmse_loocv <- model_loocv$results$RMSE

# k-fold cross-validation
kfold <- trainControl(method = "cv", number = 5)
model_kfold <- train(V201 ~ V013 + V024 + V025 + V106 + V190, data = train_data, method = "glm", trControl = kfold, family = gaussian)
rsquare_kfold <- model_kfold$results$Rsquared
rmse_kfold <- model_kfold$results$RMSE

# Repeated k-fold cross-validation
repeated_kfold <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model_repeated_kfold <- train(V201 ~ V013 + V024 + V025 + V106 + V190, data = train_data, method = "glm", trControl = repeated_kfold, family = gaussian)
rsquare_repeated_kfold <- model_repeated_kfold$results$Rsquared
rmse_repeated_kfold <- model_repeated_kfold$results$RMSE

7. Compare the R-square and RMSE of all the model and choose the one for final prediction

results <- data.frame(Method = c("Training","Testing", "LOOCV", "k-fold CV", "Repeated k-fold CV"),
                      R_Square = c(rsquare_train, rsquare_test, rsquare_loocv, rsquare_kfold, rsquare_repeated_kfold),
                      RMSE = c(rmse_train, rmse_test, rmse_loocv, rmse_kfold, rmse_repeated_kfold))

results
##               Method  R_Square     RMSE
## 1           Training 0.6209423 1.658296
## 2            Testing 0.6124119 1.705514
## 3              LOOCV 0.6202303 1.659853
## 4          k-fold CV 0.6207434 1.659266
## 5 Repeated k-fold CV 0.6206538 1.659355

The differences in R-squared and RMSE among all the methods are quite small. Therefore, any of these methods can be considered for the final prediction. However, based on the slightly higher R-squared and slightly lower RMSE, the k-fold CV method may be preferred.