Course: MDS 503 (Statistical Computing with R)
Student: Arpan Sapkota (07)
Teacher: Shital Bhandary (Associate Professor
School: School of Mathematical Sciences, IOST, TUsetwd("/Users/arpan/Desktop/MDS/01 MDS I-I/MDS 503 - Statistical Computing with R/Lab/Data")
library(haven)
data <- read_sav("ZZIR62FL.SAV")
head(data)
## # A tibble: 6 × 4,275
## CASEID V000 V001 V002 V003 V004 V005 V006 V007 V008 V009 V010
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 " 1… ZZ6 1 1 2 1 1.06e6 6 2015 1386 6 1985
## 2 " 1… ZZ6 1 3 2 1 1.06e6 6 2015 1386 4 1993
## 3 " 1… ZZ6 1 4 2 1 1.06e6 6 2015 1386 4 1973
## 4 " 1… ZZ6 1 4 3 1 1.06e6 6 2015 1386 7 1989
## 5 " 1… ZZ6 1 5 1 1 1.06e6 6 2015 1386 2 1990
## 6 " 1… ZZ6 1 6 2 1 1.06e6 6 2015 1386 2 1978
## # ℹ 4,263 more variables: V011 <dbl>, V012 <dbl>, V013 <dbl+lbl>,
## # V014 <dbl+lbl>, V015 <dbl+lbl>, V016 <dbl>, V017 <dbl>, V018 <dbl+lbl>,
## # V019 <dbl+lbl>, V019A <dbl+lbl>, V020 <dbl+lbl>, V021 <dbl>, V022 <dbl>,
## # V023 <dbl+lbl>, V024 <dbl+lbl>, V025 <dbl+lbl>, V026 <dbl+lbl>, V027 <dbl>,
## # V028 <dbl>, V029 <dbl>, V030 <dbl>, V031 <dbl>, V032 <dbl>, V034 <dbl+lbl>,
## # V040 <dbl>, V042 <dbl+lbl>, V044 <dbl+lbl>, V101 <dbl+lbl>, V102 <dbl+lbl>,
## # V103 <dbl+lbl>, V104 <dbl+lbl>, V105 <dbl+lbl>, V106 <dbl+lbl>, …
set.seed(07)
# Split into training and testing datasets
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
#trainIndex <- createDataPartition(data$V201, p = 0.8, list = FALSE)
train_indices <- sample(nrow(data), floor(0.8 * nrow(data)))
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]
#Test of Normality
# Histogram
hist(train_data$V201, breaks = 20, col = "lightblue", main = "Histogram of V201")
# Q-Q plot
qqnorm(train_data$V201)
qqline(train_data$V201)
Skewed histogram for the dependent variable V201, it indicates that the
distribution is not symmetric and deviates from a normal
distribution.
In the context of linear regression, a skewed dependent variable can affect the assumptions of the model, such as the assumption of normality of residuals. It is generally desirable to have the dependent variable and residuals follow a normal distribution for accurate and reliable regression results.
In this cases, it may be beneficial to consider transforming the dependent variable to achieve a more symmetrical distribution. transformations like taking the logarithm, square root, or reciprocal of the variable.
# Apply logarithm transformation to V201
train_data$log_V201 <- log(train_data$V201)
# Histogram of log_V201
hist(train_data$log_V201, breaks = 20, col = "lightblue", main = "Histogram of log_V201")
# Check for missing values
any(is.na(train_data$log_V201))
## [1] FALSE
# Check for infinite values
any(is.infinite(train_data$log_V201))
## [1] TRUE
# Replace infinite values with the maximum finite value
train_data$log_V201[is.infinite(train_data$log_V201)] <- max(train_data$log_V201, na.rm = TRUE)
# Q-Q plot of log_V201
qqnorm(train_data$log_V201)
qqline(train_data$log_V201)
# Apply square root transformation to V201
train_data$sqrt_V201 <- sqrt(train_data$V201)
# Histogram of sqrt_V201
hist(train_data$sqrt_V201, breaks = 20, col = "lightblue", main = "Histogram of sqrt_V201")
Above transformation also give the skewed histogram, So the dependent variable is not normally distributed, So we can not use the linear model in this case.
So Using Generalized Linear Models (GLMs): GLMs are an extension of linear regression that can handle non-normal dependent variables or non-linear relationships. They are suitable when the dependent variable follows a non-normal distribution or when the relationship between the dependent variable and independent variables is not linear. Examples of GLMs include logistic regression for binary outcomes and Poisson regression for count data.
library(caret)
# Fit a GLM on the training data
model <- glm(V201 ~ V013 + V024 + V025 + V106 + V190, data = train_data, family = gaussian)
# Print the summary of the model
summary(model)
##
## Call:
## glm(formula = V201 ~ V013 + V024 + V025 + V106 + V190, family = gaussian,
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7.0335 -0.8661 -0.0283 0.8905 9.8510
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.086375 0.149002 0.580 0.56214
## V013 0.960474 0.011498 83.537 < 2e-16 ***
## V024 -0.001018 0.018434 -0.055 0.95596
## V025 0.174019 0.055383 3.142 0.00168 **
## V106 -0.436162 0.025569 -17.058 < 2e-16 ***
## V190 -0.123216 0.019514 -6.314 2.89e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2.752417)
##
## Null deviance: 48447 on 6677 degrees of freedom
## Residual deviance: 18364 on 6672 degrees of freedom
## AIC: 25721
##
## Number of Fisher Scoring iterations: 2
# Check VIF
library(car)
## Loading required package: carData
vif(model) # Check for multicollinearity
## V013 V024 V025 V106 V190
## 1.209411 1.063295 1.797346 1.457527 1.797802
The independent variables V013, V025, V106, and V190 have significant coefficients (p < 0.001), indicating their strong relationship with the dependent variable V201. The intercept and V024 are not statistically significant. The model has a reasonably good fit, as indicated by the residual deviance and AIC values. The dispersion parameter for the gaussian family is 2.752417.
# Predict on the training data
predictions_train <- predict(model, newdata = train_data)
# Calculate R-square on the training data
rsquare_train <- caret::R2(predictions_train, train_data$V201)
rsquare_train
## [1] 0.6209423
# Calculate RMSE on the training data
rmse_train <- caret::RMSE(predictions_train, train_data$V201)
rmse_train
## [1] 1.658296
Model has an R-squared value of 0.6209, which means that approximately 62.09% of the variability in the dependent variable (Total Children Ever Born) is explained by the independent variables (age group, region, type of place of residence, highest education level, and wealth index) included in the model. This indicates a moderate level of explanatory power.
The root mean squared error (RMSE) value of 1.658 means that, on average, the predicted values from the model deviate from the actual values by approximately 1.658 units. Lower RMSE values indicate a better fit, so this value could be considered moderate in terms of prediction accuracy.
# Predict on the test data
predictions_test <- predict(model, newdata = test_data)
# Calculate R-square on the test data
rsquare_test <- caret::R2(predictions_test, test_data$V201)
rsquare_test
## [1] 0.6124119
# Calculate RMSE on the test data
rmse_test <- caret::RMSE(predictions_test, test_data$V201)
rmse_test
## [1] 1.705514
# LOOCV
loocv <- trainControl(method = "LOOCV")
model_loocv <- train(V201 ~ V013 + V024 + V025 + V106 + V190, data = train_data, method = "glm", trControl = loocv, family = gaussian)
rsquare_loocv <- model_loocv$results$Rsquared
rmse_loocv <- model_loocv$results$RMSE
# k-fold cross-validation
kfold <- trainControl(method = "cv", number = 5)
model_kfold <- train(V201 ~ V013 + V024 + V025 + V106 + V190, data = train_data, method = "glm", trControl = kfold, family = gaussian)
rsquare_kfold <- model_kfold$results$Rsquared
rmse_kfold <- model_kfold$results$RMSE
# Repeated k-fold cross-validation
repeated_kfold <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model_repeated_kfold <- train(V201 ~ V013 + V024 + V025 + V106 + V190, data = train_data, method = "glm", trControl = repeated_kfold, family = gaussian)
rsquare_repeated_kfold <- model_repeated_kfold$results$Rsquared
rmse_repeated_kfold <- model_repeated_kfold$results$RMSE
results <- data.frame(Method = c("Training","Testing", "LOOCV", "k-fold CV", "Repeated k-fold CV"),
R_Square = c(rsquare_train, rsquare_test, rsquare_loocv, rsquare_kfold, rsquare_repeated_kfold),
RMSE = c(rmse_train, rmse_test, rmse_loocv, rmse_kfold, rmse_repeated_kfold))
results
## Method R_Square RMSE
## 1 Training 0.6209423 1.658296
## 2 Testing 0.6124119 1.705514
## 3 LOOCV 0.6202303 1.659853
## 4 k-fold CV 0.6207434 1.659266
## 5 Repeated k-fold CV 0.6206538 1.659355
The differences in R-squared and RMSE among all the methods are quite small. Therefore, any of these methods can be considered for the final prediction. However, based on the slightly higher R-squared and slightly lower RMSE, the k-fold CV method may be preferred.