Based on Rezaei-Dehaghani, Keshvari and Paki (2018), family relationships are vital to a student’s academic performance. Utilizing the Student Alcohol Consumption dataset from Kaggle which contains information on student academic performance and family relationships information, we aim to create a model that could predict whether a student will pass an exam and also predict the marks they will acquire. By doing so, we aim to filter and assist students to perform better academically based on how well they perform.
library(caret)
Warning: package ‘caret’ was built under R version 4.3.1Loading required package: ggplot2
Warning: package ‘ggplot2’ was built under R version 4.3.1Loading required package: lattice
library(superml)
Warning: package ‘superml’ was built under R version 4.3.1Loading required package: R6
Warning: package ‘R6’ was built under R version 4.3.1
library(dplyr)
Warning: package ‘dplyr’ was built under R version 4.3.1
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(ggplot2)
library(corrplot)
Warning: package ‘corrplot’ was built under R version 4.3.2corrplot 0.92 loaded
library(patchwork)
Warning: package ‘patchwork’ was built under R version 4.3.2
library(randomForest)
Warning: package ‘randomForest’ was built under R version 4.3.2randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:dplyr’:
combine
The following object is masked from ‘package:ggplot2’:
margin
library(e1071)
Warning: package ‘e1071’ was built under R version 4.3.1
set.seed(42)
We will first import the required libraries. dplyr is used for data manipulation, ggplot2, patchwork and corrplot are used for data visualization while caret, randomForest, e1071 and superml are used for predictive modelling. At the same time, the seed is seeded to 42 to ensure reproducibility.
df_mat<-read.csv("student-mat.csv")
df_por<-read.csv("student-por.csv")
df<-rbind(df_mat,df_por)
df
The data is stored in 2 separate CSV based on the subject that the student is taking. Due to the limited samples of data, we will combine the 2 dataset to create a more comprehensive dataset. The resulting dataset consists of 1044 rows and 33 columns. We will then check whether there is any missing data in the dataset.
df %>%
select(everything()) %>%
summarise_all(funs(sum(is.na(.))))
Warning: `funs()` was deprecated in dplyr 0.8.0.
Please use a list of either functions or lambdas:
# Simple named list:
list(mean = mean, median = median)
# Auto named with `tibble::lst()`:
tibble::lst(mean, median)
# Using lambdas
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
Upon analysis, we verified that there are no missing values across the columns, thus the data imputation step can be omitted later on.
sample <- sample.int(n = nrow(df), size = floor(.8*nrow(df)), replace = F)
train <- df[sample, ]
test <- df[-sample, ]
The data is split in 8:2 ratio into train and test data. The resulting data will always be the same since we have seeded it. The train data will be used for EDA and training while the test data will be used to evaluate our models.
str(train)
'data.frame': 835 obs. of 33 variables:
$ school : chr "GP" "GP" "GP" "GP" ...
$ sex : chr "F" "F" "F" "M" ...
$ age : int 15 17 16 15 16 18 17 16 15 16 ...
$ address : chr "R" "U" "U" "U" ...
$ famsize : chr "GT3" "GT3" "GT3" "GT3" ...
$ Pstatus : chr "T" "A" "T" "T" ...
$ Medu : int 3 4 4 4 2 3 1 2 2 1 ...
$ Fedu : int 3 3 4 2 2 3 1 2 2 1 ...
$ Mjob : chr "services" "services" "teacher" "teacher" ...
$ Fjob : chr "services" "services" "services" "other" ...
$ reason : chr "reputation" "course" "home" "home" ...
$ guardian : chr "other" "mother" "mother" "mother" ...
$ traveltime: int 2 1 1 1 2 1 4 1 1 1 ...
$ studytime : int 3 2 3 2 2 2 2 2 3 4 ...
$ failures : int 0 0 0 0 0 0 3 0 0 0 ...
$ schoolsup : chr "no" "no" "no" "no" ...
$ famsup : chr "yes" "yes" "yes" "yes" ...
$ paid : chr "yes" "yes" "no" "yes" ...
$ activities: chr "yes" "no" "yes" "no" ...
$ nursery : chr "yes" "yes" "no" "yes" ...
$ higher : chr "yes" "yes" "yes" "yes" ...
$ internet : chr "yes" "yes" "yes" "no" ...
$ romantic : chr "yes" "yes" "no" "no" ...
$ famrel : int 4 5 5 4 5 5 5 3 4 2 ...
$ freetime : int 2 2 3 3 4 3 3 3 5 2 ...
$ goout : int 1 2 2 3 4 4 5 4 2 1 ...
$ Dalc : int 2 1 1 2 2 1 1 1 1 1 ...
$ Walc : int 3 2 1 2 4 1 5 1 1 1 ...
$ health : int 3 5 5 5 5 5 5 4 3 5 ...
$ absences : int 2 23 4 2 0 0 0 0 0 0 ...
$ G1 : int 13 13 15 15 13 10 5 13 14 14 ...
$ G2 : int 13 13 16 15 13 9 8 13 14 14 ...
$ G3 : int 13 13 16 14 12 9 7 13 15 14 ...
Firstly, we have a look at the overview structure of the dataset to understand the values of each feature and their respective data type. There are columns which are logical, integer, and character. Some of the character columns have only 2 values thus we can encode them into 0 or 1.
label_encoders<-c()
binary_col<-c()
for (col in names(train)){
if (nrow(unique(train[col]))==2){
label_encoders<-c(label_encoders,LabelEncoder$new())
binary_col<-c(binary_col,col)
train[col]<-label_encoders[[length(label_encoders)]]$fit_transform(train[[col]])
}
}
names(label_encoders)<-binary_col
Each of the columns with only 2 values will be encoded into 0 or 1 to allow the relationship between variables to be easily visualized. A label encoder will be used to fit each columns and transform it. The label encoders trained will be stored in a vector so that we could reuse it with the test data.
M<-cor(mutate(subset(train,select=-c(Mjob,Fjob,reason,guardian))))
corrplot(M,method = 'color',tl.cex = 0.8, cl.cex = 0.8)
In the above figure, we can see the correlation between all the numerical features. We can see that G1. G2 and G3 seems to be highly correlated.
ggplot(train,aes(G3>=10))+labs(x="G3",y="Number of Students",title="Student Performance in G3")+geom_bar()
In the figure above, we can see that around 1 out of 4 students failed the G3 exam. We will look into some of the features that might have contributed to it.
grouped<-subset(train,select=-c(Mjob,Fjob,reason,guardian))
grouped["G1"]<-grouped["G1"]>=10
grouped["G2"]<-grouped["G2"]>=10
grouped["G1_2"]<-grouped["G1"]+grouped["G2"]
ggplot(grouped %>% group_by(G1_2) %>% summarise(total=sum(G3>=10)/n()),aes(x=G1_2,y=total))+geom_bar(stat='identity',fill = "steelblue")+labs(x="Total tests passed",y="Percentage of students who passed G3",title="Percentage of students who passed G3 grouped by total tests passed")
In the correlation graph, we have observed that G1 and G2 are highly correlated to G3. As shown in the figure above, almost all of the students who have passed both the previous tests passed G3. For those who have not passed any tests previously, only about 20% passed G3. Thus, we will analyze what makes the two groups perform so differently.
numeric_cols<-c()
for (col in names(grouped)){
if (is.numeric(grouped[[col]])){
numeric_cols<-c(numeric_cols,col)
}
}
grouped_scaled<-scale(grouped[numeric_cols])
grouped[numeric_cols]<-grouped_scaled
summarised<-group_by(grouped,G1_2) %>% summarise(across(everything(), mean))
summarised<-subset(summarised,select=-c(G1,G2,G1_2,G3))
diff2_0<-t(abs(summarised[3,]-summarised[1,]))
colnames(diff2_0)<-c("Difference")
barchart(diff2_0[order(diff2_0),],xlab="Normalized Mean Difference",ylab="Characteristic",main = "Normalized Mean Difference between Different Characteristics")
We grouped the students based on the number of previous tests passed and calculate the difference of standardized mean between different features and visualized in the figure above. The main feature that contributed to their difference is the number of failures further before in the class, which corresponded to our figure previously. Another key difference is the desire to pursue further education, followed by the parents’ education level and the time they have spent in studies.
grouped<-subset(train,select=-c(Mjob,Fjob,reason,guardian))
grouped["G1"]<-grouped["G1"]>=10
grouped["G2"]<-grouped["G2"]>=10
grouped["G1_2"]<-grouped["G1"]+grouped["G2"]
ggplot(grouped %>% group_by(G1_2) %>% summarise(failures=mean(failures)),aes(x=G1_2,y=failures))+geom_bar(stat='identity')+labs(x="Total tests passed",y="Percentage of student with previous failures",title="Percentage of student with previous failures based on tests passed")
As we can observe in the figure above, almost 80% of students who have failed G1 and G2 have failures previously.
ggplot(grouped %>% group_by(G1_2) %>% summarise(higher=mean(higher)),aes(x=G1_2,y=higher))+geom_bar(stat='identity')+labs(x="Total tests passed",y="Percentage of student desiring further education",title="Percentage of student desiring further education based on tests passed")
At the same time, more than 20% of the students who have not passed any tests do not desire to further their studies.
ggplot(grouped %>% group_by(G1_2) %>% summarise(studytime=mean(studytime)),aes(x=G1_2,y=studytime))+geom_bar(stat='identity')+labs(x="Total tests passed",y="Mean studytime",title="Mean studytime based on tests passed")
They also study half an hour less compared to those who pass all the test.
ggplot(grouped %>% group_by(G1_2) %>% summarise(Pedu=mean(Medu+Fedu)),aes(x=G1_2,y=Pedu))+geom_bar(stat='identity')+labs(x="Total tests passed",y="Parents' average education level",title="Parents' average education level based on tests passed")
The students’ parents also have a lower average education level compared to the students who are performing better.
ggplot(train,aes(x=famrel))+geom_histogram()+labs(x="Family relationship",y="Number of students",title="Family relationship distribution")
The figure above shows the distribution of the students’ family relationships. Most of the students have a satisfactory relationship with their families as 4 is the mode of the feature.
grouped<-subset(train,select=-c(Mjob,Fjob,reason,guardian))
grouped["G1"]<-grouped["G1"]>=10
grouped["G2"]<-grouped["G2"]>=10
grouped["G1_2"]<-grouped["G1"]+grouped["G2"]
grouped["famrel"]<-grouped["famrel"]>=4
#Scale numerical columns
numeric_cols<-c()
for (col in names(grouped)){
if (is.numeric(grouped[[col]])){
numeric_cols<-c(numeric_cols,col)
}
}
grouped_scaled<-scale(grouped[numeric_cols])
grouped[numeric_cols]<-grouped_scaled
summarised<-group_by(grouped,famrel) %>% summarise(across(everything(), mean))
summarised<-subset(summarised,select=-(famrel))
difference<-t(summarised[2,]-summarised[1,])
colnames(difference)<-c("Difference")
barchart(difference[order(difference),],xlab="Normalized Mean Difference",ylab="Characteristic")
If we compare those students with good family relationship (>=4) with those who have worse, we can observe that these students consume more amounts of alcohol in the weekends. They also have less free time and study time on hand.
# Create custom color palette
custom.col <- c("#abebad", "#ffff99", "#ADC2FF", "#f0b2f0", "#ff9999")
# Create the plots for weekend alcohol consumption
plot_weekend <- ggplot(df, aes(x = as.factor(Walc), y = G1, fill = as.factor(Walc))) +
geom_boxplot() +
scale_fill_manual(values = custom.col) +
theme_bw() +
theme(legend.position = "none") +
xlab("Alcohol Consumption Level") +
ylab("First Period Grade (G1)") +
ggtitle("Weekends") +
scale_x_discrete(labels = c("Very Low", "Low", "Moderate", "High", "Very High"))
# Create the plots for weekday alcohol consumption
plot_weekday <- ggplot(df, aes(x = as.factor(Dalc), y = G1, fill = as.factor(Dalc))) +
geom_boxplot() +
scale_fill_manual(values = custom.col) +
theme_bw() +
theme(legend.position = "none") +
xlab("Alcohol Consumption Level") +
ylab("First Period Grade (G1)") +
ggtitle("Weekdays") +
scale_x_discrete(labels = c("Very Low", "Low", "Moderate", "High", "Very High"))
# Create common title
common_title <- plot_annotation(title = "G1 Grade vs Alcohol Consumption Level for Weekdays and Weekends ",
theme = theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold")))
# Combine the plots and add the common title
combined_plot <- plot_weekend + plot_weekday + plot_layout(ncol = 2, widths = c(2, 2)) + common_title
# Display the combined plot
print(combined_plot)
Median grades tend to decrease as alcohol consumption increases for both weekends and weekdays. The ‘Very High’ and ‘High’ alcohol consumption groups appear to have the lowest median grade, as well as the smaller interquartile range, indicating that students who consume a lot of alcohol tend to have lower and more consistent grades.
# Create custom color palette
custom.col <- c("#abebad", "#ffff99", "#ADC2FF", "#f0b2f0", "#ff9999")
# Create the plots for weekend alcohol consumption
plot_weekend <- ggplot(df, aes(x = as.factor(Dalc), y = G2, fill = as.factor(Dalc))) +
geom_boxplot() +
scale_fill_manual(values = custom.col)+
theme_bw() +
theme(legend.position = "none") +
xlab("Alcohol Consumption Level") +
ylab("Second Period Grade (G2)") +
ggtitle("Weekends") +
scale_x_discrete(labels = c("Very Low", "Low", "Moderate", "High", "Very High"))
# Create the plots for weekday alcohol consumption
plot_weekday <- ggplot(df, aes(x = as.factor(Walc), y = G2, fill = as.factor(Walc))) +
geom_boxplot() +
scale_fill_manual(values = custom.col)+
theme_bw() +
theme(legend.position = "none") +
xlab("Alcohol Consumption Level") +
ylab("Second Period Grade (G2)") +
ggtitle("Weekdays") +
scale_x_discrete(labels = c("Very Low", "Low", "Moderate", "High", "Very High"))
# Create common title
common_title <- plot_annotation(title = "G2 Grade vs Alcohol Consumption Level for Weekdays and Weekends ",
theme = theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold")))
# Combine the plots and add the common title
combined_plot <- plot_weekend + plot_weekday + plot_layout(ncol = 2, widths = c(2, 2)) + common_title
# Display the combined plot
print(combined_plot)
Similar to Grade 1, the ‘Very High’, ‘High’, and ‘Moderate’ alcohol consumption groups are associated with lower median grades. This suggests that increased alcohol consumption tends to correspond with lower academic performance.
X_train<-subset(train,select=-c(G3))
y_train<-train$G3
X_test<-subset(test,select=-c(G3))
y_test<-test$G3
y_train<-ifelse(y_train>=10,TRUE,FALSE)
y_test<-ifelse(y_test>=10,TRUE,FALSE)
The target feature is used as the y while the rest of the features are used as the X. The y feature will be G3. For regression, it will be the score from 0-20, while for classification, it will be true or false based on whether the students passed the exam.
#Scale numerical columns
numeric_cols<-c()
for (col in names(X_train)){
if (is.numeric(X_train[[col]])){
numeric_cols<-c(numeric_cols,col)
}
}
X_train_scaled<-scale(X_train[numeric_cols])
X_train[numeric_cols]<-X_train_scaled
The numerical columns in X_train is scaled so that each of the different features have the same scale. This allow better interpretation of the data and speeds up training as the range is reduced. A copy of the scaled data is stored to allow the scale to be reused on the testing data.
# One hot encoding
dmy <- dummyVars(" ~ .",data=X_train)
X_train <- data.frame(predict(dmy,newdata=X_train))
X_train
The remaining categorical features in the training features will be encoded using one-hot encoding. The encoder is stored to be reused on testing data.
for (col in names(label_encoders)){
X_test[col]<-label_encoders[[col]]$transform(X_test[[col]])
}
X_test[numeric_cols]<-scale(X_test[numeric_cols],center=attr(X_train_scaled, "scaled:center"),
scale=attr(X_train_scaled, "scaled:scale"))
X_test <- data.frame(predict(dmy,newdata=X_test,na.action = na.pass))
X_test
The same transformation is applied to the test data so it is of the same scale with the train data and have the same features. It will similarly undergo data encoding and data normalization.
In our classification analysis, we employed four machine learning models which were Random Forest, Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), and Naive Bayes to predict whether the students will pass or fail in the third grade (G3). Each model underwent training on the X_train with the target variable in y_train. The models were then evaluated with a test set. The evaluation metrics include a confusion matrix, along with key performance metrics like accuracy, precision, recall, and F1 score.
evaluate_model <- function(predictions, actual, model_name) {
confusion_matrix <- table(Actual = as.factor(actual), Predicted = as.factor(predictions))
cat("Confusion Matrix for", model_name,"\n")
print.table(confusion_matrix)
cat("\n")
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
precision <- confusion_matrix["TRUE", "TRUE"] / sum(confusion_matrix[, "TRUE"])
recall <- confusion_matrix["TRUE", "TRUE"] / sum(confusion_matrix["TRUE", ])
f1_score <- 2 * (precision * recall) / (precision + recall)
metrics <- list(Accuracy = accuracy, Precision = precision, Recall = recall, F1_Score = f1_score)
return(metrics)
}
perform_classification <- function(X_train, y_train, X_test, y_test) {
set.seed(42)
model_names <- c("Random_Forest", "SVM", "k_NN", "Naive_Bayes")
results <- list()
models<-list()
for (model_name in model_names) {
if (model_name == "Random_Forest") {
model <- randomForest(x = X_train, y = as.factor(y_train),
ntree = 100,
mtry = sqrt(ncol(X_train) - 1),
nodesize = 1,
importance = TRUE,
type = "classification")
} else if (model_name == "SVM") {
model <- svm(as.factor(y_train) ~ ., data = X_train, kernel = "radial", cost = 1)
} else if (model_name == "k_NN") {
model <- train(x = X_train, y = as.factor(y_train),
method = "knn",
trControl = trainControl(method = "cv", number = 10),
tuneGrid = data.frame(k = 15))
} else if (model_name == "Naive_Bayes") {
model <- naiveBayes(x = X_train, y = as.factor(y_train))
}
predictions <- predict(model, X_test)
metrics <- evaluate_model(predictions, y_test, model_name)
results[[model_name]] <- metrics
}
results_df <- do.call(rbind, results)
return(results_df)
}
results <- perform_classification(X_train, y_train, X_test, y_test)
Confusion Matrix for Random_Forest
Predicted
Actual FALSE TRUE
FALSE 32 10
TRUE 8 159
Confusion Matrix for SVM
Predicted
Actual FALSE TRUE
FALSE 22 20
TRUE 7 160
Confusion Matrix for k_NN
Predicted
Actual FALSE TRUE
FALSE 9 33
TRUE 2 165
Confusion Matrix for Naive_Bayes
Predicted
Actual FALSE TRUE
FALSE 29 13
TRUE 22 145
print(results)
Accuracy Precision Recall F1_Score
Random_Forest 0.9138756 0.9408284 0.9520958 0.9464286
SVM 0.8708134 0.8888889 0.9580838 0.9221902
k_NN 0.8325359 0.8333333 0.988024 0.9041096
Naive_Bayes 0.8325359 0.9177215 0.8682635 0.8923077
Among the models employed, the Random Forest emerged as the top performer for the given classification task. It achieved an impressive accuracy of 91.39%, showcasing a well-balanced precision of 94.08%, recall of 95.21% and F1 score of 94.64%. These results indicate the model’s ability to effectively classify the students’ grades.
model <- randomForest(x = X_train, y = as.factor(y_train), # Convert y_train to factor for classification
ntree = 100,
mtry = sqrt(ncol(X_train) - 1),
nodesize = 1,
importance = TRUE,
type = "classification")
# Calculate feature importance for Rf (best model)
feature_importance <- importance(model)
# Create a data frame for feature importance
feature_importance_df <- data.frame(
Features = row.names(feature_importance),
Importance = feature_importance[, "MeanDecreaseAccuracy"]
)
# Sort the data frame by importance (descending order)
feature_importance_df <- feature_importance_df[order(-feature_importance_df$Importance), ]
# Create a bar plot of feature importance
ggplot(data = feature_importance_df, aes(x = Importance, y = reorder(Features, Importance))) +
geom_bar(stat = "identity", fill = "steelblue", width = 0.8) +
geom_text(aes(label = round(Importance, 2)), hjust = -0.1, vjust = 0.5, size = 2, color = "black") +
labs(x = "Importance", y = "Features") +
theme(axis.text.y = element_text(hjust = 1)) +
ggtitle("Feature Importance")
The bar plot provides a visual representation of feature importance in our trained Random Forest model. Longer bars indicate greater importance in predicting outcomes. From the plot, Grade 1 and Grade 2 stand out as substantial contributors to the model’s performance. This observation is consistent with the understanding that early grades often serve as fundamental indicators for the following academic success.
In our study, G3 has been identified as the only appropriate predictor for our regression models due to the limited predictive capability of other features. Initially, the focus was to predict the binary outcome of students’ success or failure. However, to gain a deeper insight into student performance, we shifted our attention to predicting the actual G3 scores using regression models. These models, including Linear Regression, Random Forest, and Support Vector Regression (SVR), are meticulously evaluated using metrics like RMSE and R². Our approach encompasses data preparation, model building, and in-depth analysis of feature importance, aimed at providing a comprehensive understanding of factors influencing G3 scores.
set.seed(42)
y_train<-train$G3
y_test<-test$G3
train_new<-cbind(X_train,G3=y_train)
train_new
linear_model <- lm(G3 ~ ., train_new)
summary(linear_model)
Call:
lm(formula = G3 ~ ., data = train_new)
Residuals:
Min 1Q Median 3Q Max
-9.0153 -0.5077 0.1268 0.7718 5.4630
Coefficients: (4 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.0941620 0.3841245 28.882 < 2e-16 ***
school 0.0461210 0.0658997 0.700 0.484216
sex -0.0159701 0.0648252 -0.246 0.805470
age -0.0516032 0.0659433 -0.783 0.434132
address 0.0675822 0.0623198 1.084 0.278498
famsize -0.0009285 0.0579925 -0.016 0.987230
Pstatus 0.0594224 0.0589365 1.008 0.313644
Medu 0.0267174 0.0893644 0.299 0.765040
Fedu -0.0277902 0.0778462 -0.357 0.721195
Mjobat_home -0.0806190 0.2644034 -0.305 0.760515
Mjobhealth 0.0150217 0.2658821 0.056 0.954960
Mjobother -0.0658686 0.2229282 -0.295 0.767712
Mjobservices 0.0646415 0.2142298 0.302 0.762930
Mjobteacher NA NA NA NA
Fjobat_home 0.5911588 0.3612100 1.637 0.102109
Fjobhealth 0.4200493 0.3677385 1.142 0.253695
Fjobother 0.3059813 0.2670139 1.146 0.252167
Fjobservices 0.1132040 0.2700200 0.419 0.675152
Fjobteacher NA NA NA NA
reasoncourse 0.2579934 0.1515152 1.703 0.089006 .
reasonhome 0.1369730 0.1668578 0.821 0.411952
reasonother 0.1080025 0.2203820 0.490 0.624220
reasonreputation NA NA NA NA
guardianfather -0.1271201 0.2643065 -0.481 0.630680
guardianmother -0.1517286 0.2408441 -0.630 0.528885
guardianother NA NA NA NA
traveltime 0.0832493 0.0622915 1.336 0.181786
studytime -0.0109625 0.0617616 -0.177 0.859164
failures -0.1701065 0.0646666 -2.631 0.008691 **
schoolsup 0.0117131 0.0586787 0.200 0.841833
famsup -0.0926991 0.0588590 -1.575 0.115670
paid 0.1797759 0.0581375 3.092 0.002056 **
activities 0.0481574 0.0574982 0.838 0.402538
nursery 0.0203182 0.0565288 0.359 0.719368
higher -0.0527546 0.0608266 -0.867 0.386043
internet 0.0024146 0.0594991 0.041 0.967639
romantic 0.0061503 0.0579977 0.106 0.915575
famrel 0.0925326 0.0583883 1.585 0.113415
freetime 0.0108051 0.0611041 0.177 0.859686
goout -0.0468975 0.0647874 -0.724 0.469361
Dalc -0.0587007 0.0737803 -0.796 0.426494
Walc 0.1208723 0.0807811 1.496 0.134975
health -0.0354293 0.0584351 -0.606 0.544488
absences 0.2048107 0.0594102 3.447 0.000596 ***
G1 0.4392294 0.1139398 3.855 0.000125 ***
G2 3.1119953 0.1114711 27.918 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.565 on 793 degrees of freedom
Multiple R-squared: 0.8443, Adjusted R-squared: 0.8362
F-statistic: 104.9 on 41 and 793 DF, p-value: < 2.2e-16
linear_model_summary <- summary(linear_model)
r_squared <- linear_model_summary$r.squared
print(r_squared)
[1] 0.8442896
predictions <- predict(linear_model, newdata = X_test)
rmse <- sqrt(mean((predictions - y_test)^2))
r_squared <- 1 - sum((predictions - y_test)^2) / sum((y_test - mean(y_test))^2)
cat("RMSE:", rmse, "\n")
RMSE: 1.624044
cat("R²:", r_squared, "\n")
R²: 0.8228375
comparison_df <- data.frame(Actual = y_test, Predicted = predictions)
ggplot(comparison_df, aes(x = Actual, y = Predicted)) +
geom_point() + # Add Scatter Plot
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") + # Add y=x Diagonal Line
labs(title = "Linear Regression", x = "Actual G3", y = "Predicted G3") # Add Title and Axis Labels
The scatter plot shows a comparison between the actual G3 grades and the predicted grades from the Linear Regression model. The red dashed line represents the ideal where the predicted grades perfectly match the actual grades. Points clustering close to this line suggest good model accuracy. The general trend indicates that the Linear Regression model predicts lower and mid-range grades quite well, as points in these areas are closer to the line. The spread of points away from the line at the extremities could indicate less accuracy for very high or very low grades. This pattern might suggest that the model is better at predicting average performances than extremes.
rf_model <- randomForest(G3 ~ ., data = train_new, ntree = 500)
print(rf_model)
Call:
randomForest(formula = G3 ~ ., data = train_new, ntree = 500)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 15
Mean of squared residuals: 2.29217
% Var explained: 84.65
r_squared_train <- rf_model$rsq
r_squared_train_last <- tail(r_squared_train, 1)
cat("R² for Training Set:", r_squared_train_last, "\n")
R² for Training Set: 0.8464827
rf_predictions <- predict(rf_model, newdata = X_test)
rf_rmse <- sqrt(mean((rf_predictions - y_test)^2))
r_squared_test <- 1 - sum((rf_predictions - y_test)^2) / sum((y_test - mean(y_test))^2)
cat("R² for Test Set:", r_squared_test, "\n")
R² for Test Set: 0.8132012
cat("RMSE:", rf_rmse, "\n")
RMSE: 1.667627
feature_importance_raw <- importance(rf_model)
feature_importance <- data.frame(Feature = rownames(feature_importance_raw), Importance = feature_importance_raw[, "IncNodePurity"])
ggplot(feature_importance, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity") +
coord_flip() + # Flipping the coordinates for better readability of feature names
theme_minimal() +
labs(title = "Feature Importance in Random Forest Model", x = "Features", y = "Importance")
This horizontal bar chart displays the feature importance as determined by a Random Forest model. The features are listed on the y-axis and their importance scores are on the x-axis. It’s clear from the chart that ‘G2’ and ‘G1’ are the most significant predictors of the target variable, which, in the context of student performance, are likely previous grades. ‘Absences’ also appears to be a significant feature, followed by ‘failures’. Other variables such as ‘studytime’, ‘Medu’ (mother’s education), ‘Fedu’ (father’s education), and ‘traveltime’ show relatively less importance. The length of the bars represents how influential each feature is in the model’s predictions; longer bars indicate greater importance.
svr_model <- svm(G3 ~ ., data = train_new, type = "eps-regression", kernel = "radial")
svr_predictions <- predict(svr_model, newdata = X_test)
svr_rmse <- sqrt(mean((svr_predictions - y_test)^2))
cat("SVR RMSE:", svr_rmse, "\n")
SVR RMSE: 2.037617
svr_r_squared <- 1 - sum((svr_predictions - y_test)^2) / sum((y_test - mean(y_test))^2)
cat("SVR R²:", svr_r_squared, "\n")
SVR R²: 0.7211175
model_comparison <- data.frame(
Model = c("Linear Regression", "Random Forest", "SVR"),
RMSE = c(rmse, rf_rmse, svr_rmse),
R_Squared = c(r_squared, r_squared_test, svr_r_squared)
)
print(model_comparison)
In the comparison of the three models, the linear regression model outperformed the other models with the lowest RMSE and highest R-square, indicating that it had the best predictive accuracy and explanatory power for the G3 score. Random forest comes second, but with slightly less accuracy and explanatory power. SVR had the highest RMSE and lowest R-squared, making it the least effective at predicting and explaining variance in G3 scores. The results showed that the linear regression model was the most effective for predicting students’ final exam scores (G3).
Rezaei-Dehaghani, A., Keshvari, M., & Paki, S. (2018). The relationship between family functioning and academic achievement in female high school students of Isfahan, Iran, in 2013–2014. Iranian Journal of Nursing and Midwifery Research, 23(3), 183. https://doi.org/10.4103/ijnmr.ijnmr_87_17