WQD7004 Group Assignment

Adrian Chen & Luqman Hakim

7/1/2021

Group Project R

By Adrian Chen & Luqman Hakim

Our chosen dataset is about Students in a Mathematics class from 2 Portuguese Schools for the year 2005-2006. The dataset contains 395 observations for 33 attributes that includes their performance scores, demographic, family background, behaviours, and more.

These were all collected through school reports and questionnaires.

From the attributes, interesting questions could be posed on the dataset.

Exploring the Data

We started off by exploring the data using standard EDA tools.

Continuing with EDA…

There are 395 observations with 33 attributes. 17 are char variables while the remaining 16 are numeric.

Check for missing values & outliers

Check for missing values & outliers

It seems that most of the values are within acceptable range apart for the absences which seems to have quite a number of outliers. We feel that these values are not false as there are problematic students in the world and thus they are not dropped.

Objective/Goal of processing this dataset

Schools that are effectively empowering students’ potential in outstanding achievement are not necessarily effective in ensuring all students achieve the same outcome.

Hence, a study is done to find more insights on the relationships among several different indicators of high school performance: final grades, study time, alcohol consumption, and other attributes.

Since preprocess done with no NA.

Let’s do some Data Visualization

Data Visualization

Data Visualization

We were also intrigued by how alcohol consumption would affect the students’ performance.

Dropping students who dropped the subject

And we ran Summary to check the contents of it.

Dropping students who dropped the subject

Based on the summaries, the data are split into two which are Data_Drop and Data_Stay. This subsets represents whether the students drop the subjects or continue. In data_drop, there are total of 38 students (23F,15M) while for data_Stay, there are total of 357 students (185F and 172M).

For data drop, G1 has a range from 4-12 score,median of 7 and mean of 7.526.

G2 has a range from 0-10 score, median of 5 and mean of 4.658.

For data stay, G1 has a range from 3-19 score, median of 11 and mean of 11.27.

G2 has a range from 5-19 score, median of 11 and mean of 11.36.

G3 has a range from 4-20 score, median of 11 and mean of 11.52

G3 VS Study Time

Description: Study time - weekly study time (numeric: 1 = <2 hours, 2 = 2 to 5 hours, 3 = 5 to 10 hours, or 4 = >10 hours)

Our small sample confirms popular belief that students who study for longer time get better final grades than students who study for shorter time. Although the difference is not by much. Note that, there is also an optimal study time as studying for more than 10 hours has a lower G3 scores.

Analysis: Questions

We are interested in exploring what attributes affect the students’ final grade (G3)

In this project we ask few questions and answer them using classification of decision tree and some questions with linear regression on the same dataset.

Question 1:

Does G1 and G2 correlated to G3 means students that put effort to score well in G1 and G2 will do well in G3?

Question 1:

p<2.2∗10−16 and is effectively close to zero

p<2.2e-16 means 0.00000000000000022. It is (very much) less than 0.05

The plot show stronger linear relationship between G3 and G2.

From the summary shown from of p-value, we can deduce that G1 and G2 are correlated with G3

Question 2:

Does health affect the students final grade G3?

Question 3:

What are the key predictors for Final Grade G3?

Model 1 without split and train data

Decision Tree Analysis

We use R to create Decision Trees to predict the final performance grade using all the variables in the dataset.

We find that Grades in 1st and 2nd exam are key predictors followed by attendance level, alcohol consumption and jobs of parents.

The tree logic is as below where only “attendance, Father’s job, Grades in 1st and 2nd Exam” are used as variables by the tree based on correlation and collinearity between some of the other variables.

Question 3:

Model 2

Train and splitting dataset of G3

Create train and test set

Check the dimension of both training and test dataset

Question 3:

Model - Decision tree

A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (for instance whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label.

Question 4:

Does alcohol affects the student’s G3 grade? (Detail Regression)

Before proceeding, we checked the variables to see if they are correlated:

The only correlated variables are G1, G2 & G3 of which they are expected as G3 is thought to be a derivation of G1 and G2. It is not a problem as we would be referring to G3 only.

Question 4: Model with only alcohol variables

Using a model with only 2 variables would result in a model that does not explains G3 significantly.

However, we can see that coefficients of the variables are negative which means the variables affect the G3 scores negatively.

Question 4: Model with additional variables

The new regression model has a lower p-value of 0.012 which indicates the model is significant at 95% significance level.

Although the alcohol consumption variables are not significant, they have negative coefficients which means they do still affect the G3 scores negatively. In addition, consumption during workdays plays a bigger role than consumption during the weekends.

RMSE Metric Evaluation

Question 4:

To further confirm this, we plotted beeswarm plots of the 2 alcohol variables against their grades (G3).

It can be seen that those who consumed less alcohol (either on weekdays or weekend) would have a higher G3 score. Alcohol consumption does impact studies negatively.

It is also worth noting that correlation does not imply causation, that alcohol consumption could be the effect of poor grades and not vice versa.

Thank you.