Final Project

Executive Summary

In my research, I looked into the ‘student-mat.csv’ dataset to better understand how personal traits influence student academic performance. I focused on aspects such as alcohol consumption, the educational background of parents, and study routines, analyzing data from 395 students. Using RStudio, along with tools like tidyverse and ggplot2, I uncovered some interesting findings. There was a minor, albeit not statistically significant, negative relationship between alcohol use and grades. Significantly, the education level of mothers appeared to have a strong positive impact on students’ grades, underscoring the importance of family background in educational achievements. The link between personal habits like study time and academic results was less straightforward, suggesting that there are other key factors that also play a role in determining grades. My comprehensive research helps clarify the various factors that contribute to how well students do in school.

Impact of Personal Factors on Student Academic Performance

In this presentation, I will do a full in depth analysis on different factors and how they influence students college performance. Some factors include parental status, gender, personal habits, and alcohol consumption.

Key Points of Data Analysis

Some of the specific things I will focus on are..

  • Correlation between family background and academic success
  • Impact of personal habits, like alcohol consumption, on grades
  • Using R and R studio packages / libraries for data visualizations

Data set Overview

The data I will be exploring is the “student-mat.csv” data set, which has a diverse range of variables that show students academic status as well as many personal factors. Using this data, we can check and see if there are any positive correlations between life events, and school performance.

Dataset Highlights

Some of the important things to know about the dataset

  • Source: [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/student+performance)
  • Variables: 33 variables, including demographic, academic, and lifestyle factors
  • Observations: 395 students

Key Variables from the dataset

  • Academic Performance: G1, G2, G3 (grades over three periods)
  • Alcohol Consumption**: Dalc (workday consumption), Walc (weekend consumption)
  • Family Background**: Parental education (Medu, Fedu), family support (famsup)
  • Personal Habits**: Study time, extracurricular activities, internet usage

Structure of Dataset

Rows: 395
Columns: 33
$ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
$ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
$ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
$ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
$ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
$ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
$ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
$ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
$ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
$ Fjob       <chr> "teacher", "other", "other", "services", "other", "other", …
$ reason     <chr> "course", "course", "other", "home", "home", "reputation", …
$ guardian   <chr> "mother", "father", "mother", "mother", "father", "mother",…
$ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
$ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
$ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
$ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
$ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
$ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
$ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
$ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
$ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
$ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
$ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
$ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
$ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
$ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
$ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
$ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
$ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
$ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, 16,…
$ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 14, …
$ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 14, …
$ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, 14,…

Research Questions

The focus of this analysis revolves around understanding the intricate relationship between students’ lifestyle choices, family background, and their academic performance. Here are the key research questions:

  • Question 1. Influence of Alcohol Consumption: How does alcohol consumption (both on weekdays and weekends) impact students’ grades?

  • Question 2. Role of Family Background: What effect do parents’ education levels and family support have on students’ academic achievements?

  • Question 3. Personal Habits and Academic Performance: Is there a correlation between students’ study habits, internet usage, and their academic outcomes?

The Approach

To do this data exploration, I will use a variety of packages and functions.

R Packages and Functions

  • tidyverse**: For data manipulation and visualization.
  • ggplot2**: For creating comprehensive and good looking visualizations.
  • cor() and lm() Functions: For conducting correlation analyses.

Example R Code, Package installation, and Analysis

  mean_age  mean_G3
1  16.6962 10.41519
[1] -0.05466004

From this we habe found that the average age of students is approximately 16.7 years, and the average final grade is around 10.4 out of a possible 20. The correlation between weekday alcohol consumption and final grades is very weakly negative, meaning that higher alcohol consumption does not have a strong relationship with lower grades with this weaker analysis.

Data Exploration

Initial Exploration of Student Grade Distribution

Question 1: Summary

**Analysis of Alcohol Consumption and Academic Performance:**

  • Investigating how different levels of alcohol consumption (Dalc and Walc) correlate with final grades (G3).

Question 1: Summary Report


Call:
lm(formula = G3 ~ Walc, data = student_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.6537  -2.0071   0.3463   3.3463   9.3463 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.8385     0.4708  23.019   <2e-16 ***
Walc         -0.1848     0.1792  -1.031    0.303    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.581 on 393 degrees of freedom
Multiple R-squared:  0.002698,  Adjusted R-squared:  0.00016 
F-statistic: 1.063 on 1 and 393 DF,  p-value: 0.3032

Question 1: Chart

Question 1: Explaination

The linear regression model points to a minor negative link between weekend alcohol consumption and final grades. It estimates that for each additional level of alcohol consumption, there’s a decrease of 0.1848 in the final grade. However, this finding isn’t statistically significant (with a p-value of 0.303), so we can’t confidently say that this pattern isn’t due to random chance.

Looking at the model’s residuals, which range from about -10.65 to 9.35, we see there’s a lot of variation in grades that weekend alcohol consumption doesn’t fully explain. Additionally, the model’s R-squared value is quite low at 0.002698. This means that the model accounts for only about 0.27% of the variability in final grades, indicating that many other factors are likely influencing these grades.

Question 2: Summary

**Analysis of the Role of Family Background in Academic Performance:**

  • Exploring the impact of parents educational levels (Medu, Fedu) and family support (famsup) on final grades (G3)

Question 2: Summary Report


Call:
lm(formula = G3 ~ Medu + Fedu + famsup, data = student_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.0727  -2.0788   0.5971   2.8808   9.6314 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.0947     0.6665  12.146  < 2e-16 ***
Medu          0.8753     0.2643   3.312  0.00101 ** 
Fedu          0.1589     0.2659   0.598  0.55043    
famsupyes    -0.7945     0.4719  -1.684  0.09306 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.472 on 391 degrees of freedom
Multiple R-squared:  0.05448,   Adjusted R-squared:  0.04723 
F-statistic:  7.51 on 3 and 391 DF,  p-value: 6.734e-05

Question 2: Chart

Question 2: Explaination

The analysis from the regression model reveals that the education level of the mother (Medu) positively impacts students’ final grades in a significant way (with a p-value of 0.00101). This implies that higher levels of maternal education tend to be associated with better grades for students. On the other hand, the education level of the father (Fedu) and family educational support (famsupyes) did not show a significant effect on the students’ grades in this particular model.

Despite these findings, the model’s residuals indicate a substantial amount of variation in grades that these factors alone do not explain. With a multiple R-squared value of 0.05448, the model accounts for about 5.45% of the variability in final grades. This suggests that while the mother’s education, father’s education, and family support are factors, there are other significant influences affecting student grades.

Question 3: Summary

**Analysis of the Correlation Between Personal Habits and Academic Performance:**

  • Investigating the relationship between students study habits (studytime), internet usage (internet), and their academic outcomes (G3)

Question 3: Summary Report


Call:
lm(formula = G3 ~ studytime + internet, data = student_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.5956  -2.0841   0.4121   2.9159   9.0566 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.4396     0.7689  10.976   <2e-16 ***
studytime     0.5038     0.2737   1.841   0.0664 .  
internetyes   1.1407     0.6149   1.855   0.0643 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.551 on 392 degrees of freedom
Multiple R-squared:  0.01819,   Adjusted R-squared:  0.01318 
F-statistic: 3.631 on 2 and 392 DF,  p-value: 0.02739

Question 3: Chart

Question 3: Explanation

The analysis reveals that while there seems to be a positive link between study time and internet access, and students’ final grades, this connection isn’t strong enough to be considered statistically significant by usual standards (the p-values are just a bit higher than 0.05). The Intercept, which predicts the expected grade for a student who neither studies nor has internet access, significantly differs from zero.

When we look at the model’s residuals, which are the differences between actual final grades and what the model predicts, we see a wide range from -11.60 to 9.06. This wide range suggests that there’s a lot of variation in grades that can’t be explained just by study time and internet access.

The multiple R-squared value, at 0.01819, tells us that about 1.82% of the variation in final grades can be attributed to this model. This percentage is quite small, indicating that there are other significant factors at play in determining final grades beyond just study time and internet access.