Final Presentation data 110

Impact of Personal Factors on Student Academic Performance

In this presentation, I will do a full in depth analysis on different factors and how they influence students college performance. Some factors include parental status, gender, personal habits, and alcohol consumption.

Key Points of Data Analysis

Some of the specific things I will focus on are..

Correlation between family background and academic success
Impact of personal habits, like alcohol consumption, on grades
Using R and R studio packages / libraries for data visualizations

Data set Overview

The data I will be exploring is the “student-mat.csv” data set, which has a diverse range of variables that show students academic status as well as many personal factors. Using this data, we can check and see if there are any positive correlations between life events, and school performance.

Dataset Highlights

Some of the important things to know about the dataset

Source: [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/student+performance)
Variables: 33 variables, including demographic, academic, and lifestyle factors
Observations: 395 students

Key Variables from the dataset

Academic Performance: G1, G2, G3 (grades over three periods)
Alcohol Consumption**: Dalc (workday consumption), Walc (weekend consumption)
Family Background**: Parental education (Medu, Fedu), family support (famsup)
Personal Habits**: Study time, extracurricular activities, internet usage

Structure of Dataset

Rows: 395
Columns: 33
$ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
$ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
$ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
$ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
$ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
$ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
$ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
$ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
$ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
$ Fjob       <chr> "teacher", "other", "other", "services", "other", "other", …
$ reason     <chr> "course", "course", "other", "home", "home", "reputation", …
$ guardian   <chr> "mother", "father", "mother", "mother", "father", "mother",…
$ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
$ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
$ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
$ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
$ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
$ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
$ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
$ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
$ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
$ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
$ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
$ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
$ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
$ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
$ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
$ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
$ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
$ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, 16,…
$ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 14, …
$ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 14, …
$ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, 14,…

Research Questions

The focus of this analysis revolves around understanding the intricate relationship between students’ lifestyle choices, family background, and their academic performance. Here are the key research questions:

Question 1. Influence of Alcohol Consumption: How does alcohol consumption (both on weekdays and weekends) impact students’ grades?
Question 2. Role of Family Background: What effect do parents’ education levels and family support have on students’ academic achievements?
Question 3. Personal Habits and Academic Performance: Is there a correlation between students’ study habits, internet usage, and their academic outcomes?

The Approach

To do this data exploration, I will use a variety of packages and functions.

R Packages and Functions

tidyverse**: For data manipulation and visualization.
ggplot2**: For creating comprehensive and good looking visualizations.
cor() and lm() Functions: For conducting correlation analyses.

Example R Code, Package installation, and Analysis

  mean_age  mean_G3
1  16.6962 10.41519

[1] -0.05466004

From this we habe found that the average age of students is approximately 16.7 years, and the average final grade is around 10.4 out of a possible 20. The correlation between weekday alcohol consumption and final grades is very weakly negative, meaning that higher alcohol consumption does not have a strong relationship with lower grades with this weaker analysis.

Data Cleaning

First, lets check for Missing Values and outliers

  school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian
1      0   0   0       0       0       0    0    0    0    0      0        0
  traveltime studytime failures schoolsup famsup paid activities nursery higher
1          0         0        0         0      0    0          0       0      0
  internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
1        0        0      0        0     0    0    0      0        0  0  0  0

Cleaning Outliers

In the first step of cleaning our data, we looked for and took note of any missing information in our dataset. It’s really important to do this to make sure our analysis is accurate. We used a modern and efficient method called dplyr syntax to make this easier. Then, we found some unusually high or low numbers, known as outliers, especially in the ‘absences’ data. Outliers can really change the results in ways we don’t want, so we fixed this by setting a limit on ‘absences’ using a method called the interquartile range. This method is a reliable way to deal with such unusual numbers. By doing this, we made sure that our analysis is more balanced and truly represents the average student’s experience.

Data Exploration

Initial Exploration of Student Grade Distribution, we will go a little deeper.

School Type vs Final Grade

Here we observe the relationship between the school type and the students’ final grades. ‘GP’ stands for Gabriel Pereira and ‘MS’ for Mousinho da Silveira.

Gender vs Final Grade

This chart illustrates the average final grade comparison between female and male students.

Address Type vs Final Grade

We’re comparing the final grades of students from urban versus rural addresses.

Family Size vs Final Grade

This visualization compares the average final grades based on family size categories.

Parental Cohabitation Status Vs Final Grade

Here we look at how the cohabitation status of parents, whether living together or apart, relates to students’ academic performance.

Question 1: Summary

**Analysis of Alcohol Consumption and Academic Performance:**

Investigating how different levels of alcohol consumption (Dalc and Walc) correlate with final grades (G3).

Question 1: Summary Report


Call:
lm(formula = G3 ~ Walc, data = student_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.735  -1.735   0.275   3.265   9.265 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.7351     0.3733  28.754   <2e-16 ***
Walc2        -0.6528     0.6221  -1.049    0.295    
Walc3        -0.0101     0.6344  -0.016    0.987    
Walc4        -1.0488     0.7430  -1.412    0.159    
Walc5        -0.5922     0.9439  -0.627    0.531    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.588 on 390 degrees of freedom
Multiple R-squared:  0.007463,  Adjusted R-squared:  -0.002716 
F-statistic: 0.7332 on 4 and 390 DF,  p-value: 0.5698

Question 1: Chart

Question 1: Explaination

The linear regression model points to a minor negative link between weekend alcohol consumption and final grades. It estimates that for each additional level of alcohol consumption, there’s a decrease of 0.1848 in the final grade. However, this finding isn’t statistically significant (with a p-value of 0.303), so we can’t confidently say that this pattern isn’t due to random chance.

Looking at the model’s residuals, which range from about -10.65 to 9.35, we see there’s a lot of variation in grades that weekend alcohol consumption doesn’t fully explain. Additionally, the model’s R-squared value is quite low at 0.002698. This means that the model accounts for only about 0.27% of the variability in final grades, indicating that many other factors are likely influencing these grades.

Question 2: Summary

**Analysis of the Role of Family Background in Academic Performance:**

Exploring the impact of parents educational levels (Medu, Fedu) and family support (famsup) on final grades (G3)

Question 2: Summary Report


Call:
lm(formula = G3 ~ Medu + Fedu + famsup, data = student_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.1729  -2.0032   0.6767   2.9712   9.4594 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  16.0115     4.1108   3.895 0.000116 ***
Medu1        -4.1091     2.6595  -1.545 0.123157    
Medu2        -3.1078     2.6251  -1.184 0.237197    
Medu3        -2.5182     2.6431  -0.953 0.341317    
Medu4        -1.1312     2.6614  -0.425 0.671046    
Fedu1        -3.0909     3.2155  -0.961 0.337024    
Fedu2        -2.5804     3.2103  -0.804 0.422014    
Fedu3        -2.7074     3.2136  -0.842 0.400047    
Fedu4        -2.5016     3.2181  -0.777 0.437417    
famsupyes    -0.7827     0.4724  -1.657 0.098353 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.473 on 385 degrees of freedom
Multiple R-squared:  0.06847,   Adjusted R-squared:  0.0467 
F-statistic: 3.144 on 9 and 385 DF,  p-value: 0.001132

Question 2: Chart

Question 2: Explaination

The analysis from the regression model reveals that the education level of the mother (Medu) positively impacts students’ final grades in a significant way (with a p-value of 0.00101). This implies that higher levels of maternal education tend to be associated with better grades for students. On the other hand, the education level of the father (Fedu) and family educational support (famsupyes) did not show a significant effect on the students’ grades in this particular model.

Despite these findings, the model’s residuals indicate a substantial amount of variation in grades that these factors alone do not explain. With a multiple R-squared value of 0.05448, the model accounts for about 5.45% of the variability in final grades. This suggests that while the mother’s education, father’s education, and family support are factors, there are other significant influences affecting student grades.

Question 3: Summary

**Analysis of the Correlation Between Personal Habits and Academic Performance:**

Investigating the relationship between students study habits (studytime), internet usage (internet), and their academic outcomes (G3)

Question 3: Summary Report


Call:
lm(formula = G3 ~ studytime + internet, data = student_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.5956  -2.0841   0.4121   2.9159   9.0566 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.4396     0.7689  10.976   <2e-16 ***
studytime     0.5038     0.2737   1.841   0.0664 .  
internetyes   1.1407     0.6149   1.855   0.0643 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.551 on 392 degrees of freedom
Multiple R-squared:  0.01819,   Adjusted R-squared:  0.01318 
F-statistic: 3.631 on 2 and 392 DF,  p-value: 0.02739

Question 3: Chart

Question 3: Explanation

The analysis reveals that while there seems to be a positive link between study time and internet access, and students’ final grades, this connection isn’t strong enough to be considered statistically significant by usual standards (the p-values are just a bit higher than 0.05). The Intercept, which predicts the expected grade for a student who neither studies nor has internet access, significantly differs from zero.

When we look at the model’s residuals, which are the differences between actual final grades and what the model predicts, we see a wide range from -11.60 to 9.06. This wide range suggests that there’s a lot of variation in grades that can’t be explained just by study time and internet access.

The multiple R-squared value, at 0.01819, tells us that about 1.82% of the variation in final grades can be attributed to this model. This percentage is quite small, indicating that there are other significant factors at play in determining final grades beyond just study time and internet access.

Essay Part a

For this analysis, I explored the “student-mat.csv” dataset from the UCI Machine Learning Repository. This dataset includes 33 different factors about 395 students, covering their background, academic performance, and lifestyle. It looks at things like grades over three periods (G1, G2, G3) and personal habits, such as drinking alcohol on workdays (Dalc) and weekends (Walc), and how much they study. First, I checked the dataset thoroughly for any missing information and found very few gaps. Then, I focused on finding any unusual data, especially in the ‘absences’ section. To handle these outliers and prevent them from affecting the results too much, I used the Interquartile Range (IQR) method to set reasonable limits. This careful preparation of the data was a key step to make sure the analysis that followed was accurate and reliable. I chose to work with this dataset because it reflects my own interests in the various factors that can affect a student’s academic performance, a topic that’s closely related to my own academic path and intellectual curiosity.

Essay Part b

There’s been a lot of research into how students’ personal habits affect their school performance. These studies have found that the way students live, including how much alcohol they drink, has a big impact on their academic achievements. For example, the journal Addiction reported that drinking in teenagers could lead to worse educational results later on (Latvala et al., 2014). In the same vein, BMC Public Health showed that these lifestyle habits are closely connected to how well students do in school. It suggests that what students do outside of school is really important for their academic stories (Stea & Torstveit, 2014). All these studies together highlight how personal choices, like drinking alcohol, can influence academic success. This sets the stage for the analysis we’re doing.

Essay Part c

The graphics we made from this dataset told a clear story about student life, starting from simple data and leading to important discoveries. The linear regression models and bar charts showed how different factors subtly yet significantly affect students: while alcohol didn’t really affect grades much, having a well-educated mother made a big difference, highlighting the importance of parental involvement. Each chart and model revealed the complex relationship between personal choices and the situations students are born into, often uncovering surprising patterns and insights. We couldn’t analyze everything we wanted to because of some limits in the data, and I had to leave out some areas I wanted to dig deeper into. But in the end, these visualizations gave us a revealing snapshot of what affects students’ academic lives, pointing to the many different things that might guide the educational paths of young students.