Final Project

Author

Ben Lopez

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(gapminder)
library(dplyr)
data <- read.csv("StudentPerformanceFactors.csv")

tibble(data)

# A tibble: 6,607 × 20
   Hours_Studied Attendance Parental_Involvement Access_to_Resources
           <int>      <int> <chr>                <chr>              
 1            23         84 Low                  High               
 2            19         64 Low                  Medium             
 3            24         98 Medium               Medium             
 4            29         89 Low                  Medium             
 5            19         92 Medium               Medium             
 6            19         88 Medium               Medium             
 7            29         84 Medium               Low                
 8            25         78 Low                  High               
 9            17         94 Medium               High               
10            23         98 Medium               Medium             
# ℹ 6,597 more rows
# ℹ 16 more variables: Extracurricular_Activities <chr>, Sleep_Hours <int>,
#   Previous_Scores <int>, Motivation_Level <chr>, Internet_Access <chr>,
#   Tutoring_Sessions <int>, Family_Income <chr>, Teacher_Quality <chr>,
#   School_Type <chr>, Peer_Influence <chr>, Physical_Activity <int>,
#   Learning_Disabilities <chr>, Parental_Education_Level <chr>,
#   Distance_from_Home <chr>, Gender <chr>, Exam_Score <int>

In this data set we see a lot of different but very helpful variables, we are going to use all of them (Except Exam Scores) as our x variable to find the our response (Exam Scores) and use these graphs and table to figure out which of these quantitative and categorical variables have the strongest correlation with our response.

As a future teacher, I find this data set to be very interesting and find it helpful to ensure that I can best help students and set them up for success in the classroom with their exam and better prepare them for their future.

###Graphing Quantitative Variables versus Exam Scores

df_quant <- data %>%
  select(where(is.numeric)) %>%
  pivot_longer(
    cols = -Exam_Score,
    names_to = "variable",
    values_to = "value"
  )

ggplot(data = df_quant, aes(x = value, y = Exam_Score)) +
  geom_point() +
  stat_summary(
    fun = mean,
    geom = "point",
    color = "red",
    size = 2
  ) +
  facet_wrap(~ variable, scales = "free_x") +
  labs(x = "Predictor", y = "Exam Score") +
  theme_minimal()

There is a lot going on with the graphs, and it is hard to pinpoint where some of these means really are, so with this we are also going to display the top mean and median exam scores based of the different quantitative variables

###Displaying tables of these mean and median Exam Scores based off of their x quantitative variables

qmean <- df_quant %>%
  group_by(variable, value)%>%
  summarize(mean_exam_score = mean(Exam_Score, na.rm = TRUE))

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by variable and value.
ℹ Output is grouped by variable.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(variable, value))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

qmean %>%
  arrange(desc(mean_exam_score))

# A tibble: 156 × 3
# Groups:   variable [6]
   variable          value mean_exam_score
   <chr>             <int>           <dbl>
 1 Hours_Studied        43            78  
 2 Hours_Studied        39            74.7
 3 Hours_Studied        37            73.3
 4 Hours_Studied        38            72.7
 5 Hours_Studied        35            71.8
 6 Tutoring_Sessions     6            71.7
 7 Hours_Studied        36            71.2
 8 Hours_Studied         1            71  
 9 Hours_Studied        44            71  
10 Hours_Studied        32            70.9
# ℹ 146 more rows

qmedian <- df_quant %>%
  group_by(variable, value)%>%
  summarize(median_exam_score = median(Exam_Score, na.rm = TRUE))

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by variable and value.
ℹ Output is grouped by variable.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(variable, value))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

qmedian %>%
  arrange(desc(median_exam_score))

# A tibble: 156 × 3
# Groups:   variable [6]
   variable          value median_exam_score
   <chr>             <int>             <dbl>
 1 Hours_Studied        43              78  
 2 Hours_Studied        39              75  
 3 Hours_Studied        37              74  
 4 Hours_Studied        38              73  
 5 Tutoring_Sessions     6              72.5
 6 Hours_Studied        35              72  
 7 Attendance           99              71  
 8 Attendance          100              71  
 9 Hours_Studied        31              71  
10 Hours_Studied        32              71  
# ℹ 146 more rows

We notice that hours studied, tutor sessions attended, and Attendance seem to have the highest mean and median exam scores. We also see that when we go towards the bottom of the tables, we see that when hours studied and attendance are at their lowest possible outcome, it reflect in the student’s test scores as well.

###Categorical Variables verus our response Exam Scores

df_cat <- data %>%
  pivot_longer(
    cols = where(~ !is.numeric(.)),   
    names_to = "variable",
    values_to = "category"
  )

ggplot(data = df_cat, aes(x = category, y = Exam_Score)) +
  geom_boxplot(width =.5) +
  stat_summary(
    fun = mean,
    geom = "point",
    color = "red",
    size = 2
  ) +
  facet_wrap(~ variable, scales = "free_x") +
  labs(x = "Predictor", y = "Exam Score") +
  theme_minimal()

###Displaying tables of these mean and median Exam Scores based off of their x Categorical variables

cmean <- df_cat %>%
  group_by(variable, category)%>%
  summarize(mean_exam_score = mean(Exam_Score, na.rm = TRUE))

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by variable and category.
ℹ Output is grouped by variable.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(variable, category))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

cmean %>%
  arrange(desc(mean_exam_score))

# A tibble: 37 × 3
# Groups:   variable [13]
   variable                   category     mean_exam_score
   <chr>                      <chr>                  <dbl>
 1 Parental_Involvement       High                    68.1
 2 Access_to_Resources        High                    68.1
 3 Parental_Education_Level   Postgraduate            68.0
 4 Family_Income              High                    67.8
 5 Motivation_Level           High                    67.7
 6 Teacher_Quality            High                    67.7
 7 Peer_Influence             Positive                67.6
 8 Distance_from_Home         Near                    67.5
 9 Extracurricular_Activities Yes                     67.4
10 Learning_Disabilities      No                      67.3
# ℹ 27 more rows

cmedian <- df_cat %>%
  group_by(variable, category)%>%
  summarize(median_exam_score = median(Exam_Score, na.rm = TRUE))

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by variable and category.
ℹ Output is grouped by variable.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(variable, category))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

cmedian%>%
  arrange(desc(median_exam_score))

# A tibble: 37 × 3
# Groups:   variable [13]
   variable                   category     median_exam_score
   <chr>                      <chr>                    <dbl>
 1 Access_to_Resources        High                        68
 2 Family_Income              High                        68
 3 Parental_Education_Level   Postgraduate                68
 4 Parental_Involvement       High                        68
 5 Teacher_Quality            High                        68
 6 Access_to_Resources        Medium                      67
 7 Distance_from_Home         Moderate                    67
 8 Distance_from_Home         Near                        67
 9 Extracurricular_Activities No                          67
10 Extracurricular_Activities Yes                         67
# ℹ 27 more rows

We notice that Access to resources and Family income when it is high, then the exam scores for student has a higher mean and median outcome.

Now lets see if we can compare both hours studied to exam score but look at it between the three different types of Access to Resources(High, Medium, Low) and Parental Involvement(High, Medium, and Low)

ggplot(
  
  data = data,
  aes(
    x = Hours_Studied,
    y = Exam_Score,
    color = Access_to_Resources
  )
)+
  geom_point()+
  facet_wrap(vars(Parental_Involvement))

We can see that there does seem to be a positive correlation between the three variables.

Now we will see if this is the same for our variables, Attendance, Family Income and Parental Educational Level, except for this one we have an extra graph since Parental Educational Level has some NA values.

ggplot(
  
  data = data,
  aes(
    x = Attendance,
    y = Exam_Score,
    color = Family_Income
  )
)+
  geom_point()+
facet_wrap(vars(Parental_Education_Level))

We see in these graphs that has Attendance goes up the exam scores go up in all the graphs, but there does not seem to be that great of a difference between Parent’s Education level. Similarly with Family Income, the points seem to all be intermixed moreand no clear top of the group when it comes to Exam Scores.

Based off of these two most recent graphs we can assume that our Quantitative Variables (Attendance, and Hours Studied) have a stronger correlation with Exam scores. Which we were able to see when we pulled the means and medians of both the quantitative and catergorical variables.