Read in the dataset

student <- read.csv("StudentPerformanceFactors.csv")

Graph 1: Base R Graph

Introduction:

I want to investigate how Hours Studied and Motivation Level together have an impact on the Exam Score. I am interested in whether the effect of study time on exam performances differs by the student’s motivation level. I chose a scatterplot to investigate this relationship. I have three different variables: 2 countinuous and 1 discrete. A scatterplot allows me to visualize the relationship between them. The main graph would be the two continous variables, Hours Studied and Exam Score, and color-codes the points based on Motivation Level. This graph will help me be able to identify any trends or patterns among the variables.

Graph:

# change Motivation level to factor
student$Motivation_Level <- as.factor(student$Motivation_Level)

# set colors for the graph
colors <- c("Low" = "red", "Medium" = "darkgoldenrod1", "High" = "darkgreen")

# graph
plot(student$Hours_Studied, student$Exam_Score,
     col = colors[student$Motivation_Level],
     xlab = "Hours Studied",
     ylab = "Exam Score",
     main= "Exam Score vs. Hrs Studied",
     pch = 19)

# add a legend
legend("topright", legend = names(colors),
       col = colors, pch = 19, title = "Motivation Level")

# add major and minor tick marks
axis(1, at = seq(0,45, by = 10)) # major x-ticks
axis(1, at = seq(0, 45, by = 5), tck = -0.02, labels = F) # minor x-ticks
axis(2, at = seq(60, 100, by = 10)) # major y-ticks
axis(2, at = seq(60, 100, by = 5), tck = -0.02, labels = F) # minor y-ticks

Discussion:

From the scatterplot, we can see that there is a positive relationship between Hours Studied and Exam Score. This tells us that generally as students study more, their exam scores tended to be higher. However, the Motivation Level appears to have little impact on this relationship. The colors, which represents the different levels, is scattered throughout the graph with no distinct clusters or patterns. This tells us that motivation does not influence the effect of study hours on exam performance. The lack of pattern suggests that motivation might not have a direct impact on the final exam score. Overall, from this graph we can take that study hours and exam scores are related, while motivation level has little impact on said relationship.

Graph 2: ggplot Graph

Introduction:

In this graph, I am investigating how Gender and Internet Access relate to Exam Score. My goal of this graph is to see if having access to the internet affects student performance, and whether this differs between male and female students. Since internet access is often considered an important resource, it could be useful to see if it is having an impact on student exam performance. A boxplot is useful in comparing the distributions of exam scores across different categories. I can place Internet Access on the x-axis and Gender as the fill color, which will allow us to see how access to internet influences exam performance for both makes and females. This can help show if access provides an advantage across genders or benefits one group more than the other.

Graph:

ggplot(student, aes(x = Internet_Access, y = Exam_Score, fill = Gender)) +
  geom_boxplot() +
  labs(
    title = "Exam Score by Internet Access",
    x = "Internet Access",
    y = "Exam Score",
    fill = "Gender")

Discussion:

The boxplot shows that there is a slight difference in Exam Scores between students with and without Internet Access. The main part of the two distributions (interquartile range) are similar, showing that most students perform similarly regardless on their access to Internet. The slight difference comes from the mean score for students with internet access having a slightly higher value. The biggest difference is shown through the number of high outliers. Students without access only have 6 exam scores in the 75-100 range, meanwhile students with access have over 30 exam scores in the 75-100 range. This shows that having access could support high scores for some students. Overall, while internet access does not drastically impact exam scores for most students, it might provide a small advantage for those aiming to achieve a score above a 75.

Graph 3: Plotly Graph

Introduction:

In this graph, I am investigating how access to resources might affect student performance across different family incomes. My goal is to see whether student from families with different income levels are experiencing different exam scores from access to educational resources. Access to resources is normally an important factor in student performance because it allows them to learn in more ways, and this access could vary depending on a student’s family income. Some of these resources could be more easily accessible with higher income.

A boxplot is useful here because it is able to give a clear visualization of exam scores across each of the categories, including the interquartile range, means, and outliers. The family income varaiable can be used as the frame. This gives a clearer distinction on how the distributions are changing as the animation goes through the different income levels. This will be able to reveal any significant differences in exam performances across these groups.

Graph:

# reorder
student$Access_to_Resources <- factor(student$Access_to_Resources, levels = c("Low", "Medium", "High"))

# make the graph
plot_ly(student,
        x = ~Access_to_Resources,
        y = ~Exam_Score,
        color = ~Access_to_Resources,
        type = "box",
        frame = ~Family_Income,
        boxmean = T) %>%
  animation_opts(
    frame = 3000, # slow down the slider speed
    transition = 500 # smooth transition
  ) %>%
  # change the title for the slider
  animation_slider(
    currentvalue = list(
      prefix = "Family Income: "
    )
  ) %>%
  layout(
    title = "Exam Scores by Access to Resources",
    xaxis = list(title = "Access to Resources"),
    yaxis = list(title = "Exam Score")
    )

Discussion:

The boxplot did provide a clear view on how exam scores are distributed across the two different resource categories (Family Income and Access to Resources). The family income levels slider allows us to examine how the exam scores change, while controlling the income. From the animation, we are able to determine that the students with high access to resources tend to have slightly higher exam scores across every income level. The level of difference stays the same through each income level with about a two point difference between low and high. Another insight gained is that the low income family frame has the highest boxplot range (minus the outliers) for all three access to resources levels when compared to the other two frames. This tells us that for students with a low family income. There is a lot wider of a range of scores they could get while other income levels have more concentrated results. One interesting trend was the outliers on the boxplots. The spread of high outliers (>75) is pretty evenly distributed across not only access to resources, but family income. So while there is a difference in the main trend based on these two different categories, getting a higher outlier is not dependent on either access to resources or family income. The only one that has a slight leaning in high outliers was in the medium family income frame. In this frame, high access to resources did have some more high outliers than the others. This was the only main difference for high outliers. Overall, while there is a slight increase for students the higher income and higher access to resources they have, however, it is not a highly significant difference (only a point or two).

Graph 4: Own Pick

Introduction:

My goal in this graph is to investigate the relationship between student attendance and exam scores, while also looking at the amount of sleep students get per night. The goal is to see whether a student’s sleep patterns influence their attendance and in turn their academic scores. Since both attendance and exam scores are continuous variables, a scatterplot is useful to visualize their correlation. The sleep hours variable is discrete, so it is perfect to use for color-coding to spot any trends. The color will allow us to easily see if sleep hours is having an impact.

Graph:

# create bins for sleep hours
student <- student %>%
  mutate(Sleep_Bin = case_when(
    Sleep_Hours <=5 ~ "<5 Hours",
    Sleep_Hours >=6 & Sleep_Hours <= 7 ~ "6-7 Hours",
    Sleep_Hours >= 8 ~ ">8 Hours"
  ))
 
# reorder sleep bins
student$Sleep_Bin <- factor(student$Sleep_Bin, levels = c("<5 Hours", "6-7 Hours", ">8 Hours"))

# scatter plot
plot_ly(
  student,
  x = ~jitter(Attendance, amount = 1),
  y = ~Exam_Score,
  color = ~Sleep_Bin,
  colors = c("indianred", "orange3", "seagreen3"),
  type = "scatter",
  mode = "markers",
  marker = list(opacity = 0.5),
  hoverinfo = "text",
  text = ~paste(
    "Attendance (%): ", Attendance,
    "<br>Exam Score: ", Exam_Score,
    "<br>Sleep Hours: ", Sleep_Hours
  )
) %>%
  layout(
    title = "Exam Score vs. Attendance by Sleep Hrs",
    xaxis = list(title = "Attendance (%)"),
    yaxis = list(title = "Exam Score")
  )

Discussion:

The first thing is that I had to create bins for the sleep variable. There was too high a range with too many colors on the graph that clarity was being lost. The three bins were <5 hours, 6-7 hours, and >8 hours of sleep. This allowed better visualization of the colors. One trend that is first spotted is that the 6-7 hours group is the most prevalent in this graph. The other thing that had to be changed in the graph was to add a jitter and drop the opacity. Their is a highly concentrated group of scores around 60-75 which caused overlapping. Adding these two allowed more points to be visible. This change allowed us to see that in the 60-75 range there is no clear trend of one group but instead a spread of all three groups across the entire range both horizontally and vertically. This random distribution is also shown in the high outliers. This tells us that sleep hours has little to no impact on this relationship between attendance and exam scores. Now focusing on just the relationship between attendance and exam scores does reveal a slight trend. From that main range of 60-75, there is a positive linear relationship visible. This tells us that students who attend class more are likely to achieve a higher exam score to a student who does not attend classs as often. This relationship is not present in the higher outliers.

STAT 5110 - Final Project

Samantha Thomas

Read in the dataset

Graph 1: Base R Graph

Graph 2: ggplot Graph

Graph 3: Plotly Graph

Graph 4: Own Pick