DATA 605 - Discussion 11

Nick Oliver

Discussion 11

I am currently volunteer teaching Python to high school students and the tool we are using makes the usage data available as a CSV. Wanted to do a linear regression on if the amount of time students spend in the tool correlates with their quiz scores.

Loading Data

df <- read.csv('https://raw.githubusercontent.com/nolivercuny/DATA605/main/week%2011/data/student_data.csv')

Cleanup & Aggregation

Filter out instructors, sum the minutes in the tool and get the mean of their quiz scores

df <- df %>%
  filter(!grepl("instructor",username))
df <- df %>%
  mutate(studyMinsSum = rowSums(select(df,2:226),na.rm = T)) %>%
  select(c(username,
           studyMinsSum,starts_with("Unit.1.Quiz") |
             starts_with("Unit.2.Quiz") |
             starts_with("Unit.3.Quiz")))
df <- df %>%
  mutate(quizMean = rowMeans(select(df,3:5),na.rm=T))

Analysis

Plotting students times vs scores and showing the raw data

plot(df$quizMean ~ df$studyMinsSum, main="Effort vs. Scores",
xlab="Total Work Time (m)", ylab="Mean Quiz Score")

df %>% 
  kable()
username studyMinsSum Unit.1.Quiz Unit.2.Quiz Unit.3.Quiz quizMean
student1 2515 9 10 10 9.666667
student2 2276 4 9 3 5.333333
student3 1906 14 5 6 8.333333
student4 2298 11 6 6 7.666667
student5 1557 13 12 9 11.333333
student6 3234 12 14 5 10.333333
student7 2134 12 14 5 10.333333
student8 3884 11 13 5 9.666667
student9 2985 10 12 15 12.333333
student10 2082 14 15 15 14.666667
student11 4234 6 13 7 8.666667
student12 1692 1 11 0 4.000000
student13 1957 10 13 11 11.333333
student14 2854 12 11 15 12.666667
student15 2815 11 6 11 9.333333
student16 2889 8 13 5 8.666667
student17 2857 12 13 12 12.333333

Use built in lm function to fit linear model. Summary shows a p-value of \(\approx\) 0.68 which much greater than .05 which indicates the correlation is not statistically significant.

model <- lm(data=df, quizMean ~ studyMinsSum)
summary(model)
## 
## Call:
## lm(formula = quizMean ~ studyMinsSum, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4599 -1.2477 -0.1057  1.9247  5.0587 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  8.8176298  2.5052463   3.520   0.0031 **
## studyMinsSum 0.0003796  0.0009299   0.408   0.6889   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.731 on 15 degrees of freedom
## Multiple R-squared:  0.01099,    Adjusted R-squared:  -0.05495 
## F-statistic: 0.1666 on 1 and 15 DF,  p-value: 0.6889

Using ggplot and the stat_smooth there does appear to be a very slight positive correlation

ggplot(data = df, aes(x = studyMinsSum, y = quizMean)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Residuals plot

 res = resid(model)
plot(df$studyMinsSum, res,
     ylab="Residuals", xlab="Study Time")
 abline(0, 0)                  # the horizon

ggplot(data = df, aes(x = model$residuals)) +
    geom_histogram(fill = 'steelblue', color = 'black', binwidth = 1) +
    labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Frequency')

Shamelessly stealing the idea of using the performance library’s check_model function as well.

check_model(model)

Summary

Does it meet the conditions of linear regression?

Linearity: No. Based on the point plot there really does not appear to be a linear relationship.

Nearly normal residuals: Yes. We can observe from the histogram plot of the residuals that their distribution is sort of resembles the normal distribution

Constant variability: Yes. We can observe from the scatter plot of the residuals that there is no pattern present which indicates that there is constant variability

Independent observations: Yes. The study time and scores between individuals is indepdent

Is the linear model appropriate?

Hard to say. Doesn’t seem to meet all the criteria of linear regression as I understand it. This is data from a single class of 17 students with only 3 quizzes taken so far. I think there probably just isn’t enough data to draw any significant conclusions