Discussion 11

I am currently volunteer teaching Python to high school students and the tool we are using makes the usage data available as a CSV. Wanted to do a linear regression on if the amount of time students spend in the tool correlates with their quiz scores.

Loading Data

df <- read.csv('https://raw.githubusercontent.com/nolivercuny/DATA605/main/week%2011/data/student_data.csv')

Cleanup & Aggregation

Filter out instructors, sum the minutes in the tool and get the mean of their quiz scores

df <- df %>%
  filter(!grepl("instructor",username))
df <- df %>%
  mutate(studyMinsSum = rowSums(select(df,2:226),na.rm = T)) %>%
  select(c(username,
           studyMinsSum,starts_with("Unit.1.Quiz") |
             starts_with("Unit.2.Quiz") |
             starts_with("Unit.3.Quiz")))
df <- df %>%
  mutate(quizMean = rowMeans(select(df,3:5),na.rm=T))

Analysis

Plotting students times vs scores and showing the raw data

plot(df$quizMean ~ df$studyMinsSum, main="Effort vs. Scores",
xlab="Total Work Time (m)", ylab="Mean Quiz Score")

df %>% 
  kable()

username	studyMinsSum	Unit.1.Quiz	Unit.2.Quiz	Unit.3.Quiz	quizMean
student1	2515	9	10	10	9.666667
student2	2276	4	9	3	5.333333
student3	1906	14	5	6	8.333333
student4	2298	11	6	6	7.666667
student5	1557	13	12	9	11.333333
student6	3234	12	14	5	10.333333
student7	2134	12	14	5	10.333333
student8	3884	11	13	5	9.666667
student9	2985	10	12	15	12.333333
student10	2082	14	15	15	14.666667
student11	4234	6	13	7	8.666667
student12	1692	1	11	0	4.000000
student13	1957	10	13	11	11.333333
student14	2854	12	11	15	12.666667
student15	2815	11	6	11	9.333333
student16	2889	8	13	5	8.666667
student17	2857	12	13	12	12.333333

Use built in lm function to fit linear model. Summary shows a p-value of \(\approx\) 0.68 which much greater than .05 which indicates the correlation is not statistically significant.

model <- lm(data=df, quizMean ~ studyMinsSum)
summary(model)

## 
## Call:
## lm(formula = quizMean ~ studyMinsSum, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4599 -1.2477 -0.1057  1.9247  5.0587 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  8.8176298  2.5052463   3.520   0.0031 **
## studyMinsSum 0.0003796  0.0009299   0.408   0.6889   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.731 on 15 degrees of freedom
## Multiple R-squared:  0.01099,    Adjusted R-squared:  -0.05495 
## F-statistic: 0.1666 on 1 and 15 DF,  p-value: 0.6889

Using ggplot and the stat_smooth there does appear to be a very slight positive correlation

ggplot(data = df, aes(x = studyMinsSum, y = quizMean)) +
  geom_point() +
  stat_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Residuals plot

 res = resid(model)
plot(df$studyMinsSum, res,
     ylab="Residuals", xlab="Study Time")
 abline(0, 0)                  # the horizon

ggplot(data = df, aes(x = model$residuals)) +
    geom_histogram(fill = 'steelblue', color = 'black', binwidth = 1) +
    labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Frequency')

Shamelessly stealing the idea of using the performance library’s check_model function as well.

check_model(model)

Summary

Does it meet the conditions of linear regression?

Linearity: No. Based on the point plot there really does not appear to be a linear relationship.

Nearly normal residuals: Yes. We can observe from the histogram plot of the residuals that their distribution is sort of resembles the normal distribution

Constant variability: Yes. We can observe from the scatter plot of the residuals that there is no pattern present which indicates that there is constant variability

Independent observations: Yes. The study time and scores between individuals is indepdent

Is the linear model appropriate?

Hard to say. Doesn’t seem to meet all the criteria of linear regression as I understand it. This is data from a single class of 17 students with only 3 quizzes taken so far. I think there probably just isn’t enough data to draw any significant conclusions

DATA 605 - Discussion 11

Nick Oliver

Discussion 11

Loading Data

Cleanup & Aggregation

Analysis

Summary