Discussion 11
I am currently volunteer teaching Python to high school students and the tool we are using makes the usage data available as a CSV. Wanted to do a linear regression on if the amount of time students spend in the tool correlates with their quiz scores.
Loading Data
df <- read.csv('https://raw.githubusercontent.com/nolivercuny/DATA605/main/week%2011/data/student_data.csv')Cleanup & Aggregation
Filter out instructors, sum the minutes in the tool and get the mean of their quiz scores
df <- df %>%
filter(!grepl("instructor",username))
df <- df %>%
mutate(studyMinsSum = rowSums(select(df,2:226),na.rm = T)) %>%
select(c(username,
studyMinsSum,starts_with("Unit.1.Quiz") |
starts_with("Unit.2.Quiz") |
starts_with("Unit.3.Quiz")))
df <- df %>%
mutate(quizMean = rowMeans(select(df,3:5),na.rm=T))Analysis
Plotting students times vs scores and showing the raw data
plot(df$quizMean ~ df$studyMinsSum, main="Effort vs. Scores",
xlab="Total Work Time (m)", ylab="Mean Quiz Score")df %>%
kable()| username | studyMinsSum | Unit.1.Quiz | Unit.2.Quiz | Unit.3.Quiz | quizMean |
|---|---|---|---|---|---|
| student1 | 2515 | 9 | 10 | 10 | 9.666667 |
| student2 | 2276 | 4 | 9 | 3 | 5.333333 |
| student3 | 1906 | 14 | 5 | 6 | 8.333333 |
| student4 | 2298 | 11 | 6 | 6 | 7.666667 |
| student5 | 1557 | 13 | 12 | 9 | 11.333333 |
| student6 | 3234 | 12 | 14 | 5 | 10.333333 |
| student7 | 2134 | 12 | 14 | 5 | 10.333333 |
| student8 | 3884 | 11 | 13 | 5 | 9.666667 |
| student9 | 2985 | 10 | 12 | 15 | 12.333333 |
| student10 | 2082 | 14 | 15 | 15 | 14.666667 |
| student11 | 4234 | 6 | 13 | 7 | 8.666667 |
| student12 | 1692 | 1 | 11 | 0 | 4.000000 |
| student13 | 1957 | 10 | 13 | 11 | 11.333333 |
| student14 | 2854 | 12 | 11 | 15 | 12.666667 |
| student15 | 2815 | 11 | 6 | 11 | 9.333333 |
| student16 | 2889 | 8 | 13 | 5 | 8.666667 |
| student17 | 2857 | 12 | 13 | 12 | 12.333333 |
Use built in lm function to fit linear model. Summary shows a p-value of \(\approx\) 0.68 which much greater than .05 which indicates the correlation is not statistically significant.
model <- lm(data=df, quizMean ~ studyMinsSum)
summary(model)##
## Call:
## lm(formula = quizMean ~ studyMinsSum, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4599 -1.2477 -0.1057 1.9247 5.0587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.8176298 2.5052463 3.520 0.0031 **
## studyMinsSum 0.0003796 0.0009299 0.408 0.6889
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.731 on 15 degrees of freedom
## Multiple R-squared: 0.01099, Adjusted R-squared: -0.05495
## F-statistic: 0.1666 on 1 and 15 DF, p-value: 0.6889
Using ggplot and the stat_smooth there does appear to be a very slight positive correlation
ggplot(data = df, aes(x = studyMinsSum, y = quizMean)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)## `geom_smooth()` using formula 'y ~ x'
Residuals plot
res = resid(model)
plot(df$studyMinsSum, res,
ylab="Residuals", xlab="Study Time")
abline(0, 0) # the horizonggplot(data = df, aes(x = model$residuals)) +
geom_histogram(fill = 'steelblue', color = 'black', binwidth = 1) +
labs(title = 'Histogram of Residuals', x = 'Residuals', y = 'Frequency')Shamelessly stealing the idea of using the performance library’s check_model function as well.
check_model(model)Summary
Does it meet the conditions of linear regression?
Linearity: No. Based on the point plot there really does not appear to be a linear relationship.
Nearly normal residuals: Yes. We can observe from the histogram plot of the residuals that their distribution is sort of resembles the normal distribution
Constant variability: Yes. We can observe from the scatter plot of the residuals that there is no pattern present which indicates that there is constant variability
Independent observations: Yes. The study time and scores between individuals is indepdent
Is the linear model appropriate?
Hard to say. Doesn’t seem to meet all the criteria of linear regression as I understand it. This is data from a single class of 17 students with only 3 quizzes taken so far. I think there probably just isn’t enough data to draw any significant conclusions