Preamble

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.2

## Warning: package 'ggplot2' was built under R version 4.3.2

## Warning: package 'tidyr' was built under R version 4.3.2

## Warning: package 'readr' was built under R version 4.3.2

## Warning: package 'purrr' was built under R version 4.3.2

## Warning: package 'dplyr' was built under R version 4.3.2

## Warning: package 'stringr' was built under R version 4.3.2

## Warning: package 'forcats' was built under R version 4.3.2

## Warning: package 'lubridate' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(purrr)
library(plotrix)

## Warning: package 'plotrix' was built under R version 4.3.2

library(readxl)
library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.3.2

library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.3.2

library(effsize)

## Warning: package 'effsize' was built under R version 4.3.3

library(pwrss)

## Warning: package 'pwrss' was built under R version 4.3.3

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

Introduction

The week 11 data dives asks us to extend what we did previously in the week 10 data dive and build glms (or other models) to any facet of our data that we are interested in, highlight any issues we find in the model, and then interpret at least one of the coefficients. This data dive is fairly open-ended, so I plan to simply extend the week 10 data dive to include different explanatory variables.

In the week 10 dive I explored a glm using previous programming experience as the binary variable and the responses to Initial7 (i.e., “Creating Visualizations”) as the explanatory variable. I was able to show a slightly positive trend that indicates a weak correlation between students having previous programming experience and their reported competency with testing their code.

For this week’s data dive, I am planning to extend this for more of the initial skills. As an extension of last week, I will look at Initial6 (“Testing Code”), Initial11 (“Data Analysis”), Initial8 (“Matrix Operations”), and Initial9 (“Differential Equations”). These are all skills which I would expect previous coding experience to affect.

Analysis

Data Setup

So let’s go ahead and import the dataset:

data0 <- read_excel("~/IUPUI/By Semester/Spring 2024/R_Stats/F18-F23_Survey_Data_Clean_R.xlsx")

View(data0)

In this dataset, the binary column will be column 5, which is labeled “previous experience.” Right now, the column responses are y/n for “yes” and “no,” respectively. To convert those into a binary column, we can add the following column:

data1 <-
  data0 |>
  mutate(Experience = case_when(
    startsWith(`Previous Experience`, 'y') ~ 1,
    startsWith(`Previous Experience`, 'n') ~ 0))

view(data1)

Building the GLMs

GLM for Initial6

Now to build the GLM for Initial6 (“Testing code”):

m6 <-
  glm(Experience ~ Initial6,
      data = data1,
      family = binomial(link = "logit"))

summary(m6)

## 
## Call:
## glm(formula = Experience ~ Initial6, family = binomial(link = "logit"), 
##     data = data1)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.26869    0.15995  -7.932 2.16e-15 ***
## Initial6     0.10238    0.03231   3.168  0.00153 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 639.90  on 528  degrees of freedom
## Residual deviance: 629.89  on 527  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 633.89
## 
## Number of Fisher Scoring iterations: 4

The \(\beta_1\) value is the most important of the coefficients to examine, as it includes the most direct information regarding the relationship between the two variables. In this case, \(\beta_1 = 0.10238\), which is larger than zero. The fact that it’s larger than zero indicates a positive relationship between the two variables. In other words, this means that students who report greater initial competency with testing code are more likely to have previous coding experience. The p-value of this coefficient is also significantly smaller than 0.05, indicating a fairly high confidence in this value.

GLM for Initial8

Now to build the GLM for Initial8 (“Matrix Operations”):

m8 <-
  glm(Experience ~ Initial8,
      data = data1,
      family = binomial(link = "logit"))

summary(m8)

## 
## Call:
## glm(formula = Experience ~ Initial8, family = binomial(link = "logit"), 
##     data = data1)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.95625    0.17069  -5.602 2.11e-08 ***
## Initial8     0.01747    0.03252   0.537    0.591    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 639.90  on 528  degrees of freedom
## Residual deviance: 639.61  on 527  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 643.61
## 
## Number of Fisher Scoring iterations: 4

Interestingly enough, the interpretation of \(\beta_1\) for “Matrix Operations” is entirely different from the other two previous GLMs in the fact that the first thing that jumps out to me is the p-value indicates that this coefficient is not statistically significant. So while the value is positive and could indicate a positive correlation, the p-value indicates low confidence in this measure, so we cannot draw any conclusions abouut this coeffcient from the model.

GLM for Initial9

Now to build the GLM for Initial9 (“Differential Equations”):

m9 <-
  glm(Experience ~ Initial9,
      data = data1,
      family = binomial(link = "logit"))

summary(m9)

## 
## Call:
## glm(formula = Experience ~ Initial9, family = binomial(link = "logit"), 
##     data = data1)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.911370   0.156890  -5.809 6.29e-09 ***
## Initial9     0.008137   0.033037   0.246    0.805    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 639.90  on 528  degrees of freedom
## Residual deviance: 639.84  on 527  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 643.84
## 
## Number of Fisher Scoring iterations: 4

So, two things immediately jump out about this GLM. First is the fact that \(\beta_1\) is very close to zero, which would immediately indicate a very weak correlation, if any. The second is the very high p-value, indicating very little confidence in this value being non-zero. Both of these together strongly imply that there is no correlation between the two variables, meaning that a students’ response to their initial competency with differential equations is very likely not related to whether or not the student has had previous coding experience.

GLM for Initial 11

Building the GLM for “Data Analysis”:

m11 <-
  glm(Experience ~ Initial11,
      data = data1,
      family = binomial(link = "logit"))

summary(m11)

## 
## Call:
## glm(formula = Experience ~ Initial11, family = binomial(link = "logit"), 
##     data = data1)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.22819    0.21859  -5.619 1.92e-08 ***
## Initial11    0.06693    0.03719   1.800   0.0719 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 639.90  on 528  degrees of freedom
## Residual deviance: 636.63  on 527  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 640.63
## 
## Number of Fisher Scoring iterations: 4

While not as weak as some of the other \(\beta_1\) values, the value of ~0.07 indicates a weak, positive correlation.

Summary

For this data dive, it was shown that only one prompt, “Testing Code”, shows moderate positive correlation, while the other 3 do not. This is an interesting result, as it seems to imply that prior programming experience likely doesn’t have an impact on a student’s initial skills specifically related to computational problem-solving in physics. This can have some interesting implications in physics education, as it would imply that having previous experience with coding doesn’t necessarily translate to increased competency in solving physics problems computationally. It would be interesting to further explore this effect as it pertains to specific levels, considering the likelihood that students in introductory courses would need rigorous computational skills is lower than in upper level courses. This is an interesting result, and certainly something worth exploring in greater detail in the future.

week_11_data_dive

Greg Gallagher

2024-04-25