The data in a ‘cleaned’ state is in a Dropbox public folder, available to you using the line of code below. The row.names = 1 is included because the “caseid”" is in the first column. Note that I included a ‘set.seed’ line of code where necessary to ensure consistency of output.
caseid <- read.csv("C:/Users/Stephen/Dropbox/Public/caseid.csv", row.names=1)
Figure 3. The first analysis was for the PCA, using the FactoMineR package, so we load that library first
library(FactoMineR)
The r object that is the output is called ‘sheppca’, because John Shepherd asked for this analysis. The first plot is the individuals mapped onto the dimensions. I didn’t include this in the document because it doesn’t tell us much. The caseid 1061 contains has most of its data missing.
sheppca <- PCA(caseid)
to find the descriptions of the dimensions found within the data (there are three)
dimdesc(sheppca)
## $Dim.1
## $Dim.1$quanti
## correlation p.value
## buqu1130gradesdummy 0.6857255 8.939785e-125
## acct3310gradesdummy 0.6382817 4.102455e-103
## acct1110gradesdummy 0.6265694 2.441375e-98
## acct1210gradesdummy 0.6250740 9.607258e-98
## buqu1230gradesdummy 0.5588667 2.476667e-74
## acct2993gradesdummy 0.3136763 8.545717e-22
## agegroupdummy 0.2515512 2.516319e-14
## acct2293attempts -0.1673005 5.106042e-07
## acct1210attempts -0.3441896 3.513021e-26
## acct1110attempts -0.3794373 6.856857e-32
## acct3310attempts -0.3861665 4.605557e-33
##
##
## $Dim.2
## $Dim.2$quanti
## correlation p.value
## acct2993gradesdummy 0.6635881 3.580995e-114
## acct1210attempts 0.4391995 2.587085e-43
## agegroupdummy 0.2581629 4.944079e-15
## acct1110attempts 0.2035927 8.624833e-10
## servicedummy 0.1791803 7.260997e-08
## acct3310gradesdummy 0.1058837 1.550547e-03
## acct3310attempts -0.1283842 1.216613e-04
## acct1110gradesdummy -0.2198317 3.267180e-11
## acct1210gradesdummy -0.2536260 1.517722e-14
## acct2293attempts -0.5495936 1.841851e-71
##
##
## $Dim.3
## $Dim.3$quanti
## correlation p.value
## servicedummy 0.6769616 1.820698e-120
## agegroupdummy 0.5468859 1.221245e-70
## acct2293attempts 0.3063046 8.278089e-21
## acct3310gradesdummy 0.3023576 2.719789e-20
## acct3310attempts 0.1839118 3.216556e-08
## acct1110attempts 0.1778051 9.162759e-08
## acct1210gradesdummy 0.1512526 5.771027e-06
## acct1210attempts 0.1402330 2.659728e-05
## acct1110gradesdummy 0.1060794 1.519495e-03
## buqu1130gradesdummy -0.1651010 7.220859e-07
## buqu1230gradesdummy -0.1690242 3.879395e-07
## acct2993gradesdummy -0.1936067 5.670914e-09
Figure 10. Now the random forest, using the randomforest package
library(randomForest)
set.seed(45)
sheprf <- randomForest(as.factor(acct3310gradesdummy) ~ ., data=caseid, nree=10000, mtry=1, importance=TRUE,na.action=na.roughfix, replace=FALSE)
and get the variable importance list
rn <- round(importance(sheprf), 2)
rn[order(rn[,1], decreasing=TRUE),]
## 0 1 2 3 4 5 6 7 8
## agegroupdummy 3.08 5.85 0.63 -1.01 2.38 -0.72 -4.52 -1.26 -1.90
## acct1210gradesdummy 1.95 14.49 1.82 -2.82 -1.41 1.60 -0.81 -1.13 1.67
## acct2293attempts 1.09 -1.52 2.68 -0.75 0.22 4.73 1.24 -2.49 3.00
## buqu1130gradesdummy 0.53 10.29 2.92 -1.59 -3.49 5.87 -2.66 -1.63 -0.11
## acct2993gradesdummy 0.05 5.86 1.68 -2.17 -0.46 0.34 1.30 -1.59 1.35
## acct3310attempts 0.01 -0.64 2.97 1.36 -2.66 0.23 1.41 -1.89 -1.13
## acct1110attempts -0.11 -3.17 3.76 -1.30 -0.22 1.82 -0.70 -2.87 -0.54
## buqu1230gradesdummy -0.11 10.48 -0.47 1.47 -2.60 3.86 -1.09 0.80 -0.76
## acct1110gradesdummy -0.83 6.48 0.68 -1.72 1.16 -1.00 -2.56 -0.18 1.87
## servicedummy -2.19 0.75 -1.10 -1.72 -2.09 1.39 -1.15 -0.42 -1.13
## acct1210attempts -2.93 -1.73 -1.16 -0.47 -4.28 -0.91 -1.99 -1.15 -0.48
## 9 10 11 MeanDecreaseAccuracy
## agegroupdummy 4.16 4.65 9.00 8.09
## acct1210gradesdummy 1.89 5.37 9.42 13.18
## acct2293attempts 3.88 2.33 4.14 1.84
## buqu1130gradesdummy 5.43 4.95 3.25 9.52
## acct2993gradesdummy -1.72 4.81 -0.14 5.26
## acct3310attempts -2.28 1.30 4.55 -0.62
## acct1110attempts -1.92 0.06 3.34 -1.41
## buqu1230gradesdummy 1.66 2.16 5.31 8.47
## acct1110gradesdummy 4.39 6.27 8.23 8.34
## servicedummy 0.65 -2.31 5.77 -1.00
## acct1210attempts 4.61 -1.67 1.22 -4.37
## MeanDecreaseGini
## agegroupdummy 8.21
## acct1210gradesdummy 10.36
## acct2293attempts 4.63
## buqu1130gradesdummy 10.71
## acct2993gradesdummy 9.23
## acct3310attempts 6.07
## acct1110attempts 4.27
## buqu1230gradesdummy 11.18
## acct1110gradesdummy 10.86
## servicedummy 3.11
## acct1210attempts 3.90
print(sheprf)
##
## Call:
## randomForest(formula = as.factor(acct3310gradesdummy) ~ ., data = caseid, nree = 10000, mtry = 1, importance = TRUE, replace = FALSE, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 80.47%
## Confusion matrix:
## 0 1 2 3 4 5 6 7 8 9 10 11 class.error
## 0 0 72 0 0 7 0 0 0 0 2 0 1 1.00000000
## 1 0 153 0 0 8 0 0 0 0 2 0 0 0.06134969
## 2 0 68 0 0 6 3 0 0 0 1 0 0 1.00000000
## 3 0 30 0 0 2 0 0 0 0 0 0 0 1.00000000
## 4 0 106 0 0 9 1 0 0 0 0 0 1 0.92307692
## 5 0 81 0 0 10 4 0 0 1 0 1 1 0.95918367
## 6 0 73 0 0 8 2 0 0 0 1 2 0 1.00000000
## 7 0 61 0 0 2 0 1 0 0 1 0 0 1.00000000
## 8 0 37 0 0 5 2 0 1 0 2 1 4 1.00000000
## 9 0 44 0 0 6 0 1 0 0 2 2 4 0.96610169
## 10 0 25 0 0 2 1 0 0 0 2 0 5 1.00000000
## 11 0 9 0 0 1 1 0 0 0 4 3 6 0.75000000
The non-linear regression using the mgcv package
library(mgcv)
shepmgcv <- gam(acct3310gradesdummy ~ s(agegroupdummy, k=4) + s(acct1110gradesdummy, k = 4) + s(acct1210gradesdummy, k= 4) + s(acct3310attempts, k=4) + s(buqu1130gradesdummy, k=5) + s(buqu1230gradesdummy, k=5) +s(acct1210attempts, k=4), data = caseid)
Get a summary of results
summary(shepmgcv)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## acct3310gradesdummy ~ s(agegroupdummy, k = 4) + s(acct1110gradesdummy,
## k = 4) + s(acct1210gradesdummy, k = 4) + s(acct3310attempts,
## k = 4) + s(buqu1130gradesdummy, k = 5) + s(buqu1230gradesdummy,
## k = 5) + s(acct1210attempts, k = 4)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9864 0.1318 37.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(agegroupdummy) 2.172 2.552 2.182 0.0992 .
## s(acct1110gradesdummy) 1.295 1.523 5.032 0.0141 *
## s(acct1210gradesdummy) 2.562 2.863 9.089 1.47e-05 ***
## s(acct3310attempts) 1.622 1.933 0.508 0.5958
## s(buqu1130gradesdummy) 1.352 1.624 4.615 0.0172 *
## s(buqu1230gradesdummy) 1.249 1.453 5.486 0.0112 *
## s(acct1210attempts) 1.000 1.000 0.259 0.6111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.297 Deviance explained = 31.9%
## GCV = 6.6 Scale est. = 6.3797 n = 367
then a visualization, first of the two variables of interest to readers
vis.gam(shepmgcv, view=c("acct1210gradesdummy","agegroupdummy", col="topo"), plot.type = "contour")
Finally a diagnostics check of the model
gam.check(shepmgcv)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 12 iterations.
## The RMS GCV score gradiant at convergence was 3.894731e-07 .
## The Hessian was positive definite.
## The estimated model rank was 24 (maximum possible: 24)
## Model rank = 24 / 24
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(agegroupdummy) 3.000 2.172 1.008 0.55
## s(acct1110gradesdummy) 3.000 1.295 0.994 0.41
## s(acct1210gradesdummy) 3.000 2.562 1.013 0.62
## s(acct3310attempts) 3.000 1.622 1.009 0.55
## s(buqu1130gradesdummy) 4.000 1.352 1.047 0.84
## s(buqu1230gradesdummy) 4.000 1.249 0.994 0.46
## s(acct1210attempts) 3.000 1.000 0.990 0.40
Now doing the checks on the changes in grade variance, withdrawals, mean attempts and occurrence of A grades. Load a different dataset, and then the changepoint library. The code that follows tells R to show the four plots in a grid, and also provide the date points when changes occurred:
poisson <- read.csv("C:/Users/Stephen/Dropbox/Public/poisson.csv")
library(changepoint)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Successfully loaded changepoint package version 1.1.5
## Created on 2014-06-25
## Substantial changes to the structure of the package have occured from version 1.0. Please see the package NEWS for details.
par(mfcol=c(2,2))
shepmeanvar <- cpt.meanvar(poisson$astudent, test.stat = "Poisson", method = "PELT")
plot(shepmeanvar, cpt.width = 3, main = "'A' Grade Occurrence")
cpts.ts(shepmeanvar)
## [1] 396 431 586 611 786
shepwithdrawals <- cpt.meanvar(poisson$withdraw, test.stat = "Poisson", method = "PELT")
plot(shepwithdrawals, cpt.width = 3, main = "Withdrawals")
cpts.ts(shepwithdrawals)
## [1] 72 79 164
attempts<- cpt.mean(poisson$Attempts, method = "PELT")
plot(attempts, type = "l", cpt.col = "blue", xlab = "Time", cpt.width = 3, main = "Mean Attempts")
names(poisson)
## [1] "Term" "Lettergrade" "Grade" "Attempts" "astudent"
## [6] "withdraw"
gradevar <- cpt.var(poisson$Grade, method = "PELT")
plot(gradevar, type = "l", cpt.col = "blue", xlab = "Time", cpt.width = 3, main = "Grade Variance")
cpts.ts(gradevar)
## [1] 102 146 149 763 766