The data in a ‘cleaned’ state is in a Dropbox public folder, available to you using the line of code below. The row.names = 1 is included because the “caseid”" is in the first column. Note that I included a ‘set.seed’ line of code where necessary to ensure consistency of output.

caseid <- read.csv("C:/Users/Stephen/Dropbox/Public/caseid.csv", row.names=1)

Figure 3. The first analysis was for the PCA, using the FactoMineR package, so we load that library first

library(FactoMineR)

The r object that is the output is called ‘sheppca’, because John Shepherd asked for this analysis. The first plot is the individuals mapped onto the dimensions. I didn’t include this in the document because it doesn’t tell us much. The caseid 1061 contains has most of its data missing.

sheppca <- PCA(caseid)

to find the descriptions of the dimensions found within the data (there are three)

  dimdesc(sheppca)  
## $Dim.1
## $Dim.1$quanti
##                     correlation       p.value
## buqu1130gradesdummy   0.6857255 8.939785e-125
## acct3310gradesdummy   0.6382817 4.102455e-103
## acct1110gradesdummy   0.6265694  2.441375e-98
## acct1210gradesdummy   0.6250740  9.607258e-98
## buqu1230gradesdummy   0.5588667  2.476667e-74
## acct2993gradesdummy   0.3136763  8.545717e-22
## agegroupdummy         0.2515512  2.516319e-14
## acct2293attempts     -0.1673005  5.106042e-07
## acct1210attempts     -0.3441896  3.513021e-26
## acct1110attempts     -0.3794373  6.856857e-32
## acct3310attempts     -0.3861665  4.605557e-33
## 
## 
## $Dim.2
## $Dim.2$quanti
##                     correlation       p.value
## acct2993gradesdummy   0.6635881 3.580995e-114
## acct1210attempts      0.4391995  2.587085e-43
## agegroupdummy         0.2581629  4.944079e-15
## acct1110attempts      0.2035927  8.624833e-10
## servicedummy          0.1791803  7.260997e-08
## acct3310gradesdummy   0.1058837  1.550547e-03
## acct3310attempts     -0.1283842  1.216613e-04
## acct1110gradesdummy  -0.2198317  3.267180e-11
## acct1210gradesdummy  -0.2536260  1.517722e-14
## acct2293attempts     -0.5495936  1.841851e-71
## 
## 
## $Dim.3
## $Dim.3$quanti
##                     correlation       p.value
## servicedummy          0.6769616 1.820698e-120
## agegroupdummy         0.5468859  1.221245e-70
## acct2293attempts      0.3063046  8.278089e-21
## acct3310gradesdummy   0.3023576  2.719789e-20
## acct3310attempts      0.1839118  3.216556e-08
## acct1110attempts      0.1778051  9.162759e-08
## acct1210gradesdummy   0.1512526  5.771027e-06
## acct1210attempts      0.1402330  2.659728e-05
## acct1110gradesdummy   0.1060794  1.519495e-03
## buqu1130gradesdummy  -0.1651010  7.220859e-07
## buqu1230gradesdummy  -0.1690242  3.879395e-07
## acct2993gradesdummy  -0.1936067  5.670914e-09

Figure 10. Now the random forest, using the randomforest package

library(randomForest)
set.seed(45)
sheprf <- randomForest(as.factor(acct3310gradesdummy) ~ ., data=caseid, nree=10000, mtry=1, importance=TRUE,na.action=na.roughfix, replace=FALSE)

and get the variable importance list

rn <- round(importance(sheprf), 2)
rn[order(rn[,1], decreasing=TRUE),]
##                         0     1     2     3     4     5     6     7     8
## agegroupdummy        3.08  5.85  0.63 -1.01  2.38 -0.72 -4.52 -1.26 -1.90
## acct1210gradesdummy  1.95 14.49  1.82 -2.82 -1.41  1.60 -0.81 -1.13  1.67
## acct2293attempts     1.09 -1.52  2.68 -0.75  0.22  4.73  1.24 -2.49  3.00
## buqu1130gradesdummy  0.53 10.29  2.92 -1.59 -3.49  5.87 -2.66 -1.63 -0.11
## acct2993gradesdummy  0.05  5.86  1.68 -2.17 -0.46  0.34  1.30 -1.59  1.35
## acct3310attempts     0.01 -0.64  2.97  1.36 -2.66  0.23  1.41 -1.89 -1.13
## acct1110attempts    -0.11 -3.17  3.76 -1.30 -0.22  1.82 -0.70 -2.87 -0.54
## buqu1230gradesdummy -0.11 10.48 -0.47  1.47 -2.60  3.86 -1.09  0.80 -0.76
## acct1110gradesdummy -0.83  6.48  0.68 -1.72  1.16 -1.00 -2.56 -0.18  1.87
## servicedummy        -2.19  0.75 -1.10 -1.72 -2.09  1.39 -1.15 -0.42 -1.13
## acct1210attempts    -2.93 -1.73 -1.16 -0.47 -4.28 -0.91 -1.99 -1.15 -0.48
##                         9    10    11 MeanDecreaseAccuracy
## agegroupdummy        4.16  4.65  9.00                 8.09
## acct1210gradesdummy  1.89  5.37  9.42                13.18
## acct2293attempts     3.88  2.33  4.14                 1.84
## buqu1130gradesdummy  5.43  4.95  3.25                 9.52
## acct2993gradesdummy -1.72  4.81 -0.14                 5.26
## acct3310attempts    -2.28  1.30  4.55                -0.62
## acct1110attempts    -1.92  0.06  3.34                -1.41
## buqu1230gradesdummy  1.66  2.16  5.31                 8.47
## acct1110gradesdummy  4.39  6.27  8.23                 8.34
## servicedummy         0.65 -2.31  5.77                -1.00
## acct1210attempts     4.61 -1.67  1.22                -4.37
##                     MeanDecreaseGini
## agegroupdummy                   8.21
## acct1210gradesdummy            10.36
## acct2293attempts                4.63
## buqu1130gradesdummy            10.71
## acct2993gradesdummy             9.23
## acct3310attempts                6.07
## acct1110attempts                4.27
## buqu1230gradesdummy            11.18
## acct1110gradesdummy            10.86
## servicedummy                    3.11
## acct1210attempts                3.90
print(sheprf)
## 
## Call:
##  randomForest(formula = as.factor(acct3310gradesdummy) ~ ., data = caseid,      nree = 10000, mtry = 1, importance = TRUE, replace = FALSE,      na.action = na.roughfix) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 80.47%
## Confusion matrix:
##    0   1 2 3  4 5 6 7 8 9 10 11 class.error
## 0  0  72 0 0  7 0 0 0 0 2  0  1  1.00000000
## 1  0 153 0 0  8 0 0 0 0 2  0  0  0.06134969
## 2  0  68 0 0  6 3 0 0 0 1  0  0  1.00000000
## 3  0  30 0 0  2 0 0 0 0 0  0  0  1.00000000
## 4  0 106 0 0  9 1 0 0 0 0  0  1  0.92307692
## 5  0  81 0 0 10 4 0 0 1 0  1  1  0.95918367
## 6  0  73 0 0  8 2 0 0 0 1  2  0  1.00000000
## 7  0  61 0 0  2 0 1 0 0 1  0  0  1.00000000
## 8  0  37 0 0  5 2 0 1 0 2  1  4  1.00000000
## 9  0  44 0 0  6 0 1 0 0 2  2  4  0.96610169
## 10 0  25 0 0  2 1 0 0 0 2  0  5  1.00000000
## 11 0   9 0 0  1 1 0 0 0 4  3  6  0.75000000

The non-linear regression using the mgcv package

library(mgcv)

shepmgcv <- gam(acct3310gradesdummy ~  s(agegroupdummy, k=4) + s(acct1110gradesdummy, k = 4) + s(acct1210gradesdummy, k= 4) + s(acct3310attempts, k=4) + s(buqu1130gradesdummy, k=5) + s(buqu1230gradesdummy, k=5) +s(acct1210attempts, k=4),  data = caseid)

Get a summary of results

summary(shepmgcv)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## acct3310gradesdummy ~ s(agegroupdummy, k = 4) + s(acct1110gradesdummy, 
##     k = 4) + s(acct1210gradesdummy, k = 4) + s(acct3310attempts, 
##     k = 4) + s(buqu1130gradesdummy, k = 5) + s(buqu1230gradesdummy, 
##     k = 5) + s(acct1210attempts, k = 4)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.9864     0.1318   37.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                          edf Ref.df     F  p-value    
## s(agegroupdummy)       2.172  2.552 2.182   0.0992 .  
## s(acct1110gradesdummy) 1.295  1.523 5.032   0.0141 *  
## s(acct1210gradesdummy) 2.562  2.863 9.089 1.47e-05 ***
## s(acct3310attempts)    1.622  1.933 0.508   0.5958    
## s(buqu1130gradesdummy) 1.352  1.624 4.615   0.0172 *  
## s(buqu1230gradesdummy) 1.249  1.453 5.486   0.0112 *  
## s(acct1210attempts)    1.000  1.000 0.259   0.6111    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.297   Deviance explained = 31.9%
## GCV =    6.6  Scale est. = 6.3797    n = 367

then a visualization, first of the two variables of interest to readers

vis.gam(shepmgcv, view=c("acct1210gradesdummy","agegroupdummy", col="topo"), plot.type = "contour")

Finally a diagnostics check of the model

gam.check(shepmgcv)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 12 iterations.
## The RMS GCV score gradiant at convergence was 3.894731e-07 .
## The Hessian was positive definite.
## The estimated model rank was 24 (maximum possible: 24)
## Model rank =  24 / 24 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                           k'   edf k-index p-value
## s(agegroupdummy)       3.000 2.172   1.008    0.55
## s(acct1110gradesdummy) 3.000 1.295   0.994    0.41
## s(acct1210gradesdummy) 3.000 2.562   1.013    0.62
## s(acct3310attempts)    3.000 1.622   1.009    0.55
## s(buqu1130gradesdummy) 4.000 1.352   1.047    0.84
## s(buqu1230gradesdummy) 4.000 1.249   0.994    0.46
## s(acct1210attempts)    3.000 1.000   0.990    0.40

Now doing the checks on the changes in grade variance, withdrawals, mean attempts and occurrence of A grades. Load a different dataset, and then the changepoint library. The code that follows tells R to show the four plots in a grid, and also provide the date points when changes occurred:

poisson <- read.csv("C:/Users/Stephen/Dropbox/Public/poisson.csv")
library(changepoint)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Successfully loaded changepoint package version 1.1.5
## Created on 2014-06-25
##  Substantial changes to the structure of the package have occured from version 1.0.  Please see the package NEWS for details.
par(mfcol=c(2,2))

shepmeanvar <- cpt.meanvar(poisson$astudent, test.stat = "Poisson", method = "PELT")
plot(shepmeanvar, cpt.width = 3, main = "'A' Grade Occurrence")
cpts.ts(shepmeanvar)
## [1] 396 431 586 611 786
shepwithdrawals <- cpt.meanvar(poisson$withdraw, test.stat = "Poisson", method = "PELT")
plot(shepwithdrawals, cpt.width = 3, main = "Withdrawals")
cpts.ts(shepwithdrawals)
## [1]  72  79 164
attempts<- cpt.mean(poisson$Attempts, method = "PELT")
plot(attempts, type = "l", cpt.col = "blue", xlab = "Time", cpt.width = 3, main = "Mean Attempts")
names(poisson)
## [1] "Term"        "Lettergrade" "Grade"       "Attempts"    "astudent"   
## [6] "withdraw"
gradevar <- cpt.var(poisson$Grade, method = "PELT")
plot(gradevar, type = "l", cpt.col = "blue", xlab = "Time", cpt.width = 3, main = "Grade Variance")

cpts.ts(gradevar)
## [1] 102 146 149 763 766