Peregrine Overall Results

Bottom Line Up Front: Students performed reasonably well, with our worst student benchmarking at the 58% and our best student benchmarking at the 98%. In the first section, I pull in the data (having added gender manually).

#####################Read and Pre-Clean the Data#######################
library(psych) #to describe
## Warning: package 'psych' was built under R version 3.5.3
library(reticulate) #to use Python in R as well
## Warning: package 'reticulate' was built under R version 3.5.3
mydata=read.csv("C:/Users/lf25/OneDrive - Texas State University/BHA2/Peregrine Analysis/fall2018posttest.csv")
mydata[,1:2]=NULL
str(mydata)
## 'data.frame':    33 obs. of  17 variables:
##  $ Gender    : Factor w/ 2 levels "F","M": 1 2 1 1 1 1 1 1 1 1 ...
##  $ Ethnicity : Factor w/ 4 levels "A","B","C","H": 3 2 4 4 4 4 3 3 3 2 ...
##  $ Fin       : int  60 50 60 60 50 80 70 40 70 60 ...
##  $ Mgt       : int  70 60 50 70 70 80 100 70 100 70 ...
##  $ Pers      : int  70 80 70 50 60 70 80 80 80 70 ...
##  $ SysOrg    : int  80 50 90 60 70 50 40 60 80 60 ...
##  $ IM        : int  40 60 40 60 60 90 50 50 50 50 ...
##  $ Ldrship   : int  60 80 70 60 70 50 60 90 60 90 ...
##  $ ChangeMgt : int  40 40 70 80 60 60 90 60 40 90 ...
##  $ ClimCult  : int  80 80 80 80 100 70 70 90 90 70 ...
##  $ QI        : int  60 50 50 60 60 60 60 80 80 50 ...
##  $ Strat_Mark: int  70 70 70 70 80 60 40 60 70 90 ...
##  $ Comm_Envir: int  60 80 70 60 60 70 80 80 50 70 ...
##  $ Legal     : int  70 70 60 80 70 70 90 80 70 70 ...
##  $ Score     : num  63.3 64.2 65 65.8 67.5 ...
##  $ Duration  : num  66.7 71.2 57.8 74 53.9 ...
##  $ Percent   : int  58 61 64 67 72 72 77 79 79 79 ...
#########################################################################

Descriptive Statistics / Univariate Graphs / Crosstabs

N=33 students took the exit examination, and 33% (11) of the students who took it were male. The racial / ethnic breakdown (non-Census) follows: 20 Caucasion, 5 African American non Hispanic, 5 Hispanic, 3 Asian.

I manually coded the variable “Minority Group” with non-Census Bureau categories of {C = caucasion / non-Hispanic, B = African American or Associated Minority Group / non-Hispanic, A = Asian / Other, H = Hispanic regardless of C or B Primary Classification}. I used this coding scheme to reflect that we are a Hispanic-serving institution. Eventually, I will categorize by race and ethnicity separately using Census coding.

Overall, the students performed well based on both scores and national percentile comparisons. Our worst student was in the 58th percentile, while our best student was in the 98th percentile. You can see this descriptive statistics below.

The “average” student earned a 73.76% (sd=5.43%) with a mean percentile of 84.85% (sd=10.99%). That “average” student took 55.53 minutes (sd=12.07 minutes) to complete the 120-question examination, and earned the lowest mean score of 65.45 (sd=16.97%) in Information Management and the highest mean score of 80.30 (sd=12.37%) in HC Personnel. Fifty percent of the students were in the 90% or above.

Our “low-hanging fruit” will be in the areas of Change Management (highest variation, long left tail) and IM (long left tail). Looking at the histograms, the distribution for Finance and Quality is nearly uniform, which may reflect the difficulty of the content / instructors (three different individuals for quality).

#############################Descriptives 1##############################
describe(mydata[,3:17])
##            vars  n  mean    sd median trimmed   mad   min    max range
## Fin           1 33 70.00 14.36     70   70.00 14.83 40.00 100.00 60.00
## Mgt           2 33 77.88 12.69     80   77.78 14.83 50.00 100.00 50.00
## Pers          3 33 80.30 12.37     80   80.74 14.83 50.00 100.00 50.00
## SysOrg        4 33 70.91 13.78     70   71.48 14.83 40.00 100.00 60.00
## IM            5 33 65.45 16.97     60   64.44 14.83 40.00 100.00 60.00
## Ldrship       6 33 76.97 14.25     80   77.78 14.83 50.00 100.00 50.00
## ChangeMgt     7 33 73.94 18.53     80   75.56 14.83 30.00 100.00 70.00
## ClimCult      8 33 80.00 10.90     80   80.00 14.83 60.00 100.00 40.00
## QI            9 33 70.30 14.89     70   70.74 14.83 40.00  90.00 50.00
## Strat_Mark   10 33 73.03 14.89     70   73.33 14.83 40.00 100.00 60.00
## Comm_Envir   11 33 70.00 11.73     70   70.00 14.83 50.00  90.00 40.00
## Legal        12 33 76.36 11.13     80   76.30 14.83 60.00 100.00 40.00
## Score        13 33 73.76  5.43     75   73.89  6.17 63.33  85.83 22.50
## Duration     14 33 55.53 12.07     56   55.21 12.41 34.37  82.93 48.56
## Percent      15 33 84.85 10.99     90   86.22  7.41 58.00  98.00 40.00
##             skew kurtosis   se
## Fin        -0.12    -0.82 2.50
## Mgt        -0.06    -0.76 2.21
## Pers       -0.34    -0.32 2.15
## SysOrg     -0.37    -0.09 2.40
## IM          0.48    -0.62 2.95
## Ldrship    -0.42    -0.92 2.48
## ChangeMgt  -0.79    -0.44 3.23
## ClimCult    0.00    -0.64 1.90
## QI         -0.16    -1.22 2.59
## Strat_Mark -0.18    -0.72 2.59
## Comm_Envir -0.11    -0.95 2.04
## Legal       0.06    -1.02 1.94
## Score      -0.13    -0.75 0.94
## Duration    0.15    -0.77 2.10
## Percent    -0.93    -0.26 1.91
par(mai=c(.3,.3,.3,.3))
par(mfrow=c(5,4))

table(mydata$Gender)
## 
##  F  M 
## 22 11
table(mydata$Ethnicity)
## 
##  A  B  C  H 
##  3  5 20  5
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
gf=ggplot(mydata, aes(x=Score, fill=Ethnicity))+
  geom_histogram(position="identity", colour="grey40", alpha=0.2, bins =5)+facet_grid(.~Gender)
gf

for (i in 3:ncol(mydata)){
  hist(mydata[,i], ylim=c(0,15), xlim=c(0,100),xlab=NULL, ylab=NULL, main=colnames(mydata)[i])
  boxplot(mydata[,i],horizontal=TRUE,xlab=NULL,ylim=c(0,100), ylab=NULL, main=colnames(mydata)[i])}

myagg=aggregate(mydata[,3:17], by=list(mydata$Gender), mean)
myagg
##   Group.1      Fin      Mgt     Pers   SysOrg       IM  Ldrship ChangeMgt
## 1       F 67.72727 77.27273 79.54545 70.90909 64.09091 74.09091  77.27273
## 2       M 74.54545 79.09091 81.81818 70.90909 68.18182 82.72727  67.27273
##   ClimCult       QI Strat_Mark Comm_Envir    Legal    Score Duration
## 1 78.63636 70.00000   71.81818   67.27273 75.90909 72.87636 55.81136
## 2 82.72727 70.90909   75.45455   75.45455 77.27273 75.52727 54.95909
##    Percent
## 1 83.18182
## 2 88.18182
myagg2=aggregate(mydata[,3:17], by=list(mydata$Ethnicity), mean)
myagg2
##   Group.1      Fin      Mgt Pers SysOrg       IM Ldrship ChangeMgt
## 1       A 66.66667 86.66667   70   70.0 76.66667      70  66.66667
## 2       B 72.00000 72.00000   78   70.0 64.00000      86  72.00000
## 3       C 71.00000 79.50000   86   71.5 65.00000      78  76.00000
## 4       H 66.00000 72.00000   66   70.0 62.00000      68  72.00000
##   ClimCult       QI Strat_Mark Comm_Envir Legal    Score Duration  Percent
## 1 73.33333 76.66667   76.66667   66.66667  90.0 74.16333  52.5500 87.66667
## 2 82.00000 54.00000   76.00000   74.00000  70.0 72.49600  62.1280 82.60000
## 3 81.00000 76.00000   73.50000   70.00000  77.5 75.41400  53.3635 87.95000
## 4 78.00000 60.00000   66.00000   68.00000  70.0 68.16600  59.3680 73.00000
#########################################################################

pd=import("pandas")
df=mydata
pd$DataFrame$describe(df)
##             Fin       Mgt      Pers    SysOrg        IM   Ldrship
## count  33.00000  33.00000  33.00000  33.00000  33.00000  33.00000
## mean   70.00000  77.87879  80.30303  70.90909  65.45455  76.96970
## std    14.36141  12.68798  12.37054  13.77580  16.97257  14.24887
## min    40.00000  50.00000  50.00000  40.00000  40.00000  50.00000
## 25%    60.00000  70.00000  70.00000  60.00000  50.00000  70.00000
## 50%    70.00000  80.00000  80.00000  70.00000  60.00000  80.00000
## 75%    80.00000  90.00000  90.00000  80.00000  80.00000  90.00000
## max   100.00000 100.00000 100.00000 100.00000 100.00000 100.00000
##       ChangeMgt  ClimCult       QI Strat_Mark Comm_Envir     Legal
## count  33.00000  33.00000 33.00000   33.00000   33.00000  33.00000
## mean   73.93939  80.00000 70.30303   73.03030   70.00000  76.36364
## std    18.53028  10.89725 14.89229   14.89229   11.72604  11.12940
## min    30.00000  60.00000 40.00000   40.00000   50.00000  60.00000
## 25%    60.00000  70.00000 60.00000   60.00000   60.00000  70.00000
## 50%    80.00000  80.00000 70.00000   70.00000   70.00000  80.00000
## 75%    90.00000  90.00000 80.00000   80.00000   80.00000  80.00000
## max   100.00000 100.00000 90.00000  100.00000   90.00000 100.00000
##           Score Duration  Percent
## count 33.000000 33.00000 33.00000
## mean  73.760000 55.52727 84.84848
## std    5.425101 12.06972 10.98613
## min   63.330000 34.37000 58.00000
## 25%   70.000000 45.83000 79.00000
## 50%   75.000000 56.00000 90.00000
## 75%   77.500000 64.23000 93.00000
## max   85.830000 82.93000 98.00000

Correlations

The final score is a linear combination of the other scores, and percentile rank derives from that. Percentile rank is omitted for the correlation analysis, and Spearman’s is used to avoid assumptions of normality necessary for Pearson’s. While final score is retained, it must be correlated with subscores as it is built from them.

Outside of the final score (which is built from the other components), the strongest negative correlation was between duration of the examination and performance on healthcare personnel ( \(\rho\) =-.39, p<.05). The strongest positive Spearman’s correlation was between Legal and Management (\(\rho\)=.46, p<.01).

A scatterplot illustrates that Duration is not related to overall final grade performance. There are some localized associations between duration and subscores, though.

#############################Descriptives 2##############################
mycor=round(cor(mydata[,-c(1,2,17)], method="spearman"),2)  #eliminate final / percent rank
#by definition, final is a linear combination of other scores
#% rank follows from that.
mycor[upper.tri(mycor)==TRUE]=NA
cor.test(mydata$Pers,mydata$Duration, method="spearman")
## Warning in cor.test.default(mydata$Pers, mydata$Duration, method =
## "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  mydata$Pers and mydata$Duration
## S = 8342.2, p-value = 0.02325
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.394088
cor.test(mydata$Legal, mydata$Mgt, method="spearman")
## Warning in cor.test.default(mydata$Legal, mydata$Mgt, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  mydata$Legal and mydata$Mgt
## S = 3231.7, p-value = 0.007079
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4599441
library(reshape2)
meltcor=melt(mycor)

library(ggplot2)
ggplot(data = meltcor, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white") +
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 10, hjust = 1))+
 coord_fixed()

ggplot(mydata, aes(x=Duration, y=Score)) + 
  geom_point()+
  geom_smooth(method=lm)

# Remove the confidence interval
ggplot(mydata, aes(x=Duration, y=Score)) + 
  geom_point()+
  geom_smooth(method=lm, se=FALSE)

# Loess method
ggplot(mydata, aes(x=Duration, y=Score)) + 
  geom_point()+
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#########################################################################

Inference

I intend to match student performance in each area with grades in associated classes / overall; however, Catsweb is down.

While normality does not perfectly hold true for the distributions, I ran parametric rather than non-parametric for ease in undertanding. Further, the final score is not terribly skewed. (Transformations or non-parametric tests are generally preferred.)

A t-test of the final score vs. gender was not significant. An ANOVA of final score vs. ethnicity was significant (p<.05) with the difference between Hispanic and Caucasian scores being the only statistical difference found in post-hoc analysis.

Insufficient power exists to do much inference right now. As the data set grows over time and we have pre-post scores, the analysis will become more robust.

######################Basic Inferentials########################
myt=t.test(mydata$Score~mydata$Gender)
myt
## 
##  Welch Two Sample t-test
## 
## data:  mydata$Score by mydata$Gender
## t = -1.3674, df = 21.236, p-value = 0.1858
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.679879  1.378061
## sample estimates:
## mean in group F mean in group M 
##        72.87636        75.52727
myaov=aov(mydata$Score~mydata$Ethnicity)
summary(myaov)
##                  Df Sum Sq Mean Sq F value Pr(>F)  
## mydata$Ethnicity  3  219.7   73.22    2.94 0.0497 *
## Residuals        29  722.2   24.90                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mytukeys=TukeyHSD(myaov)
mytukeys
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = mydata$Score ~ mydata$Ethnicity)
## 
## $`mydata$Ethnicity`
##          diff       lwr        upr     p adj
## B-A -1.667333 -11.59633  8.2616640 0.9676142
## C-A  1.250667  -7.16705  9.6683837 0.9771528
## H-A -5.997333 -15.92633  3.9316640 0.3699565
## C-B  2.918000  -3.87992  9.7159197 0.6504030
## H-B -4.330000 -12.92876  4.2687639 0.5263976
## H-C -7.248000 -14.04592 -0.4500803 0.0332620
plot(mytukeys)

################################################################