Bottom Line Up Front: Students performed reasonably well, with our worst student benchmarking at the 58% and our best student benchmarking at the 98%. In the first section, I pull in the data (having added gender manually).
#####################Read and Pre-Clean the Data#######################
library(psych) #to describe
## Warning: package 'psych' was built under R version 3.5.3
library(reticulate) #to use Python in R as well
## Warning: package 'reticulate' was built under R version 3.5.3
mydata=read.csv("C:/Users/lf25/OneDrive - Texas State University/BHA2/Peregrine Analysis/fall2018posttest.csv")
mydata[,1:2]=NULL
str(mydata)
## 'data.frame': 33 obs. of 17 variables:
## $ Gender : Factor w/ 2 levels "F","M": 1 2 1 1 1 1 1 1 1 1 ...
## $ Ethnicity : Factor w/ 4 levels "A","B","C","H": 3 2 4 4 4 4 3 3 3 2 ...
## $ Fin : int 60 50 60 60 50 80 70 40 70 60 ...
## $ Mgt : int 70 60 50 70 70 80 100 70 100 70 ...
## $ Pers : int 70 80 70 50 60 70 80 80 80 70 ...
## $ SysOrg : int 80 50 90 60 70 50 40 60 80 60 ...
## $ IM : int 40 60 40 60 60 90 50 50 50 50 ...
## $ Ldrship : int 60 80 70 60 70 50 60 90 60 90 ...
## $ ChangeMgt : int 40 40 70 80 60 60 90 60 40 90 ...
## $ ClimCult : int 80 80 80 80 100 70 70 90 90 70 ...
## $ QI : int 60 50 50 60 60 60 60 80 80 50 ...
## $ Strat_Mark: int 70 70 70 70 80 60 40 60 70 90 ...
## $ Comm_Envir: int 60 80 70 60 60 70 80 80 50 70 ...
## $ Legal : int 70 70 60 80 70 70 90 80 70 70 ...
## $ Score : num 63.3 64.2 65 65.8 67.5 ...
## $ Duration : num 66.7 71.2 57.8 74 53.9 ...
## $ Percent : int 58 61 64 67 72 72 77 79 79 79 ...
#########################################################################
N=33 students took the exit examination, and 33% (11) of the students who took it were male. The racial / ethnic breakdown (non-Census) follows: 20 Caucasion, 5 African American non Hispanic, 5 Hispanic, 3 Asian.
I manually coded the variable “Minority Group” with non-Census Bureau categories of {C = caucasion / non-Hispanic, B = African American or Associated Minority Group / non-Hispanic, A = Asian / Other, H = Hispanic regardless of C or B Primary Classification}. I used this coding scheme to reflect that we are a Hispanic-serving institution. Eventually, I will categorize by race and ethnicity separately using Census coding.
Overall, the students performed well based on both scores and national percentile comparisons. Our worst student was in the 58th percentile, while our best student was in the 98th percentile. You can see this descriptive statistics below.
The “average” student earned a 73.76% (sd=5.43%) with a mean percentile of 84.85% (sd=10.99%). That “average” student took 55.53 minutes (sd=12.07 minutes) to complete the 120-question examination, and earned the lowest mean score of 65.45 (sd=16.97%) in Information Management and the highest mean score of 80.30 (sd=12.37%) in HC Personnel. Fifty percent of the students were in the 90% or above.
Our “low-hanging fruit” will be in the areas of Change Management (highest variation, long left tail) and IM (long left tail). Looking at the histograms, the distribution for Finance and Quality is nearly uniform, which may reflect the difficulty of the content / instructors (three different individuals for quality).
#############################Descriptives 1##############################
describe(mydata[,3:17])
## vars n mean sd median trimmed mad min max range
## Fin 1 33 70.00 14.36 70 70.00 14.83 40.00 100.00 60.00
## Mgt 2 33 77.88 12.69 80 77.78 14.83 50.00 100.00 50.00
## Pers 3 33 80.30 12.37 80 80.74 14.83 50.00 100.00 50.00
## SysOrg 4 33 70.91 13.78 70 71.48 14.83 40.00 100.00 60.00
## IM 5 33 65.45 16.97 60 64.44 14.83 40.00 100.00 60.00
## Ldrship 6 33 76.97 14.25 80 77.78 14.83 50.00 100.00 50.00
## ChangeMgt 7 33 73.94 18.53 80 75.56 14.83 30.00 100.00 70.00
## ClimCult 8 33 80.00 10.90 80 80.00 14.83 60.00 100.00 40.00
## QI 9 33 70.30 14.89 70 70.74 14.83 40.00 90.00 50.00
## Strat_Mark 10 33 73.03 14.89 70 73.33 14.83 40.00 100.00 60.00
## Comm_Envir 11 33 70.00 11.73 70 70.00 14.83 50.00 90.00 40.00
## Legal 12 33 76.36 11.13 80 76.30 14.83 60.00 100.00 40.00
## Score 13 33 73.76 5.43 75 73.89 6.17 63.33 85.83 22.50
## Duration 14 33 55.53 12.07 56 55.21 12.41 34.37 82.93 48.56
## Percent 15 33 84.85 10.99 90 86.22 7.41 58.00 98.00 40.00
## skew kurtosis se
## Fin -0.12 -0.82 2.50
## Mgt -0.06 -0.76 2.21
## Pers -0.34 -0.32 2.15
## SysOrg -0.37 -0.09 2.40
## IM 0.48 -0.62 2.95
## Ldrship -0.42 -0.92 2.48
## ChangeMgt -0.79 -0.44 3.23
## ClimCult 0.00 -0.64 1.90
## QI -0.16 -1.22 2.59
## Strat_Mark -0.18 -0.72 2.59
## Comm_Envir -0.11 -0.95 2.04
## Legal 0.06 -1.02 1.94
## Score -0.13 -0.75 0.94
## Duration 0.15 -0.77 2.10
## Percent -0.93 -0.26 1.91
par(mai=c(.3,.3,.3,.3))
par(mfrow=c(5,4))
table(mydata$Gender)
##
## F M
## 22 11
table(mydata$Ethnicity)
##
## A B C H
## 3 5 20 5
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
gf=ggplot(mydata, aes(x=Score, fill=Ethnicity))+
geom_histogram(position="identity", colour="grey40", alpha=0.2, bins =5)+facet_grid(.~Gender)
gf
for (i in 3:ncol(mydata)){
hist(mydata[,i], ylim=c(0,15), xlim=c(0,100),xlab=NULL, ylab=NULL, main=colnames(mydata)[i])
boxplot(mydata[,i],horizontal=TRUE,xlab=NULL,ylim=c(0,100), ylab=NULL, main=colnames(mydata)[i])}
myagg=aggregate(mydata[,3:17], by=list(mydata$Gender), mean)
myagg
## Group.1 Fin Mgt Pers SysOrg IM Ldrship ChangeMgt
## 1 F 67.72727 77.27273 79.54545 70.90909 64.09091 74.09091 77.27273
## 2 M 74.54545 79.09091 81.81818 70.90909 68.18182 82.72727 67.27273
## ClimCult QI Strat_Mark Comm_Envir Legal Score Duration
## 1 78.63636 70.00000 71.81818 67.27273 75.90909 72.87636 55.81136
## 2 82.72727 70.90909 75.45455 75.45455 77.27273 75.52727 54.95909
## Percent
## 1 83.18182
## 2 88.18182
myagg2=aggregate(mydata[,3:17], by=list(mydata$Ethnicity), mean)
myagg2
## Group.1 Fin Mgt Pers SysOrg IM Ldrship ChangeMgt
## 1 A 66.66667 86.66667 70 70.0 76.66667 70 66.66667
## 2 B 72.00000 72.00000 78 70.0 64.00000 86 72.00000
## 3 C 71.00000 79.50000 86 71.5 65.00000 78 76.00000
## 4 H 66.00000 72.00000 66 70.0 62.00000 68 72.00000
## ClimCult QI Strat_Mark Comm_Envir Legal Score Duration Percent
## 1 73.33333 76.66667 76.66667 66.66667 90.0 74.16333 52.5500 87.66667
## 2 82.00000 54.00000 76.00000 74.00000 70.0 72.49600 62.1280 82.60000
## 3 81.00000 76.00000 73.50000 70.00000 77.5 75.41400 53.3635 87.95000
## 4 78.00000 60.00000 66.00000 68.00000 70.0 68.16600 59.3680 73.00000
#########################################################################
pd=import("pandas")
df=mydata
pd$DataFrame$describe(df)
## Fin Mgt Pers SysOrg IM Ldrship
## count 33.00000 33.00000 33.00000 33.00000 33.00000 33.00000
## mean 70.00000 77.87879 80.30303 70.90909 65.45455 76.96970
## std 14.36141 12.68798 12.37054 13.77580 16.97257 14.24887
## min 40.00000 50.00000 50.00000 40.00000 40.00000 50.00000
## 25% 60.00000 70.00000 70.00000 60.00000 50.00000 70.00000
## 50% 70.00000 80.00000 80.00000 70.00000 60.00000 80.00000
## 75% 80.00000 90.00000 90.00000 80.00000 80.00000 90.00000
## max 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000
## ChangeMgt ClimCult QI Strat_Mark Comm_Envir Legal
## count 33.00000 33.00000 33.00000 33.00000 33.00000 33.00000
## mean 73.93939 80.00000 70.30303 73.03030 70.00000 76.36364
## std 18.53028 10.89725 14.89229 14.89229 11.72604 11.12940
## min 30.00000 60.00000 40.00000 40.00000 50.00000 60.00000
## 25% 60.00000 70.00000 60.00000 60.00000 60.00000 70.00000
## 50% 80.00000 80.00000 70.00000 70.00000 70.00000 80.00000
## 75% 90.00000 90.00000 80.00000 80.00000 80.00000 80.00000
## max 100.00000 100.00000 90.00000 100.00000 90.00000 100.00000
## Score Duration Percent
## count 33.000000 33.00000 33.00000
## mean 73.760000 55.52727 84.84848
## std 5.425101 12.06972 10.98613
## min 63.330000 34.37000 58.00000
## 25% 70.000000 45.83000 79.00000
## 50% 75.000000 56.00000 90.00000
## 75% 77.500000 64.23000 93.00000
## max 85.830000 82.93000 98.00000
The final score is a linear combination of the other scores, and percentile rank derives from that. Percentile rank is omitted for the correlation analysis, and Spearman’s is used to avoid assumptions of normality necessary for Pearson’s. While final score is retained, it must be correlated with subscores as it is built from them.
Outside of the final score (which is built from the other components), the strongest negative correlation was between duration of the examination and performance on healthcare personnel ( \(\rho\) =-.39, p<.05). The strongest positive Spearman’s correlation was between Legal and Management (\(\rho\)=.46, p<.01).
A scatterplot illustrates that Duration is not related to overall final grade performance. There are some localized associations between duration and subscores, though.
#############################Descriptives 2##############################
mycor=round(cor(mydata[,-c(1,2,17)], method="spearman"),2) #eliminate final / percent rank
#by definition, final is a linear combination of other scores
#% rank follows from that.
mycor[upper.tri(mycor)==TRUE]=NA
cor.test(mydata$Pers,mydata$Duration, method="spearman")
## Warning in cor.test.default(mydata$Pers, mydata$Duration, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: mydata$Pers and mydata$Duration
## S = 8342.2, p-value = 0.02325
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.394088
cor.test(mydata$Legal, mydata$Mgt, method="spearman")
## Warning in cor.test.default(mydata$Legal, mydata$Mgt, method = "spearman"):
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: mydata$Legal and mydata$Mgt
## S = 3231.7, p-value = 0.007079
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.4599441
library(reshape2)
meltcor=melt(mycor)
library(ggplot2)
ggplot(data = meltcor, aes(Var2, Var1, fill = value))+
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 10, hjust = 1))+
coord_fixed()
ggplot(mydata, aes(x=Duration, y=Score)) +
geom_point()+
geom_smooth(method=lm)
# Remove the confidence interval
ggplot(mydata, aes(x=Duration, y=Score)) +
geom_point()+
geom_smooth(method=lm, se=FALSE)
# Loess method
ggplot(mydata, aes(x=Duration, y=Score)) +
geom_point()+
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
#########################################################################
I intend to match student performance in each area with grades in associated classes / overall; however, Catsweb is down.
While normality does not perfectly hold true for the distributions, I ran parametric rather than non-parametric for ease in undertanding. Further, the final score is not terribly skewed. (Transformations or non-parametric tests are generally preferred.)
A t-test of the final score vs. gender was not significant. An ANOVA of final score vs. ethnicity was significant (p<.05) with the difference between Hispanic and Caucasian scores being the only statistical difference found in post-hoc analysis.
Insufficient power exists to do much inference right now. As the data set grows over time and we have pre-post scores, the analysis will become more robust.
######################Basic Inferentials########################
myt=t.test(mydata$Score~mydata$Gender)
myt
##
## Welch Two Sample t-test
##
## data: mydata$Score by mydata$Gender
## t = -1.3674, df = 21.236, p-value = 0.1858
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.679879 1.378061
## sample estimates:
## mean in group F mean in group M
## 72.87636 75.52727
myaov=aov(mydata$Score~mydata$Ethnicity)
summary(myaov)
## Df Sum Sq Mean Sq F value Pr(>F)
## mydata$Ethnicity 3 219.7 73.22 2.94 0.0497 *
## Residuals 29 722.2 24.90
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mytukeys=TukeyHSD(myaov)
mytukeys
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mydata$Score ~ mydata$Ethnicity)
##
## $`mydata$Ethnicity`
## diff lwr upr p adj
## B-A -1.667333 -11.59633 8.2616640 0.9676142
## C-A 1.250667 -7.16705 9.6683837 0.9771528
## H-A -5.997333 -15.92633 3.9316640 0.3699565
## C-B 2.918000 -3.87992 9.7159197 0.6504030
## H-B -4.330000 -12.92876 4.2687639 0.5263976
## H-C -7.248000 -14.04592 -0.4500803 0.0332620
plot(mytukeys)
################################################################