Student records systems impose their own generic table structures on a database. This means a report based on data sourced via a direct connection to that database necessarily reflects the table structure and field codes imposed by the records system.
Such a report should be reproducible on a different database, at a different institution, as long as it uses the same records system.
The market share of student records systems tends to produce dominant suppliers along with one large competitor, so a report is immediately applicable to a large proportion of other institutions, usually the majority.
What opportunities might this create? It could enable analyses to be replicated for peer to peer comparison without compromising underlying data, or the integration of reporting learnt at one institution into the practice of others.
Processing and analysis of student records could be conducted ‘blind’ reliably, for example on examination marks yet to be ratified, or in relation to anonymised personal data. Such analyses may be specified, to be run by administrators as third parties, without compromising blind-marking or individuals’ personal privacy.
Such common records systems invite reproducible reporting among the institutions that share them. They are also relatively easy to port to others for similar reporting purposes.
National reporting in education also provides another layer of standardised data, in both format and the coding in which it’s submitted. Likewise, any data derived from it and returned to institutions.
For example, the data structures and coding frames of statutory returns such as the HESA Student Record and others in the UK, which are fully documented by the national agency so that all instutitions and records systems providers can generate data with not just a common structure but also coding.
Such datasets for statutory returns also invite reproducible reporting, common to all institutions. Some elements are already combined and made available by national agencies, for example the Unistats Dataset in the UK.
Most educational reporting is very simple descriptive summaries and statistics. The tools and techniques required are also relatively simple. It’s the access to the data and tools to analyse it that are more often problematic.
library('RODBC')
library('YaleToolkit')
library('dplyr')
library('tidyr')
library('ggplot2')
library('scales')
library('ggthemes')
library('ztable')
Set options for some of them:
options(ztable.type="html") # ztable
knitr::opts_chunk$set(comment = NA) # knitr
theme_set(theme_minimal()) # ggplot2
This is an absolute requirement, to gain direct access to the database that holds the Student Records System, to extract the data in raw form, with the actual field codes as variable names rather than obscured by other names or querying applications.
In this case I use an ODBC connection:
source("../ODBC Source/odbc.source.r") # sourcing my credentials and DSN
The query selects data from
tables at the core of the records system, reflecting their relations in each join
.
It also includes a reference table used to relate codes to names or values, in this case Gender:
table <- 'smr'
query <- paste("
select smr.ayr_code, smr_agrm, smr_agrg, smr_rslt, smr_cred
, gen_code, gen_name
from intuit.ins_smr smr -- student module results
left outer join intuit.ins_spr spr -- student programme
on smr.spr_code = spr.spr_code
left outer join intuit.srs_gen gen -- gender table
on spr_gend = gen_code
where smr_yrno between 2010 and 2013
and mod_code like 'ST202'
and smr_mksc = 'BA/BSC'
and smr_rslt in ('P','F')
")
dfr <- sqlQuery(odbc, query, stringsAsFactors = FALSE)
filename <- paste0(table, '.data ', Sys.Date(), '.RData')
# load("smr.data 2015-03-01.RData")
Choosing undergraduate level II bachelors students in Statistics is a reference to Exploring Student Histories a talk given by Paul Murrell, author of grid
, based on records for his STATS 220
students:
The slides from a talk on Exploring Student Histories (pdf) (plus a supporting data set and R code) given at The University of Auckland (August 2009).
In slide 4 of Murrell’s talk there’s some uncertainty over the significance of two variables, Take Prgrs
and Pass Prgrs
, and the distinction between them. They relate to progression rules, the points taken and passed, which detirmine whether a student may progress on their course or not.
The confusion is likely to have arisen in preparting the documentation that accompanyied the data, an “Excel file (plus file describing variables)”. Perhaps a copy of one field defintion & pasted in errror into the next.
This shows why documenting datasets is crucial, and how open analysis allows confusion or errors to be resolved.
Documenting datasets is necessary to communicate them to others without direct access to such data, the records system or database.
A data.frame
in R can be described, in this case using a convenient function from the YaleToolkit
. However, it doesn’t include meaningful labels:
dfr_wha <- whatis(dfr) # YaleToolkit
ztable(dfr_wha, digits = 1)
variable.name | type | missing | distinct.values | precision | min | max | |
---|---|---|---|---|---|---|---|
1 | AYR_CODE | character | 0 | 4.0 | 2010/1 | 2013/4 | |
save(dfr, file = filename)
Documentation should to also include a defintion of any query that produced the data.
R largely eshews labels that seem required to completely document a dataset. They arise at the margins, in relation to transferring data to or from more traditional analytical tools, such as Stata.
A handful of packages such as Hmisc, epicalc, or DescTools, make use of them as attributes.
The variables in this dataset are named by field code. Variable labels should be based on field names in order to communicate their meaning to recipients unfamiliar with the records system and database it’s drawn from.
Field codes and names are held in their own table, which can also be queried:
table <- 'fld'
query <- paste("
select fld_code, fld_name, fld_type
from intuit.men_fld fld -- field table
where fld_dctc = 'SRS'
and fld_entc in ('SMR','GEN')
and fld_code in ('AYR_CODE','SMR_AGRM','SMR_AGRG'
,'SMR_RSLT','SMR_CRED','GEN_CODE','GEN_NAME')
")
labs <- sqlQuery(odbc, query, stringsAsFactors = FALSE)
str(labs)
'data.frame': 7 obs. of 3 variables:
$ FLD_CODE: chr "GEN_CODE" "GEN_NAME" "AYR_CODE" "SMR_AGRG" ...
$ FLD_NAME: chr "Gender code" "Full name" "Academic Year (AYR) code" "Agreed grade" ...
$ FLD_TYPE: chr "S" "S" "S" "S" ...
filename <- paste0(table, '.data ', Sys.Date(), '.RData')
save(labs, file = filename)
# load("fld.data 2015-03-01.RData")
Close ODBC connections, they’re no longer required:
odbcCloseAll()
remove(odbc)
Tidy up the field names in the labels data frame and create another data frame, where ‘field code’ is the name of each variable in the dataset. Join the labels to field codes.
labs[c(2:3,5), 'FLD_NAME'] <- c('Gender','Academic Year','Agreed mark')
dfr_labs <- data_frame(FLD_CODE = names(dfr)
, class = sapply(dfr, data.class))
dfr_labs <- dfr_labs %>% left_join(labs, by = 'FLD_CODE')
ztable(dfr_labs)
FLD_CODE | class | FLD_NAME | FLD_TYPE | |
---|---|---|---|---|
1 | AYR_CODE | character | Academic Year | S |
Field names attached as labels to the original data may be lost in some data manipulation, like reordering, so must be done first or repeated afterwards:
# grades as an ordered factor
dfr$SMR_AGRG <- reorder(dfr$SMR_AGRG, dfr$SMR_AGRM
, mean, na.rm = TRUE
, order = TRUE)
Attach Field Names as label attributes, in this case using Hmisc
:
Hmisc::label(dfr, self=FALSE) <- dfr_labs$FLD_NAME
Labels, or Field Names, can then be documented along with other attributes of the data:
dfr_att <- data.frame(Label = substr(
sapply(dfr, attr, which = 'label'), 1, 24)
, Name = names(dfr)
, Class = sapply(dfr, data.class)
, Mode = sapply(dfr, mode)
, row.names = NULL)
ztable(dfr_att)
Label | Name | Class | Mode | |
---|---|---|---|---|
1 | Academic Year | AYR_CODE | labelled | character |
In Hmisc, “The reason for having this class is so that the subscripting method for labelled
, [.labelled
, can preserve the label attribute in S. Also, the print method for labelled objects prefaces the print with the object’s label
.”
Labels can be utilised in plotting, here on the x-axis and the legend name:
dfr %>%
ggplot(aes(AYR_CODE, fill = SMR_AGRG)) +
geom_bar(position = 'dodge', alpha = 2/3) +
facet_wrap(~GEN_NAME, nrow = 1) +
scale_fill_few('medium', drop = FALSE, name = attr(dfr$SMR_AGRG, 'label')) +
theme(legend.position = 'bottom', axis.ticks = element_blank()) +
labs(x = attr(dfr$AYR_CODE, 'label'), y = NULL
, title = 'Grades by Year & Gender')
The field GEN_NAME
serves as a value label in the records system wherever a GEN_CODE
appears on a record, as distinct field names used as variable labels. It becomes a variable within the data itself by way of the join
to the GEN
table in the query.
As a variable, its values are more meaningful than GEN_CODE
and useful in this case as its values become the facet labels.
Relatively simple tables and plots of summary statistics are core components of most routine reports. Beyond that other statistical methods can be applied, but not without risk in the hands of non-experts. See: Reproducible research can still be wrong: Adopting a prevention approach, Jeffrey T. Leek, and Roger D. Peng
A table of counts along with proportions in long format is suitable for plotting from, but not for reading or reporting.
A count joined to percentages in wide format is more readable and suitable to report. Three measures are required: count, proportion and percentage, in two formats.
However a label attribute prevents use of the most convenient tools for data maipulation, such as dplyr
. The label has to be removed, both as an attribute and class:
for(i in 1:length(dfr)) {
attr(dfr[[i]],'label') <- NULL
class(dfr[[i]]) <- setdiff(class(dfr[[i]]), 'labelled')
remove(i)
}
This table is suitable for plotting from, but not reading or reporting:
dfr_prop <- dfr %>%
count(AYR_CODE, GEN_NAME) %>%
mutate(prop = n/sum(n))
dfr_prop
Source: local data frame [8 x 4]
Groups: AYR_CODE
AYR_CODE GEN_NAME n prop
1 2010/1 Female 44 0.4036697
2 2010/1 Male 65 0.5963303
3 2011/2 Female 49 0.4152542
4 2011/2 Male 69 0.5847458
5 2012/3 Female 42 0.3471074
6 2012/3 Male 79 0.6528926
7 2013/4 Female 47 0.4563107
8 2013/4 Male 56 0.5436893
Longer tables might be grade distributions for all courses or classifications of awards over several years, or previous results for a class of students. They are unlikely to be compact enough to be readable.
A table of counts joined to the associated distribution of percentages in wide format is more suitable for reading and reporting, once the columns are labelled:
dfr_n <- dfr %>%
count(AYR_CODE)
dfr_perc <- dfr_prop %>%
mutate(perc = round(100*prop,1)) %>%
select(-n, -prop) %>%
spread(GEN_NAME, perc) %>%
left_join(dfr_n, ., by = 'AYR_CODE')
names(dfr_perc)[1:2] <- c('Year','Students')
ztab <- ztable(as.data.frame(dfr_perc)
, digits = 1
, caption = 'Students by Year & Gender')
ztab <- addcgroup(ztab, cgroup = c('', 'Gender (%)')
, n.cgroup = c(2,2))
ztab
Gender (%) | |||||
Year | Students | Female | Male | ||
---|---|---|---|---|---|
1 | 2010/1 | 109 | 40.4 | 59.6 | |
To visualise the same counts and proportions, labels can be added for use in plots, but it’s another awkward operation:
# Is there a general way to match/join label values to names here?
Hmisc::label(dfr_prop[1:2], self=FALSE) <- dfr_labs %>%
filter(FLD_CODE %in% names(dfr_prop)) %>%
select(FLD_NAME) %>%
unlist()
I find defining a plot with some fixed variables first, without any data, is the most flexible approach. This includes using the labels for the legend name and x-axis label:
gg_ayr <- ggplot(data = data.frame()
, aes(x = AYR_CODE, group = GEN_NAME, colour = GEN_NAME)) +
scale_color_few('medium', name = attr(dfr_prop$GEN_NAME, 'label')) +
labs(x = attr(dfr_prop$AYR_CODE, 'label'), y = NULL) +
theme(legend.position = 'bottom', axis.ticks = element_blank())
GEN_NAME
is a field that exists for labelling purposes within the records system, but to label values rather variables.
Add data with %+%
, varying any particular aesthetic mapping (n
or prop
), and specify particular geometric tranformations (line
and point
) along with their own parameters and scales:
gg_ayr %+% dfr_prop +
geom_line(aes(y = n)
, linetype = 2
, size = 1) +
geom_point(aes(y = n)) +
labs(title = 'Gender by Year (n)')
gg_ayr %+% dfr_prop +
geom_line(aes(y = prop)
, linetype = 2
, size = 1) +
geom_point(aes(y = prop)) +
scale_y_continuous(labels = percent, limits = c(0,1)) +
labs(title = 'Gender by Year (%)')
Generating datasets in a format that facilitates ad hoc analyses also allows these to be reliably reproduced in routine administrative reporting.
This might enable more reporting to rise above the level of basic descriptive statistics, towards analyses that would support more meaningful inferences based on robust statistical methods.
It would provide an incentive for institutions to adopt reproducible reporting as standard, and to provide the infrastructure required by both faculty and administration and facilitate better transmission of information between them.
This might relieve faculty of some of the burden of creating their own ad hoc infrastructure, and reduce the reliance of institutions on analytics and reporting tools, restricted by expensive licences.
In education, “Numbers are not enough”.
See: Macfadyen, L. P., & Dawson, S. (2012). Numbers Are Not Enough. Why e-Learning Analytics Failed to Inform an Institutional Strategic Plan. (pdf) in Educational Technology & Society, 15 (3), 149–163.
… through participant observation of committees responsible for moving the institutional LMS review and selection process forward, we noted that subsequent deliberations and decision-making focused almost exclusively on technical questions relating to “ease of migration.” Critical interpretation of the implications of data describing the institution’s current LMS use was almost entirely absent. These observations are reflected in public and private reports documenting the committee’s activities (not shown). While there is an obvious imperative to ensure that any new enterprise technology is functional, scalable and reliable, an exclusive focus on technology integration issues, in the absence of development of a pedagogical vision, quickly neutralizes the likelihood that learning analytics data may catalyze organizational change with a focus on the student experience and learning outcomes. A focus on technological issues merely generates “urgency” around technical systems and integration concerns, and fails to address the complexities and challenges of institutional culture and change.
…senior representatives of university units—such as the Deans, Heads of Departments and other members of the senior administration participating in committees charged with LMS review and selection—are typically senior faculty members rather than professional managers. … this cohort is most likely to evaluate proposed changes to the LMS infrastructure not by coherence with vision or strategy, but by assessing the degree to which any change will burden themselves and their colleagues with the need to learn how to use complex new tools, and/or the need to redesign change their teaching habits and practices, without offering any appreciable advantage or reward. Information technology managers and staff similarly are most likely to assess proposals for new technology innovations from the perspective of workload and technical compatibility with existing systems, and have an even smaller investment in student learning outcomes.
The introduction of technology creates issues for balancing the interests of those involved, and alter the extent to which they can exercise effective autonomy within their domains of expertise:
On the various interests involved, see and hear: Analytics: As if learning mattered. Adam Cooper, Deputy Director of JISC CETIS:
The Learning Analytics Threat Rating, by Adam Cooper, Deputy Director of JISC CETIS
It is serious though. Control of education is increasingly related to control of educational data and learning analytics derived from it. Adam Cooper cites Professor Dai Griffiths, CETIS Analytics Series: The impact of analytics in Higher Education on academic practice:
The introduction of [learning analytics] techniques cannot be understood in isolation from the methods of educational management as they have grown up over the past two centuries. These methods are conditioned by the fact that educational managers are limited in their capability to monitor and act upon the range of states which are taken up by teachers and learners in their learning activities. … Teachers, meanwhile, deal as best they can with the full variety of learners’ needs in their practice. Over the years, an accommodation has developed between regulatory authorities, management and teaching professionals: educational managers indicate the goals which teachers and learners should work towards, provide a framework for them to act within, and ensure that the results of their activity meet some minimum standards. The rest is left up to the professional skills of teachers and the ethical integrity of both teachers and learners.
This accommodation has been eroded by the efforts of successive governments to increase their control over the education received by both school and higher education students. Learning Analytics radically reduces the effort involved in gathering information on the way in which lecturers deliver the curriculum, and also to automate the work of analysing this information. An alliance of these two trends has the potential to constrain teaching practice, and therefore it is necessary to take a systemic view when assessing the impact of analytics on teaching practice.
Reproducible reporting in education has the potential to appeal to a variety of these interests – academic, administrative, technical – and facilitate more productive collaboration between them.
Students would learn in an environment defined by reproducible reporting, where the skills they’re taught and tools they’re introduced to are actually applied in practice.