Background

Research Question

Ideal question

What effect do teacher-written trimester comments have on future student performance?

Realistic question

How does the language used to describe students in trimester one comments vary depending on level of improvement in trimester two?


Purpose



1. Seek out what an “effective” comment is.

Full disclosure:

I just finished four years at Xiamen International School: a private, MYP K-12 school with three “grading” periods a year. At the end of every trimester the system teachers reported student’s achievement against the MYP subject-specific criteria rubrics (e.g. in Mathematics: A: Knowing and Understanding, B: Investigating Patterns, C: Communicating, D: Applying Math in Real-Life) that were used that trimester. Each rubric has eight different achievement levels, from one to eight.

In addition to criterion-based reporting, teachers wrote a three-sentence narrative comment about the students achievement in this format:
1. An area in which the student did well
2. An area in which the student could improve
3. A recommendation as to how the student could improve in the above area.

The difficulty with comments is that it takes a long time to write these comments and by the time there are in parents’ hands it may be too late for it to be considered effective feedback. Additionally, a minority of our parents speak English as a first language. These reasons and others made diving into the value of a comment very alluring to me.

The problem with our report cards is that grades and comments are always encoded and not standard-referenced … The report card should, above all else, be user-friendly: Parents must be able to easily understand the information it contains.– Grant Wiggins (Wiggins 1994)

2. Enhance Creativity of Teachers

I love the conversations I’ve had in workshops concerning “data-driven” teaching and learning as well as the idea of action research. However, the majority of tools for educational data analysis (MAP, SAT, Atlas Rubicon) tend not to give actionable suggestions to teachers as to how to improve their practice. I want to use real data to give practice-based recommendations for helping students learn and improve.

3. Learn R

I started this project having only competed the first few courses of the Johns Hopkins Data Science specialization series on Coursera. I knew basics of R but I hoped to become more fluent as I went. I’ve definitely learned a lot and moved beyond the syntax to understand more about R, especially packages like tm, dplyr, knitr, and ggplot2.


Principles



1. Share results with community

This article is a start. Also my full github repository can be found here

2. Produce reliable, readable, commented code

I hope I’ve done that! But I certainly can’t claim any efficiency.

3. Strive for “good enough”

I’ve started and stopped this project three times now over the course of a year. I’ve branched off and explored MANY different aspects of this dataset but in the end the story I’m telling now is the one I set out at the beginning to answer.


Analysis



Set Up and EDA



First I’ll start by loading all the extra packages I used for my analysis.

#load required libraries, data, and created functions
library(tm)
## Loading required package: NLP
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(RWeka)
library(DT)
library(knitr)


#source custom-built functions
source("Functions.R")
## Loading required package: rJava
## Loading required package: xlsxjars
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate

Now I’ll load the csv file I got by exporting term grades from ManageBac

 t1.report <- GetReportsDFfromMBcsv("data/t1 comments.csv")

#find dimensions (rows & columns) of the table
dim(t1.report)
## [1] 1066   14
#get column names
colnames(t1.report)
##  [1] "Student.ID"      "Last.Name"       "First.Name"     
##  [4] "Class.ID"        "Grade.Level"     "Subject"        
##  [7] "Teacher"         "Cri.A"           "Cri.B"          
## [10] "Cri.C"           "Cri.D"           "Sum"            
## [13] "CriMean"         "Student.Comment"

So this table has 14 columns 1066 rows, where each row is one student’s report for one class.

#get an example table
t1.report.ex <- t1.report %>%
        #salt student IDs, round CriMean, 
        mutate(Student.ID = "1000****", CriMean = round(CriMean, digits = 3)) %>%
        #select only what I want
        select(Student.ID, Grade.Level:Cri.D, CriMean, -First.Name, -Last.Name) %>%
        #take 100 rows at random
        sample_n(100)

datatable(t1.report.ex, rownames = FALSE, class = 'compact')



The above table shows a random sample of 100 reports. At XIS it isn’t requried that all four criteria be assessed each trimester so mean criteria score was calculated based on what was available.


Measuring Improvement



Before we get to measuring how much a student improves from trimester one to two, Lets start with an definition then an example.

Definition

Improvement – the class-centered increase of mean critera levels from trimester one to trimester two.

Normalized in this case signifies that the mean improvement for for each class has been subtracted from the improvement each student received in said class.

Example Methodology

Let’s use an example school with two classes of four students each taught by Ms. Blue and Ms. Green.

Generally, Improvement in this instance is calculated as: \[ {\text{Improvement}_{\text{T1-T2}}} = {\text{CriMean}}_{\text{T2}}-{\text{CriMean}}_{\text{T1}} \]



We assume that if a student’s levels went up from trimester one to trimester two then they improved in that class. However in the above example, who improved more: Denise or Emily? Due to large variation in criteria levels between teachers and classes, this is something that needs adressing1.

To account for this difference I decided to normalize improvement against the the class they were in2 . The improvement metric is modified by subtracting the mean improvement of each class. That is to say, the mean for each individual class is computed then each individuals improvement is scaled according to the average improvement of the class. Accordingly, the mean and SD of Ms. Green and Ms. Blue’s class is:

by_teacher <- norm.ex %>%
        group_by(Teacher) %>%
        summarize(Improve.m = mean(Improvement), Improve.sd = sd(Improvement))
Teacher Improve.m Improve.sd
Blue 1.625 0.3227486
Green -0.625 1.3616779



The formula for our improvement adjustment. \[ {CenteredImprovement}_{student} = {Improvement}_{student} - {Improvement}_{classmean} \] So using the above formula we get the following column displaying our new improvement metric. Its worth noting that a negative centered improvement score does not necessarily mean that the student’s performance decreased, but that they increased less than the average of the class.

norm.ex.by_teacher <- left_join(norm.ex, by_teacher, by = "Teacher") %>%
        #normalize the t12.growth by mean & sd of teacher t12.growth
        mutate(Improve.zgrowth = (Improvement - Improve.m)/Improve.sd) %>%
        mutate_each(funs(round(.,3)), -Student, -Teacher) %>%
        select(Student, Teacher, Improvement:Improve.zgrowth)
        
datatable(norm.ex.by_teacher, rownames = FALSE, options = list(dom = 't'))



Improvement at XIS

So now to the XIS data, below is a table showing the average criteria levels for each subject in the MYP for both trimester one and two as well as the difference3 between them.

#combine all three trimesters of data together
year.report <- GetYearReport()

#wrappers for mean and sd with na.rm = TRUE
av <- function(x) {mean(x, na.rm = TRUE)}
s <- function(x) {sd(x, na.rm = TRUE)}

by_subject <- year.report %>%
        group_by(Subject) %>%
        summarize(CriMean.t1 = av(CriMean.t1), CriMean.t2 = av(CriMean.t2),
                  Imp.m = av(t12.growth), Imp.sd = s(t12.growth)) %>%
        mutate(centralized_improvement = Imp.m - mean(Imp.m)) %>%
        mutate_each(funs(round(.,2)), -Subject) %>%
        ungroup()

datatable(by_subject, caption = "T1-T2 Average Criteria Levels by Subject*",
          class = 'compact', options = list(pageLength = 12, dom = 't'),
          rownames = FALSE)



We can see from the table that the average criteria improvement varies amongst the subjects. This holds true for grade levels, teachers and individual classes.

I think the be

The Comments



The general format for XIS trimester comments for students is two paragraphs:

  • A paragraph about what happened in class that trimester generally
  • A paragraph of three sentences each of which performs the following function of saying something the student:

    • has done well,
    • the student struggles with, and
    • can do to improve that with which they struggle.

A typical comment reads like this:

The MYP sixth grade science program at XIS is an intellectually challenging program that results in creative, critical, reflective thinkers. It is designed to help students make connections between science and the real world. Students are developing approaches to learning skills for thinking before writing responses and communicating using tables and graphs. The first trimester focused on cells and disease. The key concept was form. Three criteria A-C were covered with the following summative assessments: Criterion A- Unit test, Criterion B & C- design lab on yeast and Criterion C ’ vertical leap investigation.

****4 is willing to work in class. He has shown an improvement in submission and achievement in assessment tasks over the first trimester. Further attention to the detail required for different assessment criteria should allow **** continued improvement. He is encouraged to write independently first so that advice can be provided on written work rather than risk forgetting how the verbal advice affects his achievement level.



As teachers, when we write these comments, we hope that the student and his/her parents reads the comment, assimilate the feedback and improve. We have no way to know for sure that this happens. But I set out to learn more…

A breakdown of the comments from trimester 1. Number of * Students: 133 * Reports: 1066 * Words: 72,921

Let’s start by getting the top 10 words that were used in trimester one comments.

t1.corpus <- GetReportsDFfromMBcsv("data/t1 comments.csv") %>%
        AnonymizeReport() %>%
        GetCorpusFromReportDF()

t1.top10 <- t1.corpus %>%
        DocumentTermMatrix() %>%
        CollapseAndSortDTM() %>%
        head(10) %>%
        select(Words, freq) %>%
        mutate( per.1000 = round(1000 * freq / 72921,2))
T1 Comments: Top 10 Most Word Used Words
Words Frequency (per 1000 words)
and 3646 50.00
the 3358 46.05
xxx 2076 28.47
his 1554 21.31
her 1502 20.60
she 1317 18.06
has 911 12.49
this 890 12.20
work 859 11.78
for 822 11.27



Wow, how insightful! (not). Let’s look instead at n-grams (i.e. phrases).

#Set tokenizer funciton to phrases 4- to 8-words in length
DersTokenizer <- function(x) {NGramTokenizer(x, Weka_control(min = 4, max = 8))}
options(mc.cores=1) #strange RJava workaround


t1.top1000 <- t1.corpus %>% 
        DocumentTermMatrix(control=list(tokenize = DersTokenizer)) %>%
        CollapseAndSortDTM() %>%
        mutate(length = CountWords(Words)) %>%
        mutate(LenNorm = length * freq) %>%
        arrange(desc(LenNorm)) %>%
        head(1000)

t1.pruned <- GetPrunedList(t1.top1000, 100)
T1 Comments: Top 10 Most Word Used Phrases – weighted by length
Phrases Frequency X Length
encouraged that parents review task specific comments and 616
needs to be able to 315
and understanding to solve problems set in familiar 304
to improve xxx can work on 276
i would like to see 260
member of the group who 260
proficiency with the mathematical concepts covered in this 248
information to make scientifically supported judgments 246
to improve in this 236
good understanding of the 232
xxx is able to 228
a specific problem or issue 225



Now we are getting somewhere! These are the phrases that were most used to describe students in the first trimester.

What I want to do now is look at compare two groups and the language used to describe each. Student who were in the: * Top 25% in terms of improvement, and * Bottom 25% in terms of improvement.

To do this I will use the term-frequency/inverse-document frequency metric (tf-idf) which I discovered from an article published on Nate Silver’s FiveThirtyEight blog titled, These Are The Phrases Each GOP Candidate Repeats Most by (Milo Beckman 2016). In it, Beckmana analyzes 2016 GOP debate transcripts to find unique phrases for each candidate.

In this analysis, I am employ tf-idf to find phrases that are more likely to have been used to describe students that improved than those that didn’t.

Putting it All Together

year.report <- year.report %>%
    #take only needed columns
    select(Student.ID, Class.ID:Teacher,
               CriMean.t1, CriMean.t2, t12.growth, t12.growth.center) %>%
    #add quartile column based on centered growth
    within(t12.growth.center.quartile <- as.integer(cut(t12.growth.center,
                                      quantile(t12.growth.center, probs=0:4/4,
                                      na.rm = TRUE), include.lowest=TRUE))) %>%
    #add index to crossref w/ corpus
    mutate(ID.SUB = paste(Student.ID, Subject))

#get 4 quartiles of ID.SUB's 
quarts <- c(1,2,3,4)
quartiles <- lapply(quarts, function(x) {
        year.report %>% filter(t12.growth.center.quartile == x) %>%
                .$ID.SUB})

#paste each quartile's comments into one comment
quartile.comments <- lapply(quartiles, function(x) {
        idx <- t1.corpus %>% meta(tag = "ID.SUB") %in% x
        do.call(paste,content(t1.corpus[idx])) })

#take only Q1 and Q4
topbot.comments <- quartile.comments[-c(2,3)]

#make corpus (1 quartile = 1 document)
topbot.corpus <- VectorSource(topbot.comments) %>% Corpus
#make dtc from corpus with phrases 2- to 6-words long
all.tfidf <- GetAllTfIdfMatricesFromCorpus(topbot.corpus, 2,5, norm = TRUE)

#remove repetitive words
all.pruned <- lapply(all.tfidf, GetPrunedList, prune_thru = 300)

top <- all.pruned[[1]] %>%
  transmute(ngrams, Score = tfidfXlength * 100000) %>%
  filter(Score >= 20)
 
bottom <- all.pruned[[2]]  %>%
  transmute(ngrams, Score = tfidfXlength * 100000) %>%
  filter(Score >= 20)
Top 25% Improved: Most Common Phrases from T1
ngrams Score
a focus for the next 27.54335
class discussions as 23.13642
he needs to improve 22.03468
xxx should try to 22.03468
a very polite cooperative and 22.03468
a way that they can 22.03468
achieve to an even higher 22.03468
and should be written in 22.03468
became somewhat noticeable while assessing 22.03468
can be measured and evaluated 22.03468
for xxx is to more 22.03468
goals should be based on 22.03468
had some difficulty when she 22.03468
of the key concepts i 22.03468
started the year well she 22.03468
this was on display when 22.03468
well throughout the first trimester 22.03468
Bottom 25% Improved: Most Common Phrases from T1
ngrams Score
as the year has progressed 31.73999
more and more comfortable sharing 31.73999
and i look forward to 26.44999
xxx needs to continue to 26.44999
achievement in all tasks xxx 21.15999
an area of focus for 21.15999
an investigation but performed better 21.15999
can work on explaining her 21.15999
coherent lines of reasoning xxx 21.15999
for next term i hope 21.15999
from the text to support 21.15999
good understanding of the concept 21.15999
it difficult for him to 21.15999
to plan and create videos 21.15999
was evident while assessing criterion 21.15999
it is important that 21.15999

Findings




Evaluation



This is how good of a job I did.


References & Footnotes

Milo Beckman. 2016. “These Are The Phrases Each GOP Candidate Repeats Most | FiveThirtyEight.” https://fivethirtyeight.com/features/these-are-the-phrases-each-gop-candidate-repeats-most/.

Wiggins, Grant. 1994. “Toward Better Report Cards.” Educational Leadership 52 (2): 28–37.


  1. asdfasdfasdfasdf

  2. The average growth should be the difference between the Criteria Means of trimester one and two but the average growth excludes mid-year students who: 1) typically don’t do well in their 1st trimester of MYP and 2) are not represented in the T1-T2 growth statistic.

  3. A lot of assumptions going on here….

  4. Name removed for privacy