Background

Research Question

Ideal question

What effect do teacher-written trimester comments have on future student performance?

Realistic question

How does the language used to describe students in trimester one comments vary depending on level of improvement in trimester two?

Purpose

1. Seek out what an “effective” comment is.

Full disclosure:

I struggle writing comments.
I think WAY too much.

I just finished four years at Xiamen International School: a private, MYP K-12 school with three “grading” periods a year. At the end of every trimester the system teachers reported student’s achievement against the MYP subject-specific criteria rubrics (e.g. in Mathematics: A: Knowing and Understanding, B: Investigating Patterns, C: Communicating, D: Applying Math in Real-Life) that were used that trimester. Each rubric has eight different achievement levels, from one to eight.

In addition to criterion-based reporting, teachers wrote a three-sentence narrative comment about the students achievement in this format:
1. An area in which the student did well
2. An area in which the student could improve
3. A recommendation as to how the student could improve in the above area.

The difficulty with comments is that it takes a long time to write these comments and by the time there are in parents’ hands it may be too late for it to be considered effective feedback. Additionally, a minority of our parents speak English as a first language. These reasons and others made diving into the value of a comment very alluring to me.

The problem with our report cards is that grades and comments are always encoded and not standard-referenced … The report card should, above all else, be user-friendly: Parents must be able to easily understand the information it contains.– Grant Wiggins (Wiggins 1994)

2. Enhance Creativity of Teachers

I love the conversations I’ve had in workshops concerning “data-driven” teaching and learning as well as the idea of action research. However, the majority of tools for educational data analysis (MAP, SAT, Atlas Rubicon) tend not to give actionable suggestions to teachers as to how to improve their practice. I want to use real data to give practice-based recommendations for helping students learn and improve.

3. Learn R

I started this project having only competed the first few courses of the Johns Hopkins Data Science specialization series on Coursera. I knew basics of R but I hoped to become more fluent as I went. I’ve definitely learned a lot and moved beyond the syntax to understand more about R, especially packages like tm, dplyr, knitr, and ggplot2.

Principles

2. Produce reliable, readable, commented code

I hope I’ve done that! But I certainly can’t claim any efficiency.

3. Strive for “good enough”

I’ve started and stopped this project three times now over the course of a year. I’ve branched off and explored MANY different aspects of this dataset but in the end the story I’m telling now is the one I set out at the beginning to answer.

Analysis

Set Up and EDA

First I’ll start by loading all the extra packages I used for my analysis.

#load required libraries, data, and created functions
library(tm)

## Loading required package: NLP

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(RWeka)
library(DT)
library(knitr)


#source custom-built functions
source("Functions.R")

## Loading required package: rJava

## Loading required package: xlsxjars

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

Now I’ll load the csv file I got by exporting term grades from ManageBac

 t1.report <- GetReportsDFfromMBcsv("data/t1 comments.csv")

#find dimensions (rows & columns) of the table
dim(t1.report)

## [1] 1066   14

#get column names
colnames(t1.report)

##  [1] "Student.ID"      "Last.Name"       "First.Name"     
##  [4] "Class.ID"        "Grade.Level"     "Subject"        
##  [7] "Teacher"         "Cri.A"           "Cri.B"          
## [10] "Cri.C"           "Cri.D"           "Sum"            
## [13] "CriMean"         "Student.Comment"

So this table has 14 columns 1066 rows, where each row is one student’s report for one class.

#get an example table
t1.report.ex <- t1.report %>%
        #salt student IDs, round CriMean, 
        mutate(Student.ID = "1000****", CriMean = round(CriMean, digits = 3)) %>%
        #select only what I want
        select(Student.ID, Grade.Level:Cri.D, CriMean, -First.Name, -Last.Name) %>%
        #take 100 rows at random
        sample_n(100)

datatable(t1.report.ex, rownames = FALSE, class = 'compact')

The above table shows a random sample of 100 reports. At XIS it isn’t requried that all four criteria be assessed each trimester so mean criteria score was calculated based on what was available.

Measuring Improvement

Before we get to measuring how much a student improves from trimester one to two, Lets start with an definition then an example.

Definition

Improvement – the class-centered increase of mean critera levels from trimester one to trimester two.

Normalized in this case signifies that the mean improvement for for each class has been subtracted from the improvement each student received in said class.

Example Methodology

Let’s use an example school with two classes of four students each taught by Ms. Blue and Ms. Green.

Generally, Improvement in this instance is calculated as: \[ {\text{Improvement}_{\text{T1-T2}}} = {\text{CriMean}}_{\text{T2}}-{\text{CriMean}}_{\text{T1}} \]

We assume that if a student’s levels went up from trimester one to trimester two then they improved in that class. However in the above example, who improved more: Denise or Emily? Due to large variation in criteria levels between teachers and classes, this is something that needs adressing¹.

To account for this difference I decided to normalize improvement against the the class they were in² . The improvement metric is modified by subtracting the mean improvement of each class. That is to say, the mean for each individual class is computed then each individuals improvement is scaled according to the average improvement of the class. Accordingly, the mean and SD of Ms. Green and Ms. Blue’s class is:

by_teacher <- norm.ex %>%
        group_by(Teacher) %>%
        summarize(Improve.m = mean(Improvement), Improve.sd = sd(Improvement))

Teacher	Improve.m	Improve.sd
Blue	1.625	0.3227486
Green	-0.625	1.3616779

The formula for our improvement adjustment. \[ {CenteredImprovement}_{student} = {Improvement}_{student} - {Improvement}_{classmean} \] So using the above formula we get the following column displaying our new improvement metric. Its worth noting that a negative centered improvement score does not necessarily mean that the student’s performance decreased, but that they increased less than the average of the class.

norm.ex.by_teacher <- left_join(norm.ex, by_teacher, by = "Teacher") %>%
        #normalize the t12.growth by mean & sd of teacher t12.growth
        mutate(Improve.zgrowth = (Improvement - Improve.m)/Improve.sd) %>%
        mutate_each(funs(round(.,3)), -Student, -Teacher) %>%
        select(Student, Teacher, Improvement:Improve.zgrowth)
        
datatable(norm.ex.by_teacher, rownames = FALSE, options = list(dom = 't'))

Improvement at XIS

So now to the XIS data, below is a table showing the average criteria levels for each subject in the MYP for both trimester one and two as well as the difference³ between them.

#combine all three trimesters of data together
year.report <- GetYearReport()

#wrappers for mean and sd with na.rm = TRUE
av <- function(x) {mean(x, na.rm = TRUE)}
s <- function(x) {sd(x, na.rm = TRUE)}

by_subject <- year.report %>%
        group_by(Subject) %>%
        summarize(CriMean.t1 = av(CriMean.t1), CriMean.t2 = av(CriMean.t2),
                  Imp.m = av(t12.growth), Imp.sd = s(t12.growth)) %>%
        mutate(centralized_improvement = Imp.m - mean(Imp.m)) %>%
        mutate_each(funs(round(.,2)), -Subject) %>%
        ungroup()

datatable(by_subject, caption = "T1-T2 Average Criteria Levels by Subject*",
          class = 'compact', options = list(pageLength = 12, dom = 't'),
          rownames = FALSE)

We can see from the table that the average criteria improvement varies amongst the subjects. This holds true for grade levels, teachers and individual classes.

I think the be

The Comments

The general format for XIS trimester comments for students is two paragraphs:

A paragraph about what happened in class that trimester generally
A paragraph of three sentences each of which performs the following function of saying something the student:
- has done well,
- the student struggles with, and
- can do to improve that with which they struggle.

A typical comment reads like this:

The MYP sixth grade science program at XIS is an intellectually challenging program that results in creative, critical, reflective thinkers. It is designed to help students make connections between science and the real world. Students are developing approaches to learning skills for thinking before writing responses and communicating using tables and graphs. The first trimester focused on cells and disease. The key concept was form. Three criteria A-C were covered with the following summative assessments: Criterion A- Unit test, Criterion B & C- design lab on yeast and Criterion C ’ vertical leap investigation.

****⁴ is willing to work in class. He has shown an improvement in submission and achievement in assessment tasks over the first trimester. Further attention to the detail required for different assessment criteria should allow **** continued improvement. He is encouraged to write independently first so that advice can be provided on written work rather than risk forgetting how the verbal advice affects his achievement level.

As teachers, when we write these comments, we hope that the student and his/her parents reads the comment, assimilate the feedback and improve. We have no way to know for sure that this happens. But I set out to learn more…

A breakdown of the comments from trimester 1. Number of * Students: 133 * Reports: 1066 * Words: 72,921

Let’s start by getting the top 10 words that were used in trimester one comments.

t1.corpus <- GetReportsDFfromMBcsv("data/t1 comments.csv") %>%
        AnonymizeReport() %>%
        GetCorpusFromReportDF()

t1.top10 <- t1.corpus %>%
        DocumentTermMatrix() %>%
        CollapseAndSortDTM() %>%
        head(10) %>%
        select(Words, freq) %>%
        mutate( per.1000 = round(1000 * freq / 72921,2))

T1 Comments: Top 10 Most Word Used Words
Words	Frequency	(per 1000 words)
and	3646	50.00
the	3358	46.05
xxx	2076	28.47
his	1554	21.31
her	1502	20.60
she	1317	18.06
has	911	12.49
this	890	12.20
work	859	11.78
for	822	11.27

Wow, how insightful! (not). Let’s look instead at n-grams (i.e. phrases).

#Set tokenizer funciton to phrases 4- to 8-words in length
DersTokenizer <- function(x) {NGramTokenizer(x, Weka_control(min = 4, max = 8))}
options(mc.cores=1) #strange RJava workaround


t1.top1000 <- t1.corpus %>% 
        DocumentTermMatrix(control=list(tokenize = DersTokenizer)) %>%
        CollapseAndSortDTM() %>%
        mutate(length = CountWords(Words)) %>%
        mutate(LenNorm = length * freq) %>%
        arrange(desc(LenNorm)) %>%
        head(1000)

t1.pruned <- GetPrunedList(t1.top1000, 100)

T1 Comments: Top 10 Most Word Used Phrases – weighted by length
Phrases	Frequency X Length
encouraged that parents review task specific comments and	616
needs to be able to	315
and understanding to solve problems set in familiar	304
to improve xxx can work on	276
i would like to see	260
member of the group who	260
proficiency with the mathematical concepts covered in this	248
information to make scientifically supported judgments	246
to improve in this	236
good understanding of the	232
xxx is able to	228
a specific problem or issue	225

Now we are getting somewhere! These are the phrases that were most used to describe students in the first trimester.

What I want to do now is look at compare two groups and the language used to describe each. Student who were in the: * Top 25% in terms of improvement, and * Bottom 25% in terms of improvement.

To do this I will use the term-frequency/inverse-document frequency metric (tf-idf) which I discovered from an article published on Nate Silver’s FiveThirtyEight blog titled, These Are The Phrases Each GOP Candidate Repeats Most by (Milo Beckman 2016). In it, Beckmana analyzes 2016 GOP debate transcripts to find unique phrases for each candidate.

In this analysis, I am employ tf-idf to find phrases that are more likely to have been used to describe students that improved than those that didn’t.

Putting it All Together

year.report <- year.report %>%
    #take only needed columns
    select(Student.ID, Class.ID:Teacher,
               CriMean.t1, CriMean.t2, t12.growth, t12.growth.center) %>%
    #add quartile column based on centered growth
    within(t12.growth.center.quartile <- as.integer(cut(t12.growth.center,
                                      quantile(t12.growth.center, probs=0:4/4,
                                      na.rm = TRUE), include.lowest=TRUE))) %>%
    #add index to crossref w/ corpus
    mutate(ID.SUB = paste(Student.ID, Subject))

#get 4 quartiles of ID.SUB's 
quarts <- c(1,2,3,4)
quartiles <- lapply(quarts, function(x) {
        year.report %>% filter(t12.growth.center.quartile == x) %>%
                .$ID.SUB})

#paste each quartile's comments into one comment
quartile.comments <- lapply(quartiles, function(x) {
        idx <- t1.corpus %>% meta(tag = "ID.SUB") %in% x
        do.call(paste,content(t1.corpus[idx])) })

#take only Q1 and Q4
topbot.comments <- quartile.comments[-c(2,3)]

#make corpus (1 quartile = 1 document)
topbot.corpus <- VectorSource(topbot.comments) %>% Corpus

#make dtc from corpus with phrases 2- to 6-words long
all.tfidf <- GetAllTfIdfMatricesFromCorpus(topbot.corpus, 2,5, norm = TRUE)

#remove repetitive words
all.pruned <- lapply(all.tfidf, GetPrunedList, prune_thru = 300)

top <- all.pruned[[1]] %>%
  transmute(ngrams, Score = tfidfXlength * 100000) %>%
  filter(Score >= 20)
 
bottom <- all.pruned[[2]]  %>%
  transmute(ngrams, Score = tfidfXlength * 100000) %>%
  filter(Score >= 20)

Top 25% Improved: Most Common Phrases from T1
ngrams	Score
a focus for the next	27.54335
class discussions as	23.13642
he needs to improve	22.03468
xxx should try to	22.03468
a very polite cooperative and	22.03468
a way that they can	22.03468
achieve to an even higher	22.03468
and should be written in	22.03468
became somewhat noticeable while assessing	22.03468
can be measured and evaluated	22.03468
for xxx is to more	22.03468
goals should be based on	22.03468
had some difficulty when she	22.03468
of the key concepts i	22.03468
started the year well she	22.03468
this was on display when	22.03468
well throughout the first trimester	22.03468

Bottom 25% Improved: Most Common Phrases from T1
ngrams	Score
as the year has progressed	31.73999
more and more comfortable sharing	31.73999
and i look forward to	26.44999
xxx needs to continue to	26.44999
achievement in all tasks xxx	21.15999
an area of focus for	21.15999
an investigation but performed better	21.15999
can work on explaining her	21.15999
coherent lines of reasoning xxx	21.15999
for next term i hope	21.15999
from the text to support	21.15999
good understanding of the concept	21.15999
it difficult for him to	21.15999
to plan and create videos	21.15999
was evident while assessing criterion	21.15999
it is important that	21.15999

Findings

Evaluation

This is how good of a job I did.

References & Footnotes

Milo Beckman. 2016. “These Are The Phrases Each GOP Candidate Repeats Most | FiveThirtyEight.” https://fivethirtyeight.com/features/these-are-the-phrases-each-gop-candidate-repeats-most/.

Wiggins, Grant. 1994. “Toward Better Report Cards.” Educational Leadership 52 (2): 28–37.

asdfasdfasdfasdf↩
The average growth should be the difference between the Criteria Means of trimester one and two but the average growth excludes mid-year students who: 1) typically don’t do well in their 1st trimester of MYP and 2) are not represented in the T1-T2 growth statistic.↩
A lot of assumptions going on here….↩
Name removed for privacy↩

Comment Mining Write Up

Anders Swanson

Background

Research Question

Ideal question

Realistic question

Purpose

1. Seek out what an “effective” comment is.

2. Enhance Creativity of Teachers

3. Learn R

Principles

2. Produce reliable, readable, commented code

3. Strive for “good enough”

Analysis

Set Up and EDA

Measuring Improvement

Definition

Example Methodology

Improvement at XIS

The Comments

Putting it All Together

Findings

Evaluation

References & Footnotes

Comment Mining Write Up

Anders Swanson

Background

Research Question

Ideal question

Realistic question

Purpose

1. Seek out what an “effective” comment is.

2. Enhance Creativity of Teachers

3. Learn R

Principles

1. Share results with community

2. Produce reliable, readable, commented code

3. Strive for “good enough”

Analysis

Set Up and EDA

Measuring Improvement

Definition

Example Methodology

Improvement at XIS

The Comments

Putting it All Together

Findings

Evaluation

References & Footnotes