CommonLit Readability Prize
https://www.kaggle.com/c/commonlitreadabilityprize
Rate the complexity of literary passages for grades 3-12 classroom use

knitr::opts_chunk$set(echo = TRUE, 
                      message = FALSE,
                      warning = FALSE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.4     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(quanteda)
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "numericVector" of class "Mnumeric"; definition not updated
## Package version: 3.0.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 4 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
library(quanteda.textstats)
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "numericVector" of class "Mnumeric"; definition not updated
library(caTools)
library(rvest)
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## 
##     guess_encoding
Read in the training set as “x”.
x <- read.csv('train.csv')
How many texts are there in the training set?
nrow(x)
## [1] 2834
Who are the passages for?
For classroom use for students between grade 3 and grade 12 in America.
Visualize readability targets with a histogram.
x %>% ggplot(aes(x=target)) +
  geom_histogram(binwidth = 0.1)

Which text is the most difficult to read?
x[which.min(x$target),]$excerpt
## [1] "The commutator is peculiar, consisting of only three segments of a copper ring, while in the simplest of other continuous current generators several times that number exist, and frequently 120! segments are to be found. These three segments are made so as to be removable in a moment for cleaning or replacement. They are mounted upon a metal support, and are surrounded on all sides by a free air space, and cannot, therefore, lose their insulated condition. This feature of air insulation is peculiar to this system, and is very important as a factor in the durability of the commutator. Besides this, the commutator is sustained by supports carried in flanges upon the shaft, which flanges, as an additional safeguard, are coated all over with hard rubber, one of the finest known insulators. It may be stated, without fear of contradiction, that no other commutator made is so thoroughly insulated and protected. The three commutator segments virtually constitute a single copper ring, mounted in free air, and cut into three equal pieces by slots across its face."
Which text is easiest to read?
x[which.max(x$target),]$excerpt
## [1] "When you think of dinosaurs and where they lived, what do you picture? Do you see hot, steamy swamps, thick jungles, or sunny plains? Dinosaurs lived in those places, yes. But did you know that some dinosaurs lived in the cold and the darkness near the North and South Poles?\nThis surprised scientists, too. Paleontologists used to believe that dinosaurs lived only in the warmest parts of the world. They thought that dinosaurs could only have lived in places where turtles, crocodiles, and snakes live today. Later, these dinosaur scientists began finding bones in surprising places.\nOne of those surprising fossil beds is a place called Dinosaur Cove, Australia. One hundred million years ago, Australia was connected to Antarctica. Both continents were located near the South Pole. Today, paleontologists dig dinosaur fossils out of the ground. They think about what those ancient bones must mean."
Comment: This is easy? Because it’s about dinosaurs for kids?
I wonder if this text may have been wrongly labeled.
What is the average target value?
x$target %>% mean
## [1] -0.9593188
What does an average text look like?
x[x$target>=-0.959 & x$target <=-0.958,]$excerpt
## [1] "One noon, during a break in the rains, there was a cool soft breeze blowing; the smell of the damp grass and leaves in the hot sun felt like the warm breathing of the tired earth on one's body. A persistent bird went on all the afternoon repeating the burden of its one complaint in Nature's audience chamber.\nThe postmaster had nothing to do. The shimmer of the freshly washed leaves, and the banked-up remnants of the retreating rain-clouds were sights to see; and the postmaster was watching them and thinking to himself: \"Oh, if only some kindred soul were near—just one loving human being whom I could hold near my heart!\" This was exactly, he went on to think, what that bird was trying to say, and it was the same feeling which the murmuring leaves were striving to express. But no one knows, or would believe, that such an idea might also take possession of an ill-paid village postmaster in the deep, silent midday interval of his work."

Model 1

Extract Flesch reading ease scores from the excerpts.
Flesch <- x$excerpt %>% textstat_readability(measure='Flesch')
x$Flesch <- Flesch[,2]
Split the original training set into a train and test set.
set.seed(1066)
spl <- sample.split(x$target, SplitRatio = 0.7)
train <- x[spl==T,]
test <- x[spl==F,]
Run a linear regression model using only the Flesch Reading Ease Scores.
mod <- glm(target~Flesch$Flesch, data = train)
summary(mod)
## 
## Call:
## glm(formula = target ~ Flesch$Flesch, data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.61468  -0.60855  -0.00169   0.60381   2.45259  
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.838289   0.068551  -41.40   <2e-16 ***
## Flesch$Flesch  0.029756   0.001047   28.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.7519595)
## 
##     Null deviance: 2097.2  on 1982  degrees of freedom
## Residual deviance: 1489.6  on 1981  degrees of freedom
## AIC: 5066.2
## 
## Number of Fisher Scoring iterations: 2
Run the model on the test set.
test$preds <- predict(mod, newdata = test)
Plot the targets against the predictions.
test %>% ggplot(aes(x=preds, y=target)) +
  geom_point() +
  stat_smooth(method="lm")

Compute R squared for Flesch.
SSE <- sum((test$target - test$preds)^2)
SST <- sum((test$target - mean(train$target))^2)
Rsquared1 <- 1 - SSE/SST
Rsquared1
## [1] 0.3410682
Comment: Thirty-four percent of the readability target scores can be accounted for by the Flesch Reading Ease Score.

Model 2

Build a model using four variables:
Keep words per sentence from Flesch and change average syllable per word to average word length.
Add the length of the passage and lexical diversity (TTR) as new variables.
x$words_per_sentence <- ntoken(x$excerpt)/nsentence(x$excerpt)
x$avg_word_length <- nchar(x$excerpt)/ntoken(x$excerpt)
x$words <- ntoken(x$excerpt)
x$TTR <- x$excerpt %>% tokens %>% dfm %>% textstat_lexdiv()
Split the training set into train and test sets again.
set.seed(1066)
spl <- sample.split(x$target, SplitRatio = 0.7)
train <- x[spl==T,]
test <- x[spl==F,]
Run the second model.
mod2 <- glm(target~ words_per_sentence + avg_word_length + words + TTR$TTR, data=train)
summary(mod2)
## 
## Call:
## glm(formula = target ~ words_per_sentence + avg_word_length + 
##     words + TTR$TTR, data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.89398  -0.58556   0.04067   0.60441   2.37460  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.3717467  0.3901903  18.893  < 2e-16 ***
## words_per_sentence -0.0291378  0.0019114 -15.244  < 2e-16 ***
## avg_word_length    -0.9376830  0.0473152 -19.818  < 2e-16 ***
## words              -0.0096339  0.0009566 -10.071  < 2e-16 ***
## TTR$TTR            -1.8169489  0.3400357  -5.343 1.02e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.7447657)
## 
##     Null deviance: 2097.2  on 1982  degrees of freedom
## Residual deviance: 1473.1  on 1978  degrees of freedom
## AIC: 5050.1
## 
## Number of Fisher Scoring iterations: 2
Make predictions for the second model on the test set and plot the targets against predictions.
test$preds2 <- predict(mod2, newdata = test)
test %>% ggplot(aes(x=preds2, y=target)) +
  geom_point() +
  stat_smooth(method="lm")

Calculate R squared for the second model.
SSE <- sum((test$target - test$preds2)^2)
SST <- sum((test$target - mean(train$target))^2)
Rsquared2 <- 1 - SSE/SST
Rsquared2
## [1] 0.3455149

What is the difference between the four-feature model and the Flesch model?

Rsquared2 - Rsquared1
## [1] 0.004446664
Comment: There is less than half a percent difference between the my simple model and the Flesch one.

Model 3

Build a model using forty-eight different readability formulas.
Extract multiple readability scores from the excerpts.
readability <- x$excerpt %>% textstat_readability(measure = c("all", "ARI", "ARI.simple", "Bormuth", "Bormuth.GP", "Coleman", "Coleman.C2", "Coleman.Liau", "Coleman.Liau.grade", "Coleman.Liau.short", "Dale.Chall", "Dale.Chall.old", "Dale.Chall.PSK", "Danielson.Bryan", "Danielson.Bryan.2", "Dickes.Steiwer", "DRP", "ELF", "Farr.Jenkins.Paterson", "Flesch", "Flesch.PSK", "Flesch.Kincaid", "FOG", "FOG.PSK", "FOG.NRI", "FORCAST", "FORCAST.RGL", "Fucks", "Linsear.Write", "LIW", "nWS", "nWS.2", "nWS.3", "nWS.4", "RIX", "Scrabble", "SMOG", "SMOG.C", "SMOG.simple", "SMOG.de", "Spache", "Spache.old", "Strain", "Traenkle.Bailer", "Traenkle.Bailer.2","Wheeler.Smith","meanSentenceLength", "meanWordSyllables"))
Add the targets to the readability data frame.
readability$target <- x$target
Split the readability scores data frame into train and test sets.
set.seed(1066)
spl <- sample.split(readability$target, SplitRatio = 0.7)
train <- readability[spl==T,]
test <- readability[spl==F,]
Run the third model.
mod3 <- glm(target~. -document, data=train)
summary(mod3)
## 
## Call:
## glm(formula = target ~ . - document, data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.49385  -0.48487  -0.00048   0.48701   2.20060  
## 
## Coefficients: (22 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -5.262e+03  1.508e+03  -3.491 0.000493 ***
## ARI                   -2.704e+02  7.884e+01  -3.430 0.000617 ***
## ARI.simple             1.367e+02  3.950e+01   3.461 0.000551 ***
## ARI.NRI                       NA         NA      NA       NA    
## Bormuth.MC             4.514e+00  1.250e+00   3.612 0.000311 ***
## Bormuth.GP             4.242e-10  8.483e-11   5.001 6.21e-07 ***
## Coleman                5.734e-03  2.653e-02   0.216 0.828918    
## Coleman.C2            -8.536e-02  1.875e-02  -4.553 5.63e-06 ***
## Coleman.Liau.ECP              NA         NA      NA       NA    
## Coleman.Liau.grade            NA         NA      NA       NA    
## Coleman.Liau.short            NA         NA      NA       NA    
## Dale.Chall             3.614e-02  8.134e-03   4.443 9.36e-06 ***
## Dale.Chall.old        -3.916e-02  2.309e-02  -1.696 0.090057 .  
## Dale.Chall.PSK                NA         NA      NA       NA    
## Danielson.Bryan        4.860e+00  1.593e+00   3.052 0.002306 ** 
## Danielson.Bryan.2     -4.182e+00  1.406e+00  -2.974 0.002971 ** 
## Dickes.Steiwer         4.792e-02  6.469e-03   7.408 1.90e-13 ***
## DRP                           NA         NA      NA       NA    
## ELF                   -1.632e-01  1.178e-01  -1.386 0.165905    
## Farr.Jenkins.Paterson         NA         NA      NA       NA    
## Flesch                 7.156e-02  2.738e-02   2.614 0.009017 ** 
## Flesch.PSK                    NA         NA      NA       NA    
## Flesch.Kincaid                NA         NA      NA       NA    
## FOG                    2.395e-01  9.186e-02   2.607 0.009208 ** 
## FOG.PSK                       NA         NA      NA       NA    
## FOG.NRI               -3.391e-02  1.726e-02  -1.964 0.049651 *  
## FORCAST                       NA         NA      NA       NA    
## FORCAST.RGL                   NA         NA      NA       NA    
## Fucks                         NA         NA      NA       NA    
## Linsear.Write          1.028e-02  4.116e-02   0.250 0.802794    
## LIW                   -3.560e-02  1.997e-02  -1.783 0.074818 .  
## nWS                    7.281e-02  6.240e-02   1.167 0.243411    
## nWS.2                         NA         NA      NA       NA    
## nWS.3                         NA         NA      NA       NA    
## nWS.4                         NA         NA      NA       NA    
## RIX                   -6.214e-02  8.409e-02  -0.739 0.460017    
## Scrabble              -4.710e-01  3.333e-01  -1.413 0.157796    
## SMOG                  -1.374e-01  1.240e-01  -1.108 0.267814    
## SMOG.C                -1.095e-02  1.658e-01  -0.066 0.947346    
## SMOG.simple                   NA         NA      NA       NA    
## SMOG.de                       NA         NA      NA       NA    
## Spache                -3.829e-01  6.853e-02  -5.588 2.63e-08 ***
## Spache.old                    NA         NA      NA       NA    
## Strain                 3.934e-01  3.419e-01   1.151 0.249963    
## Traenkle.Bailer        1.076e-02  1.425e-02   0.755 0.450115    
## Traenkle.Bailer.2     -3.630e-04  9.142e-03  -0.040 0.968332    
## Wheeler.Smith                 NA         NA      NA       NA    
## meanSentenceLength            NA         NA      NA       NA    
## meanWordSyllables             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.5479596)
## 
##     Null deviance: 2097.2  on 1982  degrees of freedom
## Residual deviance: 1071.8  on 1956  degrees of freedom
## AIC: 4463.4
## 
## Number of Fisher Scoring iterations: 2
Comment: Multi-collinearity is an issue that needs to be addressed.
Simplify the model using the step function.
mod3 <- step(mod3)
## Start:  AIC=4463.44
## target ~ (document + ARI + ARI.simple + ARI.NRI + Bormuth.MC + 
##     Bormuth.GP + Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid + 
##     FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks + 
##     Linsear.Write + LIW + nWS + nWS.2 + nWS.3 + nWS.4 + RIX + 
##     Scrabble + SMOG + SMOG.C + SMOG.simple + SMOG.de + Spache + 
##     Spache.old + Strain + Traenkle.Bailer + Traenkle.Bailer.2 + 
##     Wheeler.Smith + meanSentenceLength + meanWordSyllables) - 
##     document
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid + 
##     FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks + 
##     Linsear.Write + LIW + nWS + nWS.2 + nWS.3 + nWS.4 + RIX + 
##     Scrabble + SMOG + SMOG.C + SMOG.simple + SMOG.de + Spache + 
##     Spache.old + Strain + Traenkle.Bailer + Traenkle.Bailer.2 + 
##     Wheeler.Smith + meanSentenceLength
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid + 
##     FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks + 
##     Linsear.Write + LIW + nWS + nWS.2 + nWS.3 + nWS.4 + RIX + 
##     Scrabble + SMOG + SMOG.C + SMOG.simple + SMOG.de + Spache + 
##     Spache.old + Strain + Traenkle.Bailer + Traenkle.Bailer.2 + 
##     Wheeler.Smith
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid + 
##     FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks + 
##     Linsear.Write + LIW + nWS + nWS.2 + nWS.3 + nWS.4 + RIX + 
##     Scrabble + SMOG + SMOG.C + SMOG.simple + SMOG.de + Spache + 
##     Spache.old + Strain + Traenkle.Bailer + Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid + 
##     FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks + 
##     Linsear.Write + LIW + nWS + nWS.2 + nWS.3 + nWS.4 + RIX + 
##     Scrabble + SMOG + SMOG.C + SMOG.simple + SMOG.de + Spache + 
##     Strain + Traenkle.Bailer + Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid + 
##     FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks + 
##     Linsear.Write + LIW + nWS + nWS.2 + nWS.3 + nWS.4 + RIX + 
##     Scrabble + SMOG + SMOG.C + SMOG.simple + Spache + Strain + 
##     Traenkle.Bailer + Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid + 
##     FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks + 
##     Linsear.Write + LIW + nWS + nWS.2 + nWS.3 + nWS.4 + RIX + 
##     Scrabble + SMOG + SMOG.C + Spache + Strain + Traenkle.Bailer + 
##     Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid + 
##     FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks + 
##     Linsear.Write + LIW + nWS + nWS.2 + nWS.3 + RIX + Scrabble + 
##     SMOG + SMOG.C + Spache + Strain + Traenkle.Bailer + Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid + 
##     FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks + 
##     Linsear.Write + LIW + nWS + nWS.2 + RIX + Scrabble + SMOG + 
##     SMOG.C + Spache + Strain + Traenkle.Bailer + Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid + 
##     FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks + 
##     Linsear.Write + LIW + nWS + RIX + Scrabble + SMOG + SMOG.C + 
##     Spache + Strain + Traenkle.Bailer + Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid + 
##     FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Linsear.Write + 
##     LIW + nWS + RIX + Scrabble + SMOG + SMOG.C + Spache + Strain + 
##     Traenkle.Bailer + Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid + 
##     FOG + FOG.PSK + FOG.NRI + FORCAST + Linsear.Write + LIW + 
##     nWS + RIX + Scrabble + SMOG + SMOG.C + Spache + Strain + 
##     Traenkle.Bailer + Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid + 
##     FOG + FOG.PSK + FOG.NRI + Linsear.Write + LIW + nWS + RIX + 
##     Scrabble + SMOG + SMOG.C + Spache + Strain + Traenkle.Bailer + 
##     Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid + 
##     FOG + FOG.NRI + Linsear.Write + LIW + nWS + RIX + Scrabble + 
##     SMOG + SMOG.C + Spache + Strain + Traenkle.Bailer + Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + FOG + 
##     FOG.NRI + Linsear.Write + LIW + nWS + RIX + Scrabble + SMOG + 
##     SMOG.C + Spache + Strain + Traenkle.Bailer + Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Farr.Jenkins.Paterson + Flesch + FOG + FOG.NRI + Linsear.Write + 
##     LIW + nWS + RIX + Scrabble + SMOG + SMOG.C + Spache + Strain + 
##     Traenkle.Bailer + Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP + 
##     ELF + Flesch + FOG + FOG.NRI + Linsear.Write + LIW + nWS + 
##     RIX + Scrabble + SMOG + SMOG.C + Spache + Strain + Traenkle.Bailer + 
##     Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + ELF + 
##     Flesch + FOG + FOG.NRI + Linsear.Write + LIW + nWS + RIX + 
##     Scrabble + SMOG + SMOG.C + Spache + Strain + Traenkle.Bailer + 
##     Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Danielson.Bryan + 
##     Danielson.Bryan.2 + Dickes.Steiwer + ELF + Flesch + FOG + 
##     FOG.NRI + Linsear.Write + LIW + nWS + RIX + Scrabble + SMOG + 
##     SMOG.C + Spache + Strain + Traenkle.Bailer + Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade + 
##     Dale.Chall + Dale.Chall.old + Danielson.Bryan + Danielson.Bryan.2 + 
##     Dickes.Steiwer + ELF + Flesch + FOG + FOG.NRI + Linsear.Write + 
##     LIW + nWS + RIX + Scrabble + SMOG + SMOG.C + Spache + Strain + 
##     Traenkle.Bailer + Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Coleman.Liau.ECP + Dale.Chall + Dale.Chall.old + 
##     Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + ELF + 
##     Flesch + FOG + FOG.NRI + Linsear.Write + LIW + nWS + RIX + 
##     Scrabble + SMOG + SMOG.C + Spache + Strain + Traenkle.Bailer + 
##     Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP + 
##     Coleman + Coleman.C2 + Dale.Chall + Dale.Chall.old + Danielson.Bryan + 
##     Danielson.Bryan.2 + Dickes.Steiwer + ELF + Flesch + FOG + 
##     FOG.NRI + Linsear.Write + LIW + nWS + RIX + Scrabble + SMOG + 
##     SMOG.C + Spache + Strain + Traenkle.Bailer + Traenkle.Bailer.2
## 
## 
## Step:  AIC=4463.44
## target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + Coleman + 
##     Coleman.C2 + Dale.Chall + Dale.Chall.old + Danielson.Bryan + 
##     Danielson.Bryan.2 + Dickes.Steiwer + ELF + Flesch + FOG + 
##     FOG.NRI + Linsear.Write + LIW + nWS + RIX + Scrabble + SMOG + 
##     SMOG.C + Spache + Strain + Traenkle.Bailer + Traenkle.Bailer.2
## 
##                     Df Deviance    AIC
## - Traenkle.Bailer.2  1   1071.8 4461.4
## - SMOG.C             1   1071.8 4461.4
## - Coleman            1   1071.8 4461.5
## - Linsear.Write      1   1071.8 4461.5
## - RIX                1   1072.1 4462.0
## - Traenkle.Bailer    1   1072.1 4462.0
## - SMOG               1   1072.5 4462.7
## - Strain             1   1072.5 4462.8
## - nWS                1   1072.5 4462.8
## - ELF                1   1072.9 4463.4
## <none>                   1071.8 4463.4
## - Scrabble           1   1072.9 4463.5
## - Dale.Chall.old     1   1073.4 4464.4
## - LIW                1   1073.5 4464.7
## - FOG.NRI            1   1073.9 4465.4
## - FOG                1   1075.5 4468.3
## - Flesch             1   1075.5 4468.4
## - Danielson.Bryan.2  1   1076.7 4470.4
## - Danielson.Bryan    1   1076.9 4470.9
## - ARI                1   1078.2 4473.3
## - ARI.simple         1   1078.4 4473.5
## - Bormuth.MC         1   1079.0 4474.6
## - Dale.Chall         1   1082.6 4481.4
## - Coleman.C2         1   1083.2 4482.3
## - Bormuth.GP         1   1085.5 4486.6
## - Spache             1   1088.9 4492.8
## - Dickes.Steiwer     1   1101.9 4516.3
## 
## Step:  AIC=4461.45
## target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + Coleman + 
##     Coleman.C2 + Dale.Chall + Dale.Chall.old + Danielson.Bryan + 
##     Danielson.Bryan.2 + Dickes.Steiwer + ELF + Flesch + FOG + 
##     FOG.NRI + Linsear.Write + LIW + nWS + RIX + Scrabble + SMOG + 
##     SMOG.C + Spache + Strain + Traenkle.Bailer
## 
##                     Df Deviance    AIC
## - SMOG.C             1   1071.8 4459.4
## - Coleman            1   1071.8 4459.5
## - Linsear.Write      1   1071.8 4459.5
## - RIX                1   1072.1 4460.0
## - SMOG               1   1072.5 4460.7
## - Strain             1   1072.5 4460.8
## - nWS                1   1072.6 4460.8
## - ELF                1   1072.9 4461.4
## <none>                   1071.8 4461.4
## - Scrabble           1   1072.9 4461.5
## - Dale.Chall.old     1   1073.4 4462.4
## - LIW                1   1073.5 4462.7
## - FOG.NRI            1   1073.9 4463.4
## - Traenkle.Bailer    1   1074.4 4464.2
## - FOG                1   1075.5 4466.3
## - Flesch             1   1075.5 4466.4
## - Danielson.Bryan.2  1   1076.7 4468.4
## - Danielson.Bryan    1   1076.9 4468.9
## - ARI                1   1078.3 4471.4
## - ARI.simple         1   1078.4 4471.6
## - Bormuth.MC         1   1079.0 4472.6
## - Dale.Chall         1   1082.6 4479.4
## - Coleman.C2         1   1083.2 4480.5
## - Bormuth.GP         1   1085.5 4484.7
## - Spache             1   1088.9 4490.9
## - Dickes.Steiwer     1   1101.9 4514.4
## 
## Step:  AIC=4459.45
## target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + Coleman + 
##     Coleman.C2 + Dale.Chall + Dale.Chall.old + Danielson.Bryan + 
##     Danielson.Bryan.2 + Dickes.Steiwer + ELF + Flesch + FOG + 
##     FOG.NRI + Linsear.Write + LIW + nWS + RIX + Scrabble + SMOG + 
##     Spache + Strain + Traenkle.Bailer
## 
##                     Df Deviance    AIC
## - Coleman            1   1071.8 4457.5
## - Linsear.Write      1   1071.8 4457.5
## - RIX                1   1072.1 4458.0
## - Strain             1   1072.5 4458.8
## - nWS                1   1072.6 4458.8
## - ELF                1   1072.9 4459.4
## <none>                   1071.8 4459.4
## - Scrabble           1   1072.9 4459.5
## - Dale.Chall.old     1   1073.4 4460.4
## - LIW                1   1073.5 4460.7
## - FOG.NRI            1   1073.9 4461.4
## - Traenkle.Bailer    1   1074.4 4462.2
## - FOG                1   1075.5 4464.3
## - Flesch             1   1075.6 4464.4
## - Danielson.Bryan.2  1   1076.7 4466.4
## - Danielson.Bryan    1   1076.9 4466.9
## - ARI                1   1078.3 4469.4
## - ARI.simple         1   1078.4 4469.6
## - Bormuth.MC         1   1079.1 4470.8
## - Dale.Chall         1   1082.7 4477.5
## - Coleman.C2         1   1083.3 4478.6
## - SMOG               1   1084.8 4481.2
## - Bormuth.GP         1   1086.1 4483.6
## - Spache             1   1089.0 4489.0
## - Dickes.Steiwer     1   1101.9 4512.4
## 
## Step:  AIC=4457.5
## target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + Coleman.C2 + 
##     Dale.Chall + Dale.Chall.old + Danielson.Bryan + Danielson.Bryan.2 + 
##     Dickes.Steiwer + ELF + Flesch + FOG + FOG.NRI + Linsear.Write + 
##     LIW + nWS + RIX + Scrabble + SMOG + Spache + Strain + Traenkle.Bailer
## 
##                     Df Deviance    AIC
## - Linsear.Write      1   1071.8 4455.5
## - RIX                1   1072.1 4456.1
## - nWS                1   1072.6 4456.8
## <none>                   1071.8 4457.5
## - Scrabble           1   1073.0 4457.6
## - Strain             1   1073.3 4458.2
## - Dale.Chall.old     1   1073.5 4458.5
## - LIW                1   1073.6 4458.7
## - FOG.NRI            1   1074.0 4459.5
## - Traenkle.Bailer    1   1074.4 4460.2
## - ELF                1   1074.5 4460.4
## - Danielson.Bryan.2  1   1076.7 4464.4
## - Danielson.Bryan    1   1076.9 4464.9
## - FOG                1   1077.7 4466.3
## - ARI                1   1078.3 4467.5
## - ARI.simple         1   1078.5 4467.7
## - Flesch             1   1079.2 4469.1
## - Bormuth.MC         1   1082.3 4474.8
## - Dale.Chall         1   1082.7 4475.5
## - SMOG               1   1085.0 4479.7
## - Spache             1   1089.0 4487.0
## - Coleman.C2         1   1089.5 4488.0
## - Bormuth.GP         1   1092.5 4493.3
## - Dickes.Steiwer     1   1102.0 4510.5
## 
## Step:  AIC=4455.52
## target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + Coleman.C2 + 
##     Dale.Chall + Dale.Chall.old + Danielson.Bryan + Danielson.Bryan.2 + 
##     Dickes.Steiwer + ELF + Flesch + FOG + FOG.NRI + LIW + nWS + 
##     RIX + Scrabble + SMOG + Spache + Strain + Traenkle.Bailer
## 
##                     Df Deviance    AIC
## - RIX                1   1072.1 4454.1
## - nWS                1   1072.6 4454.9
## <none>                   1071.8 4455.5
## - Scrabble           1   1073.0 4455.6
## - Dale.Chall.old     1   1073.5 4456.5
## - LIW                1   1073.7 4456.9
## - FOG.NRI            1   1074.0 4457.6
## - Traenkle.Bailer    1   1074.4 4458.3
## - Strain             1   1077.0 4463.1
## - Danielson.Bryan.2  1   1077.1 4463.2
## - ELF                1   1077.3 4463.6
## - Danielson.Bryan    1   1077.3 4463.6
## - ARI                1   1078.9 4466.5
## - ARI.simple         1   1079.0 4466.8
## - Dale.Chall         1   1082.7 4473.5
## - FOG                1   1084.5 4476.8
## - Bormuth.MC         1   1085.2 4478.1
## - SMOG               1   1085.6 4478.8
## - Flesch             1   1088.0 4483.1
## - Spache             1   1089.1 4485.1
## - Coleman.C2         1   1098.3 4501.9
## - Bormuth.GP         1   1100.6 4506.0
## - Dickes.Steiwer     1   1102.0 4508.6
## 
## Step:  AIC=4454.06
## target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + Coleman.C2 + 
##     Dale.Chall + Dale.Chall.old + Danielson.Bryan + Danielson.Bryan.2 + 
##     Dickes.Steiwer + ELF + Flesch + FOG + FOG.NRI + LIW + nWS + 
##     Scrabble + SMOG + Spache + Strain + Traenkle.Bailer
## 
##                     Df Deviance    AIC
## - nWS                1   1072.9 4453.4
## <none>                   1072.1 4454.1
## - Scrabble           1   1073.3 4454.1
## - Dale.Chall.old     1   1073.8 4455.2
## - FOG.NRI            1   1074.3 4456.0
## - Traenkle.Bailer    1   1074.7 4456.8
## - Strain             1   1077.1 4461.3
## - Danielson.Bryan    1   1077.3 4461.6
## - ELF                1   1077.4 4461.8
## - Danielson.Bryan.2  1   1077.5 4462.0
## - ARI                1   1079.3 4465.3
## - ARI.simple         1   1079.4 4465.5
## - Dale.Chall         1   1083.0 4472.0
## - FOG                1   1085.0 4475.7
## - Bormuth.MC         1   1085.2 4476.1
## - SMOG               1   1086.2 4477.9
## - Flesch             1   1088.3 4481.7
## - Spache             1   1089.3 4483.5
## - LIW                1   1090.2 4485.3
## - Coleman.C2         1   1098.5 4500.1
## - Bormuth.GP         1   1100.7 4504.2
## - Dickes.Steiwer     1   1102.1 4506.7
## 
## Step:  AIC=4453.43
## target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + Coleman.C2 + 
##     Dale.Chall + Dale.Chall.old + Danielson.Bryan + Danielson.Bryan.2 + 
##     Dickes.Steiwer + ELF + Flesch + FOG + FOG.NRI + LIW + Scrabble + 
##     SMOG + Spache + Strain + Traenkle.Bailer
## 
##                     Df Deviance    AIC
## - Scrabble           1   1073.9 4453.4
## <none>                   1072.9 4453.4
## - Dale.Chall.old     1   1074.6 4454.7
## - FOG.NRI            1   1075.0 4455.4
## - Traenkle.Bailer    1   1075.2 4455.7
## - Strain             1   1077.6 4460.1
## - ELF                1   1077.8 4460.4
## - Danielson.Bryan    1   1078.1 4461.0
## - Danielson.Bryan.2  1   1078.3 4461.4
## - ARI                1   1080.1 4464.8
## - ARI.simple         1   1080.3 4465.0
## - Dale.Chall         1   1083.9 4471.7
## - SMOG               1   1086.4 4476.3
## - Bormuth.MC         1   1088.0 4479.2
## - Flesch             1   1088.9 4480.8
## - Spache             1   1089.5 4481.9
## - LIW                1   1090.5 4483.8
## - FOG                1   1093.9 4489.9
## - Coleman.C2         1   1102.7 4505.8
## - Dickes.Steiwer     1   1103.0 4506.3
## - Bormuth.GP         1   1105.6 4510.9
## 
## Step:  AIC=4453.38
## target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + Coleman.C2 + 
##     Dale.Chall + Dale.Chall.old + Danielson.Bryan + Danielson.Bryan.2 + 
##     Dickes.Steiwer + ELF + Flesch + FOG + FOG.NRI + LIW + SMOG + 
##     Spache + Strain + Traenkle.Bailer
## 
##                     Df Deviance    AIC
## <none>                   1073.9 4453.4
## - Dale.Chall.old     1   1075.7 4454.7
## - FOG.NRI            1   1076.0 4455.2
## - Traenkle.Bailer    1   1076.2 4455.5
## - Strain             1   1078.6 4460.0
## - ELF                1   1078.8 4460.4
## - Danielson.Bryan    1   1079.3 4461.2
## - Danielson.Bryan.2  1   1079.5 4461.7
## - ARI                1   1081.4 4465.1
## - ARI.simple         1   1081.5 4465.3
## - Dale.Chall         1   1085.3 4472.2
## - SMOG               1   1087.6 4476.5
## - Bormuth.MC         1   1089.1 4479.1
## - Flesch             1   1089.8 4480.4
## - Spache             1   1090.0 4480.8
## - LIW                1   1091.6 4483.8
## - FOG                1   1095.2 4490.3
## - Coleman.C2         1   1104.2 4506.5
## - Dickes.Steiwer     1   1105.6 4509.0
## - Bormuth.GP         1   1106.6 4510.7
summary(mod3)
## 
## Call:
## glm(formula = target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + 
##     Coleman.C2 + Dale.Chall + Dale.Chall.old + Danielson.Bryan + 
##     Danielson.Bryan.2 + Dickes.Steiwer + ELF + Flesch + FOG + 
##     FOG.NRI + LIW + SMOG + Spache + Strain + Traenkle.Bailer, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5321  -0.4864   0.0104   0.4903   2.1888  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -5.390e+03  1.434e+03  -3.759 0.000175 ***
## ARI               -2.772e+02  7.520e+01  -3.686 0.000234 ***
## ARI.simple         1.401e+02  3.763e+01   3.723 0.000202 ***
## Bormuth.MC         4.580e+00  8.713e-01   5.256 1.63e-07 ***
## Bormuth.GP         4.342e-10  5.622e-11   7.723 1.80e-14 ***
## Coleman.C2        -8.774e-02  1.179e-02  -7.439 1.50e-13 ***
## Dale.Chall         3.675e-02  8.069e-03   4.555 5.57e-06 ***
## Dale.Chall.old    -4.107e-02  2.281e-02  -1.801 0.071905 .  
## Danielson.Bryan    4.723e+00  1.515e+00   3.117 0.001853 ** 
## Danielson.Bryan.2 -4.337e+00  1.358e+00  -3.194 0.001426 ** 
## Dickes.Steiwer     4.860e-02  6.391e-03   7.604 4.42e-14 ***
## ELF               -1.790e-01  5.987e-02  -2.989 0.002832 ** 
## Flesch             7.434e-02  1.382e-02   5.377 8.46e-08 ***
## FOG                2.926e-01  4.694e-02   6.234 5.54e-10 ***
## FOG.NRI           -3.226e-02  1.664e-02  -1.939 0.052680 .  
## LIW               -4.590e-02  8.076e-03  -5.684 1.51e-08 ***
## SMOG              -1.435e-01  2.871e-02  -4.998 6.32e-07 ***
## Spache            -3.678e-01  6.787e-02  -5.420 6.71e-08 ***
## Strain             4.184e-01  1.435e-01   2.916 0.003588 ** 
## Traenkle.Bailer    9.409e-03  4.662e-03   2.018 0.043705 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.5470913)
## 
##     Null deviance: 2097.2  on 1982  degrees of freedom
## Residual deviance: 1073.9  on 1963  degrees of freedom
## AIC: 4453.4
## 
## Number of Fisher Scoring iterations: 2
Make predictions for the third model on the test set and plot the targets against predictions.
test$preds3 <- predict(mod3, newdata = test)
test %>% ggplot(aes(x=preds3, y=target)) +
  geom_point() +
  stat_smooth(method="lm")

Calculate R squared for the third model.
SSE <- sum((test$target - test$preds3)^2)
SST <- sum((test$target - mean(train$target))^2)
Rsquared3 <- 1 - SSE/SST
Rsquared3
## [1] 0.4554672
What is the difference between the first and third model accuracy?
Rsquared3 - Rsquared1
## [1] 0.114399
Comment 1: This is an eleven percent improvement over the Flesch model. I will use this model in reading class.
Comment 2: It’s still no good, but it’s better than the Flesch.
Comment 3: It’s a start. I can come back to this later.

How many different ways can you choose 19 readability formulae from 48?

choose(48,19)
## [1] 1.154185e+13

What is the probabiity of winning Loto 7?

choose(37,7)
## [1] 10295472

What is the probability of winning Loto 6?

choose(43,6)
## [1] 6096454

How does choosing 19 from 48 compare to winning the lottery?

choose(48,19)/(choose(37,7)*choose(43,5))
## [1] 1.16462

Comment: You’d need to get five out of six for Loto 6 and 7 out of 7 for Loto 7 on two tickets.

The Elvin Reading Difficulty Index

Create a new data frame for nineteen significant readability indices.
readability <- x$excerpt %>% textstat_readability(measure = c("ARI", "ARI.simple", "Bormuth", "Bormuth.GP", "Coleman.C2", "Dale.Chall", "Dale.Chall.old", "Danielson.Bryan", "Danielson.Bryan.2", "Dickes.Steiwer", "ELF", "Flesch", "FOG", "FOG.NRI", "LIW", "SMOG", "Spache", "Strain", "Traenkle.Bailer"))
readability$target <- x$target
readability$document <- NULL
Rebuild the model without splitting into a train / test set.
mod4 <- glm(target~., data=readability)
preds <- predict(mod4, newdata=readability)
SSE <- sum((readability$target - preds)^2)
SST <- sum((readability$target - mean(readability$target))^2)
Rsquared <- 1 - SSE/SST
Rsquared
## [1] 0.4810783
Add the predictions to the readability dataframe.
readability$preds <- preds
Convert reading difficulty to a 60 point scale with 60 being most difficult. This means that texts more difficult than in this grade 3-12 corpus can be estimated (and hopefully few will be greater than 100).
rescale <- function(x) (max(x)-x)/(max(x) - min(x)) * 60
readability$ERD <- rescale(readability$preds)
Check the scale.
fivenum(readability$ERD)
## [1]  0.0000 22.2692 29.8053 36.8246 60.0000
Check the relationship between predictions and ERD (Elvin Reading Difficulty).
readability %>% ggplot(aes(x=preds, y=ERD)) +
  geom_line() +
  labs(title = "Elvin Reading Difficulty Against Readability Predictions",
       x = "Readability Predictions")

Find the slope and intercept of ERD vs Readability Predictions.
lm(readability$ERD~readability$preds)
## 
## Call:
## lm(formula = readability$ERD ~ readability$preds)
## 
## Coefficients:
##       (Intercept)  readability$preds  
##             15.87             -14.09

Comment: To calculate ERD, the equation is 15.87 plus (-14.09 times prediction).

Check the prediction for the most difficult passage.
readability[which.min(readability$preds),]$ERD
## [1] 60
Check the prediction for the easiest passage.
readability[which.max(readability$preds),]$ERD
## [1] 0
Which text is the most difficult according to predictions?
x[which.min(readability$preds),]$excerpt
## [1] "Mr. Perkin, an English chemist, and Messrs. Graebe and Liebermann, German chemists, almost simultaneously applied for patents in 1869, in England, and as their methods were nearly identical they arranged priorities by the exchanging of licenses. The German license became the property of the Badische Aniline Company, and the English license became the property of the predecessors of the North British Alizarine Company. These patents expire in about two months, and the lecturer explained that an attempt made by the German manufacturers to further monopolize this industry (even after the expiry of the patent) proved abortive. He also stated that alizarine, 20 percent quality, is sold today at 2s 6d. per lb., but that if the price were reduced by one-half there will still be a handsome profit to makers, and that the United Kingdom is the largest consumer, absorbing one-third of the entire production, and that England possesses advantages over all other countries for manufacturing alizarine--first, by having a splendid supply of the raw material, anthracine; secondly, cheaper caustic soda in England than in Germany by fully £4 per ton; thirdly, cheaper fuel; fourthly, large consumption at our own doors; and, fifthly, special facilities for exporting."
Which text is the easiest according to predictions?
x[which.max(readability$preds),]$excerpt
## [1] "Cat and Dog look through the window. They look through the window. Then Cat and Dog see a butterfly! The butterfly is pink. Cat and Dog want to catch the butterfly. Cat and Dog follow the butterfly. They follow the butterfly. Cat and Dog follow the butterfly by foot. They walk after the butterfly. But the butterfly is fast. The butterfly is too fast, and Cat and Dog are slow. They are too slow. Cat and Dog follow the butterfly by bike. They ride after the butterfly. But the butterfly is fast. The butterfly is very fast, and Cat and Dog are slow. They are very slow. Cat and Dog follow the butterfly by car. They drive after the butterfly. But the butterfly is fast. The butterfly is still too fast, and Cat and Dog are slow. They are still too slow. Cat and Dog follow the butterfly by boat. They float after the butterfly. But the butterfly is fast. The butterfly is super-fast, and Cat and Dog are slow. They are still super-slow."
Comment. This text looks easier than the one about dinosaurs. I’m content with the model for the time-being.

•••••••••••••••••••••••••••••••••••••••••••••••••••••

How to Use in Class

Students send article URLs to me when it’s their turn.
url <- "https://asia.nikkei.com/Politics/China-s-three-child-policy-aims-to-head-off-demographic-crisis"

title <- read_html(url) %>% 
  html_nodes(".article-header__title .ezstring-field") %>% 
  html_text() %>% str_squish

author <- read_html(url) %>% 
  html_node(".article__details") %>% 
  html_text() %>% str_squish

x <- read_html(url) %>% 
  html_nodes("p") %>%
  html_text %>% corpus
Convert article paragraphs to single corpus.
x <- corpus(texts(x, groups = rep(1, ndoc(x))))
Calculate the article readability using my algorithm of 19 readability indices.
article <- x %>% textstat_readability(measure = c("ARI", "ARI.simple", "Bormuth", "Bormuth.GP", "Coleman.C2", "Dale.Chall", "Dale.Chall.old", "Danielson.Bryan", "Danielson.Bryan.2", "Dickes.Steiwer", "ELF", "Flesch", "FOG", "FOG.NRI", "LIW", "SMOG", "Spache", "Strain", "Traenkle.Bailer"))
Predict readability ease for the article.
article$pred <- predict(mod4, newdata = article)
What is the predicted reading ease of the passage?
article$pred
## [1] -4.119014
Convert predicted reading ease to ERD.
(15.87 -14.09*(article$pred)) %>% round
## [1] 74

Comment: The text is more difficult than the most difficult high school text from the original corpus.

Students also look at other features of the article.
x %>% ntoken %>% sum
## [1] 912
Scanning Questions (that students ask to each other)
When did China end its one-child policy?
kwic(x, phrase("China ended"), window=10) %>% summarise(keyword, post)
##       keyword                                                       post
## 1 China ended its longtime one-child policy in 2016 , seeking to reverse
What proportion of Chinese people are 65 or older?
kwic(x, phrase("65 or older"), window=20) %>% summarise(keyword, post)
##       keyword
## 1 65 or older
##                                                                                                                 post
## 1 increased by 60 % . Their share of China's population stands at 13.5 % , just below the internationally recognized
What policies did the Politburo announce?
kwic(x, "Politburo", window=20) %>% summarise(keyword, post)
##     keyword
## 1 Politburo
## 2 Politburo
## 3 Politburo
## 4 Politburo
##                                                                                                                           post
## 1 decided Monday to ease the current two-child restriction , without specifying when the change will take effect . China ended
## 2   have stressed plans to raise the statutory retirement age to secure enough workers to man shops and keep factories humming
## 3     laid out steps to support child-rearing , including expanding day care services as well as reducing education costs . It
## 4         emphasized " guiding young people's values on marriage and family after they reach marrying age , " according to the
Frequent words (students should try to be familiar with most of these).
x %>% tokens %>% dfm(remove_punct=T,
          remove_numbers=T) %>% 
  dfm_remove(stopwords("en")) %>%
  textstat_frequency() %>% 
  filter(frequency>=2) %>%
  arrange(-frequency, feature) %>%
  summarise(word = feature, frequency)
##            word frequency
## 1          asia         9
## 2    population         7
## 3      children         6
## 4         china         5
## 5     education         5
## 6           age         4
## 7         aging         4
## 8         child         4
## 9           get         4
## 10       nikkei         4
## 11    politburo         4
## 12       social         4
## 13        ahead         3
## 14      beijing         3
## 15       births         3
## 16       change         3
## 17      china's         3
## 18        costs         3
## 19    country's         3
## 20       crisis         3
## 21   exclusives         3
## 22      experts         3
## 23     insights         3
## 24       number         3
## 25       policy         3
## 26         stay         3
## 27     straight         3
## 28      trusted         3
## 29         well         3
## 30       within         3
## 31        years         3
## 32            $         2
## 33         also         2
## 34          app         2
## 35      appears         2
## 36        asian         2
## 37         bank         2
## 38    birthrate         2
## 39        boost         2
## 40         care         2
## 41      central         2
## 42       cities         2
## 43    communist         2
## 44    continued         2
## 45         cost         2
## 46          day         2
## 47       decade         2
## 48      decided         2
## 49    declining         2
## 50  demographic         2
## 51     discover         2
## 52      dynamic         2
## 53       enough         2
## 54         even         2
## 55    expensive         2
## 56     families         2
## 57         fell         2
## 58    fertility         2
## 59   government         2
## 60   households         2
## 61 improvements         2
## 62    increased         2
## 63         just         2
## 64         last         2
## 65       market         2
## 66     marriage         2
## 67         need         2
## 68          new         2
## 69          one         2
## 70        party         2
## 71      pension         2
## 72       people         2
## 73     people's         2
## 74       public         2
## 75        raise         2
## 76      raising         2
## 77         rate         2
## 78       rising         2
## 79       school         2
## 80     security         2
## 81     services         2
## 82    shrinking         2
## 83     spending         2
## 84 subscription         2
## 85        urban         2
## 86        women         2
## 87      working         2
## 88        world         2
## 89         year         2
## 90      younger         2
## 91         yuan         2
Three-word collocations specific to the article.
x %>% tokens %>% textstat_collocations(size=3) %>% arrange(-count) %>%
  head(30) %>%
  summarise(collocation, count)
##              collocation count
## 1          the number of     3
## 2     exclusives on asia     3
## 3    with our exclusives     3
## 4         ahead with our     3
## 5      our exclusives on     3
## 6    experts within asia     3
## 7  insights from experts     3
## 8  trusted insights from     3
## 9    from experts within     3
## 10    within asia itself     3
## 11       stay ahead with     3
## 12  get trusted insights     3
## 13     a subscription to     2
## 14    the decade through     2
## 15           the all new     2
## 16         market in the     2
## 17      discover the all     2
## 18          in the world     2
## 19          for an aging     2
## 20      the most dynamic     2
## 21           the cost of     2
## 22            as well as     2
## 23   need a subscription     2
## 24       nikkei asia app     2
## 25     dynamic market in     2
## 26            you need a     2
## 27           it would be     2
## 28        all new nikkei     2
## 29       new nikkei asia     2
## 30     day care services     2

Comment: These trigrams which are specific to the article are not quite the same as everyday collocations of English. The students recognize that.

Rare words (students may ignore some of these).
x %>% tokens %>% dfm(remove_punct=T,
          remove_symbols=T,
          remove_numbers = T) %>%
  dfm_remove(stopwords("en")) %>%
  textstat_frequency() %>% 
  filter(frequency==1) %>%
  summarise(word = feature) %>%
  arrange(word)
##                word
## 1        3-year-old
## 2              31st
## 3       35-year-old
## 4         according
## 5        affiliated
## 6            agency
## 7           allowed
## 8             alone
## 9          analyses
## 10        announced
## 11        apartment
## 12            april
## 13            areas
## 14         averaged
## 15            avert
## 16             away
## 17       ballooning
## 18            begin
## 19             best
## 20        betrothal
## 21              big
## 22          billion
## 23            birth
## 24             born
## 25           budget
## 26         burdened
## 27           called
## 28            casts
## 29           census
## 30          century
## 31    child-rearing
## 32     childbearing
## 33          chronic
## 34             come
## 35       compensate
## 36        concluded
## 37     contributing
## 38      coronavirus
## 39          country
## 40          couples
## 41             cram
## 42          current
## 43          customs
## 44             data
## 45          decline
## 46        delivered
## 47        difficult
## 48      discouraged
## 49       disposable
## 50         domestic
## 51         downturn
## 52            drain
## 53             drop
## 54          dropped
## 55          earlier
## 56            early
## 57             ease
## 58           easier
## 59         economic
## 60           effect
## 61      eligibility
## 62       emphasized
## 63            ended
## 64             ends
## 65          ensuing
## 66        estimated
## 67        excessive
## 68           expand
## 69        expanding
## 70         expected
## 71           expert
## 72            faces
## 73           factor
## 74        factories
## 75          falling
## 76           family
## 77           famine
## 78             five
## 79             flip
## 80        following
## 81           formal
## 82          forward
## 83             four
## 84             fund
## 85            gifts
## 86           global
## 87        gradually
## 88            great
## 89          guiding
## 90           hamper
## 91             hard
## 92             high
## 93           hoping
## 94          humming
## 95             idea
## 96         imminent
## 97      importantly
## 98            inbox
## 99        including
## 100          income
## 101         incomes
## 102       insurance
## 103 internationally
## 104        involves
## 105       jinping's
## 106             job
## 107            july
## 108            keep
## 109           known
## 110           labor
## 111            laid
## 112          larger
## 113          latest
## 114      leadership
## 115            leap
## 116           leave
## 117        lifetime
## 118            lift
## 119        longtime
## 120          lowest
## 121           major
## 122            make
## 123             man
## 124            many
## 125       marriages
## 126         married
## 127        marrying
## 128       maternity
## 129         medical
## 130         million
## 131          monday
## 132           money
## 133       mortgages
## 134             net
## 135            news
## 136     newsletters
## 137             now
## 138           offer
## 139           often
## 140           older
## 141       one-child
## 142   opportunities
## 143            pall
## 144           paper
## 145          parent
## 146         party's
## 147            peak
## 148      per-capita
## 149      percentage
## 150          permit
## 151         picking
## 152           plans
## 153          plunge
## 154           point
## 155          points
## 156        policies
## 157            pool
## 158          poorly
## 159     potentially
## 160       predicted
## 161      preschools
## 162       president
## 163      previously
## 164        programs
## 165          proved
## 166            push
## 167          quoted
## 168           rapid
## 169         rapidly
## 170           reach
## 171         reached
## 172        received
## 173      recognized
## 174        reducing
## 175         reeling
## 176           relax
## 177          relied
## 178        removing
## 179        response
## 180     restriction
## 181    restrictions
## 182      retirement
## 183         reverse
## 184          review
## 185          safety
## 186            said
## 187          saying
## 188          second
## 189          secure
## 190         seekers
## 191         seeking
## 192           seven
## 193        shanghai
## 194           share
## 195           sharp
## 196           shift
## 197           shops
## 198     short-lived
## 199        shortage
## 200           shows
## 201            side
## 202            sign
## 203           since
## 204    skyrocketing
## 205           slide
## 206         society
## 207      specifying
## 208          speedy
## 209          stands
## 210       state-run
## 211       statutory
## 212          steady
## 213        stemming
## 214           steps
## 215           still
## 216         stories
## 217        stressed
## 218       suggested
## 219         support
## 220            take
## 221          taking
## 222        targeted
## 223   technological
## 224       threatens
## 225           three
## 226       threshold
## 227           times
## 228           total
## 229           trend
## 230          trying
## 231      turnaround
## 232         turning
## 233             two
## 234       two-child
## 235    unemployment
## 236        unlikely
## 237           urged
## 238          values
## 239            view
## 240           voice
## 241         without
## 242         woman's
## 243         workers
## 244              xi
## 245          xinhua
## 246           young
Guess a word from its context.
kwic(x, "ballooning") %>% summarise(keyword, post)
##      keyword                                   post
## 1 Ballooning apartment costs have burdened families
kwic(x, "chronic") %>% summarise(keyword, post)
##   keyword                          post
## 1 chronic shortage of day care services
kwic(x, "threshold", window=20) %>% summarise(pre, keyword, post)
##                                                                                                          pre
## 1 60 % . Their share of China's population stands at 13.5 % , just below the internationally recognized 14 %
##     keyword
## 1 threshold
##                                                                                                    post
## 1 for an aging society . The flip side of China's rapid aging is a sharp drop in the working population
Students have to include an in-text citation and reference in APA format when they write their summaries and opinions. I have to do this manually here, too.
author
## [1] "IORI KAWATE, Nikkei staff writerMay 31, 2021 16:51 JSTUpdated on June 1, 2021 03:49 JST | China"
In-text citation in APA format

(Kawate, 2021).

title
## [1] "China's three-child policy aims to head off demographic crisis"
Reference in APA format

Kawate, I. (2021, June 1). China’s three-child policy aims to head off demographic crisis. Nikkei Asia. Retrieved June 1, 2021, from https://asia.nikkei.com/Politics/China-s-three-child-policy-aims-to-head-off-demographic-crisis

•••••••••••••••••••••••••••••••••••••••••••••••••••••

Elvin reading difficulty index to be continued…