CommonLit Readability Prize
https://www.kaggle.com/c/commonlitreadabilityprize
Rate the complexity of literary passages for grades 3-12 classroom use
knitr::opts_chunk$set(echo = TRUE,
message = FALSE,
warning = FALSE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.4 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(quanteda)
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "numericVector" of class "Mnumeric"; definition not updated
## Package version: 3.0.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 4 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
library(quanteda.textstats)
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "numericVector" of class "Mnumeric"; definition not updated
library(caTools)
library(rvest)
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
x <- read.csv('train.csv')
nrow(x)
## [1] 2834
x %>% ggplot(aes(x=target)) +
geom_histogram(binwidth = 0.1)
x[which.min(x$target),]$excerpt
## [1] "The commutator is peculiar, consisting of only three segments of a copper ring, while in the simplest of other continuous current generators several times that number exist, and frequently 120! segments are to be found. These three segments are made so as to be removable in a moment for cleaning or replacement. They are mounted upon a metal support, and are surrounded on all sides by a free air space, and cannot, therefore, lose their insulated condition. This feature of air insulation is peculiar to this system, and is very important as a factor in the durability of the commutator. Besides this, the commutator is sustained by supports carried in flanges upon the shaft, which flanges, as an additional safeguard, are coated all over with hard rubber, one of the finest known insulators. It may be stated, without fear of contradiction, that no other commutator made is so thoroughly insulated and protected. The three commutator segments virtually constitute a single copper ring, mounted in free air, and cut into three equal pieces by slots across its face."
x[which.max(x$target),]$excerpt
## [1] "When you think of dinosaurs and where they lived, what do you picture? Do you see hot, steamy swamps, thick jungles, or sunny plains? Dinosaurs lived in those places, yes. But did you know that some dinosaurs lived in the cold and the darkness near the North and South Poles?\nThis surprised scientists, too. Paleontologists used to believe that dinosaurs lived only in the warmest parts of the world. They thought that dinosaurs could only have lived in places where turtles, crocodiles, and snakes live today. Later, these dinosaur scientists began finding bones in surprising places.\nOne of those surprising fossil beds is a place called Dinosaur Cove, Australia. One hundred million years ago, Australia was connected to Antarctica. Both continents were located near the South Pole. Today, paleontologists dig dinosaur fossils out of the ground. They think about what those ancient bones must mean."
x$target %>% mean
## [1] -0.9593188
x[x$target>=-0.959 & x$target <=-0.958,]$excerpt
## [1] "One noon, during a break in the rains, there was a cool soft breeze blowing; the smell of the damp grass and leaves in the hot sun felt like the warm breathing of the tired earth on one's body. A persistent bird went on all the afternoon repeating the burden of its one complaint in Nature's audience chamber.\nThe postmaster had nothing to do. The shimmer of the freshly washed leaves, and the banked-up remnants of the retreating rain-clouds were sights to see; and the postmaster was watching them and thinking to himself: \"Oh, if only some kindred soul were near—just one loving human being whom I could hold near my heart!\" This was exactly, he went on to think, what that bird was trying to say, and it was the same feeling which the murmuring leaves were striving to express. But no one knows, or would believe, that such an idea might also take possession of an ill-paid village postmaster in the deep, silent midday interval of his work."
Flesch <- x$excerpt %>% textstat_readability(measure='Flesch')
x$Flesch <- Flesch[,2]
set.seed(1066)
spl <- sample.split(x$target, SplitRatio = 0.7)
train <- x[spl==T,]
test <- x[spl==F,]
mod <- glm(target~Flesch$Flesch, data = train)
summary(mod)
##
## Call:
## glm(formula = target ~ Flesch$Flesch, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.61468 -0.60855 -0.00169 0.60381 2.45259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.838289 0.068551 -41.40 <2e-16 ***
## Flesch$Flesch 0.029756 0.001047 28.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.7519595)
##
## Null deviance: 2097.2 on 1982 degrees of freedom
## Residual deviance: 1489.6 on 1981 degrees of freedom
## AIC: 5066.2
##
## Number of Fisher Scoring iterations: 2
test$preds <- predict(mod, newdata = test)
test %>% ggplot(aes(x=preds, y=target)) +
geom_point() +
stat_smooth(method="lm")
SSE <- sum((test$target - test$preds)^2)
SST <- sum((test$target - mean(train$target))^2)
Rsquared1 <- 1 - SSE/SST
Rsquared1
## [1] 0.3410682
x$words_per_sentence <- ntoken(x$excerpt)/nsentence(x$excerpt)
x$avg_word_length <- nchar(x$excerpt)/ntoken(x$excerpt)
x$words <- ntoken(x$excerpt)
x$TTR <- x$excerpt %>% tokens %>% dfm %>% textstat_lexdiv()
set.seed(1066)
spl <- sample.split(x$target, SplitRatio = 0.7)
train <- x[spl==T,]
test <- x[spl==F,]
mod2 <- glm(target~ words_per_sentence + avg_word_length + words + TTR$TTR, data=train)
summary(mod2)
##
## Call:
## glm(formula = target ~ words_per_sentence + avg_word_length +
## words + TTR$TTR, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.89398 -0.58556 0.04067 0.60441 2.37460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.3717467 0.3901903 18.893 < 2e-16 ***
## words_per_sentence -0.0291378 0.0019114 -15.244 < 2e-16 ***
## avg_word_length -0.9376830 0.0473152 -19.818 < 2e-16 ***
## words -0.0096339 0.0009566 -10.071 < 2e-16 ***
## TTR$TTR -1.8169489 0.3400357 -5.343 1.02e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.7447657)
##
## Null deviance: 2097.2 on 1982 degrees of freedom
## Residual deviance: 1473.1 on 1978 degrees of freedom
## AIC: 5050.1
##
## Number of Fisher Scoring iterations: 2
test$preds2 <- predict(mod2, newdata = test)
test %>% ggplot(aes(x=preds2, y=target)) +
geom_point() +
stat_smooth(method="lm")
SSE <- sum((test$target - test$preds2)^2)
SST <- sum((test$target - mean(train$target))^2)
Rsquared2 <- 1 - SSE/SST
Rsquared2
## [1] 0.3455149
Rsquared2 - Rsquared1
## [1] 0.004446664
readability <- x$excerpt %>% textstat_readability(measure = c("all", "ARI", "ARI.simple", "Bormuth", "Bormuth.GP", "Coleman", "Coleman.C2", "Coleman.Liau", "Coleman.Liau.grade", "Coleman.Liau.short", "Dale.Chall", "Dale.Chall.old", "Dale.Chall.PSK", "Danielson.Bryan", "Danielson.Bryan.2", "Dickes.Steiwer", "DRP", "ELF", "Farr.Jenkins.Paterson", "Flesch", "Flesch.PSK", "Flesch.Kincaid", "FOG", "FOG.PSK", "FOG.NRI", "FORCAST", "FORCAST.RGL", "Fucks", "Linsear.Write", "LIW", "nWS", "nWS.2", "nWS.3", "nWS.4", "RIX", "Scrabble", "SMOG", "SMOG.C", "SMOG.simple", "SMOG.de", "Spache", "Spache.old", "Strain", "Traenkle.Bailer", "Traenkle.Bailer.2","Wheeler.Smith","meanSentenceLength", "meanWordSyllables"))
readability$target <- x$target
set.seed(1066)
spl <- sample.split(readability$target, SplitRatio = 0.7)
train <- readability[spl==T,]
test <- readability[spl==F,]
mod3 <- glm(target~. -document, data=train)
summary(mod3)
##
## Call:
## glm(formula = target ~ . - document, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.49385 -0.48487 -0.00048 0.48701 2.20060
##
## Coefficients: (22 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.262e+03 1.508e+03 -3.491 0.000493 ***
## ARI -2.704e+02 7.884e+01 -3.430 0.000617 ***
## ARI.simple 1.367e+02 3.950e+01 3.461 0.000551 ***
## ARI.NRI NA NA NA NA
## Bormuth.MC 4.514e+00 1.250e+00 3.612 0.000311 ***
## Bormuth.GP 4.242e-10 8.483e-11 5.001 6.21e-07 ***
## Coleman 5.734e-03 2.653e-02 0.216 0.828918
## Coleman.C2 -8.536e-02 1.875e-02 -4.553 5.63e-06 ***
## Coleman.Liau.ECP NA NA NA NA
## Coleman.Liau.grade NA NA NA NA
## Coleman.Liau.short NA NA NA NA
## Dale.Chall 3.614e-02 8.134e-03 4.443 9.36e-06 ***
## Dale.Chall.old -3.916e-02 2.309e-02 -1.696 0.090057 .
## Dale.Chall.PSK NA NA NA NA
## Danielson.Bryan 4.860e+00 1.593e+00 3.052 0.002306 **
## Danielson.Bryan.2 -4.182e+00 1.406e+00 -2.974 0.002971 **
## Dickes.Steiwer 4.792e-02 6.469e-03 7.408 1.90e-13 ***
## DRP NA NA NA NA
## ELF -1.632e-01 1.178e-01 -1.386 0.165905
## Farr.Jenkins.Paterson NA NA NA NA
## Flesch 7.156e-02 2.738e-02 2.614 0.009017 **
## Flesch.PSK NA NA NA NA
## Flesch.Kincaid NA NA NA NA
## FOG 2.395e-01 9.186e-02 2.607 0.009208 **
## FOG.PSK NA NA NA NA
## FOG.NRI -3.391e-02 1.726e-02 -1.964 0.049651 *
## FORCAST NA NA NA NA
## FORCAST.RGL NA NA NA NA
## Fucks NA NA NA NA
## Linsear.Write 1.028e-02 4.116e-02 0.250 0.802794
## LIW -3.560e-02 1.997e-02 -1.783 0.074818 .
## nWS 7.281e-02 6.240e-02 1.167 0.243411
## nWS.2 NA NA NA NA
## nWS.3 NA NA NA NA
## nWS.4 NA NA NA NA
## RIX -6.214e-02 8.409e-02 -0.739 0.460017
## Scrabble -4.710e-01 3.333e-01 -1.413 0.157796
## SMOG -1.374e-01 1.240e-01 -1.108 0.267814
## SMOG.C -1.095e-02 1.658e-01 -0.066 0.947346
## SMOG.simple NA NA NA NA
## SMOG.de NA NA NA NA
## Spache -3.829e-01 6.853e-02 -5.588 2.63e-08 ***
## Spache.old NA NA NA NA
## Strain 3.934e-01 3.419e-01 1.151 0.249963
## Traenkle.Bailer 1.076e-02 1.425e-02 0.755 0.450115
## Traenkle.Bailer.2 -3.630e-04 9.142e-03 -0.040 0.968332
## Wheeler.Smith NA NA NA NA
## meanSentenceLength NA NA NA NA
## meanWordSyllables NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.5479596)
##
## Null deviance: 2097.2 on 1982 degrees of freedom
## Residual deviance: 1071.8 on 1956 degrees of freedom
## AIC: 4463.4
##
## Number of Fisher Scoring iterations: 2
mod3 <- step(mod3)
## Start: AIC=4463.44
## target ~ (document + ARI + ARI.simple + ARI.NRI + Bormuth.MC +
## Bormuth.GP + Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid +
## FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks +
## Linsear.Write + LIW + nWS + nWS.2 + nWS.3 + nWS.4 + RIX +
## Scrabble + SMOG + SMOG.C + SMOG.simple + SMOG.de + Spache +
## Spache.old + Strain + Traenkle.Bailer + Traenkle.Bailer.2 +
## Wheeler.Smith + meanSentenceLength + meanWordSyllables) -
## document
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid +
## FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks +
## Linsear.Write + LIW + nWS + nWS.2 + nWS.3 + nWS.4 + RIX +
## Scrabble + SMOG + SMOG.C + SMOG.simple + SMOG.de + Spache +
## Spache.old + Strain + Traenkle.Bailer + Traenkle.Bailer.2 +
## Wheeler.Smith + meanSentenceLength
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid +
## FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks +
## Linsear.Write + LIW + nWS + nWS.2 + nWS.3 + nWS.4 + RIX +
## Scrabble + SMOG + SMOG.C + SMOG.simple + SMOG.de + Spache +
## Spache.old + Strain + Traenkle.Bailer + Traenkle.Bailer.2 +
## Wheeler.Smith
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid +
## FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks +
## Linsear.Write + LIW + nWS + nWS.2 + nWS.3 + nWS.4 + RIX +
## Scrabble + SMOG + SMOG.C + SMOG.simple + SMOG.de + Spache +
## Spache.old + Strain + Traenkle.Bailer + Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid +
## FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks +
## Linsear.Write + LIW + nWS + nWS.2 + nWS.3 + nWS.4 + RIX +
## Scrabble + SMOG + SMOG.C + SMOG.simple + SMOG.de + Spache +
## Strain + Traenkle.Bailer + Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid +
## FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks +
## Linsear.Write + LIW + nWS + nWS.2 + nWS.3 + nWS.4 + RIX +
## Scrabble + SMOG + SMOG.C + SMOG.simple + Spache + Strain +
## Traenkle.Bailer + Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid +
## FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks +
## Linsear.Write + LIW + nWS + nWS.2 + nWS.3 + nWS.4 + RIX +
## Scrabble + SMOG + SMOG.C + Spache + Strain + Traenkle.Bailer +
## Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid +
## FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks +
## Linsear.Write + LIW + nWS + nWS.2 + nWS.3 + RIX + Scrabble +
## SMOG + SMOG.C + Spache + Strain + Traenkle.Bailer + Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid +
## FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks +
## Linsear.Write + LIW + nWS + nWS.2 + RIX + Scrabble + SMOG +
## SMOG.C + Spache + Strain + Traenkle.Bailer + Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid +
## FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Fucks +
## Linsear.Write + LIW + nWS + RIX + Scrabble + SMOG + SMOG.C +
## Spache + Strain + Traenkle.Bailer + Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid +
## FOG + FOG.PSK + FOG.NRI + FORCAST + FORCAST.RGL + Linsear.Write +
## LIW + nWS + RIX + Scrabble + SMOG + SMOG.C + Spache + Strain +
## Traenkle.Bailer + Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid +
## FOG + FOG.PSK + FOG.NRI + FORCAST + Linsear.Write + LIW +
## nWS + RIX + Scrabble + SMOG + SMOG.C + Spache + Strain +
## Traenkle.Bailer + Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid +
## FOG + FOG.PSK + FOG.NRI + Linsear.Write + LIW + nWS + RIX +
## Scrabble + SMOG + SMOG.C + Spache + Strain + Traenkle.Bailer +
## Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + Flesch.Kincaid +
## FOG + FOG.NRI + Linsear.Write + LIW + nWS + RIX + Scrabble +
## SMOG + SMOG.C + Spache + Strain + Traenkle.Bailer + Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Farr.Jenkins.Paterson + Flesch + Flesch.PSK + FOG +
## FOG.NRI + Linsear.Write + LIW + nWS + RIX + Scrabble + SMOG +
## SMOG.C + Spache + Strain + Traenkle.Bailer + Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Farr.Jenkins.Paterson + Flesch + FOG + FOG.NRI + Linsear.Write +
## LIW + nWS + RIX + Scrabble + SMOG + SMOG.C + Spache + Strain +
## Traenkle.Bailer + Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + DRP +
## ELF + Flesch + FOG + FOG.NRI + Linsear.Write + LIW + nWS +
## RIX + Scrabble + SMOG + SMOG.C + Spache + Strain + Traenkle.Bailer +
## Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Dale.Chall.PSK +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + ELF +
## Flesch + FOG + FOG.NRI + Linsear.Write + LIW + nWS + RIX +
## Scrabble + SMOG + SMOG.C + Spache + Strain + Traenkle.Bailer +
## Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Coleman.Liau.short + Dale.Chall + Dale.Chall.old + Danielson.Bryan +
## Danielson.Bryan.2 + Dickes.Steiwer + ELF + Flesch + FOG +
## FOG.NRI + Linsear.Write + LIW + nWS + RIX + Scrabble + SMOG +
## SMOG.C + Spache + Strain + Traenkle.Bailer + Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Coleman.Liau.grade +
## Dale.Chall + Dale.Chall.old + Danielson.Bryan + Danielson.Bryan.2 +
## Dickes.Steiwer + ELF + Flesch + FOG + FOG.NRI + Linsear.Write +
## LIW + nWS + RIX + Scrabble + SMOG + SMOG.C + Spache + Strain +
## Traenkle.Bailer + Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Coleman.Liau.ECP + Dale.Chall + Dale.Chall.old +
## Danielson.Bryan + Danielson.Bryan.2 + Dickes.Steiwer + ELF +
## Flesch + FOG + FOG.NRI + Linsear.Write + LIW + nWS + RIX +
## Scrabble + SMOG + SMOG.C + Spache + Strain + Traenkle.Bailer +
## Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + ARI.NRI + Bormuth.MC + Bormuth.GP +
## Coleman + Coleman.C2 + Dale.Chall + Dale.Chall.old + Danielson.Bryan +
## Danielson.Bryan.2 + Dickes.Steiwer + ELF + Flesch + FOG +
## FOG.NRI + Linsear.Write + LIW + nWS + RIX + Scrabble + SMOG +
## SMOG.C + Spache + Strain + Traenkle.Bailer + Traenkle.Bailer.2
##
##
## Step: AIC=4463.44
## target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + Coleman +
## Coleman.C2 + Dale.Chall + Dale.Chall.old + Danielson.Bryan +
## Danielson.Bryan.2 + Dickes.Steiwer + ELF + Flesch + FOG +
## FOG.NRI + Linsear.Write + LIW + nWS + RIX + Scrabble + SMOG +
## SMOG.C + Spache + Strain + Traenkle.Bailer + Traenkle.Bailer.2
##
## Df Deviance AIC
## - Traenkle.Bailer.2 1 1071.8 4461.4
## - SMOG.C 1 1071.8 4461.4
## - Coleman 1 1071.8 4461.5
## - Linsear.Write 1 1071.8 4461.5
## - RIX 1 1072.1 4462.0
## - Traenkle.Bailer 1 1072.1 4462.0
## - SMOG 1 1072.5 4462.7
## - Strain 1 1072.5 4462.8
## - nWS 1 1072.5 4462.8
## - ELF 1 1072.9 4463.4
## <none> 1071.8 4463.4
## - Scrabble 1 1072.9 4463.5
## - Dale.Chall.old 1 1073.4 4464.4
## - LIW 1 1073.5 4464.7
## - FOG.NRI 1 1073.9 4465.4
## - FOG 1 1075.5 4468.3
## - Flesch 1 1075.5 4468.4
## - Danielson.Bryan.2 1 1076.7 4470.4
## - Danielson.Bryan 1 1076.9 4470.9
## - ARI 1 1078.2 4473.3
## - ARI.simple 1 1078.4 4473.5
## - Bormuth.MC 1 1079.0 4474.6
## - Dale.Chall 1 1082.6 4481.4
## - Coleman.C2 1 1083.2 4482.3
## - Bormuth.GP 1 1085.5 4486.6
## - Spache 1 1088.9 4492.8
## - Dickes.Steiwer 1 1101.9 4516.3
##
## Step: AIC=4461.45
## target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + Coleman +
## Coleman.C2 + Dale.Chall + Dale.Chall.old + Danielson.Bryan +
## Danielson.Bryan.2 + Dickes.Steiwer + ELF + Flesch + FOG +
## FOG.NRI + Linsear.Write + LIW + nWS + RIX + Scrabble + SMOG +
## SMOG.C + Spache + Strain + Traenkle.Bailer
##
## Df Deviance AIC
## - SMOG.C 1 1071.8 4459.4
## - Coleman 1 1071.8 4459.5
## - Linsear.Write 1 1071.8 4459.5
## - RIX 1 1072.1 4460.0
## - SMOG 1 1072.5 4460.7
## - Strain 1 1072.5 4460.8
## - nWS 1 1072.6 4460.8
## - ELF 1 1072.9 4461.4
## <none> 1071.8 4461.4
## - Scrabble 1 1072.9 4461.5
## - Dale.Chall.old 1 1073.4 4462.4
## - LIW 1 1073.5 4462.7
## - FOG.NRI 1 1073.9 4463.4
## - Traenkle.Bailer 1 1074.4 4464.2
## - FOG 1 1075.5 4466.3
## - Flesch 1 1075.5 4466.4
## - Danielson.Bryan.2 1 1076.7 4468.4
## - Danielson.Bryan 1 1076.9 4468.9
## - ARI 1 1078.3 4471.4
## - ARI.simple 1 1078.4 4471.6
## - Bormuth.MC 1 1079.0 4472.6
## - Dale.Chall 1 1082.6 4479.4
## - Coleman.C2 1 1083.2 4480.5
## - Bormuth.GP 1 1085.5 4484.7
## - Spache 1 1088.9 4490.9
## - Dickes.Steiwer 1 1101.9 4514.4
##
## Step: AIC=4459.45
## target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + Coleman +
## Coleman.C2 + Dale.Chall + Dale.Chall.old + Danielson.Bryan +
## Danielson.Bryan.2 + Dickes.Steiwer + ELF + Flesch + FOG +
## FOG.NRI + Linsear.Write + LIW + nWS + RIX + Scrabble + SMOG +
## Spache + Strain + Traenkle.Bailer
##
## Df Deviance AIC
## - Coleman 1 1071.8 4457.5
## - Linsear.Write 1 1071.8 4457.5
## - RIX 1 1072.1 4458.0
## - Strain 1 1072.5 4458.8
## - nWS 1 1072.6 4458.8
## - ELF 1 1072.9 4459.4
## <none> 1071.8 4459.4
## - Scrabble 1 1072.9 4459.5
## - Dale.Chall.old 1 1073.4 4460.4
## - LIW 1 1073.5 4460.7
## - FOG.NRI 1 1073.9 4461.4
## - Traenkle.Bailer 1 1074.4 4462.2
## - FOG 1 1075.5 4464.3
## - Flesch 1 1075.6 4464.4
## - Danielson.Bryan.2 1 1076.7 4466.4
## - Danielson.Bryan 1 1076.9 4466.9
## - ARI 1 1078.3 4469.4
## - ARI.simple 1 1078.4 4469.6
## - Bormuth.MC 1 1079.1 4470.8
## - Dale.Chall 1 1082.7 4477.5
## - Coleman.C2 1 1083.3 4478.6
## - SMOG 1 1084.8 4481.2
## - Bormuth.GP 1 1086.1 4483.6
## - Spache 1 1089.0 4489.0
## - Dickes.Steiwer 1 1101.9 4512.4
##
## Step: AIC=4457.5
## target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + Coleman.C2 +
## Dale.Chall + Dale.Chall.old + Danielson.Bryan + Danielson.Bryan.2 +
## Dickes.Steiwer + ELF + Flesch + FOG + FOG.NRI + Linsear.Write +
## LIW + nWS + RIX + Scrabble + SMOG + Spache + Strain + Traenkle.Bailer
##
## Df Deviance AIC
## - Linsear.Write 1 1071.8 4455.5
## - RIX 1 1072.1 4456.1
## - nWS 1 1072.6 4456.8
## <none> 1071.8 4457.5
## - Scrabble 1 1073.0 4457.6
## - Strain 1 1073.3 4458.2
## - Dale.Chall.old 1 1073.5 4458.5
## - LIW 1 1073.6 4458.7
## - FOG.NRI 1 1074.0 4459.5
## - Traenkle.Bailer 1 1074.4 4460.2
## - ELF 1 1074.5 4460.4
## - Danielson.Bryan.2 1 1076.7 4464.4
## - Danielson.Bryan 1 1076.9 4464.9
## - FOG 1 1077.7 4466.3
## - ARI 1 1078.3 4467.5
## - ARI.simple 1 1078.5 4467.7
## - Flesch 1 1079.2 4469.1
## - Bormuth.MC 1 1082.3 4474.8
## - Dale.Chall 1 1082.7 4475.5
## - SMOG 1 1085.0 4479.7
## - Spache 1 1089.0 4487.0
## - Coleman.C2 1 1089.5 4488.0
## - Bormuth.GP 1 1092.5 4493.3
## - Dickes.Steiwer 1 1102.0 4510.5
##
## Step: AIC=4455.52
## target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + Coleman.C2 +
## Dale.Chall + Dale.Chall.old + Danielson.Bryan + Danielson.Bryan.2 +
## Dickes.Steiwer + ELF + Flesch + FOG + FOG.NRI + LIW + nWS +
## RIX + Scrabble + SMOG + Spache + Strain + Traenkle.Bailer
##
## Df Deviance AIC
## - RIX 1 1072.1 4454.1
## - nWS 1 1072.6 4454.9
## <none> 1071.8 4455.5
## - Scrabble 1 1073.0 4455.6
## - Dale.Chall.old 1 1073.5 4456.5
## - LIW 1 1073.7 4456.9
## - FOG.NRI 1 1074.0 4457.6
## - Traenkle.Bailer 1 1074.4 4458.3
## - Strain 1 1077.0 4463.1
## - Danielson.Bryan.2 1 1077.1 4463.2
## - ELF 1 1077.3 4463.6
## - Danielson.Bryan 1 1077.3 4463.6
## - ARI 1 1078.9 4466.5
## - ARI.simple 1 1079.0 4466.8
## - Dale.Chall 1 1082.7 4473.5
## - FOG 1 1084.5 4476.8
## - Bormuth.MC 1 1085.2 4478.1
## - SMOG 1 1085.6 4478.8
## - Flesch 1 1088.0 4483.1
## - Spache 1 1089.1 4485.1
## - Coleman.C2 1 1098.3 4501.9
## - Bormuth.GP 1 1100.6 4506.0
## - Dickes.Steiwer 1 1102.0 4508.6
##
## Step: AIC=4454.06
## target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + Coleman.C2 +
## Dale.Chall + Dale.Chall.old + Danielson.Bryan + Danielson.Bryan.2 +
## Dickes.Steiwer + ELF + Flesch + FOG + FOG.NRI + LIW + nWS +
## Scrabble + SMOG + Spache + Strain + Traenkle.Bailer
##
## Df Deviance AIC
## - nWS 1 1072.9 4453.4
## <none> 1072.1 4454.1
## - Scrabble 1 1073.3 4454.1
## - Dale.Chall.old 1 1073.8 4455.2
## - FOG.NRI 1 1074.3 4456.0
## - Traenkle.Bailer 1 1074.7 4456.8
## - Strain 1 1077.1 4461.3
## - Danielson.Bryan 1 1077.3 4461.6
## - ELF 1 1077.4 4461.8
## - Danielson.Bryan.2 1 1077.5 4462.0
## - ARI 1 1079.3 4465.3
## - ARI.simple 1 1079.4 4465.5
## - Dale.Chall 1 1083.0 4472.0
## - FOG 1 1085.0 4475.7
## - Bormuth.MC 1 1085.2 4476.1
## - SMOG 1 1086.2 4477.9
## - Flesch 1 1088.3 4481.7
## - Spache 1 1089.3 4483.5
## - LIW 1 1090.2 4485.3
## - Coleman.C2 1 1098.5 4500.1
## - Bormuth.GP 1 1100.7 4504.2
## - Dickes.Steiwer 1 1102.1 4506.7
##
## Step: AIC=4453.43
## target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + Coleman.C2 +
## Dale.Chall + Dale.Chall.old + Danielson.Bryan + Danielson.Bryan.2 +
## Dickes.Steiwer + ELF + Flesch + FOG + FOG.NRI + LIW + Scrabble +
## SMOG + Spache + Strain + Traenkle.Bailer
##
## Df Deviance AIC
## - Scrabble 1 1073.9 4453.4
## <none> 1072.9 4453.4
## - Dale.Chall.old 1 1074.6 4454.7
## - FOG.NRI 1 1075.0 4455.4
## - Traenkle.Bailer 1 1075.2 4455.7
## - Strain 1 1077.6 4460.1
## - ELF 1 1077.8 4460.4
## - Danielson.Bryan 1 1078.1 4461.0
## - Danielson.Bryan.2 1 1078.3 4461.4
## - ARI 1 1080.1 4464.8
## - ARI.simple 1 1080.3 4465.0
## - Dale.Chall 1 1083.9 4471.7
## - SMOG 1 1086.4 4476.3
## - Bormuth.MC 1 1088.0 4479.2
## - Flesch 1 1088.9 4480.8
## - Spache 1 1089.5 4481.9
## - LIW 1 1090.5 4483.8
## - FOG 1 1093.9 4489.9
## - Coleman.C2 1 1102.7 4505.8
## - Dickes.Steiwer 1 1103.0 4506.3
## - Bormuth.GP 1 1105.6 4510.9
##
## Step: AIC=4453.38
## target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP + Coleman.C2 +
## Dale.Chall + Dale.Chall.old + Danielson.Bryan + Danielson.Bryan.2 +
## Dickes.Steiwer + ELF + Flesch + FOG + FOG.NRI + LIW + SMOG +
## Spache + Strain + Traenkle.Bailer
##
## Df Deviance AIC
## <none> 1073.9 4453.4
## - Dale.Chall.old 1 1075.7 4454.7
## - FOG.NRI 1 1076.0 4455.2
## - Traenkle.Bailer 1 1076.2 4455.5
## - Strain 1 1078.6 4460.0
## - ELF 1 1078.8 4460.4
## - Danielson.Bryan 1 1079.3 4461.2
## - Danielson.Bryan.2 1 1079.5 4461.7
## - ARI 1 1081.4 4465.1
## - ARI.simple 1 1081.5 4465.3
## - Dale.Chall 1 1085.3 4472.2
## - SMOG 1 1087.6 4476.5
## - Bormuth.MC 1 1089.1 4479.1
## - Flesch 1 1089.8 4480.4
## - Spache 1 1090.0 4480.8
## - LIW 1 1091.6 4483.8
## - FOG 1 1095.2 4490.3
## - Coleman.C2 1 1104.2 4506.5
## - Dickes.Steiwer 1 1105.6 4509.0
## - Bormuth.GP 1 1106.6 4510.7
summary(mod3)
##
## Call:
## glm(formula = target ~ ARI + ARI.simple + Bormuth.MC + Bormuth.GP +
## Coleman.C2 + Dale.Chall + Dale.Chall.old + Danielson.Bryan +
## Danielson.Bryan.2 + Dickes.Steiwer + ELF + Flesch + FOG +
## FOG.NRI + LIW + SMOG + Spache + Strain + Traenkle.Bailer,
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5321 -0.4864 0.0104 0.4903 2.1888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.390e+03 1.434e+03 -3.759 0.000175 ***
## ARI -2.772e+02 7.520e+01 -3.686 0.000234 ***
## ARI.simple 1.401e+02 3.763e+01 3.723 0.000202 ***
## Bormuth.MC 4.580e+00 8.713e-01 5.256 1.63e-07 ***
## Bormuth.GP 4.342e-10 5.622e-11 7.723 1.80e-14 ***
## Coleman.C2 -8.774e-02 1.179e-02 -7.439 1.50e-13 ***
## Dale.Chall 3.675e-02 8.069e-03 4.555 5.57e-06 ***
## Dale.Chall.old -4.107e-02 2.281e-02 -1.801 0.071905 .
## Danielson.Bryan 4.723e+00 1.515e+00 3.117 0.001853 **
## Danielson.Bryan.2 -4.337e+00 1.358e+00 -3.194 0.001426 **
## Dickes.Steiwer 4.860e-02 6.391e-03 7.604 4.42e-14 ***
## ELF -1.790e-01 5.987e-02 -2.989 0.002832 **
## Flesch 7.434e-02 1.382e-02 5.377 8.46e-08 ***
## FOG 2.926e-01 4.694e-02 6.234 5.54e-10 ***
## FOG.NRI -3.226e-02 1.664e-02 -1.939 0.052680 .
## LIW -4.590e-02 8.076e-03 -5.684 1.51e-08 ***
## SMOG -1.435e-01 2.871e-02 -4.998 6.32e-07 ***
## Spache -3.678e-01 6.787e-02 -5.420 6.71e-08 ***
## Strain 4.184e-01 1.435e-01 2.916 0.003588 **
## Traenkle.Bailer 9.409e-03 4.662e-03 2.018 0.043705 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.5470913)
##
## Null deviance: 2097.2 on 1982 degrees of freedom
## Residual deviance: 1073.9 on 1963 degrees of freedom
## AIC: 4453.4
##
## Number of Fisher Scoring iterations: 2
test$preds3 <- predict(mod3, newdata = test)
test %>% ggplot(aes(x=preds3, y=target)) +
geom_point() +
stat_smooth(method="lm")
SSE <- sum((test$target - test$preds3)^2)
SST <- sum((test$target - mean(train$target))^2)
Rsquared3 <- 1 - SSE/SST
Rsquared3
## [1] 0.4554672
Rsquared3 - Rsquared1
## [1] 0.114399
choose(48,19)
## [1] 1.154185e+13
choose(37,7)
## [1] 10295472
choose(43,6)
## [1] 6096454
choose(48,19)/(choose(37,7)*choose(43,5))
## [1] 1.16462
Comment: You’d need to get five out of six for Loto 6 and 7 out of 7 for Loto 7 on two tickets.
readability <- x$excerpt %>% textstat_readability(measure = c("ARI", "ARI.simple", "Bormuth", "Bormuth.GP", "Coleman.C2", "Dale.Chall", "Dale.Chall.old", "Danielson.Bryan", "Danielson.Bryan.2", "Dickes.Steiwer", "ELF", "Flesch", "FOG", "FOG.NRI", "LIW", "SMOG", "Spache", "Strain", "Traenkle.Bailer"))
readability$target <- x$target
readability$document <- NULL
mod4 <- glm(target~., data=readability)
preds <- predict(mod4, newdata=readability)
SSE <- sum((readability$target - preds)^2)
SST <- sum((readability$target - mean(readability$target))^2)
Rsquared <- 1 - SSE/SST
Rsquared
## [1] 0.4810783
readability$preds <- preds
rescale <- function(x) (max(x)-x)/(max(x) - min(x)) * 60
readability$ERD <- rescale(readability$preds)
fivenum(readability$ERD)
## [1] 0.0000 22.2692 29.8053 36.8246 60.0000
readability %>% ggplot(aes(x=preds, y=ERD)) +
geom_line() +
labs(title = "Elvin Reading Difficulty Against Readability Predictions",
x = "Readability Predictions")
lm(readability$ERD~readability$preds)
##
## Call:
## lm(formula = readability$ERD ~ readability$preds)
##
## Coefficients:
## (Intercept) readability$preds
## 15.87 -14.09
Comment: To calculate ERD, the equation is 15.87 plus (-14.09 times prediction).
readability[which.min(readability$preds),]$ERD
## [1] 60
readability[which.max(readability$preds),]$ERD
## [1] 0
x[which.min(readability$preds),]$excerpt
## [1] "Mr. Perkin, an English chemist, and Messrs. Graebe and Liebermann, German chemists, almost simultaneously applied for patents in 1869, in England, and as their methods were nearly identical they arranged priorities by the exchanging of licenses. The German license became the property of the Badische Aniline Company, and the English license became the property of the predecessors of the North British Alizarine Company. These patents expire in about two months, and the lecturer explained that an attempt made by the German manufacturers to further monopolize this industry (even after the expiry of the patent) proved abortive. He also stated that alizarine, 20 percent quality, is sold today at 2s 6d. per lb., but that if the price were reduced by one-half there will still be a handsome profit to makers, and that the United Kingdom is the largest consumer, absorbing one-third of the entire production, and that England possesses advantages over all other countries for manufacturing alizarine--first, by having a splendid supply of the raw material, anthracine; secondly, cheaper caustic soda in England than in Germany by fully £4 per ton; thirdly, cheaper fuel; fourthly, large consumption at our own doors; and, fifthly, special facilities for exporting."
x[which.max(readability$preds),]$excerpt
## [1] "Cat and Dog look through the window. They look through the window. Then Cat and Dog see a butterfly! The butterfly is pink. Cat and Dog want to catch the butterfly. Cat and Dog follow the butterfly. They follow the butterfly. Cat and Dog follow the butterfly by foot. They walk after the butterfly. But the butterfly is fast. The butterfly is too fast, and Cat and Dog are slow. They are too slow. Cat and Dog follow the butterfly by bike. They ride after the butterfly. But the butterfly is fast. The butterfly is very fast, and Cat and Dog are slow. They are very slow. Cat and Dog follow the butterfly by car. They drive after the butterfly. But the butterfly is fast. The butterfly is still too fast, and Cat and Dog are slow. They are still too slow. Cat and Dog follow the butterfly by boat. They float after the butterfly. But the butterfly is fast. The butterfly is super-fast, and Cat and Dog are slow. They are still super-slow."
url <- "https://asia.nikkei.com/Politics/China-s-three-child-policy-aims-to-head-off-demographic-crisis"
title <- read_html(url) %>%
html_nodes(".article-header__title .ezstring-field") %>%
html_text() %>% str_squish
author <- read_html(url) %>%
html_node(".article__details") %>%
html_text() %>% str_squish
x <- read_html(url) %>%
html_nodes("p") %>%
html_text %>% corpus
x <- corpus(texts(x, groups = rep(1, ndoc(x))))
article <- x %>% textstat_readability(measure = c("ARI", "ARI.simple", "Bormuth", "Bormuth.GP", "Coleman.C2", "Dale.Chall", "Dale.Chall.old", "Danielson.Bryan", "Danielson.Bryan.2", "Dickes.Steiwer", "ELF", "Flesch", "FOG", "FOG.NRI", "LIW", "SMOG", "Spache", "Strain", "Traenkle.Bailer"))
article$pred <- predict(mod4, newdata = article)
article$pred
## [1] -4.119014
(15.87 -14.09*(article$pred)) %>% round
## [1] 74
Comment: The text is more difficult than the most difficult high school text from the original corpus.
x %>% ntoken %>% sum
## [1] 912
kwic(x, phrase("China ended"), window=10) %>% summarise(keyword, post)
## keyword post
## 1 China ended its longtime one-child policy in 2016 , seeking to reverse
kwic(x, phrase("65 or older"), window=20) %>% summarise(keyword, post)
## keyword
## 1 65 or older
## post
## 1 increased by 60 % . Their share of China's population stands at 13.5 % , just below the internationally recognized
kwic(x, "Politburo", window=20) %>% summarise(keyword, post)
## keyword
## 1 Politburo
## 2 Politburo
## 3 Politburo
## 4 Politburo
## post
## 1 decided Monday to ease the current two-child restriction , without specifying when the change will take effect . China ended
## 2 have stressed plans to raise the statutory retirement age to secure enough workers to man shops and keep factories humming
## 3 laid out steps to support child-rearing , including expanding day care services as well as reducing education costs . It
## 4 emphasized " guiding young people's values on marriage and family after they reach marrying age , " according to the
x %>% tokens %>% dfm(remove_punct=T,
remove_numbers=T) %>%
dfm_remove(stopwords("en")) %>%
textstat_frequency() %>%
filter(frequency>=2) %>%
arrange(-frequency, feature) %>%
summarise(word = feature, frequency)
## word frequency
## 1 asia 9
## 2 population 7
## 3 children 6
## 4 china 5
## 5 education 5
## 6 age 4
## 7 aging 4
## 8 child 4
## 9 get 4
## 10 nikkei 4
## 11 politburo 4
## 12 social 4
## 13 ahead 3
## 14 beijing 3
## 15 births 3
## 16 change 3
## 17 china's 3
## 18 costs 3
## 19 country's 3
## 20 crisis 3
## 21 exclusives 3
## 22 experts 3
## 23 insights 3
## 24 number 3
## 25 policy 3
## 26 stay 3
## 27 straight 3
## 28 trusted 3
## 29 well 3
## 30 within 3
## 31 years 3
## 32 $ 2
## 33 also 2
## 34 app 2
## 35 appears 2
## 36 asian 2
## 37 bank 2
## 38 birthrate 2
## 39 boost 2
## 40 care 2
## 41 central 2
## 42 cities 2
## 43 communist 2
## 44 continued 2
## 45 cost 2
## 46 day 2
## 47 decade 2
## 48 decided 2
## 49 declining 2
## 50 demographic 2
## 51 discover 2
## 52 dynamic 2
## 53 enough 2
## 54 even 2
## 55 expensive 2
## 56 families 2
## 57 fell 2
## 58 fertility 2
## 59 government 2
## 60 households 2
## 61 improvements 2
## 62 increased 2
## 63 just 2
## 64 last 2
## 65 market 2
## 66 marriage 2
## 67 need 2
## 68 new 2
## 69 one 2
## 70 party 2
## 71 pension 2
## 72 people 2
## 73 people's 2
## 74 public 2
## 75 raise 2
## 76 raising 2
## 77 rate 2
## 78 rising 2
## 79 school 2
## 80 security 2
## 81 services 2
## 82 shrinking 2
## 83 spending 2
## 84 subscription 2
## 85 urban 2
## 86 women 2
## 87 working 2
## 88 world 2
## 89 year 2
## 90 younger 2
## 91 yuan 2
x %>% tokens %>% textstat_collocations(size=3) %>% arrange(-count) %>%
head(30) %>%
summarise(collocation, count)
## collocation count
## 1 the number of 3
## 2 exclusives on asia 3
## 3 with our exclusives 3
## 4 ahead with our 3
## 5 our exclusives on 3
## 6 experts within asia 3
## 7 insights from experts 3
## 8 trusted insights from 3
## 9 from experts within 3
## 10 within asia itself 3
## 11 stay ahead with 3
## 12 get trusted insights 3
## 13 a subscription to 2
## 14 the decade through 2
## 15 the all new 2
## 16 market in the 2
## 17 discover the all 2
## 18 in the world 2
## 19 for an aging 2
## 20 the most dynamic 2
## 21 the cost of 2
## 22 as well as 2
## 23 need a subscription 2
## 24 nikkei asia app 2
## 25 dynamic market in 2
## 26 you need a 2
## 27 it would be 2
## 28 all new nikkei 2
## 29 new nikkei asia 2
## 30 day care services 2
Comment: These trigrams which are specific to the article are not quite the same as everyday collocations of English. The students recognize that.
x %>% tokens %>% dfm(remove_punct=T,
remove_symbols=T,
remove_numbers = T) %>%
dfm_remove(stopwords("en")) %>%
textstat_frequency() %>%
filter(frequency==1) %>%
summarise(word = feature) %>%
arrange(word)
## word
## 1 3-year-old
## 2 31st
## 3 35-year-old
## 4 according
## 5 affiliated
## 6 agency
## 7 allowed
## 8 alone
## 9 analyses
## 10 announced
## 11 apartment
## 12 april
## 13 areas
## 14 averaged
## 15 avert
## 16 away
## 17 ballooning
## 18 begin
## 19 best
## 20 betrothal
## 21 big
## 22 billion
## 23 birth
## 24 born
## 25 budget
## 26 burdened
## 27 called
## 28 casts
## 29 census
## 30 century
## 31 child-rearing
## 32 childbearing
## 33 chronic
## 34 come
## 35 compensate
## 36 concluded
## 37 contributing
## 38 coronavirus
## 39 country
## 40 couples
## 41 cram
## 42 current
## 43 customs
## 44 data
## 45 decline
## 46 delivered
## 47 difficult
## 48 discouraged
## 49 disposable
## 50 domestic
## 51 downturn
## 52 drain
## 53 drop
## 54 dropped
## 55 earlier
## 56 early
## 57 ease
## 58 easier
## 59 economic
## 60 effect
## 61 eligibility
## 62 emphasized
## 63 ended
## 64 ends
## 65 ensuing
## 66 estimated
## 67 excessive
## 68 expand
## 69 expanding
## 70 expected
## 71 expert
## 72 faces
## 73 factor
## 74 factories
## 75 falling
## 76 family
## 77 famine
## 78 five
## 79 flip
## 80 following
## 81 formal
## 82 forward
## 83 four
## 84 fund
## 85 gifts
## 86 global
## 87 gradually
## 88 great
## 89 guiding
## 90 hamper
## 91 hard
## 92 high
## 93 hoping
## 94 humming
## 95 idea
## 96 imminent
## 97 importantly
## 98 inbox
## 99 including
## 100 income
## 101 incomes
## 102 insurance
## 103 internationally
## 104 involves
## 105 jinping's
## 106 job
## 107 july
## 108 keep
## 109 known
## 110 labor
## 111 laid
## 112 larger
## 113 latest
## 114 leadership
## 115 leap
## 116 leave
## 117 lifetime
## 118 lift
## 119 longtime
## 120 lowest
## 121 major
## 122 make
## 123 man
## 124 many
## 125 marriages
## 126 married
## 127 marrying
## 128 maternity
## 129 medical
## 130 million
## 131 monday
## 132 money
## 133 mortgages
## 134 net
## 135 news
## 136 newsletters
## 137 now
## 138 offer
## 139 often
## 140 older
## 141 one-child
## 142 opportunities
## 143 pall
## 144 paper
## 145 parent
## 146 party's
## 147 peak
## 148 per-capita
## 149 percentage
## 150 permit
## 151 picking
## 152 plans
## 153 plunge
## 154 point
## 155 points
## 156 policies
## 157 pool
## 158 poorly
## 159 potentially
## 160 predicted
## 161 preschools
## 162 president
## 163 previously
## 164 programs
## 165 proved
## 166 push
## 167 quoted
## 168 rapid
## 169 rapidly
## 170 reach
## 171 reached
## 172 received
## 173 recognized
## 174 reducing
## 175 reeling
## 176 relax
## 177 relied
## 178 removing
## 179 response
## 180 restriction
## 181 restrictions
## 182 retirement
## 183 reverse
## 184 review
## 185 safety
## 186 said
## 187 saying
## 188 second
## 189 secure
## 190 seekers
## 191 seeking
## 192 seven
## 193 shanghai
## 194 share
## 195 sharp
## 196 shift
## 197 shops
## 198 short-lived
## 199 shortage
## 200 shows
## 201 side
## 202 sign
## 203 since
## 204 skyrocketing
## 205 slide
## 206 society
## 207 specifying
## 208 speedy
## 209 stands
## 210 state-run
## 211 statutory
## 212 steady
## 213 stemming
## 214 steps
## 215 still
## 216 stories
## 217 stressed
## 218 suggested
## 219 support
## 220 take
## 221 taking
## 222 targeted
## 223 technological
## 224 threatens
## 225 three
## 226 threshold
## 227 times
## 228 total
## 229 trend
## 230 trying
## 231 turnaround
## 232 turning
## 233 two
## 234 two-child
## 235 unemployment
## 236 unlikely
## 237 urged
## 238 values
## 239 view
## 240 voice
## 241 without
## 242 woman's
## 243 workers
## 244 xi
## 245 xinhua
## 246 young
kwic(x, "ballooning") %>% summarise(keyword, post)
## keyword post
## 1 Ballooning apartment costs have burdened families
kwic(x, "chronic") %>% summarise(keyword, post)
## keyword post
## 1 chronic shortage of day care services
kwic(x, "threshold", window=20) %>% summarise(pre, keyword, post)
## pre
## 1 60 % . Their share of China's population stands at 13.5 % , just below the internationally recognized 14 %
## keyword
## 1 threshold
## post
## 1 for an aging society . The flip side of China's rapid aging is a sharp drop in the working population
author
## [1] "IORI KAWATE, Nikkei staff writerMay 31, 2021 16:51 JSTUpdated on June 1, 2021 03:49 JST | China"
(Kawate, 2021).
title
## [1] "China's three-child policy aims to head off demographic crisis"
Kawate, I. (2021, June 1). China’s three-child policy aims to head off demographic crisis. Nikkei Asia. Retrieved June 1, 2021, from https://asia.nikkei.com/Politics/China-s-three-child-policy-aims-to-head-off-demographic-crisis
Comment: This is easy? Because it’s about dinosaurs for kids?