Introduction

This exploratory analysis uses Read180U data from a California School District, focusing on grade level, lexile score, school and class name, as well as current reading level, total R180U sessions, total minutes, and other program data from HMH’s Read180U.

This analysis uses a data mining technique called k-Means clustering. It is an unsupervised learning technique where data points or records are aggregated together based on similarities. k refers to the number of clusters in the analysis, and is usually selected by the data scientist based on his/her domain knowledge and characteristics of the data. This technique also helps us explore the overall variability of the data and determine commonalities between records. It is a common approach used for market segmentation

Centroids are are imaginary or real location that represents the center of each cluster. These cluster centers (centroids) are selected by the algorithm, and then optimized based on nearest data points. What this means is that groupings of the data (student lexile scores and other data within the data set) can be formed organically.

There is no exact science for determining the number of clusters, k. Because of this, data scientists will try different numbers of clusters and compare the results.

Basic Lexile Information per Grade

Group.1 x
2 0.0000
3 136.0000
6 737.3218
7 561.4863
8 643.0370

How many clusters should I pick for this?

I used the following variables (features) to help me form the groupings:

Some considerations:

  1. R180U Total Sessions, Average Session Minutes, and Total Minutes Spent may be significantly highly correlated, resulting in multicollinearity.

  2. We already know that as software usage increases, so too does a student’s Lexile Score.

Here is a plot to help us determine the number of clusters. This plot is very similar to a scree plot, and we look for the point called the “elbow”, or where the plot first starts leveling off.

It appears as though 4 clusters might be an appropriate selection, so we will run our kmeans clustering algorithm by select k = 4.

PAM Technique (Partitioning around Medoids)

PAM is a more robust version of kMeans. Keep in mind when looking at these data tables, all of the data must be scaled appropriate to conduct the kmeans. I scale the data first before passing the scaled data to the PAM algorithm, and pass k=4 to the algorithm.

This table shows the information for each of the 4 clusters.

size max_diss av_diss diameter separation
30 3.879709 0.7193631 4.526746 0.2831597
21 3.310668 1.3756077 4.440268 1.1353565
22 2.826814 1.2443726 5.100641 1.1353565
28 2.470447 0.8807569 3.660073 0.2831597

Cluster Plot

In this section, we finally generate the clustering plot, based on using the PAM algorithm.

As you can see, the cluster plot has some very serious overlap. This is the fun part about data science as we realize there are certainly some interesting clusters within the data set. We could try using a k = 3 instead, and pass that to the PAM algorithm. It is possible that a different algorithm could help us understand this data better. We’ll get to that in a moment.

First, re-execute PAM using k = 3, and print the cluster information table.

size max_diss av_diss diameter separation
58 3.885278 1.098879 5.560688 1.696264
21 3.310668 1.375608 4.440268 1.135357
22 2.826814 1.244373 5.100641 1.135357

…and re-display the cluster plot with k = 3, to see if it gets any better.

Better, but let’s continue our exploration.

Traditional kMeans, with 3 Clusters.

First, I will display the cluster information, following by a clustering visualization. The data has already been scaled, which is generally a requirement if you are using any algorithm that uses Euclidian distance as a measurement.

LEXILE_SCORE R180U_TOTAL_SESSIONS R180U_AVERAGE_SESSION_MINUTES R180U_TOTAL_MINUTES_SPENT R180U_AVERAGE_SESSIONS_PER_WEEK cluster
1.3373211 0.3441387 0.8880653 0.7445086 0.7851369 2
-0.5068390 -0.3285969 1.0423998 0.1213936 -0.9020721 1
1.2761404 0.4562613 0.2969971 0.4400349 0.7851369 2
0.5070121 -1.6740681 -1.6239745 -1.6636067 -0.9020721 3
-2.7399333 -1.5619455 -1.6715883 -1.6322250 0.7851369 3
-0.0523540 0.2320161 0.6811914 0.4928963 0.7851369 2

Linear Model

We’ll first explore a basic linear regression model to examine the relationship between Lexile Score and # of Total Minutes spent in the Read 180U software, for this particular district. This is a simple linear regression model. We’ll also examine a multiple regression model to see what’s going on.

model <- lm(LEXILE_SCORE ~ R180U_TOTAL_MINUTES_SPENT, data=df_lexile,
            na.action = na.omit)
summary(model)
## 
## Call:
## lm(formula = LEXILE_SCORE ~ R180U_TOTAL_MINUTES_SPENT, data = df_lexile, 
##     na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -649.7 -169.3    9.7  172.6  481.8 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               552.8112    45.7502  12.083  < 2e-16 ***
## R180U_TOTAL_MINUTES_SPENT   0.3271     0.1121   2.919  0.00434 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 220.7 on 99 degrees of freedom
## Multiple R-squared:  0.07926,    Adjusted R-squared:  0.06996 
## F-statistic: 8.522 on 1 and 99 DF,  p-value: 0.004343

The results show a statistically significant relationship at the p < .01 level, between Lexile Score and Total Minutes Spent. Now, let’s add more features/variables to the regression model.

The ANOVA (Analysis of Variance Table) results also show that this model is statistically significant, meaning that we can generally trust the results as a predictive model. We will check classification accuracy later on in this analysis.

df3_scaled <- as.data.frame(scale(df_lexile))

model2 <- lm(LEXILE_SCORE ~ R180U_TOTAL_SESSIONS + R180U_AVERAGE_SESSION_MINUTES +
                    R180U_AVERAGE_SESSIONS_PER_WEEK,
             data = as.data.frame(df3_scaled), na.action = na.omit)
summary(model2)
## 
## Call:
## lm(formula = LEXILE_SCORE ~ R180U_TOTAL_SESSIONS + R180U_AVERAGE_SESSION_MINUTES + 
##     R180U_AVERAGE_SESSIONS_PER_WEEK, data = as.data.frame(df3_scaled), 
##     na.action = na.omit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8724 -0.7664  0.1691  0.7058  2.0969 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                     -2.454e-17  9.560e-02   0.000   1.0000  
## R180U_TOTAL_SESSIONS             3.976e-01  1.712e-01   2.323   0.0223 *
## R180U_AVERAGE_SESSION_MINUTES    2.355e-03  1.103e-01   0.021   0.9830  
## R180U_AVERAGE_SESSIONS_PER_WEEK -1.014e-01  1.603e-01  -0.633   0.5283  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9607 on 97 degrees of freedom
## Multiple R-squared:  0.1047, Adjusted R-squared:  0.07696 
## F-statistic: 3.779 on 3 and 97 DF,  p-value: 0.01299
anova(model2)
## Analysis of Variance Table
## 
## Response: LEXILE_SCORE
##                                 Df Sum Sq Mean Sq F value   Pr(>F)   
## R180U_TOTAL_SESSIONS             1 10.093 10.0926 10.9341 0.001325 **
## R180U_AVERAGE_SESSION_MINUTES    1  0.003  0.0029  0.0031 0.955696   
## R180U_AVERAGE_SESSIONS_PER_WEEK  1  0.370  0.3696  0.4004 0.528346   
## Residuals                       97 89.535  0.9230                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of this analysis show that there is a statistically significant relationship between Read 180 U total minutes spent and the student’s lexile score, at p < .05. It appears as though the total minutes spent is more important than the total number of sessions completed, or the average minutes in a session. ANOVA suggests that this is a statistically significant model - and could be considered a relatively stable and reliable model.

Ultimately, this provides evidence that spending time in the software, specifically the number of total sessions spent in the software, is directly related to the student’s Lexile score.

Next, a basic scatter plot shows the relationship between Lexile Score and Total Sessions in R180U. Please note that the data has been scaled so that the Lexile Scores and Total # of Sessions are adjusted to be on the same scale.

## 
## Call:
## lm(formula = dlex ~ dtot, data = df3_scaled)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9155 -0.7855  0.1897  0.7080  2.0769 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 1.085e-17  9.482e-02   0.000  1.00000   
## dtot        3.177e-01  9.530e-02   3.334  0.00121 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.953 on 99 degrees of freedom
## Multiple R-squared:  0.1009, Adjusted R-squared:  0.09184 
## F-statistic: 11.11 on 1 and 99 DF,  p-value: 0.001207

Random Forest approach…

We’d like to determine if we can build a solid predictive model based in part on Lexile Scores.

## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within
## valid range
##                 Length Class  Mode     
## call              7    -none- call     
## type              1    -none- character
## predicted        75    -none- numeric  
## mse             100    -none- numeric  
## rsq             100    -none- numeric  
## oob.times        75    -none- numeric  
## importance        2    -none- numeric  
## importanceSD      1    -none- numeric  
## localImportance   0    -none- NULL     
## proximity         0    -none- NULL     
## ntree             1    -none- numeric  
## mtry              1    -none- numeric  
## forest           11    -none- list     
## coefs             0    -none- NULL     
## y                75    -none- numeric  
## test              0    -none- NULL     
## inbag             0    -none- NULL     
## terms             3    terms  call
## 
## Call:
##  randomForest(formula = LEXILE_SCORE ~ R180U_TOTAL_MINUTES_SPENT,      data = data_train, mtry = 3, ntree = 100, importance = TRUE,      na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 60406.91
##                     % Var explained: -31.02
##        6        8       18       21       28       29       33       45 
## 920.7513 770.3178 660.1872 630.9373 726.3912 663.1308 910.7120 549.1890 
##       46       47       48       52       54       55       57       62 
## 531.6247 432.5033 825.1147 496.5330 573.5425 648.4917 655.0838 683.7882 
##       66       68       70       71       72       75       79       92 
## 519.0692 496.5330 553.1110 731.3358 555.8260 561.6935 732.2073 899.4667 
##      100      101 
## 659.9925 494.2102

Next, let’s examine a Decision Tree

A decision tree is a good classifier - but not as robust as the Random Forests. At least this way, we can view a tree structure that helps us classify students and determine how we can get the kids to improve their Lexile scores.

Let’s take a look at the rules, then:

## 
##  Rule number: 15 [LEXILE_SCORE=799.69696969697 cover=33 (33%)]
##    R180U_TOTAL_SESSIONS>=16.5
##    R180U_TOTAL_MINUTES_SPENT< 539.5
##    R180U_TOTAL_MINUTES_SPENT>=407.7
## 
##  Rule number: 4 [LEXILE_SCORE=512.56 cover=25 (25%)]
##    R180U_TOTAL_SESSIONS< 16.5
##    R180U_TOTAL_MINUTES_SPENT>=32.48
## 
##  Rule number: 14 [LEXILE_SCORE=686.333333333333 cover=18 (18%)]
##    R180U_TOTAL_SESSIONS>=16.5
##    R180U_TOTAL_MINUTES_SPENT< 539.5
##    R180U_TOTAL_MINUTES_SPENT< 407.7
## 
##  Rule number: 6 [LEXILE_SCORE=640.111111111111 cover=18 (18%)]
##    R180U_TOTAL_SESSIONS>=16.5
##    R180U_TOTAL_MINUTES_SPENT>=539.5
## 
##  Rule number: 5 [LEXILE_SCORE=655.428571428571 cover=7 (7%)]
##    R180U_TOTAL_SESSIONS< 16.5
##    R180U_TOTAL_MINUTES_SPENT< 32.48

These rules help us classify a student in terms of the # of sessions and amount of time in software, so it can lead to better Lexile scores. For the best lexile scores, students should engage in more than 17 sessions, and spend a total of anywhere between 408 and 540 minutes in the software. Interestingly enough, there are diminishing returns on any amount of time over 540 minutes, or 9 hours.

Note that this analysis is for data for one school district in California, during the October 2018 time frame. The data essentially includes the back-to-school timeframe for AY18-19. Export dates were August 8 through Oct. 22nd 2019.