Plots

As you can see, there is a crossing pattern in the data: those who like romantic comedy movies with a positive correlation (blue), and those who hate romantic comedies with a negative one (red). The third plot indicates that if we should fit separate regression models.

Simple Linear Regression (SLR)

If we try to fit the data by using a single linear regression, we fail, as we can see from the plot.

SLR Model Summary

The model summary indicates a very bad model, with an adjusted R-squared of 0.0002928. If you did not notice the pattern in the data, you would just determine that there was no predictability, on average.


Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-537.79 -102.55    1.96  102.96  513.75 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 18.26748    3.53271   5.171 2.56e-07 ***
x            0.08846    0.07026   1.259    0.208    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 158 on 1998 degrees of freedom
Multiple R-squared:  0.0007929, Adjusted R-squared:  0.0002928 
F-statistic: 1.586 on 1 and 1998 DF,  p-value: 0.2081

Expectation-Maximization (EM)

The EM algorithm fits a model to a subgroup, obtains the maximum likelihood estimate, then assigns each observation in the model it is best described by, and repeats until things settle down. As you can see from the plot, it works really well.

EM Model Summary

Estimated prior probabilities, number of observations assigned to the corresponding clusters, and posterior probability. In this case, one component has a ratio of around 0.78. There were 1375 points that had non-zero likelihood of being in that group, and of those 78% were best fit by that group.


Call:
flexmix(formula = y ~ x, k = 2)

       prior size post>0 ratio
Comp.1 0.494  933   1501 0.622
Comp.2 0.506 1067   1375 0.776

'log Lik.' -11251.22 (df=7)
AIC: 22516.45   BIC: 22555.66 

Rootograms

In a rootogram, the height of the bars correspond to square roots of counts rather than the counts themselves, so that low counts are more visible and peaks less so. A peak near probability 1 indicates that many of the points are overwhelmingly well-represented by that cluster. A peak near 0 would indicate that many points clearly don’t fit the category (which is also usually good). Points in the middle indicate a lack of separation–there are points that are only moderately well-described by that group, which should be considered with respect to the number of groups you think exist.