Using R for small-N classification problems

Rick Davies just wrote an interesting post which combined thoughts on QCA (and multi-valued QCA or mvQCA) and classification trees with thoughts on INUS causation.

In this little post I will just show an easy way for evaluators to do classification trees using the open-source statistic software R rather than the Rapid Miner and BigML tools which Rick used.

The question was something like: how can we look at a small-to-medium set of cases (like a dozen or a hundred countries or development programs) and tease out which factors are associated with some outcome. In Rick's example, he looked at some African countries to see which characteristics are associated with a higher percentage of women in parliament.

The data is from Krook (2010) and is actually provided here as a long line of text, and we read it in to a variable which we will call ap.


ap = read.table(header = T, text = "\nCountry ElectoralSystem Quotas WomensStatus LevelOfDevelopment PostConflict PercentageWomenNationalParliament\n1     Mozambique              PR    Yes           44               0.39          Yes                              34.8\n2    SouthAfrica              PR    Yes           77               0.65          Yes                              32.8\n3        Burundi              PR    Yes           32               0.38          Yes                              30.5\n4       Tanzania        Majority    Yes           47               0.43           No                              30.4\n5         Uganda        Majority    Yes           65               0.50          Yes                              29.8\n6        Namibia              PR    Yes           69               0.63          Yes                              26.9\n7        Lesotho        Majority     No           66               0.50          Yes                              23.5\n8        Senegal           Mixed    Yes           36               0.46           No                              22.0\n9       Ethiopia        Majority    Yes           30               0.37          Yes                              21.9\n10        Zambia        Majority     No           52               0.41           No                              14.6\n11   SierraLeone              PR     No           55               0.34          Yes                              14.5\n12 Guinea-Bissau              PR     No           29               0.35          Yes                              14.0\n13        Malawi        Majority    Yes           64               0.40           No                              13.6\n14         Gabon        Majority     No           68               0.63           No                              12.5\n15         Niger           Mixed    Yes           18               0.31           No                              12.4\n16   BurkinaFaso              PR     No           23               0.34          Yes                              11.7\n17      Botswana        Majority    Yes           72               0.57           No                              11.1\n18         Ghana        Majority     No           44               0.53           No                              10.9\n19      Djibouti        Majority     No           21               0.50          Yes                              10.8\n20          Mali        Majority    Yes           30               0.34           No                              10.2\n21        Gambia        Majority     No           50               0.48           No                               9.4\n22         Congo        Majority     No           49               0.52          Yes                               8.5\n23         Benin              PR     No           41               0.43           No                               8.4\n24         Kenya        Majority     No           58               0.50           No                               7.3\n25    Madagascar           Mixed     No           55               0.51           No                               6.9\n26       Nigeria        Majority     No           50               0.45           No                               6.4\n")

There are literally dozens of classification and machine-learning packages to play with in R, in fact there is a whole task view dedicated to it. In this case I used the package party (install.pacakges('party')). So we just load up that package, and ask it to predict PercentageWomenNationalParliament from the all the other variables. Note we aren't reducing the data to binary (yes/no) form here in order not to lose information.

library(party)
result = (ctree(PercentageWomenNationalParliament ~ ., data = ap))
plot(result)

plot of chunk plot

Easy. The output isn't terribly pretty but it is packed with information, and clearly shows how the range of the percentage of women in parliament is quite different for countries with and without quotas.

This ctree function from the party package is pretty powerful; predictors can be any kind of variable and it can predict nominal as well as numerical outcomes.

For example here we predict ElectoralSystem from the others. PostConflict makes a contribution which is just significant at p=0.062.

library(party)
result = (ctree(ElectoralSystem ~ ., data = ap[, -1], controls = ctree_control(mincriterion = 0.005)))
plot(result)

plot of chunk plot3

But going back to the first plot, we note that this is a boring tree, with only one split. Quotas make a very significant impact at p=0.006. But after that, nothing. Even when we soften the significance level we accept to split the tree right down to something absurdly small, we still get exactly the same result with only one split:

library(party)
result = (ctree(PercentageWomenNationalParliament ~ ., data = ap, 
    controls = ctree_control(mincriterion = 0.005)))
plot(result)

plot of chunk plot2

And this is the whole problem with decision trees and small-N studies. If the other splits which Rick found don't reach statistical significance, how do we know they aren't just spurious or co-incidental results? In small-N studies we have to add a lot of good theory to make these kinds of results more plausible, as indeed Krook does in the original article.