Rick Davies just wrote an interesting post which combined thoughts on QCA (and multi-valued QCA or mvQCA) and classification trees with thoughts on INUS causation.
In this little post I will just show an easy way for evaluators to do classification trees using the open-source statistic software R rather than the Rapid Miner and BigML tools which Rick used.
The question was something like: how can we look at a small-to-medium set of cases (like a dozen or a hundred countries or development programs) and tease out which factors are associated with some outcome. In Rick's example, he looked at some African countries to see which characteristics are associated with a higher percentage of women in parliament.
The data is from Krook (2010) and is actually provided here as a long line of text, and we read it in to a variable which we will call ap.
ap = read.table(header = T, text = "\nCountry ElectoralSystem Quotas WomensStatus LevelOfDevelopment PostConflict PercentageWomenNationalParliament\n1 Mozambique PR Yes 44 0.39 Yes 34.8\n2 SouthAfrica PR Yes 77 0.65 Yes 32.8\n3 Burundi PR Yes 32 0.38 Yes 30.5\n4 Tanzania Majority Yes 47 0.43 No 30.4\n5 Uganda Majority Yes 65 0.50 Yes 29.8\n6 Namibia PR Yes 69 0.63 Yes 26.9\n7 Lesotho Majority No 66 0.50 Yes 23.5\n8 Senegal Mixed Yes 36 0.46 No 22.0\n9 Ethiopia Majority Yes 30 0.37 Yes 21.9\n10 Zambia Majority No 52 0.41 No 14.6\n11 SierraLeone PR No 55 0.34 Yes 14.5\n12 Guinea-Bissau PR No 29 0.35 Yes 14.0\n13 Malawi Majority Yes 64 0.40 No 13.6\n14 Gabon Majority No 68 0.63 No 12.5\n15 Niger Mixed Yes 18 0.31 No 12.4\n16 BurkinaFaso PR No 23 0.34 Yes 11.7\n17 Botswana Majority Yes 72 0.57 No 11.1\n18 Ghana Majority No 44 0.53 No 10.9\n19 Djibouti Majority No 21 0.50 Yes 10.8\n20 Mali Majority Yes 30 0.34 No 10.2\n21 Gambia Majority No 50 0.48 No 9.4\n22 Congo Majority No 49 0.52 Yes 8.5\n23 Benin PR No 41 0.43 No 8.4\n24 Kenya Majority No 58 0.50 No 7.3\n25 Madagascar Mixed No 55 0.51 No 6.9\n26 Nigeria Majority No 50 0.45 No 6.4\n")
There are literally dozens of classification and machine-learning packages to play with in R, in fact there is a whole task view dedicated to it. In this case I used the package party (install.pacakges('party')). So we just load up that package, and ask it to predict PercentageWomenNationalParliament from the all the other variables. Note we aren't reducing the data to binary (yes/no) form here in order not to lose information.
library(party)
result = (ctree(PercentageWomenNationalParliament ~ ., data = ap))
plot(result)
Easy. The output isn't terribly pretty but it is packed with information, and clearly shows how the range of the percentage of women in parliament is quite different for countries with and without quotas.
This ctree function from the party package is pretty powerful; predictors can be any kind of variable and it can predict nominal as well as numerical outcomes.
For example here we predict ElectoralSystem from the others. PostConflict makes a contribution which is just significant at p=0.062.
library(party)
result = (ctree(ElectoralSystem ~ ., data = ap[, -1], controls = ctree_control(mincriterion = 0.005)))
plot(result)
But going back to the first plot, we note that this is a boring tree, with only one split. Quotas make a very significant impact at p=0.006. But after that, nothing. Even when we soften the significance level we accept to split the tree right down to something absurdly small, we still get exactly the same result with only one split:
library(party)
result = (ctree(PercentageWomenNationalParliament ~ ., data = ap,
controls = ctree_control(mincriterion = 0.005)))
plot(result)
And this is the whole problem with decision trees and small-N studies. If the other splits which Rick found don't reach statistical significance, how do we know they aren't just spurious or co-incidental results? In small-N studies we have to add a lot of good theory to make these kinds of results more plausible, as indeed Krook does in the original article.