This is a preliminary attempt to analyze the Democrat and Republican Debates from 2015 on CNN. I scraped the data from CBS News for comparison.
The CBS News URLs for the two debates are: http://www.cbsnews.com/news/transcript-top-tier-primetime-cnn-gop-republican-debate-2015/ http://www.cbsnews.com/news/the-first-democratic-debate-full-rush-transcript/
This is the function I wrote to scrape the data and make it workable. There are still issues, but it gets the data close. There is some additional cleaning required, despite the data coming from the same source.
library(stringr)
scrapedebate <- function(url){
text2 <- readLines(url)
text2 <- str_replace_all(string = text2,pattern = "[']",replacement = "")
str <- unlist(str_extract_all (string = text2,pattern = '[A-Z]{2,}[.]{0,1}:.+?(?=[A-Z]{2,}[.]{0,1}:)'))
str <- as.data.frame(str,stringsAsFactors = F)
str$str <- gsub("</p><p>"," ",str$str)
str$str <- gsub("\\(.+\\)"," ",str$str)
str <- as.data.frame(str)
}
Right away, there is something that jumps out as different. This is “Turns of Talk” data, and in the Republican Debate there were 745 TOT, vs only 372 for the Democratic Debate. This means that there were more interruptions in the Republican Debate, but doesn’t indicate where the interruptions were. Were the candidates interrupting each other, or were there more interruptions from the moderators (perhaps trying to keep order)?
I use the package ‘qdap’ to try and find out by generating Gantt Plots of each debate, by sentences.Figure 1: A Gantt Plot of the CNN Republican Debate, 2015
Figure 2: A Gantt Plot of the CNN Democratic Debate, 2015
As expected, the main moderators (Tapper and Cooper) had an active role. There doesn’t appear to be much difference in how the moderators interacted in the debates, at least visually, although the Republican Debate does appear to be more chaotic…
If you look at the word counts for each of the debates, it’s easy to see that Tapper had a lot to say. In fact, in the Democrat Debate, the two front runners had the two highest word counts, as I would expect. In the Republican Debate though, the moderator had the highest word count. Perhaps even more interesting to me is that Carson, who was a frontrunner at the time of this debate, was ninth in terms of wordcount.
| Name | Total |
|---|---|
| CLINTON | 4197 |
| SANDERS | 3772 |
| COOPER | 3240 |
| OMALLEY | 2431 |
| WEBB | 2285 |
| CHAFEE | 1133 |
| BASH | 328 |
| LOPEZ | 139 |
| LEMON | 100 |
| WILKINS | 17 |
| Name | Total |
|---|---|
| TAPPER | 4304 |
| TRUMP | 4241 |
| BUSH | 3241 |
| RUBIO | 3047 |
| FIORINA | 2499 |
| CHRISTIE | 2319 |
| PAUL | 2261 |
| KASICH | 2239 |
| CARSON | 2102 |
| WALKER | 1988 |
| HUCKABEE | 1783 |
| CRUZ | 1670 |
| HEWITT | 675 |
| BASH | 477 |
Of course, this doesn’t mean that there’s any bias going on, but the two debates on the same network, were pretty different in terms of TOT and word counts. It’s certainly possible that the Republican moderator just had his hands full and was just trying to keep order.
I plan to keep working through the debate text in different ways, and I have a lot of ideas about how I want to do that, but if you have any questions that you would like investigated, let me know. (Also, I’m just a novice here, so if there are any suggestions for improvement, I’ll take those as well.)
I can be reached at www.twitter.com/dataminerx