The dataset contains every one-word phrase that was mentioned in at least 10 speeches and every two- or three-word phrase that was mentioned in at least five speeches by the State Governors.
[State of the State data web link] (https://github.com/fivethirtyeight/data/tree/master/state-of-the-state)
The dataset consists of the following columns:
phrase : one-, two- or three-word phrase
category : thematic categories
d_speeches: number of Democratic speeches
r_speeches: number of Republican speeches
total: total number of speeches
percent_of_d_speeches: percent of the 23 Democratic speeches containing the phrase
percent_of_r_speeches: percent of the 27 Republican speeches containing the phrase
chi2: chi^2 statistics
pval: p-value for chi^2 test
Load Libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## Read the Original data from GitHub link
urlfile <- 'https://raw.githubusercontent.com/fivethirtyeight/data/master/state-of-the-state/words.csv'
datain <- read.csv(urlfile)
speech_data <- data.frame(datain)
Rename the columns
View sample data
colnames(speech_data) <- c("Phrase","Category", "Democratic_Speeches", "Republican_Speeches", "Total_Speeches", "%_of_Dem_Speeches", "%_of_Rep_Speeches", "Chi^2","Probability_Measure")
head(speech_data)
## Phrase Category Democratic_Speeches Republican_Speeches
## 1 minimum wage economy/fiscal issues 9 0
## 2 clean energy energy/environment 11 1
## 3 climate change energy/environment 13 2
## 4 gun violence crime/justice 8 0
## 5 affordable care 10 1
## 6 international 0 10
## Total_Speeches %_of_Dem_Speeches %_of_Rep_Speeches Chi^2
## 1 9 39.13 0.00 10.565217
## 2 12 47.83 3.70 10.074611
## 3 15 56.52 7.41 9.986581
## 4 8 34.78 0.00 9.391304
## 5 11 43.48 3.70 8.931196
## 6 10 0.00 37.04 8.518519
## Probability_Measure
## 1 0.001152355
## 2 0.001503264
## 3 0.001576851
## 4 0.002180170
## 5 0.002803407
## 6 0.003515506
Subset the data frame with Category and Total_Speeches columns
Display the Categories, order by sum of speeches
# Create Sub-set data
speech_byCategory_subdata <- subset(speech_data, select = c("Category","Total_Speeches"))
# Aggregate them
speech_byCategory_aggregate_data <- aggregate(speech_byCategory_subdata$Total_Speeches, by=list(speech_byCategory_subdata$Category), FUN=sum)
#View(speech_byCategory_aggregate_data)
# Order them
speech_byCategory_order_aggregate_data <- speech_byCategory_aggregate_data[order(speech_byCategory_aggregate_data$x),]
colnames(speech_byCategory_order_aggregate_data) <- c("Category", "Total_Associated_Speeches")
# Filter blank Catergory
speech_byCategory_order_aggregate_data[!(!is.na(speech_byCategory_order_aggregate_data$Category) & speech_byCategory_order_aggregate_data$Category == ""), ]
## Category Total_Associated_Speeches
## 8 mental health/substance abuse 206
## 6 energy/environment 226
## 3 crime/justice 424
## 7 health care 451
## 5 education 1275
## 4 economy/fiscal issues 2651
## 2 30401
View(speech_byCategory_order_aggregate_data)
Of all the categories, leading topics are “Economy/Fiscal issues”, “Education” and “Health Care”.
Subset the data frame based on Words, Category used by Governors
Display the Words, order by number of speeches by their Polical affiliation
Top 10 Phrases used by Democratic Governors
# Create Sub-set data
Phrases_subdata <- subset(speech_data, select = c("Phrase", "Category","Democratic_Speeches", "Republican_Speeches"))
library(sqldf)
top_Dem_Phrases <- sqldf( "SELECT * FROM Phrases_subdata WHERE TRIM(Category) != '' ORDER BY Democratic_Speeches DESC LIMIT 10", row.names=FALSE)
knitr::kable(top_Dem_Phrases, format="html")
|
Phrase
|
Category
|
Democratic_Speeches
|
Republican_Speeches
|
|
health care
|
health care
|
23
|
19
|
|
business
|
economy/fiscal issues
|
23
|
24
|
|
health
|
health care
|
23
|
25
|
|
economic
|
economy/fiscal issues
|
23
|
25
|
|
budget
|
economy/fiscal issues
|
23
|
25
|
|
students
|
education
|
23
|
25
|
|
education
|
education
|
23
|
27
|
|
school
|
education
|
23
|
27
|
|
working
|
economy/fiscal issues
|
23
|
27
|
|
economy
|
economy/fiscal issues
|
22
|
24
|
Top 10 Phrases used by Repulican Governors
top_Rep_Phrases <- sqldf( "SELECT * FROM Phrases_subdata WHERE TRIM(Category) != '' ORDER BY Republican_Speeches DESC LIMIT 10", row.names=FALSE)
knitr::kable(top_Rep_Phrases, format="html")
|
Phrase
|
Category
|
Democratic_Speeches
|
Republican_Speeches
|
|
education
|
education
|
23
|
27
|
|
school
|
education
|
23
|
27
|
|
working
|
economy/fiscal issues
|
23
|
27
|
|
job
|
economy/fiscal issues
|
22
|
26
|
|
jobs
|
economy/fiscal issues
|
22
|
26
|
|
tax
|
economy/fiscal issues
|
18
|
25
|
|
health
|
health care
|
23
|
25
|
|
economic
|
economy/fiscal issues
|
23
|
25
|
|
budget
|
economy/fiscal issues
|
23
|
25
|
|
students
|
education
|
23
|
25
|
Phrases which are used by Democrates only, not by Republicans
Phrases_DemsOnly <- sqldf( "SELECT * FROM Phrases_subdata WHERE TRIM(Category) != '' AND Republican_Speeches == 0 ORDER BY Democratic_Speeches DESC LIMIT 15", row.names=FALSE)
knitr::kable(Phrases_DemsOnly, format="html")
|
Phrase
|
Category
|
Democratic_Speeches
|
Republican_Speeches
|
|
minimum wage
|
economy/fiscal issues
|
9
|
0
|
|
gun violence
|
crime/justice
|
8
|
0
|
|
education need
|
education
|
7
|
0
|
|
students state
|
education
|
6
|
0
|
|
gun safety
|
crime/justice
|
6
|
0
|
|
pre existing conditions
|
health care
|
5
|
0
|
|
reproductive health
|
health care
|
5
|
0
|
|
educators deserve
|
education
|
5
|
0
|
|
energy future
|
energy/environment
|
5
|
0
|
|
economy works
|
economy/fiscal issues
|
5
|
0
|
|
existing conditions
|
health care
|
5
|
0
|
|
cost health
|
health care
|
5
|
0
|
Phrases which are used by Republicans only, not by Democrates
Phrases_RepsOnly <- sqldf( "SELECT * FROM Phrases_subdata WHERE TRIM(Category) != '' AND Democratic_Speeches == 0 ORDER BY Republican_Speeches DESC LIMIT 15", row.names=FALSE)
knitr::kable(Phrases_RepsOnly, format="html")
|
Phrase
|
Category
|
Democratic_Speeches
|
Republican_Speeches
|
|
doing business
|
economy/fiscal issues
|
0
|
7
|
|
state income
|
economy/fiscal issues
|
0
|
7
|
|
savings account
|
economy/fiscal issues
|
0
|
5
|
|
schools safer
|
crime/justice
|
0
|
5
|
|
local law enforcement
|
crime/justice
|
0
|
5
|
|
prison population
|
crime/justice
|
0
|
5
|
|
local law
|
crime/justice
|
0
|
5
|
|
state income tax
|
economy/fiscal issues
|
0
|
5
|
|
education workforce
|
education
|
0
|
5
|
|
tax rates
|
economy/fiscal issues
|
0
|
5
|
|
fully funding
|
economy/fiscal issues
|
0
|
5
|
Top Phrases Plot by Democratic Governors
library(ggplot2)
library(ggbeeswarm)
dem_plot <- ggplot(data = top_Dem_Phrases,
aes(y =Phrase , x = Democratic_Speeches)) + geom_beeswarm()
dem_plot
Top Phrases Plot by Republican Governors
rep_plot <- ggplot(data = top_Rep_Phrases,
aes(y =Phrase , x = Republican_Speeches)) + geom_boxplot(notch=FALSE)
rep_plot
Let’s see which Phrases are used by Governors from both sides in the top Category
Added a new column Variance in the dataframe
List the Phrases with least Variance
# Create Sub-set data
Category_subdata <- subset(speech_data, select = c("Phrase","Category","Democratic_Speeches", "Republican_Speeches"))
Category_subdata$Variance <- abs(Category_subdata$Democratic_Speeches - Category_subdata$Republican_Speeches)
Category_match_data <- sqldf( "SELECT * FROM Category_subdata WHERE TRIM(Category) != '' AND Variance == 0 AND Democratic_Speeches > 5", row.names=FALSE)
knitr::kable(Category_match_data, format="html")
|
Phrase
|
Category
|
Democratic_Speeches
|
Republican_Speeches
|
Variance
|
|
cost
|
economy/fiscal issues
|
20
|
20
|
0
|
|
teachers
|
education
|
19
|
19
|
0
|
|
employees
|
economy/fiscal issues
|
19
|
19
|
0
|
|
economic development
|
economy/fiscal issues
|
14
|
14
|
0
|
|
opioid
|
mental health/substance abuse
|
11
|
11
|
0
|
|
spend
|
economy/fiscal issues
|
11
|
11
|
0
|
|
educators
|
education
|
10
|
10
|
0
|
|
colleges
|
education
|
9
|
9
|
0
|
|
careers
|
economy/fiscal issues
|
9
|
9
|
0
|
|
education funding
|
education
|
6
|
6
|
0
|
|
entrepreneurs
|
economy/fiscal issues
|
6
|
6
|
0
|
|
substance abuse
|
mental health/substance abuse
|
6
|
6
|
0
|
List the Phrases with least Variance for “Economy/Fiscal Issues” used by the Governors
TopCategory_match_data <- sqldf( "SELECT * FROM Category_subdata WHERE Category == 'economy/fiscal issues' AND Variance == 0", row.names=FALSE)
knitr::kable(TopCategory_match_data, format="html")
|
Phrase
|
Category
|
Democratic_Speeches
|
Republican_Speeches
|
Variance
|
|
cost
|
economy/fiscal issues
|
20
|
20
|
0
|
|
employees
|
economy/fiscal issues
|
19
|
19
|
0
|
|
economic development
|
economy/fiscal issues
|
14
|
14
|
0
|
|
spend
|
economy/fiscal issues
|
11
|
11
|
0
|
|
careers
|
economy/fiscal issues
|
9
|
9
|
0
|
|
entrepreneurs
|
economy/fiscal issues
|
6
|
6
|
0
|
|
fiscally
|
economy/fiscal issues
|
5
|
5
|
0
|
|
tax credit
|
economy/fiscal issues
|
5
|
5
|
0
|
|
tax relief
|
economy/fiscal issues
|
4
|
4
|
0
|
|
fully fund
|
economy/fiscal issues
|
3
|
3
|
0
|
|
business leaders
|
economy/fiscal issues
|
3
|
3
|
0
|
|
budget includes
|
economy/fiscal issues
|
3
|
3
|
0
|
|
cut taxes
|
economy/fiscal issues
|
3
|
3
|
0
|
|
new taxes
|
economy/fiscal issues
|
3
|
3
|
0
|
To conclude, based on the above analysis we found that Governors mostly talk about the economy/fiscal, education and mental health/substance abuse issues. The top ranking is economy/fiscal issues which tries to address the cost of doing business, impact on careers, employees and business development using tax relief, tax credit, cutting taxes in some areas and adding new taxes where possible.
It is also noticed that Democratic Governors talk about minimum wage, gun violence and education need which Republicans never bring up in their speeches.
On the other hand, Republican Governors talk about doing business, state income and savings account which Democrates never discuss.