This is my first data visualization project that is intended to
reproduce economist plot from this article Political
Science. I will use one of the most used visualization package in R
which is ggplot2. First of all, let’s take a look at the
economist plot that I’m talking about:
knitr::include_graphics("plot.png")Well, it seems a bit complicated at first, but I’m trying to do my best to resemble it. This plot can give us insight that Twitter social media may reflect political polarization especially for Senate Republicans and Democrats. While following the science-related accounts are low overall, we can see that Senate Democrats three times more likely to follow science-related Twitter handles than Republicans. In other word, there is a high chance that Democrate will conduct policy based on scientific knowledge or clear evidence that may lead to a better life in the future.
Let’s import some packages that we’ll be using throughout the recreating plot process, we will get into the more detail explanations once we get to the implementation of these packages
library(dplyr) # data wrangling
library(reshape2) # data wrangling specifically for reshaping the structure
library(showtext) # deal with external fonts
library(ggplot2) # data visualization
library(ggthemes) # visualization theme for ggplot
library(scales) # Scaling function for visualization
library(grid) # Adding graphical object to the plotThe website mentioned above, does not tell us explicitly where is the date come from which they use for creating plot. Fortunately, we can try to search the article that they was referring to and I finally found the data that we can use. Here are some of the data the might be useful for our purpose
data1 <- read.csv("neu_m0404q02m.csv")
data2 <- read.csv("neu_m0404q15p.csv")
data3 <- read.csv("neu_m0404z286.csv")
data4 <- read.csv("politic_science.csv")
data5 <- read.csv("S114_members.csv")Based on that article, the data is meant to be analysed and checked if there is a correlation between follows of science-related twitter accounts by U.S Senate and recent votes on a series of amendments. This behavior will define the “sense of the Senate” on global climate change (GCC). Basically, we can quickly see from the plot above that science-related twitter accounts can reflect polarization among the Senate Democrats and Republicans which belong to distinct group. Senate Democrats are more likely to follow science-related twitter accounts than Republicans. Since our goal is to only create the plot, I will not explain more about the article. If you are interested, you can check it from this link.
Actually, after inspecting some of those datasets, I realize that only the data4 and data5 will be used for creating plot. But, I will still explain what are those five data in just a simple way. Below are the brief description about each of our data:
data1: Column 1 shows the Senator’s name, and all other
columns are Twitter handles followed by at least one Senator. A 1
Indicates that the Senator’s Twitter account follows that Twitter
handle.data2: A square matrix that which US Senators follow
each other on Twitter (1: follow, 0: no follow).data3: Table containing 78,000+ twitter handles
followed by U.S. Senators as of February 2015. Handles are grouped in to
Science (based on focus of organization), Politics, Media, and
Other.data4: Total number of Twitter accounts followed by
each Senator and proportion categorized as Science, Politics, Media or
Other.data5: Table of DW-Nominate measure/score for each
Senate. I actually don’t really know how are these scores
calculated.Remember that I’m only going to use data4 and data5 for creating the plot, but I think we can still utilize other data for a quick analysis. Say we want to analyse which top 5 senates that follow science-related Twitter accounts the most.
First, we take the data3 and select only for science category
science <- data3[data3$Category=="Science",]
glimpse(science)Rows: 77
Columns: 2
$ Category <chr> "Science", "Science", "Science", "Science", "Science", "Scie…
$ Twitter.Handle <chr> "AGUSciPolicy", "MetcalfURI", "OGJOnline", "RANDCorporation"…
We can see that there are 77 science-related twitter accounts.
Next, we want to check whether Senates are following those accounts using data1. Let’s see what does data1 look like
head(data1)It contains a lot of columns(twitter accounts) even though we only need accounts that are science-related so we want to select only those valid columns with these steps:
valid_col <- c()
for(i in 2:length(colnames(data1))) {
if(colnames(data1)[i] %in% science$Twitter.Handle){
valid_col <- append(valid_col, colnames(data1)[i])
}
}
length(valid_col)[1] 75
After that, we can subset data1 which contain only those valid columns
science_tweet <- data1[, append(valid_col, "X",0)]
head(science_tweet)Now, we will use reshape2 package to restructure the
science_tweet dataframe and get the answer that we desire. We will
divide that into these steps:
melt() function to transform all columns into only
one column.table() function to see how
many science related-accounts that are followed by each Senate. It will
be shown as the Frequency column.arrange function to order the frequency from the
highest to the lowest to see the top 5.# melt the twitter account columns.
melted <- melt(science_tweet,"X")
senate_science <- data.frame(table(melted$X,melted$value))
colnames(senate_science) <- c("Name","isScience","Frequency")
senate_science <- senate_science[senate_science$isScience==1,]
# sort the data based on frequency to check the top 5
senate_science %>%
arrange(-Frequency) %>%
head(5)Now, let’s move to the main objective which is recreating the economist plot. Recall that we will only use data4 and data5 so let’s inspect them.
# See the structure
glimpse(data4)Rows: 89
Columns: 7
$ Last.Name <chr> "Alexander", "Ayotte", "Barrasso", "Bennet", "Blumenthal", "Blunt…
$ Party <chr> "R", "R", "R", "D", "D", "R", "R", "D", "D", "R", "D", "R", "D", …
$ Science <dbl> 0.000, 0.002, 0.004, 0.024, 0.014, 0.000, 0.006, 0.026, 0.023, 0.…
$ Politics <dbl> 0.154, 0.054, 0.156, 0.337, 0.086, 0.009, 0.064, 0.553, 0.636, 0.…
$ Media <dbl> 0.076, 0.015, 0.081, 0.000, 0.034, 0.004, 0.013, 0.009, 0.000, 0.…
$ Other <dbl> 0.770, 0.928, 0.759, 0.639, 0.866, 0.987, 0.917, 0.412, 0.341, 0.…
$ Total <int> 409, 1492, 456, 83, 801, 6799, 543, 114, 44, 3358, 579, 935, 263,…
glimpse(data5)Rows: 101
Columns: 22
$ congress <int> 114, 114, 114, 114, 114, 114, 114, 114, 114, …
$ chamber <chr> "President", "Senate", "Senate", "Senate", "S…
$ icpsr <int> 99911, 49700, 94659, 40300, 41500, 15039, 201…
$ state_icpsr <int> 99, 41, 41, 81, 81, 61, 61, 42, 42, 71, 71, 6…
$ district_code <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ state_abbrev <chr> "USA", "AL", "AL", "AK", "AK", "AZ", "AZ", "A…
$ party_code <int> 100, 200, 200, 200, 200, 200, 200, 200, 200, …
$ occupancy <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ last_means <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ bioname <chr> "OBAMA, Barack", "SESSIONS, Jefferson Beaureg…
$ bioguide_id <chr> "O000167", "S001141", "S000320", "M001153", "…
$ born <int> 1961, 1946, 1934, 1957, 1964, 1936, 1962, 195…
$ died <dbl> NA, NA, NA, NA, NA, 2018, NA, NA, NA, NA, NA,…
$ nominate_dim1 <dbl> -0.358, 0.549, 0.448, 0.210, 0.481, 0.381, 0.…
$ nominate_dim2 <dbl> -0.197, 0.130, 0.575, -0.302, 0.068, -0.626, …
$ nominate_log_likelihood <dbl> NA, -109.63029, -97.49168, -134.57023, -101.4…
$ nominate_geo_mean_probability <dbl> NA, 0.76537, 0.78512, 0.72192, 0.78169, 0.749…
$ nominate_number_of_votes <int> NA, 410, 403, 413, 412, 410, 407, 418, 410, 3…
$ nominate_number_of_errors <int> NA, 46, 42, 65, 35, 59, 70, 24, 27, 22, 30, 4…
$ conditional <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ nokken_poole_dim1 <dbl> NA, 0.635, 0.587, 0.299, 0.460, 0.464, 0.686,…
$ nokken_poole_dim2 <dbl> NA, 0.200, 0.561, -0.253, 0.144, -0.312, -0.7…
head(data4)head(data5)From the data5, we only care about the nominate_dim1 column since this is going to be our x-axis value.
We can see from the result above that there is a different amount of
row between data4 and data5, so we need to match both dataframe so they
contain the exact same Senate. Here, I use grepl() function
to match the Last.Name column from data4 to bioname column from data5.
grepl() function has two main parameters:
pattern: characters or patterns that need to be
matchedx: columns that we want to check whether the patterns
are found or not# conditional subsetting using subset with matching condition
nominate <- subset(data5, grepl(paste(data4$Last.Name, collapse="|"), data5$bioname, ignore.case = T), select = c("bioname","nominate_dim1"))
head(nominate)Remember that the data4 contains only the last name, whereas the
data5 contains full name. Here I want to split the fullname and take
only the last name. We can see from the bioname column, the last nama is
located before comma, so we can try to extract it and make a new column
containing that last name using below code. We can use
strsplit() function to split the characters and we can use
sapply() to apply that function for bioname column
# Use strsplit to split the bioname based on comma and only take the last name
nominate_clean <- nominate %>%
mutate(name = sapply(strsplit(nominate$bioname, ","), head, 1))
head(nominate_clean)Next, we want to match the exact character regardless of what case is
being used(upper or lower), so we lowercase both column with
tolower() function.
# convert both column name to lower case
data4$Last.Name <- tolower(data4$Last.Name)
nominate_clean$name <- tolower(nominate_clean$name)Finally, we will join these two dataframes into one dataframe so we
can use it for our plotting. We can use left_join() from
dplyr package to join two dataframes based on matched
column. The concept of join here is similar to what we have in
SQL join.
Recall that from the reference plot that we have, the y axis is in the form of percentage so we need to multiply that with 100. Furthermore, we also want to match the point color in our plot. The color is based on the Party column so we can convert that column as factor with defined levels as (“D”,“R”,“I”).
# join two dataframes based on the matching names
clean_data <- data4 %>%
left_join(nominate_clean, by = c("Last.Name" = "name"))
# multiply science columns with 100
clean_data$Science <- clean_data$Science*100
# Convert the Party column into factor with defined levels
clean_data$Party <- factor(clean_data$Party, levels = c("D","R","I"))
head(clean_data)Before I create the plot, I want to exclude senates who don’t actually follow science-related account and take the senates who follow less than 7000 account. The decision was based on what I have seen from the real plot.
clean_data <- clean_data[clean_data$Total <= 7000 ,]
clean_data <- clean_data[clean_data$Science > 0,]I try to reproduce the exact same plot even the font style, so I
found some fonts online that hopefully would help us making our plot
looks ‘economist’. In order to load those fonts, we need to use
showtext package that we’ve already imported previously
# Add and load the external fonts
font.add("ITC Officina Book", "itc-officina-sans-std-book.otf")
font.add("ITC Officina Book Italic", "itc-officina-sans-std-book-italic.otf")
font.add("ITC Officina Bold","itc-officina-sans-std-bold.otf")
font.add("ITC Officina Bold Italic", "itc-officina-sans-std-bold-italic.otf")
showtext_auto()Next, we want to only visualize certain senates names based on the real plot
# Take only certain senates
text_label <- c("markey","reid","cochran","sanders","schatz","vitter","inhofe")
data_label <- clean_data[clean_data$Last.Name %in% text_label,]
# Create new columns for senates full name that we want to show in our plot
data_label$Full.Name <- c("Thad Cochran", "Jim Inhofe","Ed Markey","Harry Reid","Bernie Sanders", "Brian Schatz","David Vitter" )Before we code the actual plot, we want to set the theme for our plot as similar as possible to our reference. This actually requires several attempts, so I am sorry if the final result does not satisfy you.
theme_pol <- theme(
# customize the plot title
plot.title = element_text(size=14, family="ITC Officina Bold",
color="black",
hjust=-0.075,
vjust = -5),
# Customize the plot subtitle
plot.subtitle = element_text(size=10, family="ITC Officina Book",
color="black",
hjust = -0.155,
vjust = -8.2),
# Customize the plot grid
panel.border = element_blank(),
panel.grid.major.x= element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.y = element_blank(),
# Customize plot background
plot.background = element_rect(fill = "#FFFFFF"),
panel.background = element_rect(fill = "#FFFFFF"),
# Customize plot axis
axis.title.y = element_text(size = 10,
family="ITC Officina Book Italic",
colour = "black",
hjust = -1.3),
axis.title.x = element_text(size = 10,
family="ITC Officina Bold Italic",
colour = "#696969"),
axis.text = element_text(size=10,
family = "ITC Officina Book",
color="black"),
axis.ticks.length.x = unit(-0.2, "lines"),
# Customize plot legend
legend.position = c(-0.07,0.88),
legend.direction = "horizontal",
legend.justification = "left",
legend.text = element_text(size=10,
family = "ITC Officina Book",
color="black"),
legend.title = element_text(size=10,
family="ITC Officina Bold",
color="black",
margin = margin(r=8)),
legend.background = element_blank(),
legend.key=element_blank(),
legend.spacing.x = unit(c(-0.1), "cm")
) ggplot2 is a package for declaratively creating
graphics, based on The Grammar of Graphics that allows
us to compose graph by combining independent components. Every ggplot2
plot has three key components:
The steps for creating ggplot graph is really straightforward:
ggplot(). The function
has two main parameters which are data and
mapping.Our plot includes point and text object, so let’s do that using
geom_point() and geom_text() function.
gg <- ggplot(data = clean_data, aes(x=nominate_dim1,y=Science)) +
# Adding points to our plot
geom_point(aes(color=Party, size=Total), alpha=0.5) +
# Adding required senate names represented by text object
geom_text(
data=data_label ,
aes(label=Full.Name,color=Party),
nudge_x = 0.15,
nudge_y = 0.2,
family = "ITC Officina Bold Italic",
show.legend = F
)We also need to specify the plot title, axis, scale, and legend so it matches the actual plot.
gg <-
gg +
# Customize axis labels
scale_x_continuous(labels = c("1.0","0.5","0","0.5","1.0")) +
scale_y_continuous(trans=log2_trans(),breaks=c(0.5,1.0,2.0,4.0,6.0,8.0,10.0)) +
# Customize the points size
scale_size(range = c(1,20), guide="none") +
# Customize the points color. This is why converting data types into factor with defined levels comes in handy.
scale_color_manual(
values = c("#8dd4f0","#f48872","#4f9588"),
guide = guide_legend(title="Party affiliation:"),
labels= c("D"="Democrats","R"="Republicans","I"="Independents")
) +
coord_cartesian(xlim=c(-1,1), clip = 'off') +
labs(
title = "Political Science",
subtitle = "United States senator's political ideology and scientific engagement on Twitter*",
x = "Liberal-Conservative Score",
y = "Science-related Twitter accounts followed, % of total, natural log scale"
) +
theme_bw() Next,we want to create additional text and the minus(-) plus(+) sign
for for the x-axis using grid package. I actually tried to
find the most efficient way, and end up finding this approach
# use grobTree to add customized text into our plot
grob = grobTree(textGrob("February 2015", x=-0.06, y=1, hjust=0, vjust=4, gp=gpar(fontsize=8, fontfamily="ITC Officina Book", col="black")))
grob2 = grobTree(textGrob("- +", x=0.37, y=-0.04, hjust=0, vjust=0.4, gp=gpar(fontsize=15, fontfamily="ITC Officina Book", col="black")))
grob3 = grobTree(textGrob("MORE LIBERAL", x=0.05, y=1, hjust=0, vjust=56, gp=gpar(fontsize=8, fontfamily="ITC Officina Bold Italic", col="#8dd4f0")))
grob4 = grobTree(textGrob("MORE CONSERVATIVE", x=0.8, y=1, hjust=0, vjust=56, gp=gpar(fontsize=8, fontfamily="ITC Officina Bold Italic", col="#4f9588")))Now, we should create the coecentric circles instead of the native legend. I don’t know how to customize that if I only use the basic native legend provided by ggplot, so I found that this way would be an option. Besides that, we also need to create customized grid since as we can see from the actual plot the two grids at the top are cut. This will be divided into several steps:
geom_point() for circles and
geom_text() for the labels# Create dataframe for circle legend
legend_bbl <- data.frame(label=c("250","1,000","5,000"),
size=c(250,1000,5000),
pos_bbl=c(8.7,9.5,12), #c(7.7,8.5,11),
pos_txt=c(8.9,12.9,21))
# Create dataframe for y-axis grid
line_segment <- data.frame(xend=c(0.7,0.7,1,1,1,1,1,1),
y=c(10,8,6,4,2,1,0.5,0)
)
dash_segment <- data.frame(y=c(9.8,11.9,19.2))After we prepare the data for creating customized legend, we can add that into our plot
final <- gg +
# Circle legend
geom_point(
data=legend_bbl,
aes(x=0.84, y=pos_bbl, size=size),
shape = 21, color = "black", fill = "#e1e3e5",alpha=0.5
)+
# Label for the legend
geom_text(
data=legend_bbl,
aes(x=1,y=pos_txt,label=label),
hjust=1,
size=3
)+
# Create customized y-axis grid
geom_segment(data=line_segment,
aes(x=-1.1,xend=xend,y=y,yend=y),color="grey66")+
# Create dash line in our legend
geom_segment(data = dash_segment,
aes(x=0.83, xend=1, y=y,yend=y),linetype="dotted") +
# adding our grobText object with annotation_custom
annotation_custom(grob) +
annotation_custom(grob2) +
annotation_custom(grob3) +
annotation_custom(grob4) +
annotate("text", x = 0.75 , y = 30, label = "Twitter accounts \nfollowed, total", fontface = "bold", family = "ITC Officina Bold",lineheight = 0.6, hjust=0 ) +
theme_pol
finalI know that the result does not exactly look like the original one, probably because there is a little difference between the dataset that I use from the economist plot reference. However, at least we have got a quite similar plot even though it still far from perfect.