As posted by Douglas Barley, the analysis is based on the data on Activities of Members of European Parliament (MEP) source
Step 1 is to load the csv from the github library and to load the required R libraries.
knitr::opts_chunk$set(eval = TRUE, results = FALSE)
library(tidyverse)
library(RCurl)
library(stringr)
library(knitr)
library(openxlsx)
members_raw_url <- getURL("https://raw.githubusercontent.com/BharaniNittala/DATA-607/master/Proj2_Case2.csv")
members_raw <- read.csv(text = members_raw_url,na.strings=c("","NA"))
knitr:: kable(members_raw)
| X | Attendees.RCV | Reports | Written.declaration | Speeches | Motions | Opinions | Questions | Amendments |
|---|---|---|---|---|---|---|---|---|
| NA | Absentees (N=10) | NA | NA | NA | NA | NA | NA | NA |
| Mean | 6622 | 0.00 | 0.20 | 32.90 | 0.50 | 0.00 | 29.80 | 1.00 |
| Median | 6188 | 0.00 | 0.00 | 24.00 | 0.50 | 0.00 | 5.50 | 0.00 |
| S-D | 2009 | 0.00 | 0.42 | 24.58 | 0.53 | 0.00 | 61.14 | 1.89 |
| NA | Public Orators (N=19) | NA | NA | NA | NA | NA | NA | NA |
| Mean | 82.16 | 0.00 | 3.42 | 222.05 | 2.89 | 0.16 | 117.63 | 13.79 |
| Median | 83.66 | 0.00 | 0.00 | 101.00 | 2.00 | 0.00 | 101.00 | 2.00 |
| S-D | 12.94 | 0.00 | 7.22 | 319.29 | 3.38 | 0.69 | 129.07 | 30.21 |
| NA | Pragmatists (N=13) | NA | NA | NA | NA | NA | NA | NA |
| Mean | 91.16 | 2.08 | 5.31 | 317.23 | 20.15 | 0.92 | 154.46 | 52.00 |
| Median | 93.62 | 1.00 | 4.00 | 183.00 | 8.00 | 0.00 | 70.00 | 27.00 |
| S-D | 6.95 | 3.17 | 4.96 | 337.80 | 33.31 | 1.50 | 207.44 | 52.84 |
#Not all the columns are required. Seelcting only those that make sense fot the analysis
members_raw_trim <- members_raw[,c(1,2,5)]
names <- c("Metric","Attendees","Speeches")
colnames(members_raw_trim) <- names
#We need to use coalesce function to get the name of the cohort in the first column.
members_raw_trim <-members_raw_trim %>% mutate(Metric = coalesce(Metric,Attendees))
members_raw_trim$Speeches <- as.character(as.numeric(members_raw_trim$Speeches))
# Changing the dataset to long format for analysis
members_raw_trim_v2 <- pivot_longer(members_raw_trim,names[2:3],names_to = "Type", values_to = "Values",values_drop_na = TRUE)
#Applying learnings from previous assignments on regex functions
library(stringr)
members_raw_trim_v2$Audience <- sub("[^a-z A-Z]+", "\\1", members_raw_trim_v2$Values)
members_raw_trim_v2$Audience[members_raw_trim_v2$Audience ==""] <- NA
#Using fill function to impute date in the missing cases and then using regex functions as per the requirement
members_raw_trim_v2 <- members_raw_trim_v2 %>% fill(Audience)
members_raw_trim_v3 <- members_raw_trim_v2 %>% separate(Audience,c("Audience", "Audience_count"),"N=")
members_raw_trim_v3$Audience_count = as.numeric(str_extract(members_raw_trim_v3$Audience_count, "\\-*\\d+\\.*\\d*"))
#We can clean the data further by removing every 7th row in the dataset (reproducibility)
members_tall <- members_raw_trim_v3 %>% dplyr::filter(row_number() %% 7 != 1) ## Delete every 7th row starting from 1
#Re-arranging the columns for better understanding of the data
members_tall <- members_tall[c(4,5,1,2,3)]
knitr:: kable(members_tall)
| Audience | Audience_count | Metric | Type | Values |
|---|---|---|---|---|
| Absentees | 10 | Mean | Attendees | 6622 |
| Absentees | 10 | Mean | Speeches | 32.9 |
| Absentees | 10 | Median | Attendees | 6188 |
| Absentees | 10 | Median | Speeches | 24 |
| Absentees | 10 | S-D | Attendees | 2009 |
| Absentees | 10 | S-D | Speeches | 24.58 |
| Public Orators | 19 | Mean | Attendees | 82.16 |
| Public Orators | 19 | Mean | Speeches | 222.05 |
| Public Orators | 19 | Median | Attendees | 83.66 |
| Public Orators | 19 | Median | Speeches | 101 |
| Public Orators | 19 | S-D | Attendees | 12.94 |
| Public Orators | 19 | S-D | Speeches | 319.29 |
| Pragmatists | 13 | Mean | Attendees | 91.16 |
| Pragmatists | 13 | Mean | Speeches | 317.23 |
| Pragmatists | 13 | Median | Attendees | 93.62 |
| Pragmatists | 13 | Median | Speeches | 183 |
| Pragmatists | 13 | S-D | Attendees | 6.95 |
| Pragmatists | 13 | S-D | Speeches | 337.8 |
members_tall$Values <- round(as.numeric(as.character(members_tall$Values)),0)
dodger = position_dodge(width = 0.9)
ggplot(members_tall,aes(y=Values, x=Audience,color = Type, fill = Type)) +
geom_bar( stat="identity",position = dodger, color="black")+
facet_grid(.~Metric) +
geom_text(aes(label=Values),color = "blue",position = dodger,vjust=-0.5)+
scale_fill_manual(values = alpha(c("#FF5733","#000000"))) +
ylab("Total Number")+
theme(axis.text.x=element_text(angle=45,margin = margin(1, unit = "cm"),vjust =1))
But above graph doesn’t make sense as the values from ‘Attendees’ are way higher than ‘Speeches’ as one may expect. Let’s draw each of them separately and then analyze
Let’s first understand ‘Attendees’ distribution among the different EU MEP
dodger = position_dodge(width = 0.9)
ggplot(subset(members_tall,members_tall$Type=="Attendees"),aes(y=Values, x=Audience, fill = Audience)) +
geom_bar( stat="identity",position = dodger, color="black")+
facet_grid(.~Metric) +
geom_text(aes(label=Values),color = "blue",position = dodger,vjust=-0.5)+
scale_fill_manual(values = alpha(c("#FF5733","#000000","#ADD8E6"))) +
ylab("Number of Attendees")+
theme(axis.text.x=element_text(angle=45,margin = margin(1, unit = "cm"),vjust =1))
We see that there are very high number of attendees from Absentees whereas the number of attendees are comparable for Pragmatists and Public Orators. This insight is obtained from comparable mean and median values in the chart. Also, if we observe closely the standard deviation of Public Orators attendees is relatively higher than Pragmatists. Something, we need to keep in mind if we want to compare these two cohorts. Now, let’s look at speeches.
dodger = position_dodge(width = 0.9)
ggplot(subset(members_tall,members_tall$Type=="Speeches"),aes(y=Values, x=Audience, fill = Audience)) +
geom_bar( stat="identity",position = dodger, color="black")+
facet_grid(.~Metric) +
geom_text(aes(label=Values),color = "blue",position = dodger,vjust=-0.5)+
scale_fill_manual(values = alpha(c("#FF5733","#000000","#ADD8E6"))) +
ylab("Number of Speeches")+
theme(axis.text.x=element_text(angle=45,margin = margin(1, unit = "cm"),vjust =1))
Interestingly, the number of speeches are way higher for Pragmatists and Public Orators compared to number of attendees. Possibly, the meetings are held in smaller groups. Among all the three cohorts, Pragmatists conduct the highest number of speeches and their 50th percentile value is atleast 80% higher than remaining two cohorts. S-D higher than mean shows that there is very high variability in the number of speeches for Pragmatists and Public Orators.
There seems to be inverse correlation between the number of attendees and speeches among the three groups Absentees, Pragmatists and Public Orators. Let’s confirm this hypothesis by calculating correlation value between attendees and speeches.
members_tall_corr <- pivot_wider(members_tall, names_from = "Type", values_from = "Values")
cor(members_tall_corr$Attendees, members_tall_corr$Speeches, method = c("pearson"))
[1] -0.733068
We see a strong negative correlation value (-73%) between the metrics which proves the hypothesis.