India is the largest democracy in the world. Parliament elections are held every 5 years to elect the Party and in turn Prime Minister of the Country.
The Parliament consists of two houses:Lok Sabha (Lower house) and Rajya Sabha (Upper house)
Members of Lok Sabha (House of the People) or the lower house of India’s Parliament are elected by being voted upon by all adult citizens of India, from a set of candidates who contest in their respective constituencies. Every adult citizen of India can vote only in their constituency. Candidates who win the Lok Sabha elections are called ‘Member of Parliament’ and hold their seats for five years. Elections take place once in 5 years to elect 545 members for the Lok Sabha.
The Rajya Sabha, also known as the Council of States, is the upper house of India’s Parliament. Candidates are not elected directly by the citizens, but by the Members of Legislative Assemblies and up to 12 can be nominated by the President of India for their contributions to art, literature, science, and social services. Members of the Parliament in Rajya Sabha get a tenure of six years, with one-third of the body facing re-election every two years.
The current data set is for Lok Sabha. Depending on its size, each State has different Districts and each District comprises of different Constituency.
R
Markdown provides an authoring framework for data science. Its design allows it to be converted to HTML, PDF or WORD output formats. To learn basic R
Markdown use the following link:
knitr
is an engine for dynamic report generation with R
.
2019 India hold largest election in world, spanning from 11 April to 19 May 2019.
election_result <- read.csv(file = "2019_Results.csv",TRUE, sep = ",", stringsAsFactors = FALSE)
Here the file 2019_Results.csv
contains 2019 Indian election data. The available variables include:
We can see the strcture and dimension of the data set usinf the command:
str(election_result)
## 'data.frame': 8568 obs. of 10 variables:
## $ State : chr "Andaman & Nicobar Islands" "Andaman & Nicobar Islands" "Andaman & Nicobar Islands" "Andaman & Nicobar Islands" ...
## $ Constituency : chr "Andaman & Nicobar Islands " "Andaman & Nicobar Islands " "Andaman & Nicobar Islands " "Andaman & Nicobar Islands " ...
## $ O_S_N : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Candidate : chr "AYAN MANDAL" "KULDEEP RAI SHARMA" "PRAKASH MINJ" "VISHAL JOLLY" ...
## $ Party : chr "All India Trinamool Congress" "Indian National Congress" "Bahujan Samaj Party" "Bharatiya Janata Party" ...
## $ EVM_Votes : int 1717 95249 2478 93772 2837 212 269 305 221 5339 ...
## $ Postal_Votes : int 4 59 8 129 2 0 6 1 0 2 ...
## $ Total_Votes : int 1721 95308 2486 93901 2839 212 275 306 221 5341 ...
## $ percent_of_Votes: num 0.83 45.98 1.2 45.3 1.37 ...
## $ Candidate_Won : chr "loss" "loss" "loss" "loss" ...
dim(election_result)
## [1] 8568 10
head(election_result,10)
## State Constituency O_S_N
## 1 Andaman & Nicobar Islands Andaman & Nicobar Islands 1
## 2 Andaman & Nicobar Islands Andaman & Nicobar Islands 2
## 3 Andaman & Nicobar Islands Andaman & Nicobar Islands 3
## 4 Andaman & Nicobar Islands Andaman & Nicobar Islands 4
## 5 Andaman & Nicobar Islands Andaman & Nicobar Islands 5
## 6 Andaman & Nicobar Islands Andaman & Nicobar Islands 6
## 7 Andaman & Nicobar Islands Andaman & Nicobar Islands 7
## 8 Andaman & Nicobar Islands Andaman & Nicobar Islands 8
## 9 Andaman & Nicobar Islands Andaman & Nicobar Islands 9
## 10 Andaman & Nicobar Islands Andaman & Nicobar Islands 10
## Candidate Party EVM_Votes
## 1 AYAN MANDAL All India Trinamool Congress 1717
## 2 KULDEEP RAI SHARMA Indian National Congress 95249
## 3 PRAKASH MINJ Bahujan Samaj Party 2478
## 4 VISHAL JOLLY Bharatiya Janata Party 93772
## 5 SANJAY MESHACK Aam Aadmi Party 2837
## 6 C G SAJI KUMAR All India Hindustan Congress Party 212
## 7 K KALIMUTHU Independent 269
## 8 V V KHALID Independent 305
## 9 GOUR CHANDRA MAJUMDER Independent 221
## 10 PARITOSH KUMAR HALDAR Independent 5339
## Postal_Votes Total_Votes percent_of_Votes Candidate_Won
## 1 4 1721 0.83 loss
## 2 59 95308 45.98 loss
## 3 8 2486 1.20 loss
## 4 129 93901 45.30 loss
## 5 2 2839 1.37 loss
## 6 0 212 0.10 loss
## 7 6 275 0.13 loss
## 8 1 306 0.15 loss
## 9 0 221 0.11 loss
## 10 2 5341 2.58 loss
The above code returns the first 10 rows of a data frame in R
.
Before cleaning and analyzing the data set, we will include some libraries that needed in R
Markdown.
library(dplyr)
library(ggplot2)
library(plotly)
library("RColorBrewer")
where
dplyr
provides a set of tools for efficiently manipulating datasets in R
, it focuses on data frames.ggplot2
is a data visualization package.plotly
provides online graphing, analytics, and statistics tools for individuals and collaboration, as well as scientific graphing libraries.RColorBrewer
package has a variety of sequential, divergent and qualitative palettes that has color palettes.It might happen that your dataset is not complete, and when information is not available we call it missing values. In R
, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number).
new_election_result <- na.omit(election_result)
Now we look at the dimension of the new data frame.
dim(new_election_result)
## [1] 8568 10
We can observe that dimension of the orginal data frame and new data frame is same. Thus the given data frame is complete.
To make graphs with ggplot2
, the data must be in a data frame.
The following plots help to examine how well correlated two variables are:
The most frequently used plot for data analysis is scatterplot. Whenever you want to understand the nature of relationship between two variables, the first choice is the scatterplot. It can be drawn using geom_point()
. Additionally, geom_smooth
which draws a smoothing line (based on loess) by default, can be improved to draw the line of best fit by setting method='lm'
.
# Scatterplot
gg <- ggplot(election_result, aes(x=State, y=percent_of_Votes)) +
geom_point(aes(col=State)) +
geom_smooth(se=F) +
labs(y="percent_of_Votes",
x="State",
title=" State Vs percent_of_Votes")+
theme_minimal(base_size = 12)
plot(gg)
Sometimes there may be overlapping points when we use scatterplot. We can make a jitter plot using jitter_geom()
in such cases.
Example:
theme_set(theme_bw()) # pre-set the bw theme.
plot_jitter <- ggplot(election_result, aes(x=State, y=percent_of_Votes))
plot_jitter+ theme(axis.text.x = element_text(size = 6,angle = 90,hjust = 0.5, vjust = 0.5),legend.text = element_text(size=6))+geom_jitter(aes(col=State),width = .5, size=1) +
labs(subtitle="Jittered Points",
y="percent_of_Votes",
x="State",
title=" State Vs percent_of_Votes")
As the name suggests, the overlapping points are randomly jittered around its original position based on a threshold controlled by the width
argument.
The default settings in ggplot work for a simple graph or perhaps one or two variables. However, when we wish to create refined graphs or visualizations that minimize pixels, reduce clutter and eliminate distractions we will need to access the dozens of theme components in ggplot.
To learn more about different theme components in ggplot, we can use the link: (https://www.rdocumentation.org/packages/ggplot2/versions/2.1.0/topics/element_text)
Each of these theme components is something you can manipulate. Often, the element_text
function will be called when you refer to these components in your code.
graph_plotly <- filter(election_result, State == "Andaman & Nicobar Islands")
p <-ggplot(graph_plotly, aes(x=Party, y=percent_of_Votes, size=percent_of_Votes, text = paste("Candidate:", Candidate),fill=Party)) +
geom_point(alpha = 1,color = "red")
ggplotly(p)
Here we have filitered the data frame and stored in the variable graph_plotly
such that the data frame graph_plotly
contains the election result for the State Andaman & Nicobar Islands.
dim(graph_plotly)
## [1] 16 10
This data frame contains 16 observations and 10 variables.
To rotate the x label’s text we can use theme
in the code as follows:
p1 <-ggplot(graph_plotly, aes(x=Party, y=percent_of_Votes, size=percent_of_Votes, fill= Party, text = paste("Candidate:", Candidate))) + theme(axis.text.x = element_text(size = 4,angle = 90,hjust = 0.5, vjust = 0.5),legend.text = element_text(size=3))+
#theme(legend.position = c(0.9, 0.2))+
#theme(legend.position="bottom")+
theme(legend.background = element_rect(color = "black",
size = 0.1, linetype = "solid"))+
geom_point(alpha = 1, color = "red")
ggplotly(p1)
From the graph we can conclude that Indian National Congress party has highest percentage and the candidate name is Kuldeep Rai Sharma. The party which received lowest percentage of vote is All India Hindustan Congress.
Now, we will try to use plotly
for the original data and how it works.
plot_j <- ggplot(election_result, aes(x=State, y=percent_of_Votes, size=percent_of_Votes, text = paste("Candidate:", Candidate),text = paste("Party:", Party)))
t <- plot_j+ theme(axis.text.x = element_text(size = 6,angle = 90,hjust = 0.5, vjust = 0.5),legend.text = element_text(size=6))+geom_jitter(aes(col=State),width = .5, size=1) +
labs(subtitle="Jittered Points",
y="percent_of_Votes",
x="State",
title=" State Vs percent_of_Votes")
ggplotly(t)
A histogram is an accurate representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable.
You have two options to make a Histogram With ggplot2
package. You can either use the qplot()
function or ggplot()
function:
qplot()
qplot(election_result$percent_of_Votes, geom="histogram")
ggplot(data=election_result, aes(election_result$percent_of_Votes)) + geom_histogram()
We can observe that both of the commands give same histogram. You can change the binwidth, color, etc. in histogram by specifying required arguments in geom_histogram
.
graph_hist <- ggplot(data=election_result, aes(election_result$percent_of_Votes)) +
geom_histogram(breaks=seq(20, 50, by = 2),
col="red",
fill="green",
binwidth = 0.01) +
labs(title="Histogram for percentage of Votes") +
labs(x="Percentage of Votes", y="Count")
graph_hist
A box plot displays the five-number summary of a set of data. The five-number summary is the minimum, first quartile, median, third quartile, and maximum.
geom_boxplot()
is used to plot Boxplot and with the help of ggplotly
command we can visualize this five-number summary.
The following code gives an example for the usage of boxplot:
gb <- filter(election_result, State == "Karnataka")
dim(gb)
## [1] 506 10
graph_boxplot <- ggplot(gb, aes(x = Party, y = Total_Votes, fill = Party)) +
geom_boxplot()+ theme(axis.text.x = element_text(size = 6,angle = 90,hjust = 0.5, vjust = 0.5),legend.text = element_text(size=6))
ggplotly(graph_boxplot)
Here with the help of plotly
we can click on the boxplot and check for five-number summary.
Example 2:
gb1 <- filter(election_result, State == "Karnataka" & Constituency == "Mandya")
dim(gb1)
## [1] 23 10
graph_boxplot <- ggplot(gb1, aes(x = Party, y = percent_of_Votes)) +
geom_boxplot(size = 1,width = 0.6)+ coord_flip()+
theme(axis.text.x = element_text(size = 6,angle = 90,hjust = 0.5, vjust = 0.5),legend.text = element_text(size=6))
ggplotly(graph_boxplot)
Note: From the the example 2, I was expecting five-summary number in the all the boxplots. It was visible only for the “Independent” Party. Also I have noticed that, to plot side-by-side boxplot both x and y should be non-numeric. The data frame that I have chosen is not suitable to plot side-by-side boxplot.
A bar plot presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally.
Example:
graph_barplot <- filter(election_result, State == "Andaman & Nicobar Islands")
br<-ggplot(graph_barplot, aes(x=Party)) + theme(axis.text.x = element_text(size = 6,angle = 90,hjust = 0.5, vjust = 0.5),legend.text = element_text(size=6))+
geom_bar(aes(fill = EVM_Votes))
#br
ggplotly(br)
Note : In this example, when I checked the EVM_Votes for each party, I observed that for “Independent” Party it shows that EVM_Vote is NA
. The reason for this may be there are more than one candidate in Independent party.
In information visualization and computing, treemap is a method for displaying hierarchical data using nested figures, usually rectangles.
library(treemap)
treemap(election_result, index="percent_of_Votes", vSize="Total_Votes",
vColor="EVM_Votes", mapping=c(-10, 10, 30), type="value", palette="RdYlGn")
library(gganimate)
Example 1:
graph_animate <- filter(election_result, State == "Andaman & Nicobar Islands")
theme_set(theme_bw()) # pre-set the bw theme.
## ggplot(election_result, aes(x=State, y=Total_Votes)) +
ggplot(graph_animate, aes(x=Party, y=Postal_Votes)) +
geom_point(aes(col=Party)) + theme(axis.text.x = element_text(size = 6,angle = 90,hjust = 0.5, vjust = 0.5),legend.text = element_text(size=6))+
transition_states(Postal_Votes,
transition_length = 2,
state_length = 1)
#devtools::install_github("thomasp85/gganimate", force = TRUE)
#graph_animate <- ggplot(election_result, aes(Party, percent_of_Votes, size = percent_of_Votes, frame = State)) +
# geom_point(alpha = 0.7, show.legend = FALSE) +
# facet_wrap(~Party)
#gganimate(graph_animate, interval = 0.2)
Comment : I was interested to work on gganimate
but somehow ended up in the error “could not find function gganimate” even after installing this files.
plotly
.Example 2:
plotly_animat2 <- election_result %>%
plot_ly(
x = ~Party,
y = ~percent_of_Votes,
size = ~percent_of_Votes,
color = ~Party,
frame = ~State,
text = ~Candidate,
hoverinfo = "text",
type = 'scatter',
mode = 'markers'
)
plotly_animat2 %>% animation_button(
x = 1, xanchor = "right", y = 0, yanchor = "bottom"
) %>% animation_slider(
currentvalue = list(prefix = "State ", font = list(color="red"))
)