Loksabha_data_dive

Dataset

Dataset: Loksabha 2019 Candidates General Information. (https://www.kaggle.com/datasets/themlphdstudent/lok-sabha-election-candidate-list-2004-to-2019)

# Importing required libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggrepel)
library(ggthemes)

Loading our Dataset

# Loading our dataset

data <-read.csv('C:\\Users\\bhush\\Downloads\\Coursework\\I 590 INTRO TO R\\datasets\\data_final\\LokSabha2019_xl.csv')

head(data)

##               Candidate                              Party Criminal.Cases
## 1    Kuldeep Rai Sharma                                INC              0
## 2           Ayan Mandal                               AITC              0
## 3        C G Saji Kumar All India Hindustan Congress Party              0
## 4           C U Rasheed                                IND              0
## 5 Gour Chandra Majumder                                IND              0
## 6                 Henry                                IND              0
##               Education Age Total.Assets                Constituency
## 1 Graduate Professional  52    132233012 Andaman And Nicobar Islands
## 2              Graduate  30      7270440 Andaman And Nicobar Islands
## 3             12th Pass  48       120000 Andaman And Nicobar Islands
## 4             12th Pass  34       202808 Andaman And Nicobar Islands
## 5              Graduate  52      6062000 Andaman And Nicobar Islands
## 6             10th Pass  50        56459 Andaman And Nicobar Islands
##   Liabilities Winner Gender
## 1    80450870      1      M
## 2     1500000      0      M
## 3           0      0      M
## 4     1700000      0      M
## 5           0      0      M
## 6           0      0      M

This dataset does not contain any time or date related field, so we get time series data from the ‘wikipediatrend’ library which aligns closely to our dataset.

library(wikipediatrend)

## Warning: package 'wikipediatrend' was built under R version 4.3.2

## 
##   [wikipediatrend]
##     
##   Note:
##     
##     - Data before 2016-01-01 
##       * is provided by petermeissner.de and
##       * was prepared in a project commissioned by the Hertie School of Governance (Prof. Dr. Simon Munzert)
##       * and supported by the Daimler and Benz Foundation.
##     
##     - Data from 2016-01-01 onwards 
##       * is provided by the Wikipedia Foundation
##       * via its pageviews package and API.
##

library(xts)

## Warning: package 'xts' was built under R version 4.3.2

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.3.2

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

library(tsibble)

## Warning: package 'tsibble' was built under R version 4.3.2

## 
## Attaching package: 'tsibble'

## The following object is masked from 'package:zoo':
## 
##     index

## The following object is masked from 'package:lubridate':
## 
##     interval

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

Extracting Indian elections data :

Using the timeframe from the year 2014 to 2020:

hits <- wp_trend(page=c("Indian elections","Loksabha","Modi","BJP"), 
                      from="2014-01-01", 
                      to="2020-12-31", 
                      lang="en", 
                      warn=TRUE)

hits <- hits |>
  mutate(article = "Indian election")

hits <- hits |>
        group_by(date) |>
        summarise(views = sum(views))|>
        ungroup()
hits <- hits |>
  mutate(article = "Indian election")

Creating a tsibble of our extracted data:

hits <- hits|>
      select(article, date, views)

hits$date <- as.Date(hits$date)

hits_ts <- as_tsibble(hits,index = date)

Plotting our tsibble:

ggplot(hits_ts, aes(x=date, y=views)) + 
  geom_line(size=1.5, colour="steelblue") + 
  theme_bw()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Plotting for different windows of time:

hits_ts |>
  index_by(month = floor_date(date, 'month')) |>
  summarise(avg_views = mean(views)) |>
  ggplot(mapping = aes(x = month, y = avg_views)) +
  geom_line() +
  geom_smooth(span = 0.3, color = 'blue', se=FALSE, ) +
  labs(title = "Indian elections",
       subtitle = "(by each month)") +
  scale_x_date(breaks = "5 month", labels = \(x) month(x)) +
  theme_hc()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

hits_ts |>
  index_by(week = floor_date(date, 'week')) |>
  summarise(avg_views = mean(views)) |>
  ggplot(mapping = aes(x = week, y = avg_views)) +
  geom_line() +
  geom_smooth(span = 0.3, color = 'blue', se=FALSE, ) +
  labs(title = "Indian elections",
       subtitle = "(by weeks)") +
  scale_x_date(breaks = "20 week", labels = \(x) week(x)) +
  theme_hc()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Observation :

There is a large spike which occurs in the months in which the election dates are announced and they are coming closer.

Using Logistic Regression lets find trends in our data!

hits_ts |>
  filter_index("2014-01" ~ "2014-05")|>
  ggplot(mapping = aes(x = date, y = views)) +
  geom_line() +
  geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
  labs(title = "Indian Elections") +
  theme_hc()

## `geom_smooth()` using formula = 'y ~ x'

hits_ts |>
  filter_index("2019-01" ~ "2019-05")|>
  ggplot(mapping = aes(x = date, y = views)) +
  geom_line() +
  geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
  labs(title = "Indian Elections") +
  theme_hc()

## `geom_smooth()` using formula = 'y ~ x'

hits_ts |>
  filter_index("2014-06" ~ "2014-12")|>
  ggplot(mapping = aes(x = date, y = views)) +
  geom_line() +
  geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
  labs(title = "Indian Elections") +
  theme_hc()

## `geom_smooth()` using formula = 'y ~ x'

hits_ts |>
  filter_index("2019-06" ~ "2019-12")|>
  ggplot(mapping = aes(x = date, y = views)) +
  geom_line() +
  geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
  labs(title = "Indian Elections") +
  theme_hc()

## `geom_smooth()` using formula = 'y ~ x'

Observations:

We take a 5 month span before the election day and after the election day, and clearly can observe that trend line rises as the election day and results come near.

On the other hand, the next 5 months after the election results the trend line has a negative or constant slope.

Finding a season:

hits_ts |>
  index_by(month = floor_date(date, 'month')) |>
  summarise(avg_views = mean(views)) |>
  ggplot(mapping = aes(x = month, y = avg_views)) +
  geom_line() +
  geom_smooth(span = 0.3, color = 'blue', se=FALSE, ) +
  labs(title = "Indian elections",
       subtitle = "(by each month)") +
  scale_x_date(breaks = "5 month", labels = \(x) month(x)) +
  theme_hc()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We can see two distinctive peaks in the 5th month of 2014 and 2019, this was peak time of the elections and the time span between these two is 5 years.

Similarly for 2009, we can also get a same kind of peak in the line plot.

Loksabha_data_dive_12

2023-11-10