Dataset

Dataset: Loksabha 2019 Candidates General Information. (https://www.kaggle.com/datasets/themlphdstudent/lok-sabha-election-candidate-list-2004-to-2019)

# Importing required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)

Loading our Dataset

# Loading our dataset
data <-read.csv('C:\\Users\\bhush\\Downloads\\Coursework\\I 590 INTRO TO R\\datasets\\data_final\\LokSabha2019_xl.csv')

Q1

A list of at least 3 columns (or values) in your data which are unclear until you read the documentation.

Our dataset is scraped from https://myneta.info/ and has a total of 10 columns. Out of which columns of Candidate, Party, Age, Education, Constituency, Gender are of general information. The columns Total Assets, Liabilities and Criminal Cases were unclear to me until I hadn’t read the documentation.

Q2 - Q3

At least one element or your data that is unclear even after reading the documentation

You may need to do some digging, but is there anything about the data that your documentation does not explain?

# Finding independent candidates per constituency
ind_cand <- data |>
  filter(Party == 'IND') |>
  group_by(Constituency)|>
  summarise(count = n())|>
  arrange(desc(count))
  
print(ind_cand)
## # A tibble: 512 × 2
##    Constituency  count
##    <chr>         <int>
##  1 Nizamabad       176
##  2 Belgaum          51
##  3 Karur            31
##  4 Beed             27
##  5 Chennai South    27
##  6 Thoothukkudi     26
##  7 Jamnagar         25
##  8 Surendranagar    25
##  9 Amritsar         21
## 10 Theni            21
## # ℹ 502 more rows
#Number of candidates per constituency

Count_per_const <- data|>
  group_by(Constituency)|>
  summarise(count = n())|>
  arrange(desc(count))

print(Count_per_const)
## # A tibble: 542 × 2
##    Constituency    count
##    <chr>           <int>
##  1 Nizamabad         183
##  2 Belgaum            56
##  3 Karur              42
##  4 Chennai South      39
##  5 Thoothukkudi       37
##  6 Beed               36
##  7 Chandigarh         36
##  8 Nalanda            35
##  9 Aurangabad         32
## 10 Chennai Central    31
## # ℹ 532 more rows
# Average number of candidates per constituency
print(mean(Count_per_const$count))
## [1] 14.70111
# Nizamabad constituency details
Nizamabad_info <- data |>
  filter(Constituency == "Nizamabad")

dim(Nizamabad_info)  
## [1] 183  10
# PLotting parties against number of candidates
Nizamabad_info|>
  ggplot()+
  geom_bar(mapping = aes(y=Party,fill = Party),stat = "count")

Well, we can observe that most candidates are from “IND” party which stands for “Independent” candidate while other parties have only one candidate!

Why this?

Parties avoid giving tickets to more than one candidate per constituency and hence many candidates who don’t get ticket to contest an election decide to contest elections “Independenty (IND)”.

Belgaum_info <- data |>
  filter(Constituency == "Belgaum")
  
dim(Belgaum_info)
## [1] 56 10
# PLotting parties against number of candidates
Belgaum_info|>
  ggplot()+
  geom_bar(mapping = aes(y=Party,fill = Party),stat = "count")