Loksabha_data_dive

Dataset

Dataset: Loksabha 2019 Candidates General Information. (https://www.kaggle.com/datasets/themlphdstudent/lok-sabha-election-candidate-list-2004-to-2019)

# Importing required libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggrepel)

Loading our Dataset

# Loading our dataset
data <-read.csv('C:\\Users\\bhush\\Downloads\\Coursework\\I 590 INTRO TO R\\datasets\\data_final\\LokSabha2019_xl.csv')

Q1

A list of at least 3 columns (or values) in your data which are unclear until you read the documentation.

E.g., this could be a column name, or just some value inside a cell of your data
Why do you think they chose to encode the data the way they did? What could have happened if you didn’t read the documentation?

Our dataset is scraped from https://myneta.info/ and has a total of 10 columns. Out of which columns of Candidate, Party, Age, Education, Constituency, Gender are of general information. The columns Total Assets, Liabilities and Criminal Cases were unclear to me until I hadn’t read the documentation.

Criminal Cases - There are several candidates with a lot of criminal cases on them, when you see a candidate with even one criminal case you would definitely question how is it even they are contesting the elections!

Upon going through the documentation I found out that most number of criminal cases were of accusations and were ongoing in the criminal courts were the candidate is not yet guilty.

Also, as per the Indian Constitution a candidate can contest elections unless until he/she is not convicted by the court in any type of crime.

Total Assets - Many Candidates report the financial value of their assets by the means of a registered Chartered Accountant while some candidates provide their evidence of their assets directly to the Election Commission of India (ECI).

For contesting elections, A candidate must pay a deposit amount of INR 10,000 ($120) but some candidates have reported assets less than this deposit amount, so how were they able to contest the election?

The answer for this is that the candidate can borrow certain amount as a liability and contest the election.
Liabilities - There are several candidates who have liabilities more than their assets and can contest elections as the Indian Constituiton permits them. There is a court case which has started in 2016 and still running in the Supreme Court of India against the liabliity rule.

Q2 - Q3

At least one element or your data that is unclear even after reading the documentation

You may need to do some digging, but is there anything about the data that your documentation does not explain?

Constituency - Few constituencies are more number of candidates than rest other constituencies have.

For example: On an average there are 14 candidates per constituency but some constituencies of Nizamabad, Belgaum, Karur, etc. have 30+ candidates.

Nizamabad has a whooping 183 candidates which is strange and the documentation does not provide any proper context to it.

# Finding independent candidates per constituency
ind_cand <- data |>
  filter(Party == 'IND') |>
  group_by(Constituency)|>
  summarise(count = n())|>
  arrange(desc(count))
  
print(ind_cand)

## # A tibble: 512 × 2
##    Constituency  count
##    <chr>         <int>
##  1 Nizamabad       176
##  2 Belgaum          51
##  3 Karur            31
##  4 Beed             27
##  5 Chennai South    27
##  6 Thoothukkudi     26
##  7 Jamnagar         25
##  8 Surendranagar    25
##  9 Amritsar         21
## 10 Theni            21
## # ℹ 502 more rows

#Number of candidates per constituency

Count_per_const <- data|>
  group_by(Constituency)|>
  summarise(count = n())|>
  arrange(desc(count))

print(Count_per_const)

## # A tibble: 542 × 2
##    Constituency    count
##    <chr>           <int>
##  1 Nizamabad         183
##  2 Belgaum            56
##  3 Karur              42
##  4 Chennai South      39
##  5 Thoothukkudi       37
##  6 Beed               36
##  7 Chandigarh         36
##  8 Nalanda            35
##  9 Aurangabad         32
## 10 Chennai Central    31
## # ℹ 532 more rows

# Average number of candidates per constituency
print(mean(Count_per_const$count))

## [1] 14.70111

# Nizamabad constituency details
Nizamabad_info <- data |>
  filter(Constituency == "Nizamabad")

dim(Nizamabad_info)

## [1] 183  10

# PLotting parties against number of candidates
Nizamabad_info|>
  ggplot()+
  geom_bar(mapping = aes(y=Party,fill = Party),stat = "count")

Well, we can observe that most candidates are from “IND” party which stands for “Independent” candidate while other parties have only one candidate!

Why this?

Parties avoid giving tickets to more than one candidate per constituency and hence many candidates who don’t get ticket to contest an election decide to contest elections “Independenty (IND)”.

Belgaum_info <- data |>
  filter(Constituency == "Belgaum")
  
dim(Belgaum_info)

## [1] 56 10

# PLotting parties against number of candidates
Belgaum_info|>
  ggplot()+
  geom_bar(mapping = aes(y=Party,fill = Party),stat = "count")

Loksabha_data_dive_4

2023-09-24

Dataset

Loading our Dataset

Q1

A list of at least 3 columns (or values) in your data which are unclear until you read the documentation.

Q2 - Q3

At least one element or your data that is unclear even after reading the documentation

Nizamabad has a whooping 183 candidates which is strange and the documentation does not provide any proper context to it.