Loksabha_data_dive

Dataset

Dataset: Loksabha 2019 Candidates General Information. (https://www.kaggle.com/datasets/themlphdstudent/lok-sabha-election-candidate-list-2004-to-2019)

# Importing required libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggrepel)
library(ggthemes)

Loading our Dataset

# Loading our dataset

data <-read.csv('C:\\Users\\bhush\\Downloads\\Coursework\\I 590 INTRO TO R\\datasets\\data_final\\LokSabha2019_xl.csv')

Creating a copy of our data:

data_cp <- data

Adding four additional binary columns,

-> age_bin: If age > 40, then 0 else 1. (Categorizing young candidates)

-> is_rich: If the assets are greater than the mean value of the column then 1 else 0.

-> is_educated: If the candidate is educated then 1 else 0.

-> crime_hist: If the candidate has criminal cases on him/her then 1 else 0.

mean_ta <- mean(data_cp$Total.Assets, na.rm = TRUE)
mean_ta

## [1] 42007516

data_cp <- data_cp |>
    mutate(age_bin = ifelse(data_cp$Age<40,1,0))|>
    mutate(is_rich = ifelse(data_cp$Total.Assets>mean_ta,1,0))|>
    mutate(is_educated = ifelse(data_cp$Education=='Illiterate',0,1))|>
    mutate(crime_hist = ifelse(data_cp$Criminal.Cases>0,1,0))

ggplot(data_cp, aes(x = crime_hist)) +
  geom_histogram()+
  xlab("Crime History") +
  ylab("Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data_cp, aes(x = is_educated)) +
  geom_histogram()+
  xlab("Is educated") +
  ylab("Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data_cp, aes(x = is_rich)) +
  geom_histogram()+
  xlab("Is rich") +
  ylab("Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 60 rows containing non-finite values (`stat_bin()`).

We choose ‘Winner’ as our binary column, where 1 represents winner and 0 represents not a winner.
Total Assets, Criminal Cases and Education are our explanatory variables.

model <- glm(Winner ~ is_rich + crime_hist + is_educated, data = data_cp, family = "binomial")

summary(model)

## 
## Call:
## glm(formula = Winner ~ is_rich + crime_hist + is_educated, family = "binomial", 
##     data = data_cp)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -5.162      1.012  -5.102 3.36e-07 ***
## is_rich        2.675      0.125  21.397  < 2e-16 ***
## crime_hist   -17.200    250.366  -0.069    0.945    
## is_educated    1.489      1.014   1.469    0.142    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2583.8  on 7907  degrees of freedom
## Residual deviance: 2033.6  on 7904  degrees of freedom
##   (60 observations deleted due to missingness)
## AIC: 2041.6
## 
## Number of Fisher Scoring iterations: 18

Summary of the model:

The is_rich variable has a highly significant positive coefficient (p < 2e-16). This indicates that candidates who are rich have higher odds of winning the election. The estimated increase in the log-odds of winning for rich candidates is 2.675.
The crime_hist variable is not significant (p = 0.945). It does not appear that having a criminal history impacts the odds of winning based on this model.
The is_educated variable also does not have a significant coefficient (p = 0.142). So this simplified binary education level variable does not predict winning.
The large z-value and highly significant p-value for is_rich means this is by far the most impactful predictor in the model.

In summary, being rich is the key factor that positively predicts the odds of a candidate winning the election in this dataset based on the model. Education and criminal history do not appear to be significant factors

Using the Standard error of the ‘is_rich’ coefficient and building it’s confidence intervals (CI):

model$coefficients[2]

##  is_rich 
## 2.674922

coef_rich <- model$coefficients[2] 


ci_rich <- confint(model, 'is_rich', level = 0.95)

## Waiting for profiling to be done...

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

ci_rich

##    2.5 %   97.5 % 
## 2.430454 2.920879

Analyzing the CI:

The number 2.6766665 (Estimated Std.) shows how much more likely a rich person is to win compared to a poor person. In other words, being rich makes winning more likely.
The range [2.430454, 2.920879] tells us that we can be pretty sure the real number is somewhere in that range. So, we can confidently say that being rich increases the chances of winning, and it’s a significant effect because the range doesn’t include 0.
In simple words, rich people are more likely to win, and the numbers show it.

Scatter plot with LR line on the explanatory variable ‘Total.Assets’ :

# Log-transform 'Total Assets' to get how it influences winning probability.
data_cp$totalAssets <- log(data_cp$Total.Assets)

# Create a scatter plot
ggplot(data = data_cp, aes(x = totalAssets, y = Winner)) +
  geom_point(shape = 19) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) +
  xlab("Log(Total Assets)") +
  ylab("Winner")

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 60 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 60 rows containing missing values (`geom_point()`).

The logistic regression line shows how the probability of being a ‘Winner’ changes as ‘Total Assets’ increases. In this case, it’s a positive relationship, which means that as ‘Total Assets’ increase, the probability of being a ‘Winner’ also increases.
The curve in the line indicates that the effect may not be strictly linear; it could be stronger for certain values of ‘Total Assets’ and weaker for others indicating other factors also play a role.
The curve in the line suggests that other factors not included in the model may also influence the probability of being a ‘Winner.’

Loksabha_data_dive_10

2023-10-27