Dataset: Loksabha 2019 Candidates General Information. (https://www.kaggle.com/datasets/themlphdstudent/lok-sabha-election-candidate-list-2004-to-2019)
# Importing required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
library(ggthemes)
# Loading our dataset
data <-read.csv('C:\\Users\\bhush\\Downloads\\Coursework\\I 590 INTRO TO R\\datasets\\data_final\\LokSabha2019_xl.csv')
data_cp <- data
-> age_bin: If age > 40, then 0 else 1. (Categorizing young candidates)
-> is_rich: If the assets are greater than the mean value of the column then 1 else 0.
-> is_educated: If the candidate is educated then 1 else 0.
-> crime_hist: If the candidate has criminal cases on him/her then 1 else 0.
mean_ta <- mean(data_cp$Total.Assets, na.rm = TRUE)
mean_ta
## [1] 42007516
data_cp <- data_cp |>
mutate(age_bin = ifelse(data_cp$Age<40,1,0))|>
mutate(is_rich = ifelse(data_cp$Total.Assets>mean_ta,1,0))|>
mutate(is_educated = ifelse(data_cp$Education=='Illiterate',0,1))|>
mutate(crime_hist = ifelse(data_cp$Criminal.Cases>0,1,0))
ggplot(data_cp, aes(x = crime_hist)) +
geom_histogram()+
xlab("Crime History") +
ylab("Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data_cp, aes(x = is_educated)) +
geom_histogram()+
xlab("Is educated") +
ylab("Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data_cp, aes(x = is_rich)) +
geom_histogram()+
xlab("Is rich") +
ylab("Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 60 rows containing non-finite values (`stat_bin()`).
We choose ‘Winner’ as our binary column, where 1 represents winner and 0 represents not a winner.
Total Assets, Criminal Cases and Education are our explanatory variables.
model <- glm(Winner ~ is_rich + crime_hist + is_educated, data = data_cp, family = "binomial")
summary(model)
##
## Call:
## glm(formula = Winner ~ is_rich + crime_hist + is_educated, family = "binomial",
## data = data_cp)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.162 1.012 -5.102 3.36e-07 ***
## is_rich 2.675 0.125 21.397 < 2e-16 ***
## crime_hist -17.200 250.366 -0.069 0.945
## is_educated 1.489 1.014 1.469 0.142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2583.8 on 7907 degrees of freedom
## Residual deviance: 2033.6 on 7904 degrees of freedom
## (60 observations deleted due to missingness)
## AIC: 2041.6
##
## Number of Fisher Scoring iterations: 18
The is_rich variable has a highly significant positive coefficient (p < 2e-16). This indicates that candidates who are rich have higher odds of winning the election. The estimated increase in the log-odds of winning for rich candidates is 2.675.
The crime_hist variable is not significant (p = 0.945). It does not appear that having a criminal history impacts the odds of winning based on this model.
The is_educated variable also does not have a significant coefficient (p = 0.142). So this simplified binary education level variable does not predict winning.
The large z-value and highly significant p-value for is_rich means this is by far the most impactful predictor in the model.
In summary, being rich is the key factor that positively predicts the odds of a candidate winning the election in this dataset based on the model. Education and criminal history do not appear to be significant factors
model$coefficients[2]
## is_rich
## 2.674922
coef_rich <- model$coefficients[2]
ci_rich <- confint(model, 'is_rich', level = 0.95)
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
ci_rich
## 2.5 % 97.5 %
## 2.430454 2.920879
The number 2.6766665 (Estimated Std.) shows how much more likely a rich person is to win compared to a poor person. In other words, being rich makes winning more likely.
The range [2.430454, 2.920879] tells us that we can be pretty sure the real number is somewhere in that range. So, we can confidently say that being rich increases the chances of winning, and it’s a significant effect because the range doesn’t include 0.
In simple words, rich people are more likely to win, and the numbers show it.
# Log-transform 'Total Assets' to get how it influences winning probability.
data_cp$totalAssets <- log(data_cp$Total.Assets)
# Create a scatter plot
ggplot(data = data_cp, aes(x = totalAssets, y = Winner)) +
geom_point(shape = 19) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) +
xlab("Log(Total Assets)") +
ylab("Winner")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 60 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 60 rows containing missing values (`geom_point()`).
The logistic regression line shows how the probability of being a ‘Winner’ changes as ‘Total Assets’ increases. In this case, it’s a positive relationship, which means that as ‘Total Assets’ increase, the probability of being a ‘Winner’ also increases.
The curve in the line indicates that the effect may not be strictly linear; it could be stronger for certain values of ‘Total Assets’ and weaker for others indicating other factors also play a role.
The curve in the line suggests that other factors not included in the model may also influence the probability of being a ‘Winner.’