Dataset

Dataset: Loksabha 2019 Candidates General Information. (https://www.kaggle.com/datasets/themlphdstudent/lok-sabha-election-candidate-list-2004-to-2019)

# Importing required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
library(ggthemes)

Loading our Dataset

# Loading our dataset

data <-read.csv('C:\\Users\\bhush\\Downloads\\Coursework\\I 590 INTRO TO R\\datasets\\data_final\\LokSabha2019_xl.csv')
data_cp <- data

Adding four additional binary columns,

-> age_bin: If age > 40, then 0 else 1. (Categorizing young candidates)

-> is_rich: If the assets are greater than the mean value of the column then 1 else 0.

-> is_educated: If the candidate is educated then 1 else 0.

-> crime_hist: If the candidate has criminal cases on him/her then 1 else 0.

mean_ta <- mean(data_cp$Total.Assets, na.rm = TRUE)
mean_ta
## [1] 42007516
data_cp <- data_cp |>
    mutate(age_bin = ifelse(data_cp$Age<40,1,0))|>
    mutate(is_rich = ifelse(data_cp$Total.Assets>mean_ta,1,0))|>
    mutate(is_educated = ifelse(data_cp$Education=='Illiterate',0,1))|>
    mutate(crime_hist = ifelse(data_cp$Criminal.Cases>0,1,0))
ggplot(data_cp, aes(x = crime_hist)) +
  geom_histogram()+
  xlab("Crime History") +
  ylab("Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data_cp, aes(x = is_educated)) +
  geom_histogram()+
  xlab("Is educated") +
  ylab("Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data_cp, aes(x = is_rich)) +
  geom_histogram()+
  xlab("Is rich") +
  ylab("Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 60 rows containing non-finite values (`stat_bin()`).

model <- glm(Winner ~ is_rich + crime_hist + is_educated, data = data_cp, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = Winner ~ is_rich + crime_hist + is_educated, family = "binomial", 
##     data = data_cp)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -5.162      1.012  -5.102 3.36e-07 ***
## is_rich        2.675      0.125  21.397  < 2e-16 ***
## crime_hist   -17.200    250.366  -0.069    0.945    
## is_educated    1.489      1.014   1.469    0.142    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2583.8  on 7907  degrees of freedom
## Residual deviance: 2033.6  on 7904  degrees of freedom
##   (60 observations deleted due to missingness)
## AIC: 2041.6
## 
## Number of Fisher Scoring iterations: 18

Summary of the model:

In summary, being rich is the key factor that positively predicts the odds of a candidate winning the election in this dataset based on the model. Education and criminal history do not appear to be significant factors

Using the Standard error of the ‘is_rich’ coefficient and building it’s confidence intervals (CI):

model$coefficients[2]
##  is_rich 
## 2.674922
coef_rich <- model$coefficients[2] 


ci_rich <- confint(model, 'is_rich', level = 0.95)
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
ci_rich
##    2.5 %   97.5 % 
## 2.430454 2.920879

Analyzing the CI:

  • The number 2.6766665 (Estimated Std.) shows how much more likely a rich person is to win compared to a poor person. In other words, being rich makes winning more likely.

  • The range [2.430454, 2.920879] tells us that we can be pretty sure the real number is somewhere in that range. So, we can confidently say that being rich increases the chances of winning, and it’s a significant effect because the range doesn’t include 0.

  • In simple words, rich people are more likely to win, and the numbers show it.

Scatter plot with LR line on the explanatory variable ‘Total.Assets’ :

# Log-transform 'Total Assets' to get how it influences winning probability.
data_cp$totalAssets <- log(data_cp$Total.Assets)

# Create a scatter plot
ggplot(data = data_cp, aes(x = totalAssets, y = Winner)) +
  geom_point(shape = 19) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE) +
  xlab("Log(Total Assets)") +
  ylab("Winner")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 60 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 60 rows containing missing values (`geom_point()`).