Dataset

Dataset: Loksabha 2019 Candidates General Information. (https://www.kaggle.com/datasets/themlphdstudent/lok-sabha-election-candidate-list-2004-to-2019)

# Importing required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
library(ggthemes)

Loading our Dataset

# Loading our dataset

data <-read.csv('C:\\Users\\bhush\\Downloads\\Coursework\\I 590 INTRO TO R\\datasets\\data_final\\LokSabha2019_xl.csv')

Creating a copy of our data and filling some values to get error free model output:

data_cp <- data

Filling up values so that we don’t get the dimension error:

# Age
mean_age <- mean(data_cp$Age, na.rm = TRUE)

data_cp$Age[is.na(data_cp$Age)] <- mean_age

# Total Assets
ta <- mean(data_cp$Total.Assets, na.rm = TRUE)

data_cp$Total.Assets[is.na(data_cp$Total.Assets)] <- ta

# Liablities
li <- mean(data_cp$Liabilities, na.rm = TRUE)

data_cp$Liabilities[is.na(data_cp$Liabilities)] <- li

Choosing the best variables for our linear model:

cor(select(data_cp, Winner, Total.Assets, Liabilities, Age, Criminal.Cases))
##                     Winner Total.Assets Liabilities        Age Criminal.Cases
## Winner          1.00000000   0.13784068  0.12381314 0.13280653    -0.02815683
## Total.Assets    0.13784068   1.00000000  0.50682529 0.11138832     0.03012918
## Liabilities     0.12381314   0.50682529  1.00000000 0.06541089     0.02640872
## Age             0.13280653   0.11138832  0.06541089 1.00000000     0.02234815
## Criminal.Cases -0.02815683   0.03012918  0.02640872 0.02234815     1.00000000
cor(data_cp$Liabilities,data_cp$Total.Assets, method="spearman")
## [1] 0.5438489

From the results: The candidate’s Total Assets and Liabilities are two highly correlated explanatory variables that we can use to model their influence on the likelihood of being a ‘Winner’ in our data.

model1 <- glm(Winner ~ Total.Assets, data = data_cp,
             family = binomial(link = 'logit'))
model1
## 
## Call:  glm(formula = Winner ~ Total.Assets, family = binomial(link = "logit"), 
##     data = data_cp)
## 
## Coefficients:
##  (Intercept)  Total.Assets  
##   -3.302e+00     1.060e-09  
## 
## Degrees of Freedom: 7967 Total (i.e. Null);  7966 Residual
## Null Deviance:       2589 
## Residual Deviance: 2533  AIC: 2537
model <- glm(Winner ~ Total.Assets+ Liabilities+ Age+ Criminal.Cases,
             data = data_cp, 
             family = binomial(link = 'logit'))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
model
## 
## Call:  glm(formula = Winner ~ Total.Assets + Liabilities + Age + Criminal.Cases, 
##     family = binomial(link = "logit"), data = data_cp)
## 
## Coefficients:
##    (Intercept)    Total.Assets     Liabilities             Age  Criminal.Cases  
##     -5.854e+00       1.223e-09       5.156e-09       5.394e-02      -2.284e+01  
## 
## Degrees of Freedom: 7967 Total (i.e. Null);  7963 Residual
## Null Deviance:       2589 
## Residual Deviance: 2215  AIC: 2225
anova(model, model1, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: Winner ~ Total.Assets + Liabilities + Age + Criminal.Cases
## Model 2: Winner ~ Total.Assets
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      7963     2214.9                          
## 2      7966     2533.2 -3  -318.21 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results show:

So in summary, this table provides a statistical test confirming Model 1 fits significantly better than Model 2.

Diagnostic plots for our models:

Q-Q plots:

qqnorm(residuals(model))
qqline(residuals(model))

  • The normal Q-Q plot suggests that the residuals from the ‘model’ are approximately normally distributed.

  • Additionally, the plot shows that the residuals are slightly scattered to the right. This means that there are a few outliers on the right-hand side of the distribution.

qqnorm(residuals(model1))
qqline(residuals(model1))

Residuals vs Fitted plots:

plot(residuals(model, type = "deviance") ~ fitted(model))

plot(residuals(model1, type = "deviance") ~ fitted(model1))

  • The residuals are randomly scattered around the zero line.

  • However, there are a few outliers, which are the observations with large residuals.