Loksabha_data_dive

Dataset

Dataset: Loksabha 2019 Candidates General Information. (https://www.kaggle.com/datasets/themlphdstudent/lok-sabha-election-candidate-list-2004-to-2019)

# Importing required libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggrepel)
library(ggthemes)

Loading our Dataset

# Loading our dataset

data <-read.csv('C:\\Users\\bhush\\Downloads\\Coursework\\I 590 INTRO TO R\\datasets\\data_final\\LokSabha2019_xl.csv')

Creating a copy of our data and filling some values to get error free model output:

data_cp <- data

Filling up values so that we don’t get the dimension error:

# Age
mean_age <- mean(data_cp$Age, na.rm = TRUE)

data_cp$Age[is.na(data_cp$Age)] <- mean_age

# Total Assets
ta <- mean(data_cp$Total.Assets, na.rm = TRUE)

data_cp$Total.Assets[is.na(data_cp$Total.Assets)] <- ta

# Liablities
li <- mean(data_cp$Liabilities, na.rm = TRUE)

data_cp$Liabilities[is.na(data_cp$Liabilities)] <- li

Choosing the best variables for our linear model:

cor(select(data_cp, Winner, Total.Assets, Liabilities, Age, Criminal.Cases))

##                     Winner Total.Assets Liabilities        Age Criminal.Cases
## Winner          1.00000000   0.13784068  0.12381314 0.13280653    -0.02815683
## Total.Assets    0.13784068   1.00000000  0.50682529 0.11138832     0.03012918
## Liabilities     0.12381314   0.50682529  1.00000000 0.06541089     0.02640872
## Age             0.13280653   0.11138832  0.06541089 1.00000000     0.02234815
## Criminal.Cases -0.02815683   0.03012918  0.02640872 0.02234815     1.00000000

cor(data_cp$Liabilities,data_cp$Total.Assets, method="spearman")

## [1] 0.5438489

From the results: The candidate’s Total Assets and Liabilities are two highly correlated explanatory variables that we can use to model their influence on the likelihood of being a ‘Winner’ in our data.

model1 <- glm(Winner ~ Total.Assets, data = data_cp,
             family = binomial(link = 'logit'))
model1

## 
## Call:  glm(formula = Winner ~ Total.Assets, family = binomial(link = "logit"), 
##     data = data_cp)
## 
## Coefficients:
##  (Intercept)  Total.Assets  
##   -3.302e+00     1.060e-09  
## 
## Degrees of Freedom: 7967 Total (i.e. Null);  7966 Residual
## Null Deviance:       2589 
## Residual Deviance: 2533  AIC: 2537

The coefficient for Total.Assets is 1.060e-09, is positive, meaning Total.Assets is positively associated with the logit of Winner.
The deviance values are different, with the residual deviance being lower than null deviance, indicating better model fit by including the predictor

model <- glm(Winner ~ Total.Assets+ Liabilities+ Age+ Criminal.Cases,
             data = data_cp, 
             family = binomial(link = 'logit'))

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

model

## 
## Call:  glm(formula = Winner ~ Total.Assets + Liabilities + Age + Criminal.Cases, 
##     family = binomial(link = "logit"), data = data_cp)
## 
## Coefficients:
##    (Intercept)    Total.Assets     Liabilities             Age  Criminal.Cases  
##     -5.854e+00       1.223e-09       5.156e-09       5.394e-02      -2.284e+01  
## 
## Degrees of Freedom: 7967 Total (i.e. Null);  7963 Residual
## Null Deviance:       2589 
## Residual Deviance: 2215  AIC: 2225

There are coefficient for each predictor:
- Total.Assets: 1.223e-09 (positive)
- Liabilities: 5.156e-09 (positive)
- Age: 5.394e-02 (positive)
- Criminal.Cases: -2.284e+01 (negative)
The deviance values show including the additional predictors further improves model fit when compared to model1:
- Null deviance is still 2589
- Residual deviance reduced to 2215 (from 2533 with just Total.Assets)
All predictors except Criminal.Cases are positively associated with the logit of Winner.
Criminal.Cases has a large negative coefficient, indicating more criminal cases is associated with lower probability of Winner.

anova(model, model1, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: Winner ~ Total.Assets + Liabilities + Age + Criminal.Cases
## Model 2: Winner ~ Total.Assets
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      7963     2214.9                          
## 2      7966     2533.2 -3  -318.21 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results show:

Model 1 has lower residual deviance than Model 2 (2214.9 vs 2533.2)
The difference in deviance is 318.21 on 3 degrees of freedom
The P-value is < 2.2e-16, indicating this difference is highly statistically significant

So in summary, this table provides a statistical test confirming Model 1 fits significantly better than Model 2.

Diagnostic plots for our models:

Q-Q plots:

qqnorm(residuals(model))
qqline(residuals(model))

The normal Q-Q plot suggests that the residuals from the ‘model’ are approximately normally distributed.
Additionally, the plot shows that the residuals are slightly scattered to the right. This means that there are a few outliers on the right-hand side of the distribution.

qqnorm(residuals(model1))
qqline(residuals(model1))

Residuals vs Fitted plots:

plot(residuals(model, type = "deviance") ~ fitted(model))

plot(residuals(model1, type = "deviance") ~ fitted(model1))

The residuals are randomly scattered around the zero line.
However, there are a few outliers, which are the observations with large residuals.

Loksabha_data_dive_11

2023-11-03

Dataset

Loading our Dataset

Creating a copy of our data and filling some values to get error free model output:

Filling up values so that we don’t get the dimension error:

Choosing the best variables for our linear model:

From the results: The candidate’s Total Assets and Liabilities are two highly correlated explanatory variables that we can use to model their influence on the likelihood of being a ‘Winner’ in our data.

Diagnostic plots for our models:

Q-Q plots:

Residuals vs Fitted plots: