Dataset

Dataset: Loksabha 2019 Candidates General Information. (https://www.kaggle.com/datasets/themlphdstudent/lok-sabha-election-candidate-list-2004-to-2019)

# Importing required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
library(ggthemes)

Loading our Dataset

# Loading our dataset

data <-read.csv('C:\\Users\\bhush\\Downloads\\Coursework\\I 590 INTRO TO R\\datasets\\data_final\\LokSabha2019_xl.csv')

Response Variable -

We are choosing “Winner” column as the response variable as it is the column of most importance in our data.

Explanatory Variable -

For the first hypothesis we choose the “Party” as our explanatory variable, which I believe affects the the result.

“Candidate is not affected by the Party he/she choose to contest the Election”

m <- aov(Winner ~ Party, data = data)
summary(m)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## Party        680  100.6 0.14790   5.592 <2e-16 ***
## Residuals   7287  192.8 0.02645                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the ANOVA results,

Df (Degrees of Freedom) - 680 Parties and 7287 Candidates are present overall.

Sum Sq (Sum of Squares) - The sum of squares associated with the “Party” and “Winner” is 100.6 and 192.8 respectively. This helps us understand how much of the Winner can be attributed to the Party (100.8) and how much remains unexplained (192.8).

Mean Sq (Mean Sum of squares) - This is number that tells us how much the “Party” variable explains the differences in the “Winner” variable which is equal to 0.14790 and 0.02645 for the opposite case.

F value - This value helps us figure out of the “Party” variable really makes a difference in predicting the “Winner” and we get 5.592 suggests us that “Party” is making a significant work in determining the “Winner”.

Pr(>F) - We get this value very close to 0 i.e. 2e-16, which gives strong evidence to disapprove our Null Hypothesis.

Hence,The “Party” variable does seem to have a significant impact on the “Winner.”

For the second scenario, we choose Criminal cases as our explanatory variable.

“Winner is not dependent on the number of criminal cases.”

m <- aov(Criminal.Cases ~ Winner, data = data)
summary(m)
##               Df Sum Sq Mean Sq F value Pr(>F)  
## Winner         1    104  104.19   6.321  0.012 *
## Residuals   7966 131311   16.48                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the ANOVA results,

We get the F value as 6.321 and a low p-value 0.012 (<0.05) indicating that the relationship between “Winner” and “Criminal Cases” is statistically significant. The low p-value suggests that the number of criminal cases may influence the outcome (the “Winner”) in a statistically significant way.

Linear Regression model: Candidate’s Total Assets v/s Liabilities

Total Assets - All the valuable things owned by the candidate.

Liabilities - All the debts and obligations the candidate owes to others.

data |>
  ggplot(mapping = aes(x = data$Total.Assets, y = data$Liabilities)) +
  geom_point(size = 4, color = 'violet') +
  geom_smooth(method = "lm", se = FALSE, color = 'blue') + 
  labs(title="Linear Regression plot", x="Total Assets", y="Liabilities")+
  theme_minimal()
## Warning: Use of `data$Total.Assets` is discouraged.
## ℹ Use `Total.Assets` instead.
## Warning: Use of `data$Liabilities` is discouraged.
## ℹ Use `Liabilities` instead.
## Warning: Use of `data$Total.Assets` is discouraged.
## ℹ Use `Total.Assets` instead.
## Warning: Use of `data$Liabilities` is discouraged.
## ℹ Use `Liabilities` instead.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 60 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 60 rows containing missing values (`geom_point()`).

model <- lm(Liabilities ~ Total.Assets, data = data)

summary(model)
## 
## Call:
## lm(formula = Liabilities ~ Total.Assets, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.032e+09 -2.139e+06 -1.764e+06 -1.689e+06  1.177e+09 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.711e+06  4.553e+05   3.758 0.000172 ***
## Total.Assets 9.302e-02  1.779e-03  52.280  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39940000 on 7906 degrees of freedom
##   (60 observations deleted due to missingness)
## Multiple R-squared:  0.2569, Adjusted R-squared:  0.2568 
## F-statistic:  2733 on 1 and 7906 DF,  p-value: < 2.2e-16
# Diagnostic plot for our LR model
par(mfrow=c(2,2))
plot(model)

if (summary(model)$coefficients["Total.Assets", "Pr(>|t|)"] < 0.05) {
  cat("The 'Total.Assets' variable has a significant effect on 'Liabilities.")
} else {
  cat("The 'Total.Assets' variable does not have a significant effect on 'Liabilities'.")
}
## The 'Total.Assets' variable has a significant effect on 'Liabilities.

Linear Regression model results:

Intercept: This number (about 1.711 million) is what we predict “Liabilities” to be when “Total.Assets” is zero. But in real life, it might not make sense because we usually have some “Liabilities” even if you have no “Total.Assets.”

Multiple R-squared: This is a number (0.2569) that tells us how well “Total.Assets” explains the differences in “Liabilities.” In this case, it explains about 25.69% of those differences.

F-statistic: This is a test to see if the whole formula we used (with “Total.Assets”) is useful. If the F-statistic is big and the p-value is small, it’s means that our formula is really helpful for understanding the relationship between “Total.Assets” and “Liabilities.” In this case, the F-statistic is 2733, and the p-value is really small (less than 0.001), so our formula is useful.

Including a new interaction term in the model “Criminal.Cases”:

Criminal Cases may or may not have an impact on the Total Assets and can change the Linear Regression model.

model <- lm(Liabilities ~ Total.Assets*Criminal.Cases, data = data)

summary(model)
## 
## Call:
## lm(formula = Liabilities ~ Total.Assets * Criminal.Cases, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -902245212   -2000277   -1707158   -1473044 1206371972 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.680e+06  4.531e+05   3.708 0.000211 ***
## Total.Assets                 8.131e-02  1.921e-03  42.336  < 2e-16 ***
## Criminal.Cases              -2.750e+05  1.120e+05  -2.456 0.014065 *  
## Total.Assets:Criminal.Cases  1.234e-02  8.256e-04  14.945  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39390000 on 7904 degrees of freedom
##   (60 observations deleted due to missingness)
## Multiple R-squared:  0.2774, Adjusted R-squared:  0.2772 
## F-statistic:  1012 on 3 and 7904 DF,  p-value: < 2.2e-16

Based on the results,

Intercept: This represents the estimated “Liabilities” when “Total.Assets” and “Criminal.Cases” are both at their reference levels. In this case, it’s about 1,680,000.

Total.Assets: For each additional unit of “Total.Assets,” “Liabilities” goes up by about 0.0813.

Criminal.Cases: For each additional unit of “Criminal.Cases,” “Liabilities” goes down by about 275,000.

Total.Assets:Criminal.Cases: This shows how the effect of “Total.Assets” on “Liabilities” is different when “Criminal.Cases” is involved. For each additional unit of “Total.Assets,” “Liabilities” goes up by about 0.01234 more if “Criminal.Cases” are involved.

F-statistic: In our case, the F-statistic is 1012, and the p-value is really small (less than 0.001), so our formula is very useful.

model <- lm(Liabilities ~ Total.Assets*Criminal.Cases*Age, data = data)

summary(model)
## 
## Call:
## lm(formula = Liabilities ~ Total.Assets * Criminal.Cases * Age, 
##     data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -979335132   -2281502   -1703938   -1302600 1173088843 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.532e+05  1.815e+06   0.305    0.760    
## Total.Assets                     8.926e-03  1.241e-02   0.719    0.472    
## Criminal.Cases                   1.531e+05  7.312e+05   0.209    0.834    
## Age                              2.692e+04  3.753e+04   0.717    0.473    
## Total.Assets:Criminal.Cases      7.233e-02  7.452e-03   9.706  < 2e-16 ***
## Total.Assets:Age                 1.259e-03  2.130e-04   5.908 3.62e-09 ***
## Criminal.Cases:Age              -1.291e+04  1.592e+04  -0.811    0.417    
## Total.Assets:Criminal.Cases:Age -1.071e-03  1.338e-04  -8.004 1.37e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39190000 on 7900 degrees of freedom
##   (60 observations deleted due to missingness)
## Multiple R-squared:  0.2852, Adjusted R-squared:  0.2845 
## F-statistic: 450.2 on 7 and 7900 DF,  p-value: < 2.2e-16

We add Age as our new interaction term along with Criminal cases.

Based on the results,