Dataset: Loksabha 2019 Candidates General Information. (https://www.kaggle.com/datasets/themlphdstudent/lok-sabha-election-candidate-list-2004-to-2019)
# Importing required libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
library(ggthemes)
# Loading our dataset
data <-read.csv('C:\\Users\\bhush\\Downloads\\Coursework\\I 590 INTRO TO R\\datasets\\data_final\\LokSabha2019_xl.csv')
We are choosing “Winner” column as the response variable as it is the column of most importance in our data.
For the first hypothesis we choose the “Party” as our explanatory variable, which I believe affects the the result.
m <- aov(Winner ~ Party, data = data)
summary(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## Party 680 100.6 0.14790 5.592 <2e-16 ***
## Residuals 7287 192.8 0.02645
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the ANOVA results,
Df (Degrees of Freedom) - 680 Parties and 7287 Candidates are present overall.
Sum Sq (Sum of Squares) - The sum of squares associated with the “Party” and “Winner” is 100.6 and 192.8 respectively. This helps us understand how much of the Winner can be attributed to the Party (100.8) and how much remains unexplained (192.8).
Mean Sq (Mean Sum of squares) - This is number that tells us how much the “Party” variable explains the differences in the “Winner” variable which is equal to 0.14790 and 0.02645 for the opposite case.
F value - This value helps us figure out of the “Party” variable really makes a difference in predicting the “Winner” and we get 5.592 suggests us that “Party” is making a significant work in determining the “Winner”.
Pr(>F) - We get this value very close to 0 i.e. 2e-16, which gives strong evidence to disapprove our Null Hypothesis.
Hence,The “Party” variable does seem to have a significant impact on the “Winner.”
For the second scenario, we choose Criminal cases as our explanatory variable.
m <- aov(Criminal.Cases ~ Winner, data = data)
summary(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## Winner 1 104 104.19 6.321 0.012 *
## Residuals 7966 131311 16.48
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the ANOVA results,
We get the F value as 6.321 and a low p-value 0.012 (<0.05) indicating that the relationship between “Winner” and “Criminal Cases” is statistically significant. The low p-value suggests that the number of criminal cases may influence the outcome (the “Winner”) in a statistically significant way.
Total Assets - All the valuable things owned by the candidate.
Liabilities - All the debts and obligations the candidate owes to others.
data |>
ggplot(mapping = aes(x = data$Total.Assets, y = data$Liabilities)) +
geom_point(size = 4, color = 'violet') +
geom_smooth(method = "lm", se = FALSE, color = 'blue') +
labs(title="Linear Regression plot", x="Total Assets", y="Liabilities")+
theme_minimal()
## Warning: Use of `data$Total.Assets` is discouraged.
## ℹ Use `Total.Assets` instead.
## Warning: Use of `data$Liabilities` is discouraged.
## ℹ Use `Liabilities` instead.
## Warning: Use of `data$Total.Assets` is discouraged.
## ℹ Use `Total.Assets` instead.
## Warning: Use of `data$Liabilities` is discouraged.
## ℹ Use `Liabilities` instead.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 60 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 60 rows containing missing values (`geom_point()`).
model <- lm(Liabilities ~ Total.Assets, data = data)
summary(model)
##
## Call:
## lm(formula = Liabilities ~ Total.Assets, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.032e+09 -2.139e+06 -1.764e+06 -1.689e+06 1.177e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.711e+06 4.553e+05 3.758 0.000172 ***
## Total.Assets 9.302e-02 1.779e-03 52.280 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39940000 on 7906 degrees of freedom
## (60 observations deleted due to missingness)
## Multiple R-squared: 0.2569, Adjusted R-squared: 0.2568
## F-statistic: 2733 on 1 and 7906 DF, p-value: < 2.2e-16
# Diagnostic plot for our LR model
par(mfrow=c(2,2))
plot(model)
if (summary(model)$coefficients["Total.Assets", "Pr(>|t|)"] < 0.05) {
cat("The 'Total.Assets' variable has a significant effect on 'Liabilities.")
} else {
cat("The 'Total.Assets' variable does not have a significant effect on 'Liabilities'.")
}
## The 'Total.Assets' variable has a significant effect on 'Liabilities.
Intercept: This number (about 1.711 million) is what we predict “Liabilities” to be when “Total.Assets” is zero. But in real life, it might not make sense because we usually have some “Liabilities” even if you have no “Total.Assets.”
Multiple R-squared: This is a number (0.2569) that tells us how well “Total.Assets” explains the differences in “Liabilities.” In this case, it explains about 25.69% of those differences.
F-statistic: This is a test to see if the whole formula we used (with “Total.Assets”) is useful. If the F-statistic is big and the p-value is small, it’s means that our formula is really helpful for understanding the relationship between “Total.Assets” and “Liabilities.” In this case, the F-statistic is 2733, and the p-value is really small (less than 0.001), so our formula is useful.
Criminal Cases may or may not have an impact on the Total Assets and can change the Linear Regression model.
model <- lm(Liabilities ~ Total.Assets*Criminal.Cases, data = data)
summary(model)
##
## Call:
## lm(formula = Liabilities ~ Total.Assets * Criminal.Cases, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -902245212 -2000277 -1707158 -1473044 1206371972
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.680e+06 4.531e+05 3.708 0.000211 ***
## Total.Assets 8.131e-02 1.921e-03 42.336 < 2e-16 ***
## Criminal.Cases -2.750e+05 1.120e+05 -2.456 0.014065 *
## Total.Assets:Criminal.Cases 1.234e-02 8.256e-04 14.945 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39390000 on 7904 degrees of freedom
## (60 observations deleted due to missingness)
## Multiple R-squared: 0.2774, Adjusted R-squared: 0.2772
## F-statistic: 1012 on 3 and 7904 DF, p-value: < 2.2e-16
Based on the results,
Intercept: This represents the estimated “Liabilities” when “Total.Assets” and “Criminal.Cases” are both at their reference levels. In this case, it’s about 1,680,000.
Total.Assets: For each additional unit of “Total.Assets,” “Liabilities” goes up by about 0.0813.
Criminal.Cases: For each additional unit of “Criminal.Cases,” “Liabilities” goes down by about 275,000.
Total.Assets:Criminal.Cases: This shows how the effect of “Total.Assets” on “Liabilities” is different when “Criminal.Cases” is involved. For each additional unit of “Total.Assets,” “Liabilities” goes up by about 0.01234 more if “Criminal.Cases” are involved.
F-statistic: In our case, the F-statistic is 1012, and the p-value is really small (less than 0.001), so our formula is very useful.
model <- lm(Liabilities ~ Total.Assets*Criminal.Cases*Age, data = data)
summary(model)
##
## Call:
## lm(formula = Liabilities ~ Total.Assets * Criminal.Cases * Age,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -979335132 -2281502 -1703938 -1302600 1173088843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.532e+05 1.815e+06 0.305 0.760
## Total.Assets 8.926e-03 1.241e-02 0.719 0.472
## Criminal.Cases 1.531e+05 7.312e+05 0.209 0.834
## Age 2.692e+04 3.753e+04 0.717 0.473
## Total.Assets:Criminal.Cases 7.233e-02 7.452e-03 9.706 < 2e-16 ***
## Total.Assets:Age 1.259e-03 2.130e-04 5.908 3.62e-09 ***
## Criminal.Cases:Age -1.291e+04 1.592e+04 -0.811 0.417
## Total.Assets:Criminal.Cases:Age -1.071e-03 1.338e-04 -8.004 1.37e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39190000 on 7900 degrees of freedom
## (60 observations deleted due to missingness)
## Multiple R-squared: 0.2852, Adjusted R-squared: 0.2845
## F-statistic: 450.2 on 7 and 7900 DF, p-value: < 2.2e-16
We add Age as our new interaction term along with Criminal cases.
Based on the results,
Intercept: This represents the estimated “Liabilities” when “Total.Assets” and “Gender” are both at their reference levels. In this case, it’s about -3,860,000.
Total.Assets: For each additional unit of “Total.Assets,” “Liabilities” goes up by about 0.2463.
GenderM: If the gender is male (M), “Liabilities” tends to be higher by about 5,759,000 compared to the reference gender.
Total.Assets:GenderM: This shows how the effect of “Total.Assets” on “Liabilities” is different for males (GenderM). For each additional unit of “Total.Assets,” “Liabilities” tends to decrease by about 0.1621 for males.