Due: August 7, 2024
Submit both the .Rmd (R markdown file) and R Pubs published version (with link provided). Copy/paste all the tasks as text and provide the relevant code and text answer when prompted.
First, let’s set up our environment by loading the necessary packages:
Using the Civil dataset, perform a simple linear regression with public_sector_corruption as the dependent variable and polyarchy as the independent variable. Visualize the relationship with a scatter plot and overlay the regression line. Use the sjPlot package to create regression tables and interpret the results.
load("Civil.RData")
Civil <- corruption
# Display the first few rows and structure of the dataset
head(Civil)
## country_name country_text_id year region
## 1 Mexico MEX 2020 Latin America and the Caribbean
## 2 Suriname SUR 2020 Latin America and the Caribbean
## 3 Sweden SWE 2020 Western Europe and North America
## 4 Switzerland CHE 2020 Western Europe and North America
## 5 Ghana GHA 2020 Sub-Saharan Africa
## 6 South Africa ZAF 2020 Sub-Saharan Africa
## disclose_donations_ord public_sector_corruption polyarchy civil_liberties
## 1 3 48.8 64.7 71.2
## 2 1 24.8 76.1 87.7
## 3 2 1.3 90.8 96.9
## 4 0 1.4 89.4 94.8
## 5 2 65.2 72.0 90.4
## 6 1 57.1 70.3 82.2
## disclose_donations iso2c population gdp_percapita capital longitude
## 1 TRUE MX 128932753 8922.612 Mexico City -99.1276
## 2 FALSE SR 586634 7529.614 Paramaribo -55.1679
## 3 FALSE SE 10353442 51541.656 Stockholm 18.0645
## 4 FALSE CH 8636561 85685.290 Bern 7.44821
## 5 FALSE GH 31072945 2020.624 Accra -0.20795
## 6 FALSE ZA 59308690 5659.207 Pretoria 28.1871
## latitude income log_gdp_percapita
## 1 19.427 Upper middle income 9.096344
## 2 5.8232 Upper middle income 8.926599
## 3 59.3327 High income 10.850146
## 4 46.948 High income 11.358436
## 5 5.57045 Lower middle income 7.611162
## 6 -25.746 Upper middle income 8.641039
str(Civil)
## 'data.frame': 168 obs. of 17 variables:
## $ country_name : chr "Mexico" "Suriname" "Sweden" "Switzerland" ...
## $ country_text_id : chr "MEX" "SUR" "SWE" "CHE" ...
## $ year : num 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ region : Factor w/ 6 levels "Eastern Europe and Central Asia",..: 2 2 5 5 4 4 6 6 1 1 ...
## $ disclose_donations_ord : num 3 1 2 0 2 1 3 2 3 2 ...
## $ public_sector_corruption: num 48.8 24.8 1.3 1.4 65.2 57.1 3.7 36.8 70.6 71.2 ...
## $ polyarchy : num 64.7 76.1 90.8 89.4 72 70.3 83.2 43.6 26.2 48.5 ...
## $ civil_liberties : num 71.2 87.7 96.9 94.8 90.4 82.2 92.8 56.9 43 85.5 ...
## $ disclose_donations : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
## $ iso2c : chr "MX" "SR" "SE" "CH" ...
## $ population : num 1.29e+08 5.87e+05 1.04e+07 8.64e+06 3.11e+07 ...
## ..- attr(*, "label")= chr "Population, total"
## $ gdp_percapita : num 8923 7530 51542 85685 2021 ...
## ..- attr(*, "label")= chr "GDP per capita (constant 2015 US$)"
## $ capital : chr "Mexico City" "Paramaribo" "Stockholm" "Bern" ...
## $ longitude : chr "-99.1276" "-55.1679" "18.0645" "7.44821" ...
## $ latitude : chr "19.427" "5.8232" "59.3327" "46.948" ...
## $ income : chr "Upper middle income" "Upper middle income" "High income" "High income" ...
## $ log_gdp_percapita : num 9.1 8.93 10.85 11.36 7.61 ...
## ..- attr(*, "label")= chr "GDP per capita (constant 2015 US$)"
model <- lm(public_sector_corruption ~ polyarchy, data = Civil)
summary(model)
##
## Call:
## lm(formula = public_sector_corruption ~ polyarchy, data = Civil)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.498 -14.334 1.448 16.985 44.436
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 89.44444 3.95373 22.62 <2e-16 ***
## polyarchy -0.82641 0.06786 -12.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.68 on 166 degrees of freedom
## Multiple R-squared: 0.4718, Adjusted R-squared: 0.4686
## F-statistic: 148.3 on 1 and 166 DF, p-value: < 2.2e-16
tab_model(model, show.ci=FALSE, show.se=TRUE, show.stat=TRUE)
| public sector corruption | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 89.44 | 3.95 | 22.62 | <0.001 |
| polyarchy | -0.83 | 0.07 | -12.18 | <0.001 |
| Observations | 168 | |||
| R2 / R2 adjusted | 0.472 / 0.469 | |||
plot_corruption <- corruption %>%
mutate(highlight = polyarchy == min( polyarchy) | polyarchy == max( polyarchy))
ggplot(plot_corruption, aes(x = polyarchy, y = public_sector_corruption)) +
geom_point(aes(color = highlight)) +
stat_smooth(method = "lm", formula = y ~ x, linewidth = 1, color = "blue") +
geom_label_repel(data = filter(plot_corruption, highlight == TRUE),
aes(label = country_name), seed = 1234) +
scale_color_manual(values = c("grey30", "red"), guide = "none") +
labs(x = "polyarchy", y = "Public Sector Corruption Index") +
theme_minimal()
Intercept: The intercept is 89.44444, it indicates that the average public_sector_corruption will be 89.44444 when the independent variable is zero. polyarchy: The coefficient on polyarchy is -0.82641 which indicates that additional one unit increasing in polyarchy is associated with decreasing in public_sector_corruption by 0.82641 units on average holding other factors constant. The effect of polyarchy on public_sector_corruption is statistically significant at 5% level since p value is 0.000 which is less than 0.05. R squared is 0.4718 which indicates that 47.18% of the variation in public_sector_corruption can be explained by the model. Adjusted R-squared is 0.4686 which indicates that 46.86% of the variation in public_sector_corruption can be expalined by the model when considering the number of independent variables.
Extend the model from Task 1 by adding a quadratic term for polyarchy to capture potential non-linear relationships. Visualize the polynomial relationship using ggplot2. Calculate the marginal effects of polyarchy at different levels (30, 60, 90) using both manual calculations and the marginaleffects package. Interpret the results.
model_sq <- lm(public_sector_corruption ~ polyarchy + I(polyarchy^2), data = Civil)
summary(model_sq)
##
## Call:
## lm(formula = public_sector_corruption ~ polyarchy + I(polyarchy^2),
## data = Civil)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.462 -7.475 1.418 14.107 35.187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.116104 7.019792 7.567 2.55e-12 ***
## polyarchy 0.974653 0.305335 3.192 0.00169 **
## I(polyarchy^2) -0.017310 0.002874 -6.023 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.69 on 165 degrees of freedom
## Multiple R-squared: 0.567, Adjusted R-squared: 0.5618
## F-statistic: 108 on 2 and 165 DF, p-value: < 2.2e-16
tab_model(model_sq, show.ci=FALSE, show.se=TRUE, show.stat=TRUE)
| public sector corruption | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 53.12 | 7.02 | 7.57 | <0.001 |
| polyarchy | 0.97 | 0.31 | 3.19 | 0.002 |
| polyarchy^2 | -0.02 | 0.00 | -6.02 | <0.001 |
| Observations | 168 | |||
| R2 / R2 adjusted | 0.567 / 0.562 | |||
# Visualizing the polynomial relationship
ggplot(plot_corruption, aes(x = polyarchy, y = public_sector_corruption)) +
geom_point(aes(color = highlight)) +
stat_smooth(method = "lm", formula = y ~ x + I(x^2), linewidth = 1, color = "blue") +
geom_label_repel(data = filter(plot_corruption, highlight == TRUE),
aes(label = country_name), seed = 1234) +
scale_color_manual(values = c("grey30", "red"), guide = "none") +
labs(x = "polyarchy", y = "Public Sector Corruption Index") +
theme_minimal()
# Extract coefficients
civ_pol1 <- coef(model_sq)["polyarchy"]
civ_pol2 <- coef(model_sq)["I(polyarchy^2)"]
# Function to calculate slope
civ_pol_slope <- function(x) civ_pol1 + (2 * civ_pol2 * x)
# Calculate slopes at different levels of civil liberties
civ_pol_slope(c(30, 60, 90))
## [1] -0.06392759 -1.10250800 -2.14108840
Interpretation:
At a polyarchy value of 30, the marginal effect is -0.064. This means that around this level of polyarchy, increasingpolyarchy slightly is associated with an decreasing in public sector corruption.
At a polyarchy value of 60, the marginal effect is -1.103. Here, an increase in polyarch is associated with a decrease in public sector corruption.
At a polyarchy value of 90, the marginal effect is -2.141, indicating a stronger negative relationship at higher levels of polyarchy.
Instead of manually computing slopes, we can use packages to do this automatically.
# Using marginaleffects package
model_sq %>%
slopes(newdata = datagrid(polyarchy = c(30, 60, 90)), eps = 0.001)
##
## Term polyarchy Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
## polyarchy 30 -0.0639 0.1402 -0.456 0.648 0.6 -0.339 0.211
## polyarchy 60 -1.1025 0.0768 -14.354 <0.001 152.8 -1.253 -0.952
## polyarchy 90 -2.1411 0.2263 -9.461 <0.001 68.2 -2.585 -1.698
##
## Columns: rowid, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, polyarchy, predicted_lo, predicted_hi, predicted, public_sector_corruption
## Type: response
# Using emmeans package
model_sq %>%
emtrends(~ polyarchy, var = "polyarchy", at = list(polyarchy = c(30, 60, 90)), delta.var = 0.001) %>%
test()
## polyarchy polyarchy.trend SE df t.ratio p.value
## 30 -0.0639 0.1408 165 -0.454 0.6503
## 60 -1.1025 0.0768 165 -14.353 <.0001
## 90 -2.1411 0.2268 165 -9.439 <.0001
Using the Civil dataset, fit a logistic regression model predicting the presence of campaign finance disclosure laws (disclose_donations) with public_sector_corruption and log_gdp_percapita as predictors. Use the sjPlot package to create regression tables and interpret the results.
model_logit_fancy <- glm(
disclose_donations ~ public_sector_corruption + log_gdp_percapita,
family = binomial(link = "logit"),
data = Civil
)
summary(model_logit_fancy)
##
## Call:
## glm(formula = disclose_donations ~ public_sector_corruption +
## log_gdp_percapita, family = binomial(link = "logit"), data = Civil)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1642 -0.4770 -0.2445 0.5413 2.3785
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.50466 2.18953 -0.230 0.818
## public_sector_corruption -0.05964 0.01191 -5.007 5.54e-07 ***
## log_gdp_percapita 0.24907 0.21785 1.143 0.253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 217.79 on 167 degrees of freedom
## Residual deviance: 131.30 on 165 degrees of freedom
## AIC: 137.3
##
## Number of Fisher Scoring iterations: 5
tab_model(model_logit_fancy, show.ci=FALSE, show.se=TRUE, show.stat=TRUE)
| disclose donations | ||||
|---|---|---|---|---|
| Predictors | Odds Ratios | std. Error | Statistic | p |
| (Intercept) | 0.60 | 1.32 | -0.23 | 0.818 |
| public sector corruption | 0.94 | 0.01 | -5.01 | <0.001 |
|
GDP per capita(constant 2015 US$) |
1.28 | 0.28 | 1.14 | 0.253 |
| Observations | 168 | |||
| R2 Tjur | 0.454 | |||
The intercept is -0.50466 which indicates that the log odds of the presence of campaign finance disclosure laws will be -0.50466 when all the independnet variables are zero. public_sector_corruption: The coefficient on public_sector_corruption is -0.05964 which indicates that additional one unit increasing in public_sector_corruption is associated with decreasing in log odds of the presence of campaign finance disclosure laws by 0.05964 on average holding ohter factors constant. The effect of public_sector_corruption is statistically significant at 5% level. log_gdp_percapita: the coefficient on log_gdp_percapita is 0.24907 which indicates that additional 1% increasing in gdp per capita is associated with increasing in odds of the presence of campaign finance disclosure laws 0.24907% on average holding other factors constant. The effect of log_gdp_percapita is not statistically significant at 5% level.
Calculate the marginal effects of public_sector_corruption from the logistic regression model in Task 3 at representative values (20, 50, 80). Use the marginaleffects and emmeans packages to compute these effects. Visualize the predicted probabilities of having campaign finance disclosure laws across a range of public_sector_corruption values using ggplot2.
# Creating the data grid using marginaleffects
library(marginaleffects)
data_grid_me <- datagrid(model = model_logit_fancy,
public_sector_corruption = c(20, 50,80))
print(data_grid_me)
## log_gdp_percapita public_sector_corruption rowid
## 1 8.567353 20 1
## 2 8.567353 50 2
## 3 8.567353 80 3
library(emmeans)
ref_grid_emmeans <- ref_grid(model_logit_fancy,
at = list(public_sector_corruption = c(20,50, 80)
))
print(ref_grid_emmeans)
## public_sector_corruption log_gdp_percapita prediction SE df
## 20 8.57 0.436 0.300 Inf
## 50 8.57 -1.353 0.279 Inf
## 80 8.57 -3.142 0.567 Inf
##
## Results are given on the logit (not the response) scale.
# Generating fitted values using marginaleffects
fitted_values_me <- model_logit_fancy |>
slopes(variables = "public_sector_corruption",
newdata = datagrid(public_sector_corruption = c(20, 50,80)
))
print(fitted_values_me)
##
## Term public_sector_corruption Estimate Std. Error z
## public_sector_corruption 20 -0.01422 0.002347 -6.06
## public_sector_corruption 50 -0.00973 0.001653 -5.89
## public_sector_corruption 80 -0.00237 0.000819 -2.89
## Pr(>|z|) S 2.5 % 97.5 %
## < 0.001 29.4 -0.01882 -0.009621
## < 0.001 27.9 -0.01297 -0.006495
## 0.00386 8.0 -0.00397 -0.000762
##
## Columns: rowid, term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, public_sector_corruption, predicted_lo, predicted_hi, predicted, log_gdp_percapita, disclose_donations
## Type: response
# Generating predictions for a range of corruption values
logit_predictions <- model_logit_fancy |>
emmeans(~ public_sector_corruption+ log_gdp_percapita, var = "public_sector_corruption",
at = list(public_sector_corruption = seq(0, 90, 1)),
regrid = "response") |>
as_tibble()
#Visualize the predicted probabilities of having campaign finance disclosure laws across a range of public_sector_corruption values using ggplot2
ggplot(logit_predictions, aes(x = public_sector_corruption, y = prob)) +
geom_line(linewidth = 1) +
labs(x = "Public sector corruption", y = "Predicted probability of having\na campaign finance disclosure law", color = NULL) +
theme_minimal() +
theme(legend.position = "bottom")
Explore the interaction effect between public_sector_corruption and region in the logistic regression model from Task 3. Use the datagrid() function from the marginaleffects package to create a dataset with representative values for regions. Fit the logistic regression model with the interaction term and visualize the interaction effects using ggplot2. Interpret the results and discuss the implications of the interaction effect.
model_logit_interact <- glm(
disclose_donations ~ public_sector_corruption + log_gdp_percapita+public_sector_corruption * region,
family = binomial(link = "logit"),
data = Civil
)
summary(model_logit_interact )
##
## Call:
## glm(formula = disclose_donations ~ public_sector_corruption +
## log_gdp_percapita + public_sector_corruption * region, family = binomial(link = "logit"),
## data = Civil)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1469 -0.5134 -0.1072 0.4771 2.6619
##
## Coefficients:
## Estimate
## (Intercept) 3.21658
## public_sector_corruption -0.06335
## log_gdp_percapita 0.01539
## regionLatin America and the Caribbean -2.47593
## regionMiddle East and North Africa -0.65585
## regionSub-Saharan Africa -1.61845
## regionWestern Europe and North America -1.05205
## regionAsia and Pacific -0.92044
## public_sector_corruption:regionLatin America and the Caribbean 0.02499
## public_sector_corruption:regionMiddle East and North Africa -0.05436
## public_sector_corruption:regionSub-Saharan Africa -0.02279
## public_sector_corruption:regionWestern Europe and North America -0.03829
## public_sector_corruption:regionAsia and Pacific -0.03145
## Std. Error
## (Intercept) 3.52849
## public_sector_corruption 0.02299
## log_gdp_percapita 0.34500
## regionLatin America and the Caribbean 1.53294
## regionMiddle East and North Africa 2.36552
## regionSub-Saharan Africa 1.79649
## regionWestern Europe and North America 1.57290
## regionAsia and Pacific 1.85141
## public_sector_corruption:regionLatin America and the Caribbean 0.03045
## public_sector_corruption:regionMiddle East and North Africa 0.06720
## public_sector_corruption:regionSub-Saharan Africa 0.03978
## public_sector_corruption:regionWestern Europe and North America 0.07872
## public_sector_corruption:regionAsia and Pacific 0.04645
## z value
## (Intercept) 0.912
## public_sector_corruption -2.755
## log_gdp_percapita 0.045
## regionLatin America and the Caribbean -1.615
## regionMiddle East and North Africa -0.277
## regionSub-Saharan Africa -0.901
## regionWestern Europe and North America -0.669
## regionAsia and Pacific -0.497
## public_sector_corruption:regionLatin America and the Caribbean 0.821
## public_sector_corruption:regionMiddle East and North Africa -0.809
## public_sector_corruption:regionSub-Saharan Africa -0.573
## public_sector_corruption:regionWestern Europe and North America -0.486
## public_sector_corruption:regionAsia and Pacific -0.677
## Pr(>|z|)
## (Intercept) 0.36198
## public_sector_corruption 0.00586 **
## log_gdp_percapita 0.96442
## regionLatin America and the Caribbean 0.10628
## regionMiddle East and North Africa 0.78158
## regionSub-Saharan Africa 0.36764
## regionWestern Europe and North America 0.50358
## regionAsia and Pacific 0.61908
## public_sector_corruption:regionLatin America and the Caribbean 0.41188
## public_sector_corruption:regionMiddle East and North Africa 0.41851
## public_sector_corruption:regionSub-Saharan Africa 0.56671
## public_sector_corruption:regionWestern Europe and North America 0.62666
## public_sector_corruption:regionAsia and Pacific 0.49833
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 217.79 on 167 degrees of freedom
## Residual deviance: 114.37 on 155 degrees of freedom
## AIC: 140.37
##
## Number of Fisher Scoring iterations: 7
tab_model(model_logit_interact , show.ci=FALSE, show.se=TRUE, show.stat=TRUE)
| disclose donations | ||||
|---|---|---|---|---|
| Predictors | Odds Ratios | std. Error | Statistic | p |
| (Intercept) | 24.94 | 88.01 | 0.91 | 0.362 |
| public sector corruption | 0.94 | 0.02 | -2.76 | 0.006 |
|
GDP per capita(constant 2015 US$) |
1.02 | 0.35 | 0.04 | 0.964 |
|
region: Latin America and the Caribbean |
0.08 | 0.13 | -1.62 | 0.106 |
|
region: Middle East and North Africa |
0.52 | 1.23 | -0.28 | 0.782 |
|
region: Sub-Saharan Africa |
0.20 | 0.36 | -0.90 | 0.368 |
|
region: Western Europe and North America |
0.35 | 0.55 | -0.67 | 0.504 |
| region: Asia and Pacific | 0.40 | 0.74 | -0.50 | 0.619 |
| public_sector_corruption:regionLatin America and the Caribbean | 1.03 | 0.03 | 0.82 | 0.412 |
| public_sector_corruption:regionMiddle East and North Africa | 0.95 | 0.06 | -0.81 | 0.419 |
| public_sector_corruption:regionSub-Saharan Africa | 0.98 | 0.04 | -0.57 | 0.567 |
| public_sector_corruption:regionWestern Europe and North America | 0.96 | 0.08 | -0.49 | 0.627 |
| public_sector_corruption:regionAsia and Pacific | 0.97 | 0.05 | -0.68 | 0.498 |
| Observations | 168 | |||
| R2 Tjur | 0.521 | |||
regions_to_use <- c("Western Europe and North America",
"Latin America and the Caribbean",
"Middle East and North Africa")
data_grid_modelr <- data_grid(Civil,
public_sector_corruption = c(20, 50, 80),
region = regions_to_use,
.model = model_logit_interact)
print(data_grid_modelr)
## # A tibble: 9 x 3
## public_sector_corruption region log_gdp_percapita
## <dbl> <chr> <dbl>
## 1 20 Latin America and the Caribbean 8.52
## 2 20 Middle East and North Africa 8.52
## 3 20 Western Europe and North America 8.52
## 4 50 Latin America and the Caribbean 8.52
## 5 50 Middle East and North Africa 8.52
## 6 50 Western Europe and North America 8.52
## 7 80 Latin America and the Caribbean 8.52
## 8 80 Middle East and North Africa 8.52
## 9 80 Western Europe and North America 8.52
# Generating predictions for a range of corruption values
logit_predictions1 <- model_logit_interact |>
emmeans(~ public_sector_corruption + log_gdp_percapita+public_sector_corruption * region, var = "public_sector_corruption",
at = list(public_sector_corruption = seq(0, 90, 1)),
regrid = "response") |>
as_tibble()
#visualize the interaction effects using ggplot2
ggplot(logit_predictions1, aes(x = public_sector_corruption, y = prob, color = region)) +
geom_line(linewidth = 1) +
labs(x = "Public sector corruption", y = "Predicted probability of having\na campaign finance disclosure law", color = NULL) +
theme_minimal() +
theme(legend.position = "bottom")
The intercept is 3.21658 which indicates that the log odds of the presence of campaign finance disclosure laws will be -0.50466 when all the independnet variables are zero. public_sector_corruption: The coefficient on public_sector_corruption is -0.06335 which indicates that additional one unit increasing in public_sector_corruption is associated with decreasing in log odds of the presence of campaign finance disclosure laws by 0.06335 on average holding ohter factors constant. The effect of public_sector_corruption is statistically significant at 5% level. log_gdp_percapita: the coefficient on log_gdp_percapita is 0.01539 which indicates that additional 1% increasing in gdp per capita is associated with increasing in odds of the presence of campaign finance disclosure laws 0.01539% on average holding other factors constant. The effect of log_gdp_percapita is not statistically significant at 5% level. Each of the variables of the dummy variables for region is not statistically significant at 5% level since p value for each of them is greater than 0.05.. Each of the interaction term between region and public_sector_corruption is not statistically significant at 5% level since p value for each of them is greater than 0.05.