library(readxl)
library(gamlss)
library(tidyverse)
library(lmtest)
library(caret)
library(car)
library(ggpubr)
data = read_xlsx("Term Limits.xlsx",sheet = 3)

data

The data obtained contains details about the participation of women in different states of the united states of America in different years, as well as more variables that might be influential to describe the female participation.

In this project, we have the suspicion that implementing term limits in politics can be beneficial for the participation of female population in politics. To test this, we are interested in building a simple linear regression where the explanatory variable is whether term limits were used in a certain year in a certain state or not; and the dependent variable is the proportion of female participation in that specific state that year.

A linear regression was selected, instead of a simple t-test, firstly because we might be interested on adding the effect of other variables, in the future, like the median income.

ggplot(data,aes(x = as.factor(Term_Limit_Dummy),y = PctWomen))+
  geom_boxplot(fill = c("red","green"))+theme_bw()+
  labs(x = "Term Limit",y = "Percentage of women")

Building a boxplot for the percentage of women in politics by the variable “term limits” produces the previous graphic, which leads us to believe that there is in fact a difference in this percentage if we control by term limits. However, this is just a visual idea. We will proceed with the correspondent linear regression to formally test this hypothesis.

1 Assumptions for an ordinary least squares linear regression

In order to properly build a linear regression, we need to be able to meet a series of assumptions, which include:

  • Normality of the residuals and mean equal to zero.

  • Independence of the residuals.

  • Homoscedasticy (constant variance).

Not meeting these assumptions, can prove to be problematic since it may lead to misleading or incorrect results about the hypothesis.

In order to test these assumptions, we first need to calculate the mentioned model, although its values will not be analyzed just yet.

1.1 Partition of the data.

First, we need to divide our data into training data, to use the exact data that will be used later on this project to compare models

set.seed(111)

indexes = createDataPartition(data$PctWomen,p = 0.8,list = F)

train = data[indexes,c(3,4)]

test = data[-indexes,c(3,4)]

For this case we selected a proportion of 80% for the training dataset and 20% for the testing part which results on two datasets with 158 and 38 records respectively.

2 Building the model.

We proceed to build two types of models, one regarding the ordinary least squares regression to explain the linear relationship that might exist between the participation of women in politics and the implementation of term limits.

2.1 Ordinary least squares linear regression.

We now proceed to build the ordinary least squares linear regression to describe the percentage of women participating in politics by the year having term limits or not.

modelOLS = lm(PctWomen~.,data = train)

summary(modelOLS)
## 
## Call:
## lm(formula = PctWomen ~ ., data = train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.153583 -0.058712 -0.006034  0.044960  0.174790 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.174417   0.008037  21.701  < 2e-16 ***
## Term_Limit_Dummy 0.046200   0.011990   3.853  0.00017 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07497 on 156 degrees of freedom
## Multiple R-squared:  0.08691,    Adjusted R-squared:  0.08105 
## F-statistic: 14.85 on 1 and 156 DF,  p-value: 0.0001699

In this case we see that the dummy variable for the term limit is in fact significant and the regression itself has a value for \(R^2 = 0.08105\) which means that using this variable and the OLS model, around 8.105% of the variability of the proportion of women in politics is explained. Specifically, this model states that the average percentage of women in these positions, if there is not term limit, is around 17.442%. However, for those cases with term limits implemented, this average goes up by 4.62% resulting in a final 22.062% of female participation.

It is also important to mention that, for comparison purposes, the generalized Pseudo R-Squared for the OLS model was calculated, giving a value for it equal to 0.086907 which represents that approximately 8.6907% of the variability of the proportion of women in politics is explained.

3 Testing Assumptions

It is important to note that the ols regression must follow a number of assumptions in order to be trustworthy.

For that reason, we will proceed to test, normality of the residuals, whether they have constant variance, and whether they are independent from each other or not.

3.1 Normality of the residuals

qqnorm(modelOLS$residuals)
qqline(modelOLS$residuals)

Taking a look at the qqplot, we see that some of the higher and lower sample quantiles do not seem to fit correctly with the theoretical quantiles. This might be a sign of our data not being correctly described by a normal distribution.

Nevertheless, we are interested in formally proving the normality of the variable. For this, a shapiro-wilk test for normality will be performed as shown next:

shapiro.test(modelOLS$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  modelOLS$residuals
## W = 0.98325, p-value = 0.05273
shapiro.test(data$PctWomen)
## 
##  Shapiro-Wilk normality test
## 
## data:  data$PctWomen
## W = 0.98238, p-value = 0.01459

The normality for the residuals seems to hold for a confidence level of 95%, since we got a \(\text{P value} = 0.05273\) which for an \(\alpha = 0.05\) does not allow to reject the normality hypothesis for the residual. However when performing the same test for the response variable we obtain a smaller p value which makes us reject normality for the percentage of women in politics.

However, we need to be careful with these results, since the p value for the residuals is extremely close to the decision threshold of 0.05.

After not rejecting normality, we want to check if the mean of the residuals is equal to zero or not.

For this, we will use a t-test, to prove the null hypothesis that the true value of the mean for the residuals is zero.

t.test(modelOLS$residuals,mu = 0)
## 
##  One Sample t-test
## 
## data:  modelOLS$residuals
## t = -9.9149e-17, df = 157, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.01174261  0.01174261
## sample estimates:
##     mean of x 
## -5.894492e-19

After performing the t-test, we obtain a very high p value, meaning that we cannot reject the hypothesis of the mean being equal to zero for the residuals of the model.

3.2 Independence of the residuals.

The next step is to see whether the residuals of the model are independent among themselves or not.

plot(1:158,modelOLS$residuals,ylab = "Residuals",xlab = "Time")

If the residuals were completely independent, the plot of time vs residuals would show no clear pattern. However, in this case it might seem as if there is a small positive linear pattern were the residuals seem to increase as time passes which might lead us to believe that there is auto-correlation among the residuals.

Nonetheless, once again it is desired to perform a formal test to prove or reject the independence hypothesis.

dwtest(modelOLS)
## 
##  Durbin-Watson test
## 
## data:  modelOLS
## DW = 0.35649, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0

For the Durbin-Watson test we obtain a very small p-value, which means that we reject the hypothesis of independence for the calculated residuals for the model.

3.3 Homoscedasticy

Finally, we want to test whether the variance of the model is constant, which would be ideal, or variable.

Unfortunately, since the independence of the residuals was previously rejected, the statistical test for homoscedasticy loses a lot of power. For that reason, using a formal test might prove to be incorrect since the returned results might be wrong.

Because of this, the proper way to try and prove if the variance is constant, is by graphically observing the residuals for each value of our independent variable in order to see if there is a difference and conclude from it.

plot(train$Term_Limit_Dummy,modelOLS$residuals,
     xlab = "Term Limits",
     ylab = "OLS model residuals")

It does look like the variance for the residuals of observations where term limits were not implemented tends to be a little higher than in those observations where it was implemented. This might be a hint to believe that the variance of the residuals for this model is not constant. However, the difference does not seem to be big enough at least in a visual way to confidently say that they are indeed different. For that reason, a final judgment over the homoscedasticity of the model will not be stated. Because of this uncertainty, we will build a linear model that models the variance and the mean, and another where the variance is assumed constant in order to compare which of these scenarios gives the best results.

It seems that one of the required assumptions is not met with the construction of this model, which means that using this methodology might be misleading and incorrect to explain the percentage of women in politics by term limits.

Nevertheless, this effect might change if we were to add the effect of more variables or if we used a bigger sample size.

Anyway, if we are only interested on the effect that having term limits has on the dependent variable, it may be more accurate to describe it using a generalized linear model like gamlss (generalized additive models for location scale and shape) than using an ordinary least squares linear regression (OLS).

4 Generalized additive model for location scale and shape.

We can now continue building a generalized model using GAMLSS tools. This model will differ in the family of distributions used (normally a normal distribution is assumed); however in this case, using this generalized model, utilizing a beta distribution seems correct since the response variable is a proportion.

We also want to compare whether modeling the \(\sigma\) parameter represents and improvement in the model or not.

modelGAMLSS = gamlss(PctWomen~.,data = train,sigma.formula = ~Term_Limit_Dummy,
                     family = BE(mu.link = "identity"))
## GAMLSS-RS iteration 1: Global Deviance = -294.9347 
## GAMLSS-RS iteration 2: Global Deviance = -373.0989 
## GAMLSS-RS iteration 3: Global Deviance = -376.644 
## GAMLSS-RS iteration 4: Global Deviance = -376.6752 
## GAMLSS-RS iteration 5: Global Deviance = -376.6754
summary(modelGAMLSS)
## ******************************************************************
## Family:  c("BE", "Beta") 
## 
## Call:  gamlss(formula = PctWomen ~ ., sigma.formula = ~Term_Limit_Dummy,  
##     family = BE(mu.link = "identity"), data = train) 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  identity
## Mu Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.173836   0.008999  19.316  < 2e-16 ***
## Term_Limit_Dummy 0.046631   0.012108   3.851 0.000172 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## Sigma link function:  logit
## Sigma Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.25592    0.09183 -13.676  < 2e-16 ***
## Term_Limit_Dummy -0.36793    0.13348  -2.756  0.00655 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## No. of observations in the fit:  158 
## Degrees of Freedom for the fit:  4
##       Residual Deg. of Freedom:  154 
##                       at cycle:  5 
##  
## Global Deviance:     -376.6754 
##             AIC:     -368.6754 
##             SBC:     -356.425 
## ******************************************************************
Rsq(modelGAMLSS)
## [1] 0.1478629

After building this model, we see that the average value for the percentage of women in politics if term limits are not implemented is around 17.384%, while the average value for those periods implementing term limits is around 22.047% of female participation in these matters. It is also important to mention that the generalized pseudo R-squared for this model is equal to 0.147863, which means approximately 14.786% of the variance is explained by the term limit under these conditions. And significance at a level of \(\alpha = 0.001\) is fulfilled for the term limits when modeling the mu parameter and at a level of \(\alpha = 0.01\) for the same variable when modeling the sigma parameter; this leads us to conclude that the implementation of term limits does make a difference in the percentage of women participating in politics and its variance.

modelGAMLSSNoVar = gamlss(PctWomen~.,data = train,
                     family = BE(mu.link = "identity"))
## GAMLSS-RS iteration 1: Global Deviance = -289.4829 
## GAMLSS-RS iteration 2: Global Deviance = -365.71 
## GAMLSS-RS iteration 3: Global Deviance = -369.33 
## GAMLSS-RS iteration 4: Global Deviance = -369.3508 
## GAMLSS-RS iteration 5: Global Deviance = -369.3509
summary(modelGAMLSSNoVar)
## ******************************************************************
## Family:  c("BE", "Beta") 
## 
## Call:  gamlss(formula = PctWomen ~ ., family = BE(mu.link = "identity"),  
##     data = train) 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  identity
## Mu Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.170370   0.007875  21.633  < 2e-16 ***
## Term_Limit_Dummy 0.053721   0.012354   4.349 2.47e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## Sigma link function:  logit
## Sigma Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.39678    0.06677  -20.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## No. of observations in the fit:  158 
## Degrees of Freedom for the fit:  3
##       Residual Deg. of Freedom:  155 
##                       at cycle:  5 
##  
## Global Deviance:     -369.3509 
##             AIC:     -363.3509 
##             SBC:     -354.1631 
## ******************************************************************
Rsq(modelGAMLSSNoVar)
## [1] 0.1074296

Finally for the case in which \(\sigma\) is not modeled, we see that the average value for the proportion of female representation when term limits are not implemented is around 17.037%. On the other hand, this model implies that the implementation of term limits increases this proportion by 5.372% resulting in an average value of 22.409%. In addition, as we would expect, this model also demonstrates significance for the used independent variable.

5 Models comparison

The next step in our analysis is to compare which of the built models is more correct when trying to describe the percentage of female participation in politics. For that, we want to compare three key metrics: R-squared, Akaike’s information criterion (AIC) and RMSE with both the training and the testing datasets.

R2 = data.frame(model = c("OLS","GAMLSS (sigma modeled)","GAMLSS (sigma not modeled)"),Pseudo_R_Squared = c(Rsq(modelOLS2),Rsq(modelGAMLSS),Rsq(modelGAMLSSNoVar))) 

AIC = data.frame(model = c("OLS","GAMLSS (sigma modeled)","GAMLSS (sigma not modeled)"),AIC = c(AIC(modelOLS),AIC(modelGAMLSS),AIC(modelGAMLSSNoVar))) 

RMSE = data.frame(model = c("OLS","OLS","GAMLSS (sigma modeled)","GAMLSS (sigma modeled)","GAMLSS (sigma not modeled)","GAMLSS (sigma not modeled)"),dataset = c("train","test","train","test","train","test"),
                  RMSE = c(RMSE(predict(modelOLS,newdata = train),train$PctWomen),RMSE(predict(modelOLS,newdata = test),test$PctWomen),
                           RMSE(predict(modelGAMLSS,newdata = train),train$PctWomen),RMSE(predict(modelGAMLSS,newdata = test),test$PctWomen),RMSE(predict(modelGAMLSSNoVar,newdata = train),train$PctWomen),RMSE(predict(modelGAMLSSNoVar,newdata = test),test$PctWomen))) 
ggplot(R2,aes(model,Pseudo_R_Squared,label = round(Pseudo_R_Squared,5)))+
  geom_point(size = 6,col = "lightblue")+
  theme_bw()+
  theme(legend.position = "none")+
  labs(title = " Psuedo R-squared by model type (higher is better)")+
  geom_text(col = "black")

ggplot(AIC,aes(model,AIC,label = round(AIC,5)))+
  geom_point(size = 6,col = "lightblue")+
  theme_bw()+
  theme(legend.position = "none")+
  labs(title = "Akaike's information criterion by model type (lower is better)")+
  geom_text(col = "black")

ggplot(RMSE,aes(model,RMSE,col = dataset,label = round(RMSE,5)))+
  geom_point(size = 6)+
  theme_bw()+
  labs(title = "RMSE by model type by dataset (lower is better)")+
  geom_text(col = "black")

These three graphics show the all-around better performance coming from the gamlss model using the beta distribution. This is clear since this model has a value for the \(R^2\) at 0.14786 higher than the 0.08105. In addition, the AIC for the generalized model is lower than the one produced by the OLS model trend that is repeated when comparing the RMSE for each model.

This, although minimal, performance improvement in addition to the fact that one of the assumptions for the OLS model was not met; takes us to conclude that the model of choice for this case and problem is the generalized model using the beta distribution.

The problem arises when we realize that the gamlss model that modeled the sigma had better values in two of our metrics, but the model without this variable sigma obtained a lower and better RMSE for the testing dataset.

However, this last difference might be due completely to randomness matters, where maybe the randomly selected testing dataset used, favored this specific model. Either way, we believe it is more accurate to preserve the model in which sigma is modeled.

6 Removing outliers or very influential points from the model.

Although we already selected the generalized model as our final model, we want to use the OLS model in order to calculate those very influential points that might be producing incorrect results in our final model.

influenceIndexPlot(modelOLS)

outlierTest(modelOLS)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##    rstudent unadjusted p-value Bonferroni p
## 37 2.379856           0.018535           NA

We can see two results from the previous plot and formal test; first, the influence plot shows the two most influential points on the model (number 37 and 101), observations that could be incorrectly influencing a lot on the final model.

It is also noticeable how observation 37 is also an outlier according to the formal outlier test performed.

Removing these two observations can be highly desired to improve the performance of our model.

modelGAMLSS_no_inf = gamlss(PctWomen~.,data = train[-c(37,101),],sigma.formula = ~Term_Limit_Dummy, family = BE(mu.link = "identity"))
## GAMLSS-RS iteration 1: Global Deviance = -294.4576 
## GAMLSS-RS iteration 2: Global Deviance = -374.5635 
## GAMLSS-RS iteration 3: Global Deviance = -378.4484 
## GAMLSS-RS iteration 4: Global Deviance = -378.4846 
## GAMLSS-RS iteration 5: Global Deviance = -378.4849

After making this small tweaking, we want to proceed to compare the same metrics as before but between the two gamlss models, the one with the influential points included and the one where they were removed (all this using the model with the modeled \(\sigma\)).

R2_gamlss = data.frame(model = c("GAMLSS","GAMLSS no influence"),
                       R_Squared = c(Rsq(modelGAMLSS),Rsq(modelGAMLSS_no_inf)))

AIC_gamlss = data.frame(model = c("GAMLSS","GAMLSS no influence"),
                       AIC= c(AIC(modelGAMLSS),AIC(modelGAMLSS_no_inf)))

RMSE_gamlss = data.frame(model = c("GAMLSS","GAMLSS","GAMLSS no influence","GAMLSS no influence"), dataset = c("train","test","train","test"),
                       RMSE= c(RMSE(predict(modelGAMLSS,newdata = train),train$PctWomen),RMSE(predict(modelGAMLSS,newdata = test),test$PctWomen),RMSE(predict(modelGAMLSS_no_inf,newdata = train),train$PctWomen),RMSE(predict(modelGAMLSS_no_inf,newdata = test),test$PctWomen)))
ggplot(R2_gamlss,aes(x = model,y = R_Squared,label = round(R_Squared,5)))+
  geom_point(size = 6,col = "lightblue")+
  geom_text(col = "black")+
  theme_bw()+
  theme(legend.position = "none")+
  labs(title = "R-Squared by model type (Higher is better)")

 ggplot(AIC_gamlss,aes(x = model,y = AIC,label = round(AIC,5)))+
  geom_point(size = 6,col = "lightblue")+
  geom_text(col = "black")+
  theme_bw()+
  theme(legend.position = "none")+
  labs(title = "Akaike's information criterion by model type (lower is better)"
       )

 ggplot(RMSE_gamlss,aes(x = model,y = RMSE,col = dataset,label = round(RMSE,5)))+
  geom_point(size = 6)+
  geom_text(col = "black")+
  theme_bw()+
  labs(title = "RMSE by model type (lower is better)")

In all of this metrics it is clear that the gamlss model that removes the influential points from the original model, is even better than the first gamlss model created.

The only metric where the original gamlss model performed better, was the RMSE for the training set, which is to be expected taking into account that this model was built with all the observations of this dataset while the final model was not.

Under this reasoning, it is recommended and expected to select the gamlss model without the influential points and modeling the \(\sigma\), as the final predicting type of model for the proportion of female participation.

summary(modelGAMLSS_no_inf)
## Warning in summary.gamlss(modelGAMLSS_no_inf): summary: vcov has failed, option qr is used instead
## ******************************************************************
## Family:  c("BE", "Beta") 
## 
## Call:  gamlss(formula = PctWomen ~ ., sigma.formula = ~Term_Limit_Dummy,  
##     family = BE(mu.link = "identity"), data = train[-c(37, 101),          ]) 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  identity
## Mu Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.169763   0.008433  20.132  < 2e-16 ***
## Term_Limit_Dummy 0.050704   0.011621   4.363 2.34e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## Sigma link function:  logit
## Sigma Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.29306    0.08899 -14.531   <2e-16 ***
## Term_Limit_Dummy -0.33080    0.13062  -2.532   0.0123 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## No. of observations in the fit:  156 
## Degrees of Freedom for the fit:  4
##       Residual Deg. of Freedom:  152 
##                       at cycle:  5 
##  
## Global Deviance:     -378.4849 
##             AIC:     -370.4849 
##             SBC:     -358.2854 
## ******************************************************************

We see that the average value for the proportion of women in politics when not implementing term limits is approximately 16.976%. and including the term limits the proportion goes up by 5.0704% resulting in an average value for this case of 22.0467%.

7 Final model

Having realized the the best model option for our case (gamlss, modeling sigma and without influential observations).

It is important to note that in the training dataset the two most influential observations were number 37 and 101. However for the full data, these observations have different indexes, 45 and 123 but represent the same values.

final_model = gamlss(PctWomen~Term_Limit_Dummy,data = data[-c(45,123),],
                     sigma.formula = ~Term_Limit_Dummy,family = BE(mu.link = "identity"))
## GAMLSS-RS iteration 1: Global Deviance = -370.77 
## GAMLSS-RS iteration 2: Global Deviance = -473.6549 
## GAMLSS-RS iteration 3: Global Deviance = -478.6986 
## GAMLSS-RS iteration 4: Global Deviance = -478.7502 
## GAMLSS-RS iteration 5: Global Deviance = -478.7506
summary(final_model)
## Warning in summary.gamlss(final_model): summary: vcov has failed, option qr is used instead
## ******************************************************************
## Family:  c("BE", "Beta") 
## 
## Call:  
## gamlss(formula = PctWomen ~ Term_Limit_Dummy, sigma.formula = ~Term_Limit_Dummy,  
##     family = BE(mu.link = "identity"), data = data[-c(45, 123),          ]) 
## 
## Fitting method: RS() 
## 
## ------------------------------------------------------------------
## Mu link function:  identity
## Mu Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.166085   0.007537  22.035  < 2e-16 ***
## Term_Limit_Dummy 0.059022   0.010224   5.773 3.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## Sigma link function:  logit
## Sigma Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.30140    0.08072 -16.122  < 2e-16 ***
## Term_Limit_Dummy -0.35948    0.11667  -3.081  0.00237 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ------------------------------------------------------------------
## No. of observations in the fit:  194 
## Degrees of Freedom for the fit:  4
##       Residual Deg. of Freedom:  190 
##                       at cycle:  5 
##  
## Global Deviance:     -478.7506 
##             AIC:     -470.7506 
##             SBC:     -457.6791 
## ******************************************************************
AIC(final_model)
## [1] -470.7506
Rsq(final_model)
## [1] 0.201309

The first thing we see is how, as we expected, both the AIC value and the R-squared value become way better when including the entire dataset, with the first of these going from -370.48 all the way down to -470.751 which represents a great improvement for the model. And with the pseudo R-squared going from 0.157 to a much better 0.20131 which means that now that all the observations (minus the most influential ones) were included, we can explain approximately 20.13% of the variance of the percentage of women participating in politics by using whether a term limit is implemented or not.

Finally, from this model we can more faithfully describe the effect of term limits in the proportion of female in politics. Specifically, we see that the average percentage of female participation in political environments when no term limits are implemented is around 16.6085%, while if term limits are implemented the proportion goes up, in average, by 5.9022% resulting in an average total of 22.5107%. In addition, the term limit is significant for both the \(\mu\) parameter and the \(\sigma\) parameter for \(\alpha = 0.001\) and \(\alpha = 0.01\) respectively which is a clear proof of how term limits do influence the percentage of women participating in politics in a positive way.

8 Conclusions

  • As we suspected, there is a notorious difference in the female participation in politics when term limits are not implemented and when they are implemented. Specifically, we saw how the use of a term limit, can increase the possibilities of women to participate in a political environment.

  • Despite producing not very different results, using generalized models to describe variables has shown to be more effective, in this case, than using an ordinary least squares regression, returning metrics that translate in a better accuracy when predicting and describing the desired variable.

  • Even though it might be appealing to use the OLS model for its simplicity and good performance (still worse than the gamlss model though), the use of this model is technically incorrect since the required assumptions were not met.

  • In this case it was desired to find specifically the relation between term limits and the mentioned women participation. However, if we were interested in building as good of a model as possible to predict, leaving the interpretation aside, it would be recommended to use a more robust regression model like a random forest including other important variables.

9 References

  • Stasinopoulos, D. M., & Rigby, R. A. (2007). Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, 23(7), 1-46.

  • Montgomery, D. C., Peck, E. A., & Vining, G. G. (2021). Introduction to linear regression analysis. John Wiley & Sons.