Beta Regression

April 1, 2025

Import Data

library(readr) # Loading CSV Files
WDI <- read_csv("data.csv")

Breif Overview of the Data

The data set I will be using to conduct my beta regression model is the World Bank’s World Development Indicators(2020) data set. The World Bank`s Development Indicators data set contains valuable information on various economic, social, and environmental indicators for over 100 countries. This data set is beneficial because you can track how countries are progressing in areas like economic growth, education, health, poverty, and environmental sustainability.

Cleaning Data

Creating a new data set with variables of interest

library(dplyr) #Data Cleaning
WDI2 <- WDI %>% select(country, year, SP.DYN.TO65.MA.ZS, SH.PRV.SMOK.MA, SH.ALC.PCAP.MA.LI) 

Renaming Variables

WDI2 <- WDI2 %>% rename("Life_Expectancy_65" = "SP.DYN.TO65.MA.ZS", "Smoking_Prevalence" = "SH.PRV.SMOK.MA", "Alcohol_Consumption" = "SH.ALC.PCAP.MA.LI")

View structure of the data

glimpse(WDI2)
## Rows: 266
## Columns: 5
## $ country             <chr> "Afghanistan", "Africa Eastern and Southern", "Afr…
## $ year                <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 20…
## $ Life_Expectancy_65  <dbl> 50.41502, 55.22468, 49.91460, 81.03560, 78.90971, …
## $ Smoking_Prevalence  <dbl> 39.40000, 20.25884, 10.94765, 38.80000, 41.30000, …
## $ Alcohol_Consumption <dbl> 0.020000, 7.054164, 6.744757, 7.370000, 0.940000, …

Convert my DV (Life_Expectancy_65) from a percentage to a proportion

WDI2 <- WDI2 %>% mutate(Life_Expectancy_65 = Life_Expectancy_65/100)

Code Book for Variables

  • Country: “List of different countries”
  • year: “Year of observation”
  • Life_Expectancy_65: “Proportion of males surviving to the age of 65.”
  • Smoking_Prevalence: “Percentage of tobacco use among males”
  • Alcohol_Consumption: “Amount of Alcohol consumed among males”(Measured in Liters)

Data Table

head(WDI2) #First few rows of data
## # A tibble: 6 × 5
##   country         year Life_Expectancy_65 Smoking_Prevalence Alcohol_Consumption
##   <chr>          <dbl>              <dbl>              <dbl>               <dbl>
## 1 Afghanistan     2020              0.504               39.4                0.02
## 2 Africa Easter…  2020              0.552               20.3                7.05
## 3 Africa Wester…  2020              0.499               10.9                6.74
## 4 Albania         2020              0.810               38.8                7.37
## 5 Algeria         2020              0.789               41.3                0.94
## 6 American Samoa  2020              0.692               NA                 NA
tail(WDI2) #Last few rows of data
## # A tibble: 6 × 5
##   country         year Life_Expectancy_65 Smoking_Prevalence Alcohol_Consumption
##   <chr>          <dbl>              <dbl>              <dbl>               <dbl>
## 1 Virgin Island…  2020              0.710               NA                NA    
## 2 West Bank and…  2020              0.787               NA                NA    
## 3 World           2020              0.714               37.3               7.85 
## 4 Yemen, Rep.     2020              0.601               32.5               0.077
## 5 Zambia          2020              0.553               25.1               5.96 
## 6 Zimbabwe        2020              0.476               21.8               4.82

Methods

Variables of Interest

  • Independent Variables: Smoking_Prevalence, Alcohol_Consumption
  • Dependent Variable: Life_Expectancy_65


1. Research Question

  • Is there a relationship between the percentage of males who smoke and the proportion of males surviving to the age of 65.

Hypothesis

  • Null Hypothesis (H₀): There is no significant relationship between the percentage of males who smoke and the proportion of males surviving to the age of 65.
  • Alternative Hypothesis (H₁): There is a significant relationship between the percentage of males who smoke and the proportion of males surviving to the age of 65


2. Research Question

  • Is there a relationship between the amount of alcohol consumed per year by males and the proportion of males surviving to the age of 65.

Hypothesis

  • Null Hypothesis (H₀): There is no significant relationship between the amount of alcohol consumed per year by males and the proportion of males surviving to the age of 65.
  • Alternative Hypothesis (H₁): There is a significant relationship between the amount of alcohol consumed per year by males and the proportion of males surviving to the age of 65.

Analysis Plan

  • I will conduct a beta regression analysis to assess the relationship between the independent and dependent variables and make predictions between the two.
  • A beta regression model is appropriate to use because the dependent variable(Life_Expectancy_65) is a proportion bounded between 0 and 1.
  • I will use the packages betareg and clarify to conduct the beta regression analysis.

Statistical Analysis

Beta Regression Model

library(betareg) #Beta Regression
beta_model <- betareg(formula = Life_Expectancy_65 ~ Smoking_Prevalence + Alcohol_Consumption, data = WDI2)
summary(beta_model)
## 
## Call:
## betareg(formula = Life_Expectancy_65 ~ Smoking_Prevalence + Alcohol_Consumption, 
##     data = WDI2)
## 
## Quantile residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7210 -0.8208 -0.0291  0.7456  2.7805 
## 
## Coefficients (mean model with logit link):
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.443642   0.115567   3.839 0.000124 ***
## Smoking_Prevalence  0.004756   0.002978   1.597 0.110328    
## Alcohol_Consumption 0.028690   0.006790   4.225 2.38e-05 ***
## 
## Phi coefficients (precision model with identity link):
##       Estimate Std. Error z value Pr(>|z|)    
## (phi)   13.535      1.295   10.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Type of estimator: ML (maximum likelihood)
## Log-likelihood: 149.1 on 4 Df
## Pseudo R-squared: 0.08142
## Number of iterations: 10 (BFGS) + 2 (Fisher scoring)

Interpretation of Results Using Clarify

Average Marginal Effects(Smoking_Prevalence and Life_Expectancy_65)

library(clarify) 
set.seed(123) #Repeatable Results
beta_sim <- sim(beta_model)
ame_beta <- sim_ame(beta_sim, var = "Smoking_Prevalence", contrast = "rd")
summary(ame_beta)
##                              Estimate     2.5 %    97.5 %
## E[dY/d(Smoking_Prevalence)]  0.001003 -0.000357  0.002187
  • There is no statistically significant relationship between the percentage of men who smoke and the proportion of men surviving to the age of 65. The confidence interval ranges from -0.000357 to 0.002187. Since zero is included within this range, we can conclude that the effect of smoking on survival to age 65 is not significant. In simpler terms, smoking does not appear to influence the proportion of men surviving to the age of 65.

Average Marginal Effects(Alcohol_Consumption and Life_Expectancy_65)

set.seed(1234) #Repeatable Results
beta1_sim <- sim(beta_model)
ame1_beta <- sim_ame(beta1_sim, var = "Alcohol_Consumption", contrast = "rd")
summary(ame1_beta)
##                              Estimate   2.5 %  97.5 %
## E[dY/d(Alcohol_Consumption)]  0.00605 0.00321 0.00884

Average Dose Response (Prediction)

adrf <- sim_adrf(beta1_sim, var = "Alcohol_Consumption", contrast = "adrf")
plot(adrf)

- Unlike the relationship between tobacco use among males and mens survival rate to age 65, there is a significant relationship between the Amount of Alcohol consumed among males and the proportion of men surviving until there 65. There is a significant relationship between the variables because zero does not fall between the range of the confidence intervals 0.00321 to 0.00884. So we can say that Alcohol consumption does have an impact on the proportion of men surviving to the age of 65. It can be said that for 1 liter increase in alcohol consumption, the proportion of men surviving to age 65 increases by 0.00605. The graph portrays how different levels of Alcohol Consumption(liters) impact the proportion of men surviving to the age of 65.By looking at the graph you can see that as Alcohol Consumption(x axis) increases the proportion of men surviving to 65(y axis) also increases. The graph depicts a positive relationship between the two variables.

Conclusion

I was not expecting these results. I was surprised to find that the percentage of men who smoke did not have a significant impact on the proportion of men surviving to age 65. Additionally, I was shocked to observe a positive relationship between alcohol consumption and male survival rates, where an increase in alcohol consumption corresponded to a higher proportion of men reaching the age of 65. I wonder if I add more predictors to the model such as GDP, Education Levels, and Health(Diet, Exercise), would the results change? I will have to conduct further analysis to determine if adding more predictors to my model will change the results.