Import Data
library(readr) # Loading CSV Files
WDI <- read_csv("data.csv")
Breif Overview of the Data
The data set I will be using to conduct my beta regression model is the World Bank’s World Development Indicators(2020) data set. The World Bank`s Development Indicators data set contains valuable information on various economic, social, and environmental indicators for over 100 countries. This data set is beneficial because you can track how countries are progressing in areas like economic growth, education, health, poverty, and environmental sustainability.
Cleaning Data
Creating a new data set with variables of interest
library(dplyr) #Data Cleaning
WDI2 <- WDI %>% select(country, year, SP.DYN.TO65.MA.ZS, SH.PRV.SMOK.MA, SH.ALC.PCAP.MA.LI)
Renaming Variables
WDI2 <- WDI2 %>% rename("Life_Expectancy_65" = "SP.DYN.TO65.MA.ZS", "Smoking_Prevalence" = "SH.PRV.SMOK.MA", "Alcohol_Consumption" = "SH.ALC.PCAP.MA.LI")
View structure of the data
glimpse(WDI2)
## Rows: 266
## Columns: 5
## $ country <chr> "Afghanistan", "Africa Eastern and Southern", "Afr…
## $ year <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 20…
## $ Life_Expectancy_65 <dbl> 50.41502, 55.22468, 49.91460, 81.03560, 78.90971, …
## $ Smoking_Prevalence <dbl> 39.40000, 20.25884, 10.94765, 38.80000, 41.30000, …
## $ Alcohol_Consumption <dbl> 0.020000, 7.054164, 6.744757, 7.370000, 0.940000, …
Convert my DV (Life_Expectancy_65) from a percentage to a proportion
WDI2 <- WDI2 %>% mutate(Life_Expectancy_65 = Life_Expectancy_65/100)
Code Book for Variables
- Country: “List of different countries”
- year: “Year of observation”
- Life_Expectancy_65: “Proportion of males surviving to the age of 65.”
- Smoking_Prevalence: “Percentage of tobacco use among males”
- Alcohol_Consumption: “Amount of Alcohol consumed among males”(Measured in Liters)
Data Table
head(WDI2) #First few rows of data
## # A tibble: 6 × 5
## country year Life_Expectancy_65 Smoking_Prevalence Alcohol_Consumption
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 2020 0.504 39.4 0.02
## 2 Africa Easter… 2020 0.552 20.3 7.05
## 3 Africa Wester… 2020 0.499 10.9 6.74
## 4 Albania 2020 0.810 38.8 7.37
## 5 Algeria 2020 0.789 41.3 0.94
## 6 American Samoa 2020 0.692 NA NA
tail(WDI2) #Last few rows of data
## # A tibble: 6 × 5
## country year Life_Expectancy_65 Smoking_Prevalence Alcohol_Consumption
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Virgin Island… 2020 0.710 NA NA
## 2 West Bank and… 2020 0.787 NA NA
## 3 World 2020 0.714 37.3 7.85
## 4 Yemen, Rep. 2020 0.601 32.5 0.077
## 5 Zambia 2020 0.553 25.1 5.96
## 6 Zimbabwe 2020 0.476 21.8 4.82
Methods
Variables of Interest
- Independent Variables: Smoking_Prevalence, Alcohol_Consumption
- Dependent Variable: Life_Expectancy_65
1. Research Question
- Is there a relationship between the percentage of males who smoke and the proportion of males surviving to the age of 65.
Hypothesis
- Null Hypothesis (H₀): There is no significant relationship between the percentage of males who smoke and the proportion of males surviving to the age of 65.
- Alternative Hypothesis (H₁): There is a significant relationship between the percentage of males who smoke and the proportion of males surviving to the age of 65
2. Research Question
- Is there a relationship between the amount of alcohol consumed per year by males and the proportion of males surviving to the age of 65.
Hypothesis
- Null Hypothesis (H₀): There is no significant relationship between the amount of alcohol consumed per year by males and the proportion of males surviving to the age of 65.
- Alternative Hypothesis (H₁): There is a significant relationship between the amount of alcohol consumed per year by males and the proportion of males surviving to the age of 65.
Analysis Plan
- I will conduct a beta regression analysis to assess the relationship between the independent and dependent variables and make predictions between the two.
- A beta regression model is appropriate to use because the dependent variable(Life_Expectancy_65) is a proportion bounded between 0 and 1.
- I will use the packages betareg and clarify to conduct the beta regression analysis.
Statistical Analysis
Beta Regression Model
library(betareg) #Beta Regression
beta_model <- betareg(formula = Life_Expectancy_65 ~ Smoking_Prevalence + Alcohol_Consumption, data = WDI2)
summary(beta_model)
##
## Call:
## betareg(formula = Life_Expectancy_65 ~ Smoking_Prevalence + Alcohol_Consumption,
## data = WDI2)
##
## Quantile residuals:
## Min 1Q Median 3Q Max
## -2.7210 -0.8208 -0.0291 0.7456 2.7805
##
## Coefficients (mean model with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.443642 0.115567 3.839 0.000124 ***
## Smoking_Prevalence 0.004756 0.002978 1.597 0.110328
## Alcohol_Consumption 0.028690 0.006790 4.225 2.38e-05 ***
##
## Phi coefficients (precision model with identity link):
## Estimate Std. Error z value Pr(>|z|)
## (phi) 13.535 1.295 10.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Type of estimator: ML (maximum likelihood)
## Log-likelihood: 149.1 on 4 Df
## Pseudo R-squared: 0.08142
## Number of iterations: 10 (BFGS) + 2 (Fisher scoring)
Interpretation of Results Using Clarify
Average Marginal Effects(Smoking_Prevalence and Life_Expectancy_65)
library(clarify)
set.seed(123) #Repeatable Results
beta_sim <- sim(beta_model)
ame_beta <- sim_ame(beta_sim, var = "Smoking_Prevalence", contrast = "rd")
summary(ame_beta)
## Estimate 2.5 % 97.5 %
## E[dY/d(Smoking_Prevalence)] 0.001003 -0.000357 0.002187
- There is no statistically significant relationship between the percentage of men who smoke and the proportion of men surviving to the age of 65. The confidence interval ranges from -0.000357 to 0.002187. Since zero is included within this range, we can conclude that the effect of smoking on survival to age 65 is not significant. In simpler terms, smoking does not appear to influence the proportion of men surviving to the age of 65.
Average Marginal Effects(Alcohol_Consumption and Life_Expectancy_65)
set.seed(1234) #Repeatable Results
beta1_sim <- sim(beta_model)
ame1_beta <- sim_ame(beta1_sim, var = "Alcohol_Consumption", contrast = "rd")
summary(ame1_beta)
## Estimate 2.5 % 97.5 %
## E[dY/d(Alcohol_Consumption)] 0.00605 0.00321 0.00884
Average Dose Response (Prediction)
adrf <- sim_adrf(beta1_sim, var = "Alcohol_Consumption", contrast = "adrf")
plot(adrf)
- Unlike the relationship between tobacco use among males and mens
survival rate to age 65, there is a significant relationship between the
Amount of Alcohol consumed among males and the proportion of men
surviving until there 65. There is a significant relationship between
the variables because zero does not fall between the range of the
confidence intervals 0.00321 to 0.00884. So we can say that Alcohol
consumption does have an impact on the proportion of men surviving to
the age of 65. It can be said that for 1 liter increase in alcohol
consumption, the proportion of men surviving to age 65 increases by
0.00605. The graph portrays how different levels of Alcohol
Consumption(liters) impact the proportion of men surviving to the age of
65.By looking at the graph you can see that as Alcohol Consumption(x
axis) increases the proportion of men surviving to 65(y axis) also
increases. The graph depicts a positive relationship between the two
variables.
Conclusion
I was not expecting these results. I was surprised to find that the percentage of men who smoke did not have a significant impact on the proportion of men surviving to age 65. Additionally, I was shocked to observe a positive relationship between alcohol consumption and male survival rates, where an increase in alcohol consumption corresponded to a higher proportion of men reaching the age of 65. I wonder if I add more predictors to the model such as GDP, Education Levels, and Health(Diet, Exercise), would the results change? I will have to conduct further analysis to determine if adding more predictors to my model will change the results.