The dataset that I chose for this project was the dataset provided in our google drive which contains variables relating to life expectancy and various socio-economic variables. This dataset is derived from The World Bank, a global partnership that combines five institutions with the goal of eliminating world poverty. As the institutions involved with The World Bank are major international financial corporations, it makes sense that they are able to provide enough information to compile a dataset regarding the generalized socio-economic status of various countries.
The World Bank is associated with the United Nations and is one of the largest sources of financial assistance to developing countries. The loans and grants that are given to the governments of developing countries are aimed to help these countries grow. An interesting point I found is that the governing body of The World Bank is based off of a country’s capital subscription, so those that are more wealthy and developed have more influence.
This dataset includes a plethora of variables, but the quantitative ones that I have chosen to focus on are: year, life expectancy, undernourishment, CO2 levels, health expenditures, education expenditures, unemployment, sanitation, and injuries. I chose this dataset because I was interested in how third-world countries were affected by such variables in comparison to other countries. It is important to know how people around the world are affected by matters which may seem insignificant to us, such as easily treatable injuries or having the privilege of high government education expenditures.
Rows: 3306 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Country Name, Country Code, Region, IncomeGroup
dbl (12): Year, Life Expectancy World Bank, Prevelance of Undernourishment, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 16
`Country Name` `Country Code` Region IncomeGroup Year Life Expectancy Worl…¹
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 Afghanistan AFG South… Low income 2001 56.3
2 Angola AGO Sub-S… Lower midd… 2001 47.1
3 Albania ALB Europ… Upper midd… 2001 74.3
4 Andorra AND Europ… High income 2001 NA
5 United Arab Em… ARE Middl… High income 2001 74.5
6 Argentina ARG Latin… Upper midd… 2001 73.8
# ℹ abbreviated name: ¹`Life Expectancy World Bank`
# ℹ 10 more variables: `Prevelance of Undernourishment` <dbl>, CO2 <dbl>,
# `Health Expenditure %` <dbl>, `Education Expenditure %` <dbl>,
# Unemployment <dbl>, Corruption <dbl>, Sanitation <dbl>, Injuries <dbl>,
# Communicable <dbl>, NonCommunicable <dbl>
Clean the data
#removing all NA datadata_clean <- data |>filter(!is.na(`Life Expectancy World Bank`)) |>filter(!is.na(`Prevelance of Undernourishment`)) |>filter(!is.na(CO2)) |>filter(!is.na(`Health Expenditure %`)) |>filter(!is.na(`Education Expenditure %`)) |>filter(!is.na(Unemployment)) |>filter(!is.na(Sanitation)) |>filter(!is.na(Injuries)) #getting rid of columns i dont want/deem unnecessary data_clean2 <- data_clean[, -c(12, 15, 16)]#removing the non-numeric columns for the correlation plotcor_data_clean3 <- data_clean2[, -c(1, 2, 3, 4)]#renaming for easecor_data_clean4 <- cor_data_clean3 |>rename(lifeExp =`Life Expectancy World Bank`, undernourished =`Prevelance of Undernourishment`, healthExp =`Health Expenditure %`, eduExp =`Education Expenditure %`, unemployment = Unemployment, sanitation = Sanitation, injuries = Injuries)#country names + renamed datadata_clean3 <- data_clean2 |>rename(lifeExp =`Life Expectancy World Bank`, undernourished =`Prevelance of Undernourishment`, healthExp =`Health Expenditure %`, eduExp =`Education Expenditure %`, unemployment = Unemployment, sanitation = Sanitation, injuries = Injuries, coName =`Country Name`, coCode =`Country Code`, region = Region, incGroup = IncomeGroup, year = Year)
From this plot, it is easier to see that variables such as sanitation and malnourishment have an effect specifically on life expectancy. It is also easier to visualize which variables should be taken into account when considering how to create the visualizations.
Scatter Plot
ggplot(cor_data_clean4, aes(x =`undernourished`, y =`lifeExp`)) +labs(title ="Undernourishment vs Life Expectancy", x ="Undernourishment", y ="Life Expectancy", caption ="The World Bank") +geom_point(color ="pink") +geom_smooth(method = lm) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Here I started with a simple scatterplot with a linear regression model to show how undernourishment affects life expectancy. There is a very clear line that shows that higher rates of malnourishment lead to a lower life expectancy.
Call:
lm(formula = lifeExp ~ undernourished + CO2 + healthExp + eduExp +
unemployment + sanitation + injuries, data = cor_data_clean4)
Residuals:
Min 1Q Median 3Q Max
-19.8491 -2.0877 0.3807 2.9323 20.0466
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.171e+01 6.239e-01 114.929 < 2e-16 ***
undernourished -5.634e-01 1.796e-02 -31.375 < 2e-16 ***
CO2 -2.036e-07 1.803e-07 -1.129 0.2592
healthExp 5.543e-01 5.960e-02 9.301 < 2e-16 ***
eduExp -1.641e-01 8.107e-02 -2.025 0.0431 *
unemployment -1.986e-01 2.460e-02 -8.072 1.57e-15 ***
sanitation 7.684e-02 5.962e-03 12.889 < 2e-16 ***
injuries 5.695e-08 2.657e-08 2.144 0.0323 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.576 on 1288 degrees of freedom
Multiple R-squared: 0.7246, Adjusted R-squared: 0.7231
F-statistic: 484.2 on 7 and 1288 DF, p-value: < 2.2e-16
From this model, you can see that the R-squared value is at 72.31% which is already very high, but lets see if removing the variables with the next highest p-values makes a difference:
Here we can see that the R-squared value actually decreased so I chose to stick with the first fit.
Diagnostic Plots
par(mfrow =c(2,2))plot(fit1)
There aren’t many outliers on this plot and it would seem that a linear plot would work well.
Equation
The equation for my model is: 71.71 + (-0.5634)undernourished + (-2.036e-07)CO2 + (0.5543)healthExp + (-0.1641)eduExp + (-0.1986)unemployment + (7.684e-02)sanitation + (5.695e-08)injuries
Final Visualizaton
#focusing on the year 2003hc_data <- data_clean3 |>filter(year ==2003)
#using the countries in the plot and converting to find the respective long and latgeo=read_csv("https://gist.github.com/tadast/8827699/raw/61b2107766d6fd51e2bd02d9f78f6be081340efc/countries_codes_and_coordinates.csv") %>%select(3,5:6)
Rows: 262 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Country, Alpha-2 code, Alpha-3 code
dbl (3): Numeric code, Latitude (average), Longitude (average)
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#interactivity datapopup_data <-paste0("<b>Year: </b>", vis_map$year, "<br>","<b>Country: </b>", vis_map$coName, "<br>","<b>Life Expectancy: </b>", vis_map$lifeExp, "<br>","<b>Education Expenditures in %: </b>", vis_map$eduExp, "<br>")leaflet() |>addProviderTiles("Esri.NatGeoWorldMap") |>addCircles(data = vis_map,radius = vis_map$eduExp*100000, #to make the visualizations easier to see for the viewercolor ="#e93323",popup = popup_data )
Assuming "long" and "lat" are longitude and latitude, respectively
#trying other years just for comparisonhc_data2 <- data_clean3 |>filter(year ==2010)hc_data3 <- data_clean3 |>filter(year ==2008)
#using the countries in the plot and converting to find the respective long and latgeo=read_csv("https://gist.github.com/tadast/8827699/raw/61b2107766d6fd51e2bd02d9f78f6be081340efc/countries_codes_and_coordinates.csv") %>%select(3,5:6)
Rows: 262 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Country, Alpha-2 code, Alpha-3 code
dbl (3): Numeric code, Latitude (average), Longitude (average)
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
output <- hc_data2 %>%left_join(y = geo,by =join_by(coCode==`Alpha-3 code`))vis_map2 <- output |>rename(long =`Longitude (average)`, lat =`Latitude (average)`)#using the countries in the plot and converting to find the respective long and latgeo=read_csv("https://gist.github.com/tadast/8827699/raw/61b2107766d6fd51e2bd02d9f78f6be081340efc/countries_codes_and_coordinates.csv") %>%select(3,5:6)
Rows: 262 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Country, Alpha-2 code, Alpha-3 code
dbl (3): Numeric code, Latitude (average), Longitude (average)
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#interactivity datapopup_data <-paste0("<b>Year: </b>", vis_map2$year, "<br>","<b>Country: </b>", vis_map2$coName, "<br>","<b>Life Expectancy: </b>", vis_map2$lifeExp, "<br>","<b>Education Expenditures in %: </b>", vis_map2$eduExp, "<br>")leaflet() |>addProviderTiles("Esri.NatGeoWorldMap") |>addCircles(data = vis_map2,radius = vis_map2$eduExp*100000, #to make the visualizations easier to see for the viewercolor ="#e93323",popup = popup_data )
Assuming "long" and "lat" are longitude and latitude, respectively
#interactivity datapopup_data <-paste0("<b>Year: </b>", vis_map3$year, "<br>","<b>Country: </b>", vis_map3$coName, "<br>","<b>Life Expectancy: </b>", vis_map3$lifeExp, "<br>","<b>Education Expenditures in %: </b>", vis_map3$eduExp, "<br>")leaflet() |>addProviderTiles("Esri.NatGeoWorldMap") |>addCircles(data = vis_map3,radius = vis_map3$eduExp*100000, #to make the visualizations easier to see for the viewercolor ="#e93323",popup = popup_data )
Assuming "long" and "lat" are longitude and latitude, respectively
Conclusion
This visualization includes a colored world map with various sized red points on it. The size of each point represents the percentage each country’s government spends on education expenditures. I wanted to see if education had an impact on the general life expectancy of citizens. I also decided to multiply the information from my dataset by 100,000 for ease of viewership for the user, without compromising the quality of the data. I have also changed the year two other times just to compare and see the differences between countries through the years. It was interesting to see countries with massive bubbles having extremely large life expectency which would seem accurate however, there are also many countries with smaller bubbles that seem to have just as high rates.
I attempted to add additional coding so that the user themselves would be able to input a year from 2000-2019, the parameters of my dataset, however, the issue came when attempting to gather the longitude and latitude for each country because that information is not included in the dataset. It just did not work out in the way that I hoped it would so I kept the year manual. I also attempted to manipulate the colors of the bubbles so that the user could be made visually aware of the life expectancy of each country however, that also did not work out.