Most individuals are required and expected to receive the MMR vaccine, to protect oneself against measles. Most measles cases have a prognosis that is non-life threatening and the person will survive. However, in more severe cases complications can occur and individuals can be hospitalized and die, this is most common for children under the age of 5. Because of the virus’s slow death rate, we were curious to see the relationship between life expectancy and measles immunization rates across different continents.
Most of the World Bank data are collected from individual countries that are part of the World Bank system and from various international organizations such as the UN. The potential problem with how the data are collected is that the sample would not be a simple random sample – the data are collected from countries that are willing to contribute their data. However, since there are 120 countries spreading over different continents in total listed in this given data set and it is collected from raw data, we assume that the data set is a simple random sample.
Fertility: the number of children that can be born to a woman of childbearing age.
Life_expectancy_female: the number of years a female lives
Life_expectancy_male: the number of years a male lives
Life_expectancy_total: the number of years an individual lives (not specifying gender)
Measles_immunization: the percentage of young children who age between 12 and 23 months and receive measles immunization before the survey.
In figure 1, the total life expectancy in years is plotted by each continent in this boxplot, and some summary statistics are included in the table. According to the box plot and table, Europe has the highest average life expectancy of 79.46 years, while Africa has the lowest average life expectancy of 64.08 years. Compared to other continents, Africa also has the largest spread of 5.853439 years, and a lowest median of 63.76 years. On the other hand, Europe also has the highest median of 80.95 years, and South America has the lowest spread of 2.504151 years. In addition, the box plot for Africa and Oceania are symmetric, with the box plot for North America being right skewed, and the box plot for South America, Europe, and Asia all being left skewed. Note that the box plot for South America is most left skewed compared to Europe and Asia. Oceania and North America are the only two continents that have outliers below the minimum, with Africa having outliers both below the minimum and above the maximum.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 71.24 74.52 76.52 75.79 77.03 80.04
## [1] 2.504151
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 67.34 77.15 79.86 77.79 81.86 82.75
## [1] 5.568052
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 63.66 73.76 74.68 75.16 78.67 81.95
## [1] 4.447449
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 71.58 77.27 80.95 79.46 82.26 83.75
## [1] 3.482971
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 64.49 70.63 75.36 74.91 77.72 84.93
## [1] 5.464881
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 52.80 61.18 63.76 64.08 66.60 76.69
## [1] 5.853439
Figure 1 (by Felicia Irawan)
In addition to the boxplot, a histogram is also created as shown in figure 2. The life expectancy is plotted based on the continents, as the boxplot. There’s a clear difference in life expectancy among different continents. It’s shown that the life expectancy in Africa is significantly lower, from the red bars concentrated near the lower end, and from the table it’s clear that Africa has significantly lower mean (64.08 years) and median (63.76 years) as compared to the other five continents. South America has the smallest standard deviation (2.50 years). Europe has the highest average life expectancy (79.46 years) and the highest median life expectancy (80.95 years), as can be seen from the yellow bars concentrated on the right end. Oceania has the second highest life expectancy (mean of 77.79 years and median of 79.86 years).
Figure 2 (by Ying Jung Wu)
Another variable that we investigated is measles immunization. The measles immunization percentages are plotted based on the continents in the boxplot below. There’s a clear difference in measles immunization percentage among different continents. The box plot shows that the measles immunization percentage in Africa has the largest box or spread (16.00) and significantly lower median (76.50%) as compared to other five continents. Oceania has the smallest standard deviation (1.86). Europe has the highest average immunization percentage (94.45%) whereas Asia has the highest median immunization percentage (96.50%) (which can be seen in Table 2 following Figure 3).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 74.00 86.00 92.50 89.50 93.75 97.00
## [1] 6.637017
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 92.00 93.50 95.00 94.33 95.50 96.00 2
## [1] 1.861899
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 69.00 86.00 89.00 87.64 91.00 97.00 1
## [1] 7.25718
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 87.00 93.00 95.00 94.45 97.00 99.00
## [1] 2.944856
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 63.00 87.75 96.50 90.66 98.00 99.00 2
## [1] 11.58306
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 47.00 59.75 76.50 74.96 88.00 99.00
## [1] 15.99875
Figure 3 (by Ying Jung Wu)
The histogram shown in figure 4 is a comparison of two continents, North America and Africa’s woman and man life expectancy. We can tell the woman in North America’s mean life expectancy is 77.8 but the woman in Africa’s mean life expectancy is lesser and the mean is 65.7. For men, the mean life expectancy in North America is 72.5 and the mean life expectancy in Africa is 62.4. From the data I have learned that women and men’s life expectancy in North America is longer than the ones in Africa. From the data, men in Africa’s mean life expectancy is less than women in Africa. This also shows the same for the men in North America. By comparing the mean life expectancy of women and men, we see that women’s life expectancy is longer than men. From the histogram for Arica, we can see it follows the normal curve. And from North America’s histogram, we can see that it follows left skewed. The resulting differences between these two continents may be the environmental differences, GDP differences, and cultural differences.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61.50 70.79 72.44 72.51 76.16 79.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.65 59.60 61.62 62.37 64.90 75.49
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 65.83 76.73 77.51 77.82 80.57 84.10
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 54.99 61.87 65.14 65.77 68.59 77.94
Figure 4 (by Yi He)
In addition to life expectancies in North America and Africa, the female life expectancies in Oceania were plotted with an age interval of 5 years, represented by the bin’s width written in R. The range of the female life expectancies in Oceania is from the age 69.178 to 84.900. The data distribution can be described as unimodal, slightly left-skewed, and it shows an obvious outlier of 69.178 years at the lower end of the data set. The median female life expectancy in Oceania is 83.365 years, and the mean is 80.069 years, which is lower than the median due to the outlier’s effect. This trend of having the average value lower than the median corresponds to the observation that the data distribution is slightly left-skewed. In this collected data, most females in Oceania have their life expectancies falling in the interval from 82.5 years to 87.5 years. The standard deviation of the female life expectancies in Oceania is about 12.879 years, suggesting that all the individuals in this data set have their life expectancies lying within one standard deviation from the mean.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 69.18 79.30 83.36 80.07 83.60 84.90
Figure 5 (by Xinyi Liu)
Furthermore, the female life expectancies in South America were plotted with the age interval of 2 years. The range of the female life expectancies in South America is from 74.209 years to 82.381 years. The data distribution has a gap between 77 and 79 years, which means no South American females in this data set have their life expectancies lying in this range. If we ignore this gap, the general distribution of the collected female life expectancies in South America is unimodal, left-skewed, with most individuals having their life expectancies falling in the age interval from 79 to 81 years. The median female life expectancy in South America is 79.504 years, and the mean is 78.828 years, which is lower than the median due to the left-skewed distribution.
The standard deviation of the female life expectancies in South America is about 2.406 years, which seems a small standard deviation considering the absolute value. Based on the histogram, 50% of the individuals’ life expectancies lie within one standard deviation from the mean, and 100% of the data individuals’ life expectancies lie within two standard deviations from the mean, which indicates that there are no apparent outliers in this data set.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 74.21 77.01 79.50 78.83 79.86 82.38
Figure 6 (by Xinyi Liu)
The last variable that we looked at is fertility. In the following histogram (Figure 7), fertility rates are categorized by the continents included in the data. It is clear that every continent contributes fertility rates of 2.5. Asia is the only continent that has fertility rates of below 2.5, and Africa is the only continent that has rates of above 5; these are outliers of the data. Africa has the largest median and mean, while Europe has the smallest median and mean. Europe has the largest standard deviation, many things could explain the spread of the data. This is not shown on the graph, but we included this additional variable because of our interest in the relationship between fertility rates and measles immunization.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.649 1.849 2.257 2.153 2.388 2.730
## [1] 0.3324557
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.710 1.740 1.970 2.101 2.313 2.774
## [1] 0.3995586
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.035 1.698 2.009 2.030 2.374 2.935
## [1] 0.5359032
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.260 1.448 1.560 1.550 1.680 1.880
## [1] 3.482971
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.977 1.879 2.075 2.211 2.457 4.473
## [1] 0.7815321
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.240 3.523 4.418 4.227 4.716 6.913
## [1] 1.077417
Figure 7 (by Azzurra Cappuccini)
In the EDA section, we focused on two main variables–life expectancy total and measles immunization–and tried to find out if there are differences across continents. From the histograms and boxplots created, it’s shown that there’s actually an obvious difference between the life expectancy of Africa and other continents (the median and mean are lower by 15 to 20 years). Also, the spread of measles immunization is larger for Africa as compared to other continents (mean and median are lower as well). Since we observe much difference in EDA, we decide to focus on 1) the relationship between life expectancy and measles immunization using linear regression method and 2) the difference between life expectancy across different continents using the hypothesis test.
When we plotted life expectancy total versus measles immunization, the data looked approximately football-shaped, and we decided to use the linear regression test to check if there’s a relationship between these two variables. The x variable is the measles_immunization (variable defined in the Variable Description section) and the y variable is the life_expectancy_total. The correlation coefficient r turned out to be 0.701 (closer to 1 than to 0), which indicates a pretty strong positive correlation between the measles immunization percentage and the life expectancy (not specifying gender). We also looked at the plot and didn’t really see any outliers, so it should be okay using the linear regression model. This might indicate that as the measles immunization percentage increases, on average the life expectancy of the continent increases.
## [1] 0.7011097
## [1] 0.4061451
## [1] 38.33092
Figure 8 (by Felicia Irawan) Equation for the Regression Line from r: y=0.406x(measles immunization %)+38.331
r=0.701
Figure 9 (by Felicia Irawan)
Figure 10 (by Felicia Irawan)
We also graphed the same relationship by continents to check if the relationship observed above changes based on the continent. The individual scatter plots are shown below, and their r values are in the following table. The number of countries (the number of data points) included in the dataset in each continent is listed here: 26 countries in Africa, 34 countries in Asia, 33 countries in Europe, 12 countries in North America, 5 countries in Oceania, and 10 countries in South America. The correlation coefficients for Asia and Africa are almost identical (about 0.6), while the correlation coefficient for Europe is almost zero. As can be seen from the Europe scatter plot, all the data are clustered near the high end of measles immunization but the life expectancy trend is similar. For Oceania, the correlation coefficient is negative, but it might be attributed to the fact that there are only 5 countries listed (there are only a few data points available, and the correlation observed might not be super accurate.) From the linear regression analysis, we think that the correlation between measles immunization and life expectancy is pretty strong for Asia and Africa, and thus we are going to focus on those two continents in our hypothesis test.
Table for Figure 10
From all the continents, we chose Africa and Asia to analyze the difference between their life expectancies because they have the most different measles immunization percentages: Africa’s mean measles immunization percentage is 64.08165%, which is the lowest among the continents, and Asia’s mean measles immunization percentage is 74.91055%, which is the highest among the continents. By comparing the life expectancies in these two continents with the most deviant measles immunization percentages, although we cannot conclude any correlation between life expectancy and measles immunization reception, we can still understand whether the difference between these two continents’ life expectancies is statistically significant. Since both the measles immunization and life expectancy are health-related factors, it is worthwhile to see the difference in these two factors between the two continents in parallel. Instead of randomly selecting the other two continents to compare their life expectancies, choosing the two continents with the most different measles immunization percentages to compare is more valuable. To compare the difference between the life expectancies in Asia and Africa, we decided to use the z-test (instead of the t-test) for assessing two independent samples’ mean values because both sample sizes are large enough. Our hypothesis was as follows:
H0: There’s no real difference between the average life expectancies in Asia and Africa, or any observed difference between the average life expectancies of these two continents is due to chance. (mean_Asia - mean_Africa=0)
H1: There’s a real difference between the average life expectancies in Asia and Africa. (mean_Asia - mean_Africa≠0)
To see if the difference between the average life expectancies were real or due to chance, we carried out a two-tailed z-test.
Expected difference between the mean life expectancies in the two continents is zero according to our null hypothesis, and the observed difference is calculated by using the mean value of life expectancies in Asia and Africa that we obtained from applying the code we created in R.
Expected Difference=0
Observed Difference=10.8289
The SD of the life expectancies in both continents are calculated by adjusting the code in R to sd()*√(n-1/n), where n is the number of values in each sample.
sd_Asia=5.439344 sd_Africa=5.853439
The SE of the life expectancies in both continents are calculated by dividing their SD value by the square root of their sample sizes.
SE_Asia=0.9328398 SE_Africa=1.147954
SE of the difference between the mean life expectancies of the two continents is calculated by the formula SE(X-Y)=√(SE(X)2+SE(Y)2).
SE_difference=1.479185
Since we are conducting a two-sample z-test, the z test statistics can be computed.
z=(10.8289-0)/1.479185=7.320854
Since we are conducting a two-sided z-test, and the default setting in R is calculating the lower tail of the normal curve, we need to first use the entire area represented as one to minus the pnorm(z), and then multiply it by 2 to consider both tails of the normal curve:
(1-pnorm(7.320854))×2=2.465*10^-13
Regardless of whether using the significance value of 0.05 or 0.01, our p-value is always much smaller than the significance value. Therefore, we reject the null hypothesis that the difference is due to chance. It is fair to conclude from the hypothesis test that there is a real difference between the average life expectancy in Asia and the average life expectancy in Africa.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 64.49 70.63 75.36 74.91 77.72 84.93
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 52.80 61.18 63.76 64.08 66.60 76.69
## [1] 74.91055
## [1] 64.08165
We started by looking at the dataset and finding variables that seemed interesting to us. We ended up mainly focusing on life expectancy and measles immunization, considering gender and continent. Based on the boxplots and histograms created, differences in life expectancy and immunization are observed across different continents, and there seems to be a relationship between life expectancy and measles immunization. It makes sense that with higher measles immunization, the survival rate for babies is higher, and those babies are more likely to live on and make the overall life expectancy higher. However, there’s probably a threshold to which the life expectancy on average for a continent doesn’t increase from increasing the measles immunization percentage.
Our method of attempting to answer this question was regression analysis. Constructing a scatter plot (and making sure there’s no outlier), we were able to make a linear regression line and obtained a correlation coefficient of 0.701, which is pretty close to 1 and indicates a strong relationship. We also thought it would be interesting to have the scatter plot for each individual continent, and it turned out that the correlation coefficients were significantly higher for Asia and Africa. For Europe, the measles immunization percentage was so high that even with 100% immunization, the life expectancy probably wouldn’t increase linearly.
In addition, a hypothesis test was conducted to compare the differences in life expectancy in Africa and Asia, and we wanted to know whether the difference was due to chance. In the end, the hypothesis test showed that it’s not due to chance, which means that the difference in life expectancy in Africa and Asia is real, possibly due to environmental and social factors.
“The World Bank: About Us.” https://data.worldbank.org/about.