We are a new social enterprise looking to help countries with low life expectancy and low government expenditure increase life expectancy with minimal additional cost.
Based on these existing lifestyle choices in each country selected, what educational program can we offer them to potentially increase life expectancy?
As a new organization, we want to market and sell health related educational programs. Since we are just starting, we will identify 10 countries (5 developing and 5 developed) and target them as our first potential clients. The top potential country clients will be the ones with the lowest life expectancy rates and lowest expenditure rates. Given the researched link between education and life expectancy, we’re positioning ourselves as an enterprise who wants to increase global life expectancies by partnering with governments to educate citizens about healthy lifestyle choices (nutrition and alcohol consumption). Addressing rates of BMI, alcohol use, and schooling is crucial in helping populations grow old healthier and prevent future costs for governments.
To conduct our analysis, we will be using data from the WHO. Including the following variables:
Country: Nominal. Name of each country, whose alcohol consumption, BMI, and schooling will be evaluated over the span of 16 years.
Year: Interval. Data was collected over the span of 16 years, starting in 2000 and ending in 2015. The fact that the same data was collected over the years will allow an analysis that can consider the progression or change over time of the dependent variables.
Status: Nominal. Describes if the country’s economy and societal structure is developed or developing per WHO definitions.
Life Expectancy: Ratio. We will use this input as our independent variable and will try to determine if the other variables considered have a direct impact on life expectancy. How many years a person is expected to live, when compared to the average life span of the citizens of his/her own country.
Alcohol: Ratio. Per capita (15+) consumption of alcohol per person (pure alcohol in liters).
Percentage Expenditure: Ratio. The percentage of a country’s Gross Domestic Product that was spent on health (per capita).
BMI: Ratio. Average Body Mass Index (BMI) of entire population.
Schooling: Ratio. The average number of schooling years for citizens in each country.
Â
To start our analysis, we structured and modified our data frame. We began by turning the Country, Year and Status variables into factors. Then, we checked for missing values to determine whether to use the rm.na function or imputate the averages.
LE <- read.csv("/Users/BudsMac/Downloads/STATSII/Life Expectancy Data(1).csv")
head(LE)
## Country.Updated Year Status Life.expectancy Alcohol
## 1 Afghanistan 2000 Developing 54.8 0.01
## 2 Afghanistan 2001 Developing 55.3 0.01
## 3 Afghanistan 2002 Developing 56.2 0.01
## 4 Afghanistan 2003 Developing 56.7 0.01
## 5 Afghanistan 2004 Developing 57.0 0.02
## 6 Afghanistan 2005 Developing 57.3 0.02
## Percentage.expenditure BMI Schooling
## 1 10.424960 21.7 5.5
## 2 10.574728 21.8 5.9
## 3 16.887351 21.9 6.2
## 4 11.089053 22.0 6.5
## 5 15.296066 22.1 6.8
## 6 1.388648 22.2 7.9
LE$Country.Updated <- factor(LE$Country.Updated)
LE$Year <- factor(LE$Year)
LE$Status <- factor(LE$Status)
To verify how many missing values we have, we used the sapply and missmap functions. Only 4% of the data is missing, meaning we now have to determine if we will remove the missing data or if we will replace it with an average or mean.
# Use `sapply` to identify the number of missing values per column
?sapply
sapply(LE, function(x) sum(is.na(x)))
## Country.Updated Year Status
## 0 0 0
## Life.expectancy Alcohol Percentage.expenditure
## 3 191 574
## BMI Schooling
## 32 157
# Use `sapply` to identify the unique number of values per column
sapply(LE, function(x) length(unique(x)))
## Country.Updated Year Status
## 181 16 3
## Life.expectancy Alcohol Percentage.expenditure
## 363 1075 2322
## BMI Schooling
## 122 173
# Create a plot that shows the missing values per column
missmap(LE, main="Missing Values vs Observed")
We now know that we are missing 4% of the data. We thought about using the general average to replace the missing values; however, one of our assumptions is that the data may vary depending on the country’s status (developed or developing country), so we need to verify if there are large differences when looking at the general averages and the the distribution of the two categories.
There is a significant difference in life expectancy between Developed and Developing countries, with people from Developed countries living 12 more years.
Taking a closer look, by graphic a histogram and boxplot to compare the distributions of Deeveloped and Developing countries, we see that: - Life expectancy in Developing countries has a wider range than in Developed countries. - Life expectancy in Developing countries is skewed to the left, while in Developed, the variability is less. - We could replace the missing values for Developed countries with the average, but it would not bee appropriate to do the same with the data for Developing. - The safest approach would be to remove the missing datam using the rm.na function.
# Group by status and calculate averages for life expectancy
by_status <- LE %>%
group_by(Status) %>%
summarise(
Life.expectancy = mean(Life.expectancy, na.rm=TRUE),
n = n()
)
mean.Life.expectancy<-arrange(by_status, Life.expectancy)
print(mean.Life.expectancy)
## # A tibble: 3 x 3
## Status Life.expectancy n
## <fct> <dbl> <int>
## 1 "Developing" 67.1 2366
## 2 "Developed" 79.2 527
## 3 "" NaN 2
# Life expectancy boxplots per status
BoxPlot <- ggplot(LE, aes(x=Status, y=Life.expectancy)) +
geom_boxplot() +
coord_flip()+
ggtitle("Life Expectancy per Status")
print(BoxPlot)
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
# Life expectancy histograms per status
Developed.countries <- filter(LE, Status == "Developed")
Developed.Countries.Hist <- ggplot(data=Developed.countries) +
geom_histogram(mapping= aes(x=Life.expectancy), binwidth=2) +
ggtitle("Life expectancy in Developed Countries")
print(Developed.Countries.Hist)
Developing.countries <- filter(LE, Status == "Developing")
Developing.Countries.Hist <- ggplot(data=Developing.countries) +
geom_histogram(mapping= aes(x=Life.expectancy), binwidth=2) +
ggtitle("Life expectancy in Developing Countries")
print(Developing.Countries.Hist)
## Warning: Removed 1 rows containing non-finite values (stat_bin).
We proceed to remove the missing values from the Life expectancy column. The dataset decreases from 2895 observations to 2892.
# Remove missing values.
LE <- LE %>%
filter(!is.na(Life.expectancy) & Life.expectancy!="")
head(LE)
## Country.Updated Year Status Life.expectancy Alcohol
## 1 Afghanistan 2000 Developing 54.8 0.01
## 2 Afghanistan 2001 Developing 55.3 0.01
## 3 Afghanistan 2002 Developing 56.2 0.01
## 4 Afghanistan 2003 Developing 56.7 0.01
## 5 Afghanistan 2004 Developing 57.0 0.02
## 6 Afghanistan 2005 Developing 57.3 0.02
## Percentage.expenditure BMI Schooling
## 1 10.424960 21.7 5.5
## 2 10.574728 21.8 5.9
## 3 16.887351 21.9 6.2
## 4 11.089053 22.0 6.5
## 5 15.296066 22.1 6.8
## 6 1.388648 22.2 7.9
We proceed with the second variable: Alcohol. We find that: - Alcohol consumption in developed countries is 2 times the rate for Developing countries (+200%). - There is a massive disparity between the consumption average, meaning the general average cannot be used to replace the missing values. - Alcohol consumption in Developing countries is skewed to the right, so using the average could lessen the validity of the data if we replace the missing values with the average. - If we cannot replace for both Developed and Developing, it is best to remove the missing values. - This proves our hypothesis that we need to analyze the data comparing Developed vs. Developing countries, because the difference is significant.
# Calculate average for Alcohol consumption in Developed and Developing countries.
by_status <- LE %>%
group_by(Status) %>%
summarise(
Alcohol = mean(Alcohol, na.rm=TRUE),
n = n()
)
mean.Alcohol<-arrange(by_status, Alcohol)
print(mean.Alcohol)
## # A tibble: 2 x 3
## Status Alcohol n
## <fct> <dbl> <int>
## 1 Developing 3.46 2365
## 2 Developed 9.90 527
# Alcohol consumption boxplot to compare consumption by status.
BoxPlot <- ggplot(LE, aes(x=Status, y=Alcohol)) +
geom_boxplot() +
coord_flip()+
ggtitle("Alcohol consumption per Status")
print(BoxPlot)
## Warning: Removed 188 rows containing non-finite values (stat_boxplot).
# Alcohol consumption histograms per Status
Developed.countries <- filter(LE, Status == "Developed")
Developed.Countries.Hist <- ggplot(data=Developed.countries) +
geom_histogram(mapping= aes(x=Alcohol), binwidth=2) +
ggtitle("Alcohol consumption in Developed Countries")
print(Developed.Countries.Hist)
## Warning: Removed 30 rows containing non-finite values (stat_bin).
Developing.countries <- filter(LE, Status == "Developing")
Developing.Countries.Hist <- ggplot(data=Developing.countries) +
geom_histogram(mapping= aes(x=Alcohol), binwidth=2) +
ggtitle("Alcohol Consumption in Developing Countries")
print(Developing.Countries.Hist)
## Warning: Removed 158 rows containing non-finite values (stat_bin).
We proceed to remove the missing values from the Life expectancy column. The number of observations decreases from 2892 to 2704.
# Remove missing values and verify that there aren't any missing values in that variable.
LE <- LE %>%
filter(!is.na(Alcohol) & Alcohol!="")
head(LE)
## Country.Updated Year Status Life.expectancy Alcohol
## 1 Afghanistan 2000 Developing 54.8 0.01
## 2 Afghanistan 2001 Developing 55.3 0.01
## 3 Afghanistan 2002 Developing 56.2 0.01
## 4 Afghanistan 2003 Developing 56.7 0.01
## 5 Afghanistan 2004 Developing 57.0 0.02
## 6 Afghanistan 2005 Developing 57.3 0.02
## Percentage.expenditure BMI Schooling
## 1 10.424960 21.7 5.5
## 2 10.574728 21.8 5.9
## 3 16.887351 21.9 6.2
## 4 11.089053 22.0 6.5
## 5 15.296066 22.1 6.8
## 6 1.388648 22.2 7.9
# verify that values aren't missing
?sapply
sapply(LE, function(x) sum(is.na(x)))
## Country.Updated Year Status
## 0 0 0
## Life.expectancy Alcohol Percentage.expenditure
## 0 0 387
## BMI Schooling
## 15 136
# Use `sapply` to identify the unique number of values per column
sapply(LE, function(x) length(unique(x)))
## Country.Updated Year Status
## 180 16 2
## Life.expectancy Alcohol Percentage.expenditure
## 359 1074 2318
## BMI Schooling
## 120 173
# Create a plot that shows the missing values per column
missmap(LE, main="Missing Values vs Observed")
As mentioned in the beginning, the percentage expenditure variable shows the percentage of a country’s Gross Domestic Product that was spent on health (per capita). - There is a massive difference between the expenditure in Developed vs. Developing countries. - Using the general average is out of the question. - Also, using the average per status to replace the missing values can be damaging as there are many outliers that can skew the data. Especially, there are significant outliers in Developed countries that may have increased the percentage expenditure average. - If we are looking for variables that may impact life expectancy, we will look closely at Percentage expenditure in our model.
# Percentage expenditure average by status.
by_status <- LE %>%
group_by(Status) %>%
summarise(
Percentage.expenditure = mean(Percentage.expenditure, na.rm=TRUE),
n = n()
)
mean.Percentage.expenditure<-arrange(by_status, Percentage.expenditure)
print(mean.Percentage.expenditure)
## # A tibble: 2 x 3
## Status Percentage.expenditure n
## <fct> <dbl> <int>
## 1 Developing 383. 2207
## 2 Developed 3328. 497
# Boxplots of Percentage Expenditure per status
BoxPlot <- ggplot(LE, aes(x=Status, y=Percentage.expenditure)) +
geom_boxplot() +
coord_flip() +
ggtitle("Percentage expenditure per Status")
print(BoxPlot)
## Warning: Removed 387 rows containing non-finite values (stat_boxplot).
# Percentage expenditure histograms per Status
Developed.countries <- filter(LE, Status == "Developed")
Developed.Countries.Hist <- ggplot(data=Developed.countries) +
geom_histogram(mapping= aes(x=Percentage.expenditure), binwidth=500) +
ggtitle("Percentage Expenditure in Developed Countries")
print(Developed.Countries.Hist)
## Warning: Removed 63 rows containing non-finite values (stat_bin).
Developing.countries <- filter(LE, Status == "Developing")
Developing.Countries.Hist <- ggplot(data=Developing.countries) +
geom_histogram(mapping= aes(x=Percentage.expenditure), binwidth=500) +
ggtitle("Percentage.expenditure in Developing Countries")
print(Developing.Countries.Hist)
## Warning: Removed 324 rows containing non-finite values (stat_bin).
We proceed to remove the missing values from the Percentage expenditure column. The number of observations decreases from 2704 to 2317.
# Remove missing values and verify that there aren't missing values.
LE <- LE %>%
filter(!is.na(Percentage.expenditure) & Percentage.expenditure!="")
head(LE)
## Country.Updated Year Status Life.expectancy Alcohol
## 1 Afghanistan 2000 Developing 54.8 0.01
## 2 Afghanistan 2001 Developing 55.3 0.01
## 3 Afghanistan 2002 Developing 56.2 0.01
## 4 Afghanistan 2003 Developing 56.7 0.01
## 5 Afghanistan 2004 Developing 57.0 0.02
## 6 Afghanistan 2005 Developing 57.3 0.02
## Percentage.expenditure BMI Schooling
## 1 10.424960 21.7 5.5
## 2 10.574728 21.8 5.9
## 3 16.887351 21.9 6.2
## 4 11.089053 22.0 6.5
## 5 15.296066 22.1 6.8
## 6 1.388648 22.2 7.9
# verify that values aren't missing
?sapply
sapply(LE, function(x) sum(is.na(x)))
## Country.Updated Year Status
## 0 0 0
## Life.expectancy Alcohol Percentage.expenditure
## 0 0 0
## BMI Schooling
## 15 14
# Use `sapply` to identify the unique number of values per column
sapply(LE, function(x) length(unique(x)))
## Country.Updated Year Status
## 157 16 2
## Life.expectancy Alcohol Percentage.expenditure
## 357 1000 2317
## BMI Schooling
## 119 173
# Create a plot that shows the missing values per column
missmap(LE, main="Missing Values vs Observed")
The BMI averages are the first variable in which we do not find a significant difference between Developed and Developing countries. However, we need to look at their distribution to see if we can replace the missing values with the averages. - As observed in the boxplot, there is one significant outlier in Developing countries. - The BMI range for the two statuses is narrow, meaning we could replace the missing values with the average.
# BMI average per status
by_status <- LE %>%
group_by(Status) %>%
summarise(
BMI = mean(BMI, na.rm=TRUE),
n = n()
)
mean.BMI<-arrange(by_status, BMI)
print(mean.BMI)
## # A tibble: 2 x 3
## Status BMI n
## <fct> <dbl> <int>
## 1 Developing 24.8 1883
## 2 Developed 25.9 434
# Boxplot to compare BMI distribution per status.
BoxPlot <- ggplot(LE, aes(x=Status, y=BMI)) +
geom_boxplot() +
coord_flip()+
ggtitle("BMI per status")
print(BoxPlot)
## Warning: Removed 15 rows containing non-finite values (stat_boxplot).
# BMI histograms per Status
Developed.countries <- filter(LE, Status == "Developed")
Developed.Countries.Hist <- ggplot(data=Developed.countries) +
geom_histogram(mapping= aes(x=BMI), binwidth=2) +
ggtitle("BMI in Developed Countries")
print(Developed.Countries.Hist)
Developing.countries <- filter(LE, Status == "Developing")
Developing.Countries.Hist <- ggplot(data=Developing.countries) +
geom_histogram(mapping= aes(x=BMI), binwidth=2) +
ggtitle("BMI in Developing Countries")
print(Developing.Countries.Hist)
## Warning: Removed 15 rows containing non-finite values (stat_bin).
We proceed to replace the missing values.
#Replace the missing BMI values with the average of the BMI for Developed and Developing countries.
Developed.countries.bmi <- filter(LE, Status == "Developed") %>%
summarise(
BMI = mean(BMI, na.rm=TRUE & BMI!=""),
n = n())
## Warning in if (na.rm) x <- x[!is.na(x)]: the condition has length > 1 and only
## the first element will be used
LE$BMI[is.na(LE$BMI)] <- Developed.countries.bmi
## Warning in LE$BMI[is.na(LE$BMI)] <- Developed.countries.bmi: number of items to
## replace is not a multiple of replacement length
Developing.countries.bmi <- filter(LE, Status == "Developing") %>%
summarise(
BMI = mean(BMI, na.rm=TRUE & BMI!=""),
n = n())
## Warning in mean.default(BMI, na.rm = TRUE & BMI != ""): argument is not numeric
## or logical: returning NA
LE$BMI[is.na(LE$BMI)] <- Developing.countries.bmi
# verify that values aren't missing
?sapply
sapply(LE, function(x) sum(is.na(x)))
## Country.Updated Year Status
## 0 0 0
## Life.expectancy Alcohol Percentage.expenditure
## 0 0 0
## BMI Schooling
## 0 14
# Use `sapply` to identify the unique number of values per column
sapply(LE, function(x) length(unique(x)))
## Country.Updated Year Status
## 157 16 2
## Life.expectancy Alcohol Percentage.expenditure
## 357 1000 2317
## BMI Schooling
## 120 173
# Create a plot that shows the missing values per column
missmap(LE, main="Missing Values vs Observed")
### Schooling
The Schooling variable shows the average number of schooling years for citizens in each country. - At first glance, by looking at the averages, the difference does not seem large. - The shape of the distribution is neither skewed to the left or right, it is in bell shape.
- We could replace the missing values with the average. Meaning missing values for Developed countries will be replaced with the Schooling average for Developed countries and the missing values for Developing countries will be replaced with the schooling average for Developing countries.
# Schooling years average per status
by_status <- LE %>%
group_by(Status) %>%
summarise(
Schooling = mean(Schooling, na.rm=TRUE),
n = n()
)
mean.Schooling<-arrange(by_status, Schooling)
print(mean.Schooling)
## # A tibble: 2 x 3
## Status Schooling n
## <fct> <dbl> <int>
## 1 Developing 11.2 1883
## 2 Developed 16.0 434
# Boxplots for Average Schooling years per status
BoxPlot <- ggplot(LE, aes(x=Status, y=Schooling)) +
geom_boxplot() +
coord_flip()
print(BoxPlot)
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).
The difference in years of schooling does vary significantly between Developing and Developed countries.
# Schooling histograms per Status
Developed.countries <- filter(LE, Status == "Developed")
Developed.Countries.Hist <- ggplot(data=Developed.countries) +
geom_histogram(mapping= aes(x=Schooling), binwidth=2) +
ggtitle("Schooling in Developed Countries")
print(Developed.Countries.Hist)
Developing.countries <- filter(LE, Status == "Developing")
Developing.Countries.Hist <- ggplot(data=Developing.countries) +
geom_histogram(mapping= aes(x=Schooling), binwidth=2) +
ggtitle("Schooling in Developing Countries")
print(Developing.Countries.Hist)
## Warning: Removed 14 rows containing non-finite values (stat_bin).
We proceed to replace the missing values.
#Replace the missing BMI values with the average of the BMI for Developed and Developing countries.
Developed.countries.schooling <- filter(LE, Status == "Developed") %>%
summarise(
Schooling = mean(Schooling, na.rm=TRUE & Schooling!=""),
n = n())
## Warning in if (na.rm) x <- x[!is.na(x)]: the condition has length > 1 and only
## the first element will be used
LE$Schooling[is.na(LE$Schooling)] <- Developed.countries.schooling
Developing.countries.schooling <- filter(LE, Status == "Developing") %>%
summarise(
Schooling = mean(Schooling, na.rm=TRUE & Schooling!=""),
n = n())
## Warning in mean.default(Schooling, na.rm = TRUE & Schooling != ""): argument is
## not numeric or logical: returning NA
LE$Schooling[is.na(LE$Schooling)] <- Developing.countries.schooling
# verify that values aren't missing
sapply(LE, function(x) sum(is.na(x)))
## Country.Updated Year Status
## 0 0 0
## Life.expectancy Alcohol Percentage.expenditure
## 0 0 0
## BMI Schooling
## 0 0
# Use `sapply` to identify the unique number of values per column
sapply(LE, function(x) length(unique(x)))
## Country.Updated Year Status
## 157 16 2
## Life.expectancy Alcohol Percentage.expenditure
## 357 1000 2317
## BMI Schooling
## 120 174
# Create a plot that shows the missing values per column
missmap(LE, main="Missing Values vs Observed")
Now we have a complete data set with no missing values.
We concluded that Percentage expenditure, Alcohol consumption and Life expectancy are the variables that present the biggest differences between Developed and Developing countries.
Our assumption that we need to analyze the data by status proved to be correct.
Questions that arise: Could Percentage expenditure and Alcohol be the variables that influence Life expectancy the most? Given that the averages for BMI and Schooling do not differ as much by status?
Aforementioned, we want to market and sell health related educational programs. Since we are just starting, we will identify 10 countries (5 developing and 5 developed) and target them as our first potential clients. The top potential country clients will be the ones with the lowest life expectancy rates and lowest expenditure rates.
TO DETERMINE OUR POTENTIAL CLIENTS IN DEVELOPING COUNTRIES, ARE WE CHOOSING THEM FROM LOWEST LIFE EXPECTANCY? OR PERCENTAGE EXPENDITURE?
Developed.countries <- filter(LE, Status == "Developed")
by_country<- group_by(Developed.countries, Country.Updated)
Country_sum <-filter(by_country)
Country_sum<- summarise(Country_sum,
Percentage.expenditure=mean(Percentage.expenditure, na.rm=TRUE),
Life.expectancy=mean(Life.expectancy, na.rm=TRUE))
Country_sum <- arrange(Country_sum, Life.expectancy, Percentage.expenditure)
print(Country_sum)
## # A tibble: 29 x 3
## Country.Updated Percentage.expenditure Life.expectancy
## <fct> <dbl> <dbl>
## 1 Bulgaria 374. 72.7
## 2 Lithuania 1083. 72.8
## 3 Latvia 566. 73.7
## 4 Hungary 402. 73.7
## 5 Romania 455. 74.0
## 6 Estonia 917. 74.8
## 7 Poland 331. 75.5
## 8 Denmark 5668. 78.8
## 9 Slovenia 1660. 79.2
## 10 Cyprus 995. 79.3
## # … with 19 more rows
Developing.countries <- filter(LE, Status == "Developing")
by_country<- group_by(Developing.countries, Country.Updated)
Country_sum <-filter(by_country)
Country_sum<- summarise(Country_sum,
Percentage.expenditure=mean(Percentage.expenditure, na.rm=TRUE),
Life.expectancy=mean(Life.expectancy, na.rm=TRUE))
Country_sum <- arrange(Country_sum, Life.expectancy, Percentage.expenditure)
print(Country_sum)
## # A tibble: 128 x 3
## Country.Updated Percentage.expenditure Life.expectancy
## <fct> <dbl> <dbl>
## 1 Sierra Leone 31.0 45.8
## 2 Central African Republic 43.6 48.2
## 3 Lesotho 87.6 48.5
## 4 Angola 109. 48.8
## 5 Malawi 27.6 49.3
## 6 Chad 34.4 50.2
## 7 Swaziland 297. 50.8
## 8 Nigeria 91.1 51.1
## 9 Zimbabwe 32.6 51.6
## 10 Mozambique 39.5 53.1
## # … with 118 more rows
The 5 Developing countries with the lowest Life expectancy are: 1. Sierra Leone 2. Central African Republic 3. Lesotho 4. Angola 5. Malawi
The 5 Developed countries with the lowest Life expectancy are: 1. Bulgaria 2. Lithuania 3. Latvia 4. Hungary 5. Romania 6. Estonia
Developed.countries <- filter(LE, Status == "Developed")
by_country<- group_by(Developed.countries, Country.Updated)
Country_sum <-filter(by_country)
Country_sum<- summarise(Country_sum,
Percentage.expenditure=mean(Percentage.expenditure, na.rm=TRUE),
Life.expectancy=mean(Life.expectancy, na.rm=TRUE))
Country_sum <- arrange(Country_sum, Percentage.expenditure, Life.expectancy)
print(Country_sum)
## # A tibble: 29 x 3
## Country.Updated Percentage.expenditure Life.expectancy
## <fct> <dbl> <dbl>
## 1 Poland 331. 75.5
## 2 Bulgaria 374. 72.7
## 3 Hungary 402. 73.7
## 4 Romania 455. 74.0
## 5 Latvia 566. 73.7
## 6 Estonia 917. 74.8
## 7 Cyprus 995. 79.3
## 8 Lithuania 1083. 72.8
## 9 Malta 1260. 80.3
## 10 Slovenia 1660. 79.2
## # … with 19 more rows
Developing.countries <- filter(LE, Status == "Developing")
by_country<- group_by(Developing.countries, Country.Updated)
Country_sum <-filter(by_country)
Country_sum<- summarise(Country_sum,
Percentage.expenditure=mean(Percentage.expenditure, na.rm=TRUE),
Life.expectancy=mean(Life.expectancy, na.rm=TRUE))
Country_sum <- arrange(Country_sum, Percentage.expenditure, Life.expectancy)
print(Country_sum)
## # A tibble: 128 x 3
## Country.Updated Percentage.expenditure Life.expectancy
## <fct> <dbl> <dbl>
## 1 Eritrea 8.58 59.5
## 2 Myanmar 13.7 64.0
## 3 Burundi 16.4 55.3
## 4 Guinea 17.0 55.8
## 5 Tajikistan 18.4 66.5
## 6 Niger 20.4 56.7
## 7 Rwanda 21.5 58.9
## 8 Timor-Leste 22.6 64.5
## 9 Senegal 22.9 62.3
## 10 Guinea-Bissau 23.4 55.1
## # … with 118 more rows
The 5 Developing countries with the lowest Percentage expenditure are: 1. Eritrea 2. Myanmar 3. Burundi 4. Guinea 5. Tajikistan
The 5 Developed countries with the lowest Percentage expenditure are: 1. Poland 2. Bulgaria 3. Hungary 4. Romania 5. Latvia 6. Estonia
#examination of correlations for all the of the continuous variables. #examination of correlation between life expectancy, alcohol, percentage expenditure 2. Examination of Correlations for all continuous variables. 2.1 First, we noticed that the BMI variable in the dataset was categorized as a list, versus a numeric value. So, the first thing we had to do was unlist it and convert it to a numeric. Next, we created a correlation matrix for the continuous variables, naming it cont.cor. To see the correlation between the significant variables, we ran signif. Hmisc was used to see the correlations more clearly.
library("Hmisc")
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
LE$BMI <- unlist(LE$BMI)
LE$BMI <- as.numeric(LE$BMI)
cont <- LE %>% select(Life.expectancy:BMI)
cont.cor <- rcorr(as.matrix(cont))
signif(cont.cor$r)
## Life.expectancy Alcohol Percentage.expenditure
## Life.expectancy 1.0000000 0.3865380 0.42795800
## Alcohol 0.3865380 1.0000000 0.38041200
## Percentage.expenditure 0.4279580 0.3804120 1.00000000
## BMI 0.0150871 -0.0158426 -0.00338359
## BMI
## Life.expectancy 0.01508710
## Alcohol -0.01584260
## Percentage.expenditure -0.00338359
## BMI 1.00000000
signif(cont.cor$P)
## Life.expectancy Alcohol Percentage.expenditure BMI
## Life.expectancy NA 0.000000 0.00000 0.467919
## Alcohol 0.000000 NA 0.00000 0.445928
## Percentage.expenditure 0.000000 0.000000 NA 0.870690
## BMI 0.467919 0.445928 0.87069 NA
cont.cor
## Life.expectancy Alcohol Percentage.expenditure BMI
## Life.expectancy 1.00 0.39 0.43 0.02
## Alcohol 0.39 1.00 0.38 -0.02
## Percentage.expenditure 0.43 0.38 1.00 0.00
## BMI 0.02 -0.02 0.00 1.00
##
## n= 2317
##
##
## P
## Life.expectancy Alcohol Percentage.expenditure BMI
## Life.expectancy 0.0000 0.0000 0.4679
## Alcohol 0.0000 0.0000 0.4459
## Percentage.expenditure 0.0000 0.0000 0.8707
## BMI 0.4679 0.4459 0.8707
Significant correlations: Alcohol with life expectancy, Percentage expenditure with life expectancy, alcohol with percentage expenditure. Additionally, of all the significant correlations, all are positively correlated. However, there is only a moderate degree of significant correlation.
We found that BMI is not significantly correlated with anything.
2.2 Scatterplot Matrices To plot the continuous variables on a scatterplot matrices, we decided to remove BMI because it was not significantly correlated with anything. This means we explored the correlation of life expectancy, alcohol, and percentage expenditure. To do this, we selected a certain piece of the data set and named it cont.sig. Then, we used pairs to do the scatterplot matrices.
cont.sig <- LE %>% select(Life.expectancy:Percentage.expenditure)
pairs(cont.sig, pch=19, lower.panel = NULL)
The top right graph is the one that is most strongly correlated. It plots life expectancy versus percentage expenditure. The graph reveals that as life expectancy increases, so does the percentage expenditure on health care. It is positively correlated. However, it seems that at a certain age for life expectancy the percentage expenditure does not continue to increase, but levels out. Additionally, it seems that life expectancy and percentage expenditure have a logarithmic relationship, versus linear. This could potentially be true for the alcohol and life expectancy relationship as well. They are also positively correlated. Alcohol and percentage expenditure are postively correlated, but it is a weaker relationship than the other two graphs.