R Final Project

We are a new social enterprise looking to help countries with low life expectancy and low government expenditure increase life expectancy with minimal additional cost.

Based on these existing lifestyle choices in each country selected, what educational program can we offer them to potentially increase life expectancy?

As a new organization, we want to market and sell health related educational programs. Since we are just starting, we will identify 10 countries (5 developing and 5 developed) and target them as our first potential clients. The top potential country clients will be the ones with the lowest life expectancy rates and lowest expenditure rates. Given the researched link between education and life expectancy, we’re positioning ourselves as an enterprise who wants to increase global life expectancies by partnering with governments to educate citizens about healthy lifestyle choices (nutrition and alcohol consumption). Addressing rates of BMI, alcohol use, and schooling is crucial in helping populations grow old healthier and prevent future costs for governments.

To conduct our analysis, we will be using data from the WHO. Including the following variables:

Country: Nominal. Name of each country, whose alcohol consumption, BMI, and schooling will be evaluated over the span of 16 years.

Year: Interval. Data was collected over the span of 16 years, starting in 2000 and ending in 2015. The fact that the same data was collected over the years will allow an analysis that can consider the progression or change over time of the dependent variables.

Status: Nominal. Describes if the country’s economy and societal structure is developed or developing per WHO definitions.

Life Expectancy: Ratio. We will use this input as our independent variable and will try to determine if the other variables considered have a direct impact on life expectancy. How many years a person is expected to live, when compared to the average life span of the citizens of his/her own country.

Alcohol: Ratio. Per capita (15+) consumption of alcohol per person (pure alcohol in liters).

Percentage Expenditure: Ratio. The percentage of a country’s Gross Domestic Product that was spent on health (per capita).

BMI: Ratio. Average Body Mass Index (BMI) of entire population.

Schooling: Ratio. The average number of schooling years for citizens in each country.

Part 1: Organizing and structuring the Dataset

To start our analysis, we structured and modified our data frame. We began by turning the Country, Year and Status variables into factors. Then, we checked for missing values to determine whether to use the rm.na function or imputate the averages.

LE <- read.csv("/Users/BudsMac/Downloads/STATSII/Life Expectancy Data(1).csv")
head(LE)

##   Country.Updated Year     Status Life.expectancy Alcohol
## 1     Afghanistan 2000 Developing            54.8    0.01
## 2     Afghanistan 2001 Developing            55.3    0.01
## 3     Afghanistan 2002 Developing            56.2    0.01
## 4     Afghanistan 2003 Developing            56.7    0.01
## 5     Afghanistan 2004 Developing            57.0    0.02
## 6     Afghanistan 2005 Developing            57.3    0.02
##   Percentage.expenditure  BMI Schooling
## 1              10.424960 21.7       5.5
## 2              10.574728 21.8       5.9
## 3              16.887351 21.9       6.2
## 4              11.089053 22.0       6.5
## 5              15.296066 22.1       6.8
## 6               1.388648 22.2       7.9

LE$Country.Updated <- factor(LE$Country.Updated)
LE$Year <- factor(LE$Year)
LE$Status <- factor(LE$Status)

1.1 Missing values

To verify how many missing values we have, we used the sapply and missmap functions. Only 4% of the data is missing, meaning we now have to determine if we will remove the missing data or if we will replace it with an average or mean.

# Use `sapply` to identify the number of missing values per column
?sapply
sapply(LE, function(x) sum(is.na(x)))

##        Country.Updated                   Year                 Status 
##                      0                      0                      0 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                      3                    191                    574 
##                    BMI              Schooling 
##                     32                    157

# Use `sapply` to identify the unique number of values per column
sapply(LE, function(x) length(unique(x)))

##        Country.Updated                   Year                 Status 
##                    181                     16                      3 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                    363                   1075                   2322 
##                    BMI              Schooling 
##                    122                    173

# Create a plot that shows the missing values per column
missmap(LE, main="Missing Values vs Observed")

1.2 Distribution

We now know that we are missing 4% of the data. We thought about using the general average to replace the missing values; however, one of our assumptions is that the data may vary depending on the country’s status (developed or developing country), so we need to verify if there are large differences when looking at the general averages and the the distribution of the two categories.

1.2.1 Life expectancy

There is a significant difference in life expectancy between Developed and Developing countries, with people from Developed countries living 12 more years.

Taking a closer look, by graphic a histogram and boxplot to compare the distributions of Deeveloped and Developing countries, we see that: - Life expectancy in Developing countries has a wider range than in Developed countries. - Life expectancy in Developing countries is skewed to the left, while in Developed, the variability is less. - We could replace the missing values for Developed countries with the average, but it would not bee appropriate to do the same with the data for Developing. - The safest approach would be to remove the missing datam using the rm.na function.

# Group by status and calculate averages for life expectancy

by_status <- LE %>%  
  group_by(Status) %>% 
  summarise(
    Life.expectancy = mean(Life.expectancy, na.rm=TRUE), 
    n = n()
  )
  
mean.Life.expectancy<-arrange(by_status, Life.expectancy)
print(mean.Life.expectancy)

## # A tibble: 3 x 3
##   Status       Life.expectancy     n
##   <fct>                  <dbl> <int>
## 1 "Developing"            67.1  2366
## 2 "Developed"             79.2   527
## 3 ""                     NaN       2

# Life expectancy boxplots per status

BoxPlot <- ggplot(LE, aes(x=Status, y=Life.expectancy)) +
  geom_boxplot() +
  coord_flip()+
  ggtitle("Life Expectancy per Status")
print(BoxPlot)

## Warning: Removed 3 rows containing non-finite values (stat_boxplot).

# Life expectancy histograms per status

Developed.countries <- filter(LE, Status == "Developed")

Developed.Countries.Hist <- ggplot(data=Developed.countries) +
  geom_histogram(mapping= aes(x=Life.expectancy), binwidth=2) +
  ggtitle("Life expectancy in Developed Countries")
  
print(Developed.Countries.Hist)

Developing.countries <- filter(LE, Status == "Developing")

Developing.Countries.Hist <- ggplot(data=Developing.countries) +
  geom_histogram(mapping= aes(x=Life.expectancy), binwidth=2) +
  ggtitle("Life expectancy in Developing Countries")
  
print(Developing.Countries.Hist)

## Warning: Removed 1 rows containing non-finite values (stat_bin).

We proceed to remove the missing values from the Life expectancy column. The dataset decreases from 2895 observations to 2892.

# Remove missing values. 

LE <- LE %>%
   filter(!is.na(Life.expectancy) & Life.expectancy!="")

head(LE)

##   Country.Updated Year     Status Life.expectancy Alcohol
## 1     Afghanistan 2000 Developing            54.8    0.01
## 2     Afghanistan 2001 Developing            55.3    0.01
## 3     Afghanistan 2002 Developing            56.2    0.01
## 4     Afghanistan 2003 Developing            56.7    0.01
## 5     Afghanistan 2004 Developing            57.0    0.02
## 6     Afghanistan 2005 Developing            57.3    0.02
##   Percentage.expenditure  BMI Schooling
## 1              10.424960 21.7       5.5
## 2              10.574728 21.8       5.9
## 3              16.887351 21.9       6.2
## 4              11.089053 22.0       6.5
## 5              15.296066 22.1       6.8
## 6               1.388648 22.2       7.9

1.2.2 Alcohol

We proceed with the second variable: Alcohol. We find that: - Alcohol consumption in developed countries is 2 times the rate for Developing countries (+200%). - There is a massive disparity between the consumption average, meaning the general average cannot be used to replace the missing values. - Alcohol consumption in Developing countries is skewed to the right, so using the average could lessen the validity of the data if we replace the missing values with the average. - If we cannot replace for both Developed and Developing, it is best to remove the missing values. - This proves our hypothesis that we need to analyze the data comparing Developed vs. Developing countries, because the difference is significant.

# Calculate average for Alcohol consumption in Developed and Developing countries. 

by_status <- LE %>%  
  group_by(Status) %>% 
  summarise(
    Alcohol = mean(Alcohol, na.rm=TRUE), 
    n = n()
  )
  
mean.Alcohol<-arrange(by_status, Alcohol)
print(mean.Alcohol)

## # A tibble: 2 x 3
##   Status     Alcohol     n
##   <fct>        <dbl> <int>
## 1 Developing    3.46  2365
## 2 Developed     9.90   527

# Alcohol consumption boxplot to compare consumption by status.

BoxPlot <- ggplot(LE, aes(x=Status, y=Alcohol)) +
  geom_boxplot() +
  coord_flip()+
  ggtitle("Alcohol consumption per Status")
print(BoxPlot)

## Warning: Removed 188 rows containing non-finite values (stat_boxplot).

# Alcohol consumption histograms per Status

Developed.countries <- filter(LE, Status == "Developed")

Developed.Countries.Hist <- ggplot(data=Developed.countries) +
  geom_histogram(mapping= aes(x=Alcohol), binwidth=2) +
  ggtitle("Alcohol consumption in Developed Countries")
  
print(Developed.Countries.Hist)

## Warning: Removed 30 rows containing non-finite values (stat_bin).

Developing.countries <- filter(LE, Status == "Developing")

Developing.Countries.Hist <- ggplot(data=Developing.countries) +
  geom_histogram(mapping= aes(x=Alcohol), binwidth=2) +
  ggtitle("Alcohol Consumption in Developing Countries")
  
print(Developing.Countries.Hist)

## Warning: Removed 158 rows containing non-finite values (stat_bin).

We proceed to remove the missing values from the Life expectancy column. The number of observations decreases from 2892 to 2704.

# Remove missing values and verify that there aren't any missing values in that variable. 

LE <- LE %>%
   filter(!is.na(Alcohol) & Alcohol!="")

head(LE)

##   Country.Updated Year     Status Life.expectancy Alcohol
## 1     Afghanistan 2000 Developing            54.8    0.01
## 2     Afghanistan 2001 Developing            55.3    0.01
## 3     Afghanistan 2002 Developing            56.2    0.01
## 4     Afghanistan 2003 Developing            56.7    0.01
## 5     Afghanistan 2004 Developing            57.0    0.02
## 6     Afghanistan 2005 Developing            57.3    0.02
##   Percentage.expenditure  BMI Schooling
## 1              10.424960 21.7       5.5
## 2              10.574728 21.8       5.9
## 3              16.887351 21.9       6.2
## 4              11.089053 22.0       6.5
## 5              15.296066 22.1       6.8
## 6               1.388648 22.2       7.9

# verify that values aren't missing
?sapply
sapply(LE, function(x) sum(is.na(x)))

##        Country.Updated                   Year                 Status 
##                      0                      0                      0 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                      0                      0                    387 
##                    BMI              Schooling 
##                     15                    136

# Use `sapply` to identify the unique number of values per column
sapply(LE, function(x) length(unique(x)))

##        Country.Updated                   Year                 Status 
##                    180                     16                      2 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                    359                   1074                   2318 
##                    BMI              Schooling 
##                    120                    173

# Create a plot that shows the missing values per column
missmap(LE, main="Missing Values vs Observed")

1.2.3 Percentage expenditure

As mentioned in the beginning, the percentage expenditure variable shows the percentage of a country’s Gross Domestic Product that was spent on health (per capita). - There is a massive difference between the expenditure in Developed vs. Developing countries. - Using the general average is out of the question. - Also, using the average per status to replace the missing values can be damaging as there are many outliers that can skew the data. Especially, there are significant outliers in Developed countries that may have increased the percentage expenditure average. - If we are looking for variables that may impact life expectancy, we will look closely at Percentage expenditure in our model.

# Percentage expenditure average by status.

by_status <- LE %>%  
  group_by(Status) %>% 
  summarise(
    Percentage.expenditure = mean(Percentage.expenditure, na.rm=TRUE), 
    n = n()
  )
  
mean.Percentage.expenditure<-arrange(by_status, Percentage.expenditure)
print(mean.Percentage.expenditure)

## # A tibble: 2 x 3
##   Status     Percentage.expenditure     n
##   <fct>                       <dbl> <int>
## 1 Developing                   383.  2207
## 2 Developed                   3328.   497

# Boxplots of Percentage Expenditure per status 

BoxPlot <- ggplot(LE, aes(x=Status, y=Percentage.expenditure)) +
  geom_boxplot() +
  coord_flip() +
  ggtitle("Percentage expenditure per Status")

print(BoxPlot)

## Warning: Removed 387 rows containing non-finite values (stat_boxplot).

# Percentage expenditure histograms per Status

Developed.countries <- filter(LE, Status == "Developed")

Developed.Countries.Hist <- ggplot(data=Developed.countries) +
  geom_histogram(mapping= aes(x=Percentage.expenditure), binwidth=500) +
  ggtitle("Percentage Expenditure in Developed Countries")
  
print(Developed.Countries.Hist)

## Warning: Removed 63 rows containing non-finite values (stat_bin).

Developing.countries <- filter(LE, Status == "Developing")

Developing.Countries.Hist <- ggplot(data=Developing.countries) +
  geom_histogram(mapping= aes(x=Percentage.expenditure), binwidth=500) +
  ggtitle("Percentage.expenditure in Developing Countries")
  
print(Developing.Countries.Hist)

## Warning: Removed 324 rows containing non-finite values (stat_bin).

We proceed to remove the missing values from the Percentage expenditure column. The number of observations decreases from 2704 to 2317.

# Remove missing values and verify that there aren't missing values. 

LE <- LE %>%
   filter(!is.na(Percentage.expenditure) & Percentage.expenditure!="")

head(LE)

##   Country.Updated Year     Status Life.expectancy Alcohol
## 1     Afghanistan 2000 Developing            54.8    0.01
## 2     Afghanistan 2001 Developing            55.3    0.01
## 3     Afghanistan 2002 Developing            56.2    0.01
## 4     Afghanistan 2003 Developing            56.7    0.01
## 5     Afghanistan 2004 Developing            57.0    0.02
## 6     Afghanistan 2005 Developing            57.3    0.02
##   Percentage.expenditure  BMI Schooling
## 1              10.424960 21.7       5.5
## 2              10.574728 21.8       5.9
## 3              16.887351 21.9       6.2
## 4              11.089053 22.0       6.5
## 5              15.296066 22.1       6.8
## 6               1.388648 22.2       7.9

# verify that values aren't missing
?sapply
sapply(LE, function(x) sum(is.na(x)))

##        Country.Updated                   Year                 Status 
##                      0                      0                      0 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                      0                      0                      0 
##                    BMI              Schooling 
##                     15                     14

# Use `sapply` to identify the unique number of values per column
sapply(LE, function(x) length(unique(x)))

##        Country.Updated                   Year                 Status 
##                    157                     16                      2 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                    357                   1000                   2317 
##                    BMI              Schooling 
##                    119                    173

# Create a plot that shows the missing values per column
missmap(LE, main="Missing Values vs Observed")

1.2.4 BMI

The BMI averages are the first variable in which we do not find a significant difference between Developed and Developing countries. However, we need to look at their distribution to see if we can replace the missing values with the averages. - As observed in the boxplot, there is one significant outlier in Developing countries. - The BMI range for the two statuses is narrow, meaning we could replace the missing values with the average.

# BMI average per status

by_status <- LE %>%  
  group_by(Status) %>% 
  summarise(
    BMI = mean(BMI, na.rm=TRUE), 
    n = n()
  )
  
mean.BMI<-arrange(by_status, BMI)
print(mean.BMI)

## # A tibble: 2 x 3
##   Status       BMI     n
##   <fct>      <dbl> <int>
## 1 Developing  24.8  1883
## 2 Developed   25.9   434

# Boxplot to compare BMI distribution per status. 

BoxPlot <- ggplot(LE, aes(x=Status, y=BMI)) +
  geom_boxplot() +
  coord_flip()+
  ggtitle("BMI per status")
print(BoxPlot)

## Warning: Removed 15 rows containing non-finite values (stat_boxplot).

# BMI histograms per Status

Developed.countries <- filter(LE, Status == "Developed")

Developed.Countries.Hist <- ggplot(data=Developed.countries) +
  geom_histogram(mapping= aes(x=BMI), binwidth=2) +
  ggtitle("BMI in Developed Countries")
  
print(Developed.Countries.Hist)

Developing.countries <- filter(LE, Status == "Developing")

Developing.Countries.Hist <- ggplot(data=Developing.countries) +
  geom_histogram(mapping= aes(x=BMI), binwidth=2) +
  ggtitle("BMI in Developing Countries")
  
print(Developing.Countries.Hist)

## Warning: Removed 15 rows containing non-finite values (stat_bin).

We proceed to replace the missing values.

#Replace the missing BMI values with the average of the BMI for Developed and Developing countries. 

Developed.countries.bmi <- filter(LE, Status == "Developed") %>%
  summarise(
    BMI = mean(BMI, na.rm=TRUE & BMI!=""), 
    n = n())

## Warning in if (na.rm) x <- x[!is.na(x)]: the condition has length > 1 and only
## the first element will be used

LE$BMI[is.na(LE$BMI)] <- Developed.countries.bmi

## Warning in LE$BMI[is.na(LE$BMI)] <- Developed.countries.bmi: number of items to
## replace is not a multiple of replacement length

Developing.countries.bmi <- filter(LE, Status == "Developing") %>%
  summarise(
    BMI = mean(BMI, na.rm=TRUE & BMI!=""), 
    n = n())

## Warning in mean.default(BMI, na.rm = TRUE & BMI != ""): argument is not numeric
## or logical: returning NA

  LE$BMI[is.na(LE$BMI)] <- Developing.countries.bmi

# verify that values aren't missing
?sapply
sapply(LE, function(x) sum(is.na(x)))

##        Country.Updated                   Year                 Status 
##                      0                      0                      0 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                      0                      0                      0 
##                    BMI              Schooling 
##                      0                     14

# Use `sapply` to identify the unique number of values per column
sapply(LE, function(x) length(unique(x)))

##        Country.Updated                   Year                 Status 
##                    157                     16                      2 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                    357                   1000                   2317 
##                    BMI              Schooling 
##                    120                    173

# Create a plot that shows the missing values per column
missmap(LE, main="Missing Values vs Observed")

### Schooling

The Schooling variable shows the average number of schooling years for citizens in each country. - At first glance, by looking at the averages, the difference does not seem large. - The shape of the distribution is neither skewed to the left or right, it is in bell shape.
- We could replace the missing values with the average. Meaning missing values for Developed countries will be replaced with the Schooling average for Developed countries and the missing values for Developing countries will be replaced with the schooling average for Developing countries.

# Schooling years average per status

by_status <- LE %>%  
  group_by(Status) %>% 
  summarise(
    Schooling = mean(Schooling, na.rm=TRUE), 
    n = n()
  )
  
mean.Schooling<-arrange(by_status, Schooling)
print(mean.Schooling)

## # A tibble: 2 x 3
##   Status     Schooling     n
##   <fct>          <dbl> <int>
## 1 Developing      11.2  1883
## 2 Developed       16.0   434

# Boxplots for Average Schooling years per status

BoxPlot <- ggplot(LE, aes(x=Status, y=Schooling)) +
  geom_boxplot() +
  coord_flip()
print(BoxPlot)

## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

The difference in years of schooling does vary significantly between Developing and Developed countries.

# Schooling histograms per Status

Developed.countries <- filter(LE, Status == "Developed")

Developed.Countries.Hist <- ggplot(data=Developed.countries) +
  geom_histogram(mapping= aes(x=Schooling), binwidth=2) +
  ggtitle("Schooling in Developed Countries")
  
print(Developed.Countries.Hist)

Developing.countries <- filter(LE, Status == "Developing")

Developing.Countries.Hist <- ggplot(data=Developing.countries) +
  geom_histogram(mapping= aes(x=Schooling), binwidth=2) +
  ggtitle("Schooling in Developing Countries")
  
print(Developing.Countries.Hist)

## Warning: Removed 14 rows containing non-finite values (stat_bin).

We proceed to replace the missing values.

#Replace the missing BMI values with the average of the BMI for Developed and Developing countries. 

Developed.countries.schooling <- filter(LE, Status == "Developed") %>%
  summarise(
    Schooling = mean(Schooling, na.rm=TRUE & Schooling!=""), 
    n = n())

## Warning in if (na.rm) x <- x[!is.na(x)]: the condition has length > 1 and only
## the first element will be used

LE$Schooling[is.na(LE$Schooling)] <- Developed.countries.schooling


Developing.countries.schooling <- filter(LE, Status == "Developing") %>%
  summarise(
    Schooling = mean(Schooling, na.rm=TRUE & Schooling!=""), 
    n = n())

## Warning in mean.default(Schooling, na.rm = TRUE & Schooling != ""): argument is
## not numeric or logical: returning NA

  LE$Schooling[is.na(LE$Schooling)] <- Developing.countries.schooling

# verify that values aren't missing
sapply(LE, function(x) sum(is.na(x)))

##        Country.Updated                   Year                 Status 
##                      0                      0                      0 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                      0                      0                      0 
##                    BMI              Schooling 
##                      0                      0

# Use `sapply` to identify the unique number of values per column
sapply(LE, function(x) length(unique(x)))

##        Country.Updated                   Year                 Status 
##                    157                     16                      2 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                    357                   1000                   2317 
##                    BMI              Schooling 
##                    120                    174

# Create a plot that shows the missing values per column
missmap(LE, main="Missing Values vs Observed")

Now we have a complete data set with no missing values.

We concluded that Percentage expenditure, Alcohol consumption and Life expectancy are the variables that present the biggest differences between Developed and Developing countries.

Our assumption that we need to analyze the data by status proved to be correct.

Questions that arise: Could Percentage expenditure and Alcohol be the variables that influence Life expectancy the most? Given that the averages for BMI and Schooling do not differ as much by status?

2 Identifying the top 10 countries with the lowest life expectancy rates and lowest expenditure percentages.

Aforementioned, we want to market and sell health related educational programs. Since we are just starting, we will identify 10 countries (5 developing and 5 developed) and target them as our first potential clients. The top potential country clients will be the ones with the lowest life expectancy rates and lowest expenditure rates.

Our potential clients in Developed countries would be: Bulgaria, Lithuania, Latvia, Hungary, Romania, Estonia.

TO DETERMINE OUR POTENTIAL CLIENTS IN DEVELOPING COUNTRIES, ARE WE CHOOSING THEM FROM LOWEST LIFE EXPECTANCY? OR PERCENTAGE EXPENDITURE?

Developed.countries <- filter(LE, Status == "Developed")
by_country<- group_by(Developed.countries, Country.Updated)
  Country_sum <-filter(by_country)
  Country_sum<- summarise(Country_sum,
                          Percentage.expenditure=mean(Percentage.expenditure, na.rm=TRUE),
                          Life.expectancy=mean(Life.expectancy, na.rm=TRUE))

  Country_sum <- arrange(Country_sum, Life.expectancy, Percentage.expenditure)
  print(Country_sum)

## # A tibble: 29 x 3
##    Country.Updated Percentage.expenditure Life.expectancy
##    <fct>                            <dbl>           <dbl>
##  1 Bulgaria                          374.            72.7
##  2 Lithuania                        1083.            72.8
##  3 Latvia                            566.            73.7
##  4 Hungary                           402.            73.7
##  5 Romania                           455.            74.0
##  6 Estonia                           917.            74.8
##  7 Poland                            331.            75.5
##  8 Denmark                          5668.            78.8
##  9 Slovenia                         1660.            79.2
## 10 Cyprus                            995.            79.3
## # … with 19 more rows

Developing.countries <- filter(LE, Status == "Developing")
by_country<- group_by(Developing.countries, Country.Updated)
  Country_sum <-filter(by_country)
  Country_sum<- summarise(Country_sum,
                          Percentage.expenditure=mean(Percentage.expenditure, na.rm=TRUE),
                          Life.expectancy=mean(Life.expectancy, na.rm=TRUE))

  Country_sum <- arrange(Country_sum, Life.expectancy, Percentage.expenditure)
  print(Country_sum)

## # A tibble: 128 x 3
##    Country.Updated          Percentage.expenditure Life.expectancy
##    <fct>                                     <dbl>           <dbl>
##  1 Sierra Leone                               31.0            45.8
##  2 Central African Republic                   43.6            48.2
##  3 Lesotho                                    87.6            48.5
##  4 Angola                                    109.             48.8
##  5 Malawi                                     27.6            49.3
##  6 Chad                                       34.4            50.2
##  7 Swaziland                                 297.             50.8
##  8 Nigeria                                    91.1            51.1
##  9 Zimbabwe                                   32.6            51.6
## 10 Mozambique                                 39.5            53.1
## # … with 118 more rows

The 5 Developing countries with the lowest Life expectancy are: 1. Sierra Leone 2. Central African Republic 3. Lesotho 4. Angola 5. Malawi

The 5 Developed countries with the lowest Life expectancy are: 1. Bulgaria 2. Lithuania 3. Latvia 4. Hungary 5. Romania 6. Estonia

Developed.countries <- filter(LE, Status == "Developed")
by_country<- group_by(Developed.countries, Country.Updated)
  Country_sum <-filter(by_country)
  Country_sum<- summarise(Country_sum,
                          Percentage.expenditure=mean(Percentage.expenditure, na.rm=TRUE),
                          Life.expectancy=mean(Life.expectancy, na.rm=TRUE))

  Country_sum <- arrange(Country_sum, Percentage.expenditure, Life.expectancy)
  print(Country_sum)

## # A tibble: 29 x 3
##    Country.Updated Percentage.expenditure Life.expectancy
##    <fct>                            <dbl>           <dbl>
##  1 Poland                            331.            75.5
##  2 Bulgaria                          374.            72.7
##  3 Hungary                           402.            73.7
##  4 Romania                           455.            74.0
##  5 Latvia                            566.            73.7
##  6 Estonia                           917.            74.8
##  7 Cyprus                            995.            79.3
##  8 Lithuania                        1083.            72.8
##  9 Malta                            1260.            80.3
## 10 Slovenia                         1660.            79.2
## # … with 19 more rows

Developing.countries <- filter(LE, Status == "Developing")
by_country<- group_by(Developing.countries, Country.Updated)
  Country_sum <-filter(by_country)
  Country_sum<- summarise(Country_sum,
                          Percentage.expenditure=mean(Percentage.expenditure, na.rm=TRUE),
                          Life.expectancy=mean(Life.expectancy, na.rm=TRUE))

  Country_sum <- arrange(Country_sum, Percentage.expenditure, Life.expectancy)
  print(Country_sum)

## # A tibble: 128 x 3
##    Country.Updated Percentage.expenditure Life.expectancy
##    <fct>                            <dbl>           <dbl>
##  1 Eritrea                           8.58            59.5
##  2 Myanmar                          13.7             64.0
##  3 Burundi                          16.4             55.3
##  4 Guinea                           17.0             55.8
##  5 Tajikistan                       18.4             66.5
##  6 Niger                            20.4             56.7
##  7 Rwanda                           21.5             58.9
##  8 Timor-Leste                      22.6             64.5
##  9 Senegal                          22.9             62.3
## 10 Guinea-Bissau                    23.4             55.1
## # … with 118 more rows

The 5 Developing countries with the lowest Percentage expenditure are: 1. Eritrea 2. Myanmar 3. Burundi 4. Guinea 5. Tajikistan

The 5 Developed countries with the lowest Percentage expenditure are: 1. Poland 2. Bulgaria 3. Hungary 4. Romania 5. Latvia 6. Estonia

#examination of correlations for all the of the continuous variables. #examination of correlation between life expectancy, alcohol, percentage expenditure 2. Examination of Correlations for all continuous variables. 2.1 First, we noticed that the BMI variable in the dataset was categorized as a list, versus a numeric value. So, the first thing we had to do was unlist it and convert it to a numeric. Next, we created a correlation matrix for the continuous variables, naming it cont.cor. To see the correlation between the significant variables, we ran signif. Hmisc was used to see the correlations more clearly.

library("Hmisc")

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

LE$BMI <- unlist(LE$BMI)
LE$BMI <- as.numeric(LE$BMI)
cont <- LE %>% select(Life.expectancy:BMI)

cont.cor <- rcorr(as.matrix(cont))
signif(cont.cor$r)

##                        Life.expectancy    Alcohol Percentage.expenditure
## Life.expectancy              1.0000000  0.3865380             0.42795800
## Alcohol                      0.3865380  1.0000000             0.38041200
## Percentage.expenditure       0.4279580  0.3804120             1.00000000
## BMI                          0.0150871 -0.0158426            -0.00338359
##                                BMI
## Life.expectancy         0.01508710
## Alcohol                -0.01584260
## Percentage.expenditure -0.00338359
## BMI                     1.00000000

signif(cont.cor$P)

##                        Life.expectancy  Alcohol Percentage.expenditure      BMI
## Life.expectancy                     NA 0.000000                0.00000 0.467919
## Alcohol                       0.000000       NA                0.00000 0.445928
## Percentage.expenditure        0.000000 0.000000                     NA 0.870690
## BMI                           0.467919 0.445928                0.87069       NA

cont.cor

##                        Life.expectancy Alcohol Percentage.expenditure   BMI
## Life.expectancy                   1.00    0.39                   0.43  0.02
## Alcohol                           0.39    1.00                   0.38 -0.02
## Percentage.expenditure            0.43    0.38                   1.00  0.00
## BMI                               0.02   -0.02                   0.00  1.00
## 
## n= 2317 
## 
## 
## P
##                        Life.expectancy Alcohol Percentage.expenditure BMI   
## Life.expectancy                        0.0000  0.0000                 0.4679
## Alcohol                0.0000                  0.0000                 0.4459
## Percentage.expenditure 0.0000          0.0000                         0.8707
## BMI                    0.4679          0.4459  0.8707

Significant correlations: Alcohol with life expectancy, Percentage expenditure with life expectancy, alcohol with percentage expenditure. Additionally, of all the significant correlations, all are positively correlated. However, there is only a moderate degree of significant correlation.

We found that BMI is not significantly correlated with anything.

2.2 Scatterplot Matrices To plot the continuous variables on a scatterplot matrices, we decided to remove BMI because it was not significantly correlated with anything. This means we explored the correlation of life expectancy, alcohol, and percentage expenditure. To do this, we selected a certain piece of the data set and named it cont.sig. Then, we used pairs to do the scatterplot matrices.

cont.sig <- LE %>% select(Life.expectancy:Percentage.expenditure)
pairs(cont.sig, pch=19, lower.panel = NULL)

The top right graph is the one that is most strongly correlated. It plots life expectancy versus percentage expenditure. The graph reveals that as life expectancy increases, so does the percentage expenditure on health care. It is positively correlated. However, it seems that at a certain age for life expectancy the percentage expenditure does not continue to increase, but levels out. Additionally, it seems that life expectancy and percentage expenditure have a logarithmic relationship, versus linear. This could potentially be true for the alcohol and life expectancy relationship as well. They are also positively correlated. Alcohol and percentage expenditure are postively correlated, but it is a weaker relationship than the other two graphs.