R Final Project

We are a new social enterprise looking to help countries with low life expectancy increase life expectancy with minimal additional cost.

Based on these existing lifestyle choices in each country selected, what educational program can we offer them to potentially increase life expectancy?

As a new organization, we want to market and sell health related educational programs. Since we are just starting, we will identify 10 countries (5 developing and 5 developed) and target them as our first potential clients. The top potential country clients will be the ones with the lowest life expectancy rates and lowest expenditure rates. Given the researched link between education and life expectancy, we are positioning ourselves as an enterprise who wants to increase global life expectancies by partnering with governments to educate citizens about healthy lifestyle choices (nutrition and alcohol consumption). Addressing rates of BMI, alcohol use, and schooling is crucial in helping populations grow old healthier and prevent future costs for governments.

To conduct our analysis, we will be using data from the WHO. Including the following variables:

Country: Nominal. Name of each country, whose alcohol consumption, BMI, and schooling will be evaluated over the span of 16 years.

Year: Interval. Data was collected over the span of 16 years, starting in 2000 and ending in 2015. The fact that the same data was collected over the years will allow an analysis that can consider the progression or change over time of the dependent variables.

Status: Nominal. Describes if the country’s economy and societal structure is developed or developing per WHO definitions.

Life Expectancy: Ratio. We will use this input as our independent variable and will try to determine if the other variables considered have a direct impact on life expectancy. How many years a person is expected to live, when compared to the average life span of the citizens of his/her own country.

Alcohol: Ratio. Per capita (15+) consumption of alcohol per person (pure alcohol in liters).

Percentage Expenditure: Ratio. The percentage of a country’s Gross Domestic Product that was spent on health (per capita). In other words the USD amount the government spends per person per year.

BMI: Ratio. Average Body Mass Index (BMI) of entire population.

Schooling: Ratio. The average number of schooling years for citizens in each country.

Part 1: Organizing and structuring the Dataset

To start our analysis, we structured and modified our data frame. We began by turning the Country, Year and Status variables into factors and the BMI column as a numeric variable.

LE <- read.csv("/Users/BudsMac/Downloads/Final R Project/Life Expectancy Data(1).csv")
head(LE)

##   Country.Updated Year     Status Life.expectancy Alcohol
## 1         Bahamas 2000 Developing            72.6   12.15
## 2           Samoa 2015 Developing            74.0      NA
## 3           Tonga 2015 Developing            59.9      NA
## 4           Tonga 2015 Developing            73.5      NA
## 5           Samoa 2014 Developing            73.8    0.01
## 6           Tonga 2014 Developing            73.3    0.01
##   Percentage.expenditure  BMI Schooling
## 1                     NA 54.4      12.0
## 2                     NA 32.1      12.9
## 3                     NA 32.1      12.0
## 4                     NA 32.1      14.3
## 5               660.2778 32.0      12.9
## 6               565.9672 32.0      14.3

LE$Country.Updated <- factor(LE$Country.Updated)
LE$Year <- factor(LE$Year)
LE$Status <- factor(LE$Status)
LE$BMI <- unlist(LE$BMI)
LE$BMI <- as.numeric(LE$BMI)
LE$Schooling <- unlist(LE$Schooling)
LE$Schooling <- as.numeric(LE$Schooling)

head(LE)

##   Country.Updated Year     Status Life.expectancy Alcohol
## 1         Bahamas 2000 Developing            72.6   12.15
## 2           Samoa 2015 Developing            74.0      NA
## 3           Tonga 2015 Developing            59.9      NA
## 4           Tonga 2015 Developing            73.5      NA
## 5           Samoa 2014 Developing            73.8    0.01
## 6           Tonga 2014 Developing            73.3    0.01
##   Percentage.expenditure  BMI Schooling
## 1                     NA 54.4      12.0
## 2                     NA 32.1      12.9
## 3                     NA 32.1      12.0
## 4                     NA 32.1      14.3
## 5               660.2778 32.0      12.9
## 6               565.9672 32.0      14.3

1.2 Looking at the Dataset

Once the variables were updated, we proceeded to look at the variable distribution to become acquainted with the data and do a scan.

1.2.1 Life Expectancy

We calculated the Life expectancy average, and see that the general average is 69 years. Looking at the Boxplot and Histogram, we see that the data is skewed to the left. There are some outliers, on the lowest side of life expectancy, meaning that some countries have a very low life expectancy. We are thinking that those countries have a “developing” status. It is possible that we may have to take a closer look, which we will do in the next section as we evaluate differences between developed and developing countries.

# Calculate averages for life expectancy

by_lifeexpectancy <- LE %>%  
  summarise(
    Life.expectancy = mean(Life.expectancy, na.rm=TRUE), 
    n = n()
  )
  
mean.Life.expectancy<-arrange(by_lifeexpectancy, Life.expectancy)
print(mean.Life.expectancy)

##   Life.expectancy    n
## 1        69.29063 2895

# Life expectancy boxplots 

BoxPlot <- ggplot(LE, aes(x=Life.expectancy)) +
  geom_boxplot() +
  coord_flip()+
  ggtitle("Life Expectancy")
print(BoxPlot)

## Warning: Removed 3 rows containing non-finite values (stat_boxplot).

# Life expectancy histogram

LifeExpectancy.Hist <- ggplot(data=LE) +
  geom_histogram(mapping= aes(x=Life.expectancy), binwidth=10) +
  ggtitle("Life expectancy")

print(LifeExpectancy.Hist)

## Warning: Removed 3 rows containing non-finite values (stat_bin).

1.2.2 Alcohol

We calculated the Alcohol consumption average, and see that the general average is 4.6 of alcohol liters consumed per person each year. However, looking at the Boxplot and Histogram, we see that the data is skewed to the right. There are few countries where the consumption rates are much higher. Which countries are these? Are they developed or developing, or a mix?

# Calculate averages for Alcohol

by_alcohol <- LE %>%  
  summarise(
    Alcohol = mean(Alcohol, na.rm=TRUE), 
    n = n()
  )
  
mean.Alcohol<-arrange(by_alcohol, Alcohol)
print(mean.Alcohol)

##    Alcohol    n
## 1 4.639316 2895

# Alcohol boxplot 

BoxPlot <- ggplot(LE, aes(x=Alcohol)) +
  geom_boxplot() +
  coord_flip()+
  ggtitle("Alcohol")
print(BoxPlot)

## Warning: Removed 191 rows containing non-finite values (stat_boxplot).

# Alcohol histogram

Alcohol.Hist <- ggplot(data=LE) +
  geom_histogram(mapping= aes(x=Alcohol), binwidth=5) +
  ggtitle("Alcohol Consumption")

print(Alcohol.Hist)

## Warning: Removed 191 rows containing non-finite values (stat_bin).

### 1.2.3 Percentage Expenditure

We calculated the Percentage Expenditure average, and looking at the Boxplot and Histogram, we see that the data is skewed.There are multiple outliers that make us wonder whether the ones with a higher percentage share some similarities in terms of status (Developed vs. Developing).It is important to remember that the data accounts for 16 data entries (from 2000 to 2016) for each country. So the outliers may represent fewer countries (optical perception may lead one to think that the number of countries that have a higher Percentage Expenditure is actually higher than it is).

# Calculate average for Percentage Expenditure

by_percentageexpenditure <- LE %>%  
  summarise(
    Percentage.expenditure = mean(Percentage.expenditure, na.rm=TRUE), 
    n = n()
  )
  
mean.Percentageexpenditure<-arrange(by_percentageexpenditure, Percentage.expenditure)
print(mean.Percentageexpenditure)

##   Percentage.expenditure    n
## 1               933.5579 2895

# Percentage Expenditure Boxplot

BoxPlot <- ggplot(LE, aes(x=Percentage.expenditure)) +
  geom_boxplot() +
  coord_flip()+
  ggtitle("Percentage Expenditure")
print(BoxPlot)

## Warning: Removed 574 rows containing non-finite values (stat_boxplot).

# Percentage Expenditure histogram

Percentageexpenditure.Hist <- ggplot(data=LE) +
  geom_histogram(mapping= aes(x=Percentage.expenditure), binwidth=1000) +
  ggtitle("Percentage Expenditure")

print(Percentageexpenditure.Hist)

## Warning: Removed 574 rows containing non-finite values (stat_bin).

1.2.4 BMI

We calculated the BMI average, which is 25. The histogram shows a more normal distribution, when compared to the last three analyzed. Here, we do not see any skewness or distortion. A BMI of between 18 and 25 indicates that a person is healthy, while 25+ indicates that a person is overweight. In the next section we will try to analyze if there is a big gap between developed and developing countries in terms of BMI. We ask ourselves, is overweight and obesity a Developed Countries’ problem? Or is it the same regardless of status?

# Calculate average for BMI

by_BMI <- LE %>%  
  summarise(
    BMI = mean(BMI, na.rm=TRUE), 
    n = n()
  )
  
mean.BMI<-arrange(by_BMI, BMI)
print(mean.BMI)

##        BMI    n
## 1 25.06699 2895

# BMI Boxplot

BoxPlot <- ggplot(LE, aes(x=BMI)) +
  geom_boxplot() +
  coord_flip()+
  ggtitle("BMI")
print(BoxPlot)

## Warning: Removed 32 rows containing non-finite values (stat_boxplot).

# BMI histogram

BMI.Hist <- ggplot(data=LE) +
  geom_histogram(mapping= aes(x=BMI), binwidth=5) +
  ggtitle("BMI")

print(BMI.Hist)

## Warning: Removed 32 rows containing non-finite values (stat_bin).

1.2.5 Schooling

We calculated the Schooling years average, which shows that people go to school for approximately 12 years. As with BMI, the histogram and boxplot show a more normal distribution.We do not see any skewness or distortion.

# Calculate average for Schooling

by_Schooling <- LE %>%  
  summarise(
    Schooling = mean(Schooling, na.rm=TRUE), 
    n = n()
  )
  
mean.Schooling<-arrange(by_Schooling, Schooling)
print(mean.Schooling)

##   Schooling    n
## 1  12.11121 2895

# Schooling Boxplot

BoxPlot <- ggplot(LE, aes(x=Schooling)) +
  geom_boxplot() +
  coord_flip()+
  ggtitle("Schooling")
print(BoxPlot)

## Warning: Removed 157 rows containing non-finite values (stat_boxplot).

# Schooling histogram

Schooling.Hist <- ggplot(data=LE) +
  geom_histogram(mapping= aes(x=Schooling), binwidth=10) +
  ggtitle("Schooling")

print(Schooling.Hist)

## Warning: Removed 157 rows containing non-finite values (stat_bin).

1.3 Developed vs. Developing Countries and Missing values

As stated in our scope of work, we have a theory that there may be a significant difference between Developing and Developed countries. In this section, we will analyze the differences per status.

Moreover, we know there are missing values in our data set. Before determining whether we can replace the values by using the general or the status average, we will compare and contrast the distributions of Developed and Developing countries.

To verify how many missing values we have, we used the sapply and missmap functions. After running the code, we found that only 4% of the data is missing.

# Use `sapply` to identify the number of missing values per column

sapply(LE, function(x) sum(is.na(x)))

##        Country.Updated                   Year                 Status 
##                      0                      0                      0 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                      3                    191                    574 
##                    BMI              Schooling 
##                     32                    157

# Use `sapply` to identify the unique number of values per column
sapply(LE, function(x) length(unique(x)))

##        Country.Updated                   Year                 Status 
##                    181                     16                      3 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                    363                   1075                   2322 
##                    BMI              Schooling 
##                    122                    173

# Create a plot that shows the missing values per column
missmap(LE, main="Missing Values vs Observed")

1.3 Distribution

We now know that we are missing 4% of the data. We thought about using the general average to replace the missing values; however, one of our assumptions is that the data may vary depending on the country’s status (developed or developing country), so we need to verify if there are large differences when looking at the general averages and the the distribution of the two categories.

1.3.1 Life expectancy

There is a significant difference in life expectancy between Developed and Developing countries, with people from Developed countries living 12 more years.

Taking a closer look, by graphic a histogram and boxplot to compare the distributions of Deeveloped and Developing countries, we see that: - Life expectancy in Developing countries has a wider range than in Developed countries. - Life expectancy in Developing countries is skewed, while in Developed, the variability is less. - We could replace the missing values for Developed countries with the average, but it would not bee appropriate to do the same with the data for Developing. - The safest approach would be to remove the missing datam using the rm.na function.

# Group by status and calculate averages for life expectancy

by_status <- LE %>%  
  group_by(Status) %>% 
  summarise(
    Life.expectancy = mean(Life.expectancy, na.rm=TRUE), 
    n = n()
  )
  
mean.Life.expectancy<-arrange(by_status, Life.expectancy)
print(mean.Life.expectancy)

## # A tibble: 3 x 3
##   Status       Life.expectancy     n
##   <fct>                  <dbl> <int>
## 1 "Developing"            67.1  2366
## 2 "Developed"             79.2   527
## 3 ""                     NaN       2

# Life expectancy boxplots per status

BoxPlot <- ggplot(LE, aes(x=Status, y=Life.expectancy)) +
  geom_boxplot() +
  coord_flip()+
  ggtitle("Life Expectancy per Status")
print(BoxPlot)

## Warning: Removed 3 rows containing non-finite values (stat_boxplot).

# Life expectancy histograms per status

Developed.countries <- filter(LE, Status == "Developed")

Developed.Countries.Hist <- ggplot(data=Developed.countries) +
  geom_histogram(mapping= aes(x=Life.expectancy), binwidth=2) +
  ggtitle("Life expectancy in Developed Countries")
  
print(Developed.Countries.Hist)

Developing.countries <- filter(LE, Status == "Developing")

Developing.Countries.Hist <- ggplot(data=Developing.countries) +
  geom_histogram(mapping= aes(x=Life.expectancy), binwidth=2) +
  ggtitle("Life expectancy in Developing Countries")
  
print(Developing.Countries.Hist)

## Warning: Removed 1 rows containing non-finite values (stat_bin).

We proceed to remove the missing values from the Life expectancy column.

# Remove missing values. 

LE <- LE %>%
   filter(!is.na(Life.expectancy) & Life.expectancy!="")

head(LE)

##   Country.Updated Year     Status Life.expectancy Alcohol
## 1         Bahamas 2000 Developing            72.6   12.15
## 2           Samoa 2015 Developing            74.0      NA
## 3           Tonga 2015 Developing            59.9      NA
## 4           Tonga 2015 Developing            73.5      NA
## 5           Samoa 2014 Developing            73.8    0.01
## 6           Tonga 2014 Developing            73.3    0.01
##   Percentage.expenditure  BMI Schooling
## 1                     NA 54.4      12.0
## 2                     NA 32.1      12.9
## 3                     NA 32.1      12.0
## 4                     NA 32.1      14.3
## 5               660.2778 32.0      12.9
## 6               565.9672 32.0      14.3

1.3.2 Alcohol

We proceed with the second variable: Alcohol. We find that: - Alcohol consumption in developed countries is 2 times the rate for Developing countries (+200%). - There is a larger range of consumption in Developing countries than there is the Developed ones. - There is a massive disparity between the consumption average, meaning the general average cannot be used to replace the missing values. - Alcohol consumption in Developing countries is skewed, so using the average could lessen the validity of the data if we replace the missing values with the average. - If we cannot replace for both Developed and Developing, it is best to remove the missing values. - This proves our hypothesis that we need to analyze the data comparing Developed vs. Developing countries, because the difference is significant.

# Calculate average for Alcohol consumption in Developed and Developing countries. 

by_status <- LE %>%  
  group_by(Status) %>% 
  summarise(
    Alcohol = mean(Alcohol, na.rm=TRUE), 
    n = n()
  )
  
mean.Alcohol<-arrange(by_status, Alcohol)
print(mean.Alcohol)

## # A tibble: 2 x 3
##   Status     Alcohol     n
##   <fct>        <dbl> <int>
## 1 Developing    3.46  2365
## 2 Developed     9.90   527

# Alcohol consumption boxplot to compare consumption by status.

BoxPlot <- ggplot(LE, aes(x=Status, y=Alcohol)) +
  geom_boxplot() +
  coord_flip()+
  ggtitle("Alcohol consumption per Status")
print(BoxPlot)

## Warning: Removed 188 rows containing non-finite values (stat_boxplot).

# Alcohol consumption histograms per Status

Developed.countries <- filter(LE, Status == "Developed")

Developed.Countries.Hist <- ggplot(data=Developed.countries) +
  geom_histogram(mapping= aes(x=Alcohol), binwidth=2) +
  ggtitle("Alcohol consumption in Developed Countries")
  
print(Developed.Countries.Hist)

## Warning: Removed 30 rows containing non-finite values (stat_bin).

Developing.countries <- filter(LE, Status == "Developing")

Developing.Countries.Hist <- ggplot(data=Developing.countries) +
  geom_histogram(mapping= aes(x=Alcohol), binwidth=2) +
  ggtitle("Alcohol Consumption in Developing Countries")
  
print(Developing.Countries.Hist)

## Warning: Removed 158 rows containing non-finite values (stat_bin).

We proceed to remove the missing values from the Alcohol column.

# Remove missing values and verify that there aren't any missing values in that variable. 

LE <- LE %>%
   filter(!is.na(Alcohol) & Alcohol!="")

# verify that values aren't missing
sapply(LE, function(x) sum(is.na(x)))

##        Country.Updated                   Year                 Status 
##                      0                      0                      0 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                      0                      0                    387 
##                    BMI              Schooling 
##                     15                    136

# Use `sapply` to identify the unique number of values per column
sapply(LE, function(x) length(unique(x)))

##        Country.Updated                   Year                 Status 
##                    180                     16                      2 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                    359                   1074                   2318 
##                    BMI              Schooling 
##                    120                    173

# Create a plot that shows the missing values per column
missmap(LE, main="Missing Values vs Observed")

1.3.3 Percentage expenditure

As mentioned in the beginning, the percentage expenditure variable shows the percentage of a country’s Gross Domestic Product that was spent on health (per capita). - This is the column with the most information missing. - There is a massive difference between the expenditure in Developed vs. Developing countries. - Using the general average is out of the question. - Also, using the average per status to replace the missing values can be damaging as there are many outliers that could distort the data. Especially, there are significant outliers in Developed countries that may have increased the percentage expenditure average. - If we are looking for variables that may impact life expectancy, we will look closely at Percentage expenditure in our model.

# Percentage expenditure average by status.

by_status <- LE %>%  
  group_by(Status) %>% 
  summarise(
    Percentage.expenditure = mean(Percentage.expenditure, na.rm=TRUE), 
    n = n()
  )
  
mean.Percentage.expenditure<-arrange(by_status, Percentage.expenditure)
print(mean.Percentage.expenditure)

## # A tibble: 2 x 3
##   Status     Percentage.expenditure     n
##   <fct>                       <dbl> <int>
## 1 Developing                   383.  2207
## 2 Developed                   3328.   497

# Boxplots of Percentage Expenditure per status 

BoxPlot <- ggplot(LE, aes(x=Status, y=Percentage.expenditure)) +
  geom_boxplot() +
  coord_flip() +
  ggtitle("Percentage expenditure per Status")

print(BoxPlot)

## Warning: Removed 387 rows containing non-finite values (stat_boxplot).

# Percentage expenditure histograms per Status

Developed.countries <- filter(LE, Status == "Developed")

Developed.Countries.Hist <- ggplot(data=Developed.countries) +
  geom_histogram(mapping= aes(x=Percentage.expenditure), binwidth=500) +
  ggtitle("Percentage Expenditure in Developed Countries")
  
print(Developed.Countries.Hist)

## Warning: Removed 63 rows containing non-finite values (stat_bin).

Developing.countries <- filter(LE, Status == "Developing")

Developing.Countries.Hist <- ggplot(data=Developing.countries) +
  geom_histogram(mapping= aes(x=Percentage.expenditure), binwidth=500) +
  ggtitle("Percentage.expenditure in Developing Countries")
  
print(Developing.Countries.Hist)

## Warning: Removed 324 rows containing non-finite values (stat_bin).

We proceed to remove the missing values from the Percentage expenditure column.

# Remove missing values and verify that there aren't missing values. 

LE <- LE %>%
   filter(!is.na(Percentage.expenditure) & Percentage.expenditure!="")

# verify that values aren't missing
sapply(LE, function(x) sum(is.na(x)))

##        Country.Updated                   Year                 Status 
##                      0                      0                      0 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                      0                      0                      0 
##                    BMI              Schooling 
##                     15                     14

# Use `sapply` to identify the unique number of values per column
sapply(LE, function(x) length(unique(x)))

##        Country.Updated                   Year                 Status 
##                    157                     16                      2 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                    357                   1000                   2317 
##                    BMI              Schooling 
##                    119                    173

# Create a plot that shows the missing values per column
missmap(LE, main="Missing Values vs Observed")

1.3.4 BMI

The BMI averages are the first variable in which we do not find a significant difference between Developed and Developing countries. However, we need to look at their distribution to see if we can replace the missing values with the averages. - Developed countries have a healthier BMI in general. - Developing countries on the other hand, have a larger range, with the lowest and highest BMI numbers. - As with the previous variables, we will remove the missing values.

# BMI average per status

by_status <- LE %>%  
  group_by(Status) %>% 
  summarise(
    BMI = mean(BMI, na.rm=TRUE), 
    n = n()
  )
  
mean.BMI<-arrange(by_status, BMI)
print(mean.BMI)

## # A tibble: 2 x 3
##   Status       BMI     n
##   <fct>      <dbl> <int>
## 1 Developing  24.8  1883
## 2 Developed   25.9   434

# Boxplot to compare BMI distribution per status. 

BoxPlot <- ggplot(LE, aes(x=Status, y=BMI)) +
  geom_boxplot() +
  coord_flip()+
  ggtitle("BMI per status")
print(BoxPlot)

## Warning: Removed 15 rows containing non-finite values (stat_boxplot).

# BMI histograms per Status

Developed.countries <- filter(LE, Status == "Developed")

Developed.Countries.Hist <- ggplot(data=Developed.countries) +
  geom_histogram(mapping= aes(x=BMI), binwidth=2) +
  ggtitle("BMI in Developed Countries")
  
print(Developed.Countries.Hist)

Developing.countries <- filter(LE, Status == "Developing")

Developing.Countries.Hist <- ggplot(data=Developing.countries) +
  geom_histogram(mapping= aes(x=BMI), binwidth=2) +
  ggtitle("BMI in Developing Countries")
  
print(Developing.Countries.Hist)

## Warning: Removed 15 rows containing non-finite values (stat_bin).

We proceed to remove the missing values.

# Remove missing values and verify that there aren't missing values. 

LE <- LE %>%
   filter(!is.na(BMI) & BMI!="")

# verify that values aren't missing
sapply(LE, function(x) sum(is.na(x)))

##        Country.Updated                   Year                 Status 
##                      0                      0                      0 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                      0                      0                      0 
##                    BMI              Schooling 
##                      0                     14

# Use `sapply` to identify the unique number of values per column
sapply(LE, function(x) length(unique(x)))

##        Country.Updated                   Year                 Status 
##                    156                     16                      2 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                    357                    998                   2302 
##                    BMI              Schooling 
##                    118                    173

# Create a plot that shows the missing values per column
missmap(LE, main="Missing Values vs Observed")

### Schooling

The Schooling variable shows the average number of schooling years for citizens in each country. - At first glance, by looking at the averages, the difference does not seem large 11 years for Developing vs 15 for Developed. - The shape of the distribution skewed for Developing countries, which differs from the normal distribution observed on the 1st section. This means that the Developed Countries data does impact the general distribution of the whole data set. - As with the previous variables, we will remove the missing values

# Schooling years average per status

by_status <- LE %>%  
  group_by(Status) %>% 
  summarise(
    Schooling = mean(Schooling, na.rm=TRUE), 
    n = n()
  )
  
mean.Schooling<-arrange(by_status, Schooling)
print(mean.Schooling)

## # A tibble: 2 x 3
##   Status     Schooling     n
##   <fct>          <dbl> <int>
## 1 Developing      11.3  1868
## 2 Developed       16.0   434

# Boxplots for Average Schooling years per status

BoxPlot <- ggplot(LE, aes(x=Status, y=Schooling)) +
  geom_boxplot() +
  coord_flip()
print(BoxPlot)

## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

The difference in years of schooling does vary significantly between Developing and Developed countries.

# Schooling histograms per Status

Developed.countries <- filter(LE, Status == "Developed")

Developed.Countries.Hist <- ggplot(data=Developed.countries) +
  geom_histogram(mapping= aes(x=Schooling), binwidth=2) +
  ggtitle("Schooling in Developed Countries")
  
print(Developed.Countries.Hist)

Developing.countries <- filter(LE, Status == "Developing")

Developing.Countries.Hist <- ggplot(data=Developing.countries) +
  geom_histogram(mapping= aes(x=Schooling), binwidth=2) +
  ggtitle("Schooling in Developing Countries")
  
print(Developing.Countries.Hist)

## Warning: Removed 14 rows containing non-finite values (stat_bin).

We proceed to replace the missing values.

# Remove missing values and verify that there aren't missing values. 

LE <- LE %>%
   filter(!is.na(Schooling) & Schooling!="")

# verify that values aren't missing
sapply(LE, function(x) sum(is.na(x)))

##        Country.Updated                   Year                 Status 
##                      0                      0                      0 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                      0                      0                      0 
##                    BMI              Schooling 
##                      0                      0

# Use `sapply` to identify the unique number of values per column
sapply(LE, function(x) length(unique(x)))

##        Country.Updated                   Year                 Status 
##                    156                     16                      2 
##        Life.expectancy                Alcohol Percentage.expenditure 
##                    357                    996                   2288 
##                    BMI              Schooling 
##                    118                    172

# Create a plot that shows the missing values per column
missmap(LE, main="Missing Values vs Observed")

Now we have a complete data set with no missing values.

We concluded that Percentage expenditure, Alcohol consumption and Life expectancy are the variables that present the biggest differences between Developed and Developing countries.

Our assumption that we need to analyze the data by status proved to be correct.

Questions that arise: Could Percentage expenditure and Alcohol be the variables that influence Life expectancy the most, given that the averages for BMI and Schooling do not differ as much by status?

2 Identifying the top 10 countries with the lowest life expectancy rates and lowest expenditure percentages.

Aforementioned, we want to market and sell health related educational programs. Since we are just starting, we will identify 10 countries (5 developing and 5 developed) and target them as our first potential clients. The top potential country clients will be the ones with the lowest life expectancy rates.

Potential clients in Developed and Developing countries

Developed.countries <- filter(LE, Status == "Developed")
by_country<- group_by(Developed.countries, Country.Updated)
  Country_sum <-filter(by_country)
  Country_sum<- summarise(Country_sum,
                          Life.expectancy=mean(Life.expectancy, na.rm=TRUE))

  Country_sum <- arrange(Country_sum, Life.expectancy)
  print(Country_sum)

## # A tibble: 29 x 2
##    Country.Updated Life.expectancy
##    <fct>                     <dbl>
##  1 Bulgaria                   72.7
##  2 Lithuania                  72.8
##  3 Latvia                     73.7
##  4 Hungary                    73.7
##  5 Romania                    74.0
##  6 Estonia                    74.8
##  7 Poland                     75.5
##  8 Denmark                    78.8
##  9 Slovenia                   79.2
## 10 Cyprus                     79.3
## # … with 19 more rows

Developing.countries <- filter(LE, Status == "Developing")
by_country<- group_by(Developing.countries, Country.Updated)
  Country_sum <-filter(by_country)
  Country_sum<- summarise(Country_sum,
                          Life.expectancy=mean(Life.expectancy, na.rm=TRUE))

  Country_sum <- arrange(Country_sum, Life.expectancy)
  print(Country_sum)

## # A tibble: 127 x 2
##    Country.Updated          Life.expectancy
##    <fct>                              <dbl>
##  1 Sierra Leone                        45.8
##  2 Central African Republic            48.2
##  3 Lesotho                             48.5
##  4 Angola                              48.8
##  5 Malawi                              49.3
##  6 Chad                                50.2
##  7 Swaziland                           50.8
##  8 Nigeria                             51.1
##  9 Zimbabwe                            51.6
## 10 Mozambique                          53.1
## # … with 117 more rows

The 5 Developing countries with the lowest Life expectancy are: 1. Sierra Leone 2. Central African Republic 3. Lesotho 4. Angola 5. Malawi

The 5 Developed countries with the lowest Life expectancy are: 1. Bulgaria 2. Lithuania 3. Latvia 4. Hungary 5. Romania 6. Estonia

3. Examination of Correlations for all continuous variables.

3.1 First, we noticed that the BMI and Schooling variables in the dataset were categorized as lists, versus a numeric value. So, the first thing we had to do was unlist them and convert them to a numeric. Next, we created a correlation matrix for the continuous variables, naming it cont.cor.

Hmisc was used to see the correlations more clearly.

library("Hmisc")

## Loading required package: lattice

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:robustbase':
## 
##     heart

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

LE$BMI <- unlist(LE$BMI)
LE$BMI <- as.numeric(LE$BMI)
LE$Schooling <- unlist(LE$Schooling)
LE$Schooling <- as.numeric(LE$Schooling)
cont <- LE %>% dplyr::select(Life.expectancy:Schooling)

cor(cont)

##                        Life.expectancy   Alcohol Percentage.expenditure
## Life.expectancy              1.0000000 0.3849922              0.4276899
## Alcohol                      0.3849922 1.0000000              0.3795351
## Percentage.expenditure       0.4276899 0.3795351              1.0000000
## BMI                          0.5943718 0.2645032              0.1881667
## Schooling                    0.7876933 0.5774616              0.4367901
##                              BMI Schooling
## Life.expectancy        0.5943718 0.7876933
## Alcohol                0.2645032 0.5774616
## Percentage.expenditure 0.1881667 0.4367901
## BMI                    1.0000000 0.6266925
## Schooling              0.6266925 1.0000000

Strong, r >.5 : Life expectancy is the strongest correlation and it is correlated with Schooling, life expectancy is strongly correlated with BMI, Alcohol is strongly correlated with Schooling, and Schooling is strongly correlated with BMI.

Moderate, .2 < r < .5 : Life expectancy is moderately correlated with Alcohol, Life expectancy is moderately correlated with percentage expenditure, Alcohol is moderately correlated with percentage expenditure, Alcohol is moderately correlated with BMI, and Schooling is moderately correlated with percentage expenditure.

Weak, r < .2: BMI is weakly correlated with percentage expenditure.

Correlations were found between each variable, none of them were found to have no correlation at all.

3.2 Scatterplot Matrices

We plotted all continous variables on a scatterplot matrices. Then, we used pairs to do the scatterplot matrices.

cont.sig <- LE %>% dplyr::select(Life.expectancy:Schooling)
pairs(cont.sig, pch=2, lower.panel = NULL)

We noticed potential linear relationships between life expectancy and alcohol, life expectancy and BMI, life expectancy and schooling, and BMI and schooling.

Part 4: Building A Model

Based on the information from our data exploration and our problem statement, our outcome variable will be life expectancy. We will now build a model with the appropriate predictor variables. We started by creating a test and train set and set a random number generator seed.

After seeing that some variables interactions from our fit1 linear regression model were not significant, we ran several iterations until we arrived at a model in the which all variables/interactions were significant: fit3.

We wanted to ensure that using a model that included both the original variables and their interactions was the best fit, so we also created a simplified linear regression model without the interactions: simple.fit1. After seeing that all variables had the same amount of significance as in our fit3 model, and knowing that the interactions might add to the efficacy of our model, we decided to continue our analysis using the fit3 model.

LE.col <-dplyr::select(LE, Life.expectancy:Schooling)

set.seed(7890)

train <- sample_frac(LE.col, 0.8)
test <- LE.col[-(as.numeric(rownames(train))), ]

fit1 <- lm(Life.expectancy~ .*., data=train)
summary(fit1)

## 
## Call:
## lm(formula = Life.expectancy ~ . * ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.1919  -3.0993   0.6882   3.5096  16.1592 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       1.214e+01  6.952e+00   1.746 0.081019 .  
## Alcohol                          -4.464e+00  5.973e-01  -7.473 1.21e-13 ***
## Percentage.expenditure            1.115e-02  1.576e-03   7.078 2.08e-12 ***
## BMI                               1.534e+00  3.062e-01   5.011 5.95e-07 ***
## Schooling                         3.759e+00  6.031e-01   6.233 5.66e-10 ***
## Alcohol:Percentage.expenditure   -7.749e-05  2.339e-05  -3.312 0.000944 ***
## Alcohol:BMI                       1.158e-01  2.756e-02   4.201 2.78e-05 ***
## Alcohol:Schooling                 9.889e-02  1.677e-02   5.897 4.40e-09 ***
## Percentage.expenditure:BMI       -2.398e-04  6.033e-05  -3.975 7.31e-05 ***
## Percentage.expenditure:Schooling -2.322e-04  3.494e-05  -6.644 4.01e-11 ***
## BMI:Schooling                    -8.706e-02  2.559e-02  -3.403 0.000681 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.451 on 1819 degrees of freedom
## Multiple R-squared:  0.6847, Adjusted R-squared:  0.683 
## F-statistic: 395.1 on 10 and 1819 DF,  p-value: < 2.2e-16

fit2 <- lm(Life.expectancy~ .*. - BMI:Schooling, data=train)
summary(fit2)

## 
## Call:
## lm(formula = Life.expectancy ~ . * . - BMI:Schooling, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.8594  -3.2786   0.6505   3.5535  14.0445 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       3.462e+01  2.168e+00  15.967  < 2e-16 ***
## Alcohol                          -4.196e+00  5.938e-01  -7.066 2.27e-12 ***
## Percentage.expenditure            1.284e-02  1.499e-03   8.566  < 2e-16 ***
## BMI                               5.563e-01  1.059e-01   5.253 1.68e-07 ***
## Schooling                         1.728e+00  8.678e-02  19.915  < 2e-16 ***
## Alcohol:Percentage.expenditure   -6.813e-05  2.330e-05  -2.924   0.0035 ** 
## Alcohol:BMI                       1.124e-01  2.763e-02   4.069 4.92e-05 ***
## Alcohol:Schooling                 8.326e-02  1.617e-02   5.147 2.93e-07 ***
## Percentage.expenditure:BMI       -2.994e-04  5.790e-05  -5.171 2.58e-07 ***
## Percentage.expenditure:Schooling -2.484e-04  3.471e-05  -7.155 1.21e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.467 on 1820 degrees of freedom
## Multiple R-squared:  0.6827, Adjusted R-squared:  0.6812 
## F-statistic: 435.2 on 9 and 1820 DF,  p-value: < 2.2e-16

fit3 <- lm(Life.expectancy~ .*. - BMI:Schooling -Alcohol:Percentage.expenditure, data=train)
summary(fit3)

## 
## Call:
## lm(formula = Life.expectancy ~ . * . - BMI:Schooling - Alcohol:Percentage.expenditure, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.859  -3.234   0.809   3.559  14.072 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       3.374e+01  2.152e+00  15.681  < 2e-16 ***
## Alcohol                          -4.042e+00  5.927e-01  -6.819 1.24e-11 ***
## Percentage.expenditure            1.172e-02  1.452e-03   8.069 1.27e-15 ***
## BMI                               5.726e-01  1.060e-01   5.403 7.40e-08 ***
## Schooling                         1.795e+00  8.385e-02  21.413  < 2e-16 ***
## Alcohol:BMI                       1.082e-01  2.765e-02   3.914 9.40e-05 ***
## Alcohol:Schooling                 7.550e-02  1.599e-02   4.722 2.52e-06 ***
## Percentage.expenditure:BMI       -2.813e-04  5.769e-05  -4.876 1.18e-06 ***
## Percentage.expenditure:Schooling -2.462e-04  3.478e-05  -7.079 2.07e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.478 on 1821 degrees of freedom
## Multiple R-squared:  0.6812, Adjusted R-squared:  0.6798 
## F-statistic: 486.5 on 8 and 1821 DF,  p-value: < 2.2e-16

vif(fit3)

##                          Alcohol           Percentage.expenditure 
##                       350.060632                       638.422810 
##                              BMI                        Schooling 
##                         3.147164                         4.558708 
##                      Alcohol:BMI                Alcohol:Schooling 
##                       512.298895                        63.716954 
##       Percentage.expenditure:BMI Percentage.expenditure:Schooling 
##                       668.827426                        95.974593

simple.fit1 <- lm(Life.expectancy~ ., data=train)
summary(simple.fit1)

## 
## Call:
## lm(formula = Life.expectancy ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.6590  -3.1202   0.8582   3.6367  15.3464 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.742e+01  1.677e+00  16.350  < 2e-16 ***
## Alcohol                -2.555e-01  4.122e-02  -6.198 7.04e-10 ***
## Percentage.expenditure  6.511e-04  6.713e-05   9.700  < 2e-16 ***
## BMI                     7.115e-01  8.131e-02   8.751  < 2e-16 ***
## Schooling               2.033e+00  6.538e-02  31.092  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.697 on 1825 degrees of freedom
## Multiple R-squared:  0.6545, Adjusted R-squared:  0.6538 
## F-statistic: 864.5 on 4 and 1825 DF,  p-value: < 2.2e-16

vif(simple.fit1)

##                Alcohol Percentage.expenditure                    BMI 
##               1.566037               1.262061               1.713181 
##              Schooling 
##               2.562758

Evaluating the Model

To start the evaluation of the model we will examine the following: F Statistic, P Values, Coefficients.

The F-Statistic is not significant. The P-values are significant. The corrplot makes it appear that schooling is significant along with the interaction of schooling and BMI.

summary(fit3)

## 
## Call:
## lm(formula = Life.expectancy ~ . * . - BMI:Schooling - Alcohol:Percentage.expenditure, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.859  -3.234   0.809   3.559  14.072 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       3.374e+01  2.152e+00  15.681  < 2e-16 ***
## Alcohol                          -4.042e+00  5.927e-01  -6.819 1.24e-11 ***
## Percentage.expenditure            1.172e-02  1.452e-03   8.069 1.27e-15 ***
## BMI                               5.726e-01  1.060e-01   5.403 7.40e-08 ***
## Schooling                         1.795e+00  8.385e-02  21.413  < 2e-16 ***
## Alcohol:BMI                       1.082e-01  2.765e-02   3.914 9.40e-05 ***
## Alcohol:Schooling                 7.550e-02  1.599e-02   4.722 2.52e-06 ***
## Percentage.expenditure:BMI       -2.813e-04  5.769e-05  -4.876 1.18e-06 ***
## Percentage.expenditure:Schooling -2.462e-04  3.478e-05  -7.079 2.07e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.478 on 1821 degrees of freedom
## Multiple R-squared:  0.6812, Adjusted R-squared:  0.6798 
## F-statistic: 486.5 on 8 and 1821 DF,  p-value: < 2.2e-16

anova(fit3)

## Analysis of Variance Table
## 
## Response: Life.expectancy
##                                    Df Sum Sq Mean Sq   F value    Pr(>F)    
## Alcohol                             1  25230   25230  840.6211 < 2.2e-16 ***
## Percentage.expenditure              1  16677   16677  555.6545 < 2.2e-16 ***
## BMI                                 1  38945   38945 1297.5833 < 2.2e-16 ***
## Schooling                           1  31376   31376 1045.4008 < 2.2e-16 ***
## Alcohol:BMI                         1   1434    1434   47.7628 6.629e-12 ***
## Alcohol:Schooling                   1    240     240    7.9888  0.004758 ** 
## Percentage.expenditure:BMI          1   1400    1400   46.6413 1.158e-11 ***
## Percentage.expenditure:Schooling    1   1504    1504   50.1118 2.065e-12 ***
## Residuals                        1821  54655      30                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

leveragePlots(fit3, pch=16)

fit3.df <- augment(fit3)
pairs(fit3.df, pch=16)

corrplot(cor(fit3.df), method="ellipse")

Examine Residuals

At first glance there does not appear to be an evident pattern. However, we are noticing a cluster that could potentially be a circular pattern in the middle right section.

fit3.df <- augment(fit3)
ggplot(fit3.df, aes(x = .fitted, y = .resid))+
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  labs(x = "Fitted Values", y = "Residuals")

Formal Test for Non-Constant Variance

The p-value is low here so there is an issue.

ncvTest(fit3)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 97.61545, Df = 1, p = < 2.22e-16

Normal Quantile Plot

There is also an issue here as many of the dots fall outside of the dashed lines.

qqPlot(fit3,pch=16)

## [1]  45 830

There is also an issue here as many of the dots fall outside of the dashed lines.

Formal Test for Normality of Residuals

P-value is small here so there is a violation.

shapiro.test(fit3$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit3$residuals
## W = 0.9624, p-value < 2.2e-16

Transformation Process

Box-Cox Test

Since there appears to be non-normal variables we will look at applying a Box-Cox test to help us determine the transformation. After running the test we notice the curve peaks at 3 and suggests that our outcome variable should cubed.

boxCox(fit3, lambda = seq(-2,5))

Transforming the Outcome Variable by Cubing

We will cube our outcome variable now based upon the previous Box-Cox test.

fit4 <- lm(I(Life.expectancy^3) ~ .*. - BMI:Schooling -Alcohol:Percentage.expenditure, data=train)

Re-Testing

We will now re-run tests on the transformation. We will again examine: F Statistic, P Values, Coefficients. We see that the F-Statistic is not significant. The P-values are significant. Schooling and the interaction of percentage expenditure and schooling continue to be significant coefficients and this time percentage expenditure is more significant as well. The corrplot shows schooling to be significant.

summary(fit4)

## 
## Call:
## lm(formula = I(Life.expectancy^3) ~ . * . - BMI:Schooling - Alcohol:Percentage.expenditure, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -218859  -45201    5554   45440  256355 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -1.050e+05  2.809e+04  -3.738 0.000191 ***
## Alcohol                          -4.136e+04  7.736e+03  -5.347 1.01e-07 ***
## Percentage.expenditure            1.401e+02  1.895e+01   7.395 2.14e-13 ***
## BMI                               6.814e+03  1.383e+03   4.926 9.16e-07 ***
## Schooling                         2.353e+04  1.095e+03  21.498  < 2e-16 ***
## Alcohol:BMI                       8.791e+02  3.609e+02   2.436 0.014938 *  
## Alcohol:Schooling                 1.223e+03  2.087e+02   5.858 5.54e-09 ***
## Percentage.expenditure:BMI       -3.201e+00  7.530e-01  -4.251 2.24e-05 ***
## Percentage.expenditure:Schooling -2.970e+00  4.540e-01  -6.541 7.90e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71510 on 1821 degrees of freedom
## Multiple R-squared:  0.7052, Adjusted R-squared:  0.7039 
## F-statistic: 544.4 on 8 and 1821 DF,  p-value: < 2.2e-16

anova(fit4)

## Analysis of Variance Table
## 
## Response: I(Life.expectancy^3)
##                                    Df     Sum Sq    Mean Sq  F value    Pr(>F)
## Alcohol                             1 5.7382e+12 5.7382e+12 1122.091 < 2.2e-16
## Percentage.expenditure              1 4.0402e+12 4.0402e+12  790.055 < 2.2e-16
## BMI                                 1 6.0746e+12 6.0746e+12 1187.878 < 2.2e-16
## Schooling                           1 5.7293e+12 5.7293e+12 1120.354 < 2.2e-16
## Alcohol:BMI                         1 1.9421e+11 1.9421e+11   37.978 8.787e-10
## Alcohol:Schooling                   1 8.9272e+10 8.9272e+10   17.457 3.078e-05
## Percentage.expenditure:BMI          1 1.8704e+11 1.8704e+11   36.576 1.777e-09
## Percentage.expenditure:Schooling    1 2.1881e+11 2.1881e+11   42.788 7.905e-11
## Residuals                        1821 9.3123e+12 5.1138e+09                   
##                                     
## Alcohol                          ***
## Percentage.expenditure           ***
## BMI                              ***
## Schooling                        ***
## Alcohol:BMI                      ***
## Alcohol:Schooling                ***
## Percentage.expenditure:BMI       ***
## Percentage.expenditure:Schooling ***
## Residuals                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

leveragePlots(fit4, pch=16)

fit4.df <- augment(fit4)
pairs(fit4.df, pch=16)

corrplot(cor(fit4.df), method="ellipse")

Examine Residuals in a histrogram and scatter plot

fit4.df <- augment(fit4)
ggplot(fit4.df, aes(x=.std.resid)) + 
  geom_histogram(bins=20, color = "black", fill="light gray")

fit4.df <- augment(fit4)
ggplot(fit4.df, aes(x = .fitted, y = .std.resid))+
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  labs(x = "Fitted Values", y = "Residuals")

ggplot(fit4.df, aes(x=.std.resid)) +
  geom_histogram(aes(y=..density..), bins=20, color = "black", fill="light gray") +
  stat_function(fun=dnorm, color="red", size=1.2,
                args=list(mean=mean(fit4.df$.std.resid),
                          sd=sd(fit4.df$.std.resid)))

Formal Test for Non-Constant Variance

Running this test the p-value is much closer to 1 and appears okay.

ncvTest(fit4)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.0004265518, Df = 1, p = 0.98352

Normal Quantile Plot

There is significant improvement this time in the qqPlot. However, there are still several dots outside of the dashed lines.

qqPlot(fit4,pch=16)

## [1]  853 1025

Formal Test for Normality of Residuals

The p-value is small here so there is still an issue.

shapiro.test(fit4$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit4$residuals
## W = 0.9928, p-value = 8.305e-08

Create Cook’s D Plot

Visually there appear to be outliers, but in actuality none of the values are greater than 1. So this should be okay.

plot(fit4, which = 4)

Identifying Multcollinearity

We will check the Variance Inflation Factor (VIF) first based on examining the correlation plots earlier. First we will check alcohol and schooling. It looks okay as the results are less than 10.

LE.vif1 <- lm(Life.expectancy~Alcohol+Schooling, data=LE.col)
summary(LE.vif1)

## 
## Call:
## lm(formula = Life.expectancy ~ Alcohol + Schooling, data = LE.col)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.6917  -3.3005   0.7604   3.9448  16.0856 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.54762    0.50698  78.006  < 2e-16 ***
## Alcohol     -0.25084    0.03741  -6.705 2.52e-11 ***
## Schooling    2.54522    0.04691  54.259  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.895 on 2285 degrees of freedom
## Multiple R-squared:  0.6278, Adjusted R-squared:  0.6275 
## F-statistic:  1927 on 2 and 2285 DF,  p-value: < 2.2e-16

vif(LE.vif1)

##   Alcohol Schooling 
##  1.500289  1.500289

Now we will check BMI and Schooling. It also looks fine with results being less than 10.

LE.vif2 <- lm(Life.expectancy~BMI+Schooling, data=LE.col)
summary(LE.vif2)

## 
## Call:
## lm(formula = Life.expectancy ~ BMI + Schooling, data = LE.col)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.4551  -3.2466   0.8914   3.8893  16.1896 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.66671    1.53165   16.76   <2e-16 ***
## BMI          0.74976    0.07309   10.26   <2e-16 ***
## Schooling    2.05165    0.04852   42.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.82 on 2285 degrees of freedom
## Multiple R-squared:  0.6372, Adjusted R-squared:  0.6369 
## F-statistic:  2006 on 2 and 2285 DF,  p-value: < 2.2e-16

vif(LE.vif2)

##       BMI Schooling 
##   1.64675   1.64675

Lastly we will check percentage expenditure and schooling. This is also okay with results being less than 10.

LE.vif3 <- lm(Life.expectancy~Percentage.expenditure+Schooling, data=LE.col)
summary(LE.vif3)

## 
## Call:
## lm(formula = Life.expectancy ~ Percentage.expenditure + Schooling, 
##     data = LE.col)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.7012  -3.1553   0.7411   4.0035  15.3538 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.182e+01  5.095e-01  82.087  < 2e-16 ***
## Percentage.expenditure 4.518e-04  6.191e-05   7.297 4.03e-13 ***
## Schooling              2.228e+00  4.250e-02  52.429  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.884 on 2285 degrees of freedom
## Multiple R-squared:  0.6291, Adjusted R-squared:  0.6288 
## F-statistic:  1938 on 2 and 2285 DF,  p-value: < 2.2e-16

vif(LE.vif3)

## Percentage.expenditure              Schooling 
##               1.235766               1.235766

Review Adjusted R^2

In reviewing the adjusted R^2 for our latest model, approximately 70% of variance is explained. In our first model approximately 68% of variance is explained. So there is a slight improvement.

summary(fit4)

## 
## Call:
## lm(formula = I(Life.expectancy^3) ~ . * . - BMI:Schooling - Alcohol:Percentage.expenditure, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -218859  -45201    5554   45440  256355 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -1.050e+05  2.809e+04  -3.738 0.000191 ***
## Alcohol                          -4.136e+04  7.736e+03  -5.347 1.01e-07 ***
## Percentage.expenditure            1.401e+02  1.895e+01   7.395 2.14e-13 ***
## BMI                               6.814e+03  1.383e+03   4.926 9.16e-07 ***
## Schooling                         2.353e+04  1.095e+03  21.498  < 2e-16 ***
## Alcohol:BMI                       8.791e+02  3.609e+02   2.436 0.014938 *  
## Alcohol:Schooling                 1.223e+03  2.087e+02   5.858 5.54e-09 ***
## Percentage.expenditure:BMI       -3.201e+00  7.530e-01  -4.251 2.24e-05 ***
## Percentage.expenditure:Schooling -2.970e+00  4.540e-01  -6.541 7.90e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71510 on 1821 degrees of freedom
## Multiple R-squared:  0.7052, Adjusted R-squared:  0.7039 
## F-statistic: 544.4 on 8 and 1821 DF,  p-value: < 2.2e-16

summary(fit1)

## 
## Call:
## lm(formula = Life.expectancy ~ . * ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.1919  -3.0993   0.6882   3.5096  16.1592 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       1.214e+01  6.952e+00   1.746 0.081019 .  
## Alcohol                          -4.464e+00  5.973e-01  -7.473 1.21e-13 ***
## Percentage.expenditure            1.115e-02  1.576e-03   7.078 2.08e-12 ***
## BMI                               1.534e+00  3.062e-01   5.011 5.95e-07 ***
## Schooling                         3.759e+00  6.031e-01   6.233 5.66e-10 ***
## Alcohol:Percentage.expenditure   -7.749e-05  2.339e-05  -3.312 0.000944 ***
## Alcohol:BMI                       1.158e-01  2.756e-02   4.201 2.78e-05 ***
## Alcohol:Schooling                 9.889e-02  1.677e-02   5.897 4.40e-09 ***
## Percentage.expenditure:BMI       -2.398e-04  6.033e-05  -3.975 7.31e-05 ***
## Percentage.expenditure:Schooling -2.322e-04  3.494e-05  -6.644 4.01e-11 ***
## BMI:Schooling                    -8.706e-02  2.559e-02  -3.403 0.000681 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.451 on 1819 degrees of freedom
## Multiple R-squared:  0.6847, Adjusted R-squared:  0.683 
## F-statistic: 395.1 on 10 and 1819 DF,  p-value: < 2.2e-16

Drop Non-Significant Predictor Variables

All the variables and interactions appear to be significant so we will not drop any of the predictor variables.

summary(fit4)

## 
## Call:
## lm(formula = I(Life.expectancy^3) ~ . * . - BMI:Schooling - Alcohol:Percentage.expenditure, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -218859  -45201    5554   45440  256355 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -1.050e+05  2.809e+04  -3.738 0.000191 ***
## Alcohol                          -4.136e+04  7.736e+03  -5.347 1.01e-07 ***
## Percentage.expenditure            1.401e+02  1.895e+01   7.395 2.14e-13 ***
## BMI                               6.814e+03  1.383e+03   4.926 9.16e-07 ***
## Schooling                         2.353e+04  1.095e+03  21.498  < 2e-16 ***
## Alcohol:BMI                       8.791e+02  3.609e+02   2.436 0.014938 *  
## Alcohol:Schooling                 1.223e+03  2.087e+02   5.858 5.54e-09 ***
## Percentage.expenditure:BMI       -3.201e+00  7.530e-01  -4.251 2.24e-05 ***
## Percentage.expenditure:Schooling -2.970e+00  4.540e-01  -6.541 7.90e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71510 on 1821 degrees of freedom
## Multiple R-squared:  0.7052, Adjusted R-squared:  0.7039 
## F-statistic: 544.4 on 8 and 1821 DF,  p-value: < 2.2e-16

Calculate RMSE and Apply Model to Test Data

The RMSE calculation is 71502.096

RMSE_model <- RMSE(fit4)

When we apply the model to the test data the RMSE is 372980.240. There is a significant difference.

p <- predict(fit4, newdata = test)
error <- (p - test$Life.expectancy)
RMSE_test <- sqrt(mean(error^2))

Combine the two RMSE results into a single table, then view the result

table1 <- data.frame(RMSE_model, RMSE_test)
table1

##   RMSE_model RMSE_test
## 1   71334.97  242989.2

Manual Calculation of Training Data & Back Transformation

Take the cube root of life expectancy variable

error_train <- (train$Life.expectancy - (fit4.df$.fitted)^(1/3))
RMSE_model <- sqrt(mean(error_train^2))

Modify the calculation of the RMSE for the test data

p <- (predict(fit4, newdata = test))^(1/3)
error <- (p - test$Life.expectancy)
RMSE_test <- sqrt(mean(error^2))

Commentary

Our Approach

In our approach we explored the variables of our data set, correlations, and created several fits of a logic model. We found out we had to transform our predictor variable and transform our training data to be cubed regression models. This was because the best model we fitted was violating many of the formal tests. Once we completed the data transformation the model appears to be a decent fit.

With regards to our problem statement, Schooling (defined as the average number of schooling years for citizens) is the most important variable. With this important insight we can pivot our consulting services from implementing an educational program to helping offer programming around keeping students in school. In other words, our services would assist a country’s government and other stakeholders to identify how to keep students in school longer as this variable has the most impact on life expectancy.

Meaningful Discoveries & Surprises

The strongest correlation that we found, and the one we decided to focus on was the correlation between life expectancy and schooling. We decided to focus on this correlation specifically because the other meaningful correlations we found were surprising. We made sure to confirm that this is a causal correlation via outside research.

We found that life expectancy was correlated with BMI. We were initially surprised by this finding because we thought that a high BMI would actually decrease life expectancy because of the health problems associated with obesity, but then we considered that if someone is well fed their life expectancy would be higher than those with low BMI. For these differing reasons, we decided not to focus on this correlation.

Another surprise that we found was that alcohol consumption was strongly correlated with schooling. Obviously, correlation does not equal causation, but we found that alcohol is correlated with schooling, which would imply that the more alcohol a country drinks, the more schooling that country has. This is a correlation we found surprising, and decided not to explore further because as a business that hopes to offer educational programs, we could not responsibly offer alcohol use as a solution for schooling.

Another surprising correlation we identified was within the moderate correlations between life expectancy and alcohol. Again, correlation does not equal causation, but the correlation seemed to suggest that higher rates of alcohol use relates to higher life expectancies. Obviously, this feels irresponsible to suggest as a solution for increasing life expectancy.

All of the correlations we found led us to limit our focus on the connection between life expectancy and schooling, especially since we were able to confirm that schooling and life expectancy are causally correlated.

Assumptions

Our assumptions can be divided into two categories: data assumptions and impact assumptions.

Data assumptions

We had to assume that the majority of our data was accurate and reliable. This is especially pertinent because we are unaware of how the data was initially collected and recorded.

Impact assumptions

We are assuming that this data captures the variables that have the most significant influence on life expectancy. We are also assuming that the correlative link between our variables and life expectancy might also be causal. After some initial research, we did find evidence supporting the fact that education has a causal link to life expectancy and that investing in healthcare education reduces future health costs for countries.

Missing Variables/Variables Wish We Had

One of the variables missing, and that we wish we could have evaluated, was gender. One assumption we have is that there may be a disparity between females and males, and proving that could help us target our services to a more specific audience. If life expectancy is related to years of schooling, and there is a significant difference in schooling between the both, we could offer services to incentivize school attendance by gender.

Given the current pandemic, we are curious to see whether COVID 19 will have a significant impact on life expectancy overall. If we could collect the data, we could analyze the progression of life expectancy as a result of COVID.

A third variable that could provide insights, would be cause of death. If there are certain diseases that are more common in a given country, then educational programs so as to raise awareness on how to prevent such diseases, could have a positive impact on life expectancy.

Another source of insight, would be data that provides specifics on why students are dropping out of school, or why they cannot attend school. If schooling impacts life expectancy, then provided mechanisms to access and remain in school would prove beneficial. This could also be analyzed by obtaining data on education funding and the geographic locations (urban vs. rural) where schooling rate are lower.

How does the model address the basic question?

Using our model, we can predict the life expectancy for any given country based on a variety of lifestyle variables. However, our modeling process led us to focus on the impact of schooling on life expectancy. Because of this, we can use our model to explore which countries we should focus on providing schooling to.

Other Factors

We would have liked to take a closer look at individual countries and how the data changes over time. Trends could provide insights, such as investment, or lack of investment, in education. Or see if a country with a spike in BMI has any potential effects. These trends could indicate a problem and could be addressed through our services or help us identify new services our firm should be offering.

Is the Model Suitable?

Our final regression model explains about 70% of the variance in our data and passed a majority of the residual tests we conducted. Our RMSE test does suggest that there is room for improvement in our model. Given these findings, we would say that our regression model is a fairly suitable fit based on the data we had available.

Appendix

This section includes additional testing and code we ran to help us with our analysis. It includes additional testing of models that ended up not being very relevant, but were still helpful in our analysis. We decided to omit these steps from the main markdown file in order to ensure our work is clear and cohesive to the reader.

Test Model Fit1

We became curious on if Model Fit1 might actual be a good model and decided to test it.This was when we ran into the issue of having to transform Fit4, but were not quite sure how to do it. We decided to test a few models and create some additional ones. They all violated the tests and we moved on from it. The below sections highlight this work.

summary(fit1)

## 
## Call:
## lm(formula = Life.expectancy ~ . * ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.1919  -3.0993   0.6882   3.5096  16.1592 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       1.214e+01  6.952e+00   1.746 0.081019 .  
## Alcohol                          -4.464e+00  5.973e-01  -7.473 1.21e-13 ***
## Percentage.expenditure            1.115e-02  1.576e-03   7.078 2.08e-12 ***
## BMI                               1.534e+00  3.062e-01   5.011 5.95e-07 ***
## Schooling                         3.759e+00  6.031e-01   6.233 5.66e-10 ***
## Alcohol:Percentage.expenditure   -7.749e-05  2.339e-05  -3.312 0.000944 ***
## Alcohol:BMI                       1.158e-01  2.756e-02   4.201 2.78e-05 ***
## Alcohol:Schooling                 9.889e-02  1.677e-02   5.897 4.40e-09 ***
## Percentage.expenditure:BMI       -2.398e-04  6.033e-05  -3.975 7.31e-05 ***
## Percentage.expenditure:Schooling -2.322e-04  3.494e-05  -6.644 4.01e-11 ***
## BMI:Schooling                    -8.706e-02  2.559e-02  -3.403 0.000681 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.451 on 1819 degrees of freedom
## Multiple R-squared:  0.6847, Adjusted R-squared:  0.683 
## F-statistic: 395.1 on 10 and 1819 DF,  p-value: < 2.2e-16

anova(fit1)

## Analysis of Variance Table
## 
## Response: Life.expectancy
##                                    Df Sum Sq Mean Sq   F value    Pr(>F)    
## Alcohol                             1  25230   25230  849.0130 < 2.2e-16 ***
## Percentage.expenditure              1  16677   16677  561.2016 < 2.2e-16 ***
## BMI                                 1  38945   38945 1310.5370 < 2.2e-16 ***
## Schooling                           1  31376   31376 1055.8371 < 2.2e-16 ***
## Alcohol:Percentage.expenditure      1      2       2    0.0549 0.8146902    
## Alcohol:BMI                         1   1495    1495   50.3231 1.860e-12 ***
## Alcohol:Schooling                   1    291     291    9.7914 0.0017811 ** 
## Percentage.expenditure:BMI          1   1515    1515   50.9671 1.352e-12 ***
## Percentage.expenditure:Schooling    1   1530    1530   51.4901 1.043e-12 ***
## BMI:Schooling                       1    344     344   11.5795 0.0006813 ***
## Residuals                        1819  54055      30                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

leveragePlots(fit1, pch=16)

Examine Residuals in a histrogram and scatter plot

fit1.df <- augment(fit1)
ggplot(fit1.df, aes(x=.std.resid)) + 
  geom_histogram(bins=20, color = "black", fill="light gray")

fit1.df <- augment(fit1)
ggplot(fit1.df, aes(x = .fitted, y = .std.resid))+
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  labs(x = "Fitted Values", y = "Residuals")

ggplot(fit1.df, aes(x=.std.resid)) +
  geom_histogram(aes(y=..density..), bins=20, color = "black", fill="light gray") +
  stat_function(fun=dnorm, color="red", size=1.2,
                args=list(mean=mean(fit1.df$.std.resid),
                          sd=sd(fit1.df$.std.resid)))

Formal Test for Non-Constant Variance

Fit1 violates this test as very small (not close to 1).

ncvTest(fit1)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 120.2996, Df = 1, p = < 2.22e-16

Normal Quantile Plot

Seems to violate here too as many dots are outside of dashed lines.

qqPlot(fit1,pch=16)

## [1]  45 830

Formal Test for Normality of Residuals

The p-value is small here so there is still an issue with fit1.

shapiro.test(fit1$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit1$residuals
## W = 0.95964, p-value < 2.2e-16

Create Cook’s D Plot

Visually there appear to be outliers, but in actuality none of the values are greater than 1. So this should be okay for fit1.

plot(fit1, which = 4)

Test Model Fit3

We also decided to test Fit3 to see how it would do. It also violated the tests.

summary(fit3)

## 
## Call:
## lm(formula = Life.expectancy ~ . * . - BMI:Schooling - Alcohol:Percentage.expenditure, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.859  -3.234   0.809   3.559  14.072 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       3.374e+01  2.152e+00  15.681  < 2e-16 ***
## Alcohol                          -4.042e+00  5.927e-01  -6.819 1.24e-11 ***
## Percentage.expenditure            1.172e-02  1.452e-03   8.069 1.27e-15 ***
## BMI                               5.726e-01  1.060e-01   5.403 7.40e-08 ***
## Schooling                         1.795e+00  8.385e-02  21.413  < 2e-16 ***
## Alcohol:BMI                       1.082e-01  2.765e-02   3.914 9.40e-05 ***
## Alcohol:Schooling                 7.550e-02  1.599e-02   4.722 2.52e-06 ***
## Percentage.expenditure:BMI       -2.813e-04  5.769e-05  -4.876 1.18e-06 ***
## Percentage.expenditure:Schooling -2.462e-04  3.478e-05  -7.079 2.07e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.478 on 1821 degrees of freedom
## Multiple R-squared:  0.6812, Adjusted R-squared:  0.6798 
## F-statistic: 486.5 on 8 and 1821 DF,  p-value: < 2.2e-16

anova(fit3)

## Analysis of Variance Table
## 
## Response: Life.expectancy
##                                    Df Sum Sq Mean Sq   F value    Pr(>F)    
## Alcohol                             1  25230   25230  840.6211 < 2.2e-16 ***
## Percentage.expenditure              1  16677   16677  555.6545 < 2.2e-16 ***
## BMI                                 1  38945   38945 1297.5833 < 2.2e-16 ***
## Schooling                           1  31376   31376 1045.4008 < 2.2e-16 ***
## Alcohol:BMI                         1   1434    1434   47.7628 6.629e-12 ***
## Alcohol:Schooling                   1    240     240    7.9888  0.004758 ** 
## Percentage.expenditure:BMI          1   1400    1400   46.6413 1.158e-11 ***
## Percentage.expenditure:Schooling    1   1504    1504   50.1118 2.065e-12 ***
## Residuals                        1821  54655      30                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

leveragePlots(fit3, pch=16)

Examine Residuals in a histrogram and scatter plot

fit3.df <- augment(fit3)
ggplot(fit3.df, aes(x=.std.resid)) + 
  geom_histogram(bins=20, color = "black", fill="light gray")

fit3.df <- augment(fit3)
ggplot(fit1.df, aes(x = .fitted, y = .std.resid))+
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  labs(x = "Fitted Values", y = "Residuals")

ggplot(fit3.df, aes(x=.std.resid)) +
  geom_histogram(aes(y=..density..), bins=20, color = "black", fill="light gray") +
  stat_function(fun=dnorm, color="red", size=1.2,
                args=list(mean=mean(fit3.df$.std.resid),
                          sd=sd(fit3.df$.std.resid)))

Formal Test for Non-Constant Variance

Fit3 violates this test as very small (not close to 1).

ncvTest(fit3)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 97.61545, Df = 1, p = < 2.22e-16

Normal Quantile Plot

Seems to violate here too as many dots are outside of dashed lines.

qqPlot(fit3,pch=16)

## [1]  45 830

Formal Test for Normality of Residuals

The p-value is small here so there is still an issue with fit3.

shapiro.test(fit3$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit3$residuals
## W = 0.9624, p-value < 2.2e-16

Create Cook’s D Plot

Visually there appear to be outliers, but in actuality none of the values are greater than 1. So this should be okay for fit3.

plot(fit3, which = 4)

Fit5: Remove interaction of Alcohol & Percentage Expenditure

We created Fit5 after testing Fit1 to make a model removing Alcohol & Percentage Expenditure. This model also violated tests.

fit5 <- lm(Life.expectancy~ .*. - Alcohol:Percentage.expenditure, data=train)
summary(fit5)

## 
## Call:
## lm(formula = Life.expectancy ~ . * . - Alcohol:Percentage.expenditure, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.1530  -3.0719   0.7875   3.4836  15.9476 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       1.372e+01  6.954e+00   1.973 0.048617 *  
## Alcohol                          -4.260e+00  5.958e-01  -7.151 1.24e-12 ***
## Percentage.expenditure            1.008e-02  1.546e-03   6.519 9.13e-11 ***
## BMI                               1.441e+00  3.058e-01   4.712 2.64e-06 ***
## Schooling                         3.602e+00  6.029e-01   5.975 2.76e-09 ***
## Alcohol:BMI                       1.107e-01  2.760e-02   4.012 6.28e-05 ***
## Alcohol:Schooling                 8.840e-02  1.651e-02   5.353 9.73e-08 ***
## Percentage.expenditure:BMI       -2.263e-04  6.036e-05  -3.749 0.000183 ***
## Percentage.expenditure:Schooling -2.316e-04  3.504e-05  -6.609 5.05e-11 ***
## BMI:Schooling                    -7.710e-02  2.548e-02  -3.026 0.002511 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.466 on 1820 degrees of freedom
## Multiple R-squared:  0.6828, Adjusted R-squared:  0.6813 
## F-statistic: 435.4 on 9 and 1820 DF,  p-value: < 2.2e-16

Test Model Fit5

Alcohol and percentage expenditure interaction is not significant in Fit5 model.

summary(fit5)

## 
## Call:
## lm(formula = Life.expectancy ~ . * . - Alcohol:Percentage.expenditure, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.1530  -3.0719   0.7875   3.4836  15.9476 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       1.372e+01  6.954e+00   1.973 0.048617 *  
## Alcohol                          -4.260e+00  5.958e-01  -7.151 1.24e-12 ***
## Percentage.expenditure            1.008e-02  1.546e-03   6.519 9.13e-11 ***
## BMI                               1.441e+00  3.058e-01   4.712 2.64e-06 ***
## Schooling                         3.602e+00  6.029e-01   5.975 2.76e-09 ***
## Alcohol:BMI                       1.107e-01  2.760e-02   4.012 6.28e-05 ***
## Alcohol:Schooling                 8.840e-02  1.651e-02   5.353 9.73e-08 ***
## Percentage.expenditure:BMI       -2.263e-04  6.036e-05  -3.749 0.000183 ***
## Percentage.expenditure:Schooling -2.316e-04  3.504e-05  -6.609 5.05e-11 ***
## BMI:Schooling                    -7.710e-02  2.548e-02  -3.026 0.002511 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.466 on 1820 degrees of freedom
## Multiple R-squared:  0.6828, Adjusted R-squared:  0.6813 
## F-statistic: 435.4 on 9 and 1820 DF,  p-value: < 2.2e-16

anova(fit5)

## Analysis of Variance Table
## 
## Response: Life.expectancy
##                                    Df Sum Sq Mean Sq   F value    Pr(>F)    
## Alcohol                             1  25230   25230  844.3870 < 2.2e-16 ***
## Percentage.expenditure              1  16677   16677  558.1438 < 2.2e-16 ***
## BMI                                 1  38945   38945 1303.3963 < 2.2e-16 ***
## Schooling                           1  31376   31376 1050.0841 < 2.2e-16 ***
## Alcohol:BMI                         1   1434    1434   47.9768 5.961e-12 ***
## Alcohol:Schooling                   1    240     240    8.0245  0.004666 ** 
## Percentage.expenditure:BMI          1   1400    1400   46.8502 1.044e-11 ***
## Percentage.expenditure:Schooling    1   1504    1504   50.3363 1.848e-12 ***
## BMI:Schooling                       1    274     274    9.1579  0.002511 ** 
## Residuals                        1820  54381      30                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

leveragePlots(fit5, pch=16)

Examine Residuals in a histrogram and scatter plot

fit5.df <- augment(fit5)
ggplot(fit1.df, aes(x=.std.resid)) + 
  geom_histogram(bins=20, color = "black", fill="light gray")

fit5.df <- augment(fit1)
ggplot(fit5.df, aes(x = .fitted, y = .std.resid))+
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  labs(x = "Fitted Values", y = "Residuals")

ggplot(fit5.df, aes(x=.std.resid)) +
  geom_histogram(aes(y=..density..), bins=20, color = "black", fill="light gray") +
  stat_function(fun=dnorm, color="red", size=1.2,
                args=list(mean=mean(fit5.df$.std.resid),
                          sd=sd(fit5.df$.std.resid)))

Formal Test for Non-Constant Variance

Fit5 violates this test as very small (not close to 1).

ncvTest(fit5)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 114.622, Df = 1, p = < 2.22e-16

Normal Quantile Plot

Seems to violate here too as many dots are outside of dashed lines.

qqPlot(fit5,pch=16)

## [1]  45 830

Formal Test for Normality of Residuals

The p-value is small here so there is still an issue with Fit5.

shapiro.test(fit5$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit5$residuals
## W = 0.95919, p-value < 2.2e-16

Create Cook’s D Plot

Visually there appear to be outliers, but in actuality none of the values are greater than 1. So this should be okay for fit1.

plot(fit5, which = 4)

Fit6: Remove interaction of Alcohol & Schooling

Fitted another model to remove interaction action of alcohol and schooling in addition to the interaction from Fit 5 (alcohol and percentage expenditure) We tested this model as well and it did not perform well.

fit6 <- lm(Life.expectancy~ .*. - BMI:Schooling  -Alcohol:Percentage.expenditure -Alcohol:Schooling, data=train)
summary(fit6)

## 
## Call:
## lm(formula = Life.expectancy ~ . * . - BMI:Schooling - Alcohol:Percentage.expenditure - 
##     Alcohol:Schooling, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.838  -3.134   0.795   3.455  15.295 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       3.637e+01  2.090e+00  17.399  < 2e-16 ***
## Alcohol                          -5.109e+00  5.511e-01  -9.272  < 2e-16 ***
## Percentage.expenditure            1.169e-02  1.460e-03   8.002 2.15e-15 ***
## BMI                               3.502e-01  9.548e-02   3.668 0.000252 ***
## Schooling                         2.039e+00  6.652e-02  30.649  < 2e-16 ***
## Alcohol:BMI                       1.903e-01  2.163e-02   8.798  < 2e-16 ***
## Percentage.expenditure:BMI       -3.081e-04  5.774e-05  -5.335 1.07e-07 ***
## Percentage.expenditure:Schooling -1.954e-04  3.326e-05  -5.874 5.06e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.51 on 1822 degrees of freedom
## Multiple R-squared:  0.6773, Adjusted R-squared:  0.6761 
## F-statistic: 546.4 on 7 and 1822 DF,  p-value: < 2.2e-16

Test Model Fit6

summary(fit6)

## 
## Call:
## lm(formula = Life.expectancy ~ . * . - BMI:Schooling - Alcohol:Percentage.expenditure - 
##     Alcohol:Schooling, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.838  -3.134   0.795   3.455  15.295 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       3.637e+01  2.090e+00  17.399  < 2e-16 ***
## Alcohol                          -5.109e+00  5.511e-01  -9.272  < 2e-16 ***
## Percentage.expenditure            1.169e-02  1.460e-03   8.002 2.15e-15 ***
## BMI                               3.502e-01  9.548e-02   3.668 0.000252 ***
## Schooling                         2.039e+00  6.652e-02  30.649  < 2e-16 ***
## Alcohol:BMI                       1.903e-01  2.163e-02   8.798  < 2e-16 ***
## Percentage.expenditure:BMI       -3.081e-04  5.774e-05  -5.335 1.07e-07 ***
## Percentage.expenditure:Schooling -1.954e-04  3.326e-05  -5.874 5.06e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.51 on 1822 degrees of freedom
## Multiple R-squared:  0.6773, Adjusted R-squared:  0.6761 
## F-statistic: 546.4 on 7 and 1822 DF,  p-value: < 2.2e-16

anova(fit6)

## Analysis of Variance Table
## 
## Response: Life.expectancy
##                                    Df Sum Sq Mean Sq  F value    Pr(>F)    
## Alcohol                             1  25230   25230  830.909 < 2.2e-16 ***
## Percentage.expenditure              1  16677   16677  549.235 < 2.2e-16 ***
## BMI                                 1  38945   38945 1282.592 < 2.2e-16 ***
## Schooling                           1  31376   31376 1033.323 < 2.2e-16 ***
## Alcohol:BMI                         1   1434    1434   47.211 8.719e-12 ***
## Percentage.expenditure:BMI          1   1427    1427   46.996 9.706e-12 ***
## Percentage.expenditure:Schooling    1   1048    1048   34.498 5.060e-09 ***
## Residuals                        1822  55324      30                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

leveragePlots(fit6, pch=16)

Examine Residuals in a histrogram and scatter plot

fit6.df <- augment(fit6)
ggplot(fit6.df, aes(x=.std.resid)) + 
  geom_histogram(bins=20, color = "black", fill="light gray")

fit6.df <- augment(fit6)
ggplot(fit6.df, aes(x = .fitted, y = .std.resid))+
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  labs(x = "Fitted Values", y = "Residuals")

ggplot(fit6.df, aes(x=.std.resid)) +
  geom_histogram(aes(y=..density..), bins=20, color = "black", fill="light gray") +
  stat_function(fun=dnorm, color="red", size=1.2,
                args=list(mean=mean(fit6.df$.std.resid),
                          sd=sd(fit6.df$.std.resid)))

Formal Test for Non-Constant Variance

Fit1 violates this test as very small (not close to 1).

ncvTest(fit6)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 103.241, Df = 1, p = < 2.22e-16

Normal Quantile Plot

Seems to violate here too as many dots are outside of dashed lines.

qqPlot(fit6,pch=16)

## [1]  45 830

Formal Test for Normality of Residuals

The p-value is small here so there is still an issue with fit6.

shapiro.test(fit6$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  fit6$residuals
## W = 0.95916, p-value < 2.2e-16

Create Cook’s D Plot

Visually there appear to be outliers, but in actuality none of the values are greater than 1. So this should be okay for fit1.

plot(fit6, which = 4)

R Final Project

Zeke Arteaga, Jessica Brotherson, Alejandra Castellanos, Grace Robison

Spring 2021

Part 1: Organizing and structuring the Dataset

1.2 Looking at the Dataset

1.2.1 Life Expectancy

1.2.2 Alcohol

1.2.4 BMI

1.2.5 Schooling

1.3 Developed vs. Developing Countries and Missing values

1.3 Distribution

1.3.1 Life expectancy

1.3.2 Alcohol

1.3.3 Percentage expenditure

1.3.4 BMI

2 Identifying the top 10 countries with the lowest life expectancy rates and lowest expenditure percentages.

Potential clients in Developed and Developing countries

3. Examination of Correlations for all continuous variables.

3.2 Scatterplot Matrices

Part 4: Building A Model

Evaluating the Model

Examine Residuals

Formal Test for Non-Constant Variance

Normal Quantile Plot

Formal Test for Normality of Residuals

Transformation Process

Box-Cox Test

Transforming the Outcome Variable by Cubing

Re-Testing

Examine Residuals in a histrogram and scatter plot

Formal Test for Non-Constant Variance

Normal Quantile Plot

Formal Test for Normality of Residuals

Create Cook’s D Plot

Identifying Multcollinearity

Review Adjusted R^2

Drop Non-Significant Predictor Variables

Calculate RMSE and Apply Model to Test Data

Manual Calculation of Training Data & Back Transformation

Commentary

Our Approach

Meaningful Discoveries & Surprises

Assumptions

Data assumptions

Impact assumptions

Missing Variables/Variables Wish We Had

How does the model address the basic question?

Other Factors

Is the Model Suitable?

Appendix

Test Model Fit1

Examine Residuals in a histrogram and scatter plot

Formal Test for Non-Constant Variance

Normal Quantile Plot

Formal Test for Normality of Residuals

Create Cook’s D Plot

Test Model Fit3

Examine Residuals in a histrogram and scatter plot

Formal Test for Non-Constant Variance

Normal Quantile Plot

Formal Test for Normality of Residuals

Create Cook’s D Plot

Fit5: Remove interaction of Alcohol & Percentage Expenditure

Test Model Fit5

Examine Residuals in a histrogram and scatter plot

Formal Test for Non-Constant Variance

Normal Quantile Plot

Formal Test for Normality of Residuals

Create Cook’s D Plot

Fit6: Remove interaction of Alcohol & Schooling

Test Model Fit6

Examine Residuals in a histrogram and scatter plot

Formal Test for Non-Constant Variance

Normal Quantile Plot

Formal Test for Normality of Residuals

Create Cook’s D Plot