Part 1 - Setting Goal and Finding Data

A Study on Life Expectancy among Countries from year 2000 to 2015

This study is done using dataset on statistical analysis to discover factors influencing life expectancy.
Dataset used is from Kaggle provided by user Kumar Rajarshi.
Dataset URL is https://www.kaggle.com/kumarajarshi/life-expectancy-who
The data was collected from WHO (for health factors) and United Nations (for economic factors) website.

Purpose of the Dataset

Domain of the dataset is healthcare.
It is a listing of life expectancy observations collected over 193 countries spanning across year 2000 to 2015 published by World Health Organization (WHO) and United Nations (UN).
Categories of factors that can affect life expectancy:
* Social factors: Status, GDP, Income Composition of Resource, Schooling
* Mortality: Adult mortality, Infant death and Number of under-five deaths
* Healthcare-related: Immunization coverage (Polio, Hepatitis B, Measles and Diphtheria), Alcohol consumption, HIV/AIDS recorded cases, BMI and prevalence of thinness among children and adoslescents.

Motivation

Objective
- To study on factors that are influencing life expectancy
Goal
- To conduct statistical analysis on life expectancy dataset by utilizing Regression and Classification model
Questions
1) Is there correlation between life expectancy and healthcare expenditures spent by each country?
2) Is this country classified as a developing or developed country based on Life Expectancy value?

Structure, Dimension and Summary of the Dataset

lifexpec <- read.csv(file.choose(), header = TRUE)
str(lifexpec)

## 'data.frame':    2938 obs. of  22 variables:
##  $ Country                        : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Year                           : int  2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
##  $ Status                         : chr  "Developing" "Developing" "Developing" "Developing" ...
##  $ Life.expectancy                : num  65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
##  $ Adult.Mortality                : int  263 271 268 272 275 279 281 287 295 295 ...
##  $ infant.deaths                  : int  62 64 66 69 71 74 77 80 82 84 ...
##  $ Alcohol                        : num  0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
##  $ percentage.expenditure         : num  71.3 73.5 73.2 78.2 7.1 ...
##  $ Hepatitis.B                    : int  65 62 64 67 68 66 63 64 63 64 ...
##  $ Measles                        : int  1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
##  $ BMI                            : num  19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
##  $ under.five.deaths              : int  83 86 89 93 97 102 106 110 113 116 ...
##  $ Polio                          : int  6 58 62 67 68 66 63 64 63 58 ...
##  $ Total.expenditure              : num  8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
##  $ Diphtheria                     : int  65 62 64 67 68 66 63 64 63 58 ...
##  $ HIV.AIDS                       : num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
##  $ GDP                            : num  584.3 612.7 631.7 670 63.5 ...
##  $ Population                     : num  33736494 327582 31731688 3696958 2978599 ...
##  $ thinness..1.19.years           : num  17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
##  $ thinness.5.9.years             : num  17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
##  $ Income.composition.of.resources: num  0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
##  $ Schooling                      : num  10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...

dim(lifexpec)

## [1] 2938   22

summary(lifexpec)

##    Country               Year         Status          Life.expectancy
##  Length:2938        Min.   :2000   Length:2938        Min.   :36.30  
##  Class :character   1st Qu.:2004   Class :character   1st Qu.:63.10  
##  Mode  :character   Median :2008   Mode  :character   Median :72.10  
##                     Mean   :2008                      Mean   :69.22  
##                     3rd Qu.:2012                      3rd Qu.:75.70  
##                     Max.   :2015                      Max.   :89.00  
##                                                       NA's   :10     
##  Adult.Mortality infant.deaths       Alcohol        percentage.expenditure
##  Min.   :  1.0   Min.   :   0.0   Min.   : 0.0100   Min.   :    0.000     
##  1st Qu.: 74.0   1st Qu.:   0.0   1st Qu.: 0.8775   1st Qu.:    4.685     
##  Median :144.0   Median :   3.0   Median : 3.7550   Median :   64.913     
##  Mean   :164.8   Mean   :  30.3   Mean   : 4.6029   Mean   :  738.251     
##  3rd Qu.:228.0   3rd Qu.:  22.0   3rd Qu.: 7.7025   3rd Qu.:  441.534     
##  Max.   :723.0   Max.   :1800.0   Max.   :17.8700   Max.   :19479.912     
##  NA's   :10                       NA's   :194                             
##   Hepatitis.B       Measles              BMI        under.five.deaths
##  Min.   : 1.00   Min.   :     0.0   Min.   : 1.00   Min.   :   0.00  
##  1st Qu.:77.00   1st Qu.:     0.0   1st Qu.:19.30   1st Qu.:   0.00  
##  Median :92.00   Median :    17.0   Median :43.50   Median :   4.00  
##  Mean   :80.94   Mean   :  2419.6   Mean   :38.32   Mean   :  42.04  
##  3rd Qu.:97.00   3rd Qu.:   360.2   3rd Qu.:56.20   3rd Qu.:  28.00  
##  Max.   :99.00   Max.   :212183.0   Max.   :87.30   Max.   :2500.00  
##  NA's   :553                        NA's   :34                       
##      Polio       Total.expenditure   Diphtheria       HIV.AIDS     
##  Min.   : 3.00   Min.   : 0.370    Min.   : 2.00   Min.   : 0.100  
##  1st Qu.:78.00   1st Qu.: 4.260    1st Qu.:78.00   1st Qu.: 0.100  
##  Median :93.00   Median : 5.755    Median :93.00   Median : 0.100  
##  Mean   :82.55   Mean   : 5.938    Mean   :82.32   Mean   : 1.742  
##  3rd Qu.:97.00   3rd Qu.: 7.492    3rd Qu.:97.00   3rd Qu.: 0.800  
##  Max.   :99.00   Max.   :17.600    Max.   :99.00   Max.   :50.600  
##  NA's   :19      NA's   :226       NA's   :19                      
##       GDP              Population        thinness..1.19.years
##  Min.   :     1.68   Min.   :3.400e+01   Min.   : 0.10       
##  1st Qu.:   463.94   1st Qu.:1.958e+05   1st Qu.: 1.60       
##  Median :  1766.95   Median :1.387e+06   Median : 3.30       
##  Mean   :  7483.16   Mean   :1.275e+07   Mean   : 4.84       
##  3rd Qu.:  5910.81   3rd Qu.:7.420e+06   3rd Qu.: 7.20       
##  Max.   :119172.74   Max.   :1.294e+09   Max.   :27.70       
##  NA's   :448         NA's   :652         NA's   :34          
##  thinness.5.9.years Income.composition.of.resources   Schooling    
##  Min.   : 0.10      Min.   :0.0000                  Min.   : 0.00  
##  1st Qu.: 1.50      1st Qu.:0.4930                  1st Qu.:10.10  
##  Median : 3.30      Median :0.6770                  Median :12.30  
##  Mean   : 4.87      Mean   :0.6276                  Mean   :11.99  
##  3rd Qu.: 7.20      3rd Qu.:0.7790                  3rd Qu.:14.30  
##  Max.   :28.60      Max.   :0.9480                  Max.   :20.70  
##  NA's   :34         NA's   :167                     NA's   :163

Part 2 - Collection and Cleaning Data

## Load all required libraries for cleaning and analysis
library(dplyr)
library(stats)
library(e1071)
library(ggplot2)
library(forecast)
library(psych)
library(tidyr)
library(ExcelFunctionsR)
library(aod)
library(tidyverse)
library(modelr)
library(broom)

## Cleaning the data set
## Make sure all the countries have data for all years (2000-2015)
## Count number of country

unique(lifexpec$Country)

##  [1] "Afghanistan"                      "Albania"                         
##  [3] "Algeria"                          "Angola"                          
##  [5] "Antigua and Barbuda"              "Argentina"                       
##  [7] "Armenia"                          "Australia"                       
##  [9] "Austria"                          "Azerbaijan"                      
## [11] "Bahamas"                          "Bahrain"                         
## [13] "Bangladesh"                       "Barbados"                        
## [15] "Belarus"                          "Belgium"                         
## [17] "Belize"                           "Benin"                           
## [19] "Bhutan"                           "Bolivia (Plurinational State of)"
##  [ reached getOption("max.print") -- omitted 173 entries ]

unique(lifexpec$Year)

##  [1] 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001
## [16] 2000

## Normally, with 193 countries and and 16 years, we should have 
193 * 16  ## 3088 rows, but instead we have

## [1] 3088

length(lifexpec$Country) ##2938 rows, that's

## [1] 2938

3088 - 2938 ##150 rows less

## [1] 150

countrycount <- data.frame(table(lifexpec$Country))  #countries without 16 counts will be excluded from the data set 

### cleaning out N/A 
is.na(lifexpec)

##         Country  Year Status Life.expectancy Adult.Mortality infant.deaths
##         Alcohol percentage.expenditure Hepatitis.B Measles   BMI
##         under.five.deaths Polio Total.expenditure Diphtheria HIV.AIDS   GDP
##         Population thinness..1.19.years thinness.5.9.years
##         Income.composition.of.resources Schooling
##  [ reached getOption("max.print") -- omitted 2938 rows ]

sum(is.na(lifexpec))

## [1] 2563

prelimlifexpect <- na.omit(lifexpec)

##filter final data set to be processed
flifexpect <- left_join(prelimlifexpect, countrycount, by = c("Country" = "Var1"))

sum(is.na(flifexpect))

## [1] 0

head(flifexpect)

##       Country Year     Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing            65.0             263            62
## 2 Afghanistan 2014 Developing            59.9             271            64
## 3 Afghanistan 2013 Developing            59.9             268            66
## 4 Afghanistan 2012 Developing            59.5             272            69
## 5 Afghanistan 2011 Developing            59.2             275            71
## 6 Afghanistan 2010 Developing            58.8             279            74
##   Alcohol percentage.expenditure Hepatitis.B Measles  BMI under.five.deaths
## 1    0.01              71.279624          65    1154 19.1                83
## 2    0.01              73.523582          62     492 18.6                86
## 3    0.01              73.219243          64     430 18.1                89
## 4    0.01              78.184215          67    2787 17.6                93
## 5    0.01               7.097109          68    3013 17.2                97
## 6    0.01              79.679367          66    1989 16.7               102
##   Polio Total.expenditure Diphtheria HIV.AIDS       GDP Population
## 1     6              8.16         65      0.1 584.25921   33736494
## 2    58              8.18         62      0.1 612.69651     327582
## 3    62              8.13         64      0.1 631.74498   31731688
## 4    67              8.52         67      0.1 669.95900    3696958
## 5    68              7.87         68      0.1  63.53723    2978599
## 6    66              9.20         66      0.1 553.32894    2883167
##   thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1                 17.2               17.3                           0.479
## 2                 17.5               17.5                           0.476
## 3                 17.7               17.7                           0.470
## 4                 17.9               18.0                           0.463
## 5                 18.2               18.2                           0.454
## 6                 18.4               18.4                           0.448
##   Schooling Freq
## 1      10.1   16
## 2      10.0   16
## 3       9.9   16
## 4       9.8   16
## 5       9.5   16
## 6       9.2   16

Part 3 - Regression Analysis

## Correlation expenditure vs LifeExpectancy
plot(flifexpect$Total.expenditure, flifexpect$Life.expectancy)

corr.test(flifexpect$Total.expenditure, flifexpect$Life.expectancy)

## Call:corr.test(x = flifexpect$Total.expenditure, y = flifexpect$Life.expectancy)
## Correlation matrix 
## [1] 0.17
## Sample Size 
## [1] 1649
## Probability values  adjusted for multiple tests. 
## [1] 0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

corr.test(lifexpec$Total.expenditure, lifexpec$Life.expectancy)

## Call:corr.test(x = lifexpec$Total.expenditure, y = lifexpec$Life.expectancy)
## Correlation matrix 
## [1] 0.22
## Sample Size 
## [1] 2702
## Probability values  adjusted for multiple tests. 
## [1] 0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

## Comments: the correlation of 0.27 implies there is no strong correlation between healthcare expenditure and life expectancy
## in other words, life expectancy doesn't depend much on how much is spent on healthcare.

## Linear regression
y <-  flifexpect$Life.expectancy
x <- flifexpect$Total.expenditure

rmod <- lm(y ~ x)

summary(rmod)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.998  -4.254   2.147   5.632  22.710 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 65.32123    0.59257 110.234  < 2e-16 ***
## x            0.66842    0.09282   7.201 9.03e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.664 on 1647 degrees of freedom
## Multiple R-squared:  0.03053,    Adjusted R-squared:  0.02994 
## F-statistic: 51.86 on 1 and 1647 DF,  p-value: 9.029e-13

attributes(rmod)

## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"        
## 
## $class
## [1] "lm"

median(flifexpect$Life.expectancy, na.rm = TRUE)

## [1] 71.7

median(flifexpect$Total.expenditure, na.rm = TRUE)

## [1] 5.84

### From this plot, we can see that as the expenditure gets above the median, so does the life expectancy.
### There might be some skewness in the data that prohibit the establishment of a strong trend. 
### We will classify data based on country status: Developing and Developed

### Classification
## Determining the country status by looking at life expectancy

age_group <- group_by(flifexpect, flifexpect$Life.expectancy, na.rm = TRUE)
summarise(age_group,
          avg = mean(flifexpect$Status),
          median = median(flifexpect$Status),
          n = n())

## `summarise()` regrouping output by 'flifexpect$Life.expectancy' (override with `.groups` argument)

## # A tibble: 320 x 5
## # Groups:   flifexpect$Life.expectancy [320]
##    `flifexpect$Life.expectancy` na.rm   avg median         n
##                           <dbl> <lgl> <dbl> <chr>      <int>
##  1                         44   TRUE     NA Developing     1
##  2                         44.3 TRUE     NA Developing     1
##  3                         44.5 TRUE     NA Developing     2
##  4                         44.6 TRUE     NA Developing     2
##  5                         44.8 TRUE     NA Developing     2
##  6                         45.1 TRUE     NA Developing     1
##  7                         45.3 TRUE     NA Developing     3
##  8                         45.4 TRUE     NA Developing     1
##  9                         45.5 TRUE     NA Developing     1
## 10                         45.6 TRUE     NA Developing     1
## # ... with 310 more rows

write.csv(age_group, "age_group.csv")

## The average life expectancy (AVERAGEIF in Excel) gives the following results
## For Developing countries, the average life expectancy is 66.30 years
## For Developed countries, the average life expectancy is 79.65 years.

################## Developing countries #####################
dvping <- filter(age_group, Status == "Developing")

ydeving <- dvping$Life.expectancy
xdeving <- dvping$Total.expenditure
rmoddeving <- lm(ydeving ~ xdeving)
summary(rmoddeving)

## 
## Call:
## lm(formula = ydeving ~ xdeving)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.892  -4.766   1.784   6.174  22.921 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  65.5400     0.6269 104.552  < 2e-16 ***
## xdeving       0.3720     0.1016   3.662 0.000259 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.317 on 1405 degrees of freedom
## Multiple R-squared:  0.009456,   Adjusted R-squared:  0.008751 
## F-statistic: 13.41 on 1 and 1405 DF,  p-value: 0.0002592

mean(dvping$Total.expenditure, na.rm = TRUE)

## [1] 5.772374

mean(dvping$Life.expectancy, na.rm = TRUE)

## [1] 67.68735

corr.test(dvping$Total.expenditure, dvping$Life.expectancy)

## Call:corr.test(x = dvping$Total.expenditure, y = dvping$Life.expectancy)
## Correlation matrix 
## [1] 0.1
## Sample Size 
## [1] 1407
## Probability values  adjusted for multiple tests. 
## [1] 0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

median(dvping$Total.expenditure, na.rm = TRUE)

## [1] 5.61

median(dvping$Life.expectancy, na.rm = TRUE)

## [1] 69.2

## Comments: At 4.92%, the p-value is still below 5% but it's way higher than the p-value for all countries data set.
### Hence still falling to confirm a dependence between total expenditure and life expectancy
### However, we observe that as the total expenditure cross its median value so does the life expectancy



################## Developed Countries #####################

dvped <- filter(age_group, Status == "Developed")

ydvped <- dvped$Life.expectancy
xdvped <- dvped$Total.expenditure
rmoddvped <- lm(ydvped ~ xdvped)
summary(rmoddvped)

## 
## Call:
## lm(formula = ydvped ~ xdvped)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.545 -3.118  0.394  2.297 11.881 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  76.6585     0.7699  99.567  < 2e-16 ***
## xdvped        0.2895     0.1026   2.821  0.00519 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.213 on 240 degrees of freedom
## Multiple R-squared:  0.0321, Adjusted R-squared:  0.02807 
## F-statistic: 7.959 on 1 and 240 DF,  p-value: 0.005185

mean(dvped$Total.expenditure, na.rm = TRUE)

## [1] 7.023099

mean(dvped$Life.expectancy, na.rm = TRUE)

## [1] 78.69174

corr.test(dvped$Total.expenditure, dvped$Life.expectancy)

## Call:corr.test(x = dvped$Total.expenditure, y = dvped$Life.expectancy)
## Correlation matrix 
## [1] 0.18
## Sample Size 
## [1] 242
## Probability values  adjusted for multiple tests. 
## [1] 0.01
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

median(dvped$Total.expenditure, na.rm = TRUE)

## [1] 7.405

median(dvped$Life.expectancy, na.rm = TRUE)

## [1] 78.95

## Comments: Developed countries spend more than developing countries; developed countries also have a higher life expectancy.

Part 4 - Classification Analysis

#Brief summary of the cleaned data frame
head(flifexpect)

##       Country Year     Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing            65.0             263            62
## 2 Afghanistan 2014 Developing            59.9             271            64
##   Alcohol percentage.expenditure Hepatitis.B Measles  BMI under.five.deaths
## 1    0.01               71.27962          65    1154 19.1                83
## 2    0.01               73.52358          62     492 18.6                86
##   Polio Total.expenditure Diphtheria HIV.AIDS      GDP Population
## 1     6              8.16         65      0.1 584.2592   33736494
## 2    58              8.18         62      0.1 612.6965     327582
##   thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1                 17.2               17.3                           0.479
## 2                 17.5               17.5                           0.476
##   Schooling Freq
## 1      10.1   16
## 2      10.0   16
##  [ reached 'max' / getOption("max.print") -- omitted 4 rows ]

#Binary response (outcome, dependent) variable is Status
#Status - Developed or Developing status
#Predictor variable is Life Expectancy, GDP, Income Composition, Total Expenditure
#Life Expectancy - Life expectancy in Age
#Income Composition - Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
#Total Expenditure - General government expenditure on health as a percentage of total government expenditure

#Prepare the data frame for the model
#Say for the year 2014
logdf <- flifexpect[flifexpect$Year == 2014, ]
row.names(logdf) <- logdf$Country
colnames(logdf)

##  [1] "Country"                         "Year"                           
##  [3] "Status"                          "Life.expectancy"                
##  [5] "Adult.Mortality"                 "infant.deaths"                  
##  [7] "Alcohol"                         "percentage.expenditure"         
##  [9] "Hepatitis.B"                     "Measles"                        
## [11] "BMI"                             "under.five.deaths"              
## [13] "Polio"                           "Total.expenditure"              
## [15] "Diphtheria"                      "HIV.AIDS"                       
## [17] "GDP"                             "Population"                     
## [19] "thinness..1.19.years"            "thinness.5.9.years"             
## [21] "Income.composition.of.resources" "Schooling"                      
## [23] "Freq"

#Subset the data frame for the classification model
logdf <- subset(logdf,select = c(Status, Life.expectancy,Income.composition.of.resources,Total.expenditure))
glimpse(logdf)

## Rows: 131
## Columns: 4
## $ Status                          <chr> "Developing", "Developing", "Develo...
## $ Life.expectancy                 <dbl> 59.9, 77.5, 75.4, 51.7, 76.2, 74.6,...
## $ Income.composition.of.resources <dbl> 0.476, 0.761, 0.741, 0.527, 0.825, ...
## $ Total.expenditure               <dbl> 8.18, 5.88, 7.21, 3.31, 4.79, 4.48,...

#Convert the status to binary
lookup <- c("Developing"=0, "Developed"=1)
logdf$binStatus <- lookup[logdf$Status]
glimpse(logdf)

## Rows: 131
## Columns: 5
## $ Status                          <chr> "Developing", "Developing", "Develo...
## $ Life.expectancy                 <dbl> 59.9, 77.5, 75.4, 51.7, 76.2, 74.6,...
## $ Income.composition.of.resources <dbl> 0.476, 0.761, 0.741, 0.527, 0.825, ...
## $ Total.expenditure               <dbl> 8.18, 5.88, 7.21, 3.31, 4.79, 4.48,...
## $ binStatus                       <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,...

#Drop the string Status column from data frame
logdf <- subset(logdf, select = -c(Status))

#Rename the binary status column to Status
names(logdf)[names(logdf) == "binStatus"] <- "Status"
glimpse(logdf)

## Rows: 131
## Columns: 4
## $ Life.expectancy                 <dbl> 59.9, 77.5, 75.4, 51.7, 76.2, 74.6,...
## $ Income.composition.of.resources <dbl> 0.476, 0.761, 0.741, 0.527, 0.825, ...
## $ Total.expenditure               <dbl> 8.18, 5.88, 7.21, 3.31, 4.79, 4.48,...
## $ Status                          <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,...

#Rank each country by income composition of resources
#Wikipedia on HDI
#A value above 0.800 is classified as very high, 
#between 0.700 and 0.799 high, 
#0.550 to 0.699 as medium 
#and anything below 0.550 as low

#Group each Income Composition into Rank Range
logdft <- logdf %>% mutate(HDI.Rank = cut(logdf$Income.composition.of.resources, c(0, 0.550, 0.700, 0.800, Inf)))

#Convert the HDI Rank to rank 1-4
# 1 - 0.800 and above - Very High
# 2 - 0.700 to 0.799 - High
# 3 - 0.550 to 0.699 - Medium
# 4 - 0.550 and below - Low
lookup2 <- c("(0,0.55]"=4, "(0.55,0.7]"=3, "(0.7,0.8]"=2, "(0.8,Inf]"=1)
logdft$ranked.HDI <- lookup2[logdft$HDI.Rank]

#Drop Income Composition and the HDI Rank column from data frame
#As it represented by ranked.HDI
logdft <- subset(logdft, select = -c(Income.composition.of.resources,HDI.Rank))

glimpse(logdft)

## Rows: 131
## Columns: 4
## $ Life.expectancy   <dbl> 59.9, 77.5, 75.4, 51.7, 76.2, 74.6, 82.7, 81.4, 7...
## $ Total.expenditure <dbl> 8.18, 5.88, 7.21, 3.31, 4.79, 4.48, 9.42, 11.21, ...
## $ Status            <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ ranked.HDI        <dbl> 4, 2, 2, 4, 1, 2, 1, 1, 2, 3, 2, 1, 2, 4, 3, 2, 3...

#Show summary of the data frame to ensure its good for logit model
summary(logdft)

##  Life.expectancy Total.expenditure     Status        ranked.HDI   
##  Min.   :48.10   Min.   : 1.210    Min.   :0.000   Min.   :1.000  
##  1st Qu.:64.65   1st Qu.: 4.485    1st Qu.:0.000   1st Qu.:2.000  
##  Median :72.00   Median : 5.820    Median :0.000   Median :3.000  
##  Mean   :70.52   Mean   : 6.107    Mean   :0.145   Mean   :2.573  
##  3rd Qu.:75.80   3rd Qu.: 7.630    3rd Qu.:0.000   3rd Qu.:4.000  
##  Max.   :89.00   Max.   :13.730    Max.   :1.000   Max.   :4.000

#To predict the outcome of Status based on Life Expectancy value

# Logistics Regression
# Split the data into two chunks; training and testing set
# Regression is done on training set
train <- logdft[1:65,]
test <- logdft[66:131,]

# Simple Logistic Regression
model1 <- glm(Status ~ Life.expectancy, family = "binomial", data = train)

# Summary of the model
summary(model1)

## 
## Call:
## glm(formula = Status ~ Life.expectancy, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8740  -0.3913  -0.1518  -0.0373   2.3207  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     -25.34204    7.80865  -3.245  0.00117 **
## Life.expectancy   0.30578    0.09787   3.124  0.00178 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 52.281  on 64  degrees of freedom
## Residual deviance: 30.572  on 63  degrees of freedom
## AIC: 34.572
## 
## Number of Fisher Scoring iterations: 7

# Accessing coefficients
# Shows the coefficient estimates and related information 
# that result from fitting a logistic regression model 
# in order to predict the probability of Developed Status = Yes using Life Expectancy
tidy(model1)

## # A tibble: 2 x 5
##   term            estimate std.error statistic p.value
##   <chr>              <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)      -25.3      7.81       -3.25 0.00117
## 2 Life.expectancy    0.306    0.0979      3.12 0.00178

# Interpretation of the balance coefficient 
# for every increasing of life expectancy value by 0.1
# the odds of the Status of a country increases by a factor of 1.3577
exp(coef(model1))

##     (Intercept) Life.expectancy 
##    9.864923e-12    1.357680e+00

# Making prediction
# From the coefficients calculated, to compute the probability of
# Country's Status for any given Life Expectancy value
# For example, given Life Expectancy are 68.9 and 73.7 
predict(model1, data.frame(Life.expectancy = c(68.9, 73.7)), type = "response")

##          1          2 
## 0.01373511 0.05698803

# From the result, 
# for Country with Life Expectancy of 68.9, 
# it is ~14% probability that it is a developed country
# However for Country with Life Expectancy of 73.7,
# it is more than 57% likely is a developed country,
# which is a high jump from mere 68.9 to 73.7

WQD7004 Group Project - Group 19

Nurarlisa Sulong (17220304) & Syahirah Hanisah (S2000053)

3/1/2021

Part 1 - Setting Goal and Finding Data

A Study on Life Expectancy among Countries from year 2000 to 2015

Purpose of the Dataset

Motivation

Structure, Dimension and Summary of the Dataset

Part 2 - Collection and Cleaning Data

Part 3 - Regression Analysis

Part 4 - Classification Analysis