This study is done using dataset on statistical analysis to discover factors influencing life expectancy.
Dataset used is from Kaggle provided by user Kumar Rajarshi.
Dataset URL is https://www.kaggle.com/kumarajarshi/life-expectancy-who
The data was collected from WHO (for health factors) and United Nations (for economic factors) website.
Domain of the dataset is healthcare.
It is a listing of life expectancy observations collected over 193 countries spanning across year 2000 to 2015 published by World Health Organization (WHO) and United Nations (UN).
Categories of factors that can affect life expectancy:
* Social factors: Status, GDP, Income Composition of Resource, Schooling
* Mortality: Adult mortality, Infant death and Number of under-five deaths
* Healthcare-related: Immunization coverage (Polio, Hepatitis B, Measles and Diphtheria), Alcohol consumption, HIV/AIDS recorded cases, BMI and prevalence of thinness among children and adoslescents.
Objective
- To study on factors that are influencing life expectancy
Goal
- To conduct statistical analysis on life expectancy dataset by utilizing Regression and Classification model
Questions
1) Is there correlation between life expectancy and healthcare expenditures spent by each country?
2) Is this country classified as a developing or developed country based on Life Expectancy value?
lifexpec <- read.csv(file.choose(), header = TRUE)
str(lifexpec)
## 'data.frame': 2938 obs. of 22 variables:
## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Year : int 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 ...
## $ Status : chr "Developing" "Developing" "Developing" "Developing" ...
## $ Life.expectancy : num 65 59.9 59.9 59.5 59.2 58.8 58.6 58.1 57.5 57.3 ...
## $ Adult.Mortality : int 263 271 268 272 275 279 281 287 295 295 ...
## $ infant.deaths : int 62 64 66 69 71 74 77 80 82 84 ...
## $ Alcohol : num 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.03 0.02 0.03 ...
## $ percentage.expenditure : num 71.3 73.5 73.2 78.2 7.1 ...
## $ Hepatitis.B : int 65 62 64 67 68 66 63 64 63 64 ...
## $ Measles : int 1154 492 430 2787 3013 1989 2861 1599 1141 1990 ...
## $ BMI : num 19.1 18.6 18.1 17.6 17.2 16.7 16.2 15.7 15.2 14.7 ...
## $ under.five.deaths : int 83 86 89 93 97 102 106 110 113 116 ...
## $ Polio : int 6 58 62 67 68 66 63 64 63 58 ...
## $ Total.expenditure : num 8.16 8.18 8.13 8.52 7.87 9.2 9.42 8.33 6.73 7.43 ...
## $ Diphtheria : int 65 62 64 67 68 66 63 64 63 58 ...
## $ HIV.AIDS : num 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## $ GDP : num 584.3 612.7 631.7 670 63.5 ...
## $ Population : num 33736494 327582 31731688 3696958 2978599 ...
## $ thinness..1.19.years : num 17.2 17.5 17.7 17.9 18.2 18.4 18.6 18.8 19 19.2 ...
## $ thinness.5.9.years : num 17.3 17.5 17.7 18 18.2 18.4 18.7 18.9 19.1 19.3 ...
## $ Income.composition.of.resources: num 0.479 0.476 0.47 0.463 0.454 0.448 0.434 0.433 0.415 0.405 ...
## $ Schooling : num 10.1 10 9.9 9.8 9.5 9.2 8.9 8.7 8.4 8.1 ...
dim(lifexpec)
## [1] 2938 22
summary(lifexpec)
## Country Year Status Life.expectancy
## Length:2938 Min. :2000 Length:2938 Min. :36.30
## Class :character 1st Qu.:2004 Class :character 1st Qu.:63.10
## Mode :character Median :2008 Mode :character Median :72.10
## Mean :2008 Mean :69.22
## 3rd Qu.:2012 3rd Qu.:75.70
## Max. :2015 Max. :89.00
## NA's :10
## Adult.Mortality infant.deaths Alcohol percentage.expenditure
## Min. : 1.0 Min. : 0.0 Min. : 0.0100 Min. : 0.000
## 1st Qu.: 74.0 1st Qu.: 0.0 1st Qu.: 0.8775 1st Qu.: 4.685
## Median :144.0 Median : 3.0 Median : 3.7550 Median : 64.913
## Mean :164.8 Mean : 30.3 Mean : 4.6029 Mean : 738.251
## 3rd Qu.:228.0 3rd Qu.: 22.0 3rd Qu.: 7.7025 3rd Qu.: 441.534
## Max. :723.0 Max. :1800.0 Max. :17.8700 Max. :19479.912
## NA's :10 NA's :194
## Hepatitis.B Measles BMI under.five.deaths
## Min. : 1.00 Min. : 0.0 Min. : 1.00 Min. : 0.00
## 1st Qu.:77.00 1st Qu.: 0.0 1st Qu.:19.30 1st Qu.: 0.00
## Median :92.00 Median : 17.0 Median :43.50 Median : 4.00
## Mean :80.94 Mean : 2419.6 Mean :38.32 Mean : 42.04
## 3rd Qu.:97.00 3rd Qu.: 360.2 3rd Qu.:56.20 3rd Qu.: 28.00
## Max. :99.00 Max. :212183.0 Max. :87.30 Max. :2500.00
## NA's :553 NA's :34
## Polio Total.expenditure Diphtheria HIV.AIDS
## Min. : 3.00 Min. : 0.370 Min. : 2.00 Min. : 0.100
## 1st Qu.:78.00 1st Qu.: 4.260 1st Qu.:78.00 1st Qu.: 0.100
## Median :93.00 Median : 5.755 Median :93.00 Median : 0.100
## Mean :82.55 Mean : 5.938 Mean :82.32 Mean : 1.742
## 3rd Qu.:97.00 3rd Qu.: 7.492 3rd Qu.:97.00 3rd Qu.: 0.800
## Max. :99.00 Max. :17.600 Max. :99.00 Max. :50.600
## NA's :19 NA's :226 NA's :19
## GDP Population thinness..1.19.years
## Min. : 1.68 Min. :3.400e+01 Min. : 0.10
## 1st Qu.: 463.94 1st Qu.:1.958e+05 1st Qu.: 1.60
## Median : 1766.95 Median :1.387e+06 Median : 3.30
## Mean : 7483.16 Mean :1.275e+07 Mean : 4.84
## 3rd Qu.: 5910.81 3rd Qu.:7.420e+06 3rd Qu.: 7.20
## Max. :119172.74 Max. :1.294e+09 Max. :27.70
## NA's :448 NA's :652 NA's :34
## thinness.5.9.years Income.composition.of.resources Schooling
## Min. : 0.10 Min. :0.0000 Min. : 0.00
## 1st Qu.: 1.50 1st Qu.:0.4930 1st Qu.:10.10
## Median : 3.30 Median :0.6770 Median :12.30
## Mean : 4.87 Mean :0.6276 Mean :11.99
## 3rd Qu.: 7.20 3rd Qu.:0.7790 3rd Qu.:14.30
## Max. :28.60 Max. :0.9480 Max. :20.70
## NA's :34 NA's :167 NA's :163
## Load all required libraries for cleaning and analysis
library(dplyr)
library(stats)
library(e1071)
library(ggplot2)
library(forecast)
library(psych)
library(tidyr)
library(ExcelFunctionsR)
library(aod)
library(tidyverse)
library(modelr)
library(broom)
## Cleaning the data set
## Make sure all the countries have data for all years (2000-2015)
## Count number of country
unique(lifexpec$Country)
## [1] "Afghanistan" "Albania"
## [3] "Algeria" "Angola"
## [5] "Antigua and Barbuda" "Argentina"
## [7] "Armenia" "Australia"
## [9] "Austria" "Azerbaijan"
## [11] "Bahamas" "Bahrain"
## [13] "Bangladesh" "Barbados"
## [15] "Belarus" "Belgium"
## [17] "Belize" "Benin"
## [19] "Bhutan" "Bolivia (Plurinational State of)"
## [ reached getOption("max.print") -- omitted 173 entries ]
unique(lifexpec$Year)
## [1] 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001
## [16] 2000
## Normally, with 193 countries and and 16 years, we should have
193 * 16 ## 3088 rows, but instead we have
## [1] 3088
length(lifexpec$Country) ##2938 rows, that's
## [1] 2938
3088 - 2938 ##150 rows less
## [1] 150
countrycount <- data.frame(table(lifexpec$Country)) #countries without 16 counts will be excluded from the data set
### cleaning out N/A
is.na(lifexpec)
## Country Year Status Life.expectancy Adult.Mortality infant.deaths
## Alcohol percentage.expenditure Hepatitis.B Measles BMI
## under.five.deaths Polio Total.expenditure Diphtheria HIV.AIDS GDP
## Population thinness..1.19.years thinness.5.9.years
## Income.composition.of.resources Schooling
## [ reached getOption("max.print") -- omitted 2938 rows ]
sum(is.na(lifexpec))
## [1] 2563
prelimlifexpect <- na.omit(lifexpec)
##filter final data set to be processed
flifexpect <- left_join(prelimlifexpect, countrycount, by = c("Country" = "Var1"))
sum(is.na(flifexpect))
## [1] 0
head(flifexpect)
## Country Year Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing 65.0 263 62
## 2 Afghanistan 2014 Developing 59.9 271 64
## 3 Afghanistan 2013 Developing 59.9 268 66
## 4 Afghanistan 2012 Developing 59.5 272 69
## 5 Afghanistan 2011 Developing 59.2 275 71
## 6 Afghanistan 2010 Developing 58.8 279 74
## Alcohol percentage.expenditure Hepatitis.B Measles BMI under.five.deaths
## 1 0.01 71.279624 65 1154 19.1 83
## 2 0.01 73.523582 62 492 18.6 86
## 3 0.01 73.219243 64 430 18.1 89
## 4 0.01 78.184215 67 2787 17.6 93
## 5 0.01 7.097109 68 3013 17.2 97
## 6 0.01 79.679367 66 1989 16.7 102
## Polio Total.expenditure Diphtheria HIV.AIDS GDP Population
## 1 6 8.16 65 0.1 584.25921 33736494
## 2 58 8.18 62 0.1 612.69651 327582
## 3 62 8.13 64 0.1 631.74498 31731688
## 4 67 8.52 67 0.1 669.95900 3696958
## 5 68 7.87 68 0.1 63.53723 2978599
## 6 66 9.20 66 0.1 553.32894 2883167
## thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1 17.2 17.3 0.479
## 2 17.5 17.5 0.476
## 3 17.7 17.7 0.470
## 4 17.9 18.0 0.463
## 5 18.2 18.2 0.454
## 6 18.4 18.4 0.448
## Schooling Freq
## 1 10.1 16
## 2 10.0 16
## 3 9.9 16
## 4 9.8 16
## 5 9.5 16
## 6 9.2 16
## Correlation expenditure vs LifeExpectancy
plot(flifexpect$Total.expenditure, flifexpect$Life.expectancy)
corr.test(flifexpect$Total.expenditure, flifexpect$Life.expectancy)
## Call:corr.test(x = flifexpect$Total.expenditure, y = flifexpect$Life.expectancy)
## Correlation matrix
## [1] 0.17
## Sample Size
## [1] 1649
## Probability values adjusted for multiple tests.
## [1] 0
##
## To see confidence intervals of the correlations, print with the short=FALSE option
corr.test(lifexpec$Total.expenditure, lifexpec$Life.expectancy)
## Call:corr.test(x = lifexpec$Total.expenditure, y = lifexpec$Life.expectancy)
## Correlation matrix
## [1] 0.22
## Sample Size
## [1] 2702
## Probability values adjusted for multiple tests.
## [1] 0
##
## To see confidence intervals of the correlations, print with the short=FALSE option
## Comments: the correlation of 0.27 implies there is no strong correlation between healthcare expenditure and life expectancy
## in other words, life expectancy doesn't depend much on how much is spent on healthcare.
## Linear regression
y <- flifexpect$Life.expectancy
x <- flifexpect$Total.expenditure
rmod <- lm(y ~ x)
summary(rmod)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.998 -4.254 2.147 5.632 22.710
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.32123 0.59257 110.234 < 2e-16 ***
## x 0.66842 0.09282 7.201 9.03e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.664 on 1647 degrees of freedom
## Multiple R-squared: 0.03053, Adjusted R-squared: 0.02994
## F-statistic: 51.86 on 1 and 1647 DF, p-value: 9.029e-13
attributes(rmod)
## $names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
##
## $class
## [1] "lm"
median(flifexpect$Life.expectancy, na.rm = TRUE)
## [1] 71.7
median(flifexpect$Total.expenditure, na.rm = TRUE)
## [1] 5.84
### From this plot, we can see that as the expenditure gets above the median, so does the life expectancy.
### There might be some skewness in the data that prohibit the establishment of a strong trend.
### We will classify data based on country status: Developing and Developed
### Classification
## Determining the country status by looking at life expectancy
age_group <- group_by(flifexpect, flifexpect$Life.expectancy, na.rm = TRUE)
summarise(age_group,
avg = mean(flifexpect$Status),
median = median(flifexpect$Status),
n = n())
## `summarise()` regrouping output by 'flifexpect$Life.expectancy' (override with `.groups` argument)
## # A tibble: 320 x 5
## # Groups: flifexpect$Life.expectancy [320]
## `flifexpect$Life.expectancy` na.rm avg median n
## <dbl> <lgl> <dbl> <chr> <int>
## 1 44 TRUE NA Developing 1
## 2 44.3 TRUE NA Developing 1
## 3 44.5 TRUE NA Developing 2
## 4 44.6 TRUE NA Developing 2
## 5 44.8 TRUE NA Developing 2
## 6 45.1 TRUE NA Developing 1
## 7 45.3 TRUE NA Developing 3
## 8 45.4 TRUE NA Developing 1
## 9 45.5 TRUE NA Developing 1
## 10 45.6 TRUE NA Developing 1
## # ... with 310 more rows
write.csv(age_group, "age_group.csv")
## The average life expectancy (AVERAGEIF in Excel) gives the following results
## For Developing countries, the average life expectancy is 66.30 years
## For Developed countries, the average life expectancy is 79.65 years.
################## Developing countries #####################
dvping <- filter(age_group, Status == "Developing")
ydeving <- dvping$Life.expectancy
xdeving <- dvping$Total.expenditure
rmoddeving <- lm(ydeving ~ xdeving)
summary(rmoddeving)
##
## Call:
## lm(formula = ydeving ~ xdeving)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.892 -4.766 1.784 6.174 22.921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.5400 0.6269 104.552 < 2e-16 ***
## xdeving 0.3720 0.1016 3.662 0.000259 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.317 on 1405 degrees of freedom
## Multiple R-squared: 0.009456, Adjusted R-squared: 0.008751
## F-statistic: 13.41 on 1 and 1405 DF, p-value: 0.0002592
mean(dvping$Total.expenditure, na.rm = TRUE)
## [1] 5.772374
mean(dvping$Life.expectancy, na.rm = TRUE)
## [1] 67.68735
corr.test(dvping$Total.expenditure, dvping$Life.expectancy)
## Call:corr.test(x = dvping$Total.expenditure, y = dvping$Life.expectancy)
## Correlation matrix
## [1] 0.1
## Sample Size
## [1] 1407
## Probability values adjusted for multiple tests.
## [1] 0
##
## To see confidence intervals of the correlations, print with the short=FALSE option
median(dvping$Total.expenditure, na.rm = TRUE)
## [1] 5.61
median(dvping$Life.expectancy, na.rm = TRUE)
## [1] 69.2
## Comments: At 4.92%, the p-value is still below 5% but it's way higher than the p-value for all countries data set.
### Hence still falling to confirm a dependence between total expenditure and life expectancy
### However, we observe that as the total expenditure cross its median value so does the life expectancy
################## Developed Countries #####################
dvped <- filter(age_group, Status == "Developed")
ydvped <- dvped$Life.expectancy
xdvped <- dvped$Total.expenditure
rmoddvped <- lm(ydvped ~ xdvped)
summary(rmoddvped)
##
## Call:
## lm(formula = ydvped ~ xdvped)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.545 -3.118 0.394 2.297 11.881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.6585 0.7699 99.567 < 2e-16 ***
## xdvped 0.2895 0.1026 2.821 0.00519 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.213 on 240 degrees of freedom
## Multiple R-squared: 0.0321, Adjusted R-squared: 0.02807
## F-statistic: 7.959 on 1 and 240 DF, p-value: 0.005185
mean(dvped$Total.expenditure, na.rm = TRUE)
## [1] 7.023099
mean(dvped$Life.expectancy, na.rm = TRUE)
## [1] 78.69174
corr.test(dvped$Total.expenditure, dvped$Life.expectancy)
## Call:corr.test(x = dvped$Total.expenditure, y = dvped$Life.expectancy)
## Correlation matrix
## [1] 0.18
## Sample Size
## [1] 242
## Probability values adjusted for multiple tests.
## [1] 0.01
##
## To see confidence intervals of the correlations, print with the short=FALSE option
median(dvped$Total.expenditure, na.rm = TRUE)
## [1] 7.405
median(dvped$Life.expectancy, na.rm = TRUE)
## [1] 78.95
## Comments: Developed countries spend more than developing countries; developed countries also have a higher life expectancy.
#Brief summary of the cleaned data frame
head(flifexpect)
## Country Year Status Life.expectancy Adult.Mortality infant.deaths
## 1 Afghanistan 2015 Developing 65.0 263 62
## 2 Afghanistan 2014 Developing 59.9 271 64
## Alcohol percentage.expenditure Hepatitis.B Measles BMI under.five.deaths
## 1 0.01 71.27962 65 1154 19.1 83
## 2 0.01 73.52358 62 492 18.6 86
## Polio Total.expenditure Diphtheria HIV.AIDS GDP Population
## 1 6 8.16 65 0.1 584.2592 33736494
## 2 58 8.18 62 0.1 612.6965 327582
## thinness..1.19.years thinness.5.9.years Income.composition.of.resources
## 1 17.2 17.3 0.479
## 2 17.5 17.5 0.476
## Schooling Freq
## 1 10.1 16
## 2 10.0 16
## [ reached 'max' / getOption("max.print") -- omitted 4 rows ]
#Binary response (outcome, dependent) variable is Status
#Status - Developed or Developing status
#Predictor variable is Life Expectancy, GDP, Income Composition, Total Expenditure
#Life Expectancy - Life expectancy in Age
#Income Composition - Human Development Index in terms of income composition of resources (index ranging from 0 to 1)
#Total Expenditure - General government expenditure on health as a percentage of total government expenditure
#Prepare the data frame for the model
#Say for the year 2014
logdf <- flifexpect[flifexpect$Year == 2014, ]
row.names(logdf) <- logdf$Country
colnames(logdf)
## [1] "Country" "Year"
## [3] "Status" "Life.expectancy"
## [5] "Adult.Mortality" "infant.deaths"
## [7] "Alcohol" "percentage.expenditure"
## [9] "Hepatitis.B" "Measles"
## [11] "BMI" "under.five.deaths"
## [13] "Polio" "Total.expenditure"
## [15] "Diphtheria" "HIV.AIDS"
## [17] "GDP" "Population"
## [19] "thinness..1.19.years" "thinness.5.9.years"
## [21] "Income.composition.of.resources" "Schooling"
## [23] "Freq"
#Subset the data frame for the classification model
logdf <- subset(logdf,select = c(Status, Life.expectancy,Income.composition.of.resources,Total.expenditure))
glimpse(logdf)
## Rows: 131
## Columns: 4
## $ Status <chr> "Developing", "Developing", "Develo...
## $ Life.expectancy <dbl> 59.9, 77.5, 75.4, 51.7, 76.2, 74.6,...
## $ Income.composition.of.resources <dbl> 0.476, 0.761, 0.741, 0.527, 0.825, ...
## $ Total.expenditure <dbl> 8.18, 5.88, 7.21, 3.31, 4.79, 4.48,...
#Convert the status to binary
lookup <- c("Developing"=0, "Developed"=1)
logdf$binStatus <- lookup[logdf$Status]
glimpse(logdf)
## Rows: 131
## Columns: 5
## $ Status <chr> "Developing", "Developing", "Develo...
## $ Life.expectancy <dbl> 59.9, 77.5, 75.4, 51.7, 76.2, 74.6,...
## $ Income.composition.of.resources <dbl> 0.476, 0.761, 0.741, 0.527, 0.825, ...
## $ Total.expenditure <dbl> 8.18, 5.88, 7.21, 3.31, 4.79, 4.48,...
## $ binStatus <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,...
#Drop the string Status column from data frame
logdf <- subset(logdf, select = -c(Status))
#Rename the binary status column to Status
names(logdf)[names(logdf) == "binStatus"] <- "Status"
glimpse(logdf)
## Rows: 131
## Columns: 4
## $ Life.expectancy <dbl> 59.9, 77.5, 75.4, 51.7, 76.2, 74.6,...
## $ Income.composition.of.resources <dbl> 0.476, 0.761, 0.741, 0.527, 0.825, ...
## $ Total.expenditure <dbl> 8.18, 5.88, 7.21, 3.31, 4.79, 4.48,...
## $ Status <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,...
#Rank each country by income composition of resources
#Wikipedia on HDI
#A value above 0.800 is classified as very high,
#between 0.700 and 0.799 high,
#0.550 to 0.699 as medium
#and anything below 0.550 as low
#Group each Income Composition into Rank Range
logdft <- logdf %>% mutate(HDI.Rank = cut(logdf$Income.composition.of.resources, c(0, 0.550, 0.700, 0.800, Inf)))
#Convert the HDI Rank to rank 1-4
# 1 - 0.800 and above - Very High
# 2 - 0.700 to 0.799 - High
# 3 - 0.550 to 0.699 - Medium
# 4 - 0.550 and below - Low
lookup2 <- c("(0,0.55]"=4, "(0.55,0.7]"=3, "(0.7,0.8]"=2, "(0.8,Inf]"=1)
logdft$ranked.HDI <- lookup2[logdft$HDI.Rank]
#Drop Income Composition and the HDI Rank column from data frame
#As it represented by ranked.HDI
logdft <- subset(logdft, select = -c(Income.composition.of.resources,HDI.Rank))
glimpse(logdft)
## Rows: 131
## Columns: 4
## $ Life.expectancy <dbl> 59.9, 77.5, 75.4, 51.7, 76.2, 74.6, 82.7, 81.4, 7...
## $ Total.expenditure <dbl> 8.18, 5.88, 7.21, 3.31, 4.79, 4.48, 9.42, 11.21, ...
## $ Status <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ ranked.HDI <dbl> 4, 2, 2, 4, 1, 2, 1, 1, 2, 3, 2, 1, 2, 4, 3, 2, 3...
#Show summary of the data frame to ensure its good for logit model
summary(logdft)
## Life.expectancy Total.expenditure Status ranked.HDI
## Min. :48.10 Min. : 1.210 Min. :0.000 Min. :1.000
## 1st Qu.:64.65 1st Qu.: 4.485 1st Qu.:0.000 1st Qu.:2.000
## Median :72.00 Median : 5.820 Median :0.000 Median :3.000
## Mean :70.52 Mean : 6.107 Mean :0.145 Mean :2.573
## 3rd Qu.:75.80 3rd Qu.: 7.630 3rd Qu.:0.000 3rd Qu.:4.000
## Max. :89.00 Max. :13.730 Max. :1.000 Max. :4.000
#To predict the outcome of Status based on Life Expectancy value
# Logistics Regression
# Split the data into two chunks; training and testing set
# Regression is done on training set
train <- logdft[1:65,]
test <- logdft[66:131,]
# Simple Logistic Regression
model1 <- glm(Status ~ Life.expectancy, family = "binomial", data = train)
# Summary of the model
summary(model1)
##
## Call:
## glm(formula = Status ~ Life.expectancy, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8740 -0.3913 -0.1518 -0.0373 2.3207
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -25.34204 7.80865 -3.245 0.00117 **
## Life.expectancy 0.30578 0.09787 3.124 0.00178 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 52.281 on 64 degrees of freedom
## Residual deviance: 30.572 on 63 degrees of freedom
## AIC: 34.572
##
## Number of Fisher Scoring iterations: 7
# Accessing coefficients
# Shows the coefficient estimates and related information
# that result from fitting a logistic regression model
# in order to predict the probability of Developed Status = Yes using Life Expectancy
tidy(model1)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -25.3 7.81 -3.25 0.00117
## 2 Life.expectancy 0.306 0.0979 3.12 0.00178
# Interpretation of the balance coefficient
# for every increasing of life expectancy value by 0.1
# the odds of the Status of a country increases by a factor of 1.3577
exp(coef(model1))
## (Intercept) Life.expectancy
## 9.864923e-12 1.357680e+00
# Making prediction
# From the coefficients calculated, to compute the probability of
# Country's Status for any given Life Expectancy value
# For example, given Life Expectancy are 68.9 and 73.7
predict(model1, data.frame(Life.expectancy = c(68.9, 73.7)), type = "response")
## 1 2
## 0.01373511 0.05698803
# From the result,
# for Country with Life Expectancy of 68.9,
# it is ~14% probability that it is a developed country
# However for Country with Life Expectancy of 73.7,
# it is more than 57% likely is a developed country,
# which is a high jump from mere 68.9 to 73.7