Assignment 3

Simple linear regression analysis

Xiyue Shu s3705474, Shan Jiang s3592369, Anna Krinochkina s3712761

Last updated: 19 October, 2018

Introduction

The rationale of the investigation:

High infant mortality rates generally indicate human needs in medical care, nutrition, sanitation, etc. are unmet. Many studies suggest that higher income at country level is closely correlated with higher health status for that country’s population. It is also assumed that the indexes of IMR and GDP per capita have a negative relationship.

Problem Statement

Understanding the relationship between two quantitative variables (IMR and GDP per capita) in order to allow making accurate predictions. It is interesting to know if the GDP per capita index of a country can be used to make predictions of infant mortality rate in that country.

Data

countries <- read_csv("countries.csv")
countries <- countries %>% select(`Infant mortality (per 1000 births)`, `GDP ($ per capita)`)

Descriptive statistics

Summary statistics for variable Infant mortality (per 1000 births) and GDP ($ per capita)are as follow:

summary(countries$`Infant mortality (per 1000 births)`, na.rm =T)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.29    8.15   21.00   35.51   55.70  191.19       3
sd(countries$`Infant mortality (per 1000 births)`, na.rm = T)
## [1] 35.3899
summary(countries$`GDP ($ per capita)`, na.rm=T)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     500    1900    5550    9690   15700   55100       1
sd(countries$`GDP ($ per capita)`, na.rm = T)
## [1] 10049.14

Data preprocessing

which(is.na(countries))
## [1]  48 222 224 451
countries <- na.omit(countries)
z.score<-countries$`Infant mortality (per 1000 births)` %>% scores(type = "z")
which(abs(z.score) > 3)
## [1]   1   6 183
z.scores<-countries$`GDP ($ per capita)` %>% scores(type = "z")
which(abs(z.scores) > 3)
## [1] 121
countries<- countries[-c(1,6,121,183),]

Data visualisation - part 1

par(mfrow = c(1,2))
hist(countries$`Infant mortality (per 1000 births)`, main = 'Infant Mortality', xlab = 'Infant Mortality', col = "lightblue")
log(countries$`Infant mortality (per 1000 births)`) %>% hist(main = "log(Infant Mortality)", col = "lightblue")

Data visualisation - part 2

par(mfrow=c(1,2))
hist(countries$`GDP ($ per capita)`, main = 'GDP', xlab = 'GDP($per capita)', col = "grey")
log(countries$`GDP ($ per capita)`) %>% hist(main = "log(GDP)", ylim = c(0,35), col = "grey")

Data visualisation - overview

a scatter plot of the transformed varaibles is as follow, to give an overview of the relationship between the two variables

plot(log(`Infant mortality (per 1000 births)`) ~ log(`GDP ($ per capita)`), data = countries)

Linear Regression - Overall Model

Linear Regression - Testing the Overall Model

model1 <- lm(log(`Infant mortality (per 1000 births)`) ~ log(`GDP ($ per capita)`), data = countries)
model1 %>% summary()
## 
## Call:
## lm(formula = log(`Infant mortality (per 1000 births)`) ~ log(`GDP ($ per capita)`), 
##     data = countries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.64617 -0.32677 -0.01089  0.35710  1.62962 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 9.5800     0.2677   35.79   <2e-16 ***
## log(`GDP ($ per capita)`)  -0.7637     0.0309  -24.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5409 on 218 degrees of freedom
## Multiple R-squared:  0.737,  Adjusted R-squared:  0.7358 
## F-statistic: 610.9 on 1 and 218 DF,  p-value: < 2.2e-16

Linear Regression - Interpreting the Coefficients and \(R^2\)

Linear Regression - Testing Model Parameters

Linear Regression - Testing Assumptions visualisation

par(mfrow=c(2,2))
model1 %>% plot(which = 1)
model1 %>% plot(which = 2)
model1 %>% plot(which = 3)
model1 %>% plot(which = 5)

Linear Regression - Testing Assumptions interpretation

Linear Regression - Strength and Direction of Linear Relationships

r <- cor(log(countries$`Infant mortality (per 1000 births)`), log(countries$`GDP ($ per capita)`), use = 'complete.obs')
r
## [1] -0.8584828
library(psychometric)
CIr(r, n = 220, level = .95)
## [1] -0.8897237 -0.8192381
detach('package:psychometric', unload = T)

Linear Regression - Interpretation

Discussion

Based on the investigation, there was a statistically significant negative linear relationship between GDP per capita and IMR index of a country. The estimated linear regression model could be used for further predictions.