For this exercise, I used a data set called “gapminder” from a collection at gapminder.org. This data set looks at the life expectancy, population, and GDP per capita across 142 different countries from 1952 to 2007. For the below tests, I analyzed the variables life expectancy and GDP per capita.

Summary of Gapminder Data

install.packages("psych")
## 
## The downloaded binary packages are in
##  /var/folders/s5/jwks7df91bl7q52mvgn02ynh0000gn/T//RtmpzHCWTt/downloaded_packages
library("psych")
install.packages('kableExtra')
## 
## The downloaded binary packages are in
##  /var/folders/s5/jwks7df91bl7q52mvgn02ynh0000gn/T//RtmpzHCWTt/downloaded_packages
library("kableExtra")
round(describe(mydata),3)%>%kbl()%>%kable_classic(html_font='Cambria')
vars n mean sd median trimmed mad min max range skew kurtosis se
X 1 1704 852.500 4.920470e+02 852.500 852.500 631.588 1.000 1.704000e+03 1.703000e+03 0.000 -1.202 11.920
country* 2 1704 71.500 4.100300e+01 71.500 71.500 52.632 1.000 1.420000e+02 1.410000e+02 0.000 -1.202 0.993
continent* 3 1704 2.331 1.209000e+00 2.000 2.271 1.483 1.000 5.000000e+00 4.000000e+00 0.255 -1.338 0.029
year 4 1704 1979.500 1.726500e+01 1979.500 1979.500 22.239 1952.000 2.007000e+03 5.500000e+01 0.000 -1.219 0.418
lifeExp 5 1704 59.474 1.291700e+01 60.712 59.915 16.101 23.599 8.260300e+01 5.900400e+01 -0.252 -1.129 0.313
pop 6 1704 29601212.325 1.061579e+08 7023595.500 11399459.451 7841473.624 60011.000 1.318683e+09 1.318623e+09 8.326 77.621 2571683.452
gdpPercap 7 1704 7215.327 9.857455e+03 3531.847 5221.443 4007.608 241.166 1.135231e+05 1.132820e+05 3.843 27.396 238.798

Correlation Test

Here we are testing the null hypothesis, that the correlation between Life Epectancy (dependent variable) and GDP per capita (independent variable) is 0. Such a small p-value indicates that we can reject the null and there is a positive correlation between the two variables (slope = 0.58).

x= mydata$gdpPercap
y= mydata$lifeExp

cor.test(y, x)
## 
##  Pearson's product-moment correlation
## 
## data:  y and x
## t = 29.658, df = 1702, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5515065 0.6141690
## sample estimates:
##       cor 
## 0.5837062

#Simple Linear Regression

mylm <- lm(mydata$lifeExp ~ mydata$gdpPercap)
summary(mylm)
## 
## Call:
## lm(formula = mydata$lifeExp ~ mydata$gdpPercap)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.754  -7.758   2.176   8.225  18.426 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.396e+01  3.150e-01  171.29   <2e-16 ***
## mydata$gdpPercap 7.649e-04  2.579e-05   29.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.49 on 1702 degrees of freedom
## Multiple R-squared:  0.3407, Adjusted R-squared:  0.3403 
## F-statistic: 879.6 on 1 and 1702 DF,  p-value: < 2.2e-16

Box Plot of Residuals

boxplot(mylm$residuals)

The box plot shows the median of residuals just above zero (2.176) along with residual descriptions given by the linear model above. The 25th and 75th (1Q and 3Q) are fairly equidistant from the median, however there are many more outliers below the linear model.

The R-squared value suggests that 34% of variance in life expectancy is defined by GDP per capita.

Scatterplot of Data

plot(log10(x),y,ylab="Life Expectancy", xlab="GDP per capita")
abline(lm(y ~ log10(x)), col=2)