For this exercise, I used a data set called “gapminder” from a collection at gapminder.org. This data set looks at the life expectancy, population, and GDP per capita across 142 different countries from 1952 to 2007. For the below tests, I analyzed the variables life expectancy and GDP per capita.
install.packages("psych")
##
## The downloaded binary packages are in
## /var/folders/s5/jwks7df91bl7q52mvgn02ynh0000gn/T//RtmpzHCWTt/downloaded_packages
library("psych")
install.packages('kableExtra')
##
## The downloaded binary packages are in
## /var/folders/s5/jwks7df91bl7q52mvgn02ynh0000gn/T//RtmpzHCWTt/downloaded_packages
library("kableExtra")
round(describe(mydata),3)%>%kbl()%>%kable_classic(html_font='Cambria')
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X | 1 | 1704 | 852.500 | 4.920470e+02 | 852.500 | 852.500 | 631.588 | 1.000 | 1.704000e+03 | 1.703000e+03 | 0.000 | -1.202 | 11.920 |
| country* | 2 | 1704 | 71.500 | 4.100300e+01 | 71.500 | 71.500 | 52.632 | 1.000 | 1.420000e+02 | 1.410000e+02 | 0.000 | -1.202 | 0.993 |
| continent* | 3 | 1704 | 2.331 | 1.209000e+00 | 2.000 | 2.271 | 1.483 | 1.000 | 5.000000e+00 | 4.000000e+00 | 0.255 | -1.338 | 0.029 |
| year | 4 | 1704 | 1979.500 | 1.726500e+01 | 1979.500 | 1979.500 | 22.239 | 1952.000 | 2.007000e+03 | 5.500000e+01 | 0.000 | -1.219 | 0.418 |
| lifeExp | 5 | 1704 | 59.474 | 1.291700e+01 | 60.712 | 59.915 | 16.101 | 23.599 | 8.260300e+01 | 5.900400e+01 | -0.252 | -1.129 | 0.313 |
| pop | 6 | 1704 | 29601212.325 | 1.061579e+08 | 7023595.500 | 11399459.451 | 7841473.624 | 60011.000 | 1.318683e+09 | 1.318623e+09 | 8.326 | 77.621 | 2571683.452 |
| gdpPercap | 7 | 1704 | 7215.327 | 9.857455e+03 | 3531.847 | 5221.443 | 4007.608 | 241.166 | 1.135231e+05 | 1.132820e+05 | 3.843 | 27.396 | 238.798 |
Here we are testing the null hypothesis, that the correlation between Life Epectancy (dependent variable) and GDP per capita (independent variable) is 0. Such a small p-value indicates that we can reject the null and there is a positive correlation between the two variables (slope = 0.58).
x= mydata$gdpPercap
y= mydata$lifeExp
cor.test(y, x)
##
## Pearson's product-moment correlation
##
## data: y and x
## t = 29.658, df = 1702, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5515065 0.6141690
## sample estimates:
## cor
## 0.5837062
#Simple Linear Regression
mylm <- lm(mydata$lifeExp ~ mydata$gdpPercap)
summary(mylm)
##
## Call:
## lm(formula = mydata$lifeExp ~ mydata$gdpPercap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.754 -7.758 2.176 8.225 18.426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.396e+01 3.150e-01 171.29 <2e-16 ***
## mydata$gdpPercap 7.649e-04 2.579e-05 29.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.49 on 1702 degrees of freedom
## Multiple R-squared: 0.3407, Adjusted R-squared: 0.3403
## F-statistic: 879.6 on 1 and 1702 DF, p-value: < 2.2e-16
boxplot(mylm$residuals)
The box plot shows the median of residuals just above zero (2.176) along with residual descriptions given by the linear model above. The 25th and 75th (1Q and 3Q) are fairly equidistant from the median, however there are many more outliers below the linear model.
The R-squared value suggests that 34% of variance in life expectancy is defined by GDP per capita.
plot(log10(x),y,ylab="Life Expectancy", xlab="GDP per capita")
abline(lm(y ~ log10(x)), col=2)