We will use Python for this analysis.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt #Data visualisation libraries
import seaborn as sns
from statsmodels.formula.api import ols
import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/Anaconda3/Library/plugins/platforms'
We retrieve the who.csv dataset (hosted on Github) which contains real-world data from 2008.
df = pd.read_csv('https://raw.githubusercontent.com/mehtablocker/cuny_605/master/data_files/who.csv', encoding = "ISO-8859-1")
df.head()
## Country LifeExp InfantSurvival ... PersExp GovtExp TotExp
## 0 Afghanistan 42 0.835 ... 20 92 112
## 1 Albania 71 0.985 ... 169 3128 3297
## 2 Algeria 71 0.967 ... 108 5184 5292
## 3 Andorra 82 0.997 ... 2589 169725 172314
## 4 Angola 41 0.846 ... 36 1620 1656
##
## [5 rows x 10 columns]
Scatterplot and regression of LifeExp~TotExp
df.plot.scatter('TotExp', 'LifeExp')
plt.show()
lm_fit = ols('LifeExp ~ TotExp', data = df).fit()
lm_fit.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: LifeExp R-squared: 0.258
## Model: OLS Adj. R-squared: 0.254
## Method: Least Squares F-statistic: 65.26
## Date: Wed, 17 Apr 2019 Prob (F-statistic): 7.71e-14
## Time: 10:29:00 Log-Likelihood: -693.74
## No. Observations: 190 AIC: 1391.
## Df Residuals: 188 BIC: 1398.
## Df Model: 1
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept 64.7534 0.754 85.933 0.000 63.267 66.240
## TotExp 6.297e-05 7.79e-06 8.079 0.000 4.76e-05 7.83e-05
## ==============================================================================
## Omnibus: 24.152 Durbin-Watson: 1.802
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 30.536
## Skew: -0.982 Prob(JB): 2.34e-07
## Kurtosis: 2.981 Cond. No. 1.07e+05
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 1.07e+05. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """
We can see from the scatterplot that a linear model does not look like a good choice for these two variables as they are. While the low coefficient p-values of standard errors and p-value of F-statistic show that there is something going on beyond chance, the low R2 value confirms that this particular model is not a good fit.
We now transform the two variables and try again.
df = df.assign(TotExp_trans = df.TotExp**0.06, LifeExp_trans = df.LifeExp**4.6)
df.plot.scatter('TotExp_trans', 'LifeExp_trans')
plt.show()
lm_fit = ols('LifeExp_trans ~ TotExp_trans', data = df).fit()
lm_fit.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: LifeExp_trans R-squared: 0.730
## Model: OLS Adj. R-squared: 0.728
## Method: Least Squares F-statistic: 507.7
## Date: Wed, 17 Apr 2019 Prob (F-statistic): 2.60e-55
## Time: 10:29:00 Log-Likelihood: -3749.5
## No. Observations: 190 AIC: 7503.
## Df Residuals: 188 BIC: 7510.
## Df Model: 1
## Covariance Type: nonrobust
## ================================================================================
## coef std err t P>|t| [0.025 0.975]
## --------------------------------------------------------------------------------
## Intercept -7.365e+08 4.68e+07 -15.732 0.000 -8.29e+08 -6.44e+08
## TotExp_trans 6.201e+08 2.75e+07 22.532 0.000 5.66e+08 6.74e+08
## ==============================================================================
## Omnibus: 20.315 Durbin-Watson: 1.909
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 24.595
## Skew: -0.736 Prob(JB): 4.56e-06
## Kurtosis: 3.971 Cond. No. 16.3
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## """
The scatterplot indicates a much more linear relationship. The p-values for the coefficient standard errors and F-statistic have all decreased. And sure enough, the R2 has almost tripled, indicating a much better fit.
When making predictions using the transformed model, we must remember to “undo” the transformation.
new_df = pd.DataFrame(np.array([1.5, 2.5]), columns = ['TotExp_trans'])
y_bars = lm_fit.predict(new_df)**(1/4.6)
print(y_bars)
## 0 63.311533
## 1 86.506448
## dtype: float64
We now try a multivariate regression with an interaction term.
lm_fit = ols('LifeExp ~ PropMD + TotExp + PropMD*TotExp', data = df).fit()
lm_fit.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: LifeExp R-squared: 0.357
## Model: OLS Adj. R-squared: 0.347
## Method: Least Squares F-statistic: 34.49
## Date: Wed, 17 Apr 2019 Prob (F-statistic): 9.02e-18
## Time: 10:29:00 Log-Likelihood: -680.03
## No. Observations: 190 AIC: 1368.
## Df Residuals: 186 BIC: 1381.
## Df Model: 3
## Covariance Type: nonrobust
## =================================================================================
## coef std err t P>|t| [0.025 0.975]
## ---------------------------------------------------------------------------------
## Intercept 62.7727 0.796 78.899 0.000 61.203 64.342
## PropMD 1497.4940 278.817 5.371 0.000 947.444 2047.544
## TotExp 7.233e-05 8.98e-06 8.053 0.000 5.46e-05 9.01e-05
## PropMD:TotExp -0.0060 0.001 -4.093 0.000 -0.009 -0.003
## ==============================================================================
## Omnibus: 26.740 Durbin-Watson: 1.925
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 33.982
## Skew: -1.027 Prob(JB): 4.18e-08
## Kurtosis: 3.266 Cond. No. 4.24e+07
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 4.24e+07. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """
Similar to the first model we tried, the p-values of the coefficient standard errors and F-statistic are low, simply indicating that if the true values were 0 then the observed results would be very unlikely due to chance alone. But the low Adjusted R2 value indicates the overall model is not a particularly good fit.
A prediction of Life Expectancy when Proportion of Doctors is 0.03 and Total Expenditure is 14:
new_df = pd.DataFrame({'PropMD':[0.03], 'TotExp':[14]})
y_bars = lm_fit.predict(new_df)
print(y_bars)
## 0 107.696004
## dtype: float64
Forecasting an average age of over 107 years old does not seem realistic. This is in part due to us choosing values that are at the extremes of each variable, but also due to our model not being particularly accurate.