Load modules

We will use Python for this analysis.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt   #Data visualisation libraries 
import seaborn as sns
from statsmodels.formula.api import ols

import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/Anaconda3/Library/plugins/platforms'

Get data

We retrieve the who.csv dataset (hosted on Github) which contains real-world data from 2008.

df = pd.read_csv('https://raw.githubusercontent.com/mehtablocker/cuny_605/master/data_files/who.csv', encoding = "ISO-8859-1")
df.head()
##        Country  LifeExp  InfantSurvival  ...  PersExp  GovtExp  TotExp
## 0  Afghanistan       42           0.835  ...       20       92     112
## 1      Albania       71           0.985  ...      169     3128    3297
## 2      Algeria       71           0.967  ...      108     5184    5292
## 3      Andorra       82           0.997  ...     2589   169725  172314
## 4       Angola       41           0.846  ...       36     1620    1656
## 
## [5 rows x 10 columns]

Exercise 1

Scatterplot and regression of LifeExp~TotExp

df.plot.scatter('TotExp', 'LifeExp')
plt.show()

lm_fit = ols('LifeExp ~ TotExp', data = df).fit()
lm_fit.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                LifeExp   R-squared:                       0.258
## Model:                            OLS   Adj. R-squared:                  0.254
## Method:                 Least Squares   F-statistic:                     65.26
## Date:                Wed, 17 Apr 2019   Prob (F-statistic):           7.71e-14
## Time:                        10:29:00   Log-Likelihood:                -693.74
## No. Observations:                 190   AIC:                             1391.
## Df Residuals:                     188   BIC:                             1398.
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## Intercept     64.7534      0.754     85.933      0.000      63.267      66.240
## TotExp      6.297e-05   7.79e-06      8.079      0.000    4.76e-05    7.83e-05
## ==============================================================================
## Omnibus:                       24.152   Durbin-Watson:                   1.802
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):               30.536
## Skew:                          -0.982   Prob(JB):                     2.34e-07
## Kurtosis:                       2.981   Cond. No.                     1.07e+05
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 1.07e+05. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """

We can see from the scatterplot that a linear model does not look like a good choice for these two variables as they are. While the low coefficient p-values of standard errors and p-value of F-statistic show that there is something going on beyond chance, the low R2 value confirms that this particular model is not a good fit.

Exercise 2

We now transform the two variables and try again.

df = df.assign(TotExp_trans = df.TotExp**0.06, LifeExp_trans = df.LifeExp**4.6)
df.plot.scatter('TotExp_trans', 'LifeExp_trans')
plt.show()

lm_fit = ols('LifeExp_trans ~ TotExp_trans', data = df).fit()
lm_fit.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:          LifeExp_trans   R-squared:                       0.730
## Model:                            OLS   Adj. R-squared:                  0.728
## Method:                 Least Squares   F-statistic:                     507.7
## Date:                Wed, 17 Apr 2019   Prob (F-statistic):           2.60e-55
## Time:                        10:29:00   Log-Likelihood:                -3749.5
## No. Observations:                 190   AIC:                             7503.
## Df Residuals:                     188   BIC:                             7510.
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ================================================================================
##                    coef    std err          t      P>|t|      [0.025      0.975]
## --------------------------------------------------------------------------------
## Intercept    -7.365e+08   4.68e+07    -15.732      0.000   -8.29e+08   -6.44e+08
## TotExp_trans  6.201e+08   2.75e+07     22.532      0.000    5.66e+08    6.74e+08
## ==============================================================================
## Omnibus:                       20.315   Durbin-Watson:                   1.909
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):               24.595
## Skew:                          -0.736   Prob(JB):                     4.56e-06
## Kurtosis:                       3.971   Cond. No.                         16.3
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## """

The scatterplot indicates a much more linear relationship. The p-values for the coefficient standard errors and F-statistic have all decreased. And sure enough, the R2 has almost tripled, indicating a much better fit.

Exercise 3

When making predictions using the transformed model, we must remember to “undo” the transformation.

new_df = pd.DataFrame(np.array([1.5, 2.5]), columns = ['TotExp_trans'])
y_bars = lm_fit.predict(new_df)**(1/4.6)
print(y_bars)
## 0    63.311533
## 1    86.506448
## dtype: float64

Exercise 4

We now try a multivariate regression with an interaction term.

lm_fit = ols('LifeExp ~ PropMD + TotExp + PropMD*TotExp', data = df).fit()
lm_fit.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                LifeExp   R-squared:                       0.357
## Model:                            OLS   Adj. R-squared:                  0.347
## Method:                 Least Squares   F-statistic:                     34.49
## Date:                Wed, 17 Apr 2019   Prob (F-statistic):           9.02e-18
## Time:                        10:29:00   Log-Likelihood:                -680.03
## No. Observations:                 190   AIC:                             1368.
## Df Residuals:                     186   BIC:                             1381.
## Df Model:                           3                                         
## Covariance Type:            nonrobust                                         
## =================================================================================
##                     coef    std err          t      P>|t|      [0.025      0.975]
## ---------------------------------------------------------------------------------
## Intercept        62.7727      0.796     78.899      0.000      61.203      64.342
## PropMD         1497.4940    278.817      5.371      0.000     947.444    2047.544
## TotExp         7.233e-05   8.98e-06      8.053      0.000    5.46e-05    9.01e-05
## PropMD:TotExp    -0.0060      0.001     -4.093      0.000      -0.009      -0.003
## ==============================================================================
## Omnibus:                       26.740   Durbin-Watson:                   1.925
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):               33.982
## Skew:                          -1.027   Prob(JB):                     4.18e-08
## Kurtosis:                       3.266   Cond. No.                     4.24e+07
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## [2] The condition number is large, 4.24e+07. This might indicate that there are
## strong multicollinearity or other numerical problems.
## """

Similar to the first model we tried, the p-values of the coefficient standard errors and F-statistic are low, simply indicating that if the true values were 0 then the observed results would be very unlikely due to chance alone. But the low Adjusted R2 value indicates the overall model is not a particularly good fit.

Exercise 5

A prediction of Life Expectancy when Proportion of Doctors is 0.03 and Total Expenditure is 14:

new_df = pd.DataFrame({'PropMD':[0.03], 'TotExp':[14]})
y_bars = lm_fit.predict(new_df)
print(y_bars)
## 0    107.696004
## dtype: float64

Forecasting an average age of over 107 years old does not seem realistic. This is in part due to us choosing values that are at the extremes of each variable, but also due to our model not being particularly accurate.