Instructions

The purpose of this assignment is to review general linear models (i.e. regression) and start thinking about alternative approaches to the analysis of hierarchical data. The data (televisions.dat) for this assignment is non-nested, comes from 40 countries, and was obtained from The World Almanac and Book of Facts 1993.


Setup

Read in data, name variables, inspect dataframe, missing values to NA

## read in data
tv.df <- read.table("C:/Users/User/Downloads/Villano_HW1/televisions.dat", quote="\"", comment.char="")

## assign column names
# country = Country name
# lifeexp = Average life expectancy in years
# ppltv = Number of people per television
# pplmd = Number of people per physician
# femexp = Life expectancy for women
# maleexp = Life expectancy for men
colnames(tv.df) <- c("country", "lifeexp", "ppltv", "pplmd", "femexp", "maleexp")

## inspect 
head(tv.df)
##      country lifeexp ppltv pplmd femexp maleexp
## 1  Argentina    70.5   4.0   370     74      67
## 2 Bangladesh    53.5 315.0  6166     53      54
## 3     Brazil    65.0   4.0   684     68      62
## 4     Canada    76.5   1.7   449     80      73
## 5      China    70.0   8.0   643     72      68
## 6   Colombia    71.0   5.6  1551     74      68
# replace "." with NA
tv.df$ppltv[which(tv.df$ppltv == ".")] <- NA

tv.df$lifeexp <- as.numeric(tv.df$lifeexp)
tv.df$ppltv <- as.numeric(tv.df$ppltv)
tv.df$pplmd <- as.numeric(tv.df$pplmd)
tv.df$femexp <- as.numeric(tv.df$femexp)
tv.df$maleexp <- as.numeric(tv.df$maleexp)

Problem Set

1. Describe the distributions of lifeexp, ppltv, and pplmd, including means, SD, skew, and kurtosis. Include histograms to visualize the data.

The average life expectancy for all countries in the sample was 67.04 (SD = 8.25). Life expectancy was negatively skewed from normal (Skewness = -0.38) and minimally kurtotic (Kurtosis = -1.05), and ranged from 51.5 years to 79 years (range = 27.5). The average number of people per television (termed “people per TV”) was 51.98 (SD = 130.35). People per TV was highly positively skewed (Skewness = 3.14), leptokurtotic (Kurtosis = 8.97), and consisted primarily of smaller values with three high outliers: people per TV ranged from 1.3 to 592 (range = 590.7), while the median number of people per television was 6.3. Two people per TV values were missing (Tanzania and Zaire) and were thus excluded. The average number of people per doctor was 3997.65 (SD = 7666.13). Like the number of people per television, the number of people per doctor was highly positively skewed (Skewness = -0.38) and leptokurtotic (Kurtosis = 7.76). The median number of doctors per person was 990.5, with values ranging from 226 to 36660 doctors per person (range = 36434).

See below for full descriptive statistics for all variables, and figure 1.2 for histograms depicting the frequency distributions for all variables.

Descriptive Statistics
kable(as.data.frame(describe(tv.df[,2:ncol(tv.df)])))
vars n mean sd median trimmed mad min max range skew kurtosis se
lifeexp 1 40 67.03750 8.248844 69.5 67.40625 8.15430 51.5 79 27.5 -0.3809784 -1.0545521 1.304257
ppltv 2 38 51.97895 130.347782 6.3 17.51250 6.07866 1.3 592 590.7 3.1464570 8.9732995 21.145202
pplmd 3 40 3997.65000 7666.131601 990.5 1915.03125 919.95330 226.0 36660 36434.0 2.8408566 7.7578860 1212.121835
femexp 4 40 69.57500 9.162038 72.0 70.00000 9.63690 53.0 82 29.0 -0.3919531 -1.1387802 1.448646
maleexp 5 40 64.50000 7.421383 66.0 64.87500 8.15430 50.0 76 26.0 -0.3778010 -0.9145029 1.173424
Histograms

Note. lifeexp = Life Expectancy, ppltv = People per TV, pplmd = People per Doctor

tv.df.melt <- melt(tv.df, measure.vars = c("lifeexp", "ppltv", "pplmd"))
tv.df.melt$value <- as.numeric(tv.df.melt$value)

ggplot(tv.df.melt, aes(x = value, fill = variable, color = variable)) + 
  geom_histogram(position = "identity", alpha = 0.75, bins = 10) + 
  scale_color_viridis_d(begin = 0.5, end = 0.85) + 
  scale_fill_viridis_d(begin = 0.5, end = 0.85) + 
  facet_wrap(~variable, scales="free") + 
  theme(legend.position = "none")



2. Generate a scatterplot of lifeexp by pplmd and ppltv, the predictors of interest. Describe the relationships observed. Why are these relationships problematic?

Due to the high degree of positive skew in the frequency distributions for both people per doctor and people per television, the relationships between life expectancy and the number of people per doctor and per television, respectively, were highly nonlinear, and approximated an exponential decay function (see plot below).

tv.df.melt2 <- melt(tv.df, measure.vars = c("ppltv", "pplmd"))
ggplot(tv.df.melt2, aes(x = value, y = lifeexp, color = variable)) + 
  geom_point(size = 2) + 
  facet_wrap(~variable, scales="free") + 
  theme(legend.position = "none") + 
  scale_color_viridis_d(begin = 0.5, end = 0.85) + 
  scale_fill_viridis_d(begin = 0.5, end = 0.85)



3. Conduct log transformations of ppltv and pplmd, the predictors of interest. Re-run the scatterplots using the log-transformed values. Are the problems you identified in (2) resolved?

Following log transformation, the frequency distributions for both the number of people per television and per doctor better approximated a normal distribution. As a result, the relationships between life expectancy and the log-transformed number of people per television and log-transformed number of people per doctor were roughly linear. Both relationships appeared negative, such that life expectancy appeared to decrease linearly as the number of people per television and per doctor increased, respectively (see plot below).

# log transform
tv.df$pplmd_log <- log(tv.df$pplmd)
tv.df$ppltv <- as.numeric(tv.df$ppltv)
tv.df$ppltv_log <- log(tv.df$ppltv)

tv.df.melt3 <- melt(tv.df, measure.vars = c("ppltv_log", "pplmd_log"))
ggplot(tv.df.melt3, aes(x = value, y = lifeexp, color = variable)) + 
  geom_point(size = 2) + 
  facet_wrap(~variable, scales="free") + 
  theme(legend.position = "none") + 
  scale_color_viridis_d(begin = 0.5, end = 0.85) + 
  scale_fill_viridis_d(begin = 0.5, end = 0.85)



4. Run a simple regression analysis predicting lifeexp from log(ppltv) and summarize the results, including interpretations of R2, the model F-test, and the intercept and slope coefficients. Venture an explanation for your findings.

Linear regression revealed a strong negative association between the log-transformed people per TV variable and life expectancy (b = -4.26, SE = 0.43, p < 0.0001), such that a one-unit increase in the log-transformed people per TV variable predicted a 4.26 year decrease in life expectancy. The predicted life expectancy for a country with 0 log-transformed people per TV was 77.89 (SE = 1.22). Model R2 indicated that the log-transformed number of people per television explained 73.12% of the total variance in life expectancy across all countries, while the model F-test revealed that the log-transformed number of people per television explained variance in life expectancy better than an intercept-only model (F[1,36] = 97.94, p < 0.0001).

Regression Output
md.1 <- lm(lifeexp ~ 1 + ppltv_log, data = tv.df)
md.1.sum <- summary(md.1)
print(md.1.sum)
## 
## Call:
## lm(formula = lifeexp ~ 1 + ppltv_log, data = tv.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.531  -2.567   0.284   2.175  11.280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  77.8873     1.2203  63.827  < 2e-16 ***
## ppltv_log    -4.2597     0.4304  -9.896  8.2e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.101 on 36 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7312, Adjusted R-squared:  0.7238 
## F-statistic: 97.94 on 1 and 36 DF,  p-value: 8.201e-12


5. Add log(pplmd) as an additional predictor. Interpret R2, the change in R2, the model F-test, and the intercept and slope coefficients. Why is the effect of log(ppltv) smaller than in (4)? Venture an explanation for why the effect of log(ppltv) is larger than the effect of log(pplmd)).

In the two-predictor model, both the log-transformed number of people per doctor and log-transformed number of people per television were significant negative predictors of life expectancy. The intercept for life expectancy was 90.62, indicating that the predicted life expectancy for a hypothetical country with zero log-transformed people per doctor and per television, respectively, was 90.62 years of age. The coefficient for the log-transformed number of people per television remained negative as in the univariate model (b = -2.92, SE = 0.59, p < 0.001), and the slope coefficient indicated that when controlling for the log-transformed number of people per doctor, a one-unit increase in the log-transformed number of people per television predicted a 2.92 year decrease in life expectancy. The coefficient for the log-transformed number of people per doctor was also negative (b = -2.26, SE = 0.74, p < 0.01), such that when controlling for the lo og-transformed number of people per television, a one-unit increase in the log-transformed number of people per doctor predicted a 2.26 year decrease in life expectancy.

After adding the log-transformed number of people per doctor to the linear regression model, the model R2 increased from 0.7312 to 0.7868, which indicates that the log-transformed number of people per doctor explained an additional 5.56% of the variance in life expectancy beyond that explained by the log-transformed number of people per television. The model F-test indicates that the log-transformed number of people per television and log-transformed number of people per doctor jointly explained variance in life expectancy better than an intercept-only model (F[2,35] = 64.6, p < 0.0001).

The effect of the log-transformed number of people per television was smaller in the two-predictor model, likely due to the fact that both the log-transformed number of people per television and per doctor, respectively, explain and compete for similar variance in life expectancy. Both variables covary, and therefore hold similar negative relationships with life expectancy. Thus, with the addition of the log-transformed number of people per doctor predictor, the coefficient for the log-transformed number of people per television, which is now a partial effect in this model, decreases.

Relative to the effect of log-transformed people per TV, the effect of log-transformed people per doctor is smaller. In theory, this may be due to the fact that doctors cannot act on the population to increase life expectancy unless the population can afford and thus access their care. It is reasonable to assume that the inverse number of people per television indicates the degree of wealth in each country, and in the absence of sufficient wealth, the effect of the presence of doctors on life expectancy cannot come to bear. Alternatively, the inverse number of people per television and per doctor may both be proxies for wealth, which predicts life expectancy, and it is also possible that people per television is more related to the third variable of wealth.

Regression Output
md.2 <- lm(lifeexp ~ 1 + ppltv_log + pplmd_log, data = tv.df)

md.2.sum <- summary(md.2)
print(md.2.sum)
## 
## Call:
## lm(formula = lifeexp ~ 1 + ppltv_log + pplmd_log, data = tv.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7173 -2.7718  0.9026  2.9923  5.8553 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  90.6222     4.3557  20.806  < 2e-16 ***
## ppltv_log    -2.9156     0.5907  -4.936 1.95e-05 ***
## pplmd_log    -2.2589     0.7474  -3.022  0.00467 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.704 on 35 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7868, Adjusted R-squared:  0.7747 
## F-statistic:  64.6 on 2 and 35 DF,  p-value: 1.788e-12
# md.2.sum$r.squared - md.1.sum$r.squared


6. Suppose that you have data on life expectancy not only in 1993, but 5 repeated measures at 5-year intervals (1993, 1998, 2003, 2008, 2013) for each country.

a. Explain specifically why it would be a bad idea to run another standard GLM, simply including time as another predictor?

With repeated measures occurring within each country, there would be a dependency between life expectancy values recorded across time in each country consistent with nesting. This would violate the GLM assumption that observations are independent. Thus a standard GLM would poorly model the data and standard errors for resultant regression coefficients would likely be inflated.


b. Describe how the different analysis approaches for hierarchical data discussed in class

could be applied to this data. What would be the pros and cons of each approach?

One solution to the nesting conundrum would be to adjust the results of the analysis, mainly the standard errors for resultant regression coefficients. While this approach would improve estimates of the error in regression coefficients, it would preclude comparison of effects across countries in the sample. Another solution would be to add a fixed-effect term to the model that corresponds to a dummy coded variable representing country. This approach accommodates data such as these, where groups (i.e., countries) are static across the population, though it would also preclude the researcher from elucidating differences in effects within- and between-countries, and also from comparing these effects across countries. The most appropriate approach would be to specify a random-effects model in which country is specified as a nesting term. This approach would allow the researcher to explore differences in effects both within- and between-countries in the sample, but might present specification challenges and convergence issues with such a small sample size.


c. For the random effects approach, set up Level-1, Level-2, and Reduced-Form equations

to summarize the hierarchical model you would test, using appropriate symbols and subscripts. Indicate which symbols represent which variables in your dataset.

Level 1 Equation

where xij represents ppltv scores over time within country, β0j represents the intercept, β1j represents the fixed effect of ppltv scores on life expectancy, and eij represents individual differences in life expectancy.

Level 2 Equations

where γ00 represents the estimated grand mean life expectancy, and uij represents group differences in life expectancy.

where γ01 represents the estimated fixed effect of ppltv scores on life expectancy.

Reduced Form Equation