PSY634 - HLM

Problem Set

1. Describe the distributions of lifeexp, ppltv, and pplmd, including means, SD, skew, and kurtosis. Include histograms to visualize the data.

The average life expectancy for all countries in the sample was 67.04 (SD = 8.25). Life expectancy was negatively skewed from normal (Skewness = -0.38) and minimally kurtotic (Kurtosis = -1.05), and ranged from 51.5 years to 79 years (range = 27.5). The average number of people per television (termed “people per TV”) was 51.98 (SD = 130.35). People per TV was highly positively skewed (Skewness = 3.14), leptokurtotic (Kurtosis = 8.97), and consisted primarily of smaller values with three high outliers: people per TV ranged from 1.3 to 592 (range = 590.7), while the median number of people per television was 6.3. Two people per TV values were missing (Tanzania and Zaire) and were thus excluded. The average number of people per doctor was 3997.65 (SD = 7666.13). Like the number of people per television, the number of people per doctor was highly positively skewed (Skewness = -0.38) and leptokurtotic (Kurtosis = 7.76). The median number of doctors per person was 990.5, with values ranging from 226 to 36660 doctors per person (range = 36434).

See below for full descriptive statistics for all variables, and figure 1.2 for histograms depicting the frequency distributions for all variables.

Descriptive Statistics

kable(as.data.frame(describe(tv.df[,2:ncol(tv.df)])))

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
lifeexp	1	40	67.03750	8.248844	69.5	67.40625	8.15430	51.5	79	27.5	-0.3809784	-1.0545521	1.304257
ppltv	2	38	51.97895	130.347782	6.3	17.51250	6.07866	1.3	592	590.7	3.1464570	8.9732995	21.145202
pplmd	3	40	3997.65000	7666.131601	990.5	1915.03125	919.95330	226.0	36660	36434.0	2.8408566	7.7578860	1212.121835
femexp	4	40	69.57500	9.162038	72.0	70.00000	9.63690	53.0	82	29.0	-0.3919531	-1.1387802	1.448646
maleexp	5	40	64.50000	7.421383	66.0	64.87500	8.15430	50.0	76	26.0	-0.3778010	-0.9145029	1.173424

Histograms

Note. lifeexp = Life Expectancy, ppltv = People per TV, pplmd = People per Doctor

tv.df.melt <- melt(tv.df, measure.vars = c("lifeexp", "ppltv", "pplmd"))
tv.df.melt$value <- as.numeric(tv.df.melt$value)

ggplot(tv.df.melt, aes(x = value, fill = variable, color = variable)) + 
  geom_histogram(position = "identity", alpha = 0.75, bins = 10) + 
  scale_color_viridis_d(begin = 0.5, end = 0.85) + 
  scale_fill_viridis_d(begin = 0.5, end = 0.85) + 
  facet_wrap(~variable, scales="free") + 
  theme(legend.position = "none")

2. Generate a scatterplot of lifeexp by pplmd and ppltv, the predictors of interest. Describe the relationships observed. Why are these relationships problematic?

Due to the high degree of positive skew in the frequency distributions for both people per doctor and people per television, the relationships between life expectancy and the number of people per doctor and per television, respectively, were highly nonlinear, and approximated an exponential decay function (see plot below).

tv.df.melt2 <- melt(tv.df, measure.vars = c("ppltv", "pplmd"))
ggplot(tv.df.melt2, aes(x = value, y = lifeexp, color = variable)) + 
  geom_point(size = 2) + 
  facet_wrap(~variable, scales="free") + 
  theme(legend.position = "none") + 
  scale_color_viridis_d(begin = 0.5, end = 0.85) + 
  scale_fill_viridis_d(begin = 0.5, end = 0.85)

3. Conduct log transformations of ppltv and pplmd, the predictors of interest. Re-run the scatterplots using the log-transformed values. Are the problems you identified in (2) resolved?

Following log transformation, the frequency distributions for both the number of people per television and per doctor better approximated a normal distribution. As a result, the relationships between life expectancy and the log-transformed number of people per television and log-transformed number of people per doctor were roughly linear. Both relationships appeared negative, such that life expectancy appeared to decrease linearly as the number of people per television and per doctor increased, respectively (see plot below).

# log transform
tv.df$pplmd_log <- log(tv.df$pplmd)
tv.df$ppltv <- as.numeric(tv.df$ppltv)
tv.df$ppltv_log <- log(tv.df$ppltv)

tv.df.melt3 <- melt(tv.df, measure.vars = c("ppltv_log", "pplmd_log"))
ggplot(tv.df.melt3, aes(x = value, y = lifeexp, color = variable)) + 
  geom_point(size = 2) + 
  facet_wrap(~variable, scales="free") + 
  theme(legend.position = "none") + 
  scale_color_viridis_d(begin = 0.5, end = 0.85) + 
  scale_fill_viridis_d(begin = 0.5, end = 0.85)

4. Run a simple regression analysis predicting lifeexp from log(ppltv) and summarize the results, including interpretations of R2, the model F-test, and the intercept and slope coefficients. Venture an explanation for your findings.

Linear regression revealed a strong negative association between the log-transformed people per TV variable and life expectancy (b = -4.26, SE = 0.43, p < 0.0001), such that a one-unit increase in the log-transformed people per TV variable predicted a 4.26 year decrease in life expectancy. The predicted life expectancy for a country with 0 log-transformed people per TV was 77.89 (SE = 1.22). Model R² indicated that the log-transformed number of people per television explained 73.12% of the total variance in life expectancy across all countries, while the model F-test revealed that the log-transformed number of people per television explained variance in life expectancy better than an intercept-only model (F[1,36] = 97.94, p < 0.0001).

Regression Output

md.1 <- lm(lifeexp ~ 1 + ppltv_log, data = tv.df)
md.1.sum <- summary(md.1)
print(md.1.sum)

## 
## Call:
## lm(formula = lifeexp ~ 1 + ppltv_log, data = tv.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.531  -2.567   0.284   2.175  11.280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  77.8873     1.2203  63.827  < 2e-16 ***
## ppltv_log    -4.2597     0.4304  -9.896  8.2e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.101 on 36 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7312, Adjusted R-squared:  0.7238 
## F-statistic: 97.94 on 1 and 36 DF,  p-value: 8.201e-12

5. Add log(pplmd) as an additional predictor. Interpret R2, the change in R2, the model F-test, and the intercept and slope coefficients. Why is the effect of log(ppltv) smaller than in (4)? Venture an explanation for why the effect of log(ppltv) is larger than the effect of log(pplmd)).

In the two-predictor model, both the log-transformed number of people per doctor and log-transformed number of people per television were significant negative predictors of life expectancy. The intercept for life expectancy was 90.62, indicating that the predicted life expectancy for a hypothetical country with zero log-transformed people per doctor and per television, respectively, was 90.62 years of age. The coefficient for the log-transformed number of people per television remained negative as in the univariate model (b = -2.92, SE = 0.59, p < 0.001), and the slope coefficient indicated that when controlling for the log-transformed number of people per doctor, a one-unit increase in the log-transformed number of people per television predicted a 2.92 year decrease in life expectancy. The coefficient for the log-transformed number of people per doctor was also negative (b = -2.26, SE = 0.74, p < 0.01), such that when controlling for the lo og-transformed number of people per television, a one-unit increase in the log-transformed number of people per doctor predicted a 2.26 year decrease in life expectancy.

After adding the log-transformed number of people per doctor to the linear regression model, the model R² increased from 0.7312 to 0.7868, which indicates that the log-transformed number of people per doctor explained an additional 5.56% of the variance in life expectancy beyond that explained by the log-transformed number of people per television. The model F-test indicates that the log-transformed number of people per television and log-transformed number of people per doctor jointly explained variance in life expectancy better than an intercept-only model (F[2,35] = 64.6, p < 0.0001).

The effect of the log-transformed number of people per television was smaller in the two-predictor model, likely due to the fact that both the log-transformed number of people per television and per doctor, respectively, explain and compete for similar variance in life expectancy. Both variables covary, and therefore hold similar negative relationships with life expectancy. Thus, with the addition of the log-transformed number of people per doctor predictor, the coefficient for the log-transformed number of people per television, which is now a partial effect in this model, decreases.

Relative to the effect of log-transformed people per TV, the effect of log-transformed people per doctor is smaller. In theory, this may be due to the fact that doctors cannot act on the population to increase life expectancy unless the population can afford and thus access their care. It is reasonable to assume that the inverse number of people per television indicates the degree of wealth in each country, and in the absence of sufficient wealth, the effect of the presence of doctors on life expectancy cannot come to bear. Alternatively, the inverse number of people per television and per doctor may both be proxies for wealth, which predicts life expectancy, and it is also possible that people per television is more related to the third variable of wealth.

Regression Output

md.2 <- lm(lifeexp ~ 1 + ppltv_log + pplmd_log, data = tv.df)

md.2.sum <- summary(md.2)
print(md.2.sum)

## 
## Call:
## lm(formula = lifeexp ~ 1 + ppltv_log + pplmd_log, data = tv.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7173 -2.7718  0.9026  2.9923  5.8553 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  90.6222     4.3557  20.806  < 2e-16 ***
## ppltv_log    -2.9156     0.5907  -4.936 1.95e-05 ***
## pplmd_log    -2.2589     0.7474  -3.022  0.00467 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.704 on 35 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7868, Adjusted R-squared:  0.7747 
## F-statistic:  64.6 on 2 and 35 DF,  p-value: 1.788e-12

# md.2.sum$r.squared - md.1.sum$r.squared

6. Suppose that you have data on life expectancy not only in 1993, but 5 repeated measures at 5-year intervals (1993, 1998, 2003, 2008, 2013) for each country.

a. Explain specifically why it would be a bad idea to run another standard GLM, simply including time as another predictor?

With repeated measures occurring within each country, there would be a dependency between life expectancy values recorded across time in each country consistent with nesting. This would violate the GLM assumption that observations are independent. Thus a standard GLM would poorly model the data and standard errors for resultant regression coefficients would likely be inflated.

b. Describe how the different analysis approaches for hierarchical data discussed in class

could be applied to this data. What would be the pros and cons of each approach?

One solution to the nesting conundrum would be to adjust the results of the analysis, mainly the standard errors for resultant regression coefficients. While this approach would improve estimates of the error in regression coefficients, it would preclude comparison of effects across countries in the sample. Another solution would be to add a fixed-effect term to the model that corresponds to a dummy coded variable representing country. This approach accommodates data such as these, where groups (i.e., countries) are static across the population, though it would also preclude the researcher from elucidating differences in effects within- and between-countries, and also from comparing these effects across countries. The most appropriate approach would be to specify a random-effects model in which country is specified as a nesting term. This approach would allow the researcher to explore differences in effects both within- and between-countries in the sample, but might present specification challenges and convergence issues with such a small sample size.

c. For the random effects approach, set up Level-1, Level-2, and Reduced-Form equations

to summarize the hierarchical model you would test, using appropriate symbols and subscripts. Indicate which symbols represent which variables in your dataset.

Level 1 Equation

where x_ij represents ppltv scores over time within country, β_0j represents the intercept, β_1j represents the fixed effect of ppltv scores on life expectancy, and e_ij represents individual differences in life expectancy.

Level 2 Equations

where γ₀₀ represents the estimated grand mean life expectancy, and u_ij represents group differences in life expectancy.

where γ₀₁ represents the estimated fixed effect of ppltv scores on life expectancy.

Reduced Form Equation

PSY634 - HLM - HW #1

William Villano

1/28/2022

Instructions

Setup

Problem Set

1. Describe the distributions of lifeexp, ppltv, and pplmd, including means, SD, skew, and kurtosis. Include histograms to visualize the data.

Descriptive Statistics

Histograms

2. Generate a scatterplot of lifeexp by pplmd and ppltv, the predictors of interest. Describe the relationships observed. Why are these relationships problematic?

3. Conduct log transformations of ppltv and pplmd, the predictors of interest. Re-run the scatterplots using the log-transformed values. Are the problems you identified in (2) resolved?

4. Run a simple regression analysis predicting lifeexp from log(ppltv) and summarize the results, including interpretations of R2, the model F-test, and the intercept and slope coefficients. Venture an explanation for your findings.

Regression Output

5. Add log(pplmd) as an additional predictor. Interpret R2, the change in R2, the model F-test, and the intercept and slope coefficients. Why is the effect of log(ppltv) smaller than in (4)? Venture an explanation for why the effect of log(ppltv) is larger than the effect of log(pplmd)).

Regression Output

6. Suppose that you have data on life expectancy not only in 1993, but 5 repeated measures at 5-year intervals (1993, 1998, 2003, 2008, 2013) for each country.

a. Explain specifically why it would be a bad idea to run another standard GLM, simply including time as another predictor?

b. Describe how the different analysis approaches for hierarchical data discussed in class

c. For the random effects approach, set up Level-1, Level-2, and Reduced-Form equations