The purpose of this assignment is to review general linear models (i.e. regression) and start thinking about alternative approaches to the analysis of hierarchical data. The data (televisions.dat) for this assignment is non-nested, comes from 40 countries, and was obtained from The World Almanac and Book of Facts 1993.
Read in data, name variables, inspect dataframe, missing values to NA
## read in data
tv.df <- read.table("C:/Users/User/Downloads/Villano_HW1/televisions.dat", quote="\"", comment.char="")
## assign column names
# country = Country name
# lifeexp = Average life expectancy in years
# ppltv = Number of people per television
# pplmd = Number of people per physician
# femexp = Life expectancy for women
# maleexp = Life expectancy for men
colnames(tv.df) <- c("country", "lifeexp", "ppltv", "pplmd", "femexp", "maleexp")
## inspect
head(tv.df)
## country lifeexp ppltv pplmd femexp maleexp
## 1 Argentina 70.5 4.0 370 74 67
## 2 Bangladesh 53.5 315.0 6166 53 54
## 3 Brazil 65.0 4.0 684 68 62
## 4 Canada 76.5 1.7 449 80 73
## 5 China 70.0 8.0 643 72 68
## 6 Colombia 71.0 5.6 1551 74 68
# replace "." with NA
tv.df$ppltv[which(tv.df$ppltv == ".")] <- NA
tv.df$lifeexp <- as.numeric(tv.df$lifeexp)
tv.df$ppltv <- as.numeric(tv.df$ppltv)
tv.df$pplmd <- as.numeric(tv.df$pplmd)
tv.df$femexp <- as.numeric(tv.df$femexp)
tv.df$maleexp <- as.numeric(tv.df$maleexp)
The average life expectancy for all countries in the sample was 67.04 (SD = 8.25). Life expectancy was negatively skewed from normal (Skewness = -0.38) and minimally kurtotic (Kurtosis = -1.05), and ranged from 51.5 years to 79 years (range = 27.5). The average number of people per television (termed “people per TV”) was 51.98 (SD = 130.35). People per TV was highly positively skewed (Skewness = 3.14), leptokurtotic (Kurtosis = 8.97), and consisted primarily of smaller values with three high outliers: people per TV ranged from 1.3 to 592 (range = 590.7), while the median number of people per television was 6.3. Two people per TV values were missing (Tanzania and Zaire) and were thus excluded. The average number of people per doctor was 3997.65 (SD = 7666.13). Like the number of people per television, the number of people per doctor was highly positively skewed (Skewness = -0.38) and leptokurtotic (Kurtosis = 7.76). The median number of doctors per person was 990.5, with values ranging from 226 to 36660 doctors per person (range = 36434).
See below for full descriptive statistics for all variables, and figure 1.2 for histograms depicting the frequency distributions for all variables.
kable(as.data.frame(describe(tv.df[,2:ncol(tv.df)])))
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| lifeexp | 1 | 40 | 67.03750 | 8.248844 | 69.5 | 67.40625 | 8.15430 | 51.5 | 79 | 27.5 | -0.3809784 | -1.0545521 | 1.304257 |
| ppltv | 2 | 38 | 51.97895 | 130.347782 | 6.3 | 17.51250 | 6.07866 | 1.3 | 592 | 590.7 | 3.1464570 | 8.9732995 | 21.145202 |
| pplmd | 3 | 40 | 3997.65000 | 7666.131601 | 990.5 | 1915.03125 | 919.95330 | 226.0 | 36660 | 36434.0 | 2.8408566 | 7.7578860 | 1212.121835 |
| femexp | 4 | 40 | 69.57500 | 9.162038 | 72.0 | 70.00000 | 9.63690 | 53.0 | 82 | 29.0 | -0.3919531 | -1.1387802 | 1.448646 |
| maleexp | 5 | 40 | 64.50000 | 7.421383 | 66.0 | 64.87500 | 8.15430 | 50.0 | 76 | 26.0 | -0.3778010 | -0.9145029 | 1.173424 |
Note. lifeexp = Life Expectancy, ppltv = People per TV, pplmd = People per Doctor
tv.df.melt <- melt(tv.df, measure.vars = c("lifeexp", "ppltv", "pplmd"))
tv.df.melt$value <- as.numeric(tv.df.melt$value)
ggplot(tv.df.melt, aes(x = value, fill = variable, color = variable)) +
geom_histogram(position = "identity", alpha = 0.75, bins = 10) +
scale_color_viridis_d(begin = 0.5, end = 0.85) +
scale_fill_viridis_d(begin = 0.5, end = 0.85) +
facet_wrap(~variable, scales="free") +
theme(legend.position = "none")
Due to the high degree of positive skew in the frequency distributions for both people per doctor and people per television, the relationships between life expectancy and the number of people per doctor and per television, respectively, were highly nonlinear, and approximated an exponential decay function (see plot below).
tv.df.melt2 <- melt(tv.df, measure.vars = c("ppltv", "pplmd"))
ggplot(tv.df.melt2, aes(x = value, y = lifeexp, color = variable)) +
geom_point(size = 2) +
facet_wrap(~variable, scales="free") +
theme(legend.position = "none") +
scale_color_viridis_d(begin = 0.5, end = 0.85) +
scale_fill_viridis_d(begin = 0.5, end = 0.85)
Following log transformation, the frequency distributions for both the number of people per television and per doctor better approximated a normal distribution. As a result, the relationships between life expectancy and the log-transformed number of people per television and log-transformed number of people per doctor were roughly linear. Both relationships appeared negative, such that life expectancy appeared to decrease linearly as the number of people per television and per doctor increased, respectively (see plot below).
# log transform
tv.df$pplmd_log <- log(tv.df$pplmd)
tv.df$ppltv <- as.numeric(tv.df$ppltv)
tv.df$ppltv_log <- log(tv.df$ppltv)
tv.df.melt3 <- melt(tv.df, measure.vars = c("ppltv_log", "pplmd_log"))
ggplot(tv.df.melt3, aes(x = value, y = lifeexp, color = variable)) +
geom_point(size = 2) +
facet_wrap(~variable, scales="free") +
theme(legend.position = "none") +
scale_color_viridis_d(begin = 0.5, end = 0.85) +
scale_fill_viridis_d(begin = 0.5, end = 0.85)
Linear regression revealed a strong negative association between the log-transformed people per TV variable and life expectancy (b = -4.26, SE = 0.43, p < 0.0001), such that a one-unit increase in the log-transformed people per TV variable predicted a 4.26 year decrease in life expectancy. The predicted life expectancy for a country with 0 log-transformed people per TV was 77.89 (SE = 1.22). Model R2 indicated that the log-transformed number of people per television explained 73.12% of the total variance in life expectancy across all countries, while the model F-test revealed that the log-transformed number of people per television explained variance in life expectancy better than an intercept-only model (F[1,36] = 97.94, p < 0.0001).
md.1 <- lm(lifeexp ~ 1 + ppltv_log, data = tv.df)
md.1.sum <- summary(md.1)
print(md.1.sum)
##
## Call:
## lm(formula = lifeexp ~ 1 + ppltv_log, data = tv.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.531 -2.567 0.284 2.175 11.280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.8873 1.2203 63.827 < 2e-16 ***
## ppltv_log -4.2597 0.4304 -9.896 8.2e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.101 on 36 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7312, Adjusted R-squared: 0.7238
## F-statistic: 97.94 on 1 and 36 DF, p-value: 8.201e-12
In the two-predictor model, both the log-transformed number of people per doctor and log-transformed number of people per television were significant negative predictors of life expectancy. The intercept for life expectancy was 90.62, indicating that the predicted life expectancy for a hypothetical country with zero log-transformed people per doctor and per television, respectively, was 90.62 years of age. The coefficient for the log-transformed number of people per television remained negative as in the univariate model (b = -2.92, SE = 0.59, p < 0.001), and the slope coefficient indicated that when controlling for the log-transformed number of people per doctor, a one-unit increase in the log-transformed number of people per television predicted a 2.92 year decrease in life expectancy. The coefficient for the log-transformed number of people per doctor was also negative (b = -2.26, SE = 0.74, p < 0.01), such that when controlling for the lo og-transformed number of people per television, a one-unit increase in the log-transformed number of people per doctor predicted a 2.26 year decrease in life expectancy.
After adding the log-transformed number of people per doctor to the linear regression model, the model R2 increased from 0.7312 to 0.7868, which indicates that the log-transformed number of people per doctor explained an additional 5.56% of the variance in life expectancy beyond that explained by the log-transformed number of people per television. The model F-test indicates that the log-transformed number of people per television and log-transformed number of people per doctor jointly explained variance in life expectancy better than an intercept-only model (F[2,35] = 64.6, p < 0.0001).
The effect of the log-transformed number of people per television was smaller in the two-predictor model, likely due to the fact that both the log-transformed number of people per television and per doctor, respectively, explain and compete for similar variance in life expectancy. Both variables covary, and therefore hold similar negative relationships with life expectancy. Thus, with the addition of the log-transformed number of people per doctor predictor, the coefficient for the log-transformed number of people per television, which is now a partial effect in this model, decreases.
Relative to the effect of log-transformed people per TV, the effect of log-transformed people per doctor is smaller. In theory, this may be due to the fact that doctors cannot act on the population to increase life expectancy unless the population can afford and thus access their care. It is reasonable to assume that the inverse number of people per television indicates the degree of wealth in each country, and in the absence of sufficient wealth, the effect of the presence of doctors on life expectancy cannot come to bear. Alternatively, the inverse number of people per television and per doctor may both be proxies for wealth, which predicts life expectancy, and it is also possible that people per television is more related to the third variable of wealth.
md.2 <- lm(lifeexp ~ 1 + ppltv_log + pplmd_log, data = tv.df)
md.2.sum <- summary(md.2)
print(md.2.sum)
##
## Call:
## lm(formula = lifeexp ~ 1 + ppltv_log + pplmd_log, data = tv.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7173 -2.7718 0.9026 2.9923 5.8553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.6222 4.3557 20.806 < 2e-16 ***
## ppltv_log -2.9156 0.5907 -4.936 1.95e-05 ***
## pplmd_log -2.2589 0.7474 -3.022 0.00467 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.704 on 35 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7868, Adjusted R-squared: 0.7747
## F-statistic: 64.6 on 2 and 35 DF, p-value: 1.788e-12
# md.2.sum$r.squared - md.1.sum$r.squared
With repeated measures occurring within each country, there would be a dependency between life expectancy values recorded across time in each country consistent with nesting. This would violate the GLM assumption that observations are independent. Thus a standard GLM would poorly model the data and standard errors for resultant regression coefficients would likely be inflated.
could be applied to this data. What would be the pros and cons of each approach?
One solution to the nesting conundrum would be to adjust the results of the analysis, mainly the standard errors for resultant regression coefficients. While this approach would improve estimates of the error in regression coefficients, it would preclude comparison of effects across countries in the sample. Another solution would be to add a fixed-effect term to the model that corresponds to a dummy coded variable representing country. This approach accommodates data such as these, where groups (i.e., countries) are static across the population, though it would also preclude the researcher from elucidating differences in effects within- and between-countries, and also from comparing these effects across countries. The most appropriate approach would be to specify a random-effects model in which country is specified as a nesting term. This approach would allow the researcher to explore differences in effects both within- and between-countries in the sample, but might present specification challenges and convergence issues with such a small sample size.
to summarize the hierarchical model you would test, using appropriate symbols and subscripts. Indicate which symbols represent which variables in your dataset.
Level 1 Equation
where xij represents ppltv scores over time within country, β0j represents the intercept, β1j represents the fixed effect of ppltv scores on life expectancy, and eij represents individual differences in life expectancy.
Level 2 Equations
where γ00 represents the estimated grand mean life expectancy, and uij represents group differences in life expectancy.
where γ01 represents the estimated fixed effect of ppltv scores on life expectancy.
Reduced Form Equation