Assignment for October 21

Use the data set you chose last week, pick a continuous variable as your dependent variable and conduct a set of regression analyses. Make sure to: Include interaction effect between two independent variables in at least one of your regression models; Discuss your regression results, including the interaction terms, in the same way we did in the class yesterday; To make sure that your interpretation of the interaction effects is correct, conduct a group-wise (i.e., group by the two independent variables involved in the interaction) summary of the dependent variable (using tidyverse) that is similar to what we did yesterday, and verify the summary statistics against your regression results. Use visreg to plot your results (especially the interaction effect), does your visreg plot agree with your interpretation of the regression results?

Introduction

In this homework, I’m interested in the relationship between the prestige of a school and applicant grade point average (GPA). Is GPA a good predictor of the level of school prestige? What is the influence of whether or not they are admitted?

(Caveat: I’m not sure that this is the most ideal dataset for this week’s homework, but it is the dataset I used last week, so I’m stuck with it. Realistically, the binary “admit” is the better outcome variable, but “rank” will have to suffice since it can be treated as continuous. I will definitely be switching up my dataset for future assignments so that I have more options.)

Data

I used UCLA dataset on graduate school admissions, downloaded from https://stats.idre.ucla.edu/stat/data/binary.csv.

This data set has one binary outcome variable “admit” which is either true or false, and two independent variables, “gre” and “gpa” which are continuous. The variable “rank” is a measure of the prestige of the school. Last week, I treated this variable as categorical, from most (1) to least prestigious (4). For the purpose of this assignment, I will treat rank as a continuous outcome variable, even though this is not an ideal use of this dataset.

Results

library(readr)
gradData <- read_csv("/Users/meredithpowers/Desktop/ucla-grad.csv")
tibble::glimpse(gradData)

## Observations: 400
## Variables: 4
## $ admit <int> 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,...
## $ gre   <int> 380, 660, 800, 640, 520, 760, 560, 400, 540, 700, 800, 4...
## $ gpa   <dbl> 3.61, 3.67, 4.00, 3.19, 2.93, 3.00, 2.98, 3.08, 3.39, 3....
## $ rank  <int> 3, 3, 1, 4, 4, 2, 1, 2, 3, 2, 4, 1, 1, 2, 1, 3, 4, 3, 2,...

m1 <- lm(rank ~ gpa, data = gradData)
m2 <- lm(rank ~ gpa + admit, data = gradData)
m3 <- lm(rank ~gpa*admit, data = gradData)

library(texreg)

## Version:  1.36.23
## Date:     2017-03-03
## Author:   Philip Leifeld (University of Glasgow)
## 
## Please cite the JSS article in your publications -- see citation("texreg").

htmlreg(list(m1, m2, m3))

Statistical models
	Model 1	Model 2	Model 3
(Intercept)	2.97^***	2.76^***	2.67^***
	(0.42)	(0.41)	(0.50)
gpa	-0.14	-0.04	-0.01
	(0.12)	(0.12)	(0.15)
admit		-0.49^***	-0.18
		(0.10)	(0.92)
gpa:admit			-0.09
			(0.27)
R²	0.00	0.06	0.06
Adj. R²	0.00	0.05	0.05
Num. obs.	400	400	400
RMSE	0.94	0.92	0.92
p < 0.001, p < 0.01, p < 0.05

Model 1 shows that 1 unit increase in GPA leads to -0.14 units decrease in rank – and remember that the higher the rank value is, the less prestigious. To rephrase in a more human-friendly way, a unit higher in GPA is related to a 0.14 increase in prestige (though there is no indication that this result is statistically significant).

Model 2 shows that after controlling for admission status, the effect of GPA on school prestige rank increases slightly, from -0.14 to -0.04. Model 2 also shows that applicants with an admission status of 1 (i.e., is admitted to the school) correlate to a -0.49 reduction in rank score – that is to say, admitted students attend a school with a 0.49 higher prestige, and this result is statistically significant.

Model 3, with the newly included interaction term between GPA and admission status, is not statistically significant, but still suggests a few things. GPA is positively correlated with the prestige level of the school more for those admitted than those not admitted (i.e., admit=0), and that on average, those admitted are associated with a higher prestige (i.e., a lower value for the rank score) than those not admitted.

library(visreg)
visreg(m1, "gpa")

This plot of the relationship of GPA and rank validates the interpretation of Model 1, in that you can clearly see a slight decline in rank score as GPA value gets higher. (Remember: the lower the rank score, the higher the prestige of the school.) The slope of the graph is not at all dramatic, which tracks with the insignificance of the results.

library(visreg)
visreg(m2, "gpa", by = "admit")

This plot, which displays the relationship between GPA and rank, but split by admission status, more clearly demonstrates the effect of the “admit” variable. Much like in Model 1, the slopes of the two graphs are fairly flat, but the difference in non-admitted (0) and admitted (1) students is more clearly pronounced. Students who were admitted to grad school had attended an undergrad school with significantly higher prestige than students who were not admitted.

library(visreg)
visreg(m3, "gpa", by = "admit")

The final plot of Model 3, which displays the interaction effect of GPA and admission status on rank, is similar to and yet still distinct from Model 2. Again, this is not really statistically significant, but the slope of the admitted students in Model 3 is somewhat more dramatic than the slope of admitted students in Model 2, suggesting that admission status has an interaction relationship with GPA, even though it is a minor effect.

Discussion

Realistically, this was not the best use of this dataset. The prestige (rank) of the undergraduate institution would best be used as a predictor variable on admission status, and not an outcome variable. However, it is still possible to look at the interactions between GPA and whether or not students were admitted to graduate school in the context of their undergraduate school prestige. For example, one might expect students at a higher-prestige school to be more academically-motivated, and likely to achieve higher GPAs and gain admission to graduate school, than students at lower prestige schools. The data doesn’t quite bear this out in a statisically significant way, though there are suggestions that this relationship may be valid.

As far as real world relevance, undergraduate institutional prestige is often correlated to an institution’s selectivity – and institutional selectivity is often linked to perpetuating income inequality. Numerous studies over the years have pointed to a clear positive link between education and income, and yet income inequality persists even as more individuals attend college (see https://nces.ed.gov/fastfacts/display.asp?id=372). How then does income inequality persist? This UCLA dataset is truly too small and specific to really tease out answers to this question, but it does suggest that there may be a pipeline problem – that is, the conflation of undergraduate prestige and selectivity can lead to fewer low income students attending prestigious undergraduate schools, and then fewer still will attend grad school, regardless of how well they performed in undergrad.

Interpretation Validation

I’m not sure how to compress the list here (I tried a few commands, but each one either broke it or failed to have any effect I could see), so the table is quite long. The full table is below, however, a quick glance at snippets near the top and bottom can more or less validate my interpretation of the models and plots in the Results section. For example, the GPA of 2.42 is one of the lowest GPA scores, yet the admitted observation shows a mean rank of 1 (high prestige undergrad) and the non-admit shows mean rank 2 (lower in prestige). Similarly, the 3.95 GPA observations show a mean rank of 4 (lowest) in the non-admitted category; the same GPA exists in the admitted category with a mean rank of 2.5 undergrad prestige. Again, it’s not super obviously conclusive or significant results by any means, but it does help point to the same trends that we observed in the Results section.

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 3.4.4

## Warning: package 'tidyr' was built under R version 3.4.4

## Warning: package 'purrr' was built under R version 3.4.4

## Warning: package 'dplyr' was built under R version 3.4.4

## Warning: package 'stringr' was built under R version 3.4.4

library(knitr)
gradData %>%
group_by(gpa, admit) %>%
summarize(mean_rank = mean(rank)) %>%
spread(admit, mean_rank) %>%
kable()

gpa	0	1
2.26	4.000000	NA
2.42	2.000000	1.000000
2.48	4.000000	NA
2.52	2.000000	NA
2.55	1.000000	NA
2.56	3.000000	NA
2.62	2.000000	2.000000
2.63	2.000000	NA
2.65	NA	3.000000
2.67	3.000000	2.000000
2.68	NA	3.000000
2.69	2.000000	NA
2.70	2.500000	NA
2.71	2.500000	NA
2.73	2.000000	NA
2.76	2.000000	NA
2.78	2.500000	NA
2.79	3.000000	NA
2.81	3.000000	1.000000
2.82	4.000000	NA
2.83	3.000000	NA
2.84	NA	2.000000
2.85	2.500000	NA
2.86	4.000000	4.000000
2.87	2.000000	NA
2.88	2.000000	NA
2.90	2.500000	NA
2.91	4.000000	2.000000
2.92	3.500000	NA
2.93	3.200000	NA
2.94	2.500000	2.000000
2.95	2.000000	NA
2.96	1.000000	3.000000
2.97	4.000000	2.000000
2.98	2.250000	1.500000
3.00	3.000000	2.666667
3.01	3.500000	NA
3.02	2.333333	1.000000
3.03	3.000000	NA
3.04	2.000000	NA
3.05	2.000000	2.000000
3.06	2.000000	NA
3.07	2.250000	NA
3.08	2.750000	NA
3.09	4.000000	NA
3.10	4.000000	NA
3.11	2.000000	NA
3.12	2.000000	3.000000
3.13	2.500000	2.000000
3.14	2.000000	2.000000
3.15	3.200000	2.000000
3.16	2.000000	NA
3.17	2.333333	1.500000
3.18	NA	2.000000
3.19	3.000000	3.500000
3.20	1.000000	2.000000
3.21	4.000000	NA
3.22	1.333333	1.500000
3.23	4.000000	3.500000
3.24	4.000000	NA
3.25	2.000000	NA
3.27	2.500000	2.000000
3.28	2.000000	NA
3.29	2.500000	NA
3.30	1.666667	2.000000
3.31	2.625000	NA
3.32	2.333333	2.000000
3.33	3.200000	NA
3.34	2.800000	NA
3.35	2.666667	2.000000
3.36	2.333333	1.000000
3.37	4.000000	1.500000
3.38	2.750000	3.000000
3.39	4.000000	2.500000
3.40	2.714286	NA
3.41	4.000000	NA
3.42	NA	2.000000
3.43	2.750000	2.000000
3.44	2.500000	2.000000
3.45	3.250000	2.333333
3.46	3.666667	2.000000
3.47	3.000000	2.500000
3.48	2.500000	2.000000
3.49	3.000000	1.666667
3.50	2.000000	3.000000
3.51	2.200000	NA
3.52	3.000000	4.000000
3.53	4.000000	1.000000
3.54	2.000000	1.000000
3.55	NA	4.000000
3.56	NA	1.666667
3.57	2.666667	NA
3.58	2.000000	1.000000
3.59	2.750000	2.000000
3.60	2.000000	3.000000
3.61	3.000000	1.000000
3.62	3.500000	NA
3.63	2.666667	2.333333
3.64	3.000000	2.000000
3.65	2.000000	4.000000
3.66	NA	1.000000
3.67	2.500000	2.500000
3.69	2.000000	3.000000
3.70	2.000000	2.500000
3.71	NA	2.500000
3.72	2.000000	NA
3.73	2.000000	NA
3.74	4.000000	2.333333
3.75	2.000000	2.000000
3.76	3.000000	3.000000
3.77	3.250000	2.000000
3.78	3.000000	2.000000
3.80	2.000000	3.000000
3.81	2.000000	1.000000
3.82	3.000000	NA
3.83	2.000000	NA
3.84	3.000000	2.000000
3.85	NA	3.000000
3.86	3.000000	2.000000
3.87	4.000000	NA
3.88	3.500000	2.000000
3.89	2.500000	1.000000
3.90	1.500000	3.000000
3.91	3.000000	NA
3.92	3.000000	NA
3.93	2.000000	NA
3.94	3.000000	2.000000
3.95	4.000000	2.500000
3.97	1.000000	NA
3.98	NA	2.000000
3.99	3.000000	3.000000
4.00	2.400000	1.769231