Imagine you have a dataset in which each observation is a school. The dataset has the following variables:
api00 – Academic performance index of the school in 2000enroll – Number of students at the schoolmeals – percentage of students at the school who gets free mealsfull – percentage of teachers at the school with full teaching credentialWe1 ran a regression that included these four variables and we got the results below.
## Call:
## lm(formula = api00 ~ enroll + meals + full)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 801.82983 26.42660 30.342 < 2e-16 ***
## enroll -0.05146 0.01384 -3.719 0.000229 ***
## meals -3.65973 0.10880 -33.639 < 2e-16 ***
## full 1.08109 0.23945 4.515 8.37e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Multiple R-squared: 0.8308
Question 1: What are the independent variables in the regression above? (give a list of variables)
Answer 1: enroll, meal full
Question 2: What are the dependent variables in the regression above? (give a list of variables)
Answer 2: api00
Question 3: What is the equation predicted by the regression output above? (write out the equation)
Answer 3: \(api00=-0.05(enroll)-3.66(meals)+1.08(full)+801.83\)
Question 4: What does the regression model predict about a school with 100 students, 0 students who get free meals, and 0 teachers with full credentials? Show all of your work. (show your calculation and answer)
Answer 4: \(api00(100,0,0)=-0.05(100)-3.66(0)+1.08(0)+801.83 \approx 796\) (796 API points, to be precise)
Question 5: What does the number -0.05146 from the table (the coefficient of enroll) mean? (answer in a single full sentence)
Answer 5: For every additional student enrolled in the school, the school’s API is predicted to be lower by 0.05146.2
Question 6: Which variables in the model are statistically significant at the p < 0.05 level? (give a list of variables)
Answer 6: All independent variables
Question 7: What does it mean for a variable in a regression model to be statistically significant? (answer in a single full sentence)
Answer 7: It means that the predicted relationship (the slope, also known as coefficient) between the independent variable and the dependent variable is at least 95% likely to be true in the full population from which the sample (the data you have) was drawn.3
Question 8: What do we know about the relationship between the actual values of the dependent variable in the dataset and the predicted/fitted values of the dependent variable in the regression model? (answer in 1–3 sentences)
Answer 8: We look at the \(R^2\) statistic for this, which is 0.83 in this regression. \(R^2\) is a measure of the goodness-of-fit of our regression model. This means that the independent variables in this model predict 83% of the variation in the dependent variable. \(R^2\) is also related to the correlation of the actual and predicted values (of the dependent variable) in this regression: \(\sqrt{R^2} = \sqrt{0.83} = R = 0.91\). This is extremely high and rare in social science data analysis.
We didn’t actually do it. It came from this source: Introduction to Regression in R (Part1, Simple and Multiple Regression). IDRE Statistical Consulting Group.↩
This is a the slope of the relationship between api00 and enroll↩
The process of using the results of your statistical test or model—in this case a linear regression—to learn something about a broader population is called inference. We use standard errors, confidence intervals, and p-values to do inferential statistics.↩