The following four variables were measured at the student level (Level 1):
Minority - a dummy variable for student ethnicity (1 = minority, 0 = other)
Female - a dummy variable for student sex (1 = female, 0 = male)
Ses - a composite scale constructed from variables measuring parental education, occupation, and income
Mathach - a measure of mathematics achievement
The following six variables were measured at the school level (Level 2):
Size: school enrollment
Sector - a dummy variable for type of school (1 = Catholic, 0 = public)
Pracad - the proportion of students in the academic track
Disclim - a scale measuring disciplinary climate
Himnty - a dummy variable indicating minority enrollment (1 = more than 40% minority enrollment, 0 = less than 40%)
Meanses - the mean of the SES values for the students in a particular school who are included in the Level-1 data
Read in the HSB.csv file and save it in a data frame called HSB_Data, excluding the variables, pracad and disclim. Verify you have done this correctly by printing the first several lines of the data frame.
HSB_Data <- read.csv("~/Documents/HLM Course/data/HSB.csv")
HSB_Data <- subset(HSB_Data, select = -c(pracad, disclim)) #exclude the two variables
head(HSB_Data)
## School minority female ses mathach size sector himnty meanses
## 1 1224 0 1 -1.53 5.88 842 0 0 -0.43
## 2 1224 0 1 -0.59 19.71 842 0 0 -0.43
## 3 1224 0 0 -0.53 20.35 842 0 0 -0.43
## 4 1224 0 0 -0.67 8.78 842 0 0 -0.43
## 5 1224 0 0 -0.16 17.90 842 0 0 -0.43
## 6 1224 0 0 0.02 4.58 842 0 0 -0.43
Produce basic descriptive information for just these two variables: mathach and ses, using a single command.
describe(HSB_Data[,4:5]) #I indexed the data set to only include mathach and ses to only use a single command
## vars n mean sd median trimmed mad min max range skew
## ses 1 7185 0.00 0.78 0.00 0.02 0.85 -3.76 2.69 6.45 -0.23
## mathach 2 7185 12.75 6.88 13.13 12.92 8.12 -2.83 24.99 27.82 -0.18
## kurtosis se
## ses -0.38 0.01
## mathach -0.92 0.08
What is the overall correlation between mathach and ses?
cor(HSB_Data$mathach, HSB_Data$ses) #print the correlation between mathach and ses
## [1] 0.3607626
Produce a cross-classification table for female and minority. Make sure that the rows and columns of the table have appropriate labels (not just numbers).
counting_table <- table(HSB_Data$minority, HSB_Data$female)
rownames(counting_table) = c("Other", "Minority") #0 = other, 1 = minority
colnames(counting_table) = c("Male", "Female") #0 = male, 1 = female
counting_table #display the cross-classification table
##
## Male Female
## Other 2481 2730
## Minority 909 1065
Are the two variables in Question 4 independent of each other?
Answer: I chose to conduct a Chi-squared test of independence to test whether two categorical variables are related to each other. Since the Chi-squared test produced a non-significant p-value, female and minority are unrelated, and therefore, independent of each other. Short answer: Yes!
chisq.test(counting_table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: counting_table
## X-squared = 1.34, df = 1, p-value = 0.247
Produce a histogram for ses. Include a blue vertical line indicating the mean, a red vertical line indicating the median, and the normal density curve (in green). Make sure the axes are appropriately labeled. Do the data seem to be normally distributed?
Answer: Yes, the data seem to be normally distributed, but the distribution is missing some extreme values on the right tail of the distribution.
ggplot(HSB_Data, aes(x = ses)) +
geom_histogram(aes(y = ..density..), binwidth = 0.2, color = "black", fill = "grey", size = 0.1, na.rm = TRUE) + #I made use the histogram plotted the density, not the frequency
coord_cartesian(xlim = c(-4,4)) +
scale_x_continuous(breaks = c(seq(-4,4,1))) +
xlab("SES") +
ylab("Density") +
theme_minimal() +
geom_vline(xintercept = mean(HSB_Data$ses),size=2,color="blue") + # Mean vertical line
geom_vline(xintercept = median(HSB_Data$ses),size=0.5,color="red", position_nudge(x = 0.5)) + # Median vertical line
stat_function(fun = dnorm, args = list(mean = mean(HSB_Data$ses), sd = sd(HSB_Data$ses)), col = "green", size = 1) + # Plotting the normal density curve
annotate("segment",x=2,xend=2.5,y=0.4,yend=0.4,color="blue",linetype=1,size=1.25) +
annotate("text",x=2.6,y=0.4,label="Mean",color="black",size=4,hjust=0) +
annotate("segment",x=2,xend=2.5,y=0.45,yend=0.45,color="red",linetype=1,size=1.25) +
annotate("text",x=2.6,y=0.45,label="Median",color="black",size=4,hjust=0) +
annotate("segment",x=2,xend=2.5,y=0.35,yend=0.35,color="green",linetype=1,size=1.25) +
annotate("text",x=2.6,y=0.35,label="Normal Density",color="black",size=4,hjust=0) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Distribution of SES Values in the HSB Dataset")
## Warning: geom_vline(): Ignoring `mapping` because `xintercept` was provided.
The mean and median overlap at SES = 0, but I did not know how to position the lines so that they weren’t overlapping.
Note on the graph: I did not know how to add a normal density curve, so I referenced this website: https://www.statology.org/overlay-normal-curve-histogram-in-r/.
I also used code from Dr. Strube’s R file on the Basic Two Level Model to annotate the figure.
Provide a formal test of normality for ses. The Shapiro-Wilk test is the most commonly used, but in R it has a sample size limit of 5000. Suggest a work-around for this limit. What does this test tell you about normality for ses?
Answer: Since the HSB_Data set has a large sample size, the sampling distribution is more likely to be a normal distribution based on the Central Limit Theorem. Therefore, I can make a Q-Q plot to draw comparisons to the sample distribution and the theoretical distribution. If the sample distribution is normal, then the sample data points should fall on or near the diagonal line of the Q-Q plot.
There are formal tests of normality for large sample sizes like the Anderson-Darling Test (see the code below), but as we talked about in class, one should always visually inspect their data by plotting histograms and Q-Q plots.
ad.test(HSB_Data$ses) #conducting the Anderson-Darling Test
##
## Anderson-Darling normality test
##
## data: HSB_Data$ses
## A = 12.836, p-value < 2.2e-16
I used Google to partially answer this question: https://universeofdatascience.com/how-to-assess-normality-in-r/
Produce a Q-Q plot for ses. This plot should include a 95% confidence band. What does this plot tell you about the distribution of ses?
Answer: The Q-Q plot tells us that there are less than a typical amount of values on the right end of the distribution.
ggpubr::ggqqplot(data = HSB_Data$ses, conf.int = TRUE) +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Theoretical Distribution") +
ylab("Sample Distribution") +
ggtitle("Q-Q Plot for SES")
Produce a scatterplot of meanses (y axis) versus size (x axis). Make sure the axes are appropriately labeled. Color the plot symbols so that public schools are red and Catholic schools are blue. Add an overall best-fitting linear regression line and an overall loess (nonlinear) fit line (just one of each for all of the data, not separately for type of school). What does this figure tell you?
Answer: This figure tells me that there is a slight, negative linear relationship between school enrollment and SES values. The nonlinear fit line shows a slight, downward curvilinear relationship with the greatest slope between school enrollment size and SES present at the greatest school enrollment sizes.
ggplot(HSB_Data, aes(x = size, y = meanses)) +
geom_point(aes(color = factor(sector)), size = 2) +
scale_color_manual(values = c('red', 'blue'), labels=c('Public', 'Catholic')) +
scale_x_continuous(breaks = c(seq(0,3000,500))) +
scale_y_continuous(breaks = c(seq(-1.2,0.8,0.4)), labels = scales::label_number(accuracy = 0.01)) +
geom_smooth(method = lm, se = FALSE, color = "black", size = 1,
linetype = 1, na.rm = TRUE, formula = y ~ x) + # this is the linear line
geom_smooth(method = loess, se = FALSE, color = "black", size = 1,
linetype = 2, na.rm = TRUE, formula = y ~ x) + # this is the non-linear line
xlab("School Enrollment Size") +
ylab("Mean of SES Values") +
labs(color = "Sector") +
ggtitle("School Enrollment by Mean SES with Linear and Nonlinear Best Fit Lines")
Create a scatterplot and best-fitting line (simple linear model) relating mathach to ses. Include the 99% confidence interval around the line (consider mathach to be the outcome variable).
ggplot(HSB_Data, aes(x = ses, y = mathach)) +
geom_point(aes(x = ses, y = mathach), size = 1) +
scale_x_continuous(breaks = c(seq(-4.0,3,1)), labels = scales::label_number(accuracy = 1)) +
geom_smooth(method = lm, color = "blue", size = 2,
linetype = 1, na.rm = TRUE, level = 0.99, formula = y ~ x) +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("SES") +
ylab("Math Achievement") +
ggtitle("The Relationship Between SES and Math Achievement")
Produce a two-panel plot. In the upper panel, show the boxplots of mathach separately for each public school. In the lower panel, show the boxplots of mathach separately for each Catholic school. (There will be a total of 160 school-level box plots in this two-panel figure).
# Separate the data into public and catholic schools to make separate graphs
HSB_Public_data <- HSB_Data[which(HSB_Data$sector==0),]
HSB_Catholic_data <- HSB_Data[which(HSB_Data$sector==1),]
# Make a boxplot graph with just the public school data
HSB_Public <- ggplot(HSB_Public_data, aes(x = factor(School), y = mathach)) +
geom_boxplot() +
xlab("Public Schools") +
ylab("Math Achievement") +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 90, size = 5)) +
ggtitle("Match Achievement by School Type")
# Make a boxplot graph with just the catholic school data
HSB_Catholic <- ggplot(HSB_Catholic_data, aes(x = factor(School), y = mathach)) +
geom_boxplot() +
xlab("Catholic Schools") +
ylab("Math Achievement") +
theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 90, size = 6))
# Combine the two graphs
ggarrange(HSB_Public, HSB_Catholic, nrow = 2)
Starting with the data frame you created in Question 1, create a new data frame that contains school-level information (it will have only 160 lines). This data frame should have the Level 2 variables as well as the school averages for the Level 1 variables. Print the first 10 lines of this data frame.
HSB_Data_q12 <- HSB_Data %>%
group_by(School) %>%
summarize(mathach_mean = mean(mathach), ses_mean = mean(ses), female_mean = mean(female),
minority_mean = mean(minority), sector = mean(sector), size = mean(size),
himnty = mean(himnty)) #I will take the mean of all variables, even the level 2 variables of sector, size and himnty because they are the same values for each school
head(HSB_Data_q12, 10) #print just the first 10 lines
## # A tibble: 10 × 8
## School mathach_mean ses_mean female_mean minority_mean sector size himnty
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1224 9.72 -0.436 0.596 0.0851 0 842 0
## 2 1288 13.5 0.120 0.44 0.12 0 1855 0
## 3 1296 7.64 -0.428 0.646 0.979 0 1719 1
## 4 1308 16.3 0.526 0 0.4 1 716 0
## 5 1317 13.2 0.343 1 0.729 1 455 1
## 6 1358 11.2 -0.0217 0.367 0.1 0 1430 0
## 7 1374 9.73 -0.0146 0.464 0.321 0 2400 0
## 8 1433 19.7 0.71 0 0.0286 1 899 0
## 9 1436 18.1 0.561 0.341 0 1 185 0
## 10 1461 16.8 0.675 0.576 0.0303 0 1672 0
I used code from the Centering.R script in the supplementary materials to produce this data set.
Create a correlation matrix among all variables for the data frame created in Question 12.
round(cor(HSB_Data_q12),2)
## School mathach_mean ses_mean female_mean minority_mean sector
## School 1.00 -0.05 -0.02 0.04 -0.11 -0.04
## mathach_mean -0.05 1.00 0.78 -0.27 -0.49 0.45
## ses_mean -0.02 0.78 1.00 -0.15 -0.48 0.36
## female_mean 0.04 -0.27 -0.15 1.00 0.07 -0.03
## minority_mean -0.11 -0.49 -0.48 0.07 1.00 0.06
## sector -0.04 0.45 0.36 -0.03 0.06 1.00
## size 0.02 -0.10 -0.13 -0.04 0.12 -0.45
## himnty -0.16 -0.38 -0.41 0.09 0.84 0.05
## size himnty
## School 0.02 -0.16
## mathach_mean -0.10 -0.38
## ses_mean -0.13 -0.41
## female_mean -0.04 0.09
## minority_mean 0.12 0.84
## sector -0.45 0.05
## size 1.00 0.11
## himnty 0.11 1.00
Using the new data frame, conduct a multiple regression, predicting school-average math achievement from school-average SES, school-average female representation in the school, sector, and minority enrollment (himnty). Which predictors make unique and significant contributions to math achievement?
Answer: The results of the multiple regression show that school-average SES, school-average female representation, sector, and minority enrollment all make significant contributions to school-average math achievement.
model_q14 <- lm(mathach_mean ~ ses_mean + female_mean + sector + himnty, data = HSB_Data_q12)
summary(model_q14)
##
## Call:
## lm(formula = mathach_mean ~ ses_mean + female_mean + sector +
## himnty, data = HSB_Data_q12)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8564 -1.1604 -0.0278 1.2189 4.3898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.2648 0.3481 38.106 < 2e-16 ***
## ses_mean 4.7802 0.4105 11.645 < 2e-16 ***
## female_mean -1.9322 0.5554 -3.479 0.000653 ***
## sector 1.4108 0.3103 4.546 1.09e-05 ***
## himnty -0.7935 0.3527 -2.250 0.025858 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.773 on 155 degrees of freedom
## Multiple R-squared: 0.6849, Adjusted R-squared: 0.6767
## F-statistic: 84.21 on 4 and 155 DF, p-value: < 2.2e-16
Add to the model in Question 14 all two-way interactions among the predictor variables. For each significant interaction, construct a figure (include 95% confidence bands) and provide a verbal description of what the interaction means.
Answer: There is only one significant interaction between ses_mean and sector. This interaction can be interpreted as the slope of SES and Math Achievement decreases by 1.89 for every one unit change in sector, controlling for all other predictors.
# Create a model with all two-way interactions
model_q15 <- lm(mathach_mean ~ ses_mean + female_mean + sector + himnty +
ses_mean*female_mean + #each interaction is listed below
ses_mean*sector +
ses_mean* himnty +
female_mean*sector +
female_mean*himnty +
sector*himnty, data = HSB_Data_q12)
summary(model_q15) # Full model is printed
##
## Call:
## lm(formula = mathach_mean ~ ses_mean + female_mean + sector +
## himnty + ses_mean * female_mean + ses_mean * sector + ses_mean *
## himnty + female_mean * sector + female_mean * himnty + sector *
## himnty, data = HSB_Data_q12)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7917 -0.9978 0.0495 1.0608 4.3965
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.20018 1.22696 9.943 < 2e-16 ***
## ses_mean 6.30877 0.97563 6.466 1.35e-09 ***
## female_mean 0.02605 2.34225 0.011 0.9911
## sector 2.72871 1.33026 2.051 0.0420 *
## himnty 0.32752 0.87824 0.373 0.7097
## ses_mean:female_mean -1.29175 1.38134 -0.935 0.3512
## ses_mean:sector -1.89470 0.85025 -2.228 0.0274 *
## ses_mean:himnty 0.51556 0.86551 0.596 0.5523
## female_mean:sector -1.84075 2.43347 -0.756 0.4506
## female_mean:himnty -0.56030 1.29932 -0.431 0.6669
## sector:himnty -1.38334 0.80294 -1.723 0.0870 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.765 on 149 degrees of freedom
## Multiple R-squared: 0.6997, Adjusted R-squared: 0.6796
## F-statistic: 34.72 on 10 and 149 DF, p-value: < 2.2e-16
# Graph the interaction between ses and sector below
HSB_Data_public <- HSB_Data_q12 %>% filter(sector == 0)
HSB_Data_catholic <- HSB_Data_q12 %>% filter(sector == 1)
ggplot(HSB_Data_public, aes(x = ses_mean, y = mathach_mean)) +
geom_point(size = 2, color = "grey") +
geom_smooth(method = lm, color = "black", size = 1, linetype = 1, formula = y ~ x) +
geom_point(data = HSB_Data_catholic, aes(x = ses_mean, y = mathach_mean), size = 2, color = "lightblue") +
geom_smooth(data = HSB_Data_catholic, aes(x = ses_mean, y = mathach_mean), method = lm, color = "black", size = 1, linetype = 2, formula = y ~ x) +
scale_x_continuous(breaks = c(seq(-1.2,1.2,0.4)), labels=c("-1.2","0.8","0.4","0","0.4","0.8","1.2"), limits = c(-1.2,1.0)) +
scale_y_continuous(breaks = c(seq(0,25,5)), labels = c("0","5","10","15","20","25"), limits = c(0,23)) +
theme_minimal() +
xlab("Mean SES") +
ylab("Mean Math Achievement") +
labs(color = "Sector") +
annotate("segment",x=.2,xend=.35,y=4,yend=4,color="black",linetype=1,size=1.25) +
annotate("text",x=.4,y=4,label="Public Schools",color="black",size=5,hjust=0) +
annotate("segment",x=.2,xend=.35,y=2,yend=2,color="black",linetype=2,size=1.25) +
annotate("text",x=.4,y=2,label="Catholic Schools",color="black",size=5,hjust=0) +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("The Interaction Between SES and Match Achievement")
I used some code from the Basic Two Level Model R script provided by Dr. Strube to graph and fit the data separately for Public and Catholic schools.