Variables in the HSB Dataset for Reference

The following four variables were measured at the student level (Level 1):

Minority - a dummy variable for student ethnicity (1 = minority, 0 = other)
Female - a dummy variable for student sex (1 = female, 0 = male)
Ses - a composite scale constructed from variables measuring parental education, occupation, and income
Mathach - a measure of mathematics achievement

The following six variables were measured at the school level (Level 2):

Size: school enrollment
Sector - a dummy variable for type of school (1 = Catholic, 0 = public)
Pracad - the proportion of students in the academic track
Disclim - a scale measuring disciplinary climate
Himnty - a dummy variable indicating minority enrollment (1 = more than 40% minority enrollment, 0 = less than 40%)
Meanses - the mean of the SES values for the students in a particular school who are included in the Level-1 data

Loading packages

Question 1

Read in the HSB.csv file and save it in a data frame called HSB_Data, excluding the variables, pracad and disclim. Verify you have done this correctly by printing the first several lines of the data frame.

HSB_Data <- read.csv("~/Documents/HLM Course/data/HSB.csv")
HSB_Data <- subset(HSB_Data, select = -c(pracad, disclim)) #exclude the two variables

head(HSB_Data)

##   School minority female   ses mathach size sector himnty meanses
## 1   1224        0      1 -1.53    5.88  842      0      0   -0.43
## 2   1224        0      1 -0.59   19.71  842      0      0   -0.43
## 3   1224        0      0 -0.53   20.35  842      0      0   -0.43
## 4   1224        0      0 -0.67    8.78  842      0      0   -0.43
## 5   1224        0      0 -0.16   17.90  842      0      0   -0.43
## 6   1224        0      0  0.02    4.58  842      0      0   -0.43

Question 2

Produce basic descriptive information for just these two variables: mathach and ses, using a single command.

describe(HSB_Data[,4:5]) #I indexed the data set to only include mathach and ses to only use a single command

##         vars    n  mean   sd median trimmed  mad   min   max range  skew
## ses        1 7185  0.00 0.78   0.00    0.02 0.85 -3.76  2.69  6.45 -0.23
## mathach    2 7185 12.75 6.88  13.13   12.92 8.12 -2.83 24.99 27.82 -0.18
##         kurtosis   se
## ses        -0.38 0.01
## mathach    -0.92 0.08

Question 3

What is the overall correlation between mathach and ses?

cor(HSB_Data$mathach, HSB_Data$ses) #print the correlation between mathach and ses

## [1] 0.3607626

Question 4

Produce a cross-classification table for female and minority. Make sure that the rows and columns of the table have appropriate labels (not just numbers).

counting_table <- table(HSB_Data$minority, HSB_Data$female)

rownames(counting_table) = c("Other", "Minority") #0 = other, 1 = minority
colnames(counting_table) = c("Male", "Female") #0 = male, 1 = female

counting_table #display the cross-classification table

##           
##            Male Female
##   Other    2481   2730
##   Minority  909   1065

Question 5

Are the two variables in Question 4 independent of each other?

Answer: I chose to conduct a Chi-squared test of independence to test whether two categorical variables are related to each other. Since the Chi-squared test produced a non-significant p-value, female and minority are unrelated, and therefore, independent of each other. Short answer: Yes!

chisq.test(counting_table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  counting_table
## X-squared = 1.34, df = 1, p-value = 0.247

Question 6

Produce a histogram for ses. Include a blue vertical line indicating the mean, a red vertical line indicating the median, and the normal density curve (in green). Make sure the axes are appropriately labeled. Do the data seem to be normally distributed?

Answer: Yes, the data seem to be normally distributed, but the distribution is missing some extreme values on the right tail of the distribution.

ggplot(HSB_Data, aes(x = ses)) +
  geom_histogram(aes(y = ..density..), binwidth = 0.2, color = "black", fill = "grey", size = 0.1, na.rm = TRUE) + #I made use the histogram plotted the density, not the frequency
  coord_cartesian(xlim = c(-4,4)) +
  scale_x_continuous(breaks = c(seq(-4,4,1))) +
  xlab("SES") +
  ylab("Density") +
  theme_minimal() +
  geom_vline(xintercept = mean(HSB_Data$ses),size=2,color="blue") + # Mean vertical line
  geom_vline(xintercept = median(HSB_Data$ses),size=0.5,color="red", position_nudge(x = 0.5)) + # Median vertical line
  stat_function(fun = dnorm, args = list(mean = mean(HSB_Data$ses), sd = sd(HSB_Data$ses)), col = "green", size = 1) + # Plotting the normal density curve
  annotate("segment",x=2,xend=2.5,y=0.4,yend=0.4,color="blue",linetype=1,size=1.25) +
  annotate("text",x=2.6,y=0.4,label="Mean",color="black",size=4,hjust=0) +
  annotate("segment",x=2,xend=2.5,y=0.45,yend=0.45,color="red",linetype=1,size=1.25) +
  annotate("text",x=2.6,y=0.45,label="Median",color="black",size=4,hjust=0) +
  annotate("segment",x=2,xend=2.5,y=0.35,yend=0.35,color="green",linetype=1,size=1.25) +
  annotate("text",x=2.6,y=0.35,label="Normal Density",color="black",size=4,hjust=0) +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("Distribution of SES Values in the HSB Dataset")

## Warning: geom_vline(): Ignoring `mapping` because `xintercept` was provided.

The mean and median overlap at SES = 0, but I did not know how to position the lines so that they weren’t overlapping.

Note on the graph: I did not know how to add a normal density curve, so I referenced this website: https://www.statology.org/overlay-normal-curve-histogram-in-r/.

I also used code from Dr. Strube’s R file on the Basic Two Level Model to annotate the figure.

Question 7

Provide a formal test of normality for ses. The Shapiro-Wilk test is the most commonly used, but in R it has a sample size limit of 5000. Suggest a work-around for this limit. What does this test tell you about normality for ses?

Answer: Since the HSB_Data set has a large sample size, the sampling distribution is more likely to be a normal distribution based on the Central Limit Theorem. Therefore, I can make a Q-Q plot to draw comparisons to the sample distribution and the theoretical distribution. If the sample distribution is normal, then the sample data points should fall on or near the diagonal line of the Q-Q plot.

There are formal tests of normality for large sample sizes like the Anderson-Darling Test (see the code below), but as we talked about in class, one should always visually inspect their data by plotting histograms and Q-Q plots.

ad.test(HSB_Data$ses) #conducting the Anderson-Darling Test

## 
##  Anderson-Darling normality test
## 
## data:  HSB_Data$ses
## A = 12.836, p-value < 2.2e-16

I used Google to partially answer this question: https://universeofdatascience.com/how-to-assess-normality-in-r/

Question 8

Produce a Q-Q plot for ses. This plot should include a 95% confidence band. What does this plot tell you about the distribution of ses?

Answer: The Q-Q plot tells us that there are less than a typical amount of values on the right end of the distribution.

ggpubr::ggqqplot(data = HSB_Data$ses, conf.int = TRUE) +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("Theoretical Distribution") +
  ylab("Sample Distribution") +
  ggtitle("Q-Q Plot for SES")

Question 9

Produce a scatterplot of meanses (y axis) versus size (x axis). Make sure the axes are appropriately labeled. Color the plot symbols so that public schools are red and Catholic schools are blue. Add an overall best-fitting linear regression line and an overall loess (nonlinear) fit line (just one of each for all of the data, not separately for type of school). What does this figure tell you?

Answer: This figure tells me that there is a slight, negative linear relationship between school enrollment and SES values. The nonlinear fit line shows a slight, downward curvilinear relationship with the greatest slope between school enrollment size and SES present at the greatest school enrollment sizes.

ggplot(HSB_Data, aes(x = size, y = meanses)) +
  geom_point(aes(color = factor(sector)), size = 2) +
  scale_color_manual(values = c('red', 'blue'), labels=c('Public', 'Catholic')) +
  scale_x_continuous(breaks = c(seq(0,3000,500))) +
  scale_y_continuous(breaks = c(seq(-1.2,0.8,0.4)), labels = scales::label_number(accuracy = 0.01)) +
  geom_smooth(method = lm, se = FALSE, color = "black", size = 1, 
              linetype = 1, na.rm = TRUE, formula = y ~ x) + # this is the linear line
  geom_smooth(method = loess, se = FALSE, color = "black", size = 1, 
              linetype = 2, na.rm = TRUE, formula = y ~ x) + # this is the non-linear line
  xlab("School Enrollment Size") +
  ylab("Mean of SES Values") +
  labs(color = "Sector") +
  ggtitle("School Enrollment by Mean SES with Linear and Nonlinear Best Fit Lines")

Question 10

Create a scatterplot and best-fitting line (simple linear model) relating mathach to ses. Include the 99% confidence interval around the line (consider mathach to be the outcome variable).

ggplot(HSB_Data, aes(x = ses, y = mathach)) +
  geom_point(aes(x = ses, y = mathach), size = 1) +
  scale_x_continuous(breaks = c(seq(-4.0,3,1)), labels = scales::label_number(accuracy = 1)) +
  geom_smooth(method = lm, color = "blue", size = 2, 
              linetype = 1, na.rm = TRUE, level = 0.99, formula = y ~ x) +
  theme(plot.title = element_text(hjust = 0.5)) +
  xlab("SES") +
  ylab("Math Achievement") +
  ggtitle("The Relationship Between SES and Math Achievement")

Question 11

Produce a two-panel plot. In the upper panel, show the boxplots of mathach separately for each public school. In the lower panel, show the boxplots of mathach separately for each Catholic school. (There will be a total of 160 school-level box plots in this two-panel figure).

# Separate the data into public and catholic schools to make separate graphs
HSB_Public_data <- HSB_Data[which(HSB_Data$sector==0),]
HSB_Catholic_data <- HSB_Data[which(HSB_Data$sector==1),]

# Make a boxplot graph with just the public school data
HSB_Public <- ggplot(HSB_Public_data, aes(x = factor(School), y = mathach)) +
  geom_boxplot() +
  xlab("Public Schools") +
  ylab("Math Achievement") +
  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 90, size = 5)) +
  ggtitle("Match Achievement by School Type")

# Make a boxplot graph with just the catholic school data
HSB_Catholic <- ggplot(HSB_Catholic_data, aes(x = factor(School), y = mathach)) +
  geom_boxplot() +
  xlab("Catholic Schools") +
  ylab("Math Achievement") +
  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 90, size = 6))

# Combine the two graphs
ggarrange(HSB_Public, HSB_Catholic, nrow = 2)

Question 12

Starting with the data frame you created in Question 1, create a new data frame that contains school-level information (it will have only 160 lines). This data frame should have the Level 2 variables as well as the school averages for the Level 1 variables. Print the first 10 lines of this data frame.

HSB_Data_q12 <- HSB_Data %>% 
  group_by(School) %>% 
  summarize(mathach_mean = mean(mathach), ses_mean = mean(ses), female_mean = mean(female), 
            minority_mean = mean(minority), sector = mean(sector), size = mean(size), 
            himnty = mean(himnty)) #I will take the mean of all variables, even the level 2 variables of sector, size and himnty because they are the same values for each school

head(HSB_Data_q12, 10) #print just the first 10 lines

## # A tibble: 10 × 8
##    School mathach_mean ses_mean female_mean minority_mean sector  size himnty
##     <int>        <dbl>    <dbl>       <dbl>         <dbl>  <dbl> <dbl>  <dbl>
##  1   1224         9.72  -0.436        0.596        0.0851      0   842      0
##  2   1288        13.5    0.120        0.44         0.12        0  1855      0
##  3   1296         7.64  -0.428        0.646        0.979       0  1719      1
##  4   1308        16.3    0.526        0            0.4         1   716      0
##  5   1317        13.2    0.343        1            0.729       1   455      1
##  6   1358        11.2   -0.0217       0.367        0.1         0  1430      0
##  7   1374         9.73  -0.0146       0.464        0.321       0  2400      0
##  8   1433        19.7    0.71         0            0.0286      1   899      0
##  9   1436        18.1    0.561        0.341        0           1   185      0
## 10   1461        16.8    0.675        0.576        0.0303      0  1672      0

I used code from the Centering.R script in the supplementary materials to produce this data set.

Question 13

Create a correlation matrix among all variables for the data frame created in Question 12.

round(cor(HSB_Data_q12),2)

##               School mathach_mean ses_mean female_mean minority_mean sector
## School          1.00        -0.05    -0.02        0.04         -0.11  -0.04
## mathach_mean   -0.05         1.00     0.78       -0.27         -0.49   0.45
## ses_mean       -0.02         0.78     1.00       -0.15         -0.48   0.36
## female_mean     0.04        -0.27    -0.15        1.00          0.07  -0.03
## minority_mean  -0.11        -0.49    -0.48        0.07          1.00   0.06
## sector         -0.04         0.45     0.36       -0.03          0.06   1.00
## size            0.02        -0.10    -0.13       -0.04          0.12  -0.45
## himnty         -0.16        -0.38    -0.41        0.09          0.84   0.05
##                size himnty
## School         0.02  -0.16
## mathach_mean  -0.10  -0.38
## ses_mean      -0.13  -0.41
## female_mean   -0.04   0.09
## minority_mean  0.12   0.84
## sector        -0.45   0.05
## size           1.00   0.11
## himnty         0.11   1.00

Question 14

Using the new data frame, conduct a multiple regression, predicting school-average math achievement from school-average SES, school-average female representation in the school, sector, and minority enrollment (himnty). Which predictors make unique and significant contributions to math achievement?

Answer: The results of the multiple regression show that school-average SES, school-average female representation, sector, and minority enrollment all make significant contributions to school-average math achievement.

model_q14 <- lm(mathach_mean ~ ses_mean + female_mean + sector + himnty, data = HSB_Data_q12)
summary(model_q14)

## 
## Call:
## lm(formula = mathach_mean ~ ses_mean + female_mean + sector + 
##     himnty, data = HSB_Data_q12)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8564 -1.1604 -0.0278  1.2189  4.3898 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.2648     0.3481  38.106  < 2e-16 ***
## ses_mean      4.7802     0.4105  11.645  < 2e-16 ***
## female_mean  -1.9322     0.5554  -3.479 0.000653 ***
## sector        1.4108     0.3103   4.546 1.09e-05 ***
## himnty       -0.7935     0.3527  -2.250 0.025858 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.773 on 155 degrees of freedom
## Multiple R-squared:  0.6849, Adjusted R-squared:  0.6767 
## F-statistic: 84.21 on 4 and 155 DF,  p-value: < 2.2e-16

Question 15

Add to the model in Question 14 all two-way interactions among the predictor variables. For each significant interaction, construct a figure (include 95% confidence bands) and provide a verbal description of what the interaction means.

Answer: There is only one significant interaction between ses_mean and sector. This interaction can be interpreted as the slope of SES and Math Achievement decreases by 1.89 for every one unit change in sector, controlling for all other predictors.

# Create a model with all two-way interactions
model_q15 <- lm(mathach_mean ~ ses_mean + female_mean + sector + himnty + 
                  ses_mean*female_mean + #each interaction is listed below
                  ses_mean*sector + 
                  ses_mean* himnty +
                  female_mean*sector + 
                  female_mean*himnty +
                  sector*himnty, data = HSB_Data_q12)
summary(model_q15) # Full model is printed

## 
## Call:
## lm(formula = mathach_mean ~ ses_mean + female_mean + sector + 
##     himnty + ses_mean * female_mean + ses_mean * sector + ses_mean * 
##     himnty + female_mean * sector + female_mean * himnty + sector * 
##     himnty, data = HSB_Data_q12)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7917 -0.9978  0.0495  1.0608  4.3965 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          12.20018    1.22696   9.943  < 2e-16 ***
## ses_mean              6.30877    0.97563   6.466 1.35e-09 ***
## female_mean           0.02605    2.34225   0.011   0.9911    
## sector                2.72871    1.33026   2.051   0.0420 *  
## himnty                0.32752    0.87824   0.373   0.7097    
## ses_mean:female_mean -1.29175    1.38134  -0.935   0.3512    
## ses_mean:sector      -1.89470    0.85025  -2.228   0.0274 *  
## ses_mean:himnty       0.51556    0.86551   0.596   0.5523    
## female_mean:sector   -1.84075    2.43347  -0.756   0.4506    
## female_mean:himnty   -0.56030    1.29932  -0.431   0.6669    
## sector:himnty        -1.38334    0.80294  -1.723   0.0870 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.765 on 149 degrees of freedom
## Multiple R-squared:  0.6997, Adjusted R-squared:  0.6796 
## F-statistic: 34.72 on 10 and 149 DF,  p-value: < 2.2e-16

# Graph the interaction between ses and sector below

HSB_Data_public <- HSB_Data_q12 %>% filter(sector == 0)
HSB_Data_catholic <- HSB_Data_q12 %>% filter(sector == 1)

ggplot(HSB_Data_public, aes(x = ses_mean, y = mathach_mean)) +
  geom_point(size = 2, color = "grey") +
  geom_smooth(method = lm, color = "black", size = 1, linetype = 1, formula = y ~ x) +
  geom_point(data = HSB_Data_catholic, aes(x = ses_mean, y = mathach_mean), size = 2, color = "lightblue") +
  geom_smooth(data = HSB_Data_catholic, aes(x = ses_mean, y = mathach_mean), method = lm, color = "black", size = 1, linetype = 2, formula = y ~ x) +
  scale_x_continuous(breaks = c(seq(-1.2,1.2,0.4)), labels=c("-1.2","0.8","0.4","0","0.4","0.8","1.2"), limits = c(-1.2,1.0)) +
  scale_y_continuous(breaks = c(seq(0,25,5)), labels = c("0","5","10","15","20","25"), limits = c(0,23)) +
  theme_minimal() + 
  xlab("Mean SES") +
  ylab("Mean Math Achievement") +
  labs(color = "Sector") +
  annotate("segment",x=.2,xend=.35,y=4,yend=4,color="black",linetype=1,size=1.25) +
  annotate("text",x=.4,y=4,label="Public Schools",color="black",size=5,hjust=0) +
  annotate("segment",x=.2,xend=.35,y=2,yend=2,color="black",linetype=2,size=1.25) +
  annotate("text",x=.4,y=2,label="Catholic Schools",color="black",size=5,hjust=0) +
  theme(plot.title = element_text(hjust = 0.5)) +
  ggtitle("The Interaction Between SES and Match Achievement")

I used some code from the Basic Two Level Model R script provided by Dr. Strube to graph and fit the data separately for Public and Catholic schools.

HLM: Homework 1

Rachel Smith Peirce

2023-02-11