1. My variable of interest is “GR3_5_mathlevel”, which tells us the percentage of students at grade level, math in grades 3-5.

1 a. Produce summary statistics include the minimum, maximum, mean, standard deviation, etc.

sd(dataschools2$GR3_5_mathlevel) #18.9836
## [1] 18.9836
dataschools2$GR3_5_mathlevel<-as.numeric(dataschools2$GR3_5_mathlevel)
mean(dataschools2$GR3_5_mathlevel) #37.88838
## [1] 37.88838
min(dataschools2$GR3_5_mathlevel) #0
## [1] 0
max(dataschools2$GR3_5_mathlevel) #100
## [1] 100
median(dataschools2$GR3_5_mathlevel) #35.4
## [1] 35.4

1b. Produce a histogram using ggplot. Place a vertical line on the mean and another on the median.

math3_5_mean<-mean(dataschools2$GR3_5_mathlevel)
math3_5_median<-median(dataschools2$GR3_5_mathlevel)

ggplot((data = dataschools2), aes(x = GR3_5_mathlevel)) +
  geom_histogram(binwidth = 5, fill = "lightblue", color = "black") +
  geom_vline(xintercept = math3_5_mean, color = "red", linetype = "dashed", size = 1) +
  geom_vline(xintercept = math3_5_median, color = "blue", linetype = "dashed", size = 1) +
  annotate("text", x = math3_5_mean - 0.5, y = 10, label = "mean", color = "red", vjust = -0.5) +
  annotate("text", x = math3_5_median + 0.5, y = 10, label = "median", color = "blue", vjust = -0.5) +
  labs(title = "Histogram of At Level Gr.3-5 Math with Mean and Median Lines",
       x = "At Math Grade Level (Gr. 3-5) Percent",
       y = "Frequency") 
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## 1c. Given your responses to 1a and 1b, can you state that the variable is normally distributed? What information did you use to conclude that?

The data does seem normally distributed (though a bit skewed to the right)- the histogram looks to have a bell curve and most of the data is centered around the mean.

1d. We can call the variable X. Create a new variable, which transforms variable X to produce a z-score of the form- Where s is the sample standard deviation. Complete the following table:

Zi =Xi − X¯n/s

dataschools2$GR3_5_mathlevel<-as.numeric(dataschools2$GR3_5_mathlevel)
x <- dataschools2$GR3_5_mathlevel  
x_mean <- mean(dataschools2$GR3_5_mathlevel)
x_sd <- sd(dataschools2$GR3_5_mathlevel)

Zi <- (x - x_mean) / x_sd

Zi_table <- seq(-2, 2, by = 0.5)

mass_belowzi <- sapply(Zi_table, function(z) mean(Zi <= z))

answers <- data.frame(Zi_table, mass_belowzi)

1e. Complete the table from 1d using the standard normal distribution, not your data. Compare the table you obtain with the table from part 1d. What can you conclude about the distribution of your variable?

Zi_snd <- seq(-2, 2, by=0.5)

mass_below_Zisnd <- pnorm(Zi_snd)

table_mass_below_Zisnd <- data.frame(Zi_snd, mass_below_Zisnd)

Considering the values in the table from my variable for the mass below Zi are not that different from the values for the mass below Zi in the table for the standard normal distribution, I would conclude that my variable is normally distributed.

2. Pick another variable from your dataset and repeat the steps from part 1. Enumerate your answers using the letters a-e.

My variable of interest is “misconduct_rate”, which tells us the number of misconducts per 100 students

2a. Summary statistics include the minimum, maximum, mean, standard deviation, etc.

colnames(dataschools2)[6] <- "misconduct_rate"

sd(dataschools2$misconduct_rate) #28.35078
## [1] 28.35078
mean(dataschools2$misconduct_rate) #23.16392
## [1] 23.16392
min(dataschools2$misconduct_rate) #0
## [1] 0
max(dataschools2$misconduct_rate) #230.6
## [1] 230.6
median(dataschools2$misconduct_rate) #13.4
## [1] 13.4

2b. Produce a histogram using ggplot. Place a vertical line on the mean and another on the median.

misconduct_mean<-mean(dataschools2$misconduct_rate)
misconduct_median<-median(dataschools2$misconduct_rate)

ggplot((data = dataschools2), aes(x = misconduct_rate)) +
  geom_histogram(binwidth = 5, fill = "lightblue", color = "black") +
  geom_vline(xintercept = misconduct_mean, color = "red", linetype = "dashed", size = 1) +
  geom_vline(xintercept = misconduct_median, color = "blue", linetype = "dashed", size = 1) +
  annotate("text", x = misconduct_mean - 0.5, y = 10, label = "mean", color = "red", vjust = -0.5) +
  annotate("text", x = misconduct_median + 0.5, y = 10, label = "median", color = "blue", vjust = -0.5) +
  labs(title = "Histogram of Number of Misconducts per 100 Students with Mean and Median Lines",
       x = "Number of Misconducts per 100 students",
       y = "Frequency") 

## 2c. Given your responses to 2a and 2b, can you state that the variable is normally distributed? What information did you use to conclude that?

The data does not seem normally distributed, it is pretty skewed to the right- the histogram does not seem to have a bell curve and it does not look like most of the data is centered around the mean.

2d. We can call the variable X. Create a new variable, which transforms variable X to produce a z-score of the form- Where s is the sample standard deviation. Complete the following table:

Zi =Xi − X¯n/s

x2 <- dataschools2$misconduct_rate

x2_mean <- mean(dataschools2$misconduct_rate)
x2_sd <- sd(dataschools2$misconduct_rate)

Zi2 <- (x2 - x2_mean) / x2_sd

Zi2_table <- seq(-2, 2, by=0.5)

mass_belowzi2 <- sapply(Zi_table, function(z) mean(Zi2 <= z))

answers2 <- data.frame(Zi2_table, mass_belowzi2)

2e. Complete the table from 1d using the standard normal distribution, not your data. Compare the table you obtain with the table from part 1d. What can you conclude about the distribution of your variable?

Zi_snd <- seq(-2, 2, by=0.5)

mass_below_Zisnd <- pnorm(Zi_snd)

table_mass_below_Zisnd <- data.frame(Zi_snd, mass_below_Zisnd)

It looks like my data becomes more normally distributed starting at -0.5 but the mass below Zi from -2 to -1 is 0 which is not normal.

3. Pick one variable of interest to conduct a hypothesis test on the population mean.

colnames(dataschools2)[5] <- "avg_student_attendance"

3a. Clearly state the null hypothesis H0 and the alternative hypothesis H1.

Null hypothesis (H0): The average student attendance is 95% or more

Alternative hypothesis (H1): The average student attendance is below 95%

3b. Manually calculate the test statistic. Be clear on the procedure you followed to calculate it.

mean(dataschools2$avg_student_attendance) #94.23777
## [1] 94.23777
sd(dataschools2$avg_student_attendance) #2.150204
## [1] 2.150204
length(dataschools2$avg_student_attendance)
## [1] 413
(94.23777-95)/(2.150204/(sqrt(413))) # t= -7.204128
## [1] -7.204128
qt(0.975, 412, ncp =0, lower.tail = TRUE, log.p = FALSE) #1.965739
## [1] 1.965739

c. Given the test statistic from part 3b, do you reject the null hypothesis? What decision rule did you follow?

Because my t statistic is -7.204128 which is much less than 1.96, I will reject the null hypothesis. The t statistic value of -7.204128 shows me that there is significant evidence that the average student attendance is not 95% or more.

3d. Calculate the p-value associated with the test statistic from part 3b. Clearly show the procedure and interpret the obtained p-value.

t3<- (94.23777-95)/(2.150204/(sqrt(413))) #-7.204128
pval_q3<-pnorm(t3) #2.920824e-13

The p-value associated with the test statistic is less than 0.05 (statistically significant) which means there is strong evidence to reject the null hypothesis.

3e. Use the R function t.test to perform the test. Compare your results from 3a-3d to the ones produced by the t.test function.

t.test(dataschools2$avg_student_attendance, mu = 95)
## 
##  One Sample t-test
## 
## data:  dataschools2$avg_student_attendance
## t = -7.2041, df = 412, p-value = 2.802e-12
## alternative hypothesis: true mean is not equal to 95
## 95 percent confidence interval:
##  94.02979 94.44576
## sample estimates:
## mean of x 
##  94.23777

My results from parts 3a-3d are almost the same as the results I obtained using the t.test function in R. The t statistic values are the same (-7.2041) and the p-value that I calculated (2.920824e-13) is 0.1 less than that calculated in the t.test (2.802e-12).

3f. What did you learn about the variable in question with this hypothesis test?

I learned that the average student attendance is not 95% or more, it is less than 95% (though the mean is 94.23777, so the average student attendance is not far off from 95%!).

4. Pick one variable of interest and another that will be used to subset the data. You should be able to subset the data into two groups, and the objective is to produce a two sample hypothesis test using the variable of interest and the two subsets.

I am interested in looking at the variable “safety_score” and whether or not having a healthy schools certification (“Yes” or “No” in the variable “health_cert) affects the safety score.

colnames(dataschools2)[3] <- "safety_score"
colnames(dataschools2)[2] <- "health_cert"

4a. Clearly state the null hypothesis H0 and the alternative hypothesis H1.

Null hypothesis (H0): Schools have the same safety scores whether or not they have a healthy schools certification

Alternative hypothesis (H1): Schools do not have the same safety scores whether or not they have a healthy schools certification

4b. Manually calculate the test statistic. Be clear on the procedure you followed to calculate it.

Where Xn is the sample average of schools that have a healthy schools certification, Yn is the sample average of schools that do not have a healthy schools certification, s2 is the squared standard deviation for each group, and n is the number of observations for each group. What is the value of the test statistic?

# YES
healthy_mean<-mean(dataschools2$safety_score[dataschools2$health_cert=="Yes"]) #62.09091
sd(dataschools2$safety_score[dataschools2$health_cert=="Yes"]) #17.67741
## [1] 17.67741
summary(dataschools2$safety_score[dataschools2$health_cert=="Yes"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   44.00   53.50   57.00   62.09   62.00   99.00
sum(dataschools2$health_cert=="Yes") #11
## [1] 11
# NO
no_healthy_mean<-mean(dataschools2$safety_score[dataschools2$health_cert=="No"]) #49.301
sd(dataschools2$safety_score[dataschools2$health_cert=="No"]) #20.37985
## [1] 20.37985
summary(dataschools2$safety_score[dataschools2$health_cert=="No"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    34.0    47.0    49.3    62.0    99.0
sum(dataschools2$health_cert=="No") #402
## [1] 402
# Yes healthy cert=X    No healthy cert=Y
#(Xn - Yn) / sqrt(((S^2/Nx)+(s2y/Ny)))
(62.09091 - 49.301) / sqrt((((17.67741^2)/11)+((20.37985^2)/402)))
## [1] 2.357154
#t= 2.357154

4c. Given the test statistic from part 4b, do you reject the null hypothesis? What decision rule did you follow?

t95 <- abs(qnorm(0.025))

We can reject the null at a 95% confidence interval, t=2.357154 which is greater than 1.96

4d. Calculate the p-value associated with the test statistic from part 4b. Clearly show the procedure and interpret the obtained p-value.

t4<-2.357154
2 * (1-pnorm(t4)) #0.01841561
## [1] 0.01841561

The p-value is less than 0.05 so at a 95% confidence interval we can reject the null hypothesis, schools do not have the same safety scores whether or not they have a healthy schools certification.

4e. Use the R function t.test to perform the test. Compare your results from 4a-4d to the ones produced by the t.test function.

t.test(dataschools2$safety_score[dataschools2$health_cert=="Yes"], dataschools2$safety_score[dataschools2$health_cert=="No"])
## 
##  Welch Two Sample t-test
## 
## data:  dataschools2$safety_score[dataschools2$health_cert == "Yes"] and dataschools2$safety_score[dataschools2$health_cert == "No"]
## t = 2.3572, df = 10.74, p-value = 0.03852
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   0.812042 24.767786
## sample estimates:
## mean of x mean of y 
##  62.09091  49.30100

My t value is the same when done manually and using the t.test function but my p values are very different (t.test p-value =0.03852 and my calculated p-value=0.01841561). I think this could be because the number of observations I have for each group are very different.

4f. What did you learn from this hypothesis test?

From this hypothesis test I think I learned that for 2 sample hypothesis testing I think it is really important to have similar numbers of observations in each group or for both to have a large number of observations because I think one of my groups only have 11 observations did not make this hypothesis test reliable. The p-value I calculated versus the p-value from the t.test are very different and based on my test statistic t I would reject the null hypothesis, but I think in reality that is not true from looking at the data, it doesn’t look like having a healthy schools certification has any effect on safety score, but then again, we only have 11 observations to base our hypothesis on.

5. Pick two continuous variables (i.e., not discrete variables) from your data.

For number five I will use the following variables: avg_student_attendance and GR3_5_mathlevel.

5a. If you have not done this for any of the two variables above, produce summary statistics and histograms to understand how the variables are distributed.

I have done this for both of the variables above.

5b. Decide which of the two variables will be used as an outcome. This variable should appear in the y-axis in every exercise below. Use ggplot to produce a scatter plot in which you plot the outcome against the other variable of interest. What can you conclude just from the scatter plot? Do you think the best way to describe the relationship is with a line?

Student Misconduct Rate (misconduct_rate) will be my dependent (outcome) variable and average student attendance (avg_student_attendance) will be my independent variable.

ggplot(data = dataschools2, mapping = aes(x = avg_student_attendance, y= misconduct_rate)) +
  geom_point() 

### Just from the scatter plot, I’m not sure these variables have a strong relationship, but it does look like the higher the average student attendance, the lower the rate of misconducts are. I think adding a line would help to see this relationship better.

5c. Use ggplot to add a linear fit on top of the scatter plot. What can you conclude from this visual exercise?

ggplot(data = dataschools2, mapping = aes(x = avg_student_attendance, y= misconduct_rate)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y~x)

### The linear fit line makes me think that the data is more strongly correlated.

5d. Use the lm function to estimate a linear regression between the outcome and the independent variable of interest. Interpret the slope (βˆ1) in substantive terms.

lm(misconduct_rate ~ avg_student_attendance, data=dataschools2)
## 
## Call:
## lm(formula = misconduct_rate ~ avg_student_attendance, data = dataschools2)
## 
## Coefficients:
##            (Intercept)  avg_student_attendance  
##                 696.94                   -7.15

Coefficients:

(Intercept) = 696.94 Slope: avg_student_attendance= -7.15

The slope (βˆ1) is a significantly large negative number. I say significant because for every one unit increase in X (student attendance), Y (misconduct rate) decreases by 7.15, which to me seems like a significant decrease. In terms of the data, to me this tells me that the higher student attendance is, the less misconduct there is in that school (the more students come to school, they are less likely to engage in misconduct).

5e. Interpret the intercept (βˆ0) in substantive terms.

The intercept for this data is very high (696.94), I imagine this is because there is no 0% for average student attendance (and it is unlikely to happen) and in my data the lowest average student attendance is 80.2 so the Y intercept would need to be high with a slope of -7.

5f. R reports a p-value and a test statistic for each of the model’s coefficients. What is the underlying test behind these quantities?

summary(lm(misconduct_rate ~ avg_student_attendance, data=dataschools2))
## 
## Call:
## lm(formula = misconduct_rate ~ avg_student_attendance, data = dataschools2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.230 -11.489  -4.375   5.052 193.582 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            696.9357    51.5101   13.53   <2e-16 ***
## avg_student_attendance  -7.1497     0.5465  -13.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.85 on 411 degrees of freedom
## Multiple R-squared:  0.294,  Adjusted R-squared:  0.2923 
## F-statistic: 171.2 on 1 and 411 DF,  p-value: < 2.2e-16

R is conducting a hypothesis test on each of the parameters (testing whether the population parameter is equal to zero or not), for each row, R is conducting the following test:

H0 : βk = 0

H1 : βk /= 0

The test statistic is computed as:

t = ˆβk − 0/s.e.( ˆβk ) ~ N (0,1)

5g. What can you say about the statistical significance of the relationship between the outcome and the independent variable derived from the p-value produced by lm?

The relationship between the outcome (Student Misconduct Rate (misconduct_rate)) and the independent variable (average student attendance (avg_student_attendance)) derived from the p-value produced by lm is very statistically significant with a p-value of <2e-16 for both the intercept and independent variable.With a p-value this small we can reject the null hypothesis, the population parameter is not 0, the misconduct rate is affected by average student attendance.

5h. Pick another variable of interest to use as an independent variable in a bivariate linear regression. Re-estimate the model using the two independent variables and the same outcome. Interpret the two coefficients associated with the independent variables in substantive and statistical terms.

summary(lm(misconduct_rate ~ avg_student_attendance + safety_score, data=dataschools2))
## 
## Call:
## lm(formula = misconduct_rate ~ avg_student_attendance + safety_score, 
##     data = dataschools2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.919 -11.445  -3.907   4.975 192.104 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            601.44679   60.26407   9.980  < 2e-16 ***
## avg_student_attendance  -6.02736    0.65959  -9.138  < 2e-16 ***
## safety_score            -0.20704    0.06953  -2.978  0.00308 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.62 on 410 degrees of freedom
## Multiple R-squared:  0.309,  Adjusted R-squared:  0.3056 
## F-statistic: 91.66 on 2 and 410 DF,  p-value: < 2.2e-16

The safety score also has a statistically significant p-value (0.00308) but it is not as significant as average student attendance. Although, with a p-value less than 0.05, we can still reject the null hypothesis. The misconduct score does depend on the average student attendance and the safety score. The slope for the safety score is also negative (-0.207), but it is not as extreme as the average student attendance still is. When accounting for safety score, the slope for average student attendance is -6.027, so very close to the slope on its own.