This is the basic setup information for the homework:

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
district <- read_excel("district.xls")

Question 2:

I am creating a new dataframe with the district name, percent special education and money spent on special education.

special_ed<-district|>select("DISTNAME", "DPETSPEP", "DPFPASPEP")

Question 3 and 4

Here are summary statistics for DPETSPEP:

summary(special_ed$DPETSPEP)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.90   12.10   12.27   14.20   51.70

And DFPASPEP:

summary(special_ed$DPFPASPEP)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   5.800   8.900   9.711  12.500  49.000       5

The DFPASPEP has 5 missing values.

Question 5:

I am dropping the NA values from DFPASPEP:

special_ed_clean<-special_ed |> drop_na( )

I want to check that I did the code correctly:

summary(special_ed_clean$DPFPASPEP)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.800   8.900   9.711  12.500  49.000

Hooray! Now, how many observations are left? Let’s make a tibble!

tibble(special_ed_clean)
## # A tibble: 1,202 × 3
##    DISTNAME                     DPETSPEP DPFPASPEP
##    <chr>                           <dbl>     <dbl>
##  1 CAYUGA ISD                       14.6      28.9
##  2 ELKHART ISD                      12.1       8.8
##  3 FRANKSTON ISD                    13.1       8.4
##  4 NECHES ISD                       10.5      10.1
##  5 PALESTINE ISD                    13.5       6.1
##  6 WESTWOOD ISD                     14.5       9.4
##  7 SLOCUM ISD                       14.7       9.9
##  8 ANDREWS ISD                      10.4      10.9
##  9 PINEYWOODS COMMUNITY ACADEMY     11.6       9.2
## 10 HUDSON ISD                       11.9      10.3
## # ℹ 1,192 more rows

There are 1,202 observations now in the dataframe.

Question 6:

Let’s create a beautiful point graph to visualize the relationship between DPFPASPEP and DPETSPEP:

ggplot(special_ed_clean, aes(x=DPFPASPEP, y=DPETSPEP))+geom_point()+labs(x="Money Spent on Special Education", y="Percent Special Education")

Are DPFPASPEP and DPETSPEP correlated? Just by looking at the point graph, it appears so. Most of the data points are clustered near each other, with some notable outliers.

Question 7

However, our eyes can be mistaken. Let’s check what the actual correlation of DPFPASPEP and DPETSPE:

cor(special_ed_clean$DPETSPEP, special_ed_clean$DPFPASPEP)
## [1] 0.3700234

There is a positive relationship between the variables, meaning as one increases, the other also slightly increases.

Question 8:

.37 is appears to be a relatively weak correlation. This may be due to the outliers at the edge of the point graph. In Section 2.4.4 of Modern Statistics with R, Måns Thulin noted that the “cor” function uses “the Pearson correlation formula, which is known to be sensitive to outliers.” Let’s try another formula:

cor(special_ed_clean$DPETSPEP, special_ed_clean$DPFPASPEP, method="spearman")
## [1] 0.346629

Huh! With this formula, it appears that the correlation is even weaker. Perhaps the correlation is moderately statistically significant due to the sample size. Perhaps other statistical tools, like the p-value, will help us determine if there is a correlation.

To be continued…