By the end of this lesson, students will:
Practice uploading and building datasets
Formulate descriptive statistics, in both packages and from “scratch”, if needed
Understand how to calculate correlation coefficients
Build regression models
Understand regression assumptions and how to mitigate broken assumptions
First, always load tidyverse and readxl
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(pastecs)
##
## Attaching package: 'pastecs'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## The following object is masked from 'package:tidyr':
##
## extract
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
Then, you have to load your data. Make sure your data is in the same folder as your Rmarkdown. If you haven’t saved your Rmarkdown to a folder, do it now.
There are several commands to load files into Rmarkdown. It depends on the file type:
destination <- read_csv(“file.csv”)
destination <- read_excel(“file.xlsx”) {or .xls}
destination <- load(“file.RData”) {for native “R” files}
destination <- read_table(“file.txt”) {for loading unstructured tables}
If you come across a weird file type, you can always google “How to load [file type] into R” – there will almost always be a way.
district<-read_excel("district.xls")
We can use the arrow “<-” and pipe “|>” to manipulate the data to our liking. The grammar for this is as follows:
new_data <- old_data |> command(variable)
new_data is the new dataset we want to construct
old_data is the dataset we already have
command() is whatever we are telling R to perform
“variable” is whatever column (if any) we want to perform command() on
That’s it!
new_cars <- mtcars |> filter(hp>100)
head(new_cars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
In this case, we are telling R that we want to create a new dataset called “new_cars”, from the old dataset called “mtcars” and we want to run the command “filter”, where all cars with a hp greater than 100 are kept.
Data should almost always be cleaned from it’s raw form. Typically we want to construct a new dataset from an old one. Three principles are found in the literature, along with a fourth principle of my own:
Every observation should be a single row
Every variable should be a single column
Every cell should be a single value
4 (Erik’s)) The final dataset should only contain the columns we are interested in
# for district, this is pretty easy: we just need the columns we want
district_clean <- district |> select(DISTNAME,DZRATING,DZCAMPUS,DPETALLC,DPFEAINSK,DPSTURNR)
Now the data is much easier to inspect.
The first command you should run is “head()”
head(district_clean)
## # A tibble: 6 × 6
## DISTNAME DZRATING DZCAMPUS DPETALLC DPFEAINSK DPSTURNR
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CAYUGA ISD A 3 574 8234 19.1
## 2 ELKHART ISD A 4 1150 6884 13.9
## 3 FRANKSTON ISD A 3 808 6751 21.6
## 4 NECHES ISD A 2 342 7484 18.3
## 5 PALESTINE ISD B 6 3360 6914 17.9
## 6 WESTWOOD ISD B 4 1332 6472 30.6
(It’s more like the top left corner than “head”, but you get the idea)
The same information can be had from clicking on “district_clean” (or whatever the dataset name is) in the upper right-hand corner.
Because we have reduced the dataset to just the columns we want, it’s must easier to get clean summaries of our data:
summary(district_clean)
## DISTNAME DZRATING DZCAMPUS DPETALLC
## Length:1207 Length:1207 Min. : 1.000 Min. : 4.0
## Class :character Class :character 1st Qu.: 2.000 1st Qu.: 337.5
## Mode :character Mode :character Median : 3.000 Median : 884.0
## Mean : 7.428 Mean : 4476.3
## 3rd Qu.: 5.000 3rd Qu.: 2746.0
## Max. :273.000 Max. :193727.0
##
## DPFEAINSK DPSTURNR
## Min. : 3122 Min. : 0.00
## 1st Qu.: 6056 1st Qu.: 14.80
## Median : 6702 Median : 19.50
## Mean : 7074 Mean : 21.51
## 3rd Qu.: 7577 3rd Qu.: 25.90
## Max. :54954 Max. :100.00
## NA's :5 NA's :7
Where are the NA values?
What are the average number of campuses?
What’s the average dollar expenditure per student?
We won’t get far if we try to get more information from columns with “NA” in them. Let’s take out those observations:
district_clean2 <- district_clean |> drop_na()
summary(district_clean2)
## DISTNAME DZRATING DZCAMPUS DPETALLC
## Length:1199 Length:1199 Min. : 1.000 Min. : 4
## Class :character Class :character 1st Qu.: 2.000 1st Qu.: 339
## Mode :character Mode :character Median : 3.000 Median : 886
## Mean : 7.466 Mean : 4500
## 3rd Qu.: 5.000 3rd Qu.: 2767
## Max. :273.000 Max. :193727
## DPFEAINSK DPSTURNR
## Min. : 3122 Min. : 0.00
## 1st Qu.: 6060 1st Qu.: 14.80
## Median : 6702 Median : 19.50
## Mean : 7081 Mean : 21.52
## 3rd Qu.: 7578 3rd Qu.: 25.90
## Max. :54954 Max. :100.00
How many observations are now in district_clean2? (Hint: you can tell by the “length” of DISTNAME) - it’s also in the upper right hand corner (Global Environment).
Notice that the means (averages) for DPFEAINSK and DPSTURNR have changed somewhat now that NA’s are removed.
We can get more detailed descriptive statistics on a single variable with stat.desc() from pastecs:
stat.desc(district_clean2$DPFEAINSK)
## nbr.val nbr.null nbr.na min max range
## 1.199000e+03 0.000000e+00 0.000000e+00 3.122000e+03 5.495400e+04 5.183200e+04
## sum median mean SE.mean CI.mean.0.95 var
## 8.490089e+06 6.702000e+03 7.080975e+03 6.514867e+01 1.278182e+02 5.088975e+06
## std.dev coef.var
## 2.255876e+03 3.185826e-01
(The results are in scientific notation)
If you don’t like this format, you can create your own new table with these values:
DPFEAINSK_descriptives <- district_clean2 |> reframe(n=n(), min=min(DPFEAINSK),max=max(DPFEAINSK),median=median(DPFEAINSK),mean(DPFEAINSK),sd=sd(DPFEAINSK))
head(DPFEAINSK_descriptives)
## # A tibble: 1 × 6
## n min max median `mean(DPFEAINSK)` sd
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1199 3122 54954 6702 7081. 2256.
You can also create summaries by different groups (groups should be character vectors or categories) with the group_by() command. Let’s say you wanted to know the DPFEAINSK stats by DZRATING:
DPFEAINSK_by_DZRATING <- district_clean2 |> group_by(DZRATING) |> reframe(n=n(), min=min(DPFEAINSK),max=max(DPFEAINSK),median=median(DPFEAINSK),mean=mean(DPFEAINSK),sd=sd(DPFEAINSK))
head(DPFEAINSK_by_DZRATING)
## # A tibble: 6 × 7
## DZRATING n min max median mean sd
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A 391 4160 54954 6721 7138. 2966.
## 2 B 643 3122 22881 6719 7045. 1670.
## 3 C 112 3801 10760 6480. 6751. 1267.
## 4 Not Rated 9 3392 22476 8837 11097. 6737.
## 5 Not Rated: Data Under Review 3 4591 7685 4906 5727. 1703.
## 6 Not Rated: Senate Bill 1365 41 3625 14549 6741 7217. 2016.
If the columns (like “DPFEAINSK”) are too difficult to remember, you can rename them, with rename():
district_clean2 <- district_clean2 |> rename(funding_per_student = DPFEAINSK)
head(district_clean2)
## # A tibble: 6 × 6
## DISTNAME DZRATING DZCAMPUS DPETALLC funding_per_student DPSTURNR
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CAYUGA ISD A 3 574 8234 19.1
## 2 ELKHART ISD A 4 1150 6884 13.9
## 3 FRANKSTON ISD A 3 808 6751 21.6
## 4 NECHES ISD A 2 342 7484 18.3
## 5 PALESTINE ISD B 6 3360 6914 17.9
## 6 WESTWOOD ISD B 4 1332 6472 30.6
You can chain together various rename() functions within “rename()”:
district_clean2 <- district_clean2 |> rename(total_students=DPETALLC, teacher_turnover=DPSTURNR)
head(district_clean2)
## # A tibble: 6 × 6
## DISTNAME DZRATING DZCAMPUS total_students funding_per_student teacher_turnover
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CAYUGA … A 3 574 8234 19.1
## 2 ELKHART… A 4 1150 6884 13.9
## 3 FRANKST… A 3 808 6751 21.6
## 4 NECHES … A 2 342 7484 18.3
## 5 PALESTI… B 6 3360 6914 17.9
## 6 WESTWOO… B 4 1332 6472 30.6
Once we have our data the way we want it, we can arrange the dataset to examine the data visually. For example, let’s say we want to know which districts have the most total students:
district_clean2 <- district_clean2 |> arrange(desc(total_students))
head(district_clean2)
## # A tibble: 6 × 6
## DISTNAME DZRATING DZCAMPUS total_students funding_per_student teacher_turnover
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 HOUSTON… B 273 193727 5989 22.4
## 2 DALLAS … B 237 143430 6613 17.8
## 3 CYPRESS… A 88 116913 6562 14.3
## 4 NORTHSI… B 122 101584 6129 12.9
## 5 KATY ISD A 72 88165 6642 13.1
## 6 FORT BE… B 82 76543 6175 16
Another great way to visually explore data is through histograms:
hist(district_clean2$total_students)
…is this normally distributed? What about teacher turnover?
hist(district_clean2$teacher_turnover)
A bit more, perhaps, but there is still a skew.
We can transform the data using a log transformation, if needed:
hist(log(district_clean2$total_students))
We can calculate correlation on numeric variables fairly easily
cor(district_clean2$funding_per_student,district_clean2$teacher_turnover)
## [1] -0.09045629
All linear models have the following general parameters:
dependent variable ~ independent variables, data = data
If we want to know what effect spending per student has on teacher turnover, we need to break it down:
turnover_model1 <- lm(teacher_turnover~funding_per_student+DZCAMPUS,data=district_clean2)
summary(turnover_model1)
##
## Call:
## lm(formula = teacher_turnover ~ funding_per_student + DZCAMPUS,
## data = district_clean2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.513 -6.585 -1.901 4.616 77.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.6908247 1.0614356 24.204 < 2e-16 ***
## funding_per_student -0.0005051 0.0001397 -3.615 0.000313 ***
## DZCAMPUS -0.0795724 0.0186521 -4.266 2.15e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.84 on 1196 degrees of freedom
## Multiple R-squared: 0.02305, Adjusted R-squared: 0.02142
## F-statistic: 14.11 on 2 and 1196 DF, p-value: 8.789e-07
Results look good (as funding goes down, turnover goes up). It’s significant, too.
But does it meet all the assumptions of a linear model?
The relationship between independent variables and the dependent variable must be linear.
plot(turnover_model1,which=1)
(This is not very linear)
The residuals of the model should be independent of each other. Or, the model should not affect different errors the same way.
durbinWatsonTest(turnover_model1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.04639874 1.903456 0.09
## Alternative hypothesis: rho != 0
(the p-value is greater than 0.05, which means durbinWatson is suggesting the errors are independent)
plot(turnover_model1,which=3)
(if this was homoscedastic, the dots would have a similar distance across the red line. They don’t!)
Is the data or it’s residuals normal? We can eye this from a Q-Q plot.
plot(turnover_model1,which=2)
The data is mostly normal until it gets to the outliers … then it goes off the rails. This is fairly common.
vif(turnover_model1)
## funding_per_student DZCAMPUS
## 1.012231 1.012231
DZCAMPUS and funding_per_student are both below 5 and 10, so there is probably no multicolinearity to speak of.
turnover_model_mitigated <- lm(teacher_turnover~log(DZCAMPUS)+log(funding_per_student),data=district_clean2)
summary(turnover_model_mitigated)
##
## Call:
## lm(formula = teacher_turnover ~ log(DZCAMPUS) + log(funding_per_student),
## data = district_clean2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.774 -6.388 -1.462 4.747 73.934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.2892 13.1371 7.482 1.42e-13 ***
## log(DZCAMPUS) -2.4352 0.3078 -7.910 5.80e-15 ***
## log(funding_per_student) -8.3431 1.4722 -5.667 1.82e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.65 on 1196 degrees of freedom
## Multiple R-squared: 0.05832, Adjusted R-squared: 0.05674
## F-statistic: 37.03 on 2 and 1196 DF, p-value: 2.481e-16
How does it look now?
Linearity:
plot(turnover_model_mitigated,which=1)
MUCH more linear.
Heteroscedasticity:
plot(turnover_model_mitigated,which=3)
Much more homoskedastic. Not perfect, but better.
Normality:
plot(turnover_model_mitigated,which=2)
Still not perfect, but possibly a bit better.
summary(turnover_model_mitigated)
##
## Call:
## lm(formula = teacher_turnover ~ log(DZCAMPUS) + log(funding_per_student),
## data = district_clean2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.774 -6.388 -1.462 4.747 73.934
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.2892 13.1371 7.482 1.42e-13 ***
## log(DZCAMPUS) -2.4352 0.3078 -7.910 5.80e-15 ***
## log(funding_per_student) -8.3431 1.4722 -5.667 1.82e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.65 on 1196 degrees of freedom
## Multiple R-squared: 0.05832, Adjusted R-squared: 0.05674
## F-statistic: 37.03 on 2 and 1196 DF, p-value: 2.481e-16
As the log of campuses and funding per student goes down, teacher turnover tends to go up. The whole model is significant (2.481e-16) and both log(DZCAMPUS) and log(funding_per_student) are significant, which means the results are unlikely to be caused purely by chance.