Learning Objectives

By the end of this lesson, students will:

Practice uploading and building datasets
Formulate descriptive statistics, in both packages and from “scratch”, if needed
Understand how to calculate correlation coefficients
Build regression models
Understand regression assumptions and how to mitigate broken assumptions

UPLOADING AND BUILDING A DATASET

First, always load tidyverse and readxl

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)
library(pastecs)

## 
## Attaching package: 'pastecs'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(lmtest)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

Then, you have to load your data. Make sure your data is in the same folder as your Rmarkdown. If you haven’t saved your Rmarkdown to a folder, do it now.

There are several commands to load files into Rmarkdown. It depends on the file type:

destination <- read_csv(“file.csv”)

destination <- read_excel(“file.xlsx”) {or .xls}

destination <- load(“file.RData”) {for native “R” files}

destination <- read_table(“file.txt”) {for loading unstructured tables}

If you come across a weird file type, you can always google “How to load [file type] into R” – there will almost always be a way.

district<-read_excel("district.xls")

Using tidyverse to manipulate data

We can use the arrow “<-” and pipe “|>” to manipulate the data to our liking. The grammar for this is as follows:

new_data <- old_data |> command(variable)

new_data is the new dataset we want to construct
old_data is the dataset we already have
command() is whatever we are telling R to perform
“variable” is whatever column (if any) we want to perform command() on

That’s it!

new_cars <- mtcars |> filter(hp>100)

head(new_cars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8  360 245 3.21 3.570 15.84  0  0    3    4

In this case, we are telling R that we want to create a new dataset called “new_cars”, from the old dataset called “mtcars” and we want to run the command “filter”, where all cars with a hp greater than 100 are kept.

Building a dataset

Data should almost always be cleaned from it’s raw form. Typically we want to construct a new dataset from an old one. Three principles are found in the literature, along with a fourth principle of my own:

Every observation should be a single row
Every variable should be a single column
Every cell should be a single value

4 (Erik’s)) The final dataset should only contain the columns we are interested in

# for district, this is pretty easy: we just need the columns we want

district_clean <- district |> select(DISTNAME,DZRATING,DZCAMPUS,DPETALLC,DPFEAINSK,DPSTURNR)

Now the data is much easier to inspect.

DESCRIPTIVE STATISTICS - INSPECTING THE LOADED DATA

The first command you should run is “head()”

head(district_clean)

## # A tibble: 6 × 6
##   DISTNAME      DZRATING DZCAMPUS DPETALLC DPFEAINSK DPSTURNR
##   <chr>         <chr>       <dbl>    <dbl>     <dbl>    <dbl>
## 1 CAYUGA ISD    A               3      574      8234     19.1
## 2 ELKHART ISD   A               4     1150      6884     13.9
## 3 FRANKSTON ISD A               3      808      6751     21.6
## 4 NECHES ISD    A               2      342      7484     18.3
## 5 PALESTINE ISD B               6     3360      6914     17.9
## 6 WESTWOOD ISD  B               4     1332      6472     30.6

(It’s more like the top left corner than “head”, but you get the idea)

The same information can be had from clicking on “district_clean” (or whatever the dataset name is) in the upper right-hand corner.

Because we have reduced the dataset to just the columns we want, it’s must easier to get clean summaries of our data:

summary(district_clean)

##    DISTNAME           DZRATING            DZCAMPUS          DPETALLC       
##  Length:1207        Length:1207        Min.   :  1.000   Min.   :     4.0  
##  Class :character   Class :character   1st Qu.:  2.000   1st Qu.:   337.5  
##  Mode  :character   Mode  :character   Median :  3.000   Median :   884.0  
##                                        Mean   :  7.428   Mean   :  4476.3  
##                                        3rd Qu.:  5.000   3rd Qu.:  2746.0  
##                                        Max.   :273.000   Max.   :193727.0  
##                                                                            
##    DPFEAINSK        DPSTURNR     
##  Min.   : 3122   Min.   :  0.00  
##  1st Qu.: 6056   1st Qu.: 14.80  
##  Median : 6702   Median : 19.50  
##  Mean   : 7074   Mean   : 21.51  
##  3rd Qu.: 7577   3rd Qu.: 25.90  
##  Max.   :54954   Max.   :100.00  
##  NA's   :5       NA's   :7

Where are the NA values?

What are the average number of campuses?

What’s the average dollar expenditure per student?

Remove NA’s

We won’t get far if we try to get more information from columns with “NA” in them. Let’s take out those observations:

district_clean2 <- district_clean |> drop_na()

summary(district_clean2)

##    DISTNAME           DZRATING            DZCAMPUS          DPETALLC     
##  Length:1199        Length:1199        Min.   :  1.000   Min.   :     4  
##  Class :character   Class :character   1st Qu.:  2.000   1st Qu.:   339  
##  Mode  :character   Mode  :character   Median :  3.000   Median :   886  
##                                        Mean   :  7.466   Mean   :  4500  
##                                        3rd Qu.:  5.000   3rd Qu.:  2767  
##                                        Max.   :273.000   Max.   :193727  
##    DPFEAINSK        DPSTURNR     
##  Min.   : 3122   Min.   :  0.00  
##  1st Qu.: 6060   1st Qu.: 14.80  
##  Median : 6702   Median : 19.50  
##  Mean   : 7081   Mean   : 21.52  
##  3rd Qu.: 7578   3rd Qu.: 25.90  
##  Max.   :54954   Max.   :100.00

How many observations are now in district_clean2? (Hint: you can tell by the “length” of DISTNAME) - it’s also in the upper right hand corner (Global Environment).

Notice that the means (averages) for DPFEAINSK and DPSTURNR have changed somewhat now that NA’s are removed.

More detailed descriptives with stat.desc()

We can get more detailed descriptive statistics on a single variable with stat.desc() from pastecs:

stat.desc(district_clean2$DPFEAINSK)

##      nbr.val     nbr.null       nbr.na          min          max        range 
## 1.199000e+03 0.000000e+00 0.000000e+00 3.122000e+03 5.495400e+04 5.183200e+04 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
## 8.490089e+06 6.702000e+03 7.080975e+03 6.514867e+01 1.278182e+02 5.088975e+06 
##      std.dev     coef.var 
## 2.255876e+03 3.185826e-01

(The results are in scientific notation)

If you don’t like this format, you can create your own new table with these values:

DPFEAINSK_descriptives <- district_clean2 |> reframe(n=n(), min=min(DPFEAINSK),max=max(DPFEAINSK),median=median(DPFEAINSK),mean(DPFEAINSK),sd=sd(DPFEAINSK))

head(DPFEAINSK_descriptives)

## # A tibble: 1 × 6
##       n   min   max median `mean(DPFEAINSK)`    sd
##   <int> <dbl> <dbl>  <dbl>             <dbl> <dbl>
## 1  1199  3122 54954   6702             7081. 2256.

You can also create summaries by different groups (groups should be character vectors or categories) with the group_by() command. Let’s say you wanted to know the DPFEAINSK stats by DZRATING:

DPFEAINSK_by_DZRATING <- district_clean2 |> group_by(DZRATING) |> reframe(n=n(), min=min(DPFEAINSK),max=max(DPFEAINSK),median=median(DPFEAINSK),mean=mean(DPFEAINSK),sd=sd(DPFEAINSK))

head(DPFEAINSK_by_DZRATING)

## # A tibble: 6 × 7
##   DZRATING                         n   min   max median   mean    sd
##   <chr>                        <int> <dbl> <dbl>  <dbl>  <dbl> <dbl>
## 1 A                              391  4160 54954  6721   7138. 2966.
## 2 B                              643  3122 22881  6719   7045. 1670.
## 3 C                              112  3801 10760  6480.  6751. 1267.
## 4 Not Rated                        9  3392 22476  8837  11097. 6737.
## 5 Not Rated: Data Under Review     3  4591  7685  4906   5727. 1703.
## 6 Not Rated: Senate Bill 1365     41  3625 14549  6741   7217. 2016.

If the columns (like “DPFEAINSK”) are too difficult to remember, you can rename them, with rename():

district_clean2 <- district_clean2 |> rename(funding_per_student = DPFEAINSK)
head(district_clean2)

## # A tibble: 6 × 6
##   DISTNAME      DZRATING DZCAMPUS DPETALLC funding_per_student DPSTURNR
##   <chr>         <chr>       <dbl>    <dbl>               <dbl>    <dbl>
## 1 CAYUGA ISD    A               3      574                8234     19.1
## 2 ELKHART ISD   A               4     1150                6884     13.9
## 3 FRANKSTON ISD A               3      808                6751     21.6
## 4 NECHES ISD    A               2      342                7484     18.3
## 5 PALESTINE ISD B               6     3360                6914     17.9
## 6 WESTWOOD ISD  B               4     1332                6472     30.6

You can chain together various rename() functions within “rename()”:

district_clean2 <- district_clean2 |> rename(total_students=DPETALLC, teacher_turnover=DPSTURNR)
head(district_clean2)

## # A tibble: 6 × 6
##   DISTNAME DZRATING DZCAMPUS total_students funding_per_student teacher_turnover
##   <chr>    <chr>       <dbl>          <dbl>               <dbl>            <dbl>
## 1 CAYUGA … A               3            574                8234             19.1
## 2 ELKHART… A               4           1150                6884             13.9
## 3 FRANKST… A               3            808                6751             21.6
## 4 NECHES … A               2            342                7484             18.3
## 5 PALESTI… B               6           3360                6914             17.9
## 6 WESTWOO… B               4           1332                6472             30.6

Arrange

Once we have our data the way we want it, we can arrange the dataset to examine the data visually. For example, let’s say we want to know which districts have the most total students:

district_clean2 <- district_clean2 |> arrange(desc(total_students))

head(district_clean2)

## # A tibble: 6 × 6
##   DISTNAME DZRATING DZCAMPUS total_students funding_per_student teacher_turnover
##   <chr>    <chr>       <dbl>          <dbl>               <dbl>            <dbl>
## 1 HOUSTON… B             273         193727                5989             22.4
## 2 DALLAS … B             237         143430                6613             17.8
## 3 CYPRESS… A              88         116913                6562             14.3
## 4 NORTHSI… B             122         101584                6129             12.9
## 5 KATY ISD A              72          88165                6642             13.1
## 6 FORT BE… B              82          76543                6175             16

Histograms for Exploration

Another great way to visually explore data is through histograms:

hist(district_clean2$total_students)

…is this normally distributed? What about teacher turnover?

hist(district_clean2$teacher_turnover)

A bit more, perhaps, but there is still a skew.

We can transform the data using a log transformation, if needed:

hist(log(district_clean2$total_students))

CORRELATION

We can calculate correlation on numeric variables fairly easily

cor(district_clean2$funding_per_student,district_clean2$teacher_turnover)

## [1] -0.09045629

BUILDING LINEAR MODELS

All linear models have the following general parameters:

dependent variable ~ independent variables, data = data

If we want to know what effect spending per student has on teacher turnover, we need to break it down:

Dependent variable (teacher turnover) – the variable we are trying to explain
Independent variable (spending per student) – the variable we think will explain the dependent variable
data (our dataset) – district_clean2

turnover_model1 <- lm(teacher_turnover~funding_per_student+DZCAMPUS,data=district_clean2)

summary(turnover_model1)

## 
## Call:
## lm(formula = teacher_turnover ~ funding_per_student + DZCAMPUS, 
##     data = district_clean2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.513  -6.585  -1.901   4.616  77.292 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         25.6908247  1.0614356  24.204  < 2e-16 ***
## funding_per_student -0.0005051  0.0001397  -3.615 0.000313 ***
## DZCAMPUS            -0.0795724  0.0186521  -4.266 2.15e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.84 on 1196 degrees of freedom
## Multiple R-squared:  0.02305,    Adjusted R-squared:  0.02142 
## F-statistic: 14.11 on 2 and 1196 DF,  p-value: 8.789e-07

Results look good (as funding goes down, turnover goes up). It’s significant, too.

But does it meet all the assumptions of a linear model?

Assumption #1: Linearity

The relationship between independent variables and the dependent variable must be linear.

plot(turnover_model1,which=1)

(This is not very linear)

Assumption 2: Independence of Errors

The residuals of the model should be independent of each other. Or, the model should not affect different errors the same way.

durbinWatsonTest(turnover_model1)

##  lag Autocorrelation D-W Statistic p-value
##    1      0.04639874      1.903456    0.09
##  Alternative hypothesis: rho != 0

(the p-value is greater than 0.05, which means durbinWatson is suggesting the errors are independent)

Assumption 3: Homoscedasticity

plot(turnover_model1,which=3)

(if this was homoscedastic, the dots would have a similar distance across the red line. They don’t!)

Assumption #4: Normality

Is the data or it’s residuals normal? We can eye this from a Q-Q plot.

plot(turnover_model1,which=2)

The data is mostly normal until it gets to the outliers … then it goes off the rails. This is fairly common.

Assumption #5 - Multicolinearity

vif(turnover_model1)

## funding_per_student            DZCAMPUS 
##            1.012231            1.012231

DZCAMPUS and funding_per_student are both below 5 and 10, so there is probably no multicolinearity to speak of.

Mitigation – What do we need to fix?

Linearity
Heteroscedasticity
Normality

turnover_model_mitigated <- lm(teacher_turnover~log(DZCAMPUS)+log(funding_per_student),data=district_clean2)

summary(turnover_model_mitigated)

## 
## Call:
## lm(formula = teacher_turnover ~ log(DZCAMPUS) + log(funding_per_student), 
##     data = district_clean2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.774  -6.388  -1.462   4.747  73.934 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               98.2892    13.1371   7.482 1.42e-13 ***
## log(DZCAMPUS)             -2.4352     0.3078  -7.910 5.80e-15 ***
## log(funding_per_student)  -8.3431     1.4722  -5.667 1.82e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.65 on 1196 degrees of freedom
## Multiple R-squared:  0.05832,    Adjusted R-squared:  0.05674 
## F-statistic: 37.03 on 2 and 1196 DF,  p-value: 2.481e-16

How does it look now?

Linearity:

plot(turnover_model_mitigated,which=1)

MUCH more linear.

Heteroscedasticity:

plot(turnover_model_mitigated,which=3)

Much more homoskedastic. Not perfect, but better.

Normality:

plot(turnover_model_mitigated,which=2)

Still not perfect, but possibly a bit better.

summary(turnover_model_mitigated)

## 
## Call:
## lm(formula = teacher_turnover ~ log(DZCAMPUS) + log(funding_per_student), 
##     data = district_clean2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.774  -6.388  -1.462   4.747  73.934 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               98.2892    13.1371   7.482 1.42e-13 ***
## log(DZCAMPUS)             -2.4352     0.3078  -7.910 5.80e-15 ***
## log(funding_per_student)  -8.3431     1.4722  -5.667 1.82e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.65 on 1196 degrees of freedom
## Multiple R-squared:  0.05832,    Adjusted R-squared:  0.05674 
## F-statistic: 37.03 on 2 and 1196 DF,  p-value: 2.481e-16

As the log of campuses and funding per student goes down, teacher turnover tends to go up. The whole model is significant (2.481e-16) and both log(DZCAMPUS) and log(funding_per_student) are significant, which means the results are unlikely to be caused purely by chance.

Applied Quant. Refresher

Karlerik Naslund

2025-10-03