When you make an Rmarkdown file, always keep this chunk:
By now you have some familiarity with R as a concept. Let’s discuss data sources and how they can be brought into R.
#set your working directory to the folder with "district.xls"
library(readxl)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
district<-read_excel("district.xls")
#notice the "quotation marks" around "district.xls". R can be picky about grammar, so if you get an error - check your quotations, etc.
#also, you can make comments in code by starting the line with "#"
#click on district in the "Global Environment" to the right, and take a moment to consider what you see
# in the "files" section on the lower right, click on "district.lyt"
head(district)
## # A tibble: 6 × 137
## DISTNAME DISTRICT DZCNTYNM REGION DZRATING DZCAMPUS DPETALLC DPETBLAP DPETHISP
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CAYUGA … 001902 001 AND… 07 A 3 574 4.4 11.5
## 2 ELKHART… 001903 001 AND… 07 A 4 1150 4 11.8
## 3 FRANKST… 001904 001 AND… 07 A 3 808 8.5 11.3
## 4 NECHES … 001906 001 AND… 07 A 2 342 8.2 13.5
## 5 PALESTI… 001907 001 AND… 07 B 6 3360 25.1 42.9
## 6 WESTWOO… 001908 001 AND… 07 B 4 1332 19.7 26.2
## # ℹ 128 more variables: DPETWHIP <dbl>, DPETINDP <dbl>, DPETASIP <dbl>,
## # DPETPCIP <dbl>, DPETTWOP <dbl>, DPETECOP <dbl>, DPETLEPP <dbl>,
## # DPETSPEP <dbl>, DPETBILP <dbl>, DPETVOCP <dbl>, DPETGIFP <dbl>,
## # DA0AT21R <dbl>, DA0912DR21R <dbl>, DAGC4X21R <dbl>, DAGC5X20R <dbl>,
## # DAGC6X19R <dbl>, DA0GR21N <dbl>, DA0GS21N <dbl>, DDA00A001S22R <dbl>,
## # DDA00A001222R <dbl>, DDA00A001322R <dbl>, DDA00AR01S22R <dbl>,
## # DDA00AR01222R <dbl>, DDA00AR01322R <dbl>, DDA00AM01S22R <dbl>, …
There’s a lot going on here:
summary(district$DZCAMPUS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 7.428 5.000 273.000
There is, maybe, a better way of examining data type:
str(district)
## tibble [1,207 × 137] (S3: tbl_df/tbl/data.frame)
## $ DISTNAME : chr [1:1207] "CAYUGA ISD" "ELKHART ISD" "FRANKSTON ISD" "NECHES ISD" ...
## $ DISTRICT : chr [1:1207] "001902" "001903" "001904" "001906" ...
## $ DZCNTYNM : chr [1:1207] "001 ANDERSON" "001 ANDERSON" "001 ANDERSON" "001 ANDERSON" ...
## $ REGION : chr [1:1207] "07" "07" "07" "07" ...
## $ DZRATING : chr [1:1207] "A" "A" "A" "A" ...
## $ DZCAMPUS : num [1:1207] 3 4 3 2 6 4 2 6 4 5 ...
## $ DPETALLC : num [1:1207] 574 1150 808 342 3360 ...
## $ DPETBLAP : num [1:1207] 4.4 4 8.5 8.2 25.1 19.7 0.3 0.8 15.7 7.2 ...
## $ DPETHISP : num [1:1207] 11.5 11.8 11.3 13.5 42.9 26.2 8.6 68.7 31.2 27.9 ...
## $ DPETWHIP : num [1:1207] 79.1 80.3 75.2 75.1 27.3 48 87 28.2 48.5 60.6 ...
## $ DPETINDP : num [1:1207] 0 0.3 0.4 0.3 0.2 0.7 0 0.3 0.1 0.3 ...
## $ DPETASIP : num [1:1207] 0.5 0.2 1 0.3 0.7 0.5 0.6 0.3 1 1 ...
## $ DPETPCIP : num [1:1207] 0 0 0 0 0.1 0.1 0 0 0.1 0.1 ...
## $ DPETTWOP : num [1:1207] 4.5 3.4 3.6 2.6 3.7 4.9 3.6 1.7 3.4 3 ...
## $ DPETECOP : num [1:1207] 40.8 45.4 54.2 54.1 81.6 74 46.8 49.6 57.8 50.1 ...
## $ DPETLEPP : num [1:1207] 1 2.8 4.1 2 17.7 7.1 0.6 14.2 5.1 6.9 ...
## $ DPETSPEP : num [1:1207] 14.6 12.1 13.1 10.5 13.5 14.5 14.7 10.4 11.6 11.9 ...
## $ DPETBILP : num [1:1207] 1 2.7 4.1 2 16.1 6.8 0.6 15.2 5 6 ...
## $ DPETVOCP : num [1:1207] 30.5 31.8 43.9 29.5 30.6 38.7 37.7 24.8 18.9 34.4 ...
## $ DPETGIFP : num [1:1207] 6.1 4.6 7.3 5.6 2.3 3.2 3.3 6.8 9.2 6 ...
## $ DA0AT21R : num [1:1207] 96.7 96 95.4 95.8 93.7 94.5 96.7 92.8 97.3 95.2 ...
## $ DA0912DR21R : num [1:1207] 0 0.3 0.4 0 0 0 0 0.4 0.4 0.7 ...
## $ DAGC4X21R : num [1:1207] 100 100 95.2 95.8 99 97.8 100 96.8 100 94.1 ...
## $ DAGC5X20R : num [1:1207] 100 98.9 100 97 99.6 97 100 97.2 100 95.6 ...
## $ DAGC6X19R : num [1:1207] 96 98.8 33.3 100 98.6 97.4 100 96.7 100 95.9 ...
## $ DA0GR21N : num [1:1207] 36 91 41 23 201 95 32 293 52 196 ...
## $ DA0GS21N : num [1:1207] 34 79 40 17 198 77 27 238 52 154 ...
## $ DDA00A001S22R: num [1:1207] 84 85 83 90 74 69 86 76 82 86 ...
## $ DDA00A001222R: num [1:1207] 62 59 57 64 46 40 55 47 56 60 ...
## $ DDA00A001322R: num [1:1207] 33 30 25 27 20 16 25 21 30 31 ...
## $ DDA00AR01S22R: num [1:1207] 81 85 84 87 72 70 86 75 82 84 ...
## $ DDA00AR01222R: num [1:1207] 67 64 63 67 48 45 66 50 60 62 ...
## $ DDA00AR01322R: num [1:1207] 39 34 24 30 20 19 31 22 31 31 ...
## $ DDA00AM01S22R: num [1:1207] 88 84 85 94 75 66 81 76 81 88 ...
## $ DDA00AM01222R: num [1:1207] 65 49 57 69 44 34 42 44 53 62 ...
## $ DDA00AM01322R: num [1:1207] 34 23 26 27 20 14 19 21 29 33 ...
## $ DDA00AC01S22R: num [1:1207] 85 86 81 90 78 73 96 75 83 84 ...
## $ DDA00AC01222R: num [1:1207] 54 63 49 54 48 41 45 46 57 52 ...
## $ DDA00AC01322R: num [1:1207] 22 29 21 23 22 15 16 18 27 21 ...
## $ DDA00AS01S22R: num [1:1207] 78 90 74 83 72 68 92 81 82 87 ...
## $ DDA00AS01222R: num [1:1207] 47 63 48 51 42 38 73 50 51 60 ...
## $ DDA00AS01322R: num [1:1207] 21 42 26 26 20 15 38 27 32 36 ...
## $ DDB00A001S22R: num [1:1207] 60 46 74 88 64 56 -1 71 68 71 ...
## $ DDB00A001222R: num [1:1207] 17 22 38 48 33 26 -1 41 38 37 ...
## $ DDB00A001322R: num [1:1207] 3 8 6 19 11 11 -1 13 14 14 ...
## $ DDH00A001S22R: num [1:1207] 74 85 75 91 73 69 87 72 81 81 ...
## $ DDH00A001222R: num [1:1207] 53 56 46 69 44 36 57 42 50 53 ...
## $ DDH00A001322R: num [1:1207] 24 25 19 26 19 12 20 17 24 24 ...
## $ DDW00A001S22R: num [1:1207] 87 88 85 89 83 75 86 84 88 89 ...
## $ DDW00A001222R: num [1:1207] 66 61 62 66 60 48 55 58 67 66 ...
## $ DDW00A001322R: num [1:1207] 35 32 28 29 29 21 26 29 40 35 ...
## $ DDI00A001S22R: num [1:1207] NA 100 80 -1 75 NA NA 83 -1 62 ...
## $ DDI00A001222R: num [1:1207] NA 100 20 -1 50 NA NA 28 -1 8 ...
## $ DDI00A001322R: num [1:1207] NA 100 20 -1 17 NA NA 6 -1 0 ...
## $ DD300A001S22R: num [1:1207] 33 -1 84 -1 85 100 NA 100 93 97 ...
## $ DD300A001222R: num [1:1207] 33 -1 53 -1 77 100 NA 87 73 82 ...
## $ DD300A001322R: num [1:1207] 17 -1 16 -1 44 88 NA 67 53 56 ...
## $ DD400A001S22R: num [1:1207] NA NA NA NA -1 -1 NA NA -1 -1 ...
## $ DD400A001222R: num [1:1207] NA NA NA NA -1 -1 NA NA -1 -1 ...
## $ DD400A001322R: num [1:1207] NA NA NA NA -1 -1 NA NA -1 -1 ...
## $ DD200A001S22R: num [1:1207] 83 77 75 -1 74 62 88 85 74 83 ...
## $ DD200A001222R: num [1:1207] 54 46 58 -1 44 38 50 58 48 50 ...
## $ DD200A001322R: num [1:1207] 34 23 28 -1 18 13 6 31 13 29 ...
## $ DDE00A001S22R: num [1:1207] 76 77 77 86 70 65 81 67 77 78 ...
## $ DDE00A001222R: num [1:1207] 50 42 49 53 40 34 45 36 48 46 ...
## $ DDE00A001322R: num [1:1207] 23 19 17 17 16 14 17 14 23 19 ...
## $ DA0CT21R : num [1:1207] 58.3 51.6 92.7 87 43.3 40 12.5 42 9.6 38.3 ...
## $ DA0CC21R : num [1:1207] 19 27.7 36.8 15 49.4 28.9 -1 35.8 60 60 ...
## $ DA0CSA21R : num [1:1207] 980 979 980 1007 1048 ...
## $ DA0CAA21R : num [1:1207] NA -1 -1 18.8 21 -1 -1 22.3 NA 23.1 ...
## $ DPSATOFC : num [1:1207] 99.9 186.6 146.7 60.1 553.4 ...
## $ DPSTTOFC : num [1:1207] 46.7 104.9 74.5 30.2 260.3 ...
## $ DPSCTOFP : num [1:1207] 1.5 1.1 1.4 3.1 2.1 1.1 4.1 1.5 4.5 0.9 ...
## $ DPSSTOFP : num [1:1207] 5 2.1 3.5 5 3.4 4.6 3.4 2.6 3.1 3.9 ...
## $ DPSUTOFP : num [1:1207] 5.4 4.9 2 1.7 8.3 4.4 3 5.8 10 6 ...
## $ DPSTTOFP : num [1:1207] 46.8 56.2 50.8 50.3 47 45.5 56.7 50.8 50 49.7 ...
## $ DPSETOFP : num [1:1207] 14.8 16.2 15 13.7 19.7 19.2 9.8 15.4 11.1 8.2 ...
## $ DPSXTOFP : num [1:1207] 26.5 19.5 27.4 26.2 19.5 25.2 23 23.9 21.4 31.3 ...
## $ DPSCTOSA : num [1:1207] 93333 100313 98293 85537 99324 ...
## $ DPSSTOSA : num [1:1207] 73300 79305 71215 81593 80415 ...
## $ DPSUTOSA : num [1:1207] 59550 60616 58022 77642 63829 ...
## $ DPSTTOSA : num [1:1207] 55570 47916 50382 55346 48825 ...
## $ DPSAMIFP : num [1:1207] 15.6 13.4 10.9 16.3 32.1 29.9 1.9 41.3 22.2 18.8 ...
## $ DPSAKIDR : num [1:1207] 5.7 6.2 5.5 5.7 6.1 5 5.2 7.3 7.4 6.5 ...
## $ DPSTKIDR : num [1:1207] 12.3 11 10.8 11.3 12.9 11 9.3 14.4 14.8 13.2 ...
## $ DPST05FP : num [1:1207] 10.4 23.8 32.7 9.7 33.8 44.8 17.9 21.5 35 21.9 ...
## $ DPSTEXPA : num [1:1207] 16.7 13.5 12.8 14.8 12.7 10.3 15.4 13.8 10.2 13.8 ...
## $ DPSTADFP : num [1:1207] 14.8 19 30.7 9.6 15.4 17.4 16.9 24.3 18.5 22.4 ...
## $ DPSTURNR : num [1:1207] 19.1 13.9 21.6 18.3 17.9 30.6 14.6 11.5 17 9.5 ...
## $ DPSTBLFP : num [1:1207] 8.3 2.9 4 6.5 9.6 11.6 0 1.4 4.4 0.5 ...
## $ DPSTHIFP : num [1:1207] 0 6.7 1.3 0 13.8 6.6 0 25.7 8.9 5.6 ...
## $ DPSTWHFP : num [1:1207] 91.7 90.5 93.3 93.5 74.6 80.9 100 69 86.7 93.9 ...
## $ DPSTINFP : num [1:1207] 0 0 0 0 0 0.8 0 0.3 0 0 ...
## $ DPSTASFP : num [1:1207] 0 0 0 0 0 0 0 0.7 0 0 ...
## $ DPSTPIFP : num [1:1207] 0 0 0 0 0 0 0 0 0 0 ...
## $ DPSTTWFP : num [1:1207] 0 0 1.3 0 1.9 0 0 2.8 0 0 ...
## $ DPSTREFP : num [1:1207] 81.6 71.5 87.6 70 71.4 71.4 61 41.7 82.7 66.4 ...
## $ DPSTSPFP : num [1:1207] 9.9 8.4 7.5 5.5 10.2 6.4 5.8 14.4 6.8 9.6 ...
## $ DPSTCOFP : num [1:1207] 0 4.9 2.7 12 5 6.1 19.2 6.5 7.4 9.2 ...
## [list output truncated]
mean(district$DZCAMPUS)
## [1] 7.428335
mean(district$TAXRATE)
## Warning in mean.default(district$TAXRATE): argument is not numeric or logical:
## returning NA
## [1] NA
diamonds<-diamonds
head(diamonds)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
We can pull the levels from “Ordered factors”
levels(diamonds$cut)
## [1] "Fair" "Good" "Very Good" "Premium" "Ideal"
This will come in handy later.
How can we categorize characters or factors if we can’t do math on them?
We can count how many there are in the data:
obj<-table(diamonds$cut)
obj1<-data.frame(table(diamonds$cut))
This can also be done more neatly via dplyr:
diamonds %>% count(cut)
## # A tibble: 5 × 2
## cut n
## <ord> <int>
## 1 Fair 1610
## 2 Good 4906
## 3 Very Good 12082
## 4 Premium 13791
## 5 Ideal 21551
We can also count their proportion in the overall data:
proportions(table(diamonds$cut))
##
## Fair Good Very Good Premium Ideal
## 0.02984798 0.09095291 0.22398962 0.25567297 0.39953652
Let’s explore the data a bit using graphs (this is very useful)
ggplot(diamonds,aes(x=carat)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Remember “cut” from previously?
ggplot(diamonds,aes(cut)) + geom_bar()
We can use geom_point to quickly compare two numerical variables, in this case – carats vs. price for diamonds
ggplot(diamonds,aes(x=carat,y=price)) + geom_point()
What is going on here?
Do price and carat appear to be correlated?
#We can figure this out mathematically!
cor(diamonds$carat,diamonds$price)
## [1] 0.9215913
#We can also add extra dimensions, such as color:
ggplot(diamonds,aes(x=carat,y=price,color=cut)) + geom_point()
Now for a much bigger question, can we compare groups?
Yes, in fact, character variables and ordered factors make good “groups” to compare!
ggplot(diamonds,aes(clarity)) + geom_bar()
ggplot(diamonds,aes(x=clarity,y=price)) + geom_boxplot()
LOTS of outliers here!
Luckily the “district” data is already tidy, for the most part.
head(district)
## # A tibble: 6 × 137
## DISTNAME DISTRICT DZCNTYNM REGION DZRATING DZCAMPUS DPETALLC DPETBLAP DPETHISP
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CAYUGA … 001902 001 AND… 07 A 3 574 4.4 11.5
## 2 ELKHART… 001903 001 AND… 07 A 4 1150 4 11.8
## 3 FRANKST… 001904 001 AND… 07 A 3 808 8.5 11.3
## 4 NECHES … 001906 001 AND… 07 A 2 342 8.2 13.5
## 5 PALESTI… 001907 001 AND… 07 B 6 3360 25.1 42.9
## 6 WESTWOO… 001908 001 AND… 07 B 4 1332 19.7 26.2
## # ℹ 128 more variables: DPETWHIP <dbl>, DPETINDP <dbl>, DPETASIP <dbl>,
## # DPETPCIP <dbl>, DPETTWOP <dbl>, DPETECOP <dbl>, DPETLEPP <dbl>,
## # DPETSPEP <dbl>, DPETBILP <dbl>, DPETVOCP <dbl>, DPETGIFP <dbl>,
## # DA0AT21R <dbl>, DA0912DR21R <dbl>, DAGC4X21R <dbl>, DAGC5X20R <dbl>,
## # DAGC6X19R <dbl>, DA0GR21N <dbl>, DA0GS21N <dbl>, DDA00A001S22R <dbl>,
## # DDA00A001222R <dbl>, DDA00A001322R <dbl>, DDA00AR01S22R <dbl>,
## # DDA00AR01222R <dbl>, DDA00AR01322R <dbl>, DDA00AM01S22R <dbl>, …
Lets examine just school administrator salaries for 2022
#district administrative salaries are kept in "DPSCTOSA" per the data dictionary (district.lyt)
#we can select just the variables we need with dplyr and "SELECT"
your_variable_here<-district %>% select(DISTNAME,DPSCTOSA)
head(your_variable_here)
## # A tibble: 6 × 2
## DISTNAME DPSCTOSA
## <chr> <dbl>
## 1 CAYUGA ISD 93333
## 2 ELKHART ISD 100313
## 3 FRANKSTON ISD 98293
## 4 NECHES ISD 85537
## 5 PALESTINE ISD 99324
## 6 WESTWOOD ISD 121228
Must be nice! But there are some problems:
summary(your_variable_here)
## DISTNAME DPSCTOSA
## Length:1207 Min. : -2
## Class :character 1st Qu.: 95459
## Mode :character Median :106674
## Mean :108039
## 3rd Qu.:119540
## Max. :270000
## NA's :10
Ten missing observations and some “-2” salaries?
mean(your_variable_here$DPSCTOSA)
## [1] NA
# trying to compute the mean manually results in "NA" because there is missing data. "Summary" above is dropping the NA's behind the scenes.
# that's helpful, but we want to be exact. Also, the "-2" salaries are slightly skewing the average. Lets clean it up!
your_variable_here_cleaned<-your_variable_here %>% filter(DPSCTOSA>0)
mean(your_variable_here_cleaned$DPSCTOSA)
## [1] 108401.4
Let’s graph this a bit
ggplot(your_variable_here_cleaned,aes(DPSCTOSA)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
compare_two<-district %>% select(DISTNAME,DPSCTOSA,DPSTTOSA)
compare_two<-compare_two %>% filter(DPSCTOSA>0)
ggplot(compare_two,aes(DPSCTOSA,DPSTTOSA)) + geom_point()
For your homework this week, please (take a deep breath) do the following:
Due two weeks from now: