Reading in dataset swiss, and taking a look at what it contains
data(swiss)
nrow(swiss)
## [1] 47
names(swiss)
## [1] "Fertility" "Agriculture" "Examination"
## [4] "Education" "Catholic" "Infant.Mortality"
head(swiss)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
Raw data on it’s own is generally not useful. How Catholic was Switzerland in 1888 based on the 47 numbers listed below?
swiss$Catholic
## [1] 9.96 84.84 93.40 33.77 5.16 90.57 92.85 97.16 97.67 91.38
## [11] 98.61 8.52 2.27 4.43 2.82 24.20 3.30 12.11 2.15 2.84
## [21] 5.23 4.52 15.14 4.20 2.40 5.23 2.56 7.72 18.46 6.10
## [31] 99.71 99.68 100.00 98.96 98.22 99.06 99.46 96.83 5.62 13.79
## [41] 11.22 16.92 4.97 8.65 42.34 50.43 58.33
Looking at summary statistics for the varaible will help us to know what is high and low in the data, and what a typical figure is. We can produce summary statistics for one variable…
summary(swiss$Catholic)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.150 5.195 15.140 41.144 93.125 100.000
Or an entire dataset at once
summary(swiss)
## Fertility Agriculture Examination Education
## Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00
## 1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00 1st Qu.: 6.00
## Median :70.40 Median :54.10 Median :16.00 Median : 8.00
## Mean :70.14 Mean :50.66 Mean :16.49 Mean :10.98
## 3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00 3rd Qu.:12.00
## Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00
## Catholic Infant.Mortality
## Min. : 2.150 Min. :10.80
## 1st Qu.: 5.195 1st Qu.:18.15
## Median : 15.140 Median :20.00
## Mean : 41.144 Mean :19.94
## 3rd Qu.: 93.125 3rd Qu.:21.70
## Max. :100.000 Max. :26.60
The package pander helps to make output and tables more attractive (IMHO).
You’ll need to install pander, before loading it.
library(pander)
pander(summary(swiss))
| Fertility | Agriculture | Examination | Education |
|---|---|---|---|
| Min. :35.00 | Min. : 1.20 | Min. : 3.00 | Min. : 1.00 |
| 1st Qu.:64.70 | 1st Qu.:35.90 | 1st Qu.:12.00 | 1st Qu.: 6.00 |
| Median :70.40 | Median :54.10 | Median :16.00 | Median : 8.00 |
| Mean :70.14 | Mean :50.66 | Mean :16.49 | Mean :10.98 |
| 3rd Qu.:78.45 | 3rd Qu.:67.65 | 3rd Qu.:22.00 | 3rd Qu.:12.00 |
| Max. :92.50 | Max. :89.70 | Max. :37.00 | Max. :53.00 |
| Catholic | Infant.Mortality |
|---|---|
| Min. : 2.150 | Min. :10.80 |
| 1st Qu.: 5.195 | 1st Qu.:18.15 |
| Median : 15.140 | Median :20.00 |
| Mean : 41.144 | Mean :19.94 |
| 3rd Qu.: 93.125 | 3rd Qu.:21.70 |
| Max. :100.000 | Max. :26.60 |
install and Load in the pander package
read in the data set USArrests
create a summary table for the entire data set
create summary statistics for just the Assault Rate
We were able to produce summary statistics for everything in the dataset with swiss because everything was numeric. Often, datasets don’t come to us that cleanly, and some variables won’t want to produce summary statistics as easily.
This dataset is in the package AER, which you’ll need to install before loading.
library(AER)
## Loading required package: car
## Loading required package: carData
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
data("MASchools")
head("MASchools")
## [1] "MASchools"
Notice below that two of the columns don’t show anything, because they aren’t numeric. Notice that others have a lot fo zeros, making a few of the statistics less informative. And 14 ocolumns of summary statistics may be more than we want.
summary(MASchools)
## district municipality expreg expspecial
## Length:220 Length:220 Min. :2905 Min. : 3832
## Class :character Class :character 1st Qu.:4065 1st Qu.: 7442
## Mode :character Mode :character Median :4488 Median : 8354
## Mean :4605 Mean : 8901
## 3rd Qu.:4972 3rd Qu.: 9722
## Max. :8759 Max. :53569
##
## expbil expocc exptot scratio
## Min. : 0 Min. : 0 Min. :3465 Min. : 2.300
## 1st Qu.: 0 1st Qu.: 0 1st Qu.:4730 1st Qu.: 6.100
## Median : 0 Median : 0 Median :5155 Median : 7.800
## Mean : 3037 Mean : 1104 Mean :5370 Mean : 8.107
## 3rd Qu.: 0 3rd Qu.: 0 3rd Qu.:5789 3rd Qu.: 9.800
## Max. :295140 Max. :15088 Max. :9868 Max. :18.400
## NA's :9
## special lunch stratio income
## Min. : 8.10 Min. : 0.40 Min. :11.40 Min. : 9.686
## 1st Qu.:13.38 1st Qu.: 5.30 1st Qu.:15.80 1st Qu.:15.223
## Median :15.45 Median :10.55 Median :17.10 Median :17.128
## Mean :15.97 Mean :15.32 Mean :17.34 Mean :18.747
## 3rd Qu.:17.93 3rd Qu.:20.02 3rd Qu.:19.02 3rd Qu.:20.376
## Max. :34.30 Max. :76.20 Max. :27.00 Max. :46.855
##
## score4 score8 salary english
## Min. :658.0 Min. :641.0 Min. :24.96 Min. : 0.0000
## 1st Qu.:701.0 1st Qu.:685.0 1st Qu.:33.80 1st Qu.: 0.0000
## Median :711.0 Median :698.0 Median :35.88 Median : 0.0000
## Mean :709.8 Mean :698.4 Mean :35.99 Mean : 1.1177
## 3rd Qu.:720.0 3rd Qu.:712.0 3rd Qu.:37.96 3rd Qu.: 0.8859
## Max. :740.0 Max. :747.0 Max. :44.49 Max. :24.4939
## NA's :40 NA's :25
We can select just those columns we want in our summary statistics though by extracting them from our initial dataset. Let’s say we’re only interesting in english, salary, and score 8.
MASchools2 <- MASchools[, c("english", "salary", "score8")]
pander(summary(MASchools2))
| english | salary | score8 |
|---|---|---|
| Min. : 0.0000 | Min. :24.96 | Min. :641.0 |
| 1st Qu.: 0.0000 | 1st Qu.:33.80 | 1st Qu.:685.0 |
| Median : 0.0000 | Median :35.88 | Median :698.0 |
| Mean : 1.1177 | Mean :35.99 | Mean :698.4 |
| 3rd Qu.: 0.8859 | 3rd Qu.:37.96 | 3rd Qu.:712.0 |
| Max. :24.4939 | Max. :44.49 | Max. :747.0 |
| NA | NA’s :25 | NA’s :40 |
install and load the package AER
load the data set MASchools
create summary statistics for lunch, stratio, income, and score4