Summary-Stats-Notes.utf8

Beginning summary stats

Reading in dataset swiss, and taking a look at what it contains

data(swiss)
nrow(swiss)

## [1] 47

names(swiss)

## [1] "Fertility"        "Agriculture"      "Examination"     
## [4] "Education"        "Catholic"         "Infant.Mortality"

head(swiss)

##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

Raw data on it’s own is generally not useful. How Catholic was Switzerland in 1888 based on the 47 numbers listed below?

swiss$Catholic

##  [1]   9.96  84.84  93.40  33.77   5.16  90.57  92.85  97.16  97.67  91.38
## [11]  98.61   8.52   2.27   4.43   2.82  24.20   3.30  12.11   2.15   2.84
## [21]   5.23   4.52  15.14   4.20   2.40   5.23   2.56   7.72  18.46   6.10
## [31]  99.71  99.68 100.00  98.96  98.22  99.06  99.46  96.83   5.62  13.79
## [41]  11.22  16.92   4.97   8.65  42.34  50.43  58.33

Looking at summary statistics for the varaible will help us to know what is high and low in the data, and what a typical figure is. We can produce summary statistics for one variable…

summary(swiss$Catholic)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.150   5.195  15.140  41.144  93.125 100.000

Or an entire dataset at once

summary(swiss)

##    Fertility      Agriculture     Examination      Education    
##  Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
##  1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
##  Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
##  Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
##  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
##  Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
##     Catholic       Infant.Mortality
##  Min.   :  2.150   Min.   :10.80   
##  1st Qu.:  5.195   1st Qu.:18.15   
##  Median : 15.140   Median :20.00   
##  Mean   : 41.144   Mean   :19.94   
##  3rd Qu.: 93.125   3rd Qu.:21.70   
##  Max.   :100.000   Max.   :26.60

The package pander helps to make output and tables more attractive (IMHO).

You’ll need to install pander, before loading it.

library(pander)
pander(summary(swiss))

Table continues below
Fertility	Agriculture	Examination	Education
Min. :35.00	Min. : 1.20	Min. : 3.00	Min. : 1.00
1st Qu.:64.70	1st Qu.:35.90	1st Qu.:12.00	1st Qu.: 6.00
Median :70.40	Median :54.10	Median :16.00	Median : 8.00
Mean :70.14	Mean :50.66	Mean :16.49	Mean :10.98
3rd Qu.:78.45	3rd Qu.:67.65	3rd Qu.:22.00	3rd Qu.:12.00
Max. :92.50	Max. :89.70	Max. :37.00	Max. :53.00

Catholic	Infant.Mortality
Min. : 2.150	Min. :10.80
1st Qu.: 5.195	1st Qu.:18.15
Median : 15.140	Median :20.00
Mean : 41.144	Mean :19.94
3rd Qu.: 93.125	3rd Qu.:21.70
Max. :100.000	Max. :26.60

Exercise set 1

install and Load in the pander package

read in the data set USArrests

create a summary table for the entire data set

create summary statistics for just the Assault Rate

Some challenges

We were able to produce summary statistics for everything in the dataset with swiss because everything was numeric. Often, datasets don’t come to us that cleanly, and some variables won’t want to produce summary statistics as easily.

This dataset is in the package AER, which you’ll need to install before loading.

library(AER)

## Loading required package: car

## Loading required package: carData

## Loading required package: lmtest

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Loading required package: survival

data("MASchools")
head("MASchools")

## [1] "MASchools"

Notice below that two of the columns don’t show anything, because they aren’t numeric. Notice that others have a lot fo zeros, making a few of the statistics less informative. And 14 ocolumns of summary statistics may be more than we want.

summary(MASchools)

##    district         municipality           expreg       expspecial   
##  Length:220         Length:220         Min.   :2905   Min.   : 3832  
##  Class :character   Class :character   1st Qu.:4065   1st Qu.: 7442  
##  Mode  :character   Mode  :character   Median :4488   Median : 8354  
##                                        Mean   :4605   Mean   : 8901  
##                                        3rd Qu.:4972   3rd Qu.: 9722  
##                                        Max.   :8759   Max.   :53569  
##                                                                      
##      expbil           expocc          exptot        scratio      
##  Min.   :     0   Min.   :    0   Min.   :3465   Min.   : 2.300  
##  1st Qu.:     0   1st Qu.:    0   1st Qu.:4730   1st Qu.: 6.100  
##  Median :     0   Median :    0   Median :5155   Median : 7.800  
##  Mean   :  3037   Mean   : 1104   Mean   :5370   Mean   : 8.107  
##  3rd Qu.:     0   3rd Qu.:    0   3rd Qu.:5789   3rd Qu.: 9.800  
##  Max.   :295140   Max.   :15088   Max.   :9868   Max.   :18.400  
##                                                  NA's   :9       
##     special          lunch          stratio          income      
##  Min.   : 8.10   Min.   : 0.40   Min.   :11.40   Min.   : 9.686  
##  1st Qu.:13.38   1st Qu.: 5.30   1st Qu.:15.80   1st Qu.:15.223  
##  Median :15.45   Median :10.55   Median :17.10   Median :17.128  
##  Mean   :15.97   Mean   :15.32   Mean   :17.34   Mean   :18.747  
##  3rd Qu.:17.93   3rd Qu.:20.02   3rd Qu.:19.02   3rd Qu.:20.376  
##  Max.   :34.30   Max.   :76.20   Max.   :27.00   Max.   :46.855  
##                                                                  
##      score4          score8          salary         english       
##  Min.   :658.0   Min.   :641.0   Min.   :24.96   Min.   : 0.0000  
##  1st Qu.:701.0   1st Qu.:685.0   1st Qu.:33.80   1st Qu.: 0.0000  
##  Median :711.0   Median :698.0   Median :35.88   Median : 0.0000  
##  Mean   :709.8   Mean   :698.4   Mean   :35.99   Mean   : 1.1177  
##  3rd Qu.:720.0   3rd Qu.:712.0   3rd Qu.:37.96   3rd Qu.: 0.8859  
##  Max.   :740.0   Max.   :747.0   Max.   :44.49   Max.   :24.4939  
##                  NA's   :40      NA's   :25

We can select just those columns we want in our summary statistics though by extracting them from our initial dataset. Let’s say we’re only interesting in english, salary, and score 8.

MASchools2 <- MASchools[, c("english", "salary", "score8")]
pander(summary(MASchools2))

english	salary	score8
Min. : 0.0000	Min. :24.96	Min. :641.0
1st Qu.: 0.0000	1st Qu.:33.80	1st Qu.:685.0
Median : 0.0000	Median :35.88	Median :698.0
Mean : 1.1177	Mean :35.99	Mean :698.4
3rd Qu.: 0.8859	3rd Qu.:37.96	3rd Qu.:712.0
Max. :24.4939	Max. :44.49	Max. :747.0
NA	NA’s :25	NA’s :40

Exercise set 2

install and load the package AER

load the data set MASchools

create summary statistics for lunch, stratio, income, and score4