Distribution with numbers

Five number summary

Load the beer data file

load(file="beer.RData")

See the internal ’Str’ucture of IQ data

str(beer)

## 'data.frame':    212 obs. of  6 variables:
##  $ BEER         : Factor w/ 212 levels "American Amber Lager",..: 1 2 3 5 7 6 9 10 16 17 ...
##  $ Brewery      : Factor w/ 68 levels "A.B. Pripps Bryggerier (Sweden)",..: 61 61 61 4 5 5 2 7 11 11 ...
##  $ Calories     : int  136 132 96 153 95 157 94 155 177 163 ...
##  $ Carbohydrates: num  10.5 10.5 7.6 16 3.2 8.9 2.6 14.2 15.6 13.9 ...
##  $ Alcohol      : num  4.1 4.1 3.2 4.9 4.2 5.9 4.1 4.6 5.2 4.7 ...
##  $ Type         : Factor w/ 2 levels "Domestic","Imported": 1 1 1 1 1 1 1 1 1 1 ...

you can also see what variables you have in the data

names(beer)

## [1] "BEER"          "Brewery"       "Calories"      "Carbohydrates"
## [5] "Alcohol"       "Type"

To see the five-number summary

summary(beer$Calories)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    55.0   135.0   150.0   153.9   163.5   330.0

Make a boxplot of the five-number summary. You can modify the boxplot with interquartile range(IQR)

boxplot(beer$Calories, range=0)

boxplot(beer$Calories, range=1.5)

Standard deviation & coefficient of variation

Standard deviation

sd(beer$Calories)

## [1] 41.38973

Coefficient of variation (to compare SD across variables with different means)

sd(beer$Calories)/mean(beer$Calories)

## [1] 0.2688716

Plot a non-parametric density estimation with a histogram

Use the following code to install the necessary R package. (Remove “repos = ‘http://cran.us.r-project.org’” when intalling in your R)

install.packages("tidyverse",  repos = 'http://cran.us.r-project.org')

## Installing package into '/Users/luanpao/Library/R/3.3/library'
## (as 'lib' is unspecified)

## 
## The downloaded binary packages are in
##  /var/folders/wg/j02h67j14d7dthw4m0r0gjk80000gn/T//RtmpOuEHS6/downloaded_packages

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

To plot a smooth density estimation for a variable

ggplot(beer, aes(x=Calories)) + geom_density()

Density estimation and histogram on the same graph, with a transparent density plot

ggplot(beer, aes(x=Calories)) + geom_histogram(aes(y=..density..), binwidth=5, col="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666")

To make a nicer graph, experiment with different values of binwidth

ggplot(beer, aes(x=Calories)) + geom_histogram(aes(y=..density..), binwidth=10, colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666")

Normal proportions

pnorm(a, mean=mu, sd=sigma): The sahre of the popultion below ‘a’ if the distribution follows a normal distribution with mean ‘mu’ and standard deviation ‘sigma’

load(file='ncaa.RData')
str(ncaa)

## 'data.frame':    78 obs. of  5 variables:
##  $ ID         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GPA        : num  7.94 8.29 4.64 7.47 8.88 ...
##  $ IQ         : int  111 107 100 107 114 115 111 97 100 112 ...
##  $ Gender     : int  2 2 2 2 1 2 2 2 1 2 ...
##  $ SelfConcept: int  67 43 52 66 58 51 71 51 49 51 ...

mean(ncaa$GPA)

## [1] 7.446538

sd(ncaa$GPA)

## [1] 2.099557

Now, see the share of the population below 3

pnorm(3, mean=7.45, sd=2.1)

## [1] 0.01704322

qnorm(p, mean=mu, sd=sigma): the p percentile of the distribution if the distribution follows a normal with mean mu and standard deviation sigma

qnorm(.9, mean=7.45, sd=2.1)

## [1] 10.14126

Normal quantile plot

qqnorm(beer$Calories)
qqline(beer$Calories, col="red")

CPS data

CPS micro data (choose only people from Pennsylvania)

Load the data file

load(file="cps.RData")

Let’s see what the data look like

str(cps$hourwage)

##  atomic [1:297222] NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "label")= chr "Hourly wage"
##  - attr(*, "format.stata")= chr "%4.2f"

To see finve-number summary of a sub-set sample (only Pennsylvania)

boxplot(subset(cps, state=="Pennsylvania")$hourwage, range=1.5)

Do your assignment with CO2 emissions vehicles data

load(file="canfreg.RData")