Stat 20 lecture 2

Histograms (Chapter 3)

We are going to look at the first 6 lines of a dataset, called CPS85, which is data from the 1985 Current Population Survey. wage is wages in US dollars per hour, and exper is number of years of work experience. The dataset is dispayed as a dataframe.

library(mosaicData)
head(CPS85)

##   wage educ race sex hispanic south married exper union age   sector
## 1  9.0   10    W   M       NH    NS Married    27   Not  43    const
## 2  5.5   12    W   M       NH    NS Married    20   Not  38    sales
## 3  3.8   12    W   F       NH    NS  Single     4   Not  22    sales
## 4 10.5   12    W   F       NH    NS Married    29   Not  47 clerical
## 5 15.0   12    W   M       NH    NS Married    40 Union  58    const
## 6  9.0   16    W   F       NH    NS Married    27   Not  49 clerical

In statatistics describing data (called descriptive statistics) is often best done visually.

We will be describing the components of graphs made with the ggplot2 package.

We start with just a blank canvas called a frame.

ggplot()

ggplot works by layering graphics in your frame. We pipe our data frame into ggplot using the symbol %>%.

Aesthetics= properties of the graphics (such as position or color) that relate to variables in the data frame.

CPS85 %>% ggplot(aes(x=wage))

Here, position is an aesthetic of our graphics and we assign the variable wage to the x axis.

Lets make a histogram (a graph consisting of blocks used to summarize data) of the continuous variable wages.

CPS85 %>% ggplot(aes(x=wage)) + geom_histogram(binwidth=10)

The horizontal axis consists of class intervals or bins. In this example the binwidth is 10.

We can put the height of each block (in parentheses) using stat_bin().

CPS85 %>% ggplot(aes(x=wage)) + geom_histogram(binwidth=10) + stat_bin(aes(label=sprintf("(%.02f)", ..count..)),binwidth=10, geom="text",vjust=-.1)

It is common to work in a density scale where the unit of the vertical axis is percent per horizontal unit (in this case percent per dollar per hour). The height of the block is (125/534)*10 where 534 is the total number of people in the study and 10 is the width of the first block.

CPS85 %>% 
  ggplot(aes(x=wage,..density..))+ geom_histogram(binwidth=10) + 
  stat_bin(aes(label=sprintf("(%.02f)", 10*..count../sum(..count..))),binwidth=10, geom="text",vjust=-.1)

The blocks now are percent that add up to 100. In the above histogram the first block is 23% which is 2.3 times 10, the second block is 66% which is 6.6 times 10, etc.

Stat 20 lecture 2

Adam Lucas

Histograms (Chapter 3)

In class exercise: