Compiled on Fri Aug 25 16:12:15 2017.
Source file ⇒ lec2.Rmd
We are going to look at the first 6 lines of a dataset, called CPS85, which is data from the 1985 Current Population Survey. wage
is wages in US dollars per hour, and exper
is number of years of work experience. The dataset is dispayed as a dataframe.
library(mosaicData)
head(CPS85)
## wage educ race sex hispanic south married exper union age sector
## 1 9.0 10 W M NH NS Married 27 Not 43 const
## 2 5.5 12 W M NH NS Married 20 Not 38 sales
## 3 3.8 12 W F NH NS Single 4 Not 22 sales
## 4 10.5 12 W F NH NS Married 29 Not 47 clerical
## 5 15.0 12 W M NH NS Married 40 Union 58 const
## 6 9.0 16 W F NH NS Married 27 Not 49 clerical
In statatistics describing data (called descriptive statistics) is often best done visually.
We will be describing the components of graphs made with the ggplot2 package.
We start with just a blank canvas called a frame.
ggplot()
ggplot works by layering graphics in your frame. We pipe our data frame into ggplot using the symbol %>%
.
Aesthetics= properties of the graphics (such as position or color) that relate to variables in the data frame.
CPS85 %>% ggplot(aes(x=wage))
Here, position is an aesthetic of our graphics and we assign the variable wage
to the x axis.
Lets make a histogram (a graph consisting of blocks used to summarize data) of the continuous variable wages
.
CPS85 %>% ggplot(aes(x=wage)) + geom_histogram(binwidth=10)
The horizontal axis consists of class intervals or bins. In this example the binwidth is 10.
We can put the height of each block (in parentheses) using stat_bin()
.
CPS85 %>% ggplot(aes(x=wage)) + geom_histogram(binwidth=10) + stat_bin(aes(label=sprintf("(%.02f)", ..count..)),binwidth=10, geom="text",vjust=-.1)
It is common to work in a density scale where the unit of the vertical axis is percent per horizontal unit (in this case percent per dollar per hour). The height of the block is (125/534)*10 where 534 is the total number of people in the study and 10 is the width of the first block.
CPS85 %>%
ggplot(aes(x=wage,..density..))+ geom_histogram(binwidth=10) +
stat_bin(aes(label=sprintf("(%.02f)", 10*..count../sum(..count..))),binwidth=10, geom="text",vjust=-.1)
The blocks now are percent that add up to 100. In the above histogram the first block is 23% which is 2.3 times 10, the second block is 66% which is 6.6 times 10, etc.
Please do the following in-class exercise. You may wish to copy and paste the link to your browser URL.