The dataset uswages is drawn as a sample from the Current Population Survey in 1988. Make a numerical and graphical summary of the data as in the previous question.
The summary() command is a quick way to get the usual univariate summary information. A quick look at the values shows that several of the variables have a min of 0. However, upon looking up the description of the variables in the dataset, these values are to be expected and are not the result of a data entry error. A value of “0” in the race variable indicates White and is not an error and similalry a value of 0 in ne indicates the male worker sampled doesn’t live in the northeast. However, the value of -2 for experience seems out of line. Sorting this variable shows 33 negative values for the experience variable which are most likely errors. We set these negative values to NA instead.
require(faraway)
## Loading required package: faraway
summary(uswages)
## wage educ exper race
## Min. : 50.39 Min. : 0.00 Min. :-2.00 Min. :0.000
## 1st Qu.: 308.64 1st Qu.:12.00 1st Qu.: 8.00 1st Qu.:0.000
## Median : 522.32 Median :12.00 Median :15.00 Median :0.000
## Mean : 608.12 Mean :13.11 Mean :18.41 Mean :0.078
## 3rd Qu.: 783.48 3rd Qu.:16.00 3rd Qu.:27.00 3rd Qu.:0.000
## Max. :7716.05 Max. :18.00 Max. :59.00 Max. :1.000
## smsa ne mw so
## Min. :0.000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.000 Median :0.000 Median :0.0000 Median :0.0000
## Mean :0.756 Mean :0.229 Mean :0.2485 Mean :0.3125
## 3rd Qu.:1.000 3rd Qu.:0.000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.000 Max. :1.000 Max. :1.0000 Max. :1.0000
## we pt
## Min. :0.00 Min. :0.0000
## 1st Qu.:0.00 1st Qu.:0.0000
## Median :0.00 Median :0.0000
## Mean :0.21 Mean :0.0925
## 3rd Qu.:0.00 3rd Qu.:0.0000
## Max. :1.00 Max. :1.0000
uswages$exper[uswages$exper < 0] <- NA
Following data clean-up, I plot some univariate graphs, namely, a histogram, a kernel density estimate and an index plot of the sorted values for each of the wage, education and experience variables. As we can see below, all three variables are either left or right-skewed. I also plot bivariate plots mainly checking the relationship between wages and education. Wages increase with education as expected but there are some outliers - high wages for # of years of education of 0 and ~3. Perhaps these outliers correspond to entrepreneurs?
## Loading required package: ggplot2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 33 rows containing non-finite values (stat_density).