Example Using Wage Data

1. Loading Required Packages, and Data. Checking Data

library(ISLR);
## Warning: package 'ISLR' was built under R version 4.0.3
library(ggplot2);
library(caret);
## Warning: package 'caret' was built under R version 4.0.3
## Loading required package: lattice
data(Wage)
summary(Wage)
##       year           age                     maritl           race     
##  Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                          
##                                                                        
##               education                     region               jobclass   
##  1. < HS Grad      :268   2. Middle Atlantic   :3000   1. Industrial :1544  
##  2. HS Grad        :971   1. New England       :   0   2. Information:1456  
##  3. Some College   :650   3. East North Central:   0                        
##  4. College Grad   :685   4. West North Central:   0                        
##  5. Advanced Degree:426   5. South Atlantic    :   0                        
##                           6. East South Central:   0                        
##                           (Other)              :   0                        
##             health      health_ins      logwage           wage       
##  1. <=Good     : 858   1. Yes:2083   Min.   :3.000   Min.   : 20.09  
##  2. >=Very Good:2142   2. No : 917   1st Qu.:4.447   1st Qu.: 85.38  
##                                      Median :4.653   Median :104.92  
##                                      Mean   :4.654   Mean   :111.70  
##                                      3rd Qu.:4.857   3rd Qu.:128.68  
##                                      Max.   :5.763   Max.   :318.34  
## 

2. Spliting Data in Trainin and Test Sets

Outcome-wage training set-70% (2113 data points) testing set-30% (909 data points) Total data= 2113+909 = 3022

inTrain<-createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
training<-Wage[inTrain,]
testing<-Wage[-inTrain,]
dim(training)
## [1] 2102   11
dim(testing)
## [1] 898  11

3. Understanding Our Training Set

3a. Feature Plot

Lets see the realtionship of age, education, and job class with the Y (the training wage variable)

featurePlot(x=training[,c("age","education","jobclass")],y=training$wage, plot="pairs")

plot="pairs"

3b. Or the Qplot function in the GGPLOT2 package

We are plotting age versus wage, and we can see that there are some kind of trend with age and wages. But we also see, there is a big chunk of observations that appear to be very different than the relationship with other chunk in the plot.

qplot(age,wage, data=training)

3c. QPlot Color Coded

Before we want to build our model, we want to see why these distinct patterns exist. We have to dig a little more and inquire. Let’s color code the dots by the job class and see if that gives us anything.

qplot(age, wage, colour=jobclass, data=training)

3d. Regression Smoothers

Most of the people in the top chunk seem to be coming from the information job area. Furthermore we can also add regression smoothers to further study the data on hand.

qq<-qplot(age, wage, colour=education, data=training)
qq+geom_smooth(method='lm', formula=y~x)

geom-smooth function help us see the linear smoother of the data. We can see that people with higher academic preparation make higher wages.

3e. Cut2 function using Hmsic package

Furthermore, we can use the cut2 function to break the wages variable into different categories. Because we just discovered that different categories seem to have different relationships. The g-parameter can be used to break the data into as many categories as we want. In this example, we are diving the training set into 3-categories.

library(Hmisc)
cutWage<-cut2(training$wage, g=3)
table(cutWage)
## cutWage
## [ 20.1, 91.7) [ 91.7,119.0) [119.0,318.3] 
##           702           730           670

3f. Boxplots with cut2

As we see, the data set has been broken into factors based on quantile groups. Now we can actually use this information to make different kinds of plots. For example, if we wanted to plot wage groups versus age, we use qplot again but now we can pass it the box plot geometry. And then we can plot different wage groups versus age and sometimes that can make it easier to see different trends that are emerging. For example, we can see a little bit more clearly the relationship between age and wage in the plot below.

p1<-qplot(cutWage, age, data=training, fill=cutWage, geom=c("boxplot"))
p1

We can see that the wages have clear linear relationship with age. There are some outliers in all age groups, as well.

3g. Boxplots with points overlayed

The other thing we might want to do is to add the data points on top of the box plots, because sometimes box plots can obscure how many points are being shown. We can actually arrange both the box plot and the box plot with points overlaid side by side for easy comparison by using ‘grid.arrange’ function. So p1 was the plot that we previously made and p2 is the plot we made with points overlaid.

library(gridExtra)
p2<-qplot(cutWage, age, data=training, fill=cutWage, geom=c("boxplot", "jitter"))
grid.arrange(p1,p2, ncol=2)

So, looking at these plots, we conclude that there’s a large number of dots in each of the different boxes suggesting a trend.

3h. Tables

Another thing that’s very useful is we can use the cut variable to provide us with the factorized version of the continuous variable in a tabular form. So, we are making a table comparing factor version of wages to the job class and we can see for example that there are more industrial jobs in the lower wage variable than there are information jobs.

t1<-table(cutWage, training$jobclass)
t1
##                
## cutWage         1. Industrial 2. Information
##   [ 20.1, 91.7)           434            268
##   [ 91.7,119.0)           376            354
##   [119.0,318.3]           267            403

And that trend reverses itself for the highway jobs. There are fewer industrial people and more information people. We can also use prop table to actually get the proportions in each group. So here it’s the proportion, by passing it 1 [prop.table(t1, 1)], we say we want to get the proportion in each row. If we pass 2 instead of 1, we get proportion in each column.

prop.table(t1,1)
##                
## cutWage         1. Industrial 2. Information
##   [ 20.1, 91.7)     0.6182336      0.3817664
##   [ 91.7,119.0)     0.5150685      0.4849315
##   [119.0,318.3]     0.3985075      0.6014925

We see that 62% of the low wage jobs go to industrial, and 38% correspond to information. So, we can use these statistics to get an idea of how those proportions change across different wage levels.

3i. Density Plots

Finally, we can use density plots to plot the values of continuous predictors. Lets use the qplot function again and plot a density plit for wage variable versus education. It will basically show where the bulk of the data is.

qplot(wage,colour=education, data=training, geom="density")

So on the x axis is the wage, and on the y axis is sort of the proportion of the variable that falls into that bin of the x axis. The highschool grad tend to have more values in the lower part of the range, and the advanced degree folks tend to be a little bit higher. Sometimes, density plots can show things that the box plot can’t necessarily do.