library(ISLR);
## Warning: package 'ISLR' was built under R version 4.0.3
library(ggplot2);
library(caret);
## Warning: package 'caret' was built under R version 4.0.3
## Loading required package: lattice
data(Wage)
summary(Wage)
## year age maritl race
## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293
## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## education region jobclass
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544
## 2. HS Grad :971 1. New England : 0 2. Information:1456
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## health health_ins logwage wage
## 1. <=Good : 858 1. Yes:2083 Min. :3.000 Min. : 20.09
## 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447 1st Qu.: 85.38
## Median :4.653 Median :104.92
## Mean :4.654 Mean :111.70
## 3rd Qu.:4.857 3rd Qu.:128.68
## Max. :5.763 Max. :318.34
##
Outcome-wage training set-70% (2113 data points) testing set-30% (909 data points) Total data= 2113+909 = 3022
inTrain<-createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
training<-Wage[inTrain,]
testing<-Wage[-inTrain,]
dim(training)
## [1] 2102 11
dim(testing)
## [1] 898 11
Lets see the realtionship of age, education, and job class with the Y (the training wage variable)
featurePlot(x=training[,c("age","education","jobclass")],y=training$wage, plot="pairs")
plot="pairs"
We are plotting age versus wage, and we can see that there are some kind of trend with age and wages. But we also see, there is a big chunk of observations that appear to be very different than the relationship with other chunk in the plot.
qplot(age,wage, data=training)
Before we want to build our model, we want to see why these distinct patterns exist. We have to dig a little more and inquire. Let’s color code the dots by the job class and see if that gives us anything.
qplot(age, wage, colour=jobclass, data=training)
Most of the people in the top chunk seem to be coming from the information job area. Furthermore we can also add regression smoothers to further study the data on hand.
qq<-qplot(age, wage, colour=education, data=training)
qq+geom_smooth(method='lm', formula=y~x)
geom-smooth function help us see the linear smoother of the data. We can see that people with higher academic preparation make higher wages.
Furthermore, we can use the cut2 function to break the wages variable into different categories. Because we just discovered that different categories seem to have different relationships. The g-parameter can be used to break the data into as many categories as we want. In this example, we are diving the training set into 3-categories.
library(Hmisc)
cutWage<-cut2(training$wage, g=3)
table(cutWage)
## cutWage
## [ 20.1, 91.7) [ 91.7,119.0) [119.0,318.3]
## 702 730 670
As we see, the data set has been broken into factors based on quantile groups. Now we can actually use this information to make different kinds of plots. For example, if we wanted to plot wage groups versus age, we use qplot again but now we can pass it the box plot geometry. And then we can plot different wage groups versus age and sometimes that can make it easier to see different trends that are emerging. For example, we can see a little bit more clearly the relationship between age and wage in the plot below.
p1<-qplot(cutWage, age, data=training, fill=cutWage, geom=c("boxplot"))
p1
We can see that the wages have clear linear relationship with age. There are some outliers in all age groups, as well.
The other thing we might want to do is to add the data points on top of the box plots, because sometimes box plots can obscure how many points are being shown. We can actually arrange both the box plot and the box plot with points overlaid side by side for easy comparison by using ‘grid.arrange’ function. So p1 was the plot that we previously made and p2 is the plot we made with points overlaid.
library(gridExtra)
p2<-qplot(cutWage, age, data=training, fill=cutWage, geom=c("boxplot", "jitter"))
grid.arrange(p1,p2, ncol=2)
So, looking at these plots, we conclude that there’s a large number of dots in each of the different boxes suggesting a trend.
Another thing that’s very useful is we can use the cut variable to provide us with the factorized version of the continuous variable in a tabular form. So, we are making a table comparing factor version of wages to the job class and we can see for example that there are more industrial jobs in the lower wage variable than there are information jobs.
t1<-table(cutWage, training$jobclass)
t1
##
## cutWage 1. Industrial 2. Information
## [ 20.1, 91.7) 434 268
## [ 91.7,119.0) 376 354
## [119.0,318.3] 267 403
And that trend reverses itself for the highway jobs. There are fewer industrial people and more information people. We can also use prop table to actually get the proportions in each group. So here it’s the proportion, by passing it 1 [prop.table(t1, 1)], we say we want to get the proportion in each row. If we pass 2 instead of 1, we get proportion in each column.
prop.table(t1,1)
##
## cutWage 1. Industrial 2. Information
## [ 20.1, 91.7) 0.6182336 0.3817664
## [ 91.7,119.0) 0.5150685 0.4849315
## [119.0,318.3] 0.3985075 0.6014925
We see that 62% of the low wage jobs go to industrial, and 38% correspond to information. So, we can use these statistics to get an idea of how those proportions change across different wage levels.
Finally, we can use density plots to plot the values of continuous predictors. Lets use the qplot function again and plot a density plit for wage variable versus education. It will basically show where the bulk of the data is.
qplot(wage,colour=education, data=training, geom="density")
So on the x axis is the wage, and on the y axis is sort of the proportion of the variable that falls into that bin of the x axis. The highschool grad tend to have more values in the lower part of the range, and the advanced degree folks tend to be a little bit higher. Sometimes, density plots can show things that the box plot can’t necessarily do.