This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(ISLR); library(ggplot2); library(caret);library(Hmisc); library(gridExtra);
## Loading required package: lattice
## Loading required package: grid
## Loading required package: survival
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
data(Wage)
summary(Wage)
## year age sex maritl
## Min. :2003 Min. :18.00 1. Male :3000 1. Never Married: 648
## 1st Qu.:2004 1st Qu.:33.75 2. Female: 0 2. Married :2074
## Median :2006 Median :42.00 3. Widowed : 19
## Mean :2006 Mean :42.41 4. Divorced : 204
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## race education region
## 1. White:2480 1. < HS Grad :268 2. Middle Atlantic :3000
## 2. Black: 293 2. HS Grad :971 1. New England : 0
## 3. Asian: 190 3. Some College :650 3. East North Central: 0
## 4. Other: 37 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## jobclass health health_ins logwage
## 1. Industrial :1544 1. <=Good : 858 1. Yes:2083 Min. :3.000
## 2. Information:1456 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447
## Median :4.653
## Mean :4.654
## 3rd Qu.:4.857
## Max. :5.763
##
## wage
## Min. : 20.09
## 1st Qu.: 85.38
## Median :104.92
## Mean :111.70
## 3rd Qu.:128.68
## Max. :318.34
##
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot. Get training/test sets
inTrain <- createDataPartition(y=Wage$wage,
p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]
dim(training); dim(testing)
## [1] 2102 12
## [1] 898 12
Feature plot (caret package)
featurePlot(x=training[,c("age","education","jobclass")],
y = training$wage,
plot="pairs")
Qplot (ggplot2 package)
qplot(age,wage,data=training)
Qplot with color (ggplot2 package)
qplot(age,wage,colour=jobclass,data=training)
Add regression smoothers (ggplot2 package)
qq <- qplot(age,wage,colour=education,data=training)
qq + geom_smooth(method='lm',formula=y~x)
cut2, making factors (Hmisc package)
cutWage <- cut2(training$wage,g=3)
table(cutWage)
## cutWage
## [ 20.1, 93) [ 93.0,119) [118.9,318]
## 714 715 673
Boxplots with cut2
p1 <- qplot(cutWage,age, data=training,fill=cutWage,
geom=c("boxplot"))
p1
Boxplots with points overlayed
p2 <- qplot(cutWage,age, data=training,fill=cutWage, geom=c("boxplot","jitter"))
grid.arrange(p1,p2,ncol=2)
Tables
t1 <- table(cutWage,training$jobclass)
t1
##
## cutWage 1. Industrial 2. Information
## [ 20.1, 93) 446 268
## [ 93.0,119) 361 354
## [118.9,318] 260 413
prop.table(t1,1)
##
## cutWage 1. Industrial 2. Information
## [ 20.1, 93) 0.6246499 0.3753501
## [ 93.0,119) 0.5048951 0.4951049
## [118.9,318] 0.3863299 0.6136701
Density plots
qplot(wage,colour=education,data=training,geom="density")