This is a demo of functions I wrote to use in intial EDA. The function code is included here as an appendix, though the most recent version is centrally stored on GitHub here.
# Load libraries
library(RCurl)
library(pander)
#==============================================================================
# Functions
#==============================================================================
# Create function to source functions from GitHub
source.GitHub = function(url){
require(RCurl)
sapply(url, function(x){
eval(parse(text = getURL(x, followlocation = T,
cainfo = system.file("CurlSSL",
"cacert.pem", package = "RCurl"))),
envir = .GlobalEnv)
})
}
# Assign URL and source functions
url = "http://bit.ly/1T6LhBJ"
source.GitHub(url); rm(url)
This demo uses the Auto dataset from the ISL website (the version included in the {ISLR} package is slightly different). We’ll start by importing these data and doing some minor tweaks based on a data quality check. Our response attribute is mpg, but we are also interested exploring data by the class origin.
# Download and assign data
if(!file.exists("~/Auto.csv")){
URL = getURL("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv")
auto = read.csv(textConnection(URL), header = T)
rm(URL)
}
The function str() helps us check the dimensions, the first few observations, and attribute classes:
# View structure
str(auto)## 'data.frame': 397 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : Factor w/ 94 levels "?","100","102",..: 17 35 29 29 24 42 47 46 48 40 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
Next we check our summary statistics:
# View summary statistics
summary(auto)## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 150 : 22
## 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.0 90 : 20
## Median :23.00 Median :4.000 Median :146.0 88 : 19
## Mean :23.52 Mean :5.458 Mean :193.5 110 : 18
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0 100 : 17
## Max. :46.60 Max. :8.000 Max. :455.0 75 : 14
## (Other):287
## weight acceleration year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2223 1st Qu.:13.80 1st Qu.:73.00 1st Qu.:1.000
## Median :2800 Median :15.50 Median :76.00 Median :1.000
## Mean :2970 Mean :15.56 Mean :75.99 Mean :1.574
## 3rd Qu.:3609 3rd Qu.:17.10 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
##
## name
## ford pinto : 6
## amc matador : 5
## ford maverick : 5
## toyota corolla: 5
## amc gremlin : 4
## amc hornet : 4
## (Other) :368
We notice a few potential issues:
cylinders should probably be a factor attribute;horsepower should probably be a numeric attribute;weight should probably be a numeric attribute (R will automatically convert integer class to numeric for mathematical operations);origin should probably be a factor attribute as it indicates country of origin;name should probably be a character attribute.Additionally, horsepower has a ?. That should probably be a NA value (in case any other attributes use the ? character to denote NA, we’ll apply it to the whole dataset). Note that with horsepower, we first convert it to a character to preserve the numeric values. Let’s make these changes:
# Treat "?" as NA
auto[auto == "?"] = NA
# Convert attributes as appropriate
auto$horsepower = as.numeric(as.character(auto$horsepower))
auto$weight = as.numeric(auto$weight)
auto$cylinders = as.factor(auto$cylinders)
auto$origin = as.factor(auto$origin)
auto$name = as.character(auto$name)For the purposes of this demo, we’ll drop the name attribute and remove the tuples with NA values (this is revisited in the Attribute Manipulations section). We’ll create a version of auto called auto.amas a reference for use in this later section.
# Drop name attribute
auto = subset(auto, select = -name)
# Create separate data.frame for auto
auto.am = auto
# Remove NA values
auto = na.omit(auto)
We’ll start with some visual EDA. The first user-defined function we’ll use is num.plots(). This function produces a histogram, boxplot, scatterplot, and Q-Q plot for each numeric attribute or an individual attribute named in num.plots(df = ). For the scatterplot, the response attribute mpg must be named, and the boxplot can be split by a factor class. The histogram can also plot bins of equally spaced probabilities (num.plots(prob = T)), and the normal curve overlay can be turned on or off (num.plots(norm = F)).
The function is really composed of four separate functions, which can each be called individually: num.hist, num.boxplot, num.scatter, and num.qq.
# Create plots of numeric attributes
# Use [mpg] as response in scatterplot
# Use [origin] as factor to split in boxplots
num.plots(df = auto, df.num = auto$mpg, df.fac = auto$origin)We’ll continue with your visual EDA, but now for factor attributes. This uses two user-defined functions: fac.barplot, and fac.mosaic. The former makes barplots of factor attributes (and will also accept a numeric attribute to segregate on). The latter makes mosaic plots for a named factor. Both will also accept an individual attribute named in num.plots(df = ) - this is done with the fac.mosaic() function.
Note the mosaic plot shows some pretty unbalanced classes when it comes to cylinders by origin.
# Create barplots for factor attributes
fac.barplot(df = auto)# Create mosaic plot for
fac.mosaic(df = auto$origin, df.fac.cn = auto$cylinders)
Now we’ll switch gears (no pun intended) over to quantitative EDA. First, we’ll look at summary statistics for numeric attributes split by a named factor attribute (origin). This user-defined function is called num.freq(). By default, the function will return summary stats for all numeric (or integer) attributes. A second argument, num.freq(df.num.cn = ), can be supplied specifying a single numeric or integer attribute to use. Here we use mpg.
# Summary stats for numeric attributes, split by named factor
num.freq(df.fac = auto$origin, df.num.cn = auto$mpg)## Variable Split On Levels Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 mpg origin 1 9.00 15.00 18.50 20.03 24.00 39.00
## 2 mpg origin 2 16.20 23.75 26.00 27.60 30.12 44.30
## 3 mpg origin 3 18.00 25.70 31.60 30.45 34.05 46.60
Similarly, we can look at occurence rates by factor class - think of this as the quantitative version of the mosaic plot. This user-defined function is called fac.freq(). A second argument, num.freq(df.num.cn = ), can be supplied specifying a single factor attribute to use. A third argument, fac.freq(cat = ), can be supplied, returning counts and frequencies of the factor (the default is FALSE).
# Frequency of occurrence for factor attributes, split by named factor
fac.freq(df.fac = auto$origin)## Variable Split On Levels 3 4 5 6 8
## 1 cylinders origin 1 0.00 28.16 0.00 29.80 42.04
## 2 cylinders origin 2 0.00 89.71 4.41 5.88 0.00
## 3 cylinders origin 3 5.06 87.34 0.00 7.59 0.00
## Variable Split On Levels 1 2 3
## 1 origin origin 1 100.00 0.00 0.00
## 2 origin origin 2 0.00 100.00 0.00
## 3 origin origin 3 0.00 0.00 100.00
# Counts and frequency of occurrence of named factor
fac.freq(df.fac = auto$origin, cat = F)## Variable Type 1 2 3
## 1 origin Count 245 68 79
## 2 origin Percent 62.50 17.35 20.15
The last part of the demo deals with various attribute manipulations: missing flag creation, scaling, trimming, transforming, and creating indicator attributes for each factor attribute level.
The user-defined function miss.flag() creates indicator attributes for attributes with missing data, as sometimes missingness is predictive. The function defaults to automatically do this for all attributes in the dataset, but can be specified to only do it for a specific attribute, or type (e.g. numeric or factor).
# Create temp data.frame
auto.temp = auto.am
# Apply missing flag function
auto.temp = miss.flag(df = auto.temp)When viewing attribute names, we see there’s a new attribute called MF_horsepower, with class factor.
# View names
names(auto.temp)## [1] "mpg" "cylinders" "displacement" "horsepower"
## [5] "weight" "acceleration" "year" "origin"
## [9] "MF_horsepower"
# Check class
class(auto.temp$MF_horsepower)## [1] "factor"
# Validate application
sum(is.na(auto.temp$horsepower))## [1] 5
length(auto.temp$MF_horsepower[auto.temp$MF_horsepower == 1])## [1] 5
Moving more toward data mining, we might want to scale, trim, and transform attributes en masse. The next user-defined functions do just that: num.scale(), num.trims(), num.trans().
num.scale() will center and scale all numeric attributes in the dataset, or an individual attribute named in num.scale(df = ). Note: the final suffix is .V1 as each attribute has two additional attributes (scaled:center and scaled:scale).
# Create temp data.frame
auto.temp = auto.am
# Apply scale function
auto.temp = num.scale(df = auto.temp)When viewing attribute names, we see all numeric attributes have been centered and scaled, with the suffix _cs. Results can be seen with the summary() function.
# View names
colnames(auto.temp[, 9:ncol(auto.temp)])## [1] "mpg_cs" "displacement_cs" "horsepower_cs" "weight_cs"
## [5] "acceleration_cs" "year_cs"
# Summary stats
summary(auto.temp[, 9:ncol(auto.temp)])## mpg_cs.V1 displacement_cs.V1 horsepower_cs.V1
## Min. :-1.8508526 Min. :-1.2080194 Min. :-1.519034
## 1st Qu.:-0.8258696 1st Qu.:-0.8544397 1st Qu.:-0.765614
## Median :-0.0891631 Median :-0.4148541 Median :-0.284985
## Mean : 0.0000000 Mean : 0.0000000 Mean : 0.000000
## 3rd Qu.: 0.7116049 3rd Qu.: 0.7772830 3rd Qu.: 0.559365
## Max. : 2.9665675 Max. : 2.4902336 Max. : 3.261284
## weight_cs.V1 acceleration_cs.V1 year_cs.V1
## Min. :-1.6065223 Min. :-2.733490 Min. :-1.6232409
## 1st Qu.:-0.8857216 1st Qu.:-0.640237 1st Qu.:-0.8088504
## Median :-0.2049490 Median :-0.014980 Median : 0.0055401
## Mean : 0.0000000 Mean : 0.000000 Mean : 0.0000000
## 3rd Qu.: 0.7501341 3rd Qu.: 0.537784 3rd Qu.: 0.8199306
## Max. : 2.5458080 Max. : 3.355973 Max. : 1.6343210
num.trims() will Winsorize all numeric attributes in the dataset, or an individual attribute named in num.trims(df = ). The attributes are Winsorized at the 1st/99th, 5th/95th, 10th/90th, and 25th/75th percentiles.
# Create temp data.frame
auto.temp = auto.am
# Apply scale function
auto.temp = num.trims(df = auto.temp)When viewing attribute names, we see all numeric attributes have been Winsorized, with the suffixes _T99, _T95, _T90, and _T75. Results can be seen with the summary() function (trims can be seen at _T75).
# View names
colnames(auto.temp[, 9:ncol(auto.temp)])## [1] "mpg_T99" "mpg_T95" "mpg_T90"
## [4] "mpg_T75" "displacement_T99" "displacement_T95"
## [7] "displacement_T90" "displacement_T75" "horsepower_T99"
## [10] "horsepower_T95" "horsepower_T90" "horsepower_T75"
## [13] "weight_T99" "weight_T95" "weight_T90"
## [16] "weight_T75" "acceleration_T99" "acceleration_T95"
## [19] "acceleration_T90" "acceleration_T75" "year_T99"
## [22] "year_T95" "year_T90" "year_T75"
# Summary stats
summary(auto.temp[, 9:ncol(auto.temp)])## mpg_T99 mpg_T95 mpg_T90 mpg_T75
## Min. :11.00 Min. :13.00 Min. :14.00 Min. :17.00
## 1st Qu.:17.00 1st Qu.:17.00 1st Qu.:17.00 1st Qu.:17.00
## Median :22.75 Median :22.75 Median :22.75 Median :22.75
## Mean :23.44 Mean :23.34 Mean :23.21 Mean :22.85
## 3rd Qu.:29.00 3rd Qu.:29.00 3rd Qu.:29.00 3rd Qu.:29.00
## Max. :43.45 Max. :37.00 Max. :34.19 Max. :29.00
## displacement_T99 displacement_T95 displacement_T90 displacement_T75
## Min. : 70.91 Min. : 85.0 Min. : 90.0 Min. :105.0
## 1st Qu.:105.00 1st Qu.:105.0 1st Qu.:105.0 1st Qu.:105.0
## Median :151.00 Median :151.0 Median :151.0 Median :151.0
## Mean :194.29 Mean :193.9 Mean :191.0 Mean :179.7
## 3rd Qu.:275.75 3rd Qu.:275.8 3rd Qu.:275.8 3rd Qu.:269.2
## Max. :441.26 Max. :400.0 Max. :350.0 Max. :275.8
## horsepower_T99 horsepower_T95 horsepower_T90 horsepower_T75
## Min. : 48.0 Min. : 60.55 Min. : 67.0 Min. : 75.00
## 1st Qu.: 75.0 1st Qu.: 75.00 1st Qu.: 75.0 1st Qu.: 75.00
## Median : 93.5 Median : 93.50 Median : 93.5 Median : 93.50
## Mean :104.4 Mean :103.59 Mean :102.3 Mean : 97.82
## 3rd Qu.:126.0 3rd Qu.:126.00 3rd Qu.:126.0 3rd Qu.:125.25
## Max. :220.4 Max. :180.00 Max. :157.7 Max. :126.00
## weight_T99 weight_T95 weight_T90 weight_T75
## Min. :1772 Min. :1932 Min. :1990 Min. :2225
## 1st Qu.:2225 1st Qu.:2225 1st Qu.:2225 1st Qu.:2226
## Median :2804 Median :2804 Median :2804 Median :2804
## Mean :2978 Mean :2970 Mean :2960 Mean :2884
## 3rd Qu.:3615 3rd Qu.:3615 3rd Qu.:3615 3rd Qu.:3613
## Max. :4951 Max. :4464 Max. :4278 Max. :3615
## acceleration_T99 acceleration_T95 acceleration_T90 acceleration_T75
## Min. : 9.455 Min. :11.26 Min. :12.00 Min. :13.78
## 1st Qu.:13.775 1st Qu.:13.78 1st Qu.:13.78 1st Qu.:13.79
## Median :15.500 Median :15.50 Median :15.50 Median :15.50
## Mean :15.532 Mean :15.52 Mean :15.48 Mean :15.41
## 3rd Qu.:17.025 3rd Qu.:17.02 3rd Qu.:17.02 3rd Qu.:17.01
## Max. :22.317 Max. :20.23 Max. :19.00 Max. :17.02
## year_T99 year_T95 year_T90 year_T75
## Min. :70.00 Min. :70.00 Min. :71.00 Min. :73.00
## 1st Qu.:73.00 1st Qu.:73.00 1st Qu.:73.00 1st Qu.:73.00
## Median :76.00 Median :76.00 Median :76.00 Median :76.00
## Mean :75.98 Mean :75.98 Mean :75.98 Mean :75.97
## 3rd Qu.:79.00 3rd Qu.:79.00 3rd Qu.:79.00 3rd Qu.:79.00
## Max. :82.00 Max. :82.00 Max. :81.00 Max. :79.00
num.trans() will transform all numeric attributes in the dataset, or an individual attribute named in num.trans(df = ) by taking the natural log (a +1 is added for stability), square root, and square.
# Create temp data.frame
auto.temp = auto.am
# Apply scale function
auto.temp = num.trans(df = auto.temp)When viewing attribute names, we see all numeric attributes have been transformed, with the suffixes _ln (\(\log_{x}\)), _rt (\(\sqrt{x}\)), and _sq (\(X^{2}\)). Results can be seen with the summary() function.
# View names
colnames(auto.temp[, 9:ncol(auto.temp)])## [1] "mpg_ln" "mpg_rt" "mpg_sq"
## [4] "displacement_ln" "displacement_rt" "displacement_sq"
## [7] "horsepower_ln" "horsepower_rt" "horsepower_sq"
## [10] "weight_ln" "weight_rt" "weight_sq"
## [13] "acceleration_ln" "acceleration_rt" "acceleration_sq"
## [16] "year_ln" "year_rt" "year_sq"
# Summary stats
summary(auto.temp[, 9:ncol(auto.temp)])## mpg_ln mpg_rt mpg_sq displacement_ln
## Min. :2.303 Min. :3.162 Min. : 81.0 Min. :4.234
## 1st Qu.:2.890 1st Qu.:4.243 1st Qu.: 289.0 1st Qu.:4.663
## Median :3.168 Median :4.873 Median : 517.6 Median :5.024
## Mean :3.145 Mean :4.882 Mean : 610.5 Mean :5.135
## 3rd Qu.:3.401 3rd Qu.:5.477 3rd Qu.: 841.0 3rd Qu.:5.622
## Max. :3.863 Max. :6.899 Max. :2171.6 Max. :6.122
## displacement_rt displacement_sq horsepower_ln horsepower_rt
## Min. : 8.307 Min. : 4624 Min. :3.850 Min. : 6.856
## 1st Qu.:10.296 1st Qu.: 11025 1st Qu.:4.331 1st Qu.: 8.718
## Median :12.329 Median : 22801 Median :4.549 Median : 9.721
## Mean :13.499 Mean : 48718 Mean :4.599 Mean :10.115
## 3rd Qu.:16.630 3rd Qu.: 76268 3rd Qu.:4.844 3rd Qu.:11.269
## Max. :21.354 Max. :207025 Max. :5.442 Max. :15.199
## horsepower_sq weight_ln weight_rt weight_sq
## Min. : 2116 Min. :7.386 Min. :40.17 Min. : 2601769
## 1st Qu.: 5625 1st Qu.:7.708 1st Qu.:47.18 1st Qu.: 4951739
## Median : 8742 Median :7.939 Median :52.96 Median : 7859624
## Mean :12392 Mean :7.960 Mean :54.04 Mean : 9585652
## 3rd Qu.:15879 3rd Qu.:8.193 3rd Qu.:60.13 3rd Qu.:13066427
## Max. :52900 Max. :8.545 Max. :71.70 Max. :26419600
## acceleration_ln acceleration_rt acceleration_sq year_ln
## Min. :2.197 Min. :3.000 Min. : 64.0 Min. :4.263
## 1st Qu.:2.693 1st Qu.:3.844 1st Qu.:189.8 1st Qu.:4.304
## Median :2.803 Median :4.062 Median :240.2 Median :4.344
## Mean :2.792 Mean :4.053 Mean :249.1 Mean :4.342
## 3rd Qu.:2.892 3rd Qu.:4.246 3rd Qu.:289.9 3rd Qu.:4.382
## Max. :3.250 Max. :5.079 Max. :615.0 Max. :4.419
## year_rt year_sq
## Min. :8.426 Min. :4900
## 1st Qu.:8.602 1st Qu.:5329
## Median :8.775 Median :5776
## Mean :8.771 Mean :5786
## 3rd Qu.:8.944 3rd Qu.:6241
## Max. :9.110 Max. :6724
fac.flag() will transform all levels of factor attributes in the dataset to indicator attributes, or an individual attribute named in fac.flag(df = ). Note: original attributes are not removed.
# Create temp data.frame
auto.temp = auto.am
# Apply scale function
auto.temp = fac.flag(df = auto.temp)When viewing attribute names, we see all factor attributes have been transformed, with the suffixes _LEVEL, where LEVEL is a named level in the factor. Results can be seen with the summary() function.
# View names
colnames(auto.temp[, 9:ncol(auto.temp)])## [1] "cylinders_8" "cylinders_4" "cylinders_6" "cylinders_3" "cylinders_5"
## [6] "origin_1" "origin_3" "origin_2"
# Summary stats
summary(auto.temp[, 9:ncol(auto.temp)])## cylinders_8 cylinders_4 cylinders_6 cylinders_3 cylinders_5 origin_1
## 0:289 0:193 0:309 0:388 0:389 0:147
## 1:103 1:199 1: 83 1: 4 1: 3 1:245
## origin_3 origin_2
## 0:313 0:324
## 1: 79 1: 68
###############################################################################
# R_EDA_Sandbox_Functions.R
# Last updated: 2016-08-02 by MJG
###############################################################################
# A compilation of useful functions to [ideally] deploy on any data set
# Suggested order of deployment:
# Convert variables as necessary (e.g. to factors)
# Plots for EDA on numeric and factor variables
# Missing flags
# Missing imputes
# Trims
# Transforms
#==============================================================================
# Accuracy
#==============================================================================
#--------------------------------------
# fit()
#--------------------------------------
# Function to add MSE to other measures from forecast::accuracy
fit = function(f, x){
require(forecast)
temp = data.frame(forecast::accuracy(f, x),
forecast::accuracy(f, x)[, 2]^2)
temp = temp[, -c(1)]
colnames(temp)[6] <- "MSE"
temp = temp[c(6, 1, 2, 3, 4, 5)]
print(temp)
}
#==============================================================================
# Missing Observations
#==============================================================================
#--------------------------------------
# miss.flag()
#--------------------------------------
# Function to create indicator variables as missing flags
miss.flag = function(df, df.cn = c("num", "fac")){
# Check for columns to apply
if (missing(df.cn)){
cols = colnames(df)
} else if (df.cn == "num"){
cols = colnames(df[, !sapply(df, is.factor)])
} else if (df.cn == "fac"){
cols = colnames(df[, sapply(df, is.factor)])
}
# Apply function
for (i in cols){
if (sum(is.na(df[, i])) > 0){
df[paste("MF", i, sep = "_")] =
as.factor(ifelse(is.na(df[, i]), 1, 0))
}
}
return(df)
}
#==============================================================================
# Numeric Variables
#==============================================================================
#------------------------------------------------------------------------------
# Plots
#------------------------------------------------------------------------------
#--------------------------------------
# num.boxplot()
#--------------------------------------
# Function to create boxplots of numeric variables
num.boxplot = function(df, df.fac){
# Check for data.frame or attribute
if (grepl("$", deparse(substitute(df)), fixed = T)){
if (!class(df) %in% c("numeric", "integer")){
stop("Please supply a numeric or integer variable to df")
}
temp = unlist(strsplit(deparse(substitute(df)),
split = "$", fixed = T))
df = eval(as.name(paste(temp[1])))
df.name = temp[1]
cols = temp[2]
} else {
df.name = deparse(substitute(df))
cols = colnames(df[, !sapply(df, is.factor)])
}
# Create plot(s)
for (i in cols){
if (missing(df.fac)){
boxplot(df[, i], col = "grey", horizontal = T,
main = paste("Boxplot of ", df.name, "$", i, sep = ""),
xlab = paste(df.name, "$", i, sep = ""),
ylab = "Values")
} else if (!class(df.fac) %in% c("factor")){
stop("Please supply a factor variable to df.fac")
} else {
fac = unlist(strsplit(deparse(substitute(df.fac)),
split = "$", fixed = T))[2]
boxplot(df[, i] ~ df[, fac], col = "grey", horizontal = T,
main = paste(df.name, "$", i," versus ",
deparse(substitute(df.fac)), sep = ""),
ylab = "Values")
}
}
}
#--------------------------------------
# num.hist()
#--------------------------------------
# Function to create histograms of numeric variables
# Optional choice of equal spaced probabilities or normal curve overlay
num.hist = function(df, prob = F, norm = T){
# Check for data.frame or attribute
if (grepl("$", deparse(substitute(df)), fixed = T)){
if (!class(df) %in% c("numeric", "integer")){
stop("Please supply a numeric or integer variable to df")
}
temp = unlist(strsplit(deparse(substitute(df)),
split = "$", fixed = T))
df = eval(as.name(paste(temp[1])))
df.name = temp[1]
cols = temp[2]
} else {
df.name = deparse(substitute(df))
cols = colnames(df[, !sapply(df, is.factor)])
}
# Create plot(s)
for (i in cols){
main = paste("Histogram of ", df.name, "$", i, sep = "")
sub = ifelse(norm, "normal curve overlay (blue)", "")
y = hist(df[, i], plot = F)
if (prob){
seq = seq(0.0, 1.0, by = 0.1)
h = hist(df[, i], col = "grey", main = main, sub = sub,
breaks = quantile(df[, i], probs = seq),
xlab = paste(df.name, "$", i, sep = ""))
}
if (!prob){
h = hist(df[, i], col = "grey", main = main, sub = sub,
ylim = c(0, 1.15*max(y$counts)),
xlab = paste(df.name, "$", i, sep = ""))
}
if (norm){
xfit = seq(min(df[, i]), max(df[, i]), length = 100)
yfit = dnorm(xfit, mean = mean(df[, i]), sd = sd(df[, i]))
if (norm & !prob){
yfit = yfit * diff(h$mids[1:2]) * length(df[, i])
}
lines(xfit, yfit, col = "blue", lwd = 2)
}
}
}
#--------------------------------------
# num.qq()
#--------------------------------------
# Function to create Q-Q plots of numeric variables
num.qq = function(df){
# Check for data.frame or attribute
if (grepl("$", deparse(substitute(df)), fixed = T)){
if (!class(df) %in% c("numeric", "integer")){
stop("Please supply a numeric or integer variable to df")
}
temp = unlist(strsplit(deparse(substitute(df)),
split = "$", fixed = T))
df = eval(as.name(paste(temp[1])))
df.name = temp[1]
cols = temp[2]
} else {
df.name = deparse(substitute(df))
cols = colnames(df[, !sapply(df, is.factor)])
}
# Create plot(s)
for (i in cols){
qqnorm(df[, i], pch = 21, bg = "grey",
main = paste("Normal Q-Q Plot of ", df.name, "$", i, sep = ""))
qqline(df[, i], lwd = 2, col = "blue")
}
}
#--------------------------------------
# num.scatter()
#--------------------------------------
# Function to create scatterplots of numeric variables
num.scatter = function(df, df.num){
# Check for data.frame or attribute
if (grepl("$", deparse(substitute(df)), fixed = T)){
if (!class(df) %in% c("numeric", "integer")){
stop("Please supply a numeric or integer variable to df")
}
temp = unlist(strsplit(deparse(substitute(df)),
split = "$", fixed = T))
df = eval(as.name(paste(temp[1])))
df.name = temp[1]
cols = temp[2]
} else {
df.name = deparse(substitute(df))
cols = colnames(df[, !sapply(df, is.factor)])
}
# Create plot(s)
num = unlist(strsplit(deparse(substitute(df.num)),
split = "$", fixed = T))[2]
for (i in cols){
plot(df[, i], df[, num], pch = 21, bg = "grey",
main = paste(df.name, "$", num, " versus ",
df.name, "$", i, sep = ""),
ylab = paste(df.name, "$", num, sep = ""),
xlab = paste(df.name, "$", i, sep = ""))
}
}
#--------------------------------------
# num.plots()
#--------------------------------------
# Function to produce four plots per variable:
# num.plots(which = ) corresponds as follows:
# 1 = Histogram
# 2 = Scatterplot
# 3 = Boxplot
# 4 = QQ Plot
num.plots = function(df, df.num, df.fac, prob = F, norm = T,
which = c(1, 2, 3, 4)){
# Check for which plots to create
if (missing(which)){
par(mfcol = c(2, 2))
}
# Check for data.frame or attribute
if (grepl("$", deparse(substitute(df)), fixed = T)){
if (!class(df) %in% c("numeric", "integer")){
stop("Please supply a numeric or integer variable to df")
}
temp = unlist(strsplit(deparse(substitute(df)),
split = "$", fixed = T))
df = eval(as.name(paste(temp[1])))
df.name = temp[1]
cols = temp[2]
} else {
df.name = deparse(substitute(df))
cols = colnames(df[, !sapply(df, is.factor)])
}
# Create plot(s)
for (i in cols){
#------------------------------
# Histograms
#------------------------------
if (1 %in% which){
main = paste("Histogram of ", df.name, "$", i, sep = "")
sub = ifelse(norm, "normal curve overlay (blue)", "")
y = hist(df[, i], plot = F)
if (prob){
seq = seq(0.0, 1.0, by = 0.1)
h = hist(df[, i], col = "grey", main = main, sub = sub,
breaks = quantile(df[, i], probs = seq),
xlab = paste(df.name, "$", i, sep = ""))
}
if (!prob){
h = hist(df[, i], col = "grey", main = main, sub = sub,
ylim = c(0, 1.15*max(y$counts)),
xlab = paste(df.name, "$", i, sep = ""))
}
if (norm){
xfit = seq(min(df[, i]), max(df[, i]), length = 100)
yfit = dnorm(xfit, mean = mean(df[, i]), sd = sd(df[, i]))
if (norm & !prob){
yfit = yfit * diff(h$mids[1:2]) * length(df[, i])
}
lines(xfit, yfit, col = "blue", lwd = 2)
}
}
#------------------------------
# Scatterplots
#------------------------------
if (2 %in% which){
num = unlist(strsplit(deparse(substitute(df.num)),
split = "$", fixed = T))[2]
plot(df[, i], df[, num], pch = 21, bg = "grey",
main = paste(df.name, "$", num, " versus ",
df.name, "$", i, sep = ""),
ylab = paste(df.name, "$", num, sep = ""),
xlab = paste(df.name, "$", i, sep = ""))
}
#------------------------------
# Boxplots
#------------------------------
if (3 %in% which){
if (missing(df.fac)){
boxplot(df[, i], col = "grey", horizontal = T,
main = paste("Boxplot of ", df.name, "$", i, sep = ""),
xlab = paste(df.name, "$", i, sep = ""),
ylab = "Values")
} else if (!class(df.fac) %in% c("factor")){
stop("Please supply a factor variable to df.fac")
} else {
fac = unlist(strsplit(deparse(substitute(df.fac)),
split = "$", fixed = T))[2]
boxplot(df[, i] ~ df[, fac], col = "grey", horizontal = T,
main = paste(df.name, "$", i," versus ",
deparse(substitute(df.fac)), sep = ""),
ylab = "Values")
}
}
#------------------------------
# QQ Plots
#------------------------------
if (4 %in% which){
qqnorm(df[, i], pch = 21, bg = "grey",
main = paste("Normal Q-Q Plot of ", df.name, "$", i, sep = ""))
qqline(df[, i], lwd = 2, col = "blue")
}
}
return(par(mfcol = c(1, 1)))
}
#------------------------------------------------------------------------------
# Variable Manipulation
#------------------------------------------------------------------------------
#--------------------------------------
# num.freq()
#--------------------------------------
# Summary stats for numeric variables, split by named factor
num.freq = function(df.fac, df.num.cn){
table.results = data.frame()
# Check df.fac is factor
if (!class(df.fac) %in% c("factor")){
stop("Please supply a factor variable to df.fac")
}
# Assign data.frame and name
temp = unlist(strsplit(deparse(substitute(df.fac)),
split = "$", fixed = T))
df = eval(as.name(paste(temp[1])))
fac = temp[2]
if (missing(df.num.cn)){
cols = colnames(df[, !sapply(df, is.factor)])
} else if (!class(df.num.cn) %in% c("numeric", "integer")){
stop("Please supply a numeric or integer variable to df.num.cn")
} else {
cols = unlist(strsplit(deparse(substitute(df.num.cn)),
split = "$", fixed = T))[2]
}
for (i in cols){
name.var = rep(paste(i), each = nlevels(df[, fac]))
name.split = rep(paste(fac), each = nlevels(df[, fac]))
table.level = levels(df[, fac])
table.agg = format(aggregate(df[, i], by = list(Var = df[, fac]),
summary)$x, nsmall = 2)
table.row = as.data.frame(cbind(name.var, name.split,
table.level, table.agg))
table.results = rbind(table.results, table.row)
}
colnames(table.results)[1] = "Variable"
colnames(table.results)[2] = "Split On"
colnames(table.results)[3] = "Levels"
return(table.results)
}
#--------------------------------------
# num.scale()
#--------------------------------------
# Function to scale (normalize: mean = 0, sd = 1) numeric variables
num.scale = function(df){
# Check for data.frame or attribute
if (grepl("$", deparse(substitute(df)), fixed = T)){
if (!class(df) %in% c("numeric", "integer")){
stop("Please supply a numeric or integer variable to df")
}
temp = unlist(strsplit(deparse(substitute(df)),
split = "$", fixed = T))
df = eval(as.name(paste(temp[1])))
cols = temp[2]
} else {
cols = colnames(df[, !sapply(df, is.factor)])
}
# Apply function
for (i in cols){
i_cs = paste(i, "cs", sep = "_")
df[i_cs] = scale(df[, i])
}
return(df)
}
#--------------------------------------
# num.trans()
#--------------------------------------
# Function to transform numeric variables
num.trans = function(df){
# Check for data.frame or attribute
if (grepl("$", deparse(substitute(df)), fixed = T)){
if (!class(df) %in% c("numeric", "integer")){
stop("Please supply a numeric or integer variable to df")
}
temp = unlist(strsplit(deparse(substitute(df)),
split = "$", fixed = T))
df = eval(as.name(paste(temp[1])))
cols = temp[2]
} else {
cols = colnames(df[, !sapply(df, is.factor)])
}
# Apply function
for (i in cols){
# Natural Log
i_ln = paste(i, "ln", sep = "_")
df[i_ln] = (sign(df[, i]) * log(abs(df[, i])+1))
# Square Root
i_rt = paste(i, "rt", sep = "_")
df[i_rt] = (sign(df[, i]) * sqrt(abs(df[, i])+1))
# Square
i_sq = paste(i, "sq", sep = "_")
df[i_sq] = (df[, i] * df[, i])
}
return(df)
}
#--------------------------------------
# num.trims()
#--------------------------------------
# Function to trim numeric variables at various percentiles
num.trims = function(df){
require(scales)
# Check for data.frame or attribute
if (grepl("$", deparse(substitute(df)), fixed = T)){
if (!class(df) %in% c("numeric", "integer")){
stop("Please supply a numeric or integer variable to df")
}
temp = unlist(strsplit(deparse(substitute(df)),
split = "$", fixed = T))
df = eval(as.name(paste(temp[1])))
cols = temp[2]
} else {
cols = colnames(df[, !sapply(df, is.factor)])
}
# Apply function
for (i in cols){
# 1st and 99th
T99 = quantile(df[, i], c(0.01, 0.99))
df[paste(i, "T99", sep = "_")] = squish(df[, i], T99)
# 5th and 95th
T95 = quantile(df[, i], c(0.05, 0.95))
df[paste(i, "T95", sep = "_")] = squish(df[, i], T95)
# 10th and 90th
T90 = quantile(df[, i], c(0.10, 0.90))
df[paste(i, "T90", sep = "_")] = squish(df[, i], T90)
# 25th and 75th
T75 = quantile(df[, i], c(0.25, 0.75))
df[paste(i, "T75", sep = "_")] = squish(df[, i], T75)
}
return(df)
}
#==============================================================================
# Factor Variables
#==============================================================================
#------------------------------------------------------------------------------
# Plots
#------------------------------------------------------------------------------
#--------------------------------------
# fac.barplot()
#--------------------------------------
# Function to create barplots of factor variables
fac.barplot = function(df, df.num){
# Check for data.frame or attribute
if (grepl("$", deparse(substitute(df)), fixed = T)){
if (!class(df) %in% c("factor")){
stop("Please supply a factor variable to df")
}
temp = unlist(strsplit(deparse(substitute(df)),
split = "$", fixed = T))
df = eval(as.name(paste(temp[1])))
df.name = temp[1]
cols = temp[2]
} else {
df.name = deparse(substitute(df))
cols = colnames(df[, sapply(df, is.factor)])
}
# Create plot(s)
for (i in cols){
if (missing(df.num)){
plot(df[, i],
main = paste(df.name, "$", i, sep = ""),
ylim = c(0, 1.15*max(summary(df[, i]))),
ylab = "Frequency")
} else if (!class(df.num) %in% c("numeric", "integer")){
stop("Please supply a numeric variable to df.num")
} else {
num = unlist(strsplit(deparse(substitute(df.num)),
split = "$", fixed = T))[2]
barplot(table(df[, num], df[, i]),
main = paste(df.name, "$", i, " by ",
deparse(substitute(df.num)), sep = ""),
ylim = c(0, 1.15*max(table(df[, i], df[, num]))),
ylab = "Frequency", beside = T)
}
}
}
#--------------------------------------
# fac.mosaic()
#--------------------------------------
# Function to create mosaic plots of factor variables
fac.mosaic = function(df.fac, df.fac.cn){
require(RColorBrewer)
# Check df.fac is factor
if (!class(df.fac) %in% c("factor")){
stop("Please supply a factor variable to df.fac")
}
# Assign data.frame and name
temp = unlist(strsplit(deparse(substitute(df.fac)),
split = "$", fixed = T))
df = eval(as.name(paste(temp[1])))
df.name = temp[1]
fac = temp[2]
# Check if df.fac.cn is missing or named (and class if named)
if (missing(df.fac.cn)){
cols = colnames(df[, sapply(df, is.factor)])
} else if (!class(df.fac.cn) %in% c("factor")){
stop("Please supply a factor variable to df.fac.cn")
} else {
cols = unlist(strsplit(deparse(substitute(df.fac.cn)),
split = "$", fixed = T))[2]
}
# Create plot(s)
for (i in cols){
plot(df[, fac], df[, i],
col = brewer.pal(nlevels(df[, i]), "Spectral"),
main = paste(df.name, "$", fac," versus ",
df.name, "$", i, sep = ""),
xlab = paste(df.name, "$", fac, sep = ""),
ylab = paste(df.name, "$", i, sep = ""))
}
}
#------------------------------------------------------------------------------
# Variable Manipulation
#------------------------------------------------------------------------------
#--------------------------------------
# fac.freq()
#--------------------------------------
# Frequency of occurrence for factor variables, split by named factor
fac.freq = function(df.fac, df.fac.cn, cat = T){
table.results = data.frame()
# Check df.fac is factor
if (!class(df.fac) %in% c("factor")){
stop("Please supply a factor variable to df.fac")
}
# Assign data.frame and name
temp = unlist(strsplit(deparse(substitute(df.fac)),
split = "$", fixed = T))
df = eval(as.name(paste(temp[1])))
fac = temp[2]
# Check if df.fac.cn is missing or named (and class if named)
if (missing(df.fac.cn)){
cols = colnames(df[, sapply(df, is.factor)])
} else if (!class(df.fac.cn) %in% c("factor")){
stop("Please supply a factor variable to df.fac.cn")
} else {
cols = unlist(strsplit(deparse(substitute(df.fac.cn)),
split = "$", fixed = T))[2]
}
# Factor splits
if (cat){
for (i in cols){
name.var = rep(paste(i), each = nlevels(df[, fac]))
name.split = rep(paste(fac), each = nlevels(df[, fac]))
table.level = levels(df[, fac])
table.agg = aggregate(df[, i], by = list(Var = df[, fac]),
summary)$x
table.prop = format(round(prop.table(table.agg, 1) * 100,
digits = 2), nsmall = 2)
table.results = as.data.frame(cbind(name.var, name.split,
table.level, table.prop))
colnames(table.results)[1] = "Variable"
colnames(table.results)[2] = "Split On"
colnames(table.results)[3] = "Levels"
if (missing(df.fac.cn)){
print(table.results)
} else {
return(table.results)
}
}
}
# Factor counts and frequencies
if (!cat){
name.var = rep(paste(fac), each = 2)
name.type = c("Count", "Percent")
table.agg = t(summary(df[, fac]))
table.prop = format(round(prop.table(table.agg) * 100,
digits = 2), nsmall = 2)
table.row = rbind(table.agg, table.prop)
table.col = cbind(name.var, name.type, table.row)
table.results = as.data.frame(table.col)
colnames(table.results)[1] = "Variable"
colnames(table.results)[2] = "Type"
return(table.results)
}
}
#--------------------------------------
# fac.flag()
#--------------------------------------
# Function to create indicator variables from factor variable levels
fac.flag = function(df){
# Check for data.frame or attribute
if (grepl("$", deparse(substitute(df)), fixed = T)){
if (!class(df) %in% c("factor")){
stop("Please supply a factor variable to df")
}
temp = unlist(strsplit(deparse(substitute(df)),
split = "$", fixed = T))
df = eval(as.name(paste(temp[1])))
cols = temp[2]
} else {
cols = colnames(df[, sapply(df, is.factor)])
}
# Apply function
for (i in cols){
for (level in unique(df[, i])){
df[paste(i, level, sep = "_")] =
as.factor(ifelse(df[, i] == level, 1, 0))
}
}
return(df)
}#==============================================================================
# FIN
#==============================================================================
sessionInfo()## R version 3.3.1 (2016-06-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scales_0.4.0 RColorBrewer_1.1-2 pander_0.6.0
## [4] RCurl_1.95-4.8 bitops_1.0-6
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.6 digest_0.6.9 plyr_1.8.4 formatR_1.4
## [5] magrittr_1.5 evaluate_0.9 stringi_1.1.1 rmarkdown_1.0
## [9] tools_3.3.1 stringr_1.0.0 munsell_0.4.3 yaml_2.1.13
## [13] colorspace_1.2-6 htmltools_0.3.5 knitr_1.13.1