This is a demo of functions I wrote to use in intial EDA. The function code is included here as an appendix, though the most recent version is centrally stored on GitHub here.

# Load libraries
library(RCurl)
library(pander)

#==============================================================================
# Functions
#==============================================================================
# Create function to source functions from GitHub
source.GitHub = function(url){
    require(RCurl)
    sapply(url, function(x){
        eval(parse(text = getURL(x, followlocation = T,
                                 cainfo = system.file("CurlSSL", 
                                          "cacert.pem", package = "RCurl"))),
             envir = .GlobalEnv)
    })
}

# Assign URL and source functions
url = "http://bit.ly/1T6LhBJ"
source.GitHub(url); rm(url)

Background

This demo uses the Auto dataset from the ISL website (the version included in the {ISLR} package is slightly different). We’ll start by importing these data and doing some minor tweaks based on a data quality check. Our response attribute is mpg, but we are also interested exploring data by the class origin.

# Download and assign data
if(!file.exists("~/Auto.csv")){
    URL = getURL("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv")
    auto = read.csv(textConnection(URL), header = T)
    rm(URL)
}

Data Quality Check

The function str() helps us check the dimensions, the first few observations, and attribute classes:

# View structure
str(auto)

## 'data.frame':    397 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : Factor w/ 94 levels "?","100","102",..: 17 35 29 29 24 42 47 46 48 40 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

Next we check our summary statistics:

# View summary statistics
summary(auto)

##       mpg          cylinders      displacement     horsepower 
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   150    : 22  
##  1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.0   90     : 20  
##  Median :23.00   Median :4.000   Median :146.0   88     : 19  
##  Mean   :23.52   Mean   :5.458   Mean   :193.5   110    : 18  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0   100    : 17  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   75     : 14  
##                                                  (Other):287  
##      weight      acceleration        year           origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2223   1st Qu.:13.80   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2800   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2970   Mean   :15.56   Mean   :75.99   Mean   :1.574  
##  3rd Qu.:3609   3rd Qu.:17.10   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
##                                                                
##              name    
##  ford pinto    :  6  
##  amc matador   :  5  
##  ford maverick :  5  
##  toyota corolla:  5  
##  amc gremlin   :  4  
##  amc hornet    :  4  
##  (Other)       :368

We notice a few potential issues:

cylinders should probably be a factor attribute;
horsepower should probably be a numeric attribute;
weight should probably be a numeric attribute (R will automatically convert integer class to numeric for mathematical operations);
origin should probably be a factor attribute as it indicates country of origin;
name should probably be a character attribute.

Additionally, horsepower has a ?. That should probably be a NA value (in case any other attributes use the ? character to denote NA, we’ll apply it to the whole dataset). Note that with horsepower, we first convert it to a character to preserve the numeric values. Let’s make these changes:

# Treat "?" as NA
auto[auto == "?"] = NA

# Convert attributes as appropriate
auto$horsepower = as.numeric(as.character(auto$horsepower))
auto$weight = as.numeric(auto$weight)
auto$cylinders = as.factor(auto$cylinders)
auto$origin = as.factor(auto$origin)
auto$name = as.character(auto$name)

For the purposes of this demo, we’ll drop the name attribute and remove the tuples with NA values (this is revisited in the Attribute Manipulations section). We’ll create a version of auto called auto.amas a reference for use in this later section.

# Drop name attribute
auto = subset(auto, select = -name)

# Create separate data.frame for auto
auto.am = auto

# Remove NA values
auto = na.omit(auto)

Qualitative EDA

We’ll start with some visual EDA. The first user-defined function we’ll use is num.plots(). This function produces a histogram, boxplot, scatterplot, and Q-Q plot for each numeric attribute or an individual attribute named in num.plots(df = ). For the scatterplot, the response attribute mpg must be named, and the boxplot can be split by a factor class. The histogram can also plot bins of equally spaced probabilities (num.plots(prob = T)), and the normal curve overlay can be turned on or off (num.plots(norm = F)).

The function is really composed of four separate functions, which can each be called individually: num.hist, num.boxplot, num.scatter, and num.qq.

# Create plots of numeric attributes
# Use [mpg] as response in scatterplot
# Use [origin] as factor to split in boxplots
num.plots(df = auto, df.num = auto$mpg, df.fac = auto$origin)

We’ll continue with your visual EDA, but now for factor attributes. This uses two user-defined functions: fac.barplot, and fac.mosaic. The former makes barplots of factor attributes (and will also accept a numeric attribute to segregate on). The latter makes mosaic plots for a named factor. Both will also accept an individual attribute named in num.plots(df = ) - this is done with the fac.mosaic() function.

Note the mosaic plot shows some pretty unbalanced classes when it comes to cylinders by origin.

# Create barplots for factor attributes
fac.barplot(df = auto)

# Create mosaic plot for 
fac.mosaic(df = auto$origin, df.fac.cn = auto$cylinders)

Quantitative EDA

Now we’ll switch gears (no pun intended) over to quantitative EDA. First, we’ll look at summary statistics for numeric attributes split by a named factor attribute (origin). This user-defined function is called num.freq(). By default, the function will return summary stats for all numeric (or integer) attributes. A second argument, num.freq(df.num.cn = ), can be supplied specifying a single numeric or integer attribute to use. Here we use mpg.

# Summary stats for numeric attributes, split by named factor
num.freq(df.fac = auto$origin, df.num.cn = auto$mpg)

##   Variable Split On Levels  Min. 1st Qu. Median  Mean 3rd Qu.  Max.
## 1      mpg   origin      1  9.00   15.00  18.50 20.03   24.00 39.00
## 2      mpg   origin      2 16.20   23.75  26.00 27.60   30.12 44.30
## 3      mpg   origin      3 18.00   25.70  31.60 30.45   34.05 46.60

Similarly, we can look at occurence rates by factor class - think of this as the quantitative version of the mosaic plot. This user-defined function is called fac.freq(). A second argument, num.freq(df.num.cn = ), can be supplied specifying a single factor attribute to use. A third argument, fac.freq(cat = ), can be supplied, returning counts and frequencies of the factor (the default is FALSE).

# Frequency of occurrence for factor attributes, split by named factor
fac.freq(df.fac = auto$origin)

##    Variable Split On Levels     3     4     5     6     8
## 1 cylinders   origin      1  0.00 28.16  0.00 29.80 42.04
## 2 cylinders   origin      2  0.00 89.71  4.41  5.88  0.00
## 3 cylinders   origin      3  5.06 87.34  0.00  7.59  0.00
##   Variable Split On Levels      1      2      3
## 1   origin   origin      1 100.00   0.00   0.00
## 2   origin   origin      2   0.00 100.00   0.00
## 3   origin   origin      3   0.00   0.00 100.00

# Counts and frequency of occurrence of named factor
fac.freq(df.fac = auto$origin, cat = F)

##   Variable    Type     1     2     3
## 1   origin   Count   245    68    79
## 2   origin Percent 62.50 17.35 20.15

Attribute Manipulations

The last part of the demo deals with various attribute manipulations: missing flag creation, scaling, trimming, transforming, and creating indicator attributes for each factor attribute level.

Missing Values

The user-defined function miss.flag() creates indicator attributes for attributes with missing data, as sometimes missingness is predictive. The function defaults to automatically do this for all attributes in the dataset, but can be specified to only do it for a specific attribute, or type (e.g. numeric or factor).

# Create temp data.frame
auto.temp = auto.am

# Apply missing flag function
auto.temp = miss.flag(df = auto.temp)

When viewing attribute names, we see there’s a new attribute called MF_horsepower, with class factor.

# View names
names(auto.temp)

## [1] "mpg"           "cylinders"     "displacement"  "horsepower"   
## [5] "weight"        "acceleration"  "year"          "origin"       
## [9] "MF_horsepower"

# Check class
class(auto.temp$MF_horsepower)

## [1] "factor"

# Validate application
sum(is.na(auto.temp$horsepower))

## [1] 5

length(auto.temp$MF_horsepower[auto.temp$MF_horsepower == 1])

## [1] 5

Scales, Trims, Transforms, and Indicators

Moving more toward data mining, we might want to scale, trim, and transform attributes en masse. The next user-defined functions do just that: num.scale(), num.trims(), num.trans().

num.scale()

num.scale() will center and scale all numeric attributes in the dataset, or an individual attribute named in num.scale(df = ). Note: the final suffix is .V1 as each attribute has two additional attributes (scaled:center and scaled:scale).

# Create temp data.frame
auto.temp = auto.am

# Apply scale function
auto.temp = num.scale(df = auto.temp)

When viewing attribute names, we see all numeric attributes have been centered and scaled, with the suffix _cs. Results can be seen with the summary() function.

# View names
colnames(auto.temp[, 9:ncol(auto.temp)])

## [1] "mpg_cs"          "displacement_cs" "horsepower_cs"   "weight_cs"      
## [5] "acceleration_cs" "year_cs"

# Summary stats
summary(auto.temp[, 9:ncol(auto.temp)])

##       mpg_cs.V1        displacement_cs.V1   horsepower_cs.V1  
##  Min.   :-1.8508526   Min.   :-1.2080194   Min.   :-1.519034  
##  1st Qu.:-0.8258696   1st Qu.:-0.8544397   1st Qu.:-0.765614  
##  Median :-0.0891631   Median :-0.4148541   Median :-0.284985  
##  Mean   : 0.0000000   Mean   : 0.0000000   Mean   : 0.000000  
##  3rd Qu.: 0.7116049   3rd Qu.: 0.7772830   3rd Qu.: 0.559365  
##  Max.   : 2.9665675   Max.   : 2.4902336   Max.   : 3.261284  
##      weight_cs.V1     acceleration_cs.V1       year_cs.V1     
##  Min.   :-1.6065223   Min.   :-2.733490   Min.   :-1.6232409  
##  1st Qu.:-0.8857216   1st Qu.:-0.640237   1st Qu.:-0.8088504  
##  Median :-0.2049490   Median :-0.014980   Median : 0.0055401  
##  Mean   : 0.0000000   Mean   : 0.000000   Mean   : 0.0000000  
##  3rd Qu.: 0.7501341   3rd Qu.: 0.537784   3rd Qu.: 0.8199306  
##  Max.   : 2.5458080   Max.   : 3.355973   Max.   : 1.6343210

num.trims()

num.trims() will Winsorize all numeric attributes in the dataset, or an individual attribute named in num.trims(df = ). The attributes are Winsorized at the 1st/99th, 5th/95th, 10th/90th, and 25th/75th percentiles.

# Create temp data.frame
auto.temp = auto.am

# Apply scale function
auto.temp = num.trims(df = auto.temp)

When viewing attribute names, we see all numeric attributes have been Winsorized, with the suffixes _T99, _T95, _T90, and _T75. Results can be seen with the summary() function (trims can be seen at _T75).

# View names
colnames(auto.temp[, 9:ncol(auto.temp)])

##  [1] "mpg_T99"          "mpg_T95"          "mpg_T90"         
##  [4] "mpg_T75"          "displacement_T99" "displacement_T95"
##  [7] "displacement_T90" "displacement_T75" "horsepower_T99"  
## [10] "horsepower_T95"   "horsepower_T90"   "horsepower_T75"  
## [13] "weight_T99"       "weight_T95"       "weight_T90"      
## [16] "weight_T75"       "acceleration_T99" "acceleration_T95"
## [19] "acceleration_T90" "acceleration_T75" "year_T99"        
## [22] "year_T95"         "year_T90"         "year_T75"

# Summary stats
summary(auto.temp[, 9:ncol(auto.temp)])

##     mpg_T99         mpg_T95         mpg_T90         mpg_T75     
##  Min.   :11.00   Min.   :13.00   Min.   :14.00   Min.   :17.00  
##  1st Qu.:17.00   1st Qu.:17.00   1st Qu.:17.00   1st Qu.:17.00  
##  Median :22.75   Median :22.75   Median :22.75   Median :22.75  
##  Mean   :23.44   Mean   :23.34   Mean   :23.21   Mean   :22.85  
##  3rd Qu.:29.00   3rd Qu.:29.00   3rd Qu.:29.00   3rd Qu.:29.00  
##  Max.   :43.45   Max.   :37.00   Max.   :34.19   Max.   :29.00  
##  displacement_T99 displacement_T95 displacement_T90 displacement_T75
##  Min.   : 70.91   Min.   : 85.0    Min.   : 90.0    Min.   :105.0   
##  1st Qu.:105.00   1st Qu.:105.0    1st Qu.:105.0    1st Qu.:105.0   
##  Median :151.00   Median :151.0    Median :151.0    Median :151.0   
##  Mean   :194.29   Mean   :193.9    Mean   :191.0    Mean   :179.7   
##  3rd Qu.:275.75   3rd Qu.:275.8    3rd Qu.:275.8    3rd Qu.:269.2   
##  Max.   :441.26   Max.   :400.0    Max.   :350.0    Max.   :275.8   
##  horsepower_T99  horsepower_T95   horsepower_T90  horsepower_T75  
##  Min.   : 48.0   Min.   : 60.55   Min.   : 67.0   Min.   : 75.00  
##  1st Qu.: 75.0   1st Qu.: 75.00   1st Qu.: 75.0   1st Qu.: 75.00  
##  Median : 93.5   Median : 93.50   Median : 93.5   Median : 93.50  
##  Mean   :104.4   Mean   :103.59   Mean   :102.3   Mean   : 97.82  
##  3rd Qu.:126.0   3rd Qu.:126.00   3rd Qu.:126.0   3rd Qu.:125.25  
##  Max.   :220.4   Max.   :180.00   Max.   :157.7   Max.   :126.00  
##    weight_T99     weight_T95     weight_T90     weight_T75  
##  Min.   :1772   Min.   :1932   Min.   :1990   Min.   :2225  
##  1st Qu.:2225   1st Qu.:2225   1st Qu.:2225   1st Qu.:2226  
##  Median :2804   Median :2804   Median :2804   Median :2804  
##  Mean   :2978   Mean   :2970   Mean   :2960   Mean   :2884  
##  3rd Qu.:3615   3rd Qu.:3615   3rd Qu.:3615   3rd Qu.:3613  
##  Max.   :4951   Max.   :4464   Max.   :4278   Max.   :3615  
##  acceleration_T99 acceleration_T95 acceleration_T90 acceleration_T75
##  Min.   : 9.455   Min.   :11.26    Min.   :12.00    Min.   :13.78   
##  1st Qu.:13.775   1st Qu.:13.78    1st Qu.:13.78    1st Qu.:13.79   
##  Median :15.500   Median :15.50    Median :15.50    Median :15.50   
##  Mean   :15.532   Mean   :15.52    Mean   :15.48    Mean   :15.41   
##  3rd Qu.:17.025   3rd Qu.:17.02    3rd Qu.:17.02    3rd Qu.:17.01   
##  Max.   :22.317   Max.   :20.23    Max.   :19.00    Max.   :17.02   
##     year_T99        year_T95        year_T90        year_T75    
##  Min.   :70.00   Min.   :70.00   Min.   :71.00   Min.   :73.00  
##  1st Qu.:73.00   1st Qu.:73.00   1st Qu.:73.00   1st Qu.:73.00  
##  Median :76.00   Median :76.00   Median :76.00   Median :76.00  
##  Mean   :75.98   Mean   :75.98   Mean   :75.98   Mean   :75.97  
##  3rd Qu.:79.00   3rd Qu.:79.00   3rd Qu.:79.00   3rd Qu.:79.00  
##  Max.   :82.00   Max.   :82.00   Max.   :81.00   Max.   :79.00

num.trans()

num.trans() will transform all numeric attributes in the dataset, or an individual attribute named in num.trans(df = ) by taking the natural log (a +1 is added for stability), square root, and square.

# Create temp data.frame
auto.temp = auto.am

# Apply scale function
auto.temp = num.trans(df = auto.temp)

When viewing attribute names, we see all numeric attributes have been transformed, with the suffixes _ln (\(\log_{x}\)), _rt (\(\sqrt{x}\)), and _sq (\(X^{2}\)). Results can be seen with the summary() function.

# View names
colnames(auto.temp[, 9:ncol(auto.temp)])

##  [1] "mpg_ln"          "mpg_rt"          "mpg_sq"         
##  [4] "displacement_ln" "displacement_rt" "displacement_sq"
##  [7] "horsepower_ln"   "horsepower_rt"   "horsepower_sq"  
## [10] "weight_ln"       "weight_rt"       "weight_sq"      
## [13] "acceleration_ln" "acceleration_rt" "acceleration_sq"
## [16] "year_ln"         "year_rt"         "year_sq"

# Summary stats
summary(auto.temp[, 9:ncol(auto.temp)])

##      mpg_ln          mpg_rt          mpg_sq       displacement_ln
##  Min.   :2.303   Min.   :3.162   Min.   :  81.0   Min.   :4.234  
##  1st Qu.:2.890   1st Qu.:4.243   1st Qu.: 289.0   1st Qu.:4.663  
##  Median :3.168   Median :4.873   Median : 517.6   Median :5.024  
##  Mean   :3.145   Mean   :4.882   Mean   : 610.5   Mean   :5.135  
##  3rd Qu.:3.401   3rd Qu.:5.477   3rd Qu.: 841.0   3rd Qu.:5.622  
##  Max.   :3.863   Max.   :6.899   Max.   :2171.6   Max.   :6.122  
##  displacement_rt  displacement_sq  horsepower_ln   horsepower_rt   
##  Min.   : 8.307   Min.   :  4624   Min.   :3.850   Min.   : 6.856  
##  1st Qu.:10.296   1st Qu.: 11025   1st Qu.:4.331   1st Qu.: 8.718  
##  Median :12.329   Median : 22801   Median :4.549   Median : 9.721  
##  Mean   :13.499   Mean   : 48718   Mean   :4.599   Mean   :10.115  
##  3rd Qu.:16.630   3rd Qu.: 76268   3rd Qu.:4.844   3rd Qu.:11.269  
##  Max.   :21.354   Max.   :207025   Max.   :5.442   Max.   :15.199  
##  horsepower_sq     weight_ln       weight_rt       weight_sq       
##  Min.   : 2116   Min.   :7.386   Min.   :40.17   Min.   : 2601769  
##  1st Qu.: 5625   1st Qu.:7.708   1st Qu.:47.18   1st Qu.: 4951739  
##  Median : 8742   Median :7.939   Median :52.96   Median : 7859624  
##  Mean   :12392   Mean   :7.960   Mean   :54.04   Mean   : 9585652  
##  3rd Qu.:15879   3rd Qu.:8.193   3rd Qu.:60.13   3rd Qu.:13066427  
##  Max.   :52900   Max.   :8.545   Max.   :71.70   Max.   :26419600  
##  acceleration_ln acceleration_rt acceleration_sq    year_ln     
##  Min.   :2.197   Min.   :3.000   Min.   : 64.0   Min.   :4.263  
##  1st Qu.:2.693   1st Qu.:3.844   1st Qu.:189.8   1st Qu.:4.304  
##  Median :2.803   Median :4.062   Median :240.2   Median :4.344  
##  Mean   :2.792   Mean   :4.053   Mean   :249.1   Mean   :4.342  
##  3rd Qu.:2.892   3rd Qu.:4.246   3rd Qu.:289.9   3rd Qu.:4.382  
##  Max.   :3.250   Max.   :5.079   Max.   :615.0   Max.   :4.419  
##     year_rt         year_sq    
##  Min.   :8.426   Min.   :4900  
##  1st Qu.:8.602   1st Qu.:5329  
##  Median :8.775   Median :5776  
##  Mean   :8.771   Mean   :5786  
##  3rd Qu.:8.944   3rd Qu.:6241  
##  Max.   :9.110   Max.   :6724

fac.flag()

fac.flag() will transform all levels of factor attributes in the dataset to indicator attributes, or an individual attribute named in fac.flag(df = ). Note: original attributes are not removed.

# Create temp data.frame
auto.temp = auto.am

# Apply scale function
auto.temp = fac.flag(df = auto.temp)

When viewing attribute names, we see all factor attributes have been transformed, with the suffixes _LEVEL, where LEVEL is a named level in the factor. Results can be seen with the summary() function.

# View names
colnames(auto.temp[, 9:ncol(auto.temp)])

## [1] "cylinders_8" "cylinders_4" "cylinders_6" "cylinders_3" "cylinders_5"
## [6] "origin_1"    "origin_3"    "origin_2"

# Summary stats
summary(auto.temp[, 9:ncol(auto.temp)])

##  cylinders_8 cylinders_4 cylinders_6 cylinders_3 cylinders_5 origin_1
##  0:289       0:193       0:309       0:388       0:389       0:147   
##  1:103       1:199       1: 83       1:  4       1:  3       1:245   
##  origin_3 origin_2
##  0:313    0:324   
##  1: 79    1: 68

Appendix: Function Code

###############################################################################
# R_EDA_Sandbox_Functions.R
# Last updated: 2016-08-02 by MJG
###############################################################################

# A compilation of useful functions to [ideally] deploy on any data set

# Suggested order of deployment:
#   Convert variables as necessary (e.g. to factors)
#   Plots for EDA on numeric and factor variables
#   Missing flags
#   Missing imputes
#   Trims
#   Transforms

#==============================================================================
# Accuracy
#==============================================================================

#--------------------------------------
# fit()
#--------------------------------------
# Function to add MSE to other measures from forecast::accuracy
fit = function(f, x){
    require(forecast)
    temp = data.frame(forecast::accuracy(f, x),
                      forecast::accuracy(f, x)[, 2]^2)
    temp = temp[, -c(1)]
    colnames(temp)[6] <- "MSE"
    temp = temp[c(6, 1, 2, 3, 4, 5)]
    print(temp)
}

#==============================================================================
# Missing Observations
#==============================================================================

#--------------------------------------
# miss.flag()
#--------------------------------------
# Function to create indicator variables as missing flags
miss.flag = function(df, df.cn = c("num", "fac")){
    # Check for columns to apply
    if (missing(df.cn)){
        cols = colnames(df)
    } else if (df.cn == "num"){
        cols = colnames(df[, !sapply(df, is.factor)])
    } else if (df.cn == "fac"){
        cols = colnames(df[, sapply(df, is.factor)])
    }
    # Apply function
    for (i in cols){
        if (sum(is.na(df[, i])) > 0){
            df[paste("MF", i, sep = "_")] =
                as.factor(ifelse(is.na(df[, i]), 1, 0))
        }
    }
    return(df)
}

#==============================================================================
# Numeric Variables
#==============================================================================

#------------------------------------------------------------------------------
# Plots
#------------------------------------------------------------------------------

#--------------------------------------
# num.boxplot()
#--------------------------------------
# Function to create boxplots of numeric variables
num.boxplot = function(df, df.fac){
    # Check for data.frame or attribute
    if (grepl("$", deparse(substitute(df)), fixed = T)){
        if (!class(df) %in% c("numeric", "integer")){
            stop("Please supply a numeric or integer variable to df")
        }
        temp = unlist(strsplit(deparse(substitute(df)),
                               split = "$", fixed = T))
        df = eval(as.name(paste(temp[1])))
        df.name = temp[1]
        cols = temp[2]
    } else {
        df.name = deparse(substitute(df))
        cols = colnames(df[, !sapply(df, is.factor)])
    }
    # Create plot(s)
    for (i in cols){
        if (missing(df.fac)){
            boxplot(df[, i], col = "grey", horizontal = T,
                    main = paste("Boxplot of ", df.name, "$", i, sep = ""),
                    xlab = paste(df.name, "$", i, sep = ""),
                    ylab = "Values")
        } else if (!class(df.fac) %in% c("factor")){
            stop("Please supply a factor variable to df.fac")
        } else {
            fac = unlist(strsplit(deparse(substitute(df.fac)),
                                  split = "$", fixed = T))[2]
            boxplot(df[, i] ~ df[, fac], col = "grey", horizontal = T,
                    main = paste(df.name, "$", i," versus ",
                                 deparse(substitute(df.fac)), sep = ""),
                    ylab = "Values")
        }
    }
}

#--------------------------------------
# num.hist()
#--------------------------------------
# Function to create histograms of numeric variables
# Optional choice of equal spaced probabilities or normal curve overlay
num.hist = function(df, prob = F, norm = T){
    # Check for data.frame or attribute
    if (grepl("$", deparse(substitute(df)), fixed = T)){
        if (!class(df) %in% c("numeric", "integer")){
            stop("Please supply a numeric or integer variable to df")
        }
        temp = unlist(strsplit(deparse(substitute(df)),
                               split = "$", fixed = T))
        df = eval(as.name(paste(temp[1])))
        df.name = temp[1]
        cols = temp[2]
    } else {
        df.name = deparse(substitute(df))
        cols = colnames(df[, !sapply(df, is.factor)])
    }
    # Create plot(s)
    for (i in cols){
        main = paste("Histogram of ", df.name, "$", i, sep = "")
        sub = ifelse(norm, "normal curve overlay (blue)", "")
        y = hist(df[, i], plot = F)
        if (prob){
            seq = seq(0.0, 1.0, by = 0.1)
            h = hist(df[, i], col = "grey", main = main, sub = sub,
                     breaks = quantile(df[, i], probs = seq),
                     xlab = paste(df.name, "$", i, sep = ""))
        }
        if (!prob){
            h = hist(df[, i], col = "grey", main = main, sub = sub,
                     ylim = c(0, 1.15*max(y$counts)),
                     xlab = paste(df.name, "$", i, sep = ""))
        }
        if (norm){
            xfit = seq(min(df[, i]), max(df[, i]), length = 100)
            yfit = dnorm(xfit, mean = mean(df[, i]), sd = sd(df[, i]))
            if (norm & !prob){
                yfit = yfit * diff(h$mids[1:2]) * length(df[, i])
            }
            lines(xfit, yfit, col = "blue", lwd = 2)
        }
    }
}

#--------------------------------------
# num.qq()
#--------------------------------------
# Function to create Q-Q plots of numeric variables
num.qq = function(df){
    # Check for data.frame or attribute
    if (grepl("$", deparse(substitute(df)), fixed = T)){
        if (!class(df) %in% c("numeric", "integer")){
            stop("Please supply a numeric or integer variable to df")
        }
        temp = unlist(strsplit(deparse(substitute(df)),
                               split = "$", fixed = T))
        df = eval(as.name(paste(temp[1])))
        df.name = temp[1]
        cols = temp[2]
    } else {
        df.name = deparse(substitute(df))
        cols = colnames(df[, !sapply(df, is.factor)])
    }
    # Create plot(s)
    for (i in cols){
        qqnorm(df[, i], pch = 21, bg = "grey",
               main = paste("Normal Q-Q Plot of ", df.name, "$", i, sep = ""))
        qqline(df[, i], lwd = 2, col = "blue")
    }
}

#--------------------------------------
# num.scatter()
#--------------------------------------
# Function to create scatterplots of numeric variables
num.scatter = function(df, df.num){
    # Check for data.frame or attribute
    if (grepl("$", deparse(substitute(df)), fixed = T)){
        if (!class(df) %in% c("numeric", "integer")){
            stop("Please supply a numeric or integer variable to df")
        }
        temp = unlist(strsplit(deparse(substitute(df)),
                               split = "$", fixed = T))
        df = eval(as.name(paste(temp[1])))
        df.name = temp[1]
        cols = temp[2]
    } else {
        df.name = deparse(substitute(df))
        cols = colnames(df[, !sapply(df, is.factor)])
    }
    # Create plot(s)
    num = unlist(strsplit(deparse(substitute(df.num)),
                          split = "$", fixed = T))[2]
    for (i in cols){
        plot(df[, i], df[, num], pch = 21, bg = "grey",
             main = paste(df.name, "$", num, " versus ",
                          df.name, "$", i, sep = ""),
             ylab = paste(df.name, "$", num, sep = ""),
             xlab = paste(df.name, "$", i, sep = ""))
    }
}

#--------------------------------------
# num.plots()
#--------------------------------------
# Function to produce four plots per variable:
# num.plots(which = ) corresponds as follows:
#   1 = Histogram
#   2 = Scatterplot
#   3 = Boxplot
#   4 = QQ Plot
num.plots = function(df, df.num, df.fac, prob = F, norm = T,
                     which = c(1, 2, 3, 4)){
    # Check for which plots to create
    if (missing(which)){
        par(mfcol = c(2, 2))
    }
    # Check for data.frame or attribute
    if (grepl("$", deparse(substitute(df)), fixed = T)){
        if (!class(df) %in% c("numeric", "integer")){
            stop("Please supply a numeric or integer variable to df")
        }
        temp = unlist(strsplit(deparse(substitute(df)),
                               split = "$", fixed = T))
        df = eval(as.name(paste(temp[1])))
        df.name = temp[1]
        cols = temp[2]
    } else {
        df.name = deparse(substitute(df))
        cols = colnames(df[, !sapply(df, is.factor)])
    }
    # Create plot(s)
    for (i in cols){
        #------------------------------
        # Histograms
        #------------------------------
        if (1 %in% which){
            main = paste("Histogram of ", df.name, "$", i, sep = "")
            sub = ifelse(norm, "normal curve overlay (blue)", "")
            y = hist(df[, i], plot = F)
            if (prob){
                seq = seq(0.0, 1.0, by = 0.1)
                h = hist(df[, i], col = "grey", main = main, sub = sub,
                         breaks = quantile(df[, i], probs = seq),
                         xlab = paste(df.name, "$", i, sep = ""))
            }
            if (!prob){
                h = hist(df[, i], col = "grey", main = main, sub = sub,
                         ylim = c(0, 1.15*max(y$counts)),
                         xlab = paste(df.name, "$", i, sep = ""))
            }
            if (norm){
                xfit = seq(min(df[, i]), max(df[, i]), length = 100)
                yfit = dnorm(xfit, mean = mean(df[, i]), sd = sd(df[, i]))
                if (norm & !prob){
                    yfit = yfit * diff(h$mids[1:2]) * length(df[, i])
                }
                lines(xfit, yfit, col = "blue", lwd = 2)
            }
        }
        #------------------------------
        # Scatterplots
        #------------------------------
        if (2 %in% which){
            num = unlist(strsplit(deparse(substitute(df.num)),
                                  split = "$", fixed = T))[2]
            plot(df[, i], df[, num], pch = 21, bg = "grey",
                 main = paste(df.name, "$", num, " versus ",
                              df.name, "$", i, sep = ""),
                 ylab = paste(df.name, "$", num, sep = ""),
                 xlab = paste(df.name, "$", i, sep = ""))
        }
        #------------------------------
        # Boxplots
        #------------------------------
        if (3 %in% which){
            if (missing(df.fac)){
                boxplot(df[, i], col = "grey", horizontal = T,
                        main = paste("Boxplot of ", df.name, "$", i, sep = ""),
                        xlab = paste(df.name, "$", i, sep = ""),
                        ylab = "Values")
            } else if (!class(df.fac) %in% c("factor")){
                stop("Please supply a factor variable to df.fac")
            } else {
                fac = unlist(strsplit(deparse(substitute(df.fac)),
                                      split = "$", fixed = T))[2]
                boxplot(df[, i] ~ df[, fac], col = "grey", horizontal = T,
                        main = paste(df.name, "$", i," versus ",
                                     deparse(substitute(df.fac)), sep = ""),
                        ylab = "Values")
            }
        }
        #------------------------------
        # QQ Plots
        #------------------------------
        if (4 %in% which){
            qqnorm(df[, i], pch = 21, bg = "grey",
                   main = paste("Normal Q-Q Plot of ", df.name, "$", i, sep = ""))
            qqline(df[, i], lwd = 2, col = "blue")
        }
    }
    return(par(mfcol = c(1, 1)))
}

#------------------------------------------------------------------------------
# Variable Manipulation
#------------------------------------------------------------------------------

#--------------------------------------
# num.freq()
#--------------------------------------
# Summary stats for numeric variables, split by named factor
num.freq = function(df.fac, df.num.cn){
    table.results = data.frame()
    # Check df.fac is factor
    if (!class(df.fac) %in% c("factor")){
        stop("Please supply a factor variable to df.fac")
    }
    # Assign data.frame and name
    temp = unlist(strsplit(deparse(substitute(df.fac)),
                           split = "$", fixed = T))
    df = eval(as.name(paste(temp[1])))
    fac = temp[2]
    if (missing(df.num.cn)){
        cols = colnames(df[, !sapply(df, is.factor)])
    } else if (!class(df.num.cn) %in% c("numeric", "integer")){
        stop("Please supply a numeric or integer variable to df.num.cn")
    } else {
        cols = unlist(strsplit(deparse(substitute(df.num.cn)),
                               split = "$", fixed = T))[2]
    }
    for (i in cols){
        name.var = rep(paste(i), each = nlevels(df[, fac]))
        name.split = rep(paste(fac), each = nlevels(df[, fac]))
        table.level = levels(df[, fac])
        table.agg = format(aggregate(df[, i], by = list(Var = df[, fac]),
                                     summary)$x, nsmall = 2)
        table.row = as.data.frame(cbind(name.var, name.split,
                                        table.level, table.agg))
        table.results = rbind(table.results, table.row)
    }
    colnames(table.results)[1] = "Variable"
    colnames(table.results)[2] = "Split On"
    colnames(table.results)[3] = "Levels"
    return(table.results)
}

#--------------------------------------
# num.scale()
#--------------------------------------
# Function to scale (normalize: mean = 0, sd = 1) numeric variables
num.scale = function(df){
    # Check for data.frame or attribute
    if (grepl("$", deparse(substitute(df)), fixed = T)){
        if (!class(df) %in% c("numeric", "integer")){
            stop("Please supply a numeric or integer variable to df")
        }
        temp = unlist(strsplit(deparse(substitute(df)),
                               split = "$", fixed = T))
        df = eval(as.name(paste(temp[1])))
        cols = temp[2]
    } else {
        cols = colnames(df[, !sapply(df, is.factor)])
    }
    # Apply function
    for (i in cols){
        i_cs = paste(i, "cs", sep = "_")
        df[i_cs] = scale(df[, i])
    }
    return(df)
}

#--------------------------------------
# num.trans()
#--------------------------------------
# Function to transform numeric variables
num.trans = function(df){
    # Check for data.frame or attribute
    if (grepl("$", deparse(substitute(df)), fixed = T)){
        if (!class(df) %in% c("numeric", "integer")){
            stop("Please supply a numeric or integer variable to df")
        }
        temp = unlist(strsplit(deparse(substitute(df)),
                               split = "$", fixed = T))
        df = eval(as.name(paste(temp[1])))
        cols = temp[2]
    } else {
        cols = colnames(df[, !sapply(df, is.factor)])
    }
    # Apply function
    for (i in cols){
        # Natural Log
        i_ln = paste(i, "ln", sep = "_")
        df[i_ln] = (sign(df[, i]) * log(abs(df[, i])+1))
        # Square Root
        i_rt = paste(i, "rt", sep = "_")
        df[i_rt] = (sign(df[, i]) * sqrt(abs(df[, i])+1))
        # Square
        i_sq = paste(i, "sq", sep = "_")
        df[i_sq] = (df[, i] * df[, i])
    }
    return(df)
}

#--------------------------------------
# num.trims()
#--------------------------------------
# Function to trim numeric variables at various percentiles
num.trims = function(df){
    require(scales)
    # Check for data.frame or attribute
    if (grepl("$", deparse(substitute(df)), fixed = T)){
        if (!class(df) %in% c("numeric", "integer")){
            stop("Please supply a numeric or integer variable to df")
        }
        temp = unlist(strsplit(deparse(substitute(df)),
                               split = "$", fixed = T))
        df = eval(as.name(paste(temp[1])))
        cols = temp[2]
    } else {
        cols = colnames(df[, !sapply(df, is.factor)])
    }
    # Apply function
    for (i in cols){
        # 1st and 99th
        T99 = quantile(df[, i], c(0.01, 0.99))
        df[paste(i, "T99", sep = "_")] = squish(df[, i], T99)
        # 5th and 95th
        T95 = quantile(df[, i], c(0.05, 0.95))
        df[paste(i, "T95", sep = "_")] = squish(df[, i], T95)
        # 10th and 90th
        T90 = quantile(df[, i], c(0.10, 0.90))
        df[paste(i, "T90", sep = "_")] = squish(df[, i], T90)
        # 25th and 75th
        T75 = quantile(df[, i], c(0.25, 0.75))
        df[paste(i, "T75", sep = "_")] = squish(df[, i], T75)
    }
    return(df)
}

#==============================================================================
# Factor Variables
#==============================================================================

#------------------------------------------------------------------------------
# Plots
#------------------------------------------------------------------------------

#--------------------------------------
# fac.barplot()
#--------------------------------------
# Function to create barplots of factor variables
fac.barplot = function(df, df.num){
    # Check for data.frame or attribute
    if (grepl("$", deparse(substitute(df)), fixed = T)){
        if (!class(df) %in% c("factor")){
            stop("Please supply a factor variable to df")
        }
        temp = unlist(strsplit(deparse(substitute(df)),
                               split = "$", fixed = T))
        df = eval(as.name(paste(temp[1])))
        df.name = temp[1]
        cols = temp[2]
    } else {
        df.name = deparse(substitute(df))
        cols = colnames(df[, sapply(df, is.factor)])
    }
    # Create plot(s)
    for (i in cols){
        if (missing(df.num)){
            plot(df[, i],
                 main = paste(df.name, "$", i, sep = ""),
                 ylim = c(0, 1.15*max(summary(df[, i]))),
                 ylab = "Frequency")
        } else if (!class(df.num) %in% c("numeric", "integer")){
            stop("Please supply a numeric variable to df.num")
        } else {
            num = unlist(strsplit(deparse(substitute(df.num)),
                                  split = "$", fixed = T))[2]
            barplot(table(df[, num], df[, i]),
                    main = paste(df.name, "$", i, " by ", 
                                 deparse(substitute(df.num)), sep = ""),
                    ylim = c(0, 1.15*max(table(df[, i], df[, num]))),
                    ylab = "Frequency", beside = T)
        }
    }
}

#--------------------------------------
# fac.mosaic()
#--------------------------------------
# Function to create mosaic plots of factor variables
fac.mosaic = function(df.fac, df.fac.cn){
    require(RColorBrewer)
    # Check df.fac is factor
    if (!class(df.fac) %in% c("factor")){
        stop("Please supply a factor variable to df.fac")
    }
    # Assign data.frame and name
    temp = unlist(strsplit(deparse(substitute(df.fac)),
                           split = "$", fixed = T))
    df = eval(as.name(paste(temp[1])))
    df.name = temp[1]
    fac = temp[2]
    # Check if df.fac.cn is missing or named (and class if named)
    if (missing(df.fac.cn)){
        cols = colnames(df[, sapply(df, is.factor)])
    } else if (!class(df.fac.cn) %in% c("factor")){
        stop("Please supply a factor variable to df.fac.cn")
    } else {
        cols = unlist(strsplit(deparse(substitute(df.fac.cn)),
                               split = "$", fixed = T))[2]
    }
    # Create plot(s)
    for (i in cols){
        plot(df[, fac], df[, i],
             col = brewer.pal(nlevels(df[, i]), "Spectral"),
             main = paste(df.name, "$", fac," versus ",
                          df.name, "$", i, sep = ""),
             xlab = paste(df.name, "$", fac, sep = ""),
             ylab = paste(df.name, "$", i, sep = ""))
    }
}

#------------------------------------------------------------------------------
# Variable Manipulation
#------------------------------------------------------------------------------

#--------------------------------------
# fac.freq()
#--------------------------------------
# Frequency of occurrence for factor variables, split by named factor
fac.freq = function(df.fac, df.fac.cn, cat = T){
    table.results = data.frame()
    # Check df.fac is factor
    if (!class(df.fac) %in% c("factor")){
        stop("Please supply a factor variable to df.fac")
    }
    # Assign data.frame and name
    temp = unlist(strsplit(deparse(substitute(df.fac)),
                           split = "$", fixed = T))
    df = eval(as.name(paste(temp[1])))
    fac = temp[2]
    # Check if df.fac.cn is missing or named (and class if named)
    if (missing(df.fac.cn)){
        cols = colnames(df[, sapply(df, is.factor)])
    } else if (!class(df.fac.cn) %in% c("factor")){
        stop("Please supply a factor variable to df.fac.cn")
    } else {
        cols = unlist(strsplit(deparse(substitute(df.fac.cn)),
                               split = "$", fixed = T))[2]
    }
    # Factor splits
    if (cat){
        for (i in cols){
            name.var = rep(paste(i), each = nlevels(df[, fac]))
            name.split = rep(paste(fac), each = nlevels(df[, fac]))
            table.level = levels(df[, fac])
            table.agg = aggregate(df[, i], by = list(Var = df[, fac]),
                                  summary)$x
            table.prop = format(round(prop.table(table.agg, 1) * 100,
                                      digits = 2), nsmall = 2)
            table.results = as.data.frame(cbind(name.var, name.split,
                                                table.level, table.prop))
            colnames(table.results)[1] = "Variable"
            colnames(table.results)[2] = "Split On"
            colnames(table.results)[3] = "Levels"
            if (missing(df.fac.cn)){
                print(table.results)
            } else {
                return(table.results)
            }
        }
    }
    # Factor counts and frequencies
    if (!cat){
        name.var = rep(paste(fac), each = 2)
        name.type = c("Count", "Percent")
        table.agg = t(summary(df[, fac]))
        table.prop = format(round(prop.table(table.agg) * 100,
                                  digits = 2), nsmall = 2)
        table.row = rbind(table.agg, table.prop)
        table.col = cbind(name.var, name.type, table.row)
        table.results = as.data.frame(table.col)
        colnames(table.results)[1] = "Variable"
        colnames(table.results)[2] = "Type"
        return(table.results)
    }
}

#--------------------------------------
# fac.flag()
#--------------------------------------
# Function to create indicator variables from factor variable levels
fac.flag = function(df){
    # Check for data.frame or attribute
    if (grepl("$", deparse(substitute(df)), fixed = T)){
        if (!class(df) %in% c("factor")){
            stop("Please supply a factor variable to df")
        }
        temp = unlist(strsplit(deparse(substitute(df)),
                               split = "$", fixed = T))
        df = eval(as.name(paste(temp[1])))
        cols = temp[2]
    } else {
        cols = colnames(df[, sapply(df, is.factor)])
    }
    # Apply function
    for (i in cols){
        for (level in unique(df[, i])){
            df[paste(i, level, sep = "_")] =
                as.factor(ifelse(df[, i] == level, 1, 0))
        }
    }
    return(df)
}

#==============================================================================
# FIN
#==============================================================================
sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] scales_0.4.0       RColorBrewer_1.1-2 pander_0.6.0      
## [4] RCurl_1.95-4.8     bitops_1.0-6      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.6      digest_0.6.9     plyr_1.8.4       formatR_1.4     
##  [5] magrittr_1.5     evaluate_0.9     stringi_1.1.1    rmarkdown_1.0   
##  [9] tools_3.3.1      stringr_1.0.0    munsell_0.4.3    yaml_2.1.13     
## [13] colorspace_1.2-6 htmltools_0.3.5  knitr_1.13.1

R_EDA_Sandbox_Demo

Michael Gilbert

03 August, 2016