Preliminary

We will use the DescTools package to explore and work with our data. If you do not already have these packages installed, you will first install the packages using the install.packages() function. The package has a very useful and comprehensive vignette which can be accessed through the help documentation in RStudio.

install.packages("DescTools")

Once installed, we load the package for use in the R session using the library() function.

library(DescTools)

In the lessons that follow we will use the “carseats.csv” file. This dataset contains sales of child car seats at 400 different stores. The variables in the dataset include:

  • CompPrice: Price charged by competitor at each location
  • Income: Community income level (in 1,000s of dollars)
  • Advertising: Local advertising budget for the company at each location (in 1,000s of dollars)
  • Population: Population size in region (in 1,000s)
  • Price: Price charged by company for car seats at each locations
  • ShelveLoc: Quality of shelving location for the car seats at each location (Bad, Medium, Good)
  • Age: Average age of the local population
  • Education: Education level at each location
  • Urban: Indicates if the location is urban (Yes) or rural (No)
  • US: Indicates if the location is in the US (Yes) or outside of the US (No)
  • Sales_Lev: Indicates the sales level of the location. High sales are those that are greater than 8 units (in 1,000s) and Low sales are those than are less than or equal to 8 units (in 1,000s)

We use the read.csv() function to import the CSV file into R as a dataframe named cs. We set stringsAsFactors = FALSE to keep any character columns as-is.

cs <- read.csv(file = "carseats.csv", 
               stringsAsFactors = FALSE) 

Data & Variable Types

Before we start working with our data, we first want to learn about the data. We will typically preview the top and bottom observations to ensure that we do not have an issues importing the data. We can use the head() and tail() functions to view the first and last 6 observations in the cs dataframe.

head(cs) # first n (default n = 6) observations
##   CompPrice Income Advertising Population Price ShelveLoc Age Education Urban
## 1       138     73          11        276   120       Bad  42        17   Yes
## 2       111     48          16        260    83      Good  65        10   Yes
## 3       113     35          10        269    80    Medium  59        12   Yes
## 4       117    100           4        466    97    Medium  55        14   Yes
## 5       141     64           3        340   128       Bad  38        13   Yes
## 6       124    113          13        501    72       Bad  78        16    No
##    US Sales_Lev
## 1 Yes      High
## 2 Yes      High
## 3 Yes      High
## 4 Yes       Low
## 5  No       Low
## 6 Yes      High
tail(cs) # last n (default n = 6) observations
##     CompPrice Income Advertising Population Price ShelveLoc Age Education Urban
## 395       130     58          19        366   139       Bad  33        16   Yes
## 396       138    108          17        203   128      Good  33        14   Yes
## 397       139     23           3         37   120    Medium  55        11    No
## 398       162     26          12        368   159    Medium  40        18   Yes
## 399       100     79           7        284    95       Bad  50        12   Yes
## 400       134     37           0         27   120      Good  49        16   Yes
##      US Sales_Lev
## 395 Yes       Low
## 396 Yes      High
## 397 Yes       Low
## 398 Yes       Low
## 399 Yes       Low
## 400 Yes      High

We can use the str() function to obtain the structure of the data, including dimensionality of the dataframe (the number of observations (rows) and variables (columns)), the variable types and a preview of the data.

str(cs)
## 'data.frame':    400 obs. of  11 variables:
##  $ CompPrice  : int  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : int  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: int  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : int  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : int  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : chr  "Bad" "Good" "Medium" "Medium" ...
##  $ Age        : int  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : int  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ US         : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Sales_Lev  : chr  "High" "High" "High" "Low" ...

Based on the structure of the cs dataframe, we have a mix of integer(int, numeric) and character (chr) variables. First, we can take a closer look at the character variables. We can created ordered or unordered (the default) factors. Based on the structure information, ShelveLoc and SalesLev are ordered categorical variables and Urban and US are unordered categorical variables. We will set up vectors of the variable names for each type of variable to make it easier to refer to them.

We set up a vector (named facs) with the names of the character variables that are unordered categorical variables and a vector (named ords) with the names of the character variables that are ordered categorical variables to make it easier to refer to them.

facs <- c("Urban", "US")
ords <- c("ShelveLoc", "Sales_Lev")

We can also create a vector of the column names of the numeric variables (named nums), which includes the names of all of the variables (columns) not in (using ! and %in%) the facs and ords vectors.

nums <- names(cs)[!names(cs) %in% c(facs, ords)]

We can use the unique() function to view the unique values that a variable can take on.

unique(cs[ ,"ShelveLoc"])
## [1] "Bad"    "Good"   "Medium"

We can use the lapply() function to apply a function over a list or vector and return a list object as the result. We will use the function to apply the unique function to all of the character variables(facs and ords) in the cs dataframe.

lapply(X = cs[ ,c(facs, ords)], FUN = unique)
## $Urban
## [1] "Yes" "No" 
## 
## $US
## [1] "Yes" "No" 
## 
## $ShelveLoc
## [1] "Bad"    "Good"   "Medium"
## 
## $Sales_Lev
## [1] "High" "Low"

Character variables are typically categorical variables and we will want to convert the ShelveLoc and Sales_Lev to ordered factors and US and Urban to unordered factors. Factors are a special type of variable in R that have levels (unique values that they can assume). To convert a variable to a factor, we use the factor() or as.factor() function. By default, the factor will be unordered.

cs$US <- factor(cs$US)

We can use the lapply() function to convert all of the variables in the facs vector to factor variables.

cs[ ,facs] <- lapply(X = cs[ , facs], FUN = factor)

To convert a character variable to an ordinal variable (ordered factor variable), we use the ordered argument of the factor() function. By default, the ordering is automatically alphabetical, so we need to specify the correct order using the levels argument.

cs$ShelveLoc <- factor(x = cs$ShelveLoc, 
                       levels = c("Bad", "Medium", "Good"),
                       ordered = TRUE)

cs$Sales_Lev <- factor(x = cs$Sales_Lev,
                       levels = c("Low", "High"),
                       ordered = TRUE)

Data Exploration

To explore our data, we can use the Abstract() function in the DescTools package.

Abstract(cs)
## ------------------------------------------------------------------------------ 
## cs
## 
## data frame:  400 obs. of  11 variables
##      400 complete cases (100.0%)
## 
##   Nr  ColName      Class            NAs  Levels                      
##   1   CompPrice    integer          .                                
##   2   Income       integer          .                                
##   3   Advertising  integer          .                                
##   4   Population   integer          .                                
##   5   Price        integer          .                                
##   6   ShelveLoc    ordered, factor  .    (3): 1-Bad, 2-Medium, 3-Good
##   7   Age          integer          .                                
##   8   Education    integer          .                                
##   9   Urban        factor           .    (2): 1-No, 2-Yes            
##   10  US           factor           .    (2): 1-No, 2-Yes            
##   11  Sales_Lev    ordered, factor  .    (2): 1-Low, 2-High

We can use the summary() function in the stats package to obtain basic summary statistic information about the variables in the cs dataframe. For numeric variables, the output includes the minimum, maximum, mean, median and quantiles. For factor variables, the output includes frequency information.

summary(cs)
##    CompPrice       Income        Advertising       Population   
##  Min.   : 77   Min.   : 21.00   Min.   : 0.000   Min.   : 10.0  
##  1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000   1st Qu.:139.0  
##  Median :125   Median : 69.00   Median : 5.000   Median :272.0  
##  Mean   :125   Mean   : 68.66   Mean   : 6.635   Mean   :264.8  
##  3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000   3rd Qu.:398.5  
##  Max.   :175   Max.   :120.00   Max.   :29.000   Max.   :509.0  
##      Price        ShelveLoc        Age          Education    Urban    
##  Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0   No :118  
##  1st Qu.:100.0   Medium:219   1st Qu.:39.75   1st Qu.:12.0   Yes:282  
##  Median :117.0   Good  : 85   Median :54.50   Median :14.0            
##  Mean   :115.8                Mean   :53.32   Mean   :13.9            
##  3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0            
##  Max.   :191.0                Max.   :80.00   Max.   :18.0            
##    US      Sales_Lev 
##  No :142   Low :236  
##  Yes:258   High:164  
##                      
##                      
##                      
## 

Numeric Variables

For more advanced descriptive statistic information, we can use the Desc() function from the DescTools package. In addition to the values from the summary() function, the Desc() function outputs: NA value frequency, the number of unique values, the number and percentage of 0 values, confidence interval for the mean, range, standard deviation, coefficient of variation, median absolute deviation, IQR, skewness, kurtosis and extreme values (highest and lowest). For continuous numeric variables as histogram with a density line, a boxplot and a plot empirical (cumulative) distribution function are also output. For discrete numeric variables the output will include frequency information instead of the extreme values and displays the histogram with lines instead of bars. Below is an example of the output for a continuous numeric variable, Population.

Desc(x = cs$Population)
## ------------------------------------------------------------------------------ 
## cs$Population (integer)
## 
##   length       n     NAs  unique      0s    mean  meanCI'
##      400     400       0     275       0  264.84  250.35
##           100.0%    0.0%            0.0%          279.33
##                                                         
##      .05     .10     .25  median     .75     .90     .95
##    29.00   58.90  139.00  272.00  398.50  467.00  493.15
##                                                         
##    range      sd   vcoef     mad     IQR    skew    kurt
##   499.00  147.38    0.56  191.26  259.50   -0.05   -1.21
##                                                         
## lowest : 10, 12, 13, 14, 16 (2)
## highest: 503, 504, 507, 508 (2), 509
## 
## ' 95%-CI (classic)

We can use the var() function to obtain the variance of a single variable.

var(cs$CompPrice)
## [1] 235.1472

To obtain the variances for all of the numeric variables, we can use the sapply() function to apply the var() function over a list or vector and return a vector as the result.

sapply(X = cs[ ,nums], FUN = var)
##    CompPrice       Income  Advertising   Population        Price          Age 
##   235.147243   783.218239    44.227343 21719.813935   560.584436   262.449618 
##    Education 
##     6.867168

There is no function in R to directly obtain the mode, which is the most frequent value(s) of a variable. There can be none, one or many modes. The modefun() function is a custom function which will return the most frequent value (or values) if one exists, and “No mode exists”, otherwise. The required argument, x, must be a vector or variable (dataframe column).

modefun <- function(x){
  if(any(tabulate(match(x, unique(x))) > 1)){
  outp <- unique(x)[which(tabulate(match(x, unique(x))) == max(tabulate(match(x, unique(x)))))]
  } else {
    outp <- "No mode exists"}
  return(outp)
}

To obtain the mode of a single variable, CompPrice, we can use the modefun() function.

modefun(x = cs$CompPrice)
## [1] 121

To obtain the modes for all of the numeric variables in the cs dataframe, we can use the sapply() function.

sapply(X = cs[ ,nums], FUN = modefun)
## $CompPrice
## [1] 121
## 
## $Income
## [1] 69
## 
## $Advertising
## [1] 0
## 
## $Population
## [1] 276 148 497 170 237 125 220
## 
## $Price
## [1] 120 128
## 
## $Age
## [1] 62
## 
## $Education
## [1] 17 12

Categorical (Factor) Variables

To summarize individual categorical variables (both ordered and unordered), we can use a one-way frequency table, using the table() function.

table(cs$ShelveLoc)
## 
##    Bad Medium   Good 
##     96    219     85

Below is an example of the output for an ordered categorical (factor) variable using the Desc() function in the DescTools package. The output includes frequency, percentage and confidence interval information. Barplots including frequency and percentage information (along with cumulative percentage) are also displayed as output.

Desc(x = cs$ShelveLoc)
## ------------------------------------------------------------------------------ 
## cs$ShelveLoc (ordered, factor)
## 
##   length      n    NAs unique levels  dupes
##      400    400      0      3      3      y
##          100.0%   0.0%                     
## 
##     level  freq   perc  cumfreq  cumperc
## 1     Bad    96  24.0%       96    24.0%
## 2  Medium   219  54.8%      315    78.8%
## 3    Good    85  21.2%      400   100.0%

Two-way frequency tables (sometimes called contingency tables or crosstabs) can be used to display frequency information for two categorical variables. The table() function can be used.

crosstabs <- table(cs$Sales_Lev, cs$ShelveLoc,
                   dnn = c("Level", "Location"))
crosstabs
##       Location
## Level  Bad Medium Good
##   Low   82    135   19
##   High  14     84   66

The Desc() function can be used (suing the formula argument) to produce two-way frequency tables and perform tests of variable association. Mosiac Plots are included in the output.

Desc(formula = Sales_Lev ~ ShelveLoc, data = cs)
## ------------------------------------------------------------------------------ 
## Sales_Lev ~ ShelveLoc (cs)
## 
## Summary: 
## n: 400, rows: 2, columns: 3
## 
## Pearson's Chi-squared test:
##   X-squared = 75.518, df = 2, p-value < 2.2e-16
## Log likelihood ratio (G-test) test of independence:
##   G = 79.788, X-squared df = 2, p-value < 2.2e-16
## Mantel-Haenszel Chi-squared:
##   X-squared = 72.874, df = 1, p-value < 2.2e-16
## 
## Phi-Coefficient        0.435
## Contingency Coeff.     0.399
## Cramer's V             0.435
## 
##                                                    
##             ShelveLoc    Bad   Medium   Good    Sum
## Sales_Lev                                          
##                                                    
## Low         freq          82      135     19    236
##             perc       20.5%    33.8%   4.8%  59.0%
##             p.row      34.7%    57.2%   8.1%      .
##             p.col      85.4%    61.6%  22.4%      .
##                                                    
## High        freq          14       84     66    164
##             perc        3.5%    21.0%  16.5%  41.0%
##             p.row       8.5%    51.2%  40.2%      .
##             p.col      14.6%    38.4%  77.6%      .
##                                                    
## Sum         freq          96      219     85    400
##             perc       24.0%    54.8%  21.3% 100.0%
##             p.row          .        .      .      .
##             p.col          .        .      .      .
## 

To look for differences in distributional properties of numerical variables across categorical variable class levels, we can use the Desc() function (using the formula argument.

Desc(formula = Price ~ ShelveLoc, data = cs)
## ------------------------------------------------------------------------------ 
## Price ~ ShelveLoc (cs)
## 
## Summary: 
## n pairs: 400, valid: 400 (100.0%), missings: 0 (0.0%), groups: 3
## 
##                                  
##             Bad   Medium     Good
## mean    114.271  115.653  117.882
## median  113.500  117.000  122.000
## sd       23.779   23.099   25.129
## IQR      33.250   30.000   29.000
## n            96      219       85
## np      24.000%  54.750%  21.250%
## NAs           0        0        0
## 0s            0        0        0
## 
## Kruskal-Wallis rank sum test:
##   Kruskal-Wallis chi-squared = 1.6523, df = 2, p-value = 0.4377

Variable Association

Categorical Variables

The output from the Desc() function when two categorical variables are provided in the formula argument can be used to test for relationships among the variables. For unordered variables, the output of the Pearson’s Chi-squared test can be used to test for a relationship. For ordered variables, the output of the Mantel-Haenszel Chi-squared test can be used.

Desc(Urban ~ US, data = cs)
## ------------------------------------------------------------------------------ 
## Urban ~ US (cs)
## 
## Summary: 
## n: 400, rows: 2, columns: 2
## 
## Pearson's Chi-squared test (cont. adj):
##   X-squared = 0.68416, df = 1, p-value = 0.4082
## Fisher's exact test p-value = 0.3609
## McNemar's chi-squared = 3.1488, df = 1, p-value = 0.07598
## 
##                     estimate lwr.ci upr.ci'
##                                           
## odds ratio             1.238  0.794  1.931
## rel. risk (col1)       1.145  0.867  1.512
## rel. risk (col2)       0.925  0.783  1.093
## 
## 
## Phi-Coefficient        0.047
## Contingency Coeff.     0.047
## Cramer's V             0.047
## 
##                                   
##         US        No    Yes    Sum
## Urban                             
##                                   
## No      freq      46     72    118
##         perc   11.5%  18.0%  29.5%
##         p.row  39.0%  61.0%      .
##         p.col  32.4%  27.9%      .
##                                   
## Yes     freq      96    186    282
##         perc   24.0%  46.5%  70.5%
##         p.row  34.0%  66.0%      .
##         p.col  67.6%  72.1%      .
##                                   
## Sum     freq     142    258    400
##         perc   35.5%  64.5% 100.0%
##         p.row      .      .      .
##         p.col      .      .      .
##                                   
## 
## ----------
## ' 95% conf. level

Numerical Variables

We use the cor() function to obtain the association, or linear relationship, between numeric variables. The value assumes that the variables have a linear relationship. If the relationship is not linear, correlation should not be used. Correlation provides both the strength (weak, strong) and direction (positive, negative) of the relationship. Correlation values range from -1 to +1.

cor(cs$CompPrice, cs$Price)
## [1] 0.5848478

If more than two variables are provided as input to the cor() function, the output will be a symmetric (since the correlation between a variable x and a variable y is the same as the correlation between the variable y and the variable x) correlation matrix with 1s on the diagonal (since the correlation between a variable and itself is equal to 1). If NA values are present in the data, they should be removed or the use argument should be set equal to “complete.obs”.

cor(x = cs[,nums])
##               CompPrice       Income  Advertising   Population       Price
## CompPrice    1.00000000 -0.080653423 -0.024198788 -0.094706516  0.58484777
## Income      -0.08065342  1.000000000  0.058994706 -0.007876994 -0.05669820
## Advertising -0.02419879  0.058994706  1.000000000  0.265652145  0.04453687
## Population  -0.09470652 -0.007876994  0.265652145  1.000000000 -0.01214362
## Price        0.58484777 -0.056698202  0.044536874 -0.012143620  1.00000000
## Age         -0.10023882 -0.004670094 -0.004557497 -0.042663355 -0.10217684
## Education    0.02519705 -0.056855422 -0.033594307 -0.106378231  0.01174660
##                      Age    Education
## CompPrice   -0.100238817  0.025197050
## Income      -0.004670094 -0.056855422
## Advertising -0.004557497 -0.033594307
## Population  -0.042663355 -0.106378231
## Price       -0.102176839  0.011746599
## Age          1.000000000  0.006488032
## Education    0.006488032  1.000000000

For data with high variable dimensionality, it is convenient to use the symnum() function, which provides a matrix with symbols indicating the level (strength) of correlation among the variables, rather than the correlation values themselves. By setting corr = TRUE, the absolute values from the correlation matrix are used. As shown, the strongest correlation among the variables is between Price and CompPrice, with a correlation between 0.3 and 0.6. Note: the output does not include the direction of the relationship, only the strength.

symnum(x = cor(cs[,nums]), 
       corr = TRUE)
##             C I Ad Pp Pr Ag E
## CompPrice   1                
## Income        1              
## Advertising     1            
## Population         1         
## Price       .         1      
## Age                      1   
## Education                   1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

Data Visualization

Numerical Variables

Box Plots, also called box-and-whisker plots can be used to produce univariate or grouped visualizations displaying the percentile or quantile information. The whiskers (horizontal lines) represent the extreme values of the distribution that are not outliers. Outliers are displayed as points outside of the whiskers. The box is made up of Q1, Q2 (median) and Q3. The boxplot() function is used to create univariate boxplots (using the x argument) or boxplots grouped by categorical (factor) variables (using the formula argument).

boxplot(x = cs$Price, 
        main = "Price Box Plot")

By specifying a formula of the format numerical variable ~ factor variable, grouped boxplots can be created.

boxplot(formula = Price ~ ShelveLoc, 
        data = cs,
        col = cm.colors(3))

Histograms are used to display the distribution of continuous numerical variables organized into bins, with the height of the bars indicating the frequency of observations in a given range. Histograms are created using the hist() function.

hist(x = cs$Price, 
     main = "Price Histogram", 
     xlab = "", 
     col = "steelblue")

Correlation Plots can be used to create a visualization of a correlation matrix, which indicates the strength and direction of the linear relationship using a color scale. The PlotCorr() function in the DescTools package will create a correlation plot. The x argument must contain a correlation matrix created using the cor() function.

PlotCorr(x = cor(cs[,nums]))

To visualize the relationship between two numerical variables, a Scatter Plot is used, with one variable mapped to the x-axis and another variable mapped to the y-axis. You can either use the x and y arguments to map variables to the axes or you can use the formula argument using the format y ~ x and the data argument to identify the dataframe containing the variables.

plot(formula = Price ~ CompPrice,
     data = cs,
     pch = 16,
     xlab = "Competitor Price",
     ylab = "Price")

A matrix of Scatter Plots can be created using the pairs() function.

pairs(x = cs[,nums])

Categorical Variables

Pie Charts are used to show frequency or proportion information. The pie() function is used to create a Pie Chart with a frequency table created using the table() function as input to the x argument.

SLtab <- table(cs$ShelveLoc)
pie(x = SLtab, 
    main="Shelf Location Pie Chart")

Bar Plots are the most popular plotting method used for categorical and discrete numerical variables (when the number of values the discrete variable can take on is low). The barplot() function can be used to create a Bar Plot, with an object created using the table() function as input to the height argument.

barplot(height = SLtab,
        main = "Shelf Location Bar Plot",
        col = heat.colors(3))

We can use Grouped Bar Plots to visualize the distribution of more than one categorical (factor) variable. We can either display the bars side-by-side or stacked (default). A two-way frequency table created using the table() function is used as input to the height argument. Color is used for the grouping variable and a legend should be included.

barplot(height = crosstabs,
        col = cm.colors(2),
        legend.text = TRUE,
        args.legend = list(x = "topright"),
        main = "Shelf Location by Sales Level")