We will use the DescTools package to explore and work with our data. If you do not already have these packages installed, you will first install the packages using the install.packages() function. The package has a very useful and comprehensive vignette which can be accessed through the help documentation in RStudio.
install.packages("DescTools")
Once installed, we load the package for use in the R session using the library() function.
library(DescTools)
In the lessons that follow we will use the “carseats.csv” file. This dataset contains sales of child car seats at 400 different stores. The variables in the dataset include:
CompPrice: Price charged by competitor at each locationIncome: Community income level (in 1,000s of dollars)Advertising: Local advertising budget for the company at each location (in 1,000s of dollars)Population: Population size in region (in 1,000s)Price: Price charged by company for car seats at each locationsShelveLoc: Quality of shelving location for the car seats at each location (Bad, Medium, Good)Age: Average age of the local populationEducation: Education level at each locationUrban: Indicates if the location is urban (Yes) or rural (No)US: Indicates if the location is in the US (Yes) or outside of the US (No)Sales_Lev: Indicates the sales level of the location. High sales are those that are greater than 8 units (in 1,000s) and Low sales are those than are less than or equal to 8 units (in 1,000s)We use the read.csv() function to import the CSV file into R as a dataframe named cs. We set stringsAsFactors = FALSE to keep any character columns as-is.
cs <- read.csv(file = "carseats.csv",
stringsAsFactors = FALSE)
Before we start working with our data, we first want to learn about the data. We will typically preview the top and bottom observations to ensure that we do not have an issues importing the data. We can use the head() and tail() functions to view the first and last 6 observations in the cs dataframe.
head(cs) # first n (default n = 6) observations
## CompPrice Income Advertising Population Price ShelveLoc Age Education Urban
## 1 138 73 11 276 120 Bad 42 17 Yes
## 2 111 48 16 260 83 Good 65 10 Yes
## 3 113 35 10 269 80 Medium 59 12 Yes
## 4 117 100 4 466 97 Medium 55 14 Yes
## 5 141 64 3 340 128 Bad 38 13 Yes
## 6 124 113 13 501 72 Bad 78 16 No
## US Sales_Lev
## 1 Yes High
## 2 Yes High
## 3 Yes High
## 4 Yes Low
## 5 No Low
## 6 Yes High
tail(cs) # last n (default n = 6) observations
## CompPrice Income Advertising Population Price ShelveLoc Age Education Urban
## 395 130 58 19 366 139 Bad 33 16 Yes
## 396 138 108 17 203 128 Good 33 14 Yes
## 397 139 23 3 37 120 Medium 55 11 No
## 398 162 26 12 368 159 Medium 40 18 Yes
## 399 100 79 7 284 95 Bad 50 12 Yes
## 400 134 37 0 27 120 Good 49 16 Yes
## US Sales_Lev
## 395 Yes Low
## 396 Yes High
## 397 Yes Low
## 398 Yes Low
## 399 Yes Low
## 400 Yes High
We can use the str() function to obtain the structure of the data, including dimensionality of the dataframe (the number of observations (rows) and variables (columns)), the variable types and a preview of the data.
str(cs)
## 'data.frame': 400 obs. of 11 variables:
## $ CompPrice : int 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : int 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: int 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : int 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : int 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : chr "Bad" "Good" "Medium" "Medium" ...
## $ Age : int 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : int 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : chr "Yes" "Yes" "Yes" "Yes" ...
## $ US : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Sales_Lev : chr "High" "High" "High" "Low" ...
Based on the structure of the cs dataframe, we have a mix of integer(int, numeric) and character (chr) variables. First, we can take a closer look at the character variables. We can created ordered or unordered (the default) factors. Based on the structure information, ShelveLoc and SalesLev are ordered categorical variables and Urban and US are unordered categorical variables. We will set up vectors of the variable names for each type of variable to make it easier to refer to them.
We set up a vector (named facs) with the names of the character variables that are unordered categorical variables and a vector (named ords) with the names of the character variables that are ordered categorical variables to make it easier to refer to them.
facs <- c("Urban", "US")
ords <- c("ShelveLoc", "Sales_Lev")
We can also create a vector of the column names of the numeric variables (named nums), which includes the names of all of the variables (columns) not in (using ! and %in%) the facs and ords vectors.
nums <- names(cs)[!names(cs) %in% c(facs, ords)]
We can use the unique() function to view the unique values that a variable can take on.
unique(cs[ ,"ShelveLoc"])
## [1] "Bad" "Good" "Medium"
We can use the lapply() function to apply a function over a list or vector and return a list object as the result. We will use the function to apply the unique function to all of the character variables(facs and ords) in the cs dataframe.
lapply(X = cs[ ,c(facs, ords)], FUN = unique)
## $Urban
## [1] "Yes" "No"
##
## $US
## [1] "Yes" "No"
##
## $ShelveLoc
## [1] "Bad" "Good" "Medium"
##
## $Sales_Lev
## [1] "High" "Low"
Character variables are typically categorical variables and we will want to convert the ShelveLoc and Sales_Lev to ordered factors and US and Urban to unordered factors. Factors are a special type of variable in R that have levels (unique values that they can assume). To convert a variable to a factor, we use the factor() or as.factor() function. By default, the factor will be unordered.
cs$US <- factor(cs$US)
We can use the lapply() function to convert all of the variables in the facs vector to factor variables.
cs[ ,facs] <- lapply(X = cs[ , facs], FUN = factor)
To convert a character variable to an ordinal variable (ordered factor variable), we use the ordered argument of the factor() function. By default, the ordering is automatically alphabetical, so we need to specify the correct order using the levels argument.
cs$ShelveLoc <- factor(x = cs$ShelveLoc,
levels = c("Bad", "Medium", "Good"),
ordered = TRUE)
cs$Sales_Lev <- factor(x = cs$Sales_Lev,
levels = c("Low", "High"),
ordered = TRUE)
To explore our data, we can use the Abstract() function in the DescTools package.
Abstract(cs)
## ------------------------------------------------------------------------------
## cs
##
## data frame: 400 obs. of 11 variables
## 400 complete cases (100.0%)
##
## Nr ColName Class NAs Levels
## 1 CompPrice integer .
## 2 Income integer .
## 3 Advertising integer .
## 4 Population integer .
## 5 Price integer .
## 6 ShelveLoc ordered, factor . (3): 1-Bad, 2-Medium, 3-Good
## 7 Age integer .
## 8 Education integer .
## 9 Urban factor . (2): 1-No, 2-Yes
## 10 US factor . (2): 1-No, 2-Yes
## 11 Sales_Lev ordered, factor . (2): 1-Low, 2-High
We can use the summary() function in the stats package to obtain basic summary statistic information about the variables in the cs dataframe. For numeric variables, the output includes the minimum, maximum, mean, median and quantiles. For factor variables, the output includes frequency information.
summary(cs)
## CompPrice Income Advertising Population
## Min. : 77 Min. : 21.00 Min. : 0.000 Min. : 10.0
## 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000 1st Qu.:139.0
## Median :125 Median : 69.00 Median : 5.000 Median :272.0
## Mean :125 Mean : 68.66 Mean : 6.635 Mean :264.8
## 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000 3rd Qu.:398.5
## Max. :175 Max. :120.00 Max. :29.000 Max. :509.0
## Price ShelveLoc Age Education Urban
## Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0 No :118
## 1st Qu.:100.0 Medium:219 1st Qu.:39.75 1st Qu.:12.0 Yes:282
## Median :117.0 Good : 85 Median :54.50 Median :14.0
## Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :191.0 Max. :80.00 Max. :18.0
## US Sales_Lev
## No :142 Low :236
## Yes:258 High:164
##
##
##
##
For more advanced descriptive statistic information, we can use the Desc() function from the DescTools package. In addition to the values from the summary() function, the Desc() function outputs: NA value frequency, the number of unique values, the number and percentage of 0 values, confidence interval for the mean, range, standard deviation, coefficient of variation, median absolute deviation, IQR, skewness, kurtosis and extreme values (highest and lowest). For continuous numeric variables as histogram with a density line, a boxplot and a plot empirical (cumulative) distribution function are also output. For discrete numeric variables the output will include frequency information instead of the extreme values and displays the histogram with lines instead of bars. Below is an example of the output for a continuous numeric variable, Population.
Desc(x = cs$Population)
## ------------------------------------------------------------------------------
## cs$Population (integer)
##
## length n NAs unique 0s mean meanCI'
## 400 400 0 275 0 264.84 250.35
## 100.0% 0.0% 0.0% 279.33
##
## .05 .10 .25 median .75 .90 .95
## 29.00 58.90 139.00 272.00 398.50 467.00 493.15
##
## range sd vcoef mad IQR skew kurt
## 499.00 147.38 0.56 191.26 259.50 -0.05 -1.21
##
## lowest : 10, 12, 13, 14, 16 (2)
## highest: 503, 504, 507, 508 (2), 509
##
## ' 95%-CI (classic)
We can use the var() function to obtain the variance of a single variable.
var(cs$CompPrice)
## [1] 235.1472
To obtain the variances for all of the numeric variables, we can use the sapply() function to apply the var() function over a list or vector and return a vector as the result.
sapply(X = cs[ ,nums], FUN = var)
## CompPrice Income Advertising Population Price Age
## 235.147243 783.218239 44.227343 21719.813935 560.584436 262.449618
## Education
## 6.867168
There is no function in R to directly obtain the mode, which is the most frequent value(s) of a variable. There can be none, one or many modes. The modefun() function is a custom function which will return the most frequent value (or values) if one exists, and “No mode exists”, otherwise. The required argument, x, must be a vector or variable (dataframe column).
modefun <- function(x){
if(any(tabulate(match(x, unique(x))) > 1)){
outp <- unique(x)[which(tabulate(match(x, unique(x))) == max(tabulate(match(x, unique(x)))))]
} else {
outp <- "No mode exists"}
return(outp)
}
To obtain the mode of a single variable, CompPrice, we can use the modefun() function.
modefun(x = cs$CompPrice)
## [1] 121
To obtain the modes for all of the numeric variables in the cs dataframe, we can use the sapply() function.
sapply(X = cs[ ,nums], FUN = modefun)
## $CompPrice
## [1] 121
##
## $Income
## [1] 69
##
## $Advertising
## [1] 0
##
## $Population
## [1] 276 148 497 170 237 125 220
##
## $Price
## [1] 120 128
##
## $Age
## [1] 62
##
## $Education
## [1] 17 12
To summarize individual categorical variables (both ordered and unordered), we can use a one-way frequency table, using the table() function.
table(cs$ShelveLoc)
##
## Bad Medium Good
## 96 219 85
Below is an example of the output for an ordered categorical (factor) variable using the Desc() function in the DescTools package. The output includes frequency, percentage and confidence interval information. Barplots including frequency and percentage information (along with cumulative percentage) are also displayed as output.
Desc(x = cs$ShelveLoc)
## ------------------------------------------------------------------------------
## cs$ShelveLoc (ordered, factor)
##
## length n NAs unique levels dupes
## 400 400 0 3 3 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 Bad 96 24.0% 96 24.0%
## 2 Medium 219 54.8% 315 78.8%
## 3 Good 85 21.2% 400 100.0%
Two-way frequency tables (sometimes called contingency tables or crosstabs) can be used to display frequency information for two categorical variables. The table() function can be used.
crosstabs <- table(cs$Sales_Lev, cs$ShelveLoc,
dnn = c("Level", "Location"))
crosstabs
## Location
## Level Bad Medium Good
## Low 82 135 19
## High 14 84 66
The Desc() function can be used (suing the formula argument) to produce two-way frequency tables and perform tests of variable association. Mosiac Plots are included in the output.
Desc(formula = Sales_Lev ~ ShelveLoc, data = cs)
## ------------------------------------------------------------------------------
## Sales_Lev ~ ShelveLoc (cs)
##
## Summary:
## n: 400, rows: 2, columns: 3
##
## Pearson's Chi-squared test:
## X-squared = 75.518, df = 2, p-value < 2.2e-16
## Log likelihood ratio (G-test) test of independence:
## G = 79.788, X-squared df = 2, p-value < 2.2e-16
## Mantel-Haenszel Chi-squared:
## X-squared = 72.874, df = 1, p-value < 2.2e-16
##
## Phi-Coefficient 0.435
## Contingency Coeff. 0.399
## Cramer's V 0.435
##
##
## ShelveLoc Bad Medium Good Sum
## Sales_Lev
##
## Low freq 82 135 19 236
## perc 20.5% 33.8% 4.8% 59.0%
## p.row 34.7% 57.2% 8.1% .
## p.col 85.4% 61.6% 22.4% .
##
## High freq 14 84 66 164
## perc 3.5% 21.0% 16.5% 41.0%
## p.row 8.5% 51.2% 40.2% .
## p.col 14.6% 38.4% 77.6% .
##
## Sum freq 96 219 85 400
## perc 24.0% 54.8% 21.3% 100.0%
## p.row . . . .
## p.col . . . .
##
To look for differences in distributional properties of numerical variables across categorical variable class levels, we can use the Desc() function (using the formula argument.
Desc(formula = Price ~ ShelveLoc, data = cs)
## ------------------------------------------------------------------------------
## Price ~ ShelveLoc (cs)
##
## Summary:
## n pairs: 400, valid: 400 (100.0%), missings: 0 (0.0%), groups: 3
##
##
## Bad Medium Good
## mean 114.271 115.653 117.882
## median 113.500 117.000 122.000
## sd 23.779 23.099 25.129
## IQR 33.250 30.000 29.000
## n 96 219 85
## np 24.000% 54.750% 21.250%
## NAs 0 0 0
## 0s 0 0 0
##
## Kruskal-Wallis rank sum test:
## Kruskal-Wallis chi-squared = 1.6523, df = 2, p-value = 0.4377
The output from the Desc() function when two categorical variables are provided in the formula argument can be used to test for relationships among the variables. For unordered variables, the output of the Pearson’s Chi-squared test can be used to test for a relationship. For ordered variables, the output of the Mantel-Haenszel Chi-squared test can be used.
Desc(Urban ~ US, data = cs)
## ------------------------------------------------------------------------------
## Urban ~ US (cs)
##
## Summary:
## n: 400, rows: 2, columns: 2
##
## Pearson's Chi-squared test (cont. adj):
## X-squared = 0.68416, df = 1, p-value = 0.4082
## Fisher's exact test p-value = 0.3609
## McNemar's chi-squared = 3.1488, df = 1, p-value = 0.07598
##
## estimate lwr.ci upr.ci'
##
## odds ratio 1.238 0.794 1.931
## rel. risk (col1) 1.145 0.867 1.512
## rel. risk (col2) 0.925 0.783 1.093
##
##
## Phi-Coefficient 0.047
## Contingency Coeff. 0.047
## Cramer's V 0.047
##
##
## US No Yes Sum
## Urban
##
## No freq 46 72 118
## perc 11.5% 18.0% 29.5%
## p.row 39.0% 61.0% .
## p.col 32.4% 27.9% .
##
## Yes freq 96 186 282
## perc 24.0% 46.5% 70.5%
## p.row 34.0% 66.0% .
## p.col 67.6% 72.1% .
##
## Sum freq 142 258 400
## perc 35.5% 64.5% 100.0%
## p.row . . .
## p.col . . .
##
##
## ----------
## ' 95% conf. level
We use the cor() function to obtain the association, or linear relationship, between numeric variables. The value assumes that the variables have a linear relationship. If the relationship is not linear, correlation should not be used. Correlation provides both the strength (weak, strong) and direction (positive, negative) of the relationship. Correlation values range from -1 to +1.
cor(cs$CompPrice, cs$Price)
## [1] 0.5848478
If more than two variables are provided as input to the cor() function, the output will be a symmetric (since the correlation between a variable x and a variable y is the same as the correlation between the variable y and the variable x) correlation matrix with 1s on the diagonal (since the correlation between a variable and itself is equal to 1). If NA values are present in the data, they should be removed or the use argument should be set equal to “complete.obs”.
cor(x = cs[,nums])
## CompPrice Income Advertising Population Price
## CompPrice 1.00000000 -0.080653423 -0.024198788 -0.094706516 0.58484777
## Income -0.08065342 1.000000000 0.058994706 -0.007876994 -0.05669820
## Advertising -0.02419879 0.058994706 1.000000000 0.265652145 0.04453687
## Population -0.09470652 -0.007876994 0.265652145 1.000000000 -0.01214362
## Price 0.58484777 -0.056698202 0.044536874 -0.012143620 1.00000000
## Age -0.10023882 -0.004670094 -0.004557497 -0.042663355 -0.10217684
## Education 0.02519705 -0.056855422 -0.033594307 -0.106378231 0.01174660
## Age Education
## CompPrice -0.100238817 0.025197050
## Income -0.004670094 -0.056855422
## Advertising -0.004557497 -0.033594307
## Population -0.042663355 -0.106378231
## Price -0.102176839 0.011746599
## Age 1.000000000 0.006488032
## Education 0.006488032 1.000000000
For data with high variable dimensionality, it is convenient to use the symnum() function, which provides a matrix with symbols indicating the level (strength) of correlation among the variables, rather than the correlation values themselves. By setting corr = TRUE, the absolute values from the correlation matrix are used. As shown, the strongest correlation among the variables is between Price and CompPrice, with a correlation between 0.3 and 0.6. Note: the output does not include the direction of the relationship, only the strength.
symnum(x = cor(cs[,nums]),
corr = TRUE)
## C I Ad Pp Pr Ag E
## CompPrice 1
## Income 1
## Advertising 1
## Population 1
## Price . 1
## Age 1
## Education 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
Box Plots, also called box-and-whisker plots can be used to produce univariate or grouped visualizations displaying the percentile or quantile information. The whiskers (horizontal lines) represent the extreme values of the distribution that are not outliers. Outliers are displayed as points outside of the whiskers. The box is made up of Q1, Q2 (median) and Q3. The boxplot() function is used to create univariate boxplots (using the x argument) or boxplots grouped by categorical (factor) variables (using the formula argument).
boxplot(x = cs$Price,
main = "Price Box Plot")
By specifying a formula of the format numerical variable ~ factor variable, grouped boxplots can be created.
boxplot(formula = Price ~ ShelveLoc,
data = cs,
col = cm.colors(3))
Histograms are used to display the distribution of continuous numerical variables organized into bins, with the height of the bars indicating the frequency of observations in a given range. Histograms are created using the hist() function.
hist(x = cs$Price,
main = "Price Histogram",
xlab = "",
col = "steelblue")
Correlation Plots can be used to create a visualization of a correlation matrix, which indicates the strength and direction of the linear relationship using a color scale. The PlotCorr() function in the DescTools package will create a correlation plot. The x argument must contain a correlation matrix created using the cor() function.
PlotCorr(x = cor(cs[,nums]))
To visualize the relationship between two numerical variables, a Scatter Plot is used, with one variable mapped to the x-axis and another variable mapped to the y-axis. You can either use the x and y arguments to map variables to the axes or you can use the formula argument using the format y ~ x and the data argument to identify the dataframe containing the variables.
plot(formula = Price ~ CompPrice,
data = cs,
pch = 16,
xlab = "Competitor Price",
ylab = "Price")
A matrix of Scatter Plots can be created using the pairs() function.
pairs(x = cs[,nums])
Pie Charts are used to show frequency or proportion information. The pie() function is used to create a Pie Chart with a frequency table created using the table() function as input to the x argument.
SLtab <- table(cs$ShelveLoc)
pie(x = SLtab,
main="Shelf Location Pie Chart")
Bar Plots are the most popular plotting method used for categorical and discrete numerical variables (when the number of values the discrete variable can take on is low). The barplot() function can be used to create a Bar Plot, with an object created using the table() function as input to the height argument.
barplot(height = SLtab,
main = "Shelf Location Bar Plot",
col = heat.colors(3))
We can use Grouped Bar Plots to visualize the distribution of more than one categorical (factor) variable. We can either display the bars side-by-side or stacked (default). A two-way frequency table created using the table() function is used as input to the height argument. Color is used for the grouping variable and a legend should be included.
barplot(height = crosstabs,
col = cm.colors(2),
legend.text = TRUE,
args.legend = list(x = "topright"),
main = "Shelf Location by Sales Level")