Exploratory data analysis involves the exploration of data to obtain deeper understanding of the data and the patterns in the data by using a combination of:
Examining summary statistics can provide numerical insights about the specifics of each of these questions. Alternatively, graphs can be used to visually explore the data, potentially providing more insight than a summary statistic.
Gaming is a very big industry now. Every year there are millions of Dollars invested in Esports and many new companies want to invest in the Esports scene now. One of bigegest ever deals was when Mixer opened up and brought Ninja and Shroud to their platform from twitch. But Twitch has been a home to streamers since day 1 and now that Mixer has been shut down, streamers are returning to the platform again.Millions, if not billions, watch twitch streams everyday and i myself like to watch twitch streams. So i put together Top 1000 Streamers from past one year who were streaming on twitch.
This data consists of different things like number of viewers, number of active viewers, followers gained and many other relevant columns regarding a particular streamer. It has 11 different columns with all the necessary information that is needed.
# Load the dataset
df <- read.csv("twitch.csv")
head(df, n=10)
## Channel Watch.time.Minutes. Stream.time.minutes. Peak.viewers
## 1 xQcOW 6196161750 215250 222720
## 2 summit1g 6091677300 211845 310998
## 3 Gaules 5644590915 515280 387315
## 4 ESL_CSGO 3970318140 517740 300575
## 5 Tfue 3671000070 123660 285644
## 6 Asmongold 3668799075 82260 263720
## 7 NICKMERCS 3360675195 136275 115633
## 8 Fextralife 3301867485 147885 68795
## 9 loltyler1 2928356940 122490 89387
## 10 Anomaly 2865429915 92880 125408
## Average.viewers Followers Followers.gained Views.gained Partnered Mature
## 1 27716 3246298 1734810 93036735 True False
## 2 25610 5310163 1370184 89705964 True False
## 3 10976 1767635 1023779 102611607 True True
## 4 7714 3944850 703986 106546942 True False
## 5 29602 8938903 2068424 78998587 True False
## 6 42414 1563438 554201 61715781 True False
## 7 24181 4074287 1089824 46084211 True False
## 8 18985 508816 425468 670137548 True False
## 9 22381 3530767 951730 51349926 True False
## 10 12377 2607076 1532689 36350662 True False
## Language
## 1 English
## 2 English
## 3 Portuguese
## 4 English
## 5 English
## 6 English
## 7 English
## 8 English
## 9 English
## 10 English
Take a peek into the data
Explore the structure of the data
Discover the data by asking questions, visualizing and analysing the data
Identification of missing values and missing value treatment
Exploring numeric data attributes (Univariate analysis)
5.1 Measure the central tendency
5.2 Visualize the shape of the distribution
5.3 Explore the spread using summary statistics
5.4 Explore the spread using Visualization plots
5.5 Identify outliers and unusual observations
Exploring categorical data attributes (Univariate analysis)
6.1 Measure the central tendency
6.2 Explore the frequency of categories using frequency table
6.3 Explore the frequency of categories using using visualization plots
Exploring relationship between features (Bivariate & Multivariate analysis)
7.1 Cross table counts
7.2 Correlation
7.3 Visualization of relationshop using scatter plot & heatmap
Let us take a look at some raw data from the dataset.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.0 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(DT)
datatable(data= df)
Let us now explore how the data is organised in terms of shape or dimensions of the data and also the data types of the features.
We can use the dim() function to see the number of rows and columns in the dataset. We are able to see that the dataset has 467 observations or records and 34 features or variables.
library(mlbench)
# display the dimensions of the dataset
dim(df)
## [1] 1000 11
Lets explore the datatype of the attributes in the dataset.
# list data types for each features
sapply(df,class)
## Channel Watch.time.Minutes. Stream.time.minutes.
## "character" "numeric" "integer"
## Peak.viewers Average.viewers Followers
## "integer" "integer" "integer"
## Followers.gained Views.gained Partnered
## "integer" "integer" "character"
## Mature Language
## "character" "character"
(images/Data Types.png) A a numerical variable can take a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values. On the other hand, we would not classify a variable reporting telephone area codes as numerical since the average, sum, and difference of area codes doesn’t have any clear meaning. Instead, we would consider area codes as a categorical variable.
Some variable are numerical, although the variable can only take whole non-negative numbers. For this reason, the variable is said to be discrete since it can only take numerical values with jumps. On the other hand, there are variable that are continuous.
Because the values for are categories, state is called a categorical variable, and the possible values are called the variable’s levels.
This variable seems to be a hybrid: it is a categorical variable but the levels have a natural ordering. A variable with these properties is called an ordinal variable, while a regular categorical variable without this type of special ordering is called a nominal variable. To simplify analyses, any categorical variable in this book will be treated as a nominal (unordered) categorical variable.
# Convert data types
df$Channel <- as.factor(df$Channel)
df$Partnered <- as.factor(df$Partnered)
df$Mature <- as.factor(df$Mature)
df$Language <- as.factor(df$Language)
sapply(df, class)
## Channel Watch.time.Minutes. Stream.time.minutes.
## "factor" "numeric" "integer"
## Peak.viewers Average.viewers Followers
## "integer" "integer" "integer"
## Followers.gained Views.gained Partnered
## "integer" "integer" "factor"
## Mature Language
## "factor" "factor"
Ask Questions
Treatment for missing values
#df <- na.omit(df)
Measures of central tendency is used to identify a value that falls in the middle of a set of data.
The mean is a common way to measure the center of a distribution of data. The sample mean can be calculated as the sum of the observed values divided by the number of observations. The sample mean is often labeled \(\bar{x}\).
\(\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}\)
df %>%
summarise(mean = mean(Peak.viewers), n = n())
## mean n
## 1 37065.05 1000
Although the mean is by far the most commonly cited statistic for measuring the center of a dataset, it is not always the most appropriate one. Another commonly used measure of central tendency is the median, which is the value that occurs at the midpoint of an ordered list of values.
At first glance, it seems like the median and mean are very similar measures. Why have two measures of central tendency? The reason is due to the fact that the mean and median are affected differently by values falling at far ends of the range. In particular, the mean is highly sensitive to outliers, or values that are atypically high or low relative to the majority of data. So, because the mean is more responsive to outliers, it is more likely to be shifted higher or lower by a small number of extreme values.
Since the mean is more sensitive to extreme values than the median, the fact that the mean is much higher than the median might lead us to suspect that there are some used cars in the dataset with extremely high mileage values
df %>%
summarise(median = median(Peak.viewers), n = n())
## median n
## 1 16676 1000
Sometimes we are interested in the distribution of a single variable. In these cases, a dot plot provides the most basic of displays. A dot plot is a one-variable scatterplot
ggplot(df, aes(x = Peak.viewers)) +
geom_dotplot()
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = df, aes(x = Peak.viewers)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A histogram is another way to visualize the spread of a numeric variable. Histograms provide a view of the data density. It is composed of a series of bars with heights indicating the count, or frequency, of values falling within each of the equal-width bins partitioning the values. Higher bars represent where the data are relatively more common.
We can create a histogram using the hist() function.
Histograms are especially convenient for understanding the shape of the data distribution. When the distribution of a variable trails off to the right in this way and has a longer right tail, the shape is said to be right skewed. Variables a long, thinner tail to the left are said to be left skewed. We also say that such a distribution has a long left tail. Variables that show roughly equal trailing off in both directions are called symmetric.
If a distribution looks nearly-Gaussian but is pushed far left or right it is useful to know the skew. Getting a feeling for the skew is much easier with plots of the data, such as a histogram or density plot. It is harder to tell from looking at means, standard deviations and quartiles. Nevertheless, calculating the skew up-front gives you a reference that you can use later if you decide to correct the skew for an attribute
This characteristic is known as skew, or more specifically right skew, because the values on the high end (right side) are far more spread out than the values on the low end (left side). As shown in the following diagram, histograms of skewed data look stretched on one of the sides:
In addition to looking at whether a distribution is skewed or symmetric, histograms can be used to identify modes. A mode is represented by a prominent peak in the distribution. A definition of mode sometimes taught in math classes is the value with the most occurrences in the dataset.histograms that have one, two, or three prominent peaks. Such distributions are called unimodal, bimodal, and multimodal, respectively. Any distribution with more than two prominent peaks is called multimodal.
Histograms, boxplots, and statistics describing the center and spread provide ways to examine the distribution of a variable’s values. A variable’s distribution describes how likely a value is to fall within various ranges.
If all values are equally likely to occur—say, for instance, in a dataset recording the values rolled on a fair six-sided die—the distribution is said to be uniform. A uniform distribution is easy to detect with a histogram because the bars are approximately the same height. The histogram may look something like the following diagram:
This is clearly not uniform, since some values are seemingly far more likely to occur than others. In fact, on the price histogram, it seems that values become less likely to occur as they are further away from both sides of the center bar, which results in a bell-shaped distribution of data. This characteristic is so common in real-world data that it is the hallmark of the so-called normal distribution. The stereotypical bell-shaped curve of the normal distribution is shown in the following diagram:
Although there are numerous types of non-normal distributions, many real- world phenomena generate data that can be described by the normal distribution. Therefore, the normal distribution’s properties have been studied in great detail.
Explore the spread using 5 point summary, range, variance & standard deviation
The mean and median provide ways to quickly summarize values, but these measures of center tell us little about whether or not there is diversity in the measurements. To measure the diversity, we need to employ another type of summary statistics concerned with the spread of the data, or how tightly or loosely the values are spaced.
Knowing about the spread provides a sense of the data’s highs and lows, and whether most values are like or unlike the mean and median.
The five-number summary is a set of five statistics that defines the spread of a feature’s values. We can use the summary() function to obtain the five number summary.
Let’s take a look at the age:
summary(df$Peak.viewers)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 496 9114 16676 37065 37570 639375
summary(df)
## Channel Watch.time.Minutes. Stream.time.minutes.
## _연두부_ (lovelyyeon): 1 Min. :1.222e+08 Min. : 3465
## 10000DAYS : 1 1st Qu.:1.632e+08 1st Qu.: 73759
## 1DrakoNz : 1 Median :2.350e+08 Median :108240
## 1PVCS : 1 Mean :4.184e+08 Mean :120515
## 39daph : 1 3rd Qu.:4.337e+08 3rd Qu.:141844
## 72hrs : 1 Max. :6.196e+09 Max. :521445
## (Other) :994
## Peak.viewers Average.viewers Followers Followers.gained
## Min. : 496 Min. : 235 Min. : 3660 Min. : -15772
## 1st Qu.: 9114 1st Qu.: 1458 1st Qu.: 170546 1st Qu.: 43758
## Median : 16676 Median : 2425 Median : 318063 Median : 98352
## Mean : 37065 Mean : 4781 Mean : 570054 Mean : 205519
## 3rd Qu.: 37570 3rd Qu.: 4786 3rd Qu.: 624332 3rd Qu.: 236131
## Max. :639375 Max. :147643 Max. :8938903 Max. :3966525
##
## Views.gained Partnered Mature Language
## Min. : 175788 False: 22 False:770 English :485
## 1st Qu.: 3880602 True :978 True :230 Korean : 77
## Median : 6456324 Russian : 74
## Mean : 11668166 Spanish : 68
## 3rd Qu.: 12196762 French : 66
## Max. :670137548 Portuguese: 61
## (Other) :169
Here are some observations from the summary statistics:
The middle 50 percent of data, found between the first and third quartiles, known as the interquartile range is a simple measure of spread and can be calculated with the IQR() function:
IQR(df$Peak.viewers)
## [1] 28456
The span between the minimum and maximum value is known as the range. In R, the range() function returns both the minimum and maximum value:
diff(range(df$Peak.viewers))
## [1] 638879
The quartiles divide a dataset into four portions, each with the same number of values. The sequence function seq() generates vectors of evenly-spaced values. This makes it easy to obtain other slices of data, such as the quintiles (five groups) shown in the following command:
quantile(df$Peak.viewers, seq(from = 0, to = 1, by = 0.2))
## 0% 20% 40% 60% 80% 100%
## 496.0 7934.2 13377.0 22177.0 45895.6 639375.0
The mean was introduced as a method to describe the center of a variable, and variability in the data is also important. Here, we introduce two measures of variability: the variance and the standard deviation. Both of these are very useful in data analysis, even though their formulas are a bit tedious to calculate by hand. The standard deviation is the easier of the two to comprehend, as it roughly describes how far away the typical observation is from the mean.
We call the distance of an observation from its mean its deviation. If we square these deviations and then take an average, the result is equal to the sample variance, denoted by \(s^2\)
We divide by (n-1), rather than dividing by n, when computing a sample’s variance. There’s some mathematical nuance here, but the end result is that doing this makes this statistic slightly more reliable and useful. Notice that squaring the deviations does two things. First, it makes large values relatively much larger. Second, it gets rid of any negative signs.
The sample standard deviation can be calculated as the square root of the sum of the squared distance of each value from the mean divided by the number of observations minus one:
\(s = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}\)
The standard deviation is defined as the square root of the variance:
The variance is the average squared distance from the mean. The standard deviation is the square root of the variance. The standard deviation is useful when considering how far the data are distributed from the mean.
The standard deviation represents the typical deviation of observations from the mean. Often about 68% of the data will be within one standard deviation of the mean and about 95% will be within two standard deviations. However, these percentages are not strict rules.
Like the mean, the population values for variance and standard deviation have special symbols: \(\sigma^2\) for the variance and \(\sigma\) for the standard deviation.
In practice, the variance and standard deviation are sometimes used as a means to an end, where the “end” is being able to accurately estimate the uncertainty associated with a sample statistic. For example, in Chapter 13 the standard deviation is used in calculations that help us understand how much a sample mean varies from one sample to the next.
Distributions allow us to characterize a large number of values using a smaller number of parameters. The normal distribution, which describes many types of real-world data, can be defined with just two: center and spread. The center of the normal distribution is defined by its mean value, which we have used before. The spread is measured by a statistic called the standard deviation.
In order to calculate the standard deviation, we must first obtain the variance, which is defined as the average of the squared differences between each value and the mean value. In mathematical notation, the variance of a set of n values in a set named x is defined by the following formula. The Greek letter mu (similar in appearance to an m or u) denotes the mean of the values, and the variance itself is denoted by the Greek letter sigma squared (similar to a b turned sideways):
The standard deviation is the square root of the variance, and is denoted by sigma as shown in the following formula:
In R, the var() and sd() functions can be used to obtain the variance and standard deviation. For example, computing the variance and standard deviation on the price and mileage vectors, we find:
var(df$Peak.viewers)
## [1] 3637815712
sd(df$Peak.viewers)
## [1] 60314.31
When interpreting the variance, larger numbers indicate that the data is spread more widely around the mean. The standard deviation indicates, on average, how much each value differs from the mean.
The standard deviation can be used to quickly estimate how extreme a given value is under the assumption that it came from a normal distribution. The 68–95–99.7 rule states that 68 percent of values in a normal distribution fall within one standard deviation of the mean, while 95 percent and 99.7 percent of values fall within two and three standard deviations, respectively. This is illustrated in the following diagram:
# load packages
#library(mlbench)
#library(e1071)
# load the dataset
#data(PimaIndiansDiabetes)
# calculate skewness for each variable
#skew <- apply(PimaIndiansDiabetes[,1:8], 2, skewness)
# display skewness, larger/smaller deviations from 0 show more skew
#print(skew)
The further the distribution of the skew value from zero, the larger the skew to the left (negative skew value) or right (positive skew value).
A box plot summarizes a dataset using five statistics while also identifying unusual observations.
The dark line inside the box represents the median, which splits the data in half. 50% of the data fall below this value and 50% fall above it.
Median: the number in the middle.
If the data are ordered from smallest to largest, the median is the observation right in the middle. If there are an even number of observations, there will be two values in the middle, and the median is taken as their average.
The second step in building a box plot is drawing a rectangle to represent the middle 50% of the data. The length of the box is called the interquartile range, or IQR for short. It, like the standard deviation, is a measure of variability in data. The more variable the data, the larger the standard deviation and IQR tend to be. The two boundaries of the box are called the first quartile (the 25th percentile, i.e., 25% of the data fall below this value) and the third quartile (the 75th percentile, i.e., 75% of the data fall below this value), and these are often labeled Q1 and Q3, respectively.
ggplot(data = df, mapping = aes(y = df$Peak.viewers)) +
geom_boxplot()
## Warning: Use of `df$Peak.viewers` is discouraged. Use `Peak.viewers` instead.
ggplot(df, aes(x = 1, y = df$Peak.viewers)) +
geom_boxplot() +
scale_x_continuous(breaks = NULL) +
theme(axis.title.x = element_blank())
## Warning: Use of `df$Peak.viewers` is discouraged. Use `Peak.viewers` instead.
#ggplot(billboard, aes(x = performer, y = weeks_on_chart)) +
# geom_boxplot()
Interquartile range (IQR).
The IQR interquartile range is the length of the box in a box plot. It is computed as \(IQR = Q_3 - Q_1\), where Q1 and Q3 are the 25th and 75th percentiles, respectively.
A α percentile is a number with α% of the observations below and 100−α% of the observations above. For example, the 90th percentile of SAT scores is the value of the SAT score with 90% of students below that value and 10% of students above that value.
Extending out from the box, the whiskers attempt to capture the data outside of the box. The whiskers of a box plot reach to the minimum and the maximum values in the data, unless there are points that are considered unusually high or unusually low, which are identified as potential outliers by the box plot. These are labeled with a dot on the box plot. The purpose of labeling the outlying points – instead of extending the whiskers to the minimum and maximum observed values – is to help identify any observations that appear to be unusually distant from the rest of the data. There are a variety of formulas for determining whether a particular data point is considered an outlier, and different statistical software use different formulas. A commonly used formula is that any observation beyond 1.5× IQR away from the first or the third quartile is considered an outlier. In a sense, the box is like the body of the box plot and the whiskers are like its arms trying to reach the rest of the data, up to the outliers.
Visualizing numeric variables can be helpful for diagnosing data problems.
A common visualization of the five-number summary is a boxplot, also known as a box-and-whisker plot. The boxplot displays the center and spread of a numeric variable in a format that allows you to quickly obtain a sense of the range and skew of a variable or compare it to other variables.
Let’s take a look at a boxplot for the used car price and mileage data. To obtain a boxplot for a variable, we will use the boxplot() function. We will also specify a pair of extra parameters, main and ylab, to add a title to the figure and label the y axis (the vertical axis), respectively. The commands to create the price and mileage boxplots are: boxplot(usedcars\(price, main = "Boxplot of Used Car Prices", ylab = "Price (\))")
A boxplot depicts the five-number summary using horizontal lines and dots. The horizontal lines forming the box in the middle of each figure represent Q1, Q2 (the median), and Q3 when reading the plot from bottom to top. The median is denoted by the dark line, which lines up with $13,592 on the vertical axis for price and 36,385 mi. on the vertical axis for mileage.
The minimum and maximum values can be illustrated using the whiskers that extend below and above the box; however, a widely used convention only allows the whiskers to extend to a minimum or maximum of 1.5 times the IQR below Q1 or above Q3. Any values that fall beyond this threshold are considered outliers and are denoted as circles or dots. For example, recall that the IQR for the price variable was 3,909 with Q1 of 10,995 and Q3 of 14,904. An outlier is therefore any value that is less than 10995 - 1.5 * 3909 = 5131.5 or greater than 14904 + 1.5 * 3909 = 20767.5. The price boxplot shows two outliers on both the high and low ends. On the mileage boxplot, there are no outliers on the low end and thus the bottom whisker extends to the minimum value of 4,867. On the high end, we see several outliers beyond the 100,000 mile mark. These outliers are responsible for our earlier finding, which noted that the mean value was much greater than the median.
Outliers are extreme.
An outlier is an observation that appears extreme relative to the rest of the data. Examining data for outliers serves many useful purposes, including
identifying strong skew in the distribution, identifying possible data collection or data entry errors, and providing insight into interesting properties of the data. Keep in mind, however, that some datasets have a naturally long skew and outlying points do not represent any sort of problem in the dataset.
Measuring the central tendency – the mode In statistics terminology, the mode of a feature is the value occurring most often. Like the mean and median, the mode is another measure of central tendency. It is typically used for categorical data, since the mean and median are not defined for nominal variables.
For example, in the used car data, the mode of the year variable is 2010, while the modes for the model and color variables are SE and Black, respectively. A variable may have more than one mode; a variable with a single mode is unimodal, while a variable with two modes is bimodal. Data with multiple modes is more generally called multimodal.
Although you might suspect that you could use the mode() function, R uses this to obtain the type of variable (as in numeric, list, and so on) rather than the statistical mode. Instead, to find the statistical mode, simply look at the table() output for the category with the greatest number of values.
The mode or modes are used in a qualitative sense to gain an understanding of important values. Even so, it would be dangerous to place too much emphasis on the mode since the most common value is not necessarily a majority. For instance, although black was the single most common car color, it was only about a quarter of all advertised cars.
It is best to think about the modes in relation to the other categories. Is there one category that dominates all others, or are there several? Thinking about modes this way may help to generate testable hypotheses by raising questions about what makes certain values more common than others. If black and silver are common used car colors, we might believe that the data represents luxury cars, which tend to be sold in more conservative colors. Alternatively, these colors could indicate economy cars, which are sold with fewer color options. We will keep these questions in mind as we continue to examine this data.
Thinking about the modes as common values allows us to apply the concept of the statistical mode to numeric data. Strictly speaking, it would be unlikely to have a mode for a continuous variable, since no two values are likely to repeat. However, if we think about modes as the highest bars on a histogram, we can discuss the modes of variables such as price and mileage. It can be helpful to consider the mode when exploring numeric data, particularly to examine whether or not the data is multimodal.
Exploring categorical variables If you recall, the used car dataset contains three categorical variables: model, color, and transmission. R has stored these as character (chr) vectors rather than factor type because we used the stringsAsFactors = FALSE parameter when loading the data. Additionally, although the year variable is stored as a numeric (int) vector, each year can be imagined as a category applied to multiple cars.
We may therefore consider also treating it as categorical. In contrast to numeric data, categorical data is typically examined using tables rather than summary statistics. A table that presents a single categorical variable is known as a one-way table. The table() function can be used to generate one-way tables for the used car data:
table(df$Language)
##
## Arabic Chinese Czech English Finnish French German
## 5 30 6 485 1 66 49
## Greek Hungarian Italian Japanese Korean Other Polish
## 1 2 17 10 77 1 12
## Portuguese Russian Slovak Spanish Swedish Thai Turkish
## 61 74 1 68 1 11 22
The table() output lists the categories of the nominal variable and a count of the number of values falling into each category. Since we know there are 150 used cars in the dataset, we can determine that roughly one-third of all the cars were manufactured in 2010, given that 49 / 150 = 0.327.
R can also perform the calculation of table proportions directly, by using the prop. table() command on a table produced by the table() function: > model_table <- table(usedcars\(model) > color_table <- table(usedcars\)color) > color_pct <- prop.table(color_table) * 100 > round(color_pct, digits = 1)
Class Distribution
In a classification problem you must know the proportion of instances that belong to each class label. This is important because it may highlight an imbalance in the data, that if severe may need to be addressed with rebalancing techniques. In the case of a multi-class classification problem it may expose a class with a small or zero instances that may be candidates for removing from the dataset.
# load the packages
#library(mlbench)
# load the dataset
#data(PimaIndiansDiabetes)
# distribution of class variable
#y <- PimaIndiansDiabetes$diabetes
#cbind(freq=table(y), percentage=prop.table(table(y))*100)
This recipe creates a useful table showing the number of instances that belong to each class as well as the percentage that this represents from the entire dataset.
A bar plot is a common way to display a single categorical variable. The left panel of Figure 4.1 shows a bar plot for the homeownership variable. In the right panel, the counts are converted into proportions, showing the proportion of observations that are in each level.
These types of questions can be addressed by looking at bivariate relationships, which consider the relationship between two variables. Relationships of more than two variables are called multivariate relationships. Let’s begin with the bivariate case.
A table that summarizes data for two categorical variables in this way is called a contingency table. Each value in the table represents the number of times a particular combination of variable outcomes occurred.
For example, the value 3496 corresponds to the number of loans in the dataset where the borrower rents their home and the application type was by an individual. Row and column totals are also included. The row totals provide the total counts across each row and the column totals down each column. We can also create a table that shows only the overall percentages or proportions for each combination of categories, or we can create a table for a single variable, such as the one shown in Table
Examining relationships – two-way cross-tabulations
To examine a relationship between two nominal variables, a two-way cross- tabulation is used (also known as a crosstab or contingency table). A cross- tabulation is similar to a scatterplot in that it allows you to examine how the values of one variable vary by the values of another. The format is a table in which the rows are the levels of one variable, while the columns are the levels of another. Counts in each of the table’s cells indicate the number of values falling into the particular row and column combination.
To answer our earlier question about whether there is a relationship between model and color, we will examine a crosstab. There are several functions to produce two-way tables in R, including table(), which we used for one-way tables. The CrossTable() function in the gmodels package by Gregory R. Warnes is perhaps the most user-friendly, as it presents the row, column, and margin percentages in a single table, saving us the trouble of computing them ourselves. To install the gmodels package, type: install.packages(“gmodels”)
After the package installs, type library(gmodels) to load the package. You will need to do this during each R session in which you plan to use the CrossTable() function. Before proceeding with our analysis, let’s simplify our project by reducing the number of levels in the color variable. This variable has nine levels, but we don’t really need this much detail. What we are actually interested in is whether or not the car’s color is conservative. Toward this end, we’ll divide the nine colors into two groups—the first group will include the conservative colors Black, Gray, Silver, and White; the second group will include Blue, Gold, Green, Red, and Yellow. We will create a binary indicator variable (often called a dummy variable), indicating whether or not the car’s color is conservative by our definition. Its value will be 1 if true and 0 otherwise:
#usedcars$conservative <-
# usedcars$color %in% c("Black", "Gray", "Silver", "White")
You may have noticed a new command here: the %in% operator returns TRUE or FALSE for each value in the vector on the left-hand side of the operator, indicating whether the value is found in the vector on the right-hand side. In simple terms, you can translate this line as “is the used car color in the set of black, gray, silver, and white?”
Examining the table() output for our newly created variable, we see that about two-thirds of the cars have conservative colors while one-third does not:
#table(usedcars$conservative)
Now, let’s look at a cross-tabulation to see how the proportion of conservatively- colored cars varies by model. Since we’re assuming that the model of car dictates the choice of color, we’ll treat the conservative color indicator as the dependent (y) variable. The CrossTable() command is therefore: CrossTable(x = usedcars\(model, y = usedcars\)conservative)
There is a wealth of data in the CrossTable() output. The legend at the top (labeled Cell Contents) indicates how to interpret each value. The table rows indicate the three models of used cars: SE, SEL, and SES (plus an additional row for the total across all models). The columns indicate whether or not the car’s color is conservative (plus a column totaling across both types of color).
The first value in each cell indicates the number of cars with that combination of model and color. The proportions indicate each cell’s contribution to the chi-square statistic, the row total, the column total, and the table’s overall total.
What we are most interested in is the proportion of conservative cars for each model. The row proportions tell us that 0.654 (65 percent) of SE cars are colored conservatively, in comparison to 0.696 (70 percent) of SEL cars, and 0.653 (65 percent) of SES. These differences are relatively small, which suggests that there are no substantial differences in the types of colors chosen for each model of car.
The chi-square values refer to the cell’s contribution in the Pearson’s chi-squared test for independence between two variables. This test measures how likely it is that the difference in cell counts in the table is due to chance alone. If the probability is very low, it provides strong evidence that the two variables are associated.
You can obtain the chi-squared test results by adding an additional parameter specifying chisq = TRUE when calling the CrossTable() function. In this case, the probability is about 93 percent, suggesting that it is very likely that the variations in cell count are due to chance alone, and not due to a true association between model and color.
Perhaps a more interesting finding is the fact that there are very few cars that have both high price and high mileage, aside from a lone outlier at about 125,000 miles and $14,000. The absence of more points like this provides evidence to support a conclusion that our dataset is unlikely to include any high-mileage luxury cars. All of the most expensive cars in the data, particularly those above $17,500, seem to have extraordinarily low mileage, which implies that we could be looking at a single type of car that retails for a price around $20,000 when new.
The relationship we’ve observed between car prices and mileage is known as a negative association because it forms a pattern of dots in a line sloping downward. A positive association would appear to form a line sloping upward. A flat line,or a seemingly random scattering of dots, is evidence that the two variables are not associated at all. The strength of a linear association between two variables is measured by a statistic known as correlation. Correlations are discussed in detail in Chapter 6, Forecasting Numeric Data – Regression Methods, which covers methods for modeling linear relationships.
Visualizing relationships – scatterplots
A scatterplot is a diagram that visualizes a bivariate relationship between numeric features. It is a two-dimensional figure in which dots are drawn on a coordinate plane using the values of one feature to provide the horizontal x coordinates, and the values of another feature to provide the vertical y coordinates. Patterns in the placement of dots reveal underlying associations between the two features.
To answer our question about the relationship between price and mileage, we will examine a scatterplot. We’ll use the plot() function, along with the main, xlab, and ylab parameters used previously to label the diagram.
To use plot(), we need to specify x and y vectors containing the values used to position the dots on the figure. Although the conclusions would be the same regardless of the variable used to supply the x and y coordinates, convention dictates that the y variable is the one that is presumed to depend on the other (and is therefore known as the dependent variable). Since a seller cannot modify a car’s odometer reading, mileage is unlikely to be dependent on the car’s price. Instead, our hypothesis is that a car’s price depends on the odometer mileage. Therefore, we will select price as the dependent y variable.
The full command to create our scatterplot is: > plot(x = usedcars\(mileage, y = usedcars\)price,
Scatterplots are one type of graph used to study the relationship between two numerical variables. Figure 1.2 displays the relationship between the variables homeownership and multi_unit, which is the percent of housing units that are in multi-unit structures (e.g., apartments, condos). Each point on the plot represents a single county. For instance, the highlighted dot corresponds to County 413 in the county dataset: Chattahoochee County, Georgia, which has 39.4% of housing units that are in multi-unit structures and a homeownership rate of 31.3%. The scatterplot suggests a relationship between the two variables: counties with a higher rate of housing units that are in multi-unit structures tend to have lower homeownership rates. We might brainstorm as to why this relationship exists and investigate each idea to determine which are the most reasonable explanations.
The multi-unit and homeownership rates are said to be associated because the plot shows a discernible pattern. When two variables show some connection with one another, they are called associated variables.
A pair of variables are either related in some way (associated) or not (independent). No pair of variables is both associated and independent.
Because there is a downward trend in Figure 1.2 – counties with more housing units that are in multi-unit structures are associated with lower homeownership – these variables are said to be negatively associated. A positive association is shown in the relationship between the median_hh_income and pop_change variables in Figure 1.3, where counties with higher median household income tend to have higher rates of population growth.
Explanatory and response variables.
When we suspect one variable might causally affect another, we label the first variable the explanatory variable and the second the response variable. We also use the terms explanatory and response to describe variables where the response might be predicted using the explanatory even if there is no causal relationship.
explanatory variable → might affect → response variable
For many pairs of variables, there is no hypothesized relationship, and these labels would not be applied to either variable in such cases.
In general, association does not imply causation. An advantage of a randomized experiment is that it is easier to establish causal relationships with such a study. The main reason for this is that observational studies do not control for confounding variables, and hence establishing causal relationships with observational studies requires advanced statistical methods