Data Pre-processing, Descriptive Statistics, and Hypothesis Testing
Lab Overview
- Data pre-processing
- Exploratory data analysis (EDA)
- Hypothesis testing
In this session, we shift from foundational R skills to real-world data analysis. Imagine yourselves as data analysts working for a company, ready to tackle a practical business case. Today’s case involves delving into customer data to gain insights and solve business challenges.
We have a thriving e-commerce company that exclusively operates online. Our aim is to customize our marketing strategy by understanding the factors that drive varying levels of customer spending on our website. By analyzing customer behavior and characteristics, we hope to enhance our marketing approach and improve customer engagement.
To this purpose, we will analyze a data set including customer demographics and purchase history. The dataset covers a sample 1000 registered customers and their browsing activity on the company’s website during the previous month (30 days.)
The data set includes the following attributes:
Age
: customer age, categorized into young, middle-aged, and oldGender
: customer gender, categorized into male or femaleOwnHome
: customer home ownership, categorized into own home or rented homeMarried
: customer marital status, categorized into single or marriedLocation
: customer location, categorized into whether the customer is close or far from the nearest brick-and-mortar store selling similar productsChildren
: how many children the customer hasHistory
: past purchasing history categorized into low, medium, or high, or NA if the customer has not purchased anything in the pastVisits
: the total number of visits to the website in the previous monthAmountSpent
: the total amount of money the customer has spent on purchases during the previous month (in US dollars)TimeSpent
: the average time the customer spent on the website per visit
1. Getting Started
Download Lab 3’s materials from Moodle:
- Save provided data set in your
data
folder in BRM-Labs project folder. - Save provided R script in your
code
folder in BRM-Labs project folder.
- Save provided data set in your
Open the provided lab 3’s R script.
Package installation
# Install packages
install.packages("infer")
install.packages("janitor")
- Setup your R environment.
# Clean work environment
rm(list = ls()) # USE with CAUTION: this will delete everything in your environment
# Load packages
library(tidyverse)
library(stargazer)
library(ggthemes)
library(GGally)
library(skimr)
library(corrr)
library(infer)
library(janitor)
- Load the data.
# Load data
load("data/ecommerce.RData")
2. Data Pre-Processing
Whenever working with a new data set, you should first try to get to know the key features of the data and prepare it analysis.
2.1 Considerations
Possible questions you should consider include:
How many observations does the data set contain? How many are complete?
Which variables (if any) suffer from missing data? Will the missing data be a problem for the analysis? How and why?
Is missing data indicated correctly and consistently by
NA
or is there “hidden” missing data e.g. “N/A” and “Not Applicable”?Are variables being measured on the right scale and do they have the right data type?
Should any variables be transformed? Are the factor variables ordered?
Do the minimum and maximum values of each variable make sense? Do there seem to be any mistaken data entries? Are there any outliers?
Are there duplicate entries in the data set?
Should the qualitative variables be encoded as binary (0/1) indicator variables?
Are there any typos and/or inconsistent capitalization that should be fixed?
Does the data need to be aggregated for the analysis?
Do you need to create any new variables based on the ones available?
To answer the questions above you can use the functions and tools
covered in Lab 2. For instance, a combination of skim
,
summary statistics, and plotting can be used. At the end of this process
you should be able to write a paragraph describing the main features of
the data set.
2.2 Skim
We will use skim to get a quick overview of the data.
# Skim data set
skim(tb.ecommerce)
Name | tb.ecommerce |
Number of rows | 1000 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
factor | 6 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Age | 0 | 1.0 | FALSE | 3 | Mid: 508, You: 287, Old: 205 |
Gender | 0 | 1.0 | FALSE | 2 | Fem: 506, Mal: 494 |
OwnHome | 0 | 1.0 | FALSE | 2 | Own: 516, Ren: 484 |
Married | 0 | 1.0 | FALSE | 2 | Mar: 502, Sin: 498 |
Location | 0 | 1.0 | FALSE | 2 | Clo: 710, Far: 290 |
History | 303 | 0.7 | FALSE | 3 | Hig: 255, Low: 230, Med: 212 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Children | 0 | 1 | 0.93 | 1.05 | 0.00 | 0.0 | 1.00 | 2.00 | 3.00 | ▇▅▁▂▂ |
Visits | 0 | 1 | 2.66 | 1.74 | 0.00 | 1.0 | 2.00 | 4.00 | 8.00 | ▆▇▃▃▁ |
AmountSpent | 0 | 1 | 54.64 | 91.17 | 0.00 | 0.0 | 0.00 | 81.03 | 621.70 | ▇▁▁▁▁ |
TimeSpent | 0 | 1 | 12.37 | 12.70 | 1.75 | 3.2 | 6.11 | 19.15 | 88.25 | ▇▂▁▁▁ |
The data set contains 1000 rows (customers) and 10 columns. Six of the variables are categorical and 4 are numeric. We can see that variable names are descriptive and contain both upper and lower case letters. The only variable affected by missing data is “History” with 303 missing observations. This is likely due to the fact that these customers are first-time registered customers for whom a previous purchase history is not available. There does not seem to be any “hidden” missing data. In total, we have 697 complete observations. This represents a data loss of 30%. We can keep all the data if we recode the missing data in the variable “History” as “New Customer”. All variables seem to be using the correct scales and units but it will be useful to ensure proper ordering of some of the categorical variables (e.g., Age). The minimum and maximum values of the variables do not suggest any data entry errors. There seem to be some outliers present in the data, this will have to be further investigated.
2.3 Pre-processing
Now that we are familiar with the data set, it is time to prepare it for analysis. This includes correcting any errors, making sure variables are in the desired format, dealing with missing data and/or outliers, aggregating the data (if necessary), creating new variables, etc.
We will start by converting all variable names to lower case. This is not a necessary step, simply a preference.
# Set all variable names to lower case (this is a preference)
tb.ecommerce <- rename_with(tb.ecommerce, tolower)
Next, we will order the ordinal categorical variables \(age\) and \(history\).
# Check factor levels for age
tb.ecommerce %>% distinct(age)
# Re-order factor variable age
tb.ecommerce <- tb.ecommerce %>%
mutate(age = fct_relevel(age, "Young", "Middle", "Old"))
# Check factor levels for history
tb.ecommerce %>% distinct(history)
# Re-order factor variable history
tb.ecommerce <- tb.ecommerce %>%
mutate(history = fct_relevel(history, "Low", "Medium", "High"))
Finally, we will deal with the missing data in the “history” variable. We will create a new variable “new_hist” that will take on the value “New Customer” whenever the original “history” variable is missing.
# Deal with NAs in history
tb.ecommerce <- tb.ecommerce %>%
mutate(new_hist = if_else(is.na(history), "New Customer", as.character(history)))
# Convert new_hist to factor variable and order levels
tb.ecommerce <- tb.ecommerce %>%
mutate(new_hist = factor( x = new_hist
, levels = c("New Customer", "Low", "Medium", "High")))
Our data set is now ready for analysis.
3. Descriptive Statistics
Now that the data set is ready to be analyzed, you should produce the key descriptive statistics that will allow your audience to get a good grasp of the data. You should consider producing and presenting the following:
A table of summary statistics for the full sample.
Summary statistics for specific groups (if relevant).
A correlation table or plot.
Histograms and/or box plots of the numerical variables of interest.
Bar plots or frequency tables of the qualitative variables of interest.
Scatterplots and side by side boxplots for depicting relationships among the different variables.
3.1 Summary Statistics
We start by using the stargazer()
function to produce a
summary statistics table.
# Summary statistics table
stargazer(data.frame(tb.ecommerce), type = "text"
, iqr = FALSE, median = TRUE, no.space = TRUE
, title = "Summary Statistics")
Summary Statistics
======================================================
Statistic N Mean St. Dev. Min Median Max
------------------------------------------------------
children 1,000 0.934 1.051 0 1 3
visits 1,000 2.659 1.738 0 2 8
amountspent 1,000 54.637 91.169 0.000 0.000 621.700
timespent 1,000 12.368 12.698 1.753 6.107 88.248
------------------------------------------------------
We can see that the customers in our sample visited the website an average of 2.7 times during the previous month, spent an average of approximately 12 minutes per visit on the website, spent an average of $55 on purchases during the previous month, and have, on average, one child.
Next, we look at subgroup summary statistics.
# Mean amountspent by age group
tb.ecommerce %>%
group_by(age) %>%
summarise( mean_amount = mean(amountspent)
, sd_amount = sd(amountspent)
, min_amount = min(amountspent)
, max_amount = max(amountspent))
Note that the average amount spent by “Young” customers is much lower than the average expenditure of “Middle” and “Old” aged customers.
# Mean amountspent by marriage status
tb.ecommerce %>%
group_by(married) %>%
summarise( mean_amount = mean(amountspent)
, sd_amount = sd(amountspent)
, min_amount = min(amountspent)
, max_amount = max(amountspent))
Note that, on average, married customers spend over twice as much than single customers.
3.2 Correlation Table
We will now produce a correlation table to get an overview of the sign and strength of the relationships between the numerical variables in our data.
# Create correlation matrix
cor_matrix <- tb.ecommerce %>%
select(amountspent, timespent, visits, children) %>%
correlate(diagonal = 1)
# Format table for a professional presentation
cor_matrix %>%
rearrange() %>% # rearrange by correlations
shave() %>% # Shave off the upper triangle for a clean result
fashion(decimals = 3) # Clean presentation
Note the moderate positive correlation between the time spent on the website and the amount spent and also the positive moderate correlation between the number of visits made and the amount spent. We can also note the negative but quite weak correlation between the number of children the customer has and the amount spent.
3.3 Variable Distributions
We can start by looking at the distributions of the numerical variables.
# Distribution of amountspent
ggplot(data = tb.ecommerce, aes(x = amountspent)) +
geom_histogram()
# Distribution of timespent
ggplot(data = tb.ecommerce, aes(x = timespent)) +
geom_histogram()
# Distribution of children
ggplot(data = tb.ecommerce, aes(x = children)) +
geom_histogram(binwidth = 0.1)
# Distribution of visits
ggplot(data = tb.ecommerce, aes(x = visits)) +
geom_histogram(binwidth = 0.1)
Note that the distribution of amount spent is very positively skewed with a peak at 0. The distribution of the timespent is also very positive skewed.
We can now look at the distribution of the qualitative variables.
# Distribution of age
ggplot(data = tb.ecommerce, aes(x = age)) + geom_bar()
# Distribution of gender
ggplot(data = tb.ecommerce, aes(x = gender)) + geom_bar()
# Distribution of ownhome
ggplot(data = tb.ecommerce, aes(x = ownhome)) + geom_bar()
# Distribution of married
ggplot(data = tb.ecommerce, aes(x = married)) + geom_bar()
# Distribution of location
ggplot(data = tb.ecommerce, aes(x = location)) + geom_bar()
# Distribution of new_hist
ggplot(data = tb.ecommerce, aes(x = new_hist)) + geom_bar()
Most customers are middle aged, and the sample is fairly balanced in terms of gender, home ownership, and marital status. Most customers live close to a physical store selling similar products.
3.4 Variable Relationships
Lets now look at the relationship between customers’ timespent and amount spent:
# Scatterplot of amount spent and timespent
ggplot(data = tb.ecommerce, aes(x = timespent, y = amountspent)) +
geom_point() + stat_smooth(method = lm)
Now, suppose you suspect that the relationship between timespent and
amount spent is not constant over all locations. You can use
facet_wrap
to produce a scatterplot for each group:
# Scatterplot of amount spent and timespent by location
ggplot(data = tb.ecommerce, aes(x = timespent, y = amountspent)) +
geom_point() + stat_smooth(method = lm) +
facet_wrap( ~ location)
Now let’s explore the amount spent by different customer segments.
# Distribution of amount spent by gender
ggplot(data = tb.ecommerce, aes(x = gender, y = amountspent)) +
geom_boxplot()
# Distribution of amount spent by location
ggplot(data = tb.ecommerce, aes(x = location, y = amountspent)) +
geom_boxplot()
# Distribution of amount spent by age
ggplot(data = tb.ecommerce, aes(x = age, y = amountspent)) +
geom_boxplot()
# Distribution of amount spent by history
ggplot(data = tb.ecommerce, aes(x = new_hist, y = amountspent)) +
geom_boxplot()
If you decide to include one or more graphs in your final presentation make sure the graph is adequately titled and labeled. All axis should be clearly labeled indicating the variables and their units. You can also add a note to the graph briefly detailing what is shown and what is the key message(s) that you wish your audience to retain.
4. Hypothesis Testing
4.1 Comparing Mean with value
The previous graphical evidence suggested that customers had visited the website at least once in the past month. We will now use a one-sample hypothesis test to evaluate whether this observation is statistically significant.
To do this, we will use the base R function t.test()
,
which relies on the t-distribution to perform the test.
Given our large sample size, this approach is appropriate due to the
Central Limit Theorem, which ensures the sampling
distribution of the mean is approximately normal.
The t.test()
function allows us to test whether the
mean of a numeric variable differs from a specific
value. In our case, we will test whether the mean number of
visits is greater than zero.
We will be testing the following hypotheses:
- \(H_0\): \(\mu = 0\) (the true mean number of visits is zero)
- \(H_a\): \(\mu > 0\) (the true mean number of visits is greater than zero)
This is a one-tailed (right-tailed) test. We can
specify this using the alternative = "greater"
argument in
t.test()
.
# One-sample t-test: is the mean number of visits > 0?
t.test(tb.ecommerce$visits, mu = 0, alternative = "greater")
One Sample t-test
data: tb.ecommerce$visits
t = 48.4, df = 999, p-value <2e-16
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
2.5685 Inf
sample estimates:
mean of x
2.659
The resulting p-value is extremely small (typically < 0.001), allowing us to reject the null hypothesis. This means there is strong statistical evidence that the average number of visits is greater than zero — confirming what we observed graphically.
🔍 Note: The
t.test()
function assumes the data are independent and approximately normally distributed. For large sample sizes, the test is robust to moderate departures from normality.
4.2 Comparing Two Means
The previous graphical evidence suggested that, on average, male customers spent more than female customers, and that customers living far from a physical store spent more than those living close to one. We will now use hypothesis tests to evaluate whether these observed differences are statistically significant.
To do this, we use the t.test()
function from base R.
This function performs a two-sample t-test, which
compares the means of two independent groups. By default, it assumes
unequal variances between the groups (Welch’s t-test),
which makes it more robust in practical applications.
We will begin by testing whether the mean amount spent differs by gender. This is a two-tailed test with the following hypotheses:
\(H_0\): \(\mu_{M} = \mu_{F}\)
\(H_a\): \(\mu_{M} \ne \mu_{F}\)
# Two-sample t-test: compare mean amount spent by gender
t.test(amountspent ~ gender, data = tb.ecommerce)
Welch Two Sample t-test
data: amountspent by gender
t = -2.96, df = 969, p-value = 0.0032
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
-28.3056 -5.7202
sample estimates:
mean in group Female mean in group Male
46.232 63.245
The resulting p-value is very small, allowing us to reject the null hypothesis. This indicates that the difference in average spending between male and female customers is statistically significant.
We can also customize the test by specifying the confidence level or by performing a one-tailed test if we have a directional hypothesis.
# Increase confidence level to 99%
t.test(amountspent ~ gender, data = tb.ecommerce, conf.level = 0.99)
Welch Two Sample t-test
data: amountspent by gender
t = -2.96, df = 969, p-value = 0.0032
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
99 percent confidence interval:
-31.8648 -2.1611
sample estimates:
mean in group Female mean in group Male
46.232 63.245
# One-tailed test: is mean spending for females < males?
t.test(amountspent ~ gender, data = tb.ecommerce, alternative = "less")
Welch Two Sample t-test
data: amountspent by gender
t = -2.96, df = 969, p-value = 0.0016
alternative hypothesis: true difference in means between group Female and group Male is less than 0
95 percent confidence interval:
-Inf -7.5386
sample estimates:
mean in group Female mean in group Male
46.232 63.245
📌 Note: The default setting assumes unequal variances (
var.equal = FALSE
). If you are confident the variances are equal, you can specifyvar.equal = TRUE
to perform the classical Student’s t-test. Additionally,alternative = "greater"
tests whether the first group listed (e.g., Female) has a lower mean than the second (e.g., Male), so group ordering matters when making directional claims.
4.3 Comparing Two Proportions
In order to compare two group proportions we use the function
prop_test()
. Suppose we want to compare the proportion of
females who are married and the proportion of males who are married.
Before conducting the statistical test, we can produce the following
two-way table:
# Comparing two proportions
tb.ecommerce %>%
# Create a two-way table (with group counts)
tabyl(gender, married) %>%
# Add column and row totals
adorn_totals(c("row", "col")) %>%
# Convert values to percentages
adorn_percentages("row") %>%
# Format values with percent sign and two decimal places
adorn_pct_formatting(2) %>%
# Add counts along with the percentages
adorn_ns("front")
We can see that 44.5% (0.4447) of females in the sample are married while 56% (0.5607) of the males in the sample are married. Is the difference of -0.116 statistically significant?
\(H_0:\) The two proportions are equal.
\(H_a:\) The two proportions are not equal.
# Test for two proportions
# notice the z argument, a logical value for whether to report the statistic as
# a standard normal deviate or a Pearson's chi-square statistic.
tb.ecommerce %>% prop_test(married ~ gender,
order = c("Female", "Male"),
z = TRUE)
The p-value of the test is well below 0.01 and thus we reject the null hypothesis of equality of proportions and conclude that the observed difference is statistically significant.
4.4 Graphical Comparisons
In order to provide a visual comparison of group means, we can use confidence interval plots. These graphs display the mean of each group and the associated confidence interval. Note that we need to be cautious when interpreting these plots - if there is no overalp between the two intervals then the difference between the two groups is statistically significant; however, if the two intervals overlap, the difference between the two groups may still be statistically significant. A t-test should be performed to confirm whether the group means are statistically significant or not.
# Create data set for plot
tb.plot1 <- tb.ecommerce %>%
group_by(gender) %>%
summarize( n = n()
, mean_amount = mean(amountspent)
, sd_amount = sd(amountspent)
, se_amount = sd_amount/sqrt(n))
# Create plot with confidence interval for each group mean
ggplot(data = tb.plot1
, aes(x = gender, y = mean_amount, color = gender)) +
geom_point() +
geom_errorbar(aes(ymin = mean_amount-1.96*se_amount
, ymax = mean_amount+1.96*se_amount)
, width = 0.2) +
theme_classic() + labs(x = "") +
theme(legend.position = "")
5. Recommended Assignment
- Complete DataCamp’s fourth chapter of the Introduction to the Tidyverse course: Types of Visualizations