Tyrenol preferences among congestion vs. muscle ache sufferers.
Means (Multiple Groups):
t-test: Compare means of two groups. e.g., (Male vs. female teens’ sports drink consumption.)
ANOVA: Compare means across multiple groups. (e.g., age-based preferences for cars).
Paired Samples:
Paired t-test: Compare two variables within the same group.
Example: Concern for global warming vs. gasoline emissions.
Agenda
What Are Associative Analyses?
Types of Relationships
Correlation Coefficient
Cross Tabulation and Chi-Square Test
Special Consideration in Associative Analyses
Introduction to Associative Analyses
Marketing researchers often want to go beyond descriptive measures, statistical inference, and differences tests.
Explore relationships among variables in large datasets (hundreds or thousands of survey responses).
What kinds of people buy Frito-Lay snacks such as Cheetos, Fritos, or Lay’s potato chips?
Demographics: Who are the customers?
Circumstances: Under what conditions are these products chosen?
Associative analyses: determine where stable relationships exist between two variables
Relationships between Two Variables
Relationship: a consistent, systematic linkage between the levels or labels for two variables
“Levels” refers to the characteristics of description for interval or ratio scales.
e.g., Older consumers purchase more vitamins.
“Labels” refers to the characteristics of description for nominal or ordinal scales.
e.g., PayPal customers (Yes/No) tend to also be Amazon Prime customers.
Relationships between Two Variables (cont.)
Statistical Linkage (our focus): Indicates a consistent pattern or association between variables, but not causation.
Most daily exercisers purchase sports drinks → suggests correlation but does not prove that exercising causes sports drink purchases.
Causal Linkage: Requires certainty that one variable causes the other.
Increased advertising spend leads to higher sales → a proven cause-and-effect relationship supported by controlled testing or robust analysis.
Why Statistical Relationships Matter?
They offer insights that lead to deeper understanding.
Customer Preferences: Data shows that young adults are more likely to purchase plant-based milk. ->Marketing efforts can focus on health-conscious messaging for this demographic.
Product Usage Patterns: Consumers who frequently travel tend to buy larger quantities of travel-size toiletries. -> Retailers can optimize product placement near travel-related items.
E-Commerce Behaviors: Frequent online shoppers often use digital wallets like PayPal. -> Encouraging wallet-based payment options might increase conversion rates.
Cross-Selling Opportunities: Customers who buy high-end smartphones often purchase accessories like cases or ear buds.->Bundle deals can encourage higher spending.
Agenda
What Are Associative Analyses?
Types of Relationships
Correlation Coefficient
Cross Tabulation and Chi-Square Test
Special Consideration in Associative Analyses
Monotonic Linear Relationship
A “straight-line association” between two scale variables.
Formula: \[y=a+bx\]
\(y\): Dependent variable (predicted/estimated).
\(a\): Intercept (value of y when \(x=0\)).
\(b\): Slope (rate of change in \(y\) per unit change in \(x\)).
\(x\): Independent variable.
Predicts \(y\) given any value of \(x\) using known \(a\) and \(b\).
Linear relationships are commonly used in the analysis of interval or ratio scale variables
We will learn about it in more detail in the regression analysis.
Monotonic Linear Relationship Example
Burger King estimates that every customer will spend about $12 per lunch visit
it is easy to use a linear relationship to estimate how many dollars of revenue will be associated with the number of customers for any given location.
\[y = 0 + 12\] where x = number of customers.
100 customers → 12×100=1,200 revenue.
200 customers → 12×200=2,400 revenue.
Linear relationship provides an average expectation of future revenue.
Nonmonotonic Relationship
A relationship where the presence or absence of one variable’s label is systematically associated with the presence or absence of another variable’s label.
No consistent direction (e.g., increasing or decreasing).
The relationship is general and described verbally.
Describes patterns of association, not precise relationships or quantities.
Nonmonotonic relationships are often used for the analysis of norminal scale variables
Nonmonotonic Relationship Example
Drinks Ordered at McDonald’s
Morning customers tend to buy coffee and breakfast foods.
Noon customers tend to buy soft drinks and lunch items.
Not exclusive: Labels are associated on average, but exceptions exist.
Characterizing Relationships between Variables
Presence:
Whether a systematic (statistical) relationship exists between two variables.
Pattern:
The general nature of the relationship, including its direction.
Strength of Association:
Whether the relationship is consistent.
Step-By-Step Procedure for Analyzing the Relationship between Two Variables
Agenda
What Are Associative Analyses?
Types of Relationships
Correlation Coefficient
Cross Tabulation and Chi-Square Test
Special Consideration in Associative Analyses
Correlation Coefficients
Definition:
A measure (r) that quantifies the strength and direction of a linear relationship between two scale variables.
Range: −1.0 to +1.0.
The correlation coefficient communicates both the strength and the direction of the linear relationship between two metric variables.Strength: Determined by the absolute value of
Horizontal axis (x): Independent variable (e.g., number of salespeople).
Points: Represent matched pairs of x and y values.
Systematic covariation forms an ellipse-like pattern.
Visualizing Covariation Using Scatter Diagrams (cont.)
Patterns can vary depending on the relationship between variables.
The Pearson Product Moment Correlation Coefficient
Measures the linear relationship between two interval or ratio-scaled variables (scale variables).
Based on the closeness of scatter points to a straight line.
Perfect Correlation: All points fall on a straight line; Correlation coefficient (r) = \(\pm\) 1.00.
No Correlation: Scatter points form a ball-shaped pattern with no discernable ellipse; r ≈ 0.0.
Typical Values: Most correlations fall between -1.0 and +1.0 and are interpreted as: Strong, moderate, or weak, if statistically significant.
Provides insight into the strength and direction of the linear association.
If you are interested in the math behind it, here is the reference.
The calculation of r is VERY tedious, so I don’t even introduce the formula here.
Calculating Correlation Coefficient in R
Let’s calculate the correlation coefficient using R.
As always, we will use auto_concept.csv data.
There are six different lifestyles: Novelist, Innovator, Trendsetter, Forerunner, Mainstreamer, and Classic.
Each lifestyle type is measured with a 7-point interval scale
1 = Does not describe me at all to 7 = Describes me perfectly
Correlation analysis can find out which lifestyle profile is associated with a particular automobile model preference.
High Positive Correlations: Indicate consumers prefer a specific model and score high on the associated lifestyle type.
Low or Negative Correlations: Suggest consumers do not align with that model or lifestyle type.
Calculating Correlation Coefficient in R (cont.)
Let’s determine which of the six lifestyle types is associated with the preference for the 5-seat economy gasoline automobile model.
First step is to select the variables to analyze.
Preference for the 4 Seat Economy Gasoline model.
All six lifestyle types: Novelist, Innovator, Trendsetter, Forerunner, Mainstreamer, and Classic.
All of these variables are interval scale variables. -> Use Pearson’s correlation to measure the strength and direction of the linear relationship.
Default to a two-tailed test for statistical significance.
Calculating Correlation Coefficient in R (cont.)
We will rely on psych package for an efficient and clean workflow.
Computes both the correlation coefficient and the p-value in a single step.
Saves time and reduces redundancy compared to cor() and cor.test().
Load Data: Import the dataset into R.
SSelect Variables: Focus on the 4 Seat Economy Gasoline model preference and the six lifestyle types.
Rename variables to their original names (e.g., Novelist, Innovator) for clarity.
# Install and load the psych package# install.packages("psych") #if you didn't installlibrary(tidyverse)library(psych)setwd("~/Documents/GitHub/Marketing-Research-2025-Spring")# Step 1: Load the datasetauto_concept <-read_csv("auto_concept.csv")# Step 2: Select relevant columns# Include 'economygas4seat' and the six lifestyle variablesselected_data <- auto_concept %>%select(economygas4seat, lifestyle1, lifestyle2, lifestyle3, lifestyle4, lifestyle5, lifestyle6) %>%rename("Economy Gasoline"= economygas4seat,"Novelist"= lifestyle1,"Innovator"= lifestyle2,"Trendsetter"= lifestyle3,"Forerunner"= lifestyle4,"Mainstreamer"= lifestyle5,"Classic"= lifestyle6 )
Calculating Correlation Coefficient in R (cont.)
Use the corr.test() function in the psych package for Pearson correlation.
Correlation matrix: Measures strength and direction.
P-Value: Indicates statistical significance.
# Step 3: Compute correlations and p-values# Use rcorr to compute correlation matrix and significanceresults <-corr.test(selected_data, method ="pearson")# Step 4: Extract correlation coefficients and p-valuesresultsresults$stars
Indicates that individuals identifying as Innovators are slightly less likely to prefer the economy gasoline model.
Agenda
What Are Associative Analyses?
Types of Relationships
Correlation Coefficient
Cross Tabulation and Chi-Square Test
Special Consideration in Associative Analyses
Cross-Tabulation Analysis
Cross-tabulation is a statistical method used to examine nonmonotonic relationships between two nominally scaled variables.
A cross-tabulation table is sometimes referred to as an “r×c” (r-by-c) table because it is comprised of rows and columns.
Rows (r): Represent categories of one variable.
Columns (c): Represent categories of another variable.
Cells: The intersection of rows and columns, containing the frequency of occurrences.
Cross-Tabulation Analysis Example
In a survey (200 respondents), researchers examine the relationship between Occupation (nominal variable) and Michelob Ultra Beer Purchasing Behavior (nominal variable).
Two Nominal Variables:
Occupation: White Collar (160) & Blue Collar (40).
Cross-tabulation analysis uses both nominal variables simultaneously and tallies up the cell frequencies
Occupation Status
Buyer
Nonbuyer
Row Total
White Collar
152
8
160
Blue Collar
14
26
40
Column Total
166
34
200
Raw Percentages Table
In a cross-tabulation table, raw frequencies can be converted into various types of percentages to provide additional insights.
The raw percentages table is derived by dividing each raw frequency by the grand total (200 in this case) and multiplying by 10 \[\text{Raw Percentage}=\frac{\text{Cell Freqeuncy}}{\text{Grand Total}}\times100\]
Buyer
Nonbuyer
Total
White Collar
% of Total
76%
4%
80%
Blue Collar
% of Total
7%
13%
20%
Total
% of Total
83%
17%
100%
Row Percentages Table
The row percentages table calculates percentages relative to the row totals (e.g., 160 for White Collar, 40 for Blue Collar). \[\text{Row Cell Percentage}=\frac{\text{Cell Freqeuncy}}{\text{Total of Cell Frequencies in that Row}}\times100\]
e.g., The first column is calculated by \(\frac{152}{160}\times100\) and \(\frac{14}{40}\times100\)
Occupation Status
Buyer (%)
Nonbuyer (%)
Row Total (%)
White Collar
95%
5%
100%
Blue Collar
35%
65%
100%
Column Percentages Table
The row percentages table calculates percentages relative to the row totals (e.g., 160 for White Collar, 40 for Blue Collar). \[\text{Column Cell Percentage}=\frac{\text{Cell Freqeuncy}}{\text{Total of Cell Frequencies in that Column}}\times100\]
e.g., The first row is calculated by \(\frac{152}{166}\times100\) and \(\frac{8}{34}\times100\)
Occupation Status
Buyer (%)
Nonbuyer (%)
White Collar
91.57%
23.53%
Blue Collar
8.43%
76.47%
Column Total
100%
100%
Chi-Square (χ²) Analysis Overview
Chi-square (χ²) analysis is a statistical technique used to examine the frequencies of two nominally scaled variables in a cross-tabulation table.
It assesses nonmonotonic association in a cross-tabulation table based upon differences between observed and expected frequencies
It determines whether there is a statistically significant relationship between the two variables.
Null Hypothesis: Assumes no association between the variables (independence).
e.g.,: The distribution of buyers and nonbuyers is independent of occupation.
Alternative Hypothesis: Assumes that the variables are associated (dependent).
Observed Frequencies in Chi-Square Analysis
Chi-square analysis compares observed frequencies (actual counts from the data) to expected frequencies (theoretical counts assuming no association). The degree of difference is expressed in the Chi-square test statistic.
Observed Frequencies (\(O_{ij}\)):
These are the actual cell counts in the cross-tabulation table.
Example: 152 White Collar buyers and 8 White Collar nonbuyers.
Occupation Status
Buyer
Nonbuyer
Row Total
White Collar
152
8
160
Blue Collar
14
26
40
Column Total
166
34
200
Expected Frequencies in Chi-Square Analysis
Expected Frequencies (\(E_{ij}\)): These are theoretical frequencies calculated under the null hypothesis of no association.
The Chi-square (χ²) value is a single number summarizing how much the observed frequencies deviate from the expected frequencies in a cross-tabulation table.
It provides a measure of how closely the data align with the null hypothesis of no association. \[
\chi^2 = \sum_{i=1,j=1}^n \frac{(O_{ij} - E_{ij})^2}{E_{ij}}
\] Where:
\(O_{ij}\) : Observed frequency in row i and column j
\(E_{ij}\): Expected frequency in row i and column j
\(n\): Number of cells
Degrees of Freedom in Chi-Square Analysis
Degrees of freedom represent the number of values in a calculation that are free to vary while still meeting the constraints of the data.
In the context of Chi-square analysis, degrees of freedom are used to determine the critical value from the Chi-square distribution table.
\[
df = (\text{Number of Rows} - 1) \times (\text{Number of Columns} - 1)
\]
Statistical significance is determined by comparing the computed Chi-square value with the critical value from the Chi-square distribution table or by using a p-value.
Modern statistical software automates the process.
How the Chi-Square Value is Computed
Compare Observed and Expected Frequencies: Subtract the expected frequency\(E\) from the observed frequency\(O\); \((O - E)\).
Adjust for Negative Values: Square the difference to eliminate negative values and prevent cancellation effects: \((O - E)^2\)
Normalize by Expected Frequency: Divide the squared difference by the expected frequency\(E\) to adjust for differences in cell sizes: \(\frac{(O - E)^2}{E}\)
Summation Across All Cells: Add these normalized values across all cells in the cross-tabulation table: \[
\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}
\]
Interpreting the Chi-Square Value
Large Deviations:
If observed frequencies deviate significantly from expected frequencies -> the computed Chi-square value increases.
Small Deviations:
If observed frequencies are close to expected frequencies -> the computed Chi-square value remains small.
The Chi-square value provides a summary measure of the departure of observed frequencies from expected frequencies.
Cross Tabulation in R
Okay, let’s practice what we have learned in R.
Recall that we have several nominal variables in our auto_concept.csv data.
Marital Status (marital): Unmarried(0) / Married (1) (Nominal Variable)
Favorite Newspaper Section: (newspaper): Editorial, Business, Local News, National News, Sports, Entertainment, Do Not Read (Nominal Variable)
We will use janitor package for the associative analysis
library(tidyverse)#install.packages("janitor") # Install if not already installedlibrary(janitor)setwd("~/Documents/GitHub/Marketing-Research-2025-Spring")auto_concept <-read_csv("auto_concept.csv")# Rename marital and newspaper variablesauto_concept <- auto_concept %>%mutate(marital =factor(marital,levels =c(0, 1),labels =c("Unmarried", "Married")),newspaper =factor(newspaper,levels =c(1, 2, 3, 4, 5, 6, 7),labels =c("Editorial", "Business", "Local News","National News", "Sports", "Entertainment", "Do Not Read")) )
Cross Tabulation in R (cont.)
The tabyl() within janitor package provides simple way to calculate the cross table.
# Create a cross-tabulation tablecross_tab <- auto_concept %>%tabyl(marital, newspaper)cross_tab
Observed Frequencies: Marital Status vs. Newspaper Section
marital
Editorial
Business
Local News
National News
Sports
Entertainment
Do Not Read
Unmarried
0
5
16
4
53
7
25
Married
94
199
301
37
183
52
24
Cross Tabulation in R (cont.)
Let’s do the chi-square test.
you just put the cross_tab in the chisq.test() function.
The p-value is essentially zero; it indicates a statistically significant association between marital status and newspaper section preferences.
However, it does not specify the nature of the relationship.
To understand this, we must analyze the row or column percentages.
# Perform Chi-square testchi_test <-chisq.test(cross_tab)# Display Chi-square test resultschi_test