Introduction This project explores the factors influencing vehicle purchase decisions in Ireland using data from the Central Statistics Office (CSO). The goal is to understand whether demographic variables such as age and sex affect the importance assigned to influencing factors.
Research Questions
Does age group significantly influence how important various factors are when purchasing a vehicle?(analyzed using ANOVA)
Can we predict the importance placed on vehicle purchase factors using a respondent’s age group and sex?(analyzed using multiple linear regression)
About Dataset. The dataset was sourced from the Central Statistics Office (CSO) and includes responses categorized by age group, sex, and influencing factor, with the corresponding importance value.
STATISTIC Statistic.Label TLIST.A1. Year
1 NTA42 Factors that influence a vehicle purchase 2019 2019
2 NTA42 Factors that influence a vehicle purchase 2019 2019
3 NTA42 Factors that influence a vehicle purchase 2019 2019
4 NTA42 Factors that influence a vehicle purchase 2019 2019
5 NTA42 Factors that influence a vehicle purchase 2019 2019
6 NTA42 Factors that influence a vehicle purchase 2019 2019
C02076V02508 Age.Group C02199V02655 Sex C03655V04397 Influencing.Factor
1 350 18 - 24 years 1 Male 10 Purchase price
2 350 18 - 24 years 1 Male 40 Reliability
3 350 18 - 24 years 1 Male 70 Engine efficiency
4 350 18 - 24 years 1 Male 110 Size
5 350 18 - 24 years 1 Male 60 Tax
6 350 18 - 24 years 1 Male 80 Insurance
UNIT VALUE
1 % 73.3
2 % 20.4
3 % 19.9
4 % 8.3
5 % 21.6
6 % 66.6
Data Preprocessing. I checked for missing values and there were missing values in the data, for which I went ahead with Median Imputation.Then I created a new dataframe with only required variables.
Age.Group Sex Influencing.Factor VALUE
1 18 - 24 years Male Purchase price 73.3
2 18 - 24 years Male Reliability 20.4
3 18 - 24 years Male Engine efficiency 19.9
4 18 - 24 years Male Size 8.3
5 18 - 24 years Male Tax 21.6
6 18 - 24 years Male Insurance 66.6
ggplot(nta_clean, aes(x = `Age.Group`, y = VALUE, fill = `Age.Group`)) +
geom_boxplot() +
labs(title = "Outlier Detection by Age Group", y = "Value", x = "Age Group") +
theme_minimal()Exploratory Data Analysis
Average Influence Score by Age Group
Call:
lm(formula = VALUE ~ Age.Group + Sex, data = nta_model)
Residuals:
Min 1Q Median 3Q Max
-21.712 -10.594 -3.894 2.775 59.081
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.5173 2.4547 7.136 5.41e-12 ***
Age.Group25 - 34 years 2.1019 3.2472 0.647 0.518
Age.Group35 - 44 years 2.6462 3.2472 0.815 0.416
Age.Group45 - 54 years 2.8173 3.2472 0.868 0.386
Age.Group55 - 64 years 1.2769 3.2472 0.393 0.694
Age.Group65 - 74 years 0.5404 3.2472 0.166 0.868
Age.Group75 years and over -2.4462 3.2472 -0.753 0.452
SexMale 1.3769 1.7357 0.793 0.428
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.56 on 356 degrees of freedom
Multiple R-squared: 0.01247, Adjusted R-squared: -0.006944
F-statistic: 0.6424 on 7 and 356 DF, p-value: 0.7207
Interpretation
Interpretation of Coefficients. - Intercept (17.51): This is the estimated VALUE for the baseline group — individuals in the 15–24 age group who identify as female. - Age Group Coefficients: Each coefficient represents the difference in VALUE compared to the baseline age group. None of the age groups showed statistically significant differences (all p-values > 0.05). - Sex (Male): The negative coefficient (-1.38) implies that, on average, males assign slightly lower VALUE scores than females, but the difference is not statistically significant (p = 0.428).
Model Diagnostics. - R-squared = 0.0124: Only about 1.2% of the variance in VALUE is explained by age group and sex combined. This indicates a very poor fit. - Adjusted R-squared = -0.0069: The model does not provide a meaningful improvement over a mean-based prediction. - F-statistic p-value = 0.7207: The model as a whole is not statistically significant.
Conclusion. The analysis reveals that neither age group nor sex significantly affects how individuals rate the importance of influencing factors in vehicle purchases.
Shapiro-Wilk normality test
data: sample(nta_clean$VALUE, 5000, replace = TRUE)
W = 0.82355, p-value < 2.2e-16
Interpretation
The Shapiro-Wilk test was conducted to evaluate whether the response
variable VALUE follows a normal distribution—an important
assumption for linear regression analysis.
Since the p-value is significantly less than 0.05,
we reject the null hypothesis of the Shapiro-Wilk test,
which states that the data are normally distributed. This indicates that
the VALUE variable does not follow a normal
distribution.
Conclusion:
The assumption of normality is violated. While linear regression can
still be used (as it’s relatively robust to this assumption in large
samples), alternative approaches like non-parametric tests or data
transformation may be considered based on the context.
Interpretation
To evaluate the assumptions of linear regression, we examined the following diagnostic plots:
Residuals vs Fitted: This plot shows no clear pattern, but a slight funnel shape suggests some heteroscedasticity (non-constant variance). Ideally, residuals should be evenly scattered around zero.
Q-Q Plot: The quantile-quantile plot shows deviation from the diagonal line, especially at the tails. This indicates that the residuals are not normally distributed, which violates the normality assumption.
Scale-Location Plot: The red trend line is slightly sloped, and the spread increases, again hinting at non-constant variance. This reinforces the evidence of heteroscedasticity.
Residuals vs Leverage: No influential points with high leverage or large Cook’s distance are detected, indicating there are no major outliers unduly influencing the model.
Together, these plots suggest that: - The linearity assumption is fairly reasonable. - There are mild violations of normality and homoscedasticity. - No severe multicollinearity or influential outliers are evident.
These insights guide whether additional transformation or robust methods should be considered in future models.
Q-Q Plot
Interpretation
The Q-Q (Quantile-Quantile) plot helps assess whether the residuals from the regression model are normally distributed, which is a key assumption in linear regression.
In the plot above,
** Observations - The dots substantially deviate** from the red line, particularly at the ends (tails) of the distribution. - This S-shaped pattern suggests that the residuals exhibit heavy tails or skewness, indicating a violation of the normality assumption.
Conclusion The Q-Q plot shows that the residuals are not normally distributed, which may impact the reliability of p-values and confidence intervals in the regression output. This justifies the use of non-parametric tests or transformations if needed in further analysis.
Breusch-Pagan Test for Homoscedasticity
studentized Breusch-Pagan test
data: reg_model
BP = 2.3749, df = 7, p-value = 0.9362
Interpretation
The Breusch-Pagan (BP) test is used to check the assumption of homoscedasticity in a regression model—that is, whether the variance of the residuals is constant across all levels of the independent variables.
Test Output - BP Statistic:
2.3749
- Degrees of Freedom (df): 7
- p-value: 0.9362
Interpretation - Since the p-value (0.9362) is much greater than 0.05, we fail to reject the null hypothesis of homoscedasticity. - This indicates that there is no significant evidence of heteroscedasticity, and the residuals appear to have constant variance.
Conclusion The assumption of homoscedasticity holds true, supporting the validity of the regression model’s standard errors and inferential statistics.
VIF for Multicollinearity GVIF Df GVIF^(1/(2*Df))
Age.Group 1 6 1
Sex 1 1 1
Interpretation
The Variance Inflation Factor (VIF) test assesses whether independent variables in a regression model are highly correlated with each other (i.e., multicollinearity). High multicollinearity can distort the estimation of regression coefficients.
Conclusion Both Age Group and Sex have VIF values equal to 1, indicating that multicollinearity is not a concern in this regression model. The predictors are sufficiently independent for reliable coefficient estimation.
Df Sum Sq Mean Sq F value Pr(>F)
Age.Group 6 1060 176.7 0.645 0.694
Residuals 357 97773 273.9
ANOVA Test
To determine whether the mean VALUE differs significantly across different Age Groups, a one-way Analysis of Variance (ANOVA) was conducted.
Null Hypothesis (H₀): There is no
significant difference in VALUE across age groups.
Alternative Hypothesis (H₁): At least one age group has a significantly different mean VALUE.
F-value = 0.645 indicates the ratio of variance
between groups to the variance within groups.
p-value = 0.694, which is greater than 0.05, means we fail to reject the null hypothesis.
Conclusion There is no statistically significant difference in the VALUE across age groups. The observed variations in VALUE among the groups are likely due to random chance rather than true differences in means.
Note: Since the p-value is not significant, a post hoc test is not needed. However, it can still be performed for exploratory purposes if desired.
Post Hoc Test
Pairwise comparisons using t tests with pooled SD
data: nta_model$VALUE and nta_model$Age.Group
18 - 24 years 25 - 34 years 35 - 44 years 45 - 54 years
25 - 34 years 1 - - -
35 - 44 years 1 1 - -
45 - 54 years 1 1 1 -
55 - 64 years 1 1 1 1
65 - 74 years 1 1 1 1
75 years and over 1 1 1 1
55 - 64 years 65 - 74 years
25 - 34 years - -
35 - 44 years - -
45 - 54 years - -
55 - 64 years - -
65 - 74 years 1 -
75 years and over 1 1
P value adjustment method: bonferroni
Post Hoc Test: Pairwise Comparisons Between Age Groups (Bonferroni Adjusted)
To further investigate the results of the ANOVA test, a pairwise t-test with Bonferroni correction was conducted. This helps identify which specific age groups differ significantly in their VALUE scores.
Conclusion The post hoc test confirms that none of the age groups differ significantly from one another in terms of VALUE. This suggests that age group alone does not substantially influence the outcome variable, within the scope of this dataset.
Conclusion. This study aimed to identify whether age
group and sex significantly influence the factors considered important
during a vehicle purchase decision in Ireland. Using both multiple
linear regression and ANOVA, we found:
• Neither age group nor sex had a statistically significant impact on
the importance scores (VALUE).
• The regression model explained only a small portion of the variance
(R² ≈ 1.2%) and failed to meet some key assumptions, notably normality
of residuals.
• The ANOVA test also showed no significant difference in mean VALUE
across different age groups (p = 0.694).
• Post hoc tests confirmed there were no statistically meaningful
pairwise differences between any age groups.
These findings suggest that demographic variables alone may not be strong predictors of what influences vehicle purchasing decisions. More nuanced or behavioral variables may play a more dominant role.
Key Findings • Demographic influence is minimal:
Both regression and ANOVA results indicate that age group and sex do not
significantly affect the importance people assign to vehicle purchase
factors.
• Regression model was not significant:
- R-squared was only 1.2%, suggesting that the model explains very
little variance.
- Neither age group nor sex showed statistically significant
coefficients (all p-values > 0.05).
• Normality assumption violated:
- The Q-Q plot and Shapiro-Wilk test showed that residuals were not
normally distributed.
- This could affect the reliability of parametric tests and suggests
future analysis should consider robust or non-parametric methods.
• No significant group differences found in ANOVA:
- The ANOVA test returned a p-value of 0.694, showing no significant
difference in VALUE across age groups.
- Post hoc pairwise comparisons also returned no significant
differences, reinforcing the ANOVA result.
1. Limited Scope of Predictors: The analysis was restricted to age group and sex. Other impactful factors like income, location, and prior ownership experience were not available in the dataset.
2. Cross-Sectional Data: The analysis is based on a snapshot in time and may not capture evolving consumer behavior or trends.
3. Missing Context: Some influencing factors may carry different meanings for different demographics, which could not be captured with numerical encoding alone.
4. Normality Violation: The regression residuals failed the Shapiro-Wilk test and showed deviation in the Q-Q plot, which can affect inference.
5. Potential Survey Bias: The dataset is based on self-reported survey responses and might be subject to recall or response bias.