Introduction

Introduction This project explores the factors influencing vehicle purchase decisions in Ireland using data from the Central Statistics Office (CSO). The goal is to understand whether demographic variables such as age and sex affect the importance assigned to influencing factors.

Research Questions

Does age group significantly influence how important various factors are when purchasing a vehicle?(analyzed using ANOVA)
Can we predict the importance placed on vehicle purchase factors using a respondent’s age group and sex?(analyzed using multiple linear regression)

Dataset

About Dataset. The dataset was sourced from the Central Statistics Office (CSO) and includes responses categorized by age group, sex, and influencing factor, with the corresponding importance value.

head(nta)

  STATISTIC                           Statistic.Label TLIST.A1. Year
1     NTA42 Factors that influence a vehicle purchase      2019 2019
2     NTA42 Factors that influence a vehicle purchase      2019 2019
3     NTA42 Factors that influence a vehicle purchase      2019 2019
4     NTA42 Factors that influence a vehicle purchase      2019 2019
5     NTA42 Factors that influence a vehicle purchase      2019 2019
6     NTA42 Factors that influence a vehicle purchase      2019 2019
  C02076V02508     Age.Group C02199V02655  Sex C03655V04397 Influencing.Factor
1          350 18 - 24 years            1 Male           10     Purchase price
2          350 18 - 24 years            1 Male           40        Reliability
3          350 18 - 24 years            1 Male           70  Engine efficiency
4          350 18 - 24 years            1 Male          110               Size
5          350 18 - 24 years            1 Male           60                Tax
6          350 18 - 24 years            1 Male           80          Insurance
  UNIT VALUE
1    %  73.3
2    %  20.4
3    %  19.9
4    %   8.3
5    %  21.6
6    %  66.6

Data Preprocessing & EDA

Data Preprocessing. I checked for missing values and there were missing values in the data, for which I went ahead with Median Imputation.Then I created a new dataframe with only required variables.

head(nta_selected)

      Age.Group  Sex Influencing.Factor VALUE
1 18 - 24 years Male     Purchase price  73.3
2 18 - 24 years Male        Reliability  20.4
3 18 - 24 years Male  Engine efficiency  19.9
4 18 - 24 years Male               Size   8.3
5 18 - 24 years Male                Tax  21.6
6 18 - 24 years Male          Insurance  66.6

Missing Values Check

There were outliers present in the dataset, but I didn’t remove them as they will be fundamental in my analysis.
Outlier Detection

ggplot(nta_clean, aes(x = `Age.Group`, y = VALUE, fill = `Age.Group`)) +
  geom_boxplot() +
  labs(title = "Outlier Detection by Age Group", y = "Value", x = "Age Group") +
  theme_minimal()

Exploratory Data Analysis

Average Influence Score by Age Group

Value by Age Group and Sex

Distribution of value by Gender

Influence scores for each gender across all influencing factors

Regression

Null Hypothesis: Age group and sex do not significantly predict the importance score (VALUE).
Alternative Hypothesis: At least one of the predictors (age group or sex) significantly predicts the importance score (VALUE).


Call:
lm(formula = VALUE ~ Age.Group + Sex, data = nta_model)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.712 -10.594  -3.894   2.775  59.081 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 17.5173     2.4547   7.136 5.41e-12 ***
Age.Group25 - 34 years       2.1019     3.2472   0.647    0.518    
Age.Group35 - 44 years       2.6462     3.2472   0.815    0.416    
Age.Group45 - 54 years       2.8173     3.2472   0.868    0.386    
Age.Group55 - 64 years       1.2769     3.2472   0.393    0.694    
Age.Group65 - 74 years       0.5404     3.2472   0.166    0.868    
Age.Group75 years and over  -2.4462     3.2472  -0.753    0.452    
SexMale                      1.3769     1.7357   0.793    0.428    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.56 on 356 degrees of freedom
Multiple R-squared:  0.01247,   Adjusted R-squared:  -0.006944 
F-statistic: 0.6424 on 7 and 356 DF,  p-value: 0.7207

Interpretation

Interpretation of Coefficients. - Intercept (17.51): This is the estimated VALUE for the baseline group — individuals in the 15–24 age group who identify as female. - Age Group Coefficients: Each coefficient represents the difference in VALUE compared to the baseline age group. None of the age groups showed statistically significant differences (all p-values > 0.05). - Sex (Male): The negative coefficient (-1.38) implies that, on average, males assign slightly lower VALUE scores than females, but the difference is not statistically significant (p = 0.428).

Model Diagnostics. - R-squared = 0.0124: Only about 1.2% of the variance in VALUE is explained by age group and sex combined. This indicates a very poor fit. - Adjusted R-squared = -0.0069: The model does not provide a meaningful improvement over a mean-based prediction. - F-statistic p-value = 0.7207: The model as a whole is not statistically significant.

Conclusion. The analysis reveals that neither age group nor sex significantly affects how individuals rate the importance of influencing factors in vehicle purchases.

Assumptions Check

Shapiro-Wilk Normality Test

shapiro.test(sample(nta_clean$VALUE, 5000, replace = TRUE))


    Shapiro-Wilk normality test

data:  sample(nta_clean$VALUE, 5000, replace = TRUE)
W = 0.82355, p-value < 2.2e-16

Interpretation

The Shapiro-Wilk test was conducted to evaluate whether the response variable VALUE follows a normal distribution—an important assumption for linear regression analysis.

Test Statistic (W): 0.82365
p-value: < 2.2e-16.

Since the p-value is significantly less than 0.05, we reject the null hypothesis of the Shapiro-Wilk test, which states that the data are normally distributed. This indicates that the VALUE variable does not follow a normal distribution.

Conclusion:
The assumption of normality is violated. While linear regression can still be used (as it’s relatively robust to this assumption in large samples), alternative approaches like non-parametric tests or data transformation may be considered based on the context.

** Residual Plots**

par(mfrow = c(2, 2))
plot(reg_model)

Interpretation

To evaluate the assumptions of linear regression, we examined the following diagnostic plots:

Residuals vs Fitted: This plot shows no clear pattern, but a slight funnel shape suggests some heteroscedasticity (non-constant variance). Ideally, residuals should be evenly scattered around zero.
Q-Q Plot: The quantile-quantile plot shows deviation from the diagonal line, especially at the tails. This indicates that the residuals are not normally distributed, which violates the normality assumption.
Scale-Location Plot: The red trend line is slightly sloped, and the spread increases, again hinting at non-constant variance. This reinforces the evidence of heteroscedasticity.
Residuals vs Leverage: No influential points with high leverage or large Cook’s distance are detected, indicating there are no major outliers unduly influencing the model.

Together, these plots suggest that: - The linearity assumption is fairly reasonable. - There are mild violations of normality and homoscedasticity. - No severe multicollinearity or influential outliers are evident.

These insights guide whether additional transformation or robust methods should be considered in future models.

Q-Q Plot

qqnorm(residuals(reg_model))
qqline(residuals(reg_model), col = "red")

Interpretation

The Q-Q (Quantile-Quantile) plot helps assess whether the residuals from the regression model are normally distributed, which is a key assumption in linear regression.

In the plot above,

The black dots represent the actual sample quantiles of the residuals.
The red line represents the expected theoretical quantiles if the residuals were perfectly normal.

** Observations - The dots substantially deviate** from the red line, particularly at the ends (tails) of the distribution. - This S-shaped pattern suggests that the residuals exhibit heavy tails or skewness, indicating a violation of the normality assumption.

Conclusion The Q-Q plot shows that the residuals are not normally distributed, which may impact the reliability of p-values and confidence intervals in the regression output. This justifies the use of non-parametric tests or transformations if needed in further analysis.

Breusch-Pagan Test for Homoscedasticity

library(lmtest)
bptest(reg_model)


    studentized Breusch-Pagan test

data:  reg_model
BP = 2.3749, df = 7, p-value = 0.9362

Interpretation

The Breusch-Pagan (BP) test is used to check the assumption of homoscedasticity in a regression model—that is, whether the variance of the residuals is constant across all levels of the independent variables.

Test Output - BP Statistic: 2.3749
- Degrees of Freedom (df): 7
- p-value: 0.9362

Interpretation - Since the p-value (0.9362) is much greater than 0.05, we fail to reject the null hypothesis of homoscedasticity. - This indicates that there is no significant evidence of heteroscedasticity, and the residuals appear to have constant variance.

Conclusion The assumption of homoscedasticity holds true, supporting the validity of the regression model’s standard errors and inferential statistics.

VIF for Multicollinearity

library(car)
vif(reg_model)

          GVIF Df GVIF^(1/(2*Df))
Age.Group    1  6               1
Sex          1  1               1

Interpretation

The Variance Inflation Factor (VIF) test assesses whether independent variables in a regression model are highly correlated with each other (i.e., multicollinearity). High multicollinearity can distort the estimation of regression coefficients.

VIF values close to 1 indicate no multicollinearity.
Common threshold: VIF > 5 or **GVIF^(1/(2*Df)) > 2** suggests problematic multicollinearity.

Conclusion Both Age Group and Sex have VIF values equal to 1, indicating that multicollinearity is not a concern in this regression model. The predictors are sufficiently independent for reliable coefficient estimation.

ANOVA

Null Hypothesis: There is no significant difference in the mean importance score (VALUE) across different age groups.
Alternative Hypothesis: At least one age group has a significantly different mean importance score (VALUE).

aov_model <- aov(VALUE ~ `Age.Group`, data = nta_model)
summary(aov_model)

             Df Sum Sq Mean Sq F value Pr(>F)
Age.Group     6   1060   176.7   0.645  0.694
Residuals   357  97773   273.9

ANOVA Test

To determine whether the mean VALUE differs significantly across different Age Groups, a one-way Analysis of Variance (ANOVA) was conducted.

Null Hypothesis (H₀): There is no significant difference in VALUE across age groups.
Alternative Hypothesis (H₁): At least one age group has a significantly different mean VALUE.
F-value = 0.645 indicates the ratio of variance between groups to the variance within groups.
p-value = 0.694, which is greater than 0.05, means we fail to reject the null hypothesis.

Conclusion There is no statistically significant difference in the VALUE across age groups. The observed variations in VALUE among the groups are likely due to random chance rather than true differences in means.

Note: Since the p-value is not significant, a post hoc test is not needed. However, it can still be performed for exploratory purposes if desired.

Post Hoc Test

pairwise.t.test(nta_model$VALUE, nta_model$`Age.Group`, p.adjust.method = "bonferroni")


    Pairwise comparisons using t tests with pooled SD 

data:  nta_model$VALUE and nta_model$Age.Group 

                  18 - 24 years 25 - 34 years 35 - 44 years 45 - 54 years
25 - 34 years     1             -             -             -            
35 - 44 years     1             1             -             -            
45 - 54 years     1             1             1             -            
55 - 64 years     1             1             1             1            
65 - 74 years     1             1             1             1            
75 years and over 1             1             1             1            
                  55 - 64 years 65 - 74 years
25 - 34 years     -             -            
35 - 44 years     -             -            
45 - 54 years     -             -            
55 - 64 years     -             -            
65 - 74 years     1             -            
75 years and over 1             1            

P value adjustment method: bonferroni

Post Hoc Test: Pairwise Comparisons Between Age Groups (Bonferroni Adjusted)

To further investigate the results of the ANOVA test, a pairwise t-test with Bonferroni correction was conducted. This helps identify which specific age groups differ significantly in their VALUE scores.

Each comparison between two age groups yielded a p-value of 1, indicating no statistically significant differences between any pairs of age groups after adjusting for multiple comparisons.
The Bonferroni correction is a conservative method, reducing the risk of Type I errors, but increasing the likelihood of Type II errors.
These results reinforce the ANOVA findings, where we earlier failed to reject the null hypothesis.

Conclusion The post hoc test confirms that none of the age groups differ significantly from one another in terms of VALUE. This suggests that age group alone does not substantially influence the outcome variable, within the scope of this dataset.

Conclusion & Key Findings

Conclusion. This study aimed to identify whether age group and sex significantly influence the factors considered important during a vehicle purchase decision in Ireland. Using both multiple linear regression and ANOVA, we found:
• Neither age group nor sex had a statistically significant impact on the importance scores (VALUE).
• The regression model explained only a small portion of the variance (R² ≈ 1.2%) and failed to meet some key assumptions, notably normality of residuals.
• The ANOVA test also showed no significant difference in mean VALUE across different age groups (p = 0.694).
• Post hoc tests confirmed there were no statistically meaningful pairwise differences between any age groups.

These findings suggest that demographic variables alone may not be strong predictors of what influences vehicle purchasing decisions. More nuanced or behavioral variables may play a more dominant role.

Key Findings • Demographic influence is minimal: Both regression and ANOVA results indicate that age group and sex do not significantly affect the importance people assign to vehicle purchase factors.
• Regression model was not significant:
- R-squared was only 1.2%, suggesting that the model explains very little variance.
- Neither age group nor sex showed statistically significant coefficients (all p-values > 0.05).
• Normality assumption violated:
- The Q-Q plot and Shapiro-Wilk test showed that residuals were not normally distributed.
- This could affect the reliability of parametric tests and suggests future analysis should consider robust or non-parametric methods.
• No significant group differences found in ANOVA:
- The ANOVA test returned a p-value of 0.694, showing no significant difference in VALUE across age groups.
- Post hoc pairwise comparisons also returned no significant differences, reinforcing the ANOVA result.

Limitations of the Project

1.  Limited Scope of Predictors: The analysis was restricted to age group and sex. Other impactful factors like income, location, and prior ownership experience were not available in the dataset.  
2.  Cross-Sectional Data: The analysis is based on a snapshot in time and may not capture evolving consumer behavior or trends.  
3.  Missing Context: Some influencing factors may carry different meanings for different demographics, which could not be captured with numerical encoding alone.  
4.  Normality Violation: The regression residuals failed the Shapiro-Wilk test and showed deviation in the Q-Q plot, which can affect inference.  
5.  Potential Survey Bias: The dataset is based on self-reported survey responses and might be subject to recall or response bias.