Introduction

Introduction This project explores the factors influencing vehicle purchase decisions in Ireland using data from the Central Statistics Office (CSO). The goal is to understand whether demographic variables such as age and sex affect the importance assigned to influencing factors.

Research Questions

  1. Does age group significantly influence how important various factors are when purchasing a vehicle?(analyzed using ANOVA)

  2. Can we predict the importance placed on vehicle purchase factors using a respondent’s age group and sex?(analyzed using multiple linear regression)

Dataset

About Dataset. The dataset was sourced from the Central Statistics Office (CSO) and includes responses categorized by age group, sex, and influencing factor, with the corresponding importance value.

head(nta)
  STATISTIC                           Statistic.Label TLIST.A1. Year
1     NTA42 Factors that influence a vehicle purchase      2019 2019
2     NTA42 Factors that influence a vehicle purchase      2019 2019
3     NTA42 Factors that influence a vehicle purchase      2019 2019
4     NTA42 Factors that influence a vehicle purchase      2019 2019
5     NTA42 Factors that influence a vehicle purchase      2019 2019
6     NTA42 Factors that influence a vehicle purchase      2019 2019
  C02076V02508     Age.Group C02199V02655  Sex C03655V04397 Influencing.Factor
1          350 18 - 24 years            1 Male           10     Purchase price
2          350 18 - 24 years            1 Male           40        Reliability
3          350 18 - 24 years            1 Male           70  Engine efficiency
4          350 18 - 24 years            1 Male          110               Size
5          350 18 - 24 years            1 Male           60                Tax
6          350 18 - 24 years            1 Male           80          Insurance
  UNIT VALUE
1    %  73.3
2    %  20.4
3    %  19.9
4    %   8.3
5    %  21.6
6    %  66.6

Data Preprocessing & EDA

Data Preprocessing. I checked for missing values and there were missing values in the data, for which I went ahead with Median Imputation.Then I created a new dataframe with only required variables.

head(nta_selected)
      Age.Group  Sex Influencing.Factor VALUE
1 18 - 24 years Male     Purchase price  73.3
2 18 - 24 years Male        Reliability  20.4
3 18 - 24 years Male  Engine efficiency  19.9
4 18 - 24 years Male               Size   8.3
5 18 - 24 years Male                Tax  21.6
6 18 - 24 years Male          Insurance  66.6

Missing Values Check

There were outliers present in the dataset, but I didn’t remove them as they will be fundamental in my analysis.
Outlier Detection
ggplot(nta_clean, aes(x = `Age.Group`, y = VALUE, fill = `Age.Group`)) +
  geom_boxplot() +
  labs(title = "Outlier Detection by Age Group", y = "Value", x = "Age Group") +
  theme_minimal()

Exploratory Data Analysis

Average Influence Score by Age Group
Value by Age Group and Sex
Distribution of value by Gender
Influence scores for each gender across all influencing factors

Regression


Call:
lm(formula = VALUE ~ Age.Group + Sex, data = nta_model)

Residuals:
    Min      1Q  Median      3Q     Max 
-21.712 -10.594  -3.894   2.775  59.081 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 17.5173     2.4547   7.136 5.41e-12 ***
Age.Group25 - 34 years       2.1019     3.2472   0.647    0.518    
Age.Group35 - 44 years       2.6462     3.2472   0.815    0.416    
Age.Group45 - 54 years       2.8173     3.2472   0.868    0.386    
Age.Group55 - 64 years       1.2769     3.2472   0.393    0.694    
Age.Group65 - 74 years       0.5404     3.2472   0.166    0.868    
Age.Group75 years and over  -2.4462     3.2472  -0.753    0.452    
SexMale                      1.3769     1.7357   0.793    0.428    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.56 on 356 degrees of freedom
Multiple R-squared:  0.01247,   Adjusted R-squared:  -0.006944 
F-statistic: 0.6424 on 7 and 356 DF,  p-value: 0.7207

Interpretation

Interpretation of Coefficients. - Intercept (17.51): This is the estimated VALUE for the baseline group — individuals in the 15–24 age group who identify as female. - Age Group Coefficients: Each coefficient represents the difference in VALUE compared to the baseline age group. None of the age groups showed statistically significant differences (all p-values > 0.05). - Sex (Male): The negative coefficient (-1.38) implies that, on average, males assign slightly lower VALUE scores than females, but the difference is not statistically significant (p = 0.428).

Model Diagnostics. - R-squared = 0.0124: Only about 1.2% of the variance in VALUE is explained by age group and sex combined. This indicates a very poor fit. - Adjusted R-squared = -0.0069: The model does not provide a meaningful improvement over a mean-based prediction. - F-statistic p-value = 0.7207: The model as a whole is not statistically significant.

Conclusion. The analysis reveals that neither age group nor sex significantly affects how individuals rate the importance of influencing factors in vehicle purchases.

Assumptions Check

Shapiro-Wilk Normality Test
shapiro.test(sample(nta_clean$VALUE, 5000, replace = TRUE))

    Shapiro-Wilk normality test

data:  sample(nta_clean$VALUE, 5000, replace = TRUE)
W = 0.82355, p-value < 2.2e-16

Interpretation

The Shapiro-Wilk test was conducted to evaluate whether the response variable VALUE follows a normal distribution—an important assumption for linear regression analysis.

Since the p-value is significantly less than 0.05, we reject the null hypothesis of the Shapiro-Wilk test, which states that the data are normally distributed. This indicates that the VALUE variable does not follow a normal distribution.

Conclusion:
The assumption of normality is violated. While linear regression can still be used (as it’s relatively robust to this assumption in large samples), alternative approaches like non-parametric tests or data transformation may be considered based on the context.

** Residual Plots**
par(mfrow = c(2, 2))
plot(reg_model)

Interpretation

To evaluate the assumptions of linear regression, we examined the following diagnostic plots:

Together, these plots suggest that: - The linearity assumption is fairly reasonable. - There are mild violations of normality and homoscedasticity. - No severe multicollinearity or influential outliers are evident.

These insights guide whether additional transformation or robust methods should be considered in future models.

Q-Q Plot
qqnorm(residuals(reg_model))
qqline(residuals(reg_model), col = "red")

Interpretation

The Q-Q (Quantile-Quantile) plot helps assess whether the residuals from the regression model are normally distributed, which is a key assumption in linear regression.

In the plot above,

** Observations - The dots substantially deviate** from the red line, particularly at the ends (tails) of the distribution. - This S-shaped pattern suggests that the residuals exhibit heavy tails or skewness, indicating a violation of the normality assumption.

Conclusion The Q-Q plot shows that the residuals are not normally distributed, which may impact the reliability of p-values and confidence intervals in the regression output. This justifies the use of non-parametric tests or transformations if needed in further analysis.

Breusch-Pagan Test for Homoscedasticity
library(lmtest)
bptest(reg_model)

    studentized Breusch-Pagan test

data:  reg_model
BP = 2.3749, df = 7, p-value = 0.9362

Interpretation

The Breusch-Pagan (BP) test is used to check the assumption of homoscedasticity in a regression model—that is, whether the variance of the residuals is constant across all levels of the independent variables.

Test Output - BP Statistic: 2.3749
- Degrees of Freedom (df): 7
- p-value: 0.9362

Interpretation - Since the p-value (0.9362) is much greater than 0.05, we fail to reject the null hypothesis of homoscedasticity. - This indicates that there is no significant evidence of heteroscedasticity, and the residuals appear to have constant variance.

Conclusion The assumption of homoscedasticity holds true, supporting the validity of the regression model’s standard errors and inferential statistics.

VIF for Multicollinearity
library(car)
vif(reg_model)
          GVIF Df GVIF^(1/(2*Df))
Age.Group    1  6               1
Sex          1  1               1

Interpretation

The Variance Inflation Factor (VIF) test assesses whether independent variables in a regression model are highly correlated with each other (i.e., multicollinearity). High multicollinearity can distort the estimation of regression coefficients.

Conclusion Both Age Group and Sex have VIF values equal to 1, indicating that multicollinearity is not a concern in this regression model. The predictors are sufficiently independent for reliable coefficient estimation.

ANOVA

aov_model <- aov(VALUE ~ `Age.Group`, data = nta_model)
summary(aov_model)
             Df Sum Sq Mean Sq F value Pr(>F)
Age.Group     6   1060   176.7   0.645  0.694
Residuals   357  97773   273.9               

ANOVA Test

To determine whether the mean VALUE differs significantly across different Age Groups, a one-way Analysis of Variance (ANOVA) was conducted.

Conclusion There is no statistically significant difference in the VALUE across age groups. The observed variations in VALUE among the groups are likely due to random chance rather than true differences in means.

Note: Since the p-value is not significant, a post hoc test is not needed. However, it can still be performed for exploratory purposes if desired.

Post Hoc Test
pairwise.t.test(nta_model$VALUE, nta_model$`Age.Group`, p.adjust.method = "bonferroni")

    Pairwise comparisons using t tests with pooled SD 

data:  nta_model$VALUE and nta_model$Age.Group 

                  18 - 24 years 25 - 34 years 35 - 44 years 45 - 54 years
25 - 34 years     1             -             -             -            
35 - 44 years     1             1             -             -            
45 - 54 years     1             1             1             -            
55 - 64 years     1             1             1             1            
65 - 74 years     1             1             1             1            
75 years and over 1             1             1             1            
                  55 - 64 years 65 - 74 years
25 - 34 years     -             -            
35 - 44 years     -             -            
45 - 54 years     -             -            
55 - 64 years     -             -            
65 - 74 years     1             -            
75 years and over 1             1            

P value adjustment method: bonferroni 

Post Hoc Test: Pairwise Comparisons Between Age Groups (Bonferroni Adjusted)

To further investigate the results of the ANOVA test, a pairwise t-test with Bonferroni correction was conducted. This helps identify which specific age groups differ significantly in their VALUE scores.

Conclusion The post hoc test confirms that none of the age groups differ significantly from one another in terms of VALUE. This suggests that age group alone does not substantially influence the outcome variable, within the scope of this dataset.

Conclusion & Key Findings

Conclusion. This study aimed to identify whether age group and sex significantly influence the factors considered important during a vehicle purchase decision in Ireland. Using both multiple linear regression and ANOVA, we found:
• Neither age group nor sex had a statistically significant impact on the importance scores (VALUE).
• The regression model explained only a small portion of the variance (R² ≈ 1.2%) and failed to meet some key assumptions, notably normality of residuals.
• The ANOVA test also showed no significant difference in mean VALUE across different age groups (p = 0.694).
• Post hoc tests confirmed there were no statistically meaningful pairwise differences between any age groups.

These findings suggest that demographic variables alone may not be strong predictors of what influences vehicle purchasing decisions. More nuanced or behavioral variables may play a more dominant role.

Key Findings • Demographic influence is minimal: Both regression and ANOVA results indicate that age group and sex do not significantly affect the importance people assign to vehicle purchase factors.
• Regression model was not significant:
- R-squared was only 1.2%, suggesting that the model explains very little variance.
- Neither age group nor sex showed statistically significant coefficients (all p-values > 0.05).
• Normality assumption violated:
- The Q-Q plot and Shapiro-Wilk test showed that residuals were not normally distributed.
- This could affect the reliability of parametric tests and suggests future analysis should consider robust or non-parametric methods.
• No significant group differences found in ANOVA:
- The ANOVA test returned a p-value of 0.694, showing no significant difference in VALUE across age groups.
- Post hoc pairwise comparisons also returned no significant differences, reinforcing the ANOVA result.

Limitations of the Project

1.  Limited Scope of Predictors: The analysis was restricted to age group and sex. Other impactful factors like income, location, and prior ownership experience were not available in the dataset.  
2.  Cross-Sectional Data: The analysis is based on a snapshot in time and may not capture evolving consumer behavior or trends.  
3.  Missing Context: Some influencing factors may carry different meanings for different demographics, which could not be captured with numerical encoding alone.  
4.  Normality Violation: The regression residuals failed the Shapiro-Wilk test and showed deviation in the Q-Q plot, which can affect inference.  
5.  Potential Survey Bias: The dataset is based on self-reported survey responses and might be subject to recall or response bias.