Introduction

The purpose of this final project was to perform basic predictive modelling on a suitable dataset selected by our group. Although we initially struggled to find an appropriate dataset, we ultimately chose Australian Tax Information from the years 2011-2012. This dataset contained over 300 variables with over 900 observations. From the extensive list of variables, we selected a target variable for the models to predict as well as variety of predictor variables. The selected target variable called ‘experience’ was derived from the ‘age range’ variable. The predictor variables selected were taxable income, medicare levy, supplement income, child support, self education expenses, gifts and donations, along with the respective count of respondents. Prior to proceeding with the predictive modelling, essential dataset cleaning and transformation were conducted to address identified data issues.

Data Preparation

The Australian taxation data was acquired from the Australian Taxation Office for the 2011-2012 income year. A significant issue with this dataset is that it had undergone priop processing and aggregation. Specifically, the dataset was categorized based on state, age group, gender, taxable status, and lodgement method. As a result, the 934 observations were only a summary of a much larger raw dataset. This required the data to be cleaned before predictive modelling could be performed.

The dataset was first subset to retain only the columns pertinent to the variables of interest - our target and predictor variables. The column names were then changed to better understand what each variable represented. For the target variable, we derived it from the age range variable. Before it could be derived, a character string was removed to ensure a more concise classification. A new categorical variable called ‘Experience’ was then created based on the age range ranges. The categories for the target variable were Low (under 18 - 29), Medium (30 - 49), High (50 - 69), and Retired (70 and above).

# make a new categorical column to identify 'Low', 'Medium', 'High', 'Retire' experience. This will be the target variable
data_sub$`Experience` <- as.factor(ifelse(data_sub$`Age Range` %in% c('Under 18', '18 - 24', '25 - 29'), 'Low', 
                                          ifelse(data_sub$`Age Range` %in% c('30 - 34', '35 - 39', '40 - 44', '45 - 49'), 'Medium',
                                          ifelse(data_sub$`Age Range` %in% c('50 - 54', '55 - 59', '60 - 64', '65 - 69'), 'High', 'Retired'))))

Since the dataset had already been aggregated, the average and ratio of each variable had to be calculated. Calculating the average for each predictor provided insights into the average income or expense per observation within specific age groups. This, theoretically, offered an estimate of individual claimed income or expenses in the observed summary. Furthermore, determining the ratio between the count of respondents for each predictor variable and the total number of respondents for a given observation provided insights into the relative occurrence of each predictor variable. The dataset was then subset again to contain only the new adjusted predictors and any other predictors that were still required. During the calculation of adjusted predictors, some observations had variables where no occurrence was present resulting in NaN (not a number) values. These NaN values were changed to 0 as the significance of having no reports was important to include.

# since the data is grouped, get the average and ratio of each predictors
average <- data_sub %>%
  mutate(`Average Income` = `Income $`/`Income No`,
         `Average Medicare` = `Medicare $`/`Medicare No`,
         `Medicare Ratio` = (`Medicare No`/`Income No`)*100,
         `Average Supplement Income` = `Supplement $`/`Supplement No`,
         `Supplement Income Ratio` = (`Supplement No`/`Income No`)*100,
         `Average Child Support` = `Child Support $`/`Child Support No`,
         `Child Support Ratio` = (`Child Support No`/`Income No`)*100,
         `Average SEE` = `SEE $`/`SEE No`,
         `SEE Ratio` = (`SEE No`/`Income No`)*100,
         `Average Gifts` = `Gifts $`/`Gifts No`,
         `Gifts Ratio` = (`Gifts No`/`Income No`)*100)

# get a new subset with just the adjusted predictors including any other predictors
ave_sub <- average %>%
  select(c(14:25))

After visualizing data, we discovered a single outlier in the low experience class that was subsequently removed as some models are sensitive to outliers, leaving us with 933 observations. The data was then scaled using a z-transformation to reduce the effects of larger numbers creating bias within the models. The dataset was also changed from a tibble format to a data frame as some aspects of the model and predicting variable importance was not possible in a tibble format. Finally, to ensure that we had equal levels of observations for each experience level for the model, a new dataset was created consisting of 560 total observations with 140 observations for each experience level.

Exploratory Data Analysis

Following the initial data preparation, we were then ready to utilize R’s plotting capabilities to better visualize any variations within our variables along with covariations. We used a simple scatter plot to observe the relationship between average income and gifts and donation ratio, categorized by the experience level. As mentioned in the data preparation section, we discovered an outlier using this method and ultimately trimmed the aforementioned outlier as displayed in the below. The first plot with the outlier and the subsequent plot displays the data absent of the outlier.

As displayed in the plots, there is a positive relationship between the two variables. This implies that, as the average income increases, so does the ratio of individuals reporting gifts and donations in their income tax return. Of the experience levels, “low” had the steepest rate of growth. “Medium” and “High” experience levels had similar rates of growth. However, the “High” experience level maintained a 5% higher ratio of gift and donations. Ultimately, the “Retired” experience level had the highest overall gifts and donations ratio.

Moreover, we created density plots, histograms, and the combination density histogram for the average income across all observations. This revealed a heavy skew to the left. This distribution indicated a substantial portion of the population was earning less than $30,000 annually. Notably, a modest number of individuals reported anannual income great than $60,000. Specifically, around 150 observations, which was the highest count for the bins, reported income approximately $10,000 annually. In contrast, less than 25 observations earned more than $90,000 annually.

Finally, we created a scatter matrix which shows the relationships between the different combinations of variables. The correlations were also categorized by experience level. The most notable positive correlations were the ratio of people who reported medicare levy with avergae income, as well as the ratio of individuals who reported gifts and donations with the ratio of those who reported medicare levy. Conversely, the most pronounced negative correlations were seen between the average amount reported for gifts and donations with the ratio of individuals who reported self education expenses, as well as the ratio of people who reported self-education expense with the ratio of people who reported supplement income.

As displayed in the scatter matrix, these relationships are not very good. This is likely due to the nature of the dataset we acquired. However, we still wanted to test the if this dataset would produce acceptable results when predictive modelling was carried out.

##    High     Low  Medium Retired 
##     288     213     288     144

Predictive Modelling

The predictive models we chose were the Linear Discriminant Analysis (LDA), Support Vector Machines (SVM), and k-Nearest Neighbors (KNN). However, before we ran our models, we ensured to create training and validation data from the subset of 560 random observations, with equal 140 samples for each levels. We used a 75-25 training to validation ratio.

##    High     Low  Medium Retired 
##      35      35      35      35

We wanted to use the LDA model because for efficiency in distinguishing and predicting between different classifications. However, a notable drawback to using this model is the assumption of a normal distribution, leading to sub optimal results when this assumption is not met. As seen in the density histogram, the distribution was not normal, likely a consequence of our dataset’s aggregated nature. Overall, the model accuracy was 74.29%.

Similarly,the SVM model proved advantageous in classification tasks. It is also less prone to over-fitting which is ideal for machine learning. Despite its computational demands, especially with larger datasets, the SVM model exhibited an overall accuracy of 79.29%.

Concluding out model evaluation, similar to the LDA and SVM models, KNN demonstrated its ideal suitability for classification tasks. However, akin to the SVM model, the downside for this model is also the computation time which can be slow for larger datasets. Nonetheless, the model accuracy was 77.86%.

Upon assessing all three models, the mean accuracy for each predictive model is visualized below. Notably, the Support Vector Model emerged with the highest mean accuracy at 78.98%, followed closely by k-Nearest Neighbours at 78.56%. In contrast, Linear Discriminant Analysis yielded the lowest results at 74.82% mean accuracy, although this remains a respectable predictive score.

Evaluation of Results

Unfortunately, we were unable to produce variable importance values that are reliable. This is due to the fact that all three models outputted the same predictor importance values across all classes of our target variable for all three models, rendering the results void. These unfortunate results could be a product of models performing sub optimally with summarized data. Below displays the results of the varImp function for each of the models and a demonstrative plot for what all the variable importance plots would have looked like for all three models.

## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##                           High    Low Medium Retired
## Gifts & Donations       100.00 100.00 100.00   46.94
## Supplement Income Ratio  85.07  85.07  85.07   24.14
## SEE Ratio                68.92  79.82  59.25   79.82
## Gifts & Donations Ratio  72.50  72.50  72.50   19.83
## SEE                      59.88  68.10  32.06   68.10
## Medicare                 58.19  58.19  64.38   25.88
## Child Support Ratio      56.71  52.68  63.29   56.71
## Child Support            24.72  54.18  62.48   54.18
## Income                   53.25  53.25  53.25    0.00
## Medicare Ratio           34.81  34.81  39.13   21.58
## Supplement Income        18.45  17.39  11.95   18.45
## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##                           High    Low Medium Retired
## Gifts & Donations       100.00 100.00 100.00   46.94
## Supplement Income Ratio  85.07  85.07  85.07   24.14
## SEE Ratio                68.92  79.82  59.25   79.82
## Gifts & Donations Ratio  72.50  72.50  72.50   19.83
## SEE                      59.88  68.10  32.06   68.10
## Medicare                 58.19  58.19  64.38   25.88
## Child Support Ratio      56.71  52.68  63.29   56.71
## Child Support            24.72  54.18  62.48   54.18
## Income                   53.25  53.25  53.25    0.00
## Medicare Ratio           34.81  34.81  39.13   21.58
## Supplement Income        18.45  17.39  11.95   18.45
## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##                           High    Low Medium Retired
## Gifts & Donations       100.00 100.00 100.00   46.94
## Supplement Income Ratio  85.07  85.07  85.07   24.14
## SEE Ratio                68.92  79.82  59.25   79.82
## Gifts & Donations Ratio  72.50  72.50  72.50   19.83
## SEE                      59.88  68.10  32.06   68.10
## Medicare                 58.19  58.19  64.38   25.88
## Child Support Ratio      56.71  52.68  63.29   56.71
## Child Support            24.72  54.18  62.48   54.18
## Income                   53.25  53.25  53.25    0.00
## Medicare Ratio           34.81  34.81  39.13   21.58
## Supplement Income        18.45  17.39  11.95   18.45

However, this does not mean that the data was entirely void of value. Our scatter matrix shows some usable correlative data between variables and their importance in selecting which bin they belong to. For example, in the scatter matrix we can see that between Medicare Ratio-Average Income and Gifts Ratio-Medicare Ratio there are strong correlations while Average Gifts-SEE Ratio and SEE Ratio-Supplement Income Ratio have a weak correlation.

Conclusion

In conclusion, while our models did not provide accurate predictor importance, they offer insightful comparative data between variables and their impacts, while successfully classifying experience bins with a high (over 74%) mean accuracy. This data could help provide governments, charities, or cities, and other decision making bodies with valuable insights in identifying and understanding vulnerable populations, the reasons they may be vulnerable, and build effective avenues and solutions for a healthier socioeconomic society. With better raw data and more tailored variables and models this type of machine learning holds promise for offering assistance for many who are marginalized or fall through the cracks of society and illuminate areas where support is most needed.

References:

Alboukadel. (2019, December 25). How to change ggplot labels: Title, Axis and legend: Title, Axis and legend. Datanovia. https://www.datanovia.com/en/blog/how-to-change-ggplot-labels/

Alboukadel. (2019, December 25). How to change ggplot labels: Title, Axis and legend: Title, Axis and legend. Datanovia. https://www.datanovia.com/en/blog/how-to-change-ggplot-labels/

Australian Taxation Office. (2014). Individual tax - Table 2: Selected items, by state/territory, age, gender, taxable status and lodgment method, 2011–12 income year [Data set]. https://data.gov.au/dataset/ds-dga-95d9e550-8b36-4273-8df7-2b76c140e73a/distribution/dist-dga-10356b15-37f7-4f6e-80f9-df7f2db5017b/details?q=

Edwards, S. (2023). Lecture Notes. Nova Scotia Community College - COGS

Holtz, Y. (n.d.). Reorder a variable with GGPLOT2. – the R Graph Gallery. https://r-graph-gallery.com/267-reorder-a-variable-in-ggplot2.html