CONTENTS


1. ABSTRACT


The purpose of this study is to evaluate the accuracy of two machine learning classification algorithms, Decision Tree and Random Forest, when attempting to classify individuals who may be likely to suffer from Autism Spectrum Disorder.

This study details the data cleaning and pre-processing steps taken to prepare the data-set for classification such as dealing with missing values, outlier removal, variable selection, and the partitioning of the data into training and testing subsets. It also walks through the validation and evaluation methods used when choosing the most appropriate models for each algorithm.

Evaluation of the predictions of the two classification algorithms used indicates that while both algorithms predicted the binary target moderately well, the Random Forest model was significantly more accurate than the Decision Tree model, resulting in less false negative and false positive predictions.

The authors of this paper noted that the available data is overwhelmingly biased towards a negative response to the target variable, and have concluded that further study on a much larger, balanced data-set would be required to hopefully gain some more insight into this data and possibly improve prediction accuracy by utilising several different classification algorithms.

2. INTRODUCTION


In recent times, the application of Machine Learning to cross-disciplinary subjects have been very active and successful, especially in the fields of biology and neurology. Many researchers are interested in creating computational frameworks for automatically generating patterns and trends in large medical data-sets. A learned data representation can help visualise data to assist humans in clinical decision making and predict a target variable from a set of input features (Bone, Goodwin, Black, Lee, Audhkhasi and Narayanan, 2014).

The data selected for this project is the Autistic Spectrum Disorder (ASD) screening data for adults. ASD refers to several related disorders that normally begin in childhood and continue in adulthood. There is no cure for ASD, but treatments can help to improve symptoms. As per HSE (2017), the symptoms can include:

The ASD symptoms can vary from person to person, and it can classify in three main types. The most typical type is “autistic disorder”, followed by “Asperger syndrome” and “pervasive developmental disorder” (PDD). The third one is also known as ‘atypical autism’. ASD are estimated to affect 1 in every 100 children and boys are more likely to develop ASD than girls by four times (HSE, 2017).

Researches and studies about classification on data related Autism have been conducted mainly by clinical experts and data scientists. A few ASD studies have analysed functional connectivity MRI(fcMRI) scan data to classify whether a data-set is coming from ASD or a typically developing participant solely based on functional connectivity (Chen, Keown, Jahedi, Nair, Plieger, Bailey and Muller, 2015).

Another stream is to develop diagnostic algorithms using machine learning using human behaviour data. The Autism Diagnostic Interview-Revised and the Autism Diagnostic Observation Schedule proved certain level of usefulness of objective machine learning methods for diagnosing autism (Lord, Risi, Lambrecht, Cook, Leventhal, DiLavore, Pickles, M. and Rutter, M., 2000).

3. CLASSIFICATION ALGORITHMS


Classification is a technique to predict what group a certain instance is going to be. To create classifiers, we use from the given learning data set and evaluate on the test samples, so it is possible predict what class the group is following to. For Witten and Eibe (2017), “classification is sometimes called supervised because, in a sense, the scheme operates under supervision by being provided with the actual outcome for each of the examples.”

3.1. DECISION TREE

The decision tree is a visual representation that is used as part of a selection criteria, or even to support the selection of specific data, considering the overall structure. It represents choices and its results in the form of a tree. It can start with simple questions that will have 2 or more answers, leading to a further question, and so on. It will support to identify and classify the data. Decision trees are mostly used in Data Mining applications using R and Machine Learning. (Brown, 2012)

Below example shows the decision tree where an incoming error condition can be classified.

fig 3.1. - Decision Tree (Brown, 2012)

fig 3.1. - Decision Tree (Brown, 2012)

A decision tree will divide the data into leaf nodes and each one of them will represent an attribute. In a nutshell, decision tree is a splitting method that is applied to demonstrate every possible outcome of a decision (Jain, 2016).

The name decision tree already implies the meaning of the technique. From root to leaves it can predict and classify outcomes, leading to a new question. The author sustain that the tree is placed upside down, with the leaves indicating the outcomes, and the root at the top, which represents the original data-set Zangh (2016), affirms the following: “Because the parent population can be split into in numerous patterns, we are interested in the one with the greatest purity. In technical terminology, purity can be described by entropy.”

To control the size and to select the optimal tree size the complexity parameter (cp) is used. It will stop the tree construction in case a new variable need to be added and its value is above the cp. It stipulates how the cost of a tree is penalized considering the number of terminal nodes (Williams, 2010).

3.2. RANDOM FOREST

The Random forest algorithm is a supervised classification algorithm. As the name suggests, this algorithm creates the forest which in turn develops large numbers of random decision trees analyzing sets of variables (Saimadhu Polamuri, 2017).

Each decision tree in the forest considers a random subset of features when forming questions and only has access to a random set of the training data points. This increases diversity in the forest leading to more robust overall predictions and the name ‘random forest.’

Random Forest works by splitting the data set into three parts:

  • Training
  • Testing
  • Cross validation

Training

The training process creates a separate data set. The use of a series of questions is part of the training. The questions are used to reduce the range of data until we come up with a prediction.

A very simple query like predicting tomorrow’s maximum temperature would need to work through an entire series of queries (e.g. what season it is, maximum temperature of season, lowest temperature, yesterday temperature, average season temperature etc) to answer this query.

Humans usually ask questions which could have multiple answers. This is not the case for decision tree implemented in machine learning. This will list all possible alternatives to every question and will answer all questions in True/False form. This concept is different and can be tough to grasp because it is not how humans naturally think. The number of queries have a diminishing effect and need to be relevant to your data set. This also leads to the tree effect.

The model has no prior knowledge of the data and will learn everything from the data provided to it. The model learns from the data provided to it and can also use historic data.

The fundamental idea behind a random forest is to combine many decision trees into a single model. Individually, predictions made by decision trees (or humans) may not be accurate, but combined, the predictions will be closer to the mark on average (William Koehrsen, 2017).

Testing

The test data is then tested against the trained data. Each Random Forest will predict different values form the same values. This concept of voting is known as majority voting.

Cross Validation

Cross-validation is a technique used to protect against overfitting in a predictive mode especially in cases where the amount of data may be limited. Cross-validation is a technique used to evaluate predictive models. This is done by partitioning the original sample into a training set to train the model, and a test set to evaluate it. In k-fold cross-validation, the original sample is randomly partitioned into k equal size sub samples (English, C., 2018). This process is sometimes called rotation estimation.

In cross-validation, you make a fixed number of folds (or partitions) of the data, run the analysis on each fold, and then average the overall error estimate (www-bcf.usc.edu, 2018).

The Random Forest algorithm has three concepts:

  • Bootstrap
  • Bagging
  • Decision Trees

Bootstrap

They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample to obtain additional information about the fitted model.

There are two types of statistic, descriptive and inferential. Bootstrapping uses inferential which are produced through complex mathematical calculations that allow scientists to infer trends about a larger population based on a study of a sample taken from it.

Inferential statistics start with a sample and then generalizes to a population. This information about a population is not stated as a number. Instead, scientists express these parameters as a range of potential numbers, along with a degree of confidence (ThoughtCo., 2018).

The bootstrap samples with replacement from the sample. As there is replacement the bootstrap will more than likely not be the same twice. The bootstrapping allows some data to be duplicated, other data to be omitted. Computers can create thousands of bootstrap samples in a relatively short time. This has been around since 1979 paper by Bradley Efron and increased in popularity as computer power increased and cost reduced.

Bagging

This is used to improve the performance of a predictive model. It does this by taking multiple sampling by using replacement within the sample from a training set. This method would be more useful (when the predictors are not stable) when the random sample from the training set is very different. The greater variability should hopefully lead to greater chance of better results (Paruchuri, 2018). When the samples are very similar the use of Bagging is superfluous.

MTRY

In Random Forest the Mtry is the number of variables available for splitting at each tree node, the default value of this parameter depends on which R package is used to fit the model (Cran.r-project.org. 2018).

Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating (bagging) to sub-sample data samples used for training. OOB is the mean prediction error on each training sample x1, using only the trees that did not have x1 in their bootstrap sample

The random selection of variables at each node splitting step is what makes it a random forest, as opposed to just a bagged estimator. Quoting from The Elements of Statistical Learning, p 588 in the second edition. Reducing mtry (Number of random variables used in each tree) reduces both the correlation and the strength and therefore increasing it increases both.

Somewhere in between is an “optimal” range of mtry - usually quite wide. Using this approach Out-of-bag (OOB) error rate a value of mtry in the range can quickly be found. This is the only adjustable parameter to which random forests is somewhat sensitive.

fig 3.2. - Random Forest Simplified (Koehrsen, 2018)

fig 3.2. - Random Forest Simplified (Koehrsen, 2018)

4. THE DATA-SET


Our chosen data-set, entitled, “Autism Screening Adult Data Set”, is an open source data-set available from The UCI Machine Learning Repository. It was donated to the repository on the 24th of December 2017 by Dr.Fadi Fayez Thabtah (fadi.fayez@manukau.ac.nz ), Department of Digital Technology, Manukau Institute of Technology, Auckland, New Zealand.

The data-set consists of 704 observations of 21 variables. There are just two numeric variables present (age and result), with the remaining variables being categorical and binary in nature. The data describes ASD screening results, some of which appear to have been harvested from an app developed by Dr.Fadi Fayez named “ASDQuiz”, which is available for both Android and iOS devices.

The raw data-set contains ten binary variables representing the screening questions (A1_Score to A10_Score), as well as the categorical variables of gender, ethnicity, jaundice, autism, country_of_res, used_app_before, age_desc, relation and Class/ASD. There are also two numeric variables named age and result.

Please see the official description of the variables below. Note, we have included the actual questions associated with the A1_Score to A10_Score variables. The questions ask for a simple binary answer of either agree or disagree.

Variables Description

Age:

Age in years

Gender:

Male or female

Ethnicity:

List of common ethnicities in text format

Born with Jaundice:

Whether case was born with jaundice

Family member with PDD:

Whether any immediate family member has a PDD

Who is completing the test:

Parent, self, caregiver, medical staff, clinician, etc

Country of Residence:

List of countries in text format

Used the screening app before:

Whether the user has used screening app

Screening Method Type:

Type of screening method chosen based on age category

Question 1 Answer:

I often notice small sounds when others do not

Question 2 Answer:

I usually concentrate more on the whole picture, rather than the small details

Question 3 Answer:

I find it easy to do more than one thing at once

Question 4 Answer:

If there is an interruption, I can switch back to what I was doing very quickly

Question 5 Answer:

I find it easy to read between the lines when someone is talking to me

Question 6 Answer:

I know how to tell if someone listening to me is getting bored

Question 7 Answer:

When I’m reading a story I find it difficult to work out the character’s intentions

Question 8 Answer:

I like to collect information about categories of things (e.g. types of cars, types of bird, types of train, types of plant, etc)

Question 9 Answer:

I find it easy to work out what someone is thinking or feeling just by looking at their face

Question 10 Answer:

I find it difficult to work out peoples intentions

Screening Score:

Final score obtained based on scoring algorithm of screening method used

We found 3 Analysis on the same dataset. All of them are written by Dr. Kanad Basu and they basically aim to explore several competing supervised machine learning techniques such as Decision Trees, Random Forest, Support Vector Machines(SVM), k-Nearest Neighbors(kNN), Naïve Bayes, Logistic Regression, Linear Discriminant Analysis(LDA) and Multi-Layer Perception(MLP), (BasuKanad, 2018), to solve the classification problem of predicting whether an adult individual with certain characteristics has Autistic Spectrum Disorder(ASD), (BasuKanad, 2018).

5. DATA EXPLORATION


names(dataASD)
 [1] "A1_Score"        "A2_Score"        "A3_Score"       
 [4] "A4_Score"        "A5_Score"        "A6_Score"       
 [7] "A7_Score"        "A8_Score"        "A9_Score"       
[10] "A10_Score"       "age"             "gender"         
[13] "ethnicity"       "jundice"         "austim"         
[16] "contry_of_res"   "used_app_before" "result"         
[19] "age_desc"        "relation"        "Class/ASD"      

The first step in the analysis of the raw data-set is to explore. By doing this we may discover changes that need to be made that will make the data-set easier to work with and could potentially increase the accuracy of our results.

Below you can see the first five rows of the data-set. This gives a good overview of what a typical observation from the data looks like.

Table continues below
A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score
1 1 1 1 0 0 1
1 1 0 1 0 0 0
1 1 0 1 1 0 1
1 1 0 1 0 0 1
1 0 0 0 0 0 0
Table continues below
A8_Score A9_Score A10_Score age gender ethnicity jundice
1 0 0 26 f White-European no
1 0 1 24 m Latino no
1 1 1 27 m Latino yes
1 0 1 35 f White-European no
1 0 0 40 f NA no
Table continues below
austim contry_of_res used_app_before result age_desc relation
no United States no 6 18 and more Self
yes Brazil no 5 18 and more Self
yes Spain no 8 18 and more Parent
yes United States no 6 18 and more Self
no Egypt no 2 18 and more NA
Class/ASD
NO
NO
YES
NO
NO

We can also check the structure of the data-set. Here we can see the levels of our categorical variables. From this we can see that there are some spelling errors in the variable names (jundice, austim, contry_of_res), and that the Class/ASD variable will cause trouble due to the illegal forward slash character.

The age_desc variable appears to have just one level, making it a likely candidate to be dropped from the data set. Also, the contry_of_res variable has 67 levels, which is too many for the Random Forest algorithm to handle, so this can probably be dropped as well.

'data.frame':   704 obs. of  21 variables:
 $ A1_Score       : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 2 2 ...
 $ A2_Score       : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 2 2 ...
 $ A3_Score       : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 2 1 2 ...
 $ A4_Score       : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 1 2 1 2 ...
 $ A5_Score       : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 1 2 1 ...
 $ A6_Score       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
 $ A7_Score       : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 1 1 1 2 ...
 $ A8_Score       : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 1 2 2 ...
 $ A9_Score       : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 2 2 ...
 $ A10_Score      : Factor w/ 2 levels "0","1": 1 2 2 2 1 2 1 1 2 1 ...
 $ age            : num  26 24 27 35 40 36 17 64 29 17 ...
 $ gender         : Factor w/ 2 levels "f","m": 1 2 2 1 1 2 1 2 2 2 ...
 $ ethnicity      : Factor w/ 11 levels "Asian","Black",..: 11 4 4 11 NA 7 2 11 11 1 ...
 $ jundice        : Factor w/ 2 levels "no","yes": 1 1 2 1 1 2 1 1 1 2 ...
 $ austim         : Factor w/ 2 levels "no","yes": 1 2 2 2 1 1 1 1 1 2 ...
 $ contry_of_res  : Factor w/ 67 levels "Afghanistan",..: 65 14 57 65 23 65 65 44 65 10 ...
 $ used_app_before: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ result         : num  6 5 8 6 2 9 2 5 6 8 ...
 $ age_desc       : Factor w/ 1 level "18 and more": 1 1 1 1 1 1 1 1 1 1 ...
 $ relation       : Factor w/ 5 levels "Health care professional",..: 5 5 3 5 NA 5 5 3 5 1 ...
 $ Class/ASD      : Factor w/ 2 levels "NO","YES": 1 1 2 1 1 2 1 1 1 2 ...

Before proceeding any further with the exploration, we will first fix some of the variable names to aid in readability.

Here, we have truncated the variable names of the A1_Score to A10_Score variables to A1 to A10. Also, we have corrected the spelling errors/illegal characters in the austim, jundice, contry_of_res and Class/ASD variables.

Let’s have a look at the corrected variable names.

 [1] "A1"              "A2"              "A3"             
 [4] "A4"              "A5"              "A6"             
 [7] "A7"              "A8"              "A9"             
[10] "A10"             "age"             "gender"         
[13] "ethnicity"       "jaundice"        "autism"         
[16] "country"         "used_app_before" "result"         
[19] "age_desc"        "relation"        "Class_ASD"      

Let’s also check the levels of our categorical variables. While they are mostly binary in nature, the relation, ethnicity and country variables have more than 2 levels. We will be dropping the country variable later, so let’s have a look at relation.

[1] "Health care professional" "Others"                  
[3] "Parent"                   "Relative"                
[5] "Self"                    

And ethnicity. There appears to be a duplicate category of others which will need to be corrected.

 [1] "Asian"           "Black"           "Hispanic"       
 [4] "Latino"          "Middle Eastern " "others"         
 [7] "Others"          "Pasifika"        "South Asian"    
[10] "Turkish"         "White-European" 

As we will be attempting to classify whether or not an individual may have ASD, it would be useful to know the proportion of the data-set that has been classed YES or NO. Here, we can see that 26.85% of the observations present have been classed as potentially having ASD, with 73.15% being classed as not having ASD. This indicates that there is a large bias towards the No class of our target variable within this data-set.

Class_ASD count
NO 0.7315
YES 0.2685

By using the summary function, we can see some basic descriptive statistics for the variables. Here, we can see that we have a number of missing values in the ethnicity, relation and age variables. Also, there appears to be an impossibly large Max. value present in the age variable (383), possibly due to a typing error.

Table continues below
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
0:196 0:385 0:382 0:355 0:353 0:504 0:410 0:247 0:476 0:300
1:508 1:319 1:322 1:349 1:351 1:200 1:294 1:457 1:228 1:404
Table continues below
age gender ethnicity jaundice autism
Min. : 17.0 f:337 White-European :233 no :635 no :613
1st Qu.: 21.0 m:367 Asian :123 yes: 69 yes: 91
Median : 27.0 Middle Eastern : 92
Mean : 29.7 Black : 43
3rd Qu.: 35.0 South Asian : 36
Max. :383.0 (Other) : 82
NA’s :2 NA’s : 95
Table continues below
country used_app_before result age_desc
United States :113 no :692 Min. : 0.000 18 and more:704
United Arab Emirates: 82 yes: 12 1st Qu.: 3.000
India : 81 Median : 4.000
New Zealand : 81 Mean : 4.875
United Kingdom : 77 3rd Qu.: 7.000
Jordan : 47 Max. :10.000
(Other) :223
relation Class_ASD
Health care professional: 4 NO :515
Others : 5 YES:189
Parent : 50
Relative : 28
Self :522
NA’s : 95

6. DATA CLEANING


Now that we have we transformed the data-set to a much more easily readable state, we can begin to clean the data of unwanted information to prepare it for use with our chosen machine learning algorithms.

6.1. MISSING VALUES

The output below displays a count of the observations with missing values for each variable. There are 95 each for ethnicity and relation, and 2 for age.

Table continues below
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 age gender ethnicity
0 0 0 0 0 0 0 0 0 0 2 0 95
Table continues below
jaundice autism country used_app_before result age_desc relation
0 0 0 0 0 0 95
Class_ASD
0

The missing values are predominantly categorical. This makes it difficult to create replacement values as we cannot substitute the mean or median for non numeric variables.

The best course of action is to remove all of the observations containing missing values. This is done by creating a copy of our data frame and using the na.omit function to strip out all of the NAs.

The output below shows a count of missing values on the data frame and confirms that there are no longer any NAs present.

Table continues below
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 age gender ethnicity
0 0 0 0 0 0 0 0 0 0 0 0 0
Table continues below
jaundice autism country used_app_before result age_desc relation
0 0 0 0 0 0 0
Class_ASD
0

Counting the observations present in the new data frame without missing values reveals that there are now 609 observations, confirming that we have indeed dropped 95 observations.

n
609

6.2. OUTLIER VALUES

You may recall that we had discovered an impossibly large Max. value of 383 in the age variable. Given that there are already many typographical errors present in the data-set, it is reasonable to assume that this too is the result of a typing error and the intended value was 38.

The code below replaces the value of 383 with 38. Looking at the summary statistics now, we can see that the Max. age is now 64, which is much more in line with what is expected from this data-set.

# Fix outlier in age column. Replace value of 383 with 38
dataASDclean$age[dataASDclean$age == 383] <- 38
pander(summary(dataASDclean$age))
Min. 1st Qu. Median Mean 3rd Qu. Max.
17 22 27 29.65 35 64

6.3. DUPLICATES

There is also a duplicated category in the ethnicity variable. The categories of Others and others are obviously intended to be one value.

The code below amalgamates both values into just one level of the ethnicity factor.

# Fix duplicated "others" category in the ethnicity variable
levels(dataASDclean$ethnicity) <- gsub("others", "Others", levels(dataASDclean$ethnicity))
levels(dataASDclean$ethnicity)
 [1] "Asian"           "Black"           "Hispanic"       
 [4] "Latino"          "Middle Eastern " "Others"         
 [7] "Pasifika"        "South Asian"     "Turkish"        
[10] "White-European" 

7. DATA VISUALISATION


With the missing values, outlier values and duplicates taken care of, we can now start to visually explore the data that we will be working with.

7.1. BOXPLOTS - AGE

The boxplot below represents the age distribution of males and females and their class, ASD YES/ASD NO. We can see that age ranges from the late teens to the early to mid 60s for both genders. Males appear to have a wider distribution of individuals classed as having ASD.

*fig 7.1. - Boxplots - Age*

fig 7.1. - Boxplots - Age

7.2. BOXPLOTS - RESULT

This boxplot represents the distribution of the screening result scores for both males and females and their class, ASD YES/ASD NO. There is no difference in the distribution between males and females, which indicates that the screening criteria treats both genders equally. We can also see that a score of about 7 results in an individual being classed as having ASD.

*fig 7.2. - Boxplots - Result*

fig 7.2. - Boxplots - Result

Querying the data with the code below confirms that a score of 7 or more results in an individual being classed as having ASD. The code simply returns a count of results that are 7 or more, and a count of the observations with Class_ASD = YES. As you can see, they are exactly the same.

# Confirm that a result of 7 or more is classified as ASD
c1 <- dataASDclean %>% count(result >= 7)

c2 <- dataASDclean %>% count(Class_ASD == "YES")

comp_c1_c2 <- cbind(c1,c2)

pander(comp_c1_c2)
result >= 7 n Class_ASD == “YES” n
FALSE 429 FALSE 429
TRUE 180 TRUE 180

7.3. BAR CHART - ETHNICITY

This bar chart represents the proportions of each ethnicity present in the data. White-Europeans account for approximately one third of the data, followed by Asians and Middle Eastern people.

*fig 7.3. - Bar Chart - Ethnicity*

fig 7.3. - Bar Chart - Ethnicity

7.4. BAR CHARTS - JAUNDICE / RELATIVE

The bar chart on the left represents the relationship between a Jaundice diagnosis at birth, and being classed as having ASD. There is no significant relationship between the two present within this data-set. It is clear that the number of individuals diagnosed with Jaundice at birth is approximately the same for a classification of ASD = YES and ASD = NO.

The bar chart on the right represents the relationship between individuals that have a relative that has ASD and being classed as ASD themselves. Again, there does not appear to be any significant relationship between the two variables present in this data-set, as the number of individuals with relatives that have ASD is approximately the same for a classification of ASD = YES and ASD = NO.

*fig 7.4. - Bar Charts - Jaundice / Relative*

fig 7.4. - Bar Charts - Jaundice / Relative

8. DATA PRE-PROCESSING


8.1. VARIABLE REDUCTION

There are a number of variables present within the data-set that do not offer any benefit to our analysis.

  • Country of Residence: In initial testing of the classification models, this variable did not display any significant importance to the accuracy of predictions. As a factor with more than 60 levels, it is too large to be processed by certain classification functions within the R environment. So, to ensure compatibility it will be removed.
  • Used App Before: This variable denotes whether an individual used the screening application or not. It is not a significant indicator of our target variable, so this will also be removed.
  • Age Description: This variable categories the age range of an individual. Everyone over the age of 17 is classed as an adult. As all of our observations are for adults aged 17 and older, this factor has only one level, therefore it can offer no significant benefit to the analysis.
  • Result: As discussed previously in this report, a result value of 7 or more will always be classified as Class_ASD = YES. Therefore, including this variable would mean that the machine learning algorithms would essentially already have the outcome of the target variable. For the purposes of this analysis it will also be removed.

Below are the remaining variables.

 [1] "A1"        "A2"        "A3"        "A4"        "A5"       
 [6] "A6"        "A7"        "A8"        "A9"        "A10"      
[11] "age"       "gender"    "ethnicity" "jaundice"  "autism"   
[16] "relation"  "Class_ASD"

8.2. ONE HOT ENCODING

One hot encoding is a method by which non ordinal categorical variables can be converted into numeric data. This may be a necessary pre-processing step depending on the the type of classification you intend to carry out, as some algorithms cannot handle categorical data.

To put it simply, one hot encoding takes the levels of a factor variable and gives each of them their own numeric variable where a 1 denotes a positive response, and a 0 denotes a negative response (yes/no). For example, if you had a categorical variable made up of colours, one hot encoding would give each colour its own variable. A value of 1 in a colours variable would indicate that the colour does occur exist in an observation, while 0 indicates that it does not.

In the case of binary variables, such as gender. One hot encoding can be instructed to not create new variables (as we have here, see: fullRank = TRUE. Using the gender example, one hot encoding could create one numeric variable called gender.f, where a value of 1 denotes that the observation is female, but a 0 denotes that it is male.

Although it was not necessary for our two chosen algorithms, we implemented one hot encoding early on in the process to gives us more analysis options should we need them.

# One Hot Encoding. Transforms all categorical variables (except target variable) into numeric
dummy.vars <- dummyVars(~ ., data = dataASDclean[, -17], fullRank = TRUE)

# The output of predict is a matrix, change it to data frame
oneHotEncoding <- data.frame(predict(dummy.vars, newdata = dataASDclean))

# Append the original target variable (factor)
oneHotEncoding$Class_ASD <- dataASDclean$Class_ASD

Here we have the structure of the one hot encoded data-set.

'data.frame':   609 obs. of  28 variables:
 $ A1.1                     : num  1 1 1 1 1 0 1 1 1 1 ...
 $ A2.1                     : num  1 1 1 1 1 1 1 1 1 1 ...
 $ A3.1                     : num  1 0 0 0 1 0 1 0 1 1 ...
 $ A4.1                     : num  1 1 1 1 1 0 1 0 1 1 ...
 $ A5.1                     : num  0 0 1 0 1 0 0 1 0 1 ...
 $ A6.1                     : num  0 0 0 0 0 0 0 0 1 1 ...
 $ A7.1                     : num  1 0 1 1 1 0 0 0 1 1 ...
 $ A8.1                     : num  1 1 1 1 1 1 0 1 1 1 ...
 $ A9.1                     : num  0 0 1 0 1 0 1 1 1 1 ...
 $ A1.0.1                   : num  0 1 1 1 1 0 0 1 0 1 ...
 $ age                      : num  26 24 27 35 36 17 64 29 17 33 ...
 $ gender.m                 : num  0 1 1 0 1 0 1 1 1 1 ...
 $ ethnicity.Black          : num  0 0 0 0 0 1 0 0 0 0 ...
 $ ethnicity.Hispanic       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity.Latino         : num  0 1 1 0 0 0 0 0 0 0 ...
 $ ethnicity.Middle.Eastern.: num  0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity.Others         : num  0 0 0 0 1 0 0 0 0 0 ...
 $ ethnicity.Pasifika       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity.South.Asian    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity.Turkish        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ethnicity.White.European : num  1 0 0 1 0 0 1 1 0 1 ...
 $ jaundice.yes             : num  0 0 1 0 1 0 0 0 1 0 ...
 $ autism.yes               : num  0 1 1 1 0 0 0 0 1 0 ...
 $ relation.Others          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ relation.Parent          : num  0 0 1 0 0 0 1 0 0 0 ...
 $ relation.Relative        : num  0 0 0 0 0 0 0 0 0 1 ...
 $ relation.Self            : num  1 1 0 1 1 1 0 1 0 0 ...
 $ Class_ASD                : Factor w/ 2 levels "NO","YES": 1 1 2 1 2 1 1 1 2 2 ...

8.3. PARTITIONING

We will be partitioning the data-set into separate training and testing subsets. Two thirds of the data will be randomly sampled and allocated to the training set, and one third will be allocated to the testing set. While we will be using Repeated K-Fold Cross Validation to aid in choosing the most appropriate model, partitioning beforehand allows us to validate the model’s accuracy by feeding it the unseen testing set.

The output below shows the number of observations present in each of the subsets.

Training_Observations Testing_Observations
391 218

9. CONCEPTS & TECHNIQUES


Here, we will briefly describe of the concepts and techniques used in the interpretation and evaluation of our classification results.

9.1 OVERFITTING vs. UNDERFITTING

Underfitting occurs when a model is too simple. It has too few features or regularized too much which makes it inflexible in learning from the data-set.

Overfitting is where the model works well on the trained data and gives a high accuracy rate but when presented with New Data it does not provide accurate results. Too many factors can result in overfitting or noise. When combined with bootstrapping and bagging we can find the optimum numbers of factors to use i.e. if graphed would create a peak. In predictive modelling, you can think of the “signal” as the true underlying pattern that you wish to learn from the data.

“Noise,” on the other hand, refers to the irrelevant information or randomness in a data-set. Here’s where machine learning comes in. A well-functioning ML algorithm will separate the signal from the noise. If the algorithm is too complex or flexible (e.g. it has too many input features or it’s not properly regularized), it can end up “memorizing the noise” instead of finding the signal. This overfit model will then make predictions based on that noise. It will perform unusually well on its training data but very poorly on new, unseen data (EliteDataScience, 2018).

9.2. K-FOLD CROSS VALIDATION

Generally, a classifier is decided from training data using a classifier learning algorithm. Each classifier has an associated prediction error, also called the true error. Usually, the true error is unknown, cannot be calculated, and must be estimated from data. This error is called estimated prediction error (Rodriguez, Perez and Lozano, 2010). There are some estimators of the classification error such as Bootstrap, Resubstitution, Hold-out and more.

To measure the performance of classifiers used in this study, K-Fold Cross Validation was employed. In K-Fold Cross Validation, the data is partitioned into K-subsets. A portion of the data is reserved for testing(prediction) and the remaining data is used for training the model (Bone, Goodwin, Black, Lee, Audhkhasi and Narayanan et al., 2014). In this study, we have used a 10-Fold Cross Validation repeated 5 times on 2 different machine learning algorithms (Decision Tree and Random Forest). 10-Fold Cross Validation will use 90% of the entire data as a training set and 10% of the data as a test set and repeat this activity for 5 times.

The code below depicts the creation of a trainControl object using the “Caret” package. This object will be called when executing the train function on our training set and process the algorithm using repeated k-Fold cross valaidation.

# Set up Repeated K-Fold Cross Validation
trctrl <- trainControl(method = "repeatedcv",
                       number = 10,
                       repeats = 5,
                       classProbs=TRUE,
                       summaryFunction=twoClassSummary)

9.3. CONFUSION MATRIX

The confusion matrix, also known as an error matrix, is a pre-processing of the data that describes how well a classification model is predicting its target variables. A confusion matrix shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes in the data, which means, on a classification problem, it is a brief description of the prediction results.

For Susmaga (2004), the confusion matrix is very useful as it, “provides much more detailed information on the results of the test than the mere accuracy or error. It shows which classes were classified properly or almost properly and which where misclassified/confused with other classes and in what degree”.

The performance is frequently evaluated using the data set in the matrix, allowing the visualization of the algorithms. The matrix is N by N where N the number of classes with predicted classes (output classes) and actual classes (target classes). Predictive models don’t make assumptions, so it is very important that the performance is measured. Above the Confusion Matrix (Kohavi and Provost, 1998) shows the 2 classes classifier example:

Actual_Predicted Negative Positive

Negative

a

c

Positive

b

d

(Kohavi and Provost, 1998)

  • a: The true negative rate (TN) is the number of correct negative predictions.
  • b: The false positive rate (FP) is the number of incorrect positive predictions.
  • c: The false negative rate (FN) is the number of incorrect negative predictions.
  • d: The True positive rate or recall (TP) is the number of correct positive predictions.

For the purposes of our analysis we are mostly concerned with the following metrics from our confusion matrix output:

  1. Specificity: How often did our model correctly predict NO?
  2. Sensitivity: How often did our model correctly predict YES?
  3. Kappa: A comparison of the observed accuracy vs. the expected accuracy. Put simply, it takes the accuracy of random chance into consideration.

9.4. ROC CURVE

The receiver operating characteristic (ROC) curve, which is defined as a plot of test sensitivity as the y coordinate versus its 1-specificity or false positive rate (FPR) as the x coordinate, is an effective method of evaluating the performance of diagnostic tests (Park, Goo and Jo, 2004).

The area under the ROC curve(AUC), or the equivalent Gini index, is a widely used measure of performance of supervised classification rules (Hand and Till, 2001). This study also utilises ROC Curves and the area under the ROC curve to visualize the performance of our prediction models.

10. RESULTS


We can now begin to train and test our models. Because we are dealing with the classification of a medical condition, we have decided to set our target metric as Specificity. This is a measure of how often our model’s correctly predicted a NO response.

The reasoning behind this is because we want to reduce the amount of false negative predictions as much as possible. In the context of our classification, a false negative would mean that someone who is classed as not having ASD would in fact have ASD and may end up not receiving the care that they require.

Our cross validation process will select the value of the tuning parameters for each of our models that results in the highest specificity score.

However, we will also be paying close attention to the Sensitivity metric, which measures the amount of correctly predicted YES responses. While false positives (an individual classed as having ASD when they do not) can also have quite serious ramifications. It is likely that in the context of our study, these individuals would probably undergo some additional testing until they are correctly diagnosed, opposed to false negative predictions who could suffer from a lack of care.

10.1. DECISION TREE RESULTS

Below is the output from the training phase of our Decision Tree model. Here, we can see that our repeated k-Fold cross validation, using specificity as our target metric, has determined that the highest specificity of approximately 76% occurred with a Complexity Parameter (CP) of 0.1735537. We can now test the performance of the model’s predictions using our testing set.

CART 

391 samples
 27 predictor
  2 classes: 'NO', 'YES' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 352, 352, 352, 352, 352, 352, ... 
Resampling results across tuning parameters:

  cp          ROC        Sens       Spec     
  0.03305785  0.8627612  0.9362963  0.7314103
  0.17355372  0.8204511  0.8829630  0.7562821
  0.49586777  0.6458191  0.9074074  0.3842308

Spec was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.1735537.

This plot is a visual representation of the above information. It shows us that model specificity drops as the complexity parameter increases.

*fig 10.1. - Accuracy of Complexity Parameter (CP)*

fig 10.1. - Accuracy of Complexity Parameter (CP)

Here, we can see the level of importance the model has assigned to the variables.

rpart variable importance

  only 20 most important variables shown (out of 27)

                          Overall
A9.1                        70.98
A6.1                        67.56
A5.1                        60.81
A4.1                        44.46
A3.1                        31.98
jaundice.yes                 0.00
ethnicity.Others             0.00
ethnicity.Hispanic           0.00
ethnicity.Turkish            0.00
ethnicity.Black              0.00
autism.yes                   0.00
relation.Self                0.00
gender.m                     0.00
ethnicity.Middle.Eastern.    0.00
A1.0.1                       0.00
relation.Relative            0.00
age                          0.00
relation.Parent              0.00
ethnicity.Pasifika           0.00
A7.1                         0.00

Plotting this information allows us to see that the model assigned the most significant importance to positive responses to the binary variables of:

  • A9: I find it easy to work out what someone is thinking or feeling just by looking at their face?
  • A6: I know how to tell if someone listening to me is getting bored?
  • A5: I find it easy to read between the lines when someone is talking to me?

These three variables, which are questions answered during the ASD screening process, are all related to social interactions. Difficulty with social interactions is a key symptom of ASD, so we can infer from this that answering yes to these three questions is a strong contributer to the models prediction of the target variable.

*fig 10.2. - D-Tree - Variable Importance*

fig 10.2. - D-Tree - Variable Importance

And here we have the actual decision tree.

*fig 10.3. - D-Tree*

fig 10.3. - D-Tree

To test the prediction accuracy of the model, We can now introduce new unseen data in the form of our testing data-set. Below is the confusion matrix output after performing predictions on the test set.

We can see that our target metric of specificity is approximately 80%, which is quite close to that of 76% achieved during the training phase.

Sensitivity is 85%. Kappa is 61%, which given the unbalanced nature of our data-set, is a more realistic measure than that of accuracy and considered to be reasonably good.

Confusion Matrix and Statistics

         
predASDdt  NO YES
      NO  136  12
      YES  23  47
                                          
               Accuracy : 0.8394          
                 95% CI : (0.7839, 0.8856)
    No Information Rate : 0.7294          
    P-Value [Acc > NIR] : 8.481e-05       
                                          
                  Kappa : 0.6158          
 Mcnemar's Test P-Value : 0.09097         
                                          
            Sensitivity : 0.8553          
            Specificity : 0.7966          
         Pos Pred Value : 0.9189          
         Neg Pred Value : 0.6714          
              Precision : 0.9189          
                 Recall : 0.8553          
                     F1 : 0.8860          
             Prevalence : 0.7294          
         Detection Rate : 0.6239          
   Detection Prevalence : 0.6789          
      Balanced Accuracy : 0.8260          
                                          
       'Positive' Class : NO              
                                          

Plotting the confusion matrix, we can see the model’s correct predictions in green, and incorrect predictions in red.

  • The model correctly predicted a class of “NO” 136 times (True Negative), and incorrectly predicted “NO” 12 times (False Negative).
  • The model correctly predicted a class of “YES” 47 times (True Positive), and incorrectly predicted “YES” 23 times (False Positive).

Here we can see that we have 12 false negative predictions. That is 12 people who may have ASD, but were classed as not having ASD, or to look t it another way, 12 people that potentially may not receive the care they require.

Furthermore, we have 23 false positive predictions. That is 23 people who were classed as having ASD when they do not have ASD. These 23 people may end up having to undergo additional ASD testing that they do not require. Possibly at great inconvenience and cost to themselves.

*fig 10.4. - D-Tree - Confusion Matrix*

fig 10.4. - D-Tree - Confusion Matrix

The ROC curve visually represents the prediction performance of the model. For reference, a model that can predict with 100% accuracy would show a line that is a 90 degree angle, going all the way up to 1.0 sensitivity, and then a sharp right turn towards the diagonal line. Conversely, if the curve hugs the diagonal line, it’s prediction accuracy is closer to 50%, or no better than random chance.

*fig 10.5. - D-Tree - ROC Curve*

fig 10.5. - D-Tree - ROC Curve

10.2. RANDOM FOREST RESULTS

Below is the output from the training phase of our Random Forest model. Here, we can see that our repeated k-Fold cross validation, using specificity as our target metric, has determined that the highest specificity of approximately 92% occurred with a mtry (variables per split) of 14. We can now test the performance of the model’s predictions using our testing set.

Random Forest 

391 samples
 27 predictor
  2 classes: 'NO', 'YES' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 352, 352, 352, 352, 351, 352, ... 
Resampling results across tuning parameters:

  mtry  ROC        Sens       Spec     
   2    0.9960613  0.9925926  0.8591026
  14    0.9903965  0.9614815  0.9239744
  27    0.9861491  0.9548148  0.9124359

Spec was used to select the optimal model using the largest value.
The final value used for the model was mtry = 14.
*fig 10.6. - Accuracy of Randomly Selected Predictors*

fig 10.6. - Accuracy of Randomly Selected Predictors

Checking on the variable importance that the Random Forest model has assigned to the variables, we can see that, just like the Decision Tree model, it has selected positive responses to the A9, A6 and A5 as the three most important variables when classifying the target variable of Class_ASD. Although, it should be noted that the amount of importance attributed to of each them is significantly different than that of the Decision Tree model. A9 for example, appears to be of much more importance in this model.

rf variable importance

  only 20 most important variables shown (out of 27)

                          Overall
A9.1                      41.5150
A6.1                      28.8997
A5.1                      28.2429
A4.1                      10.9226
A1.0.1                     8.4912
A1.1                       7.3603
A2.1                       7.0332
age                        6.4495
A3.1                       6.0163
A8.1                       4.4875
A7.1                       4.3194
ethnicity.White.European   3.2762
ethnicity.Middle.Eastern.  1.5154
gender.m                   1.3342
relation.Self              1.1672
autism.yes                 1.0019
relation.Parent            0.8579
jaundice.yes               0.8544
ethnicity.Black            0.5773
relation.Relative          0.5155
*fig 10.7. - R-Forest - Variable Importance*

fig 10.7. - R-Forest - Variable Importance

Testing our Random Forest model on the unseen testing data, We see that our target metric of specificity is approximately 86%, which is lower than that of 92% achieved during the training phase.

Sensitivity is 94%. Kappa is 79%. Overall, this model appears to performing better than the Decision Tree model.

Confusion Matrix and Statistics

         
predASDrf  NO YES
      NO  149   8
      YES  10  51
                                          
               Accuracy : 0.9174          
                 95% CI : (0.8726, 0.9503)
    No Information Rate : 0.7294          
    P-Value [Acc > NIR] : 2.892e-12       
                                          
                  Kappa : 0.7931          
 Mcnemar's Test P-Value : 0.8137          
                                          
            Sensitivity : 0.9371          
            Specificity : 0.8644          
         Pos Pred Value : 0.9490          
         Neg Pred Value : 0.8361          
              Precision : 0.9490          
                 Recall : 0.9371          
                     F1 : 0.9430          
             Prevalence : 0.7294          
         Detection Rate : 0.6835          
   Detection Prevalence : 0.7202          
      Balanced Accuracy : 0.9008          
                                          
       'Positive' Class : NO              
                                          

Plotting the confusion matrix, we can see the Random Forest model’s correct predictions in green, and incorrect predictions in red.

  • The model correctly predicted a class of “NO” 149 times (True Negative), and incorrectly predicted “NO” 8 times (False Negative).
  • The model correctly predicted a class of “YES” 51 times (True Positive), and incorrectly predicted “YES” 10 times (False Positive).

So, this time we have 8 false negative predictions. Slightly less than the Decision Tree.

We now also have a lot less false positive predictions of 10.

*fig 10.8. - R-Forest - Confusion Matrix*

fig 10.8. - R-Forest - Confusion Matrix

Plotting the ROC curve for the Random Forest model, we can see that the line comes a lot closer to the right angle shape mentioned previously. Compared to the Decision Tree model, this curve indicates a stronger prediction accuracy.

*fig 10.9. - Random Forest ROC Curve*

fig 10.9. - Random Forest ROC Curve

11. CONCLUSION


Our goal with this study was to apply supervised machine learning algorithms (Decision Tree and Random Forest) to a data-set derived from Autism Spectrum Disorder research with the hopes of classifying new observations into the categories of “Has ASD” or “Does Not Have ASD” (in the case of this study, these observations, would be new individuals that have undergone the ASD screening process).

We cleansed the data-set by correcting outlier values, removing observations containing missing values and dropping unnecessary variables. The loss of 95 observations is not ideal. However, it is difficult to replace missing values of a categorical nature as there is no median or mean to work with. While there are methods to impute missing categorical data, we decided that this course of action was beyond the scope of this project.

Using a repeated K-Fold cross validation to aid in selecting a model, as well as a partition of our data into training and testing subsets, we were able to build two machine learning models and confirm that they are capable of predicting the target variable of Class_ASD with a moderate/good level of accuracy when given new data.

Given that it takes the expected accuracy of the given data (which is biased towards the class of NO) into account, the Kappa score (Decision Tree: 61%, Random Forest: 79%), is a good choice of metric for comparison. With an almost 20% increase over the Decision Tree, it is clear that the Random Forest model is superior.

However, the fact that both models resulted in considerable amounts of both false negative and false positive predictions, we must accept that they are not reliable predictors of the target variable. In a real world situation, these errors could result in a lack of care where needed, unnecessary ASD testing/medical bills, or even legal implications for the body that performed the predictions.

We conclude that it may be possible to improve the accuracy of predicting the target variable by collecting a lot more data from which a more balanced data-set, with equal representation of both Class_ASD = YES/NO, could be created. Retraining these algorithms (and other classifiers) on such a data-set would give a clearer picture of the prediction possibilities of this data.

12. REFERENCES


  1. Bone, D., Goodwin, M.S., Black, M., Lee, C., Audhkhasi, K. and Narayanan, C. (2014). ‘Applying Machine Learning to Facilitate Autism Diagnostics: Pitfalls and Promises.’ [Online]. Springer Science, Business Media New York. Available from: https://vdocuments.site/applying-machine-learning-to-facilitate-autism-diagnostics-pitfalls-and-pro mises.html. [Accessed 21st April 2018].

  2. Brown, M. (2012) ‘Data Mining and techniques’. [Online]. IBM Developer Works. Available from: https://www.ibm.com/developerworks/library/ba-data-mining-techniques/. [Accessed 16th April 2018].

  3. Chen, C., Keown, C., Jahedi, A., Nair. A., Mark E. Pflieger, Bailey, B. and Muller, R. (2015). ‘Diagnostic classification of intrinsic functional connectivity highlights somatosensory, default mode, and visual regions in autism’. [Online]. NeuroImage: Clinical 8, 238-245. Available from: http://dx.doi.org/10.1016/j.nicl.2015.04.002. [Accessed 22nd April 2018].

  4. Hand, D. and Till, R. (2001). ‘A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems’. Kluwer Academic Publishers Hingham, MA, USA. 45(2): pp. 171-186. HSE (2017). ‘Asperger syndrome (see Autistic spectrum disorders)’. [Online]. Health Service Executive. Available from: https://www.hse.ie/eng/health/az/a/asperger-syndrome/adults-living-with-autism.html. [Accessed: 18th April 2018].

  5. Jain, S. (2016). ‘Analysis and Application of Data Mining Methods used for Customer Churn in Telecom Industry’. [Online]. LinkedIn. Available from: https://www.linkedin.com/pulse/analysis-application-data-mining-methods-used-customer-saurabh-jain/. [Accessed: 16th April 2018].

  6. Kohavi, R. and F. Provost (1998). ‘Glossary of terms. Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process.’ [Online]. Kluwer Academic Publishers. Available from: http://robotics.stanford.edu/~ronnyk/glossary.html. [Accessed 22nd April 2018].

  7. Lord, C., Risi, S., Lambrecht, L., Cook, E. H, Jr, Leventhal, B. L., DiLavore, P. C., et al. (2000). “The Autism Diagnostic Observation Schedule-Generic: A Standard Measure of Social and Communication Deficits Associated with the Spectrum of Autism”. Journal of Autism and Developmental Disorders. 30(3): pp. 205-223
    Susmaga, R. (2004). ’Confusion Matrix Visualization: Intelligent Information Processing and Web Mining. Advances in Soft Computing. [Online]. Available from: https://link.springer.com/chapter/10.1007/978-3-540-39985-8_12. [Accessed 22nd April 2018].

  8. Park, S.H, Goo, J.M. and Jo, C.H. (2004). ‘Receiver Operating Characteristic (ROC) Curve: Practical Review for Radiologists’. [Online]. KoreaMed. Available from: https://www.synapse.koreamed.org/search.php?where=aview&id=10.3348/kjr.2004.5.1.11&code=0068KJR&vmode=FULL. [Accessed 22nd April 2018].

  9. Rodriguez, J.D., Perez, A. and Lozano, J.A. (2010). ‘Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation’. IEEE Transactions on Pattern Analysis and Machine Intelligence. 32(3): pp. 569-575. Williams, G. (2010). ‘Complexity (cp)’. [Online]. Togware. Available from: http://datamining.togaware.com/survivor/Complexity_cp.html. [Accessed 20th April 2018].

  10. Witten, I., Frank, E., Hall, M. and Pal, C. (2016). ‘Data Mining: Practical Machine Learning Tools and Techniques’. 4th Ed. [Online]. Morgan Kaufmann. Available from: https://books.google.ie/books?id=1SylCgAAQBAJ&lpg=PP1&dq=classification%20data%20mining%20technique&lr&pg=PP1#v=onepage&q=classification%20concept&f=false. [Accessed 22nd April 2018].

  11. Paruchuri, V. (2018) ‘Improve Predictive Performance in R with Bagging’. [Online]. R-bloggers. Available from: https://www.r-bloggers.com/improve-predictive-performance-in-r-with-bagging/. [Accessed 22nd April 2018].

  12. ThoughtCo. (2018). ‘The Difference Between Descriptive and Inferential Statistics’. [online] Available at: https://www.thoughtco.com/differences-in-descriptive-and-inferential-statistics-3126224. [Accessed 18th Apr. 2018].

  13. Koehrsen, William. (2018). ‘Random Forest Simple Explanation’. [Online]. Medium. Available at: https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d. [Accessed 18th Apr. 2018].

  14. Cran.r-project.org. (2018). [online] Available at: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf. [Accessed 25 Apr. 2018].

  15. Basu, K. (2018, February). ‘Machine learning approaches to the classification problem for autism spectrum disorder’. Retrieved from Github: https://github.com/kbasu2016/Autism-Detection-in-Adults/blob/master/report.pdf. [Accessed 29h April 2018].

  16. Basu, K. (2018, February). ‘Autism Screening Adult Data Set: A Machine Approach’. Retrieved from Github: https://github.com/kbasu2016/Autism-Detection-in-Adults/blob/master/report.pdf. [Accessed 29h April 2018].