1 Background

The basis of the problem analyzed in this work is a current argument being debated by a team of researchers lead by Prof. Michael Caliguri. These researchers are exploring the relationship between kinematics of handwritten text and the ability of forensic document examiners to determine if two documents were written by the same writer. However, a side debate has risen among them thinking whether or not statisticians and machine learning experts can build a reasonably accurate classifier that can predict three things at once; (1) The writer of a short note, (2) Among 6 phrases (‘Our London business is good’, ‘but Vienna and Berlin are quiet’, ‘Mr. Lloyd has gone to Switzerland’, ‘and I hope for good news’, ‘He will be there for a week’, ‘And then goes to Turin’), the phrase written by the writer (3) The type/style of writing used (Print or Script) based on the kinematic features recorded by MovAlyzeR. Hence, the objective of this project is to use the provided data to build a classifier that is reasonably accurate in predicting the three things. Two data sets (labeled and unlabeled) were provided. The labeled data has 30 columns with 103,311 rows and the unlabeled data has 150 rows with 27 columns.

2 Exploratory Analysis:

Summary analysis on the labeled data, table 5.1, shows the variable RelativeDurationofPrimary and RelativeSizeofPrimary appears insignificant in predicting the target variables. These variables were excluded from the data. A plot of correlation matrix on the numeric variables, fig 5.6 shows there is strong correlation among most of the predictors. Hence, variables with correlation absolute value greater than 0.75 were removed from the data, figure 5.7. Boxplots were used to explore the remaining predictors graphically in order to investigate if there exist any association with the three target variables, namely: “Group”, “Subject”, “Condition”. From figure 5.4 and figure 5.5, there exist an association between Group and Subject, and majority of the predictors. However, predictors such as StraightnessError, RelativePenDownDuration, Slant, RelativeInitialSlant, Seqment, RelativeTimeToPeakVerticalVelocity and AveragePenPressure seems most likely to be useful in the prediction. From figure 5.1 and figure 5.2, the distribution of most of the predictors are not normal except that of Slant, RelativeInitialSlant, RelativeTimeToPeakVerticalVelocity, AveragePenPressure, VerticalSize and PeakVerticalVelocity which appears to be normally distributed.The final labeled data used for the analysis has 1440 rows and 16 predictors.

3 Model Building Process:

Since the problem is multiclass classification. I approached it with three supervised classifiers including Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), and Support Vector Machines (SVM). The project was attempted in two ways. First, the three classifiers were built to predict the three target variables at once. Differently, they were also built to predict each of the target variables. The predictors were modeled in three ways. In one approach, I scaled the predictors to have a standard deviation of one and used it for the knn model. In another approach, Principal component analysis (PCA) was modeled on the predictors to determine the components significant in predicting the target variables (This was also used for the knn model). In the third approach, the data was not scaled and used for the svm and lda models. The function pr. comp () which is one of the functions in R that performs PCA was used. pr. comp () centers the variables to have mean value of zero and scale=TRUE scales the variables to have standard deviation of one. A plot of the proportion of variance explained (PVE) by each of the principal components and cumulative proportion of the variance explained by the components (figure 5.8) shows the first 12 principal components explains approximately 99 percent of the variance in the data. Separately, the three models were built on the scaled data and the first 12 principal components and their accuracy rates were compared. All the three classifiers were fitted with leave-one-out cross validation (LOOCV) to partition the data into train and test data, the performance of the models were compared and the best one used for prediction on the unlabeled data.The svm () function in the e1071 library was used to fit the support vector classifiers with two differnt arguments namely: kernel = “linear” and kernel = “radial”. A cost argument was used to specify the cost of violation to the margin.

4 Model Performance:

Among all the classifiers built to predict the three target variables at once, LDA had the highest accuracy rate of 51%, and svm with the argument kernel = “radial” had the least accuracy rate (30.5%), figure 5.10. “NB: The svm model built on the three combined target variables takes pretty long time to run, this was therefore not included in the analysis”. The classifiers were also modeled on the individual target variables (Group, Subject and Condition). The computed accuracy rate from the classifiers modeled with Groups shows svm kernel = “radial” recorded the highest accuracy rate (91.8%). svm kernel = “radial” also recorded the highest accuracy rate (71%) among all the classifiers when modeled with Subject as well as when modeled with Condition (53%). The accuracy rate of the other classifiers on Condition were statistically, figure 5.11.

5 Short Notes on the Models Used:

Principal Component Analysis (PCA): Principal component analysis is an exploratory tool employed in data analysis to study and visualize the correlations that exist between variables in a dataset and to pre-process the data before supervised techniques are applied. It is a non-parametric method of extracting relevant information from dataset with many variables. In exploratory data analysis, PCA focuses on reducing the dimensionality of data while retaining the variations present in the data. It computes a new set of dimensions such that all the dimensions are orthogonal and ranked according to the variance of data along them. Thus, the variation in the original dataset explained by each of the newly computed dimensions also termed principal components varies, decreasing along each subsequent principal component. The first principal component has the largest proportion of variance explained (PVE) in the data. The PVE associated with each of the principal components is a positive quantity. Deciding on the number of principal components to visualiz, interpret or use in a given data is often relative as there is no well accepted objective way to go. However, one must choose the smallest number of principal components that explains sizable amount of the variation captured in the data. There are several functions in R that performs PCA, one of which include prcomp () function. By default, this function centers all the variables in the given dataset to have mean zero, and scale=TRUE which is an option in the function, scales the variables to have standard deviation of one.

K-Nearest Neighbor Model (K-NN): K-Nearest Neighbor is a supervised learning algorithm used for classification and regression. When given dataset with known observations, KNN classifies the unlabeled observation in the given dataset according to the k number or the closest labeled observation. That is, KNN uses labeled observations in a given data to learn how to label other unlabeled observatioons. To assign a label to an unknown observation, KNN chooses the labeled observations nearest to the unknown observation and assign label according to the identity of those observations/neighbors. The output of a KNN classification represents a class membership. The value of k is a positive integer, typically small. In R, KNN is performed with the function knn () which is a part of the library, class. Compared to other model fitting functions which uses two-step approach (model fitting and prediction making from the model) to make predictions, the knn () function uses a single command with four inputs to form predictions.

Linear Discriminant Analysis (LDA): In Machine Learning and pattern classification applications, LDA is one of the many techniques that are used for data classification and dimensionality reduction. As dimensionality reduction technique, LDA transforms the features from higher dimensional space to lower dimensional space, thereby maximizing the ratio of the between-class variance to the within-class variance in a given dataset and guaranteeing maximum class separability. LDA classifier results from the assumption that observations within each class comes from a normal distribution with a class-specific mean vector and a common variance \({\sigma}^{2}\)

Support Vector Machine (SVM): SVM is a classification and regression prediction tool which makes use of machine learning algorithms to maximize predictive accuracy while automatically avoiding over-fit to the data. SVM uses line or hyperplanes as decision boundary to classify observations in a dataset by separating the data into different classes. A hyperplane in a q-dimensional space can be described as an q-1 dimensional subspace. For a given 2-dimension and 3-dimension space respectively, the hyperplane is a line and a plane dividing the dimensional space into halves. Mathematically, any hyperplane can be expressed as \({\beta_{0}} + {\beta_{1}x_1} + {\beta_{2}{x_2}} + \dots + {\beta_{𝑞}} \times {\beta_{𝑞}} = {0}\). When in 2-dimensions, the mathematical expression is given as \({\beta_{0}} + {\beta_{1}{𝑥_1}} + {\beta_{2}{x_2}} = {0}\). Applying hyperplane to a dataset splits the data into distinct classes and classifies any observation with unknow class according to the side of the hyperplane on which it lies. A given hyperplane is chosen on its correctness in separating most of the observations in the dataset into distinct classes. However, most often, some of the obsevations are misclassified. A solution to the misclassification problems that often araises is to aim at optimizing the margin between the distinct classes created by the hyperplane.

#Libraries that were used in the analysis
library(kableExtra)
library(reshape)
library(gridExtra)
library(knitr)
library(GGally)
library(MASS)
library(ggplot2)
library(caret)
library(class)
library(e1071)
library(tidyverse)

#setwd("C:/Users/Owner/OneDrive/R-STUDIO/STAT 602/FINAL PROJECT")
labeled<-read.csv("C:/Users/Owner/OneDrive/R-STUDIO/STAT 602/FINAL PROJECT/labeled.csv",stringsAsFactors = F)[, -1]
unlabeled<-read.csv("C:/Users/Owner/OneDrive/R-STUDIO/STAT 602/FINAL PROJECT/unlab.example.trial.csv",stringsAsFactors = F)[,-1]
dplyr::glimpse(labeled)

## Rows: 103,311
## Columns: 30
## $ Group                              <chr> "CUR", "CUR", "CUR", "CUR", "CUR", ~
## $ Subject                            <chr> "0", "0", "0", "0", "0", "0", "0", ~
## $ Condition                          <chr> "L1", "L1", "L1", "L1", "L1", "L1",~
## $ Trial                              <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ Segment                            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ~
## $ Direction                          <int> 1, -1, 1, -1, 1, -1, 1, -1, 1, -1, ~
## $ Duration                           <dbl> 0.0894, 0.2604, 0.2688, 0.6222, 0.2~
## $ VerticalSize                       <dbl> 0.0826, -0.5291, 0.4285, -0.3950, 0~
## $ PeakVerticalVelocity               <dbl> 1.6658, -3.3324, 2.5150, -1.5824, 2~
## $ PeakVerticalAcceleration           <dbl> 53.6843, -40.1977, 56.8252, -23.308~
## $ HorizontalSize                     <dbl> -0.0946, -0.0362, 0.1273, 0.1741, 0~
## $ StraightnessError                  <dbl> 0.0604, 0.1225, 0.1451, 0.2099, 0.0~
## $ Slant                              <dbl> 2.3458, -1.6903, 1.3306, -1.5045, 0~
## $ LoopSurface                        <dbl> 0.0000, 0.0000, 0.0000, 0.0010, 0.0~
## $ RelativeInitialSlant               <dbl> -0.0996, -0.9133, -0.7585, 1.3047, ~
## $ RelativeTimeToPeakVerticalVelocity <dbl> 0.4940, 0.4960, 0.2349, 0.3777, 0.5~
## $ RelativePenDownDuration            <dbl> 1.0000, 1.0000, 1.0000, 0.2903, 1.0~
## $ RelativeDurationofPrimary          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ RelativeSizeofPrimary              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ AbsoluteSize                       <dbl> 0.1256, 0.5303, 0.4470, 0.4316, 0.3~
## $ AverageAbsoluteVelocity            <dbl> 1.5377, 2.9476, 2.4795, 1.2614, 1.7~
## $ Roadlength                         <dbl> 0.1375, 0.7674, 0.6664, 0.7848, 0.3~
## $ AbsoluteyJerk                      <dbl> 580.494, 236.992, 370.418, 239.508,~
## $ NormalizedyJerk                    <dbl> 7.8102, 10.9292, 21.9435, 119.8060,~
## $ AverageNormalizedyJerkPerTrial     <dbl> 173.998, 173.998, 173.998, 173.998,~
## $ AbsoluteJerk                       <dbl> 617.420, 338.503, 471.810, 751.200,~
## $ NormalizedJerk                     <dbl> 8.3070, 15.6105, 27.9499, 375.7620,~
## $ AverageNormalizedJerkPerTrial      <dbl> 293.266, 293.266, 293.266, 293.266,~
## $ NumberOfPeakAccelerationPoints     <int> 1, 1, 4, 5, 3, 0, 2, 4, 8, 2, 4, 2,~
## $ AveragePenPressure                 <dbl> 338.2220, 489.3080, 629.0370, 99.70~

labeled[,1:3]<-lapply(labeled[,1:3],as.factor)

summary.stats <-round(as.data.frame(labeled[,-c(1:4)]%>% 
      psych::describe() %>%
          dplyr::select(n,mean, sd, median, min, max, range, se)),3)
            kableExtra::kable(summary.stats, 
      caption="Statistical distribution of the numeric variables in the labeled data",format ="pandoc")%>%
kableExtra::kable_styling(latex_option=c("hold_position"), full_width = F)

Table 5.1: Statistical distribution of the numeric variables in the labeled data
	n	mean	sd	median	min	max	range	se
Segment	103311	36.842	21.219	37.000	1.000	75.000	74.000	0.066
Direction	103311	-0.002	1.000	-1.000	-1.000	1.000	2.000	0.003
Duration	103311	0.157	0.113	0.132	0.012	7.169	7.157	0.000
VerticalSize	103311	-0.005	0.357	-0.002	-4.311	3.964	8.275	0.001
PeakVerticalVelocity	103311	-0.226	4.344	-0.061	-86.473	50.694	137.167	0.014
PeakVerticalAcceleration	103311	-9.907	2100.444	3.483	-509504.000	428441.000	937945.000	6.535
HorizontalSize	103311	0.136	0.381	0.102	-16.956	5.830	22.786	0.001
StraightnessError	103311	0.059	0.054	0.043	0.000	1.761	1.761	0.000
Slant	103311	-0.258	1.420	0.018	-3.141	3.142	6.283	0.004
LoopSurface	103311	0.001	0.028	0.000	-1.323	4.219	5.542	0.000
RelativeInitialSlant	103311	-0.036	0.432	-0.038	-3.092	3.100	6.192	0.001
RelativeTimeToPeakVerticalVelocity	103311	0.476	0.135	0.474	0.013	0.996	0.983	0.000
RelativePenDownDuration	103311	0.735	0.374	1.000	0.000	1.000	1.000	0.001
RelativeDurationofPrimary	103311	0.000	0.000	0.000	0.000	0.000	0.000	0.000
RelativeSizeofPrimary	103311	0.000	0.000	0.000	0.000	0.000	0.000	0.000
AbsoluteSize	103311	0.417	0.343	0.340	0.000	17.070	17.070	0.001
AverageAbsoluteVelocity	103311	3.301	2.267	2.962	0.000	189.784	189.784	0.007
Roadlength	103311	0.505	0.459	0.403	0.000	27.059	27.059	0.001
AbsoluteyJerk	103311	829.182	748.849	659.966	6.721	34146.700	34139.979	2.330
NormalizedyJerk	103311	34.240	705.431	8.811	0.004	127543.000	127542.996	2.195
AverageNormalizedyJerkPerTrial	103311	34.515	337.709	17.280	6.887	20909.900	20903.013	1.051
AbsoluteJerk	103311	1258.579	2052.388	972.555	10.597	176682.000	176671.403	6.385
NormalizedJerk	103311	73.312	3000.535	12.194	0.039	749816.000	749815.961	9.335
AverageNormalizedJerkPerTrial	103311	73.937	1116.676	29.104	11.179	84524.300	84513.121	3.474
NumberOfPeakAccelerationPoints	103311	1.563	1.441	1.000	0.000	76.000	76.000	0.004
AveragePenPressure	103311	381.704	256.278	399.900	0.000	1023.000	1023.000	0.797

grid.arrange(
labeled%>%ggplot() + geom_histogram(aes(x=AveragePenPressure), fill="#A4AA83", col="black")+ labs(title = "AveragePenPressure"),
labeled%>%ggplot() + geom_histogram(aes(x=Direction), fill="#A4AA83", col="black")+ labs(title = "Direction"),
labeled%>%ggplot() + geom_histogram(aes(x=PeakVerticalVelocity), fill="#A4AA83", col="black")+ labs(title = "PeakVerticalVelocity"),
labeled%>%ggplot() + geom_histogram(aes(x=RelativeInitialSlant), fill="#A4AA83", col="black")+ labs(title = "RelativeInitialSlant"),
labeled%>%ggplot() + geom_histogram(aes(x=RelativePenDownDuration), fill="#A4AA83", col="black")+ labs(title = "RelativePenDownDuration"),
labeled%>%ggplot() + geom_histogram(aes(x=RelativeTimeToPeakVerticalVelocity), fill="#A4AA83", col="black")+ labs(title = "RelativeTimeToPeakVerticalVelocity"),
labeled%>%ggplot() + geom_histogram(aes(x=Segment), fill="#A4AA83", col="black")+ labs(title = "Segment"),
labeled%>%ggplot() + geom_histogram(aes(x=Slant), fill="#A4AA83", col="black")+ labs(title = "Slant"),
labeled%>%ggplot() + geom_histogram(aes(x=StraightnessError), fill="#A4AA83", col="black")+ labs(title = "StraightnessError"),
labeled%>%ggplot() + geom_histogram(aes(x=VerticalSize), fill="#A4AA83", col="black")+ labs(title = "VerticalSize"),
ncol=2)

Figure 5.1: Histograms of Variables - Labeled Data

grid.arrange(
labeled%>%ggplot() + geom_histogram(aes(x=Duration), fill="#A4AA83", col="black")+ labs(title = "Duration"),
labeled%>%ggplot() + geom_histogram(aes(x=PeakVerticalAcceleration), fill="#A4AA83", col="black")+ labs(title = "PeakVerticalAcceleration"),
labeled%>%ggplot() + geom_histogram(aes(x=HorizontalSize), fill="#A4AA83", col="black")+ labs(title = "HorizontalSize"),
labeled%>%ggplot() + geom_histogram(aes(x=LoopSurface), fill="#A4AA83", col="black")+ labs(title = "LoopSurface"),
labeled%>%ggplot() + geom_histogram(aes(x=AbsoluteSize), fill="#A4AA83", col="black")+ labs(title = "AbsoluteSize"),
labeled%>%ggplot() + geom_histogram(aes(x=AverageAbsoluteVelocity), fill="#A4AA83", col="black")+ labs(title = "AverageAbsoluteVelocity"),
labeled%>%ggplot() + geom_histogram(aes(x=Roadlength), fill="#A4AA83", col="black")+ labs(title = "Roadlength"),
labeled%>%ggplot() + geom_histogram(aes(x=AbsoluteyJerk), fill="#A4AA83", col="black")+ labs(title = "AbsoluteyJerk"),
labeled%>%ggplot() + geom_histogram(aes(x=NormalizedyJerk), fill="#A4AA83", col="black")+ labs(title = "NormalizedyJerk"),
labeled%>%ggplot() + geom_histogram(aes(x=AverageNormalizedyJerkPerTrial), fill="#A4AA83", col="black")+ labs(title = "AverageNormalizedyJerkPerTrial"),
ncol=2)

Figure 5.2: Histogram of Variables - Labeled Data

grid.arrange(
labeled%>%ggplot() + geom_histogram(aes(x=AbsoluteJerk), fill="#A4AA83", col="black")+ labs(title = "AbsoluteJerk"),
labeled%>%ggplot() + geom_histogram(aes(x=NormalizedJerk), fill="#A4AA83", col="black")+ labs(title = "NormalizedJerk"),
labeled%>%ggplot() + geom_histogram(aes(x=AverageNormalizedJerkPerTrial), fill="#A4AA83", col="black")+ labs(title = "AverageNormalizedJerkPerTrial"),
labeled%>%ggplot() + geom_histogram(aes(x=NumberOfPeakAccelerationPoints), fill="#A4AA83", col="black")+ labs(title = "NumberOfPeakAccelerationPoints"),
labeled%>%ggplot() + geom_histogram(aes(x=Trial), fill="#A4AA83", col="black")+ labs(title = "Trial"),
labeled%>%ggplot() + geom_histogram(aes(x=AverageAbsoluteVelocity), fill="#A4AA83", col="black")+ labs(title = "AverageAbsoluteVelocity"),
ncol=2)

Figure 5.3: Histogram of Variables - Labeled Data

labeled%>%
    select(AveragePenPressure,Direction,PeakVerticalVelocity,RelativeInitialSlant,RelativePenDownDuration,
           RelativeTimeToPeakVerticalVelocity,Segment,Slant,StraightnessError,VerticalSize,Group) %>%
  gather(key   = "key", 
         value = "value",-Group) %>%
  
  #Distribution of the target variable (0=Good,1=Bad) among the numeric variables
  #Since no customer is indebted or has any standing delinquent amount, this variable together with AccountsDQ were removed from the data
  ggplot(aes(y = value)) +
  geom_boxplot(aes(fill = Group),
               alpha  = .6,
               fatten = .7) + 
  labs(x = "",
       y = "",
       title = "Boxplots of Group vs Significant Variables") +
  scale_fill_manual(
    values = c("turquoise2","turquoise4"),
    name   = "Group",
    labels = c("CUR", "PRI")) +
  theme_bw() +
  facet_wrap(~ key, 
             scales = "free", 
             ncol   = 5)

Figure 5.4: Boxplot of distribution of Group among some significant Variables

labeled%>%
    select(AveragePenPressure,Direction,PeakVerticalVelocity,RelativeInitialSlant,RelativePenDownDuration,
           RelativeTimeToPeakVerticalVelocity,Segment,Slant,StraightnessError,VerticalSize,Condition) %>%
  gather(key   = "key", 
         value = "value",-Condition) %>%
  
  #Distribution of the target variable (0=Good,1=Bad) among the numeric variables
  #Since no customer is indebted or has any standing delinquent amount, this variable together with AccountsDQ were removed from the data
  ggplot(aes(y = value)) +
  geom_boxplot(aes(fill = Condition),
               alpha  = .6,
               fatten = .7) + 
  labs(x = "",
       y = "",
       title ="Boxplots of Condition vs Significant Variables") +
  scale_fill_manual(
    values = c("turquoise","turquoise1","turquoise2","turquoise3","turquoise4","skyblue2"),
    name   = "Condition",
    labels = c("L1", "L2","L3","L4","L5","L6")) +
  theme_bw() +
  facet_wrap(~ key, 
             scales = "free", 
             ncol   = 5)

Figure 5.5: Boxplot of distribution of Condition among some significant Variables

df<-labeled[,-c(1:4,18:19)]
corr<-round(cor(df),1)
ggcorrplot::ggcorrplot(corr, hc.order = TRUE, outline.col = "white")

Figure 5.6: Correlation matrix of all the variables

lab<-labeled[,-(18:19)]#Creating a variable "Group" to contain the names of the known species
num_vars<-lab[,!names(lab) %in% c('Group','Subject','Condition','Trial')]
Matrix<- cor(num_vars)  #Computing correlation matrix on the input variables
Matrix[upper.tri(Matrix)]<-0
diag(Matrix)<-0        #Setting the upper triangle and perfectly correlated variables to 0
Non_corr_vars<-num_vars[,!apply(Matrix,2,function(x) any(x > abs(0.75)))]#Removing any input variable with correlation absolute value of 0.8 or above

corr<-round(cor(Non_corr_vars),1)
ggcorrplot::ggcorrplot(corr, hc.order = TRUE, outline.col = "white")

Figure 5.7: Correlation plot of less correlated variables

#The code in this chunk is replicated from what you took us through in the class.
labeled <- labeled %>%
  select(Group,Subject,Condition,Trial,Slant,HorizontalSize,RelativeInitialSlant,
         NormalizedJerk,NumberOfPeakAccelerationPoints, StraightnessError, Segment,
         PeakVerticalAcceleration,AveragePenPressure,LoopSurface,RelativeTimeToPeakVerticalVelocity,
         AbsoluteJerk,AbsoluteyJerk,Roadlength,AverageAbsoluteVelocity)

## Model performance function
perf.measure<-function(Preds, Resp){
  Preds<-as.character(Preds)
  Resp<-as.character(Resp)
  CV.tab.dat <- cbind(Preds, Resp)
  conf.tab <- xtabs(~Preds+Resp, CV.tab.dat)
  #accuracy rate
  accuracy.rate <- round(mean(Preds==Resp),2)
  return(list(accuracy.rate = accuracy.rate,conf.tab <- conf.tab))
}

labeled.means=NULL
Col.means=NULL
response=apply(labeled[,1:4], 1, paste, collapse = ":") 
#Combining the first four variables in the dataset which include the three target variables, the data was subseting using the column means of the variables.
uniq=unique(response)
for (i in uniq){
labeled.nrow=labeled[i==response, ]
colmean.nrow=colMeans(labeled.nrow[,-(1:4)])
Col.means=rbind(Col.means, labeled.nrow[1,(1:3)]) 
labeled.means=rbind(labeled.means, colmean.nrow)
}
dim(labeled.means)

## [1] 1440   15

collapsed<-data.frame(target=apply(Col.means[,1:3], 1, paste, collapse = ":"),labeled.means)
predictors<-as.data.frame(collapsed[,-1])
predictors$target<-as.factor(collapsed$target)


pr.out=prcomp(collapsed[,-1],scale.=TRUE)
df_pca<-as.data.frame(pr.out$x[,1:12])
df_pca$target<- collapsed$target

pr.var=pr.out$sdev^2
pve=pr.var/sum(pr.var)
cat("\nVariability Explained:Group:Subject:Condition(Principal Components):",(cumsum(pr.var/sum(pr.var))[12])*100)

## 
## Variability Explained:Group:Subject:Condition(Principal Components): 98.73105

if (require(gridExtra)){
Plot1<-ggplot(data = c(),aes(1:15,pve)) + geom_line() + geom_point() + labs(x="Principal Components",
                  y = "PVE ", title ="Proportion of Variance Explained") + ylim(0.00,0.1)+ theme_bw()
Plot2<-ggplot(data = c(),aes(1:15,cumsum(pve))) + geom_line() + geom_point() + labs(x="Principal Components",
      y = "Cumulative PVE ", title ="Cumulative Proportion Variance Explained") + ylim(0.00,1.0)+ theme_bw()
grid.arrange(Plot1,Plot2,ncol = 2)
}

Figure 5.8: Proportion of Variance Explained

#Fitting LDA Model and predicting with CV (leave-one-out cross validation)
lda_model<-lda(target~., data = predictors, CV = TRUE)

#LDA CV Model Performance on Predicting Group:Subject:Condition at once
lda_perf<-perf.measure(Preds=lda_model$class, Resp=predictors$target)

set.seed(12345)
scal_predictors<-as.data.frame(scale(predictors[,-16]))
scal_predictors$target<-predictors$target

test_scal<- NULL
for (k in 1:100) {
model_scal<- knn.cv(scal_predictors[,1:15],cl=scal_predictors$target, k)
test_scal[k] <- mean(model_scal==scal_predictors$target)
}
k_test_scal <- which(test_scal==max(test_scal)) # my result is 15/17
knn_model_scal<- knn.cv(scal_predictors[,1:15],
                     cl=scal_predictors$target, k=k_test_scal[1])

test_pca<- NULL
for (k in 1:100) {
model_pca<- knn.cv(df_pca[,1:12],cl=df_pca$target, k)
test_pca[k] <- mean(model_pca==df_pca$target)
}
k_test_pca <- which(test_pca==max(test_pca)) # my result is 15/17
knn_model_pca<- knn.cv(df_pca[,1:12],  
                       cl=df_pca$target, k=k_test_pca[1])

#knn optimual parameters plot
par(mfrow = c(2,1))
plot(test_scal, ylim=c(0.8,0.9), xlab = 'k value', ylab = 'LOOCV accuracy rate', 
     main = 'knn model with all variables')
abline(v=k_test_scal[1], col = 'red')
legend(x=k_test_scal[1], y=0.85, legend = paste('optimal k=',k_test_scal[1]), bty='n')


plot(k_test_pca, ylim=c(0.8,0.9), xlab = 'k value', ylab = 'LOOCV accuracy rate', 
     main = 'knn model with first three pca variables')
abline(v=k_test_pca[1], col = 'red')
legend(x=k_test_pca[1], y=0.88, legend = paste('optimal k=',k_test_pca[1]), bty='n')

Figure 5.9: optimal k value choices plots for knn model

#KNN CV Model Performance on Predicting Group:Subject:Condition at once
knn_scal_perf<-perf.measure(Preds = knn_model_scal, Resp= scal_predictors$target)
knn_pca_perf<-perf.measure(Preds = knn_model_pca, Resp=scal_predictors$target)

#Data frame with each of the individual target variable
df<-predictors %>%
  select(-target) 

df_all_target<-data.frame(df,
                     Col.means[,1:3])

#The labeled data with Group as target variable
df_grp<-df_all_target %>%
  select(-c(Subject,Condition))

#The labeled data with Subject as target variable 
df_sub<-df_all_target %>%
  select(-c(Group,Condition))

#The labeled data with Condition as target variable
df_cond<-df_all_target %>%
  select(-c(Group,Subject))


#Fitting LDA Model and predicting with CV (leave-one-out cross validation) on the scaled data
lda_model.grp<-lda(Group ~., data = df_grp, CV = TRUE)
lda_model.sub<-lda(Subject ~., data = df_sub,CV = TRUE)
lda_model.cond<-lda(Condition ~ ., data = df_cond,CV = TRUE)


#LDA CV performance on the models
tab_lda_model.grp<-table(lda_model.grp$class,df_grp$Group)
lda_grp_perf<-sum(diag(tab_lda_model.grp))/sum(tab_lda_model.grp)
lda_sub.perf<-perf.measure(Preds=lda_model.sub$class, Resp=df_sub$Subject)
lda_cond.perf<-perf.measure(Preds=lda_model.cond$class, Resp=df_cond$Condition)

#The scaled data with the individual target variable
df<-scal_predictors %>%
  select(-target) 

df_all_target<-data.frame(df,
                     Col.means[,1:3])

#The labeled data with Group as target variable
df_grp<-df_all_target %>%
  select(-c(Subject,Condition))

#The labeled data with Subject as target variable 
df_sub<-df_all_target %>%
  select(-c(Group,Condition))

#The labeled data with Condition as target variable
df_cond<-df_all_target %>%
  select(-c(Group,Subject))


#Fitting KNN Model and predicting with CV (leave-one-out cross validation) on the scaled data
set.seed(12345)
test_grp<- NULL
for (k in 1:100) {
model_grp<- knn.cv(df_grp[1:15],cl=df_grp$Group, k)
test_grp[k] <- mean(model_grp==df_grp$Group)
}
k_test_grp<-which(test_grp==max(test_grp)) # my result is 15/17
knn_model_grp<- knn.cv(df_grp[,1:15],
                     cl=df_grp$Group, k=k_test_grp[1])

test_sub<- NULL
for (k in 1:100) {
model_sub<- knn.cv(df_sub[,1:15],cl=df_sub$Subject, k)
test_sub[k] <- mean(model_sub==df_sub$Subject)
}
k_test_sub<-which(test_sub==max(test_sub)) # my result is 15/17
knn_model_sub<- knn.cv(df_sub[,1:15],
                     cl=df_sub$Subject, k=k_test_sub[1])

test_cond<- NULL
for (k in 1:100) {
model_cond<- knn.cv(df_cond[,1:15],cl=df_cond$Condition, k)
test_cond[k] <- mean(model_cond==df_cond$Condition)
}
k_test_cond<-which(test_cond==max(test_cond)) # my result is 15/17
knn_model_cond<- knn.cv(df_cond[,1:15],
                     cl=df_cond$Condition, k=k_test_cond[1])

#Fitting KNN model and predicting with CV on the Principal Component scaled data

#Principal component scaled data with group as target
knn_pca<-as.data.frame(pr.out$x[,1:12])
knn_pca.grp<-data.frame(knn_pca,Col.means[,1])
names(knn_pca.grp)[13]<-"Group"

#Principal component scaled data with Subject as target
knn_pca.sub<-data.frame(knn_pca,Col.means[,2])
names(knn_pca.sub)[13]<-"Subject"

#Principal component scaled data with Condition as target
knn_pca.cond<-data.frame(knn_pca,Col.means[,3])
names(knn_pca.cond)[13]<-"Condition"


test_pca.grp<- NULL
for (k in 1:100) {
model_pca.grp<- knn.cv(knn_pca.grp[,1:12],cl=knn_pca.grp$Group, k)
test_pca.grp[k] <- mean(model_pca.grp==knn_pca.grp$Group)
}
k_test_pca.grp<- which(test_pca.grp==max(test_pca.grp)) # my result is 15/17
knn_model_pca.grp<- knn.cv(knn_pca.grp[,1:12],  
                       cl=knn_pca.grp$Group, k=k_test_pca.grp[1])

test_pca.sub<- NULL
for (k in 1:100) {
model_pca.sub<- knn.cv(knn_pca.sub[,1:12],cl=knn_pca.sub$Subject, k)
test_pca.sub[k] <- mean(model_pca.sub==knn_pca.sub$Subject)
}
k_test_pca.sub<- which(test_pca.sub==max(test_pca.sub)) # my result is 15/17
knn_model_pca.sub<- knn.cv(knn_pca.sub[,1:12],  
                       cl=knn_pca.sub$Subject, k=k_test_pca.grp[1])

test_pca.cond<- NULL
for (k in 1:100) {
model_pca.cond<- knn.cv(knn_pca.cond[,1:12],cl=knn_pca.cond$Condition, k)
test_pca.cond[k] <- mean(model_pca.cond==knn_pca.cond$Condition)
}
k_test_pca.cond <- which(test_pca.cond==max(test_pca.cond)) # my result is 15/17
knn_model_pca.cond<- knn.cv(knn_pca.cond[,1:12],  
                       cl=knn_pca.cond$Condition, k=k_test_pca.cond[1])

##KNN CV Model performance in predicting Group, Subject and Condition separately
tab_knn_model.grp<-table(knn_model_grp,df_grp$Group)
knn_grp_perf<-sum(diag(tab_knn_model.grp))/sum(tab_knn_model.grp)
knn_sub_perf<- perf.measure(Preds = knn_model_sub, Resp=df_sub$Subject)
knn_cond_perf<- perf.measure(Preds = knn_model_cond, Resp=df_cond$Condition)


#LDA CV performance on the PCA models
tab_knn_model_pca.grp<-table(knn_model_pca.grp,knn_pca.grp$Group)
knn_pca_grp.perf<-sum(diag(tab_knn_model_pca.grp))/sum(tab_knn_model_pca.grp)
knn_pca_sub.perf <- perf.measure(Preds = knn_model_pca.sub, Resp=knn_pca.sub$Subject)
knn_pca_cond.perf<- perf.measure(Preds = knn_model_pca.cond, Resp=knn_pca.cond$Condition)

set.seed(702)
#svm_radial with Group as target variable
svm_radial_grp<-as.factor(NULL)
levels(svm_radial_grp)<-levels(df_grp$Group)
for(i in 1:1440){
  trainX <- df_grp[-i,]
  testY <- df_grp[i, ]
  Radial_model.grp<-svm(Group~., data = trainX, kernel = 'radial',cost=10)
  svm_radial_grp[i] <- predict(Radial_model.grp, newdata = testY)
}


#svm_polynomial with Group as target variable
svm_linear_grp<-as.factor(NULL)
levels(svm_linear_grp)<-levels(df_grp$Group)
for(i in 1:1440){
  trainX <- df_grp[-i,]
  testY <- df_grp[i,]
  linear_model<- svm(Group ~., data = trainX, scale = FALSE, kernel = 'linear',cost=10)
  svm_linear_grp[i] <- predict(linear_model, newdata = testY)
}



#svm_radial with Subject as target variable
svm_radial_sub<-as.factor(NULL)
levels(svm_radial_sub)<-levels(df_sub$Subject)
for(i in 1:1440){
  trainX <- df_sub[-i,]
  testY <- df_sub[i, ]
  Radial_model.subj<-svm(Subject~., data = trainX, kernel = 'radial',cost=10)
  svm_radial_sub[i] <- predict(Radial_model.subj, newdata = testY)
}

#svm_polynomial with Subject as target variable
svm_linear_sub<-as.factor(NULL)
levels(svm_linear_sub)<-levels(df_sub$Subject)
for(i in 1:1440){
  trainX <- df_sub[-i,]
  testY <- df_sub[i,]
  linear_model<- svm(Subject~., data = trainX, scale = FALSE, kernel = 'linear',cost=10)
  svm_linear_sub[i] <- predict(linear_model, newdata = testY)
}


#svm_radial with Condition as target variable
svm_radial_cond<-as.factor(NULL)
levels(svm_radial_cond)<-levels(df_cond$Condition)
for(i in 1:1440){
  trainX <- df_cond[-i,]
  testY <- df_cond[i, ]
  Radial_model<-svm(Condition~., data = trainX, kernel = 'radial',cost=10)
  svm_radial_cond[i] <- predict(Radial_model, newdata = testY)
}

#svm_polynomial with Condition as target variable
svm_linear_cond<-as.factor(NULL)
levels(svm_linear_cond)<-levels(df_cond$Condition)
for(i in 1:1440){
  trainX <- df_cond[-i,]
  testY <- df_cond[i,]
  linear_model<- svm(Condition~., data = trainX, scale = FALSE, kernel = 'linear',cost=10,degree=3)
  svm_linear_cond[i] <- predict(linear_model, newdata = testY)
}

tab_svm_radial.grp<-table(svm_radial_grp,df_grp$Group)
svm_radial_grp.perf<-sum(diag(tab_svm_radial.grp))/sum(tab_svm_radial.grp)

tab_svm_linear.grp<-table(svm_linear_grp,df_grp$Group)
svm.linear_grp.perf<-sum(diag(tab_svm_linear.grp))/sum(tab_svm_linear.grp)



svm_radial_sub.perf<-perf.measure(Preds=svm_radial_sub,Resp=df_sub$Subject)
svm_linear_sub.perf<- perf.measure(Preds=svm_linear_sub,Resp=df_sub$Subject)


svm_radial_cond.perf<-perf.measure(Preds=svm_radial_cond,Resp=df_cond$Condition)
svm_linear_cond.perf<-perf.measure(Preds=svm_linear_cond,Resp=df_cond$Condition)

## construct performance table and plot

COL.NAME <- c('Model')
ROW.NAME <- c('lda_all','knn_scal.all','knn_pca.all')

accuracy.rate_all <- as.data.frame(cbind(c(lda_perf$accuracy.rate,
                                    knn_scal_perf$accuracy.rate,knn_pca_perf$accuracy.rate)
                                     ))
colnames(accuracy.rate_all) <- COL.NAME
rownames(accuracy.rate_all) <- ROW.NAME

row.name<-c('lda','knn_scal','knn_PCA','svm_radial','svm_linear')
column.name<-c('Group','Subject','Condition')

accuracy.rate<-as.data.frame(rbind(c(lda_grp_perf,lda_sub.perf$accuracy.rate,lda_cond.perf$accuracy.rate),
                                     c(knn_grp_perf,knn_sub_perf$accuracy.rate,knn_cond_perf$accuracy.rate),
                                    c( knn_pca_grp.perf,knn_pca_sub.perf$accuracy.rate,knn_pca_cond.perf$accuracy.rate),
                                     c(svm_radial_grp.perf,svm_radial_sub.perf$accuracy.rate,svm_radial_cond.perf$accuracy.rate),
                                     c(svm.linear_grp.perf,svm_linear_sub.perf$accuracy.rate,svm_linear_cond.perf$accuracy.rate)
                             ))
                              
colnames(accuracy.rate)<-column.name
row.names(accuracy.rate)<-row.name

#Table of Performance Measures
kable(accuracy.rate_all, caption = 'Average LOOCV Accuracy Rate across GROUP:SUBJECT:CONDITION', format = "pandoc")%>%kable_styling(latex_option=c("hold_position"), full_width = F)

Table 5.2: Average LOOCV Accuracy Rate across GROUP:SUBJECT:CONDITION
	Model
lda_all	0.51
knn_scal.all	0.34
knn_pca.all	0.31

kable(accuracy.rate, caption = 'Average LOOCV Accuracy Rate across "GROUP","SUBJECT","CONDITION"', format = "pandoc")%>%kable_styling(latex_option=c("hold_position"), full_width = F)

Table 5.2: Average LOOCV Accuracy Rate across “GROUP”,“SUBJECT”,“CONDITION”
	Group	Subject	Condition
lda	0.8694444	0.57	0.45
knn_scal	0.9000000	0.61	0.48
knn_PCA	0.8736111	0.55	0.46
svm_radial	0.9180556	0.71	0.53
svm_linear	0.8763889	0.69	0.45

#Graph of model Performance measures
accuracy.rate_all$numb.var <- as.factor(c('lda_all', 'knn_scal.all','knn_pca.all'))
mdata <- melt(accuracy.rate_all, id="numb.var")%>%dplyr::rename(Model="variable")


ggplot(mdata, aes(x=numb.var, y=value, group=Model)) +
  geom_line(aes(color=Model)) +
  geom_point(aes(color=Model)) +
  coord_cartesian(xlim = NULL, ylim = c(0.3,0.6), 
                 expand = TRUE, default = FALSE,clip = "on") + 
 theme(legend.position="top") + labs(title = "Accuracy rate") + 
  xlab("Variable Selection") + ylab("LOOCV Accuracy rate") + 
  guides(fill=guide_legend(title="Model"))

Figure 5.10: Model Performance on Group:Subject:Condition

accuracy.rate$numb.var<-as.factor(c('lda','knn_Scal','knn_PCA',
                                    'svm_radial','svm_linear'))
melt(accuracy.rate, id="numb.var")%>%dplyr::rename(Target_Variable="variable") %>%

ggplot(aes(x=numb.var, y=value, group=Target_Variable)) +
  geom_line(aes(color=Target_Variable)) +
  geom_point(aes(color=Target_Variable)) +
  coord_cartesian(xlim = NULL, ylim = c(0.15,0.95), 
                 expand = TRUE, default = FALSE,clip = "on") + 
 theme(legend.position="top") + labs(title = "Accuracy rate") + 
  xlab("Models") + ylab("LOOCV Accuracy rate") + 
  guides(fill=guide_legend(title="Target_Variable"))

Figure 5.11: Model Performance

set.seed(702)
unlabeled<-unlabeled %>% #Selecting the same variables used in the model building process from the unlabled dataset for the prediction and computing their column means
  select(Trial,Slant,HorizontalSize,RelativeInitialSlant,
         NormalizedJerk,NumberOfPeakAccelerationPoints, StraightnessError, Segment,
         PeakVerticalAcceleration,AveragePenPressure,LoopSurface,RelativeTimeToPeakVerticalVelocity,
         AbsoluteJerk,AbsoluteyJerk,Roadlength,AverageAbsoluteVelocity) 

unlabeled.means=NULL
uni.unlab=unique(unlabeled$Trial) 
for (i in uni.unlab){
unlabeled.nrow=unlabeled[i==unlabeled$Trial, ]
unlab.Col.means=colMeans(unlabeled.nrow[,-1])
unlabeled.means=rbind(unlabeled.means, unlab.Col.means)
}
dim(unlabeled.means)

## [1]  2 15

unlabeled.df<-data.frame(unlabeled.means)

#prediction of the unlabeled data with the best performed model when the three 
#combined target variables were used in the model building process
lda_model<-lda(target~., data = predictors, CV = FALSE)

pred_lda<-predict(lda_model, newdata=unlabeled.df)
pred_lda$class

## [1] CUR:00K:L2 CUR:00K:L2
## 480 Levels: CUR:0:L1 CUR:0:L2 CUR:0:L3 CUR:0:L4 CUR:0:L5 CUR:0:L6 ... PRI:9:L6

#prediction of the unlabeled data with the best performed model when Group was used in the model building process
pred_svm.kernel_grp<-predict(Radial_model.grp,newdata=unlabeled.df)
pred_svm.kernel_grp

##   unlab.Col.means unlab.Col.means.1 
##               CUR               CUR 
## Levels: CUR PRI

#prediction of the unlabeled data with the best performed model when Group was used in the model building process
pred_svm_kernel_subj<-predict(Radial_model.subj,newdata=unlabeled.df)
pred_svm_kernel_subj

##   unlab.Col.means unlab.Col.means.1 
##               00F               00F 
## 40 Levels: 0 00A 00B 00C 00D 00F 00H 00I 00J 00K 00L 00N 00O 00P 00Q ... 9

#prediction of the unlabeled data with the best performed model when Group was used in the model building process
pred_svm_kernel_cond<-predict(Radial_model,newdata = unlabeled.df)
pred_svm_kernel_cond

##   unlab.Col.means unlab.Col.means.1 
##                L1                L1 
## Levels: L1 L2 L3 L4 L5 L6

6 References:

Gareth James, Daniela Witten, Trevor Hastie & Robert Tibshirani. (2013). An Introduction to Statistical Learning with Application in R. 10.1007/978-1-4614-7138-7

Awad, Mariette & Khanna, Rahul. (2015). Support Vector Machines for Classification. 10.1007/978-1-4302-5990-9_3.

Jon Shlens. (2003). A TUTORIAL ON PRINCIPAL COMPONENT ANALYSIS: Derivation, Discussion and Singular Value Decomposition

Oliver Sutton. (2012).Introduction to k Nearest Neighbour Classification andCondensed Nearest Neighbour Data Reduction

Vikramaditya Jakkula., “Tutorial on Support Vector Machine (SVM)”. School of EECS, Washington State University, Pullman 99164.

Burges C., “A tutorial on support vector machines for pattern recognition”, In “Data Mining and Knowledge Discovery”.Kluwer Academic Publishers, Boston, 1998, (Volume 2).

https://pythonprogramminglanguage.com/how-is-the-k-nearest-neighbor-algorithm-different-from-k-means-clustering/

https://medium.com/@aptrishu/understanding-principle-component-analysis-e32be0253ef0

https://www.dezyre.com/data-science-in-python-tutorial/principal-component-analysis-tutorial

https://www.researchgate.net/publication/240093048_Linear_Discriminant_Analysis-A_Brief_Tutorial http://thatdatatho.com/2018/02/19/assumption-checking-lda-vs-qda-r-tutorial-2

ANALYSIS OF HANDWRITTEN DOCUMENTS

Samuel Adjei