General rule and information

You must show your work in order to get points. Please prepare your report according to the rubrics on projects that are given in the syllabus. If a project report contains only codes and their outputs and the project has a total of 100 points, a maximum of 25 points can be taken off. Please note that your need to submit codes that would have been used for your data analysis. Your report can be in .doc, .docx, .html or .pdf format.

The project will assess your skills in support vector machines and dimension reduction, for which visualization techniques you have learnt will be used to illustrate your findings. This project gives you more freedom to use your knowledge and skills in data analysis.

Task A: Analysis of gene expression data

For this task, you need to use PCA and Sparse PCA.

Data set and its description

Please download the data set “TCGA-PANCAN-HiSeq-801x20531.tar.gz” from the website https://archive.ics.uci.edu/ml/machine-learning-databases/00401/. A brief description of the data set is given at https://archive.ics.uci.edu/ml/datasets/gene+expression+cancer+RNA-Seq. Please read the description carefully, and you may need to read a bit more on gene expression data to help you complete this project.

You need to decompress the data file since it is a .tar.gz file. Once uncompressed, the data files are “labels.csv” that contains the cancer type for each sample, and “data.csv” that contains the “gene expression profile” (i.e., expression measurements of a set of genes) for each sample. Here each sample is for a subject and is stored in a row of “data.csv”. In fact, the data set contains the gene expression profiles for 801 subjects, each with a cancer type, where each gene expression profile contains the gene expressions for the same set of 20531 genes. The cancer types are: “BRCA”, “COAD”, “KIRC”, “LUAD” and “PRAD”. In both files “labels.csv” and “data.csv”, each row name records which sample a label or observation is for.

Data processing

Please use set.seed(123) for random sampling via the command sample.

Filter out genes (from “data.csv”) whose expressions are zero for at least 300 subjects, and save the filtered data as R object “gexp2”.
Use the command sample to randomly select 1000 genes and their expressions from “gexp2”, and save the resulting data as R object “gexp3”.
Use the command scale to standardize the gene expressions for each gene in “gexp3”. Save the standardized data as R object “stdgexpProj2”.

You will analyze the standardized data.

#load libraries
library(e1071)

## Warning: package 'e1071' was built under R version 4.4.2

library(caret)

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: ggplot2

## Loading required package: lattice

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.2

## corrplot 0.95 loaded

library(plotly)

## Warning: package 'plotly' was built under R version 4.4.2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(elasticnet)

## Loading required package: lars

## Loaded lars 1.3

library(reshape2)
library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

#read in data
gexp1 <- read.csv("C:\\Users\\marga\\STAT437\\TCGA-PANCAN-HiSeq-801x20531\\TCGA-PANCAN-HiSeq-801x20531\\data.csv")
labels <- read.csv("C:\\Users\\marga\\STAT437\\TCGA-PANCAN-HiSeq-801x20531\\TCGA-PANCAN-HiSeq-801x20531\\labels.csv")

#set seed
set.seed(123)

#filter out genes with less than 300 subjects
gexp2 <- gexp1[, colSums(gexp1==0)<300]

#getting the sample data
gexp3 <- sample(gexp2, 1000)

#standardizing the data
stdgexpProj2 <- scale(gexp3)

Questions to answer when doing data analysis

Please also investigate and address the following when doing data analysis:

(1.a) Are there genes for which linear combinations of their expressions explain a significant proportion of the variation of gene expressions in the data set? Note that each gene corresponds to a feature, and a principal component based on data version is a linear combination of the expression measurements for several genes.

No, only 52.1% of the variance is explained by the first 10 PCs in PCA and only 1.42% is explained by the first 3 PCs from SPCA. In comparison the first 3 PCs of PCA explained 30.7% of the variance, which is much better but still would not constitute a significant proportion. Depending on the research being done, somewhere between 50-200 PCs from PCA might be needed to explain enough of the variance. Since my computer can’t run SPCA with more than 3 PCs, I can’t say for certain, but it’s likely that a similar number of PCs from SPCA would be needed.

(1.b) Ideally, a type of cancer should have its “signature”, i.e., a pattern in the gene expressions that is specific to this cancer type. From the “labels.csv”, you will know which expression measurements belong to which cancer type. Identify the signature of each cancer type (if any) and visualize it. For this, you need to be creative and should try both PCA and Sparse PCA.

The 3D plot for PCA gave good estimates as to whether signatures could be developed. In that plot each cancer type was the best grouped out of all the visualizations. PRAD showed the tightest grouping and was also better separated from the other cancer types. PRAD shows negative associations to PC2 and PC3 and a positive association with PC1. It would be the best candidate for a signature. COAD has a positive association with all three PCs and fairly tight grouping, indicating a signature could be developed for COAD. LUAD is closely grouped around the origin for PC1, similarly slightly above the origin, positive association for PC2 and PC3. KIRC is negatively associated with PC1 and PC2, and is grouped closely around the origin, but slightly positive, for PC3. KIRC and LUAD are not as tightly grouped as COAD or PRAD, a signature would be more difficult to develop. BRCA was the most spread out cancer type, near the origin for each PC, indicating that a signature would be difficult to develop for BRCA. These findings held true in SPCA as well, although it appeared the observations for each cancer type were more spread out.

(1.c) There are 5 cancer types. Would 5 principal components, obtained either from PCA or Sparse PCA, explain a dominant proportion of variability in the data set, and serve as the signatures of the 5 cancer types? Note that the same set of genes were measured for each cancer type.

No, while PCA explains 40.7% of the variance with the first 5 PCs that is not a very high accuracy rate for something as important as identifying cancer type. Which is unsurprising, cancer research and risk evaluation is a complex and field of study. While it would be ground breaking to find gene combinations that can accurately predict type of cancer and the risk level, both PCA and sparse PCA show that the genomic predictive power for type of cancer is complex and accuracy is only achieved with a robust study of many genes.

Identify patterns and low-dimensional structures

Please implement the following:

(2.a) Apply PCA, determine the number of principal components, provide visualizations of low-dimensional structures, and report your findings. Note that you need to use “labels.csv” for the task of discovering patterns such as if different cancer types have distinct transformed gene expressions (that are represented by principal components). For PCA or Sparse PCA, low-dimensional structures are usually represented by the linear space spanned by some principal components.

pca_2a <- prcomp(stdgexpProj2, retx = TRUE, center = TRUE, scale. = TRUE)

#checking dimensions to ensure PCA ran correctly
dim(stdgexpProj2)

## [1]  801 1000

dim(pca_2a$rotation) #loading matrix dimensions

## [1] 1000  801

dim(pca_2a$x) #PC scores

## [1] 801 801

#calculating proportional variance explained
pve <- cumsum((nrow(stdgexpProj2)-1)*((pca_2a$sdev)^2)) / sum(stdgexpProj2^2)

pve[1:200] #looking at first 200 PCs

##   [1] 0.1200590 0.2186996 0.3074862 0.3643454 0.4067436 0.4398219 0.4669024
##   [8] 0.4897715 0.5084996 0.5215233 0.5328420 0.5433123 0.5534031 0.5615880
##  [15] 0.5694001 0.5769628 0.5840573 0.5908778 0.5970822 0.6031929 0.6085813
##  [22] 0.6138562 0.6188784 0.6238570 0.6286509 0.6331363 0.6375263 0.6418964
##  [29] 0.6461300 0.6503140 0.6542844 0.6581694 0.6619621 0.6655938 0.6691875
##  [36] 0.6727028 0.6761552 0.6794505 0.6827091 0.6859172 0.6890220 0.6920757
##  [43] 0.6951221 0.6981174 0.7010190 0.7038521 0.7066473 0.7094123 0.7121150
##  [50] 0.7148072 0.7174054 0.7199576 0.7224725 0.7249607 0.7273991 0.7298160
##  [57] 0.7321836 0.7344932 0.7367818 0.7390245 0.7412290 0.7434070 0.7455498
##  [64] 0.7476442 0.7496653 0.7516785 0.7536726 0.7556535 0.7576096 0.7595493
##  [71] 0.7614716 0.7633586 0.7652176 0.7670737 0.7689085 0.7707151 0.7724904
##  [78] 0.7742634 0.7760043 0.7777145 0.7794017 0.7810637 0.7827128 0.7843417
##  [85] 0.7859605 0.7875336 0.7890889 0.7906402 0.7921795 0.7937041 0.7952185
##  [92] 0.7967014 0.7981626 0.7996120 0.8010322 0.8024448 0.8038445 0.8052337
##  [99] 0.8066127 0.8079774 0.8093399 0.8106818 0.8120029 0.8133179 0.8146178
## [106] 0.8159093 0.8171901 0.8184704 0.8197209 0.8209598 0.8221905 0.8234062
## [113] 0.8246151 0.8258191 0.8270115 0.8281912 0.8293640 0.8305190 0.8316462
## [120] 0.8327701 0.8338899 0.8350046 0.8361155 0.8372182 0.8383148 0.8393969
## [127] 0.8404739 0.8415362 0.8425961 0.8436439 0.8446846 0.8457185 0.8467451
## [134] 0.8477674 0.8487752 0.8497702 0.8507638 0.8517460 0.8527243 0.8536922
## [141] 0.8546538 0.8556091 0.8565537 0.8574898 0.8584223 0.8593438 0.8602593
## [148] 0.8611676 0.8620725 0.8629725 0.8638666 0.8647534 0.8656295 0.8665025
## [155] 0.8673659 0.8682266 0.8690836 0.8699334 0.8707762 0.8716121 0.8724421
## [162] 0.8732719 0.8740951 0.8749129 0.8757196 0.8765199 0.8773165 0.8781108
## [169] 0.8789007 0.8796883 0.8804694 0.8812437 0.8820132 0.8827759 0.8835332
## [176] 0.8842835 0.8850330 0.8857760 0.8865135 0.8872501 0.8879745 0.8886961
## [183] 0.8894136 0.8901247 0.8908330 0.8915354 0.8922362 0.8929317 0.8936248
## [190] 0.8943125 0.8949938 0.8956673 0.8963395 0.8970089 0.8976763 0.8983270
## [197] 0.8989744 0.8996200 0.9002647 0.9009034

scree_values <- pca_2a$sdev^2
prop_var <- scree_values / sum(scree_values)

plot(prop_var[1:100], type = "b", xlab = "Principal Component", 
     ylab = "Proportion of Variance Explained", 
     main = "Scree Plot")

cum_var <- cumsum(prop_var)
plot(cum_var[1:100], type = "b", xlab = "Number of Components", 
     ylab = "Cumulative Proportion of Variance", 
     main = "Cumulative Variance Explained")

pca_hist10 <- pca_2a$rotation[,1:10]

colnames(pca_hist10) = paste("PC", 1:10, sep=" ")
colnames(pca_hist10)~paste(colnames(pca_hist10), "Loadings", sep = " ")

## colnames(pca_hist10) ~ paste(colnames(pca_hist10), "Loadings", 
##     sep = " ")

melt_hist <- melt(pca_hist10)

ggplot(melt_hist, aes(x = value)) +
  geom_histogram(bins = 25) +
  facet_wrap(~Var2, scales = "free_x") +
  theme_bw()

#adding labels to the scores
scores <- data.frame(pca_2a$x[, 1:3])
scores$label <- labels$Class

#plot of first 2 PCs
ggplot(scores, aes(x = PC1, y = PC2, color = label)) +
  geom_point(alpha = 0.7) +
  theme_minimal() +
  labs(title = "PCA: First Two Principal Components",
       x = "PC1", y = "PC2")

#plot of first 3 PCs
plot_ly(scores, x = ~PC1, y = ~PC2, z = ~PC3, color = ~label, type = "scatter3d", mode = "markers")

After looking at the first 200 principal components, it shows that 80.1% of the variance is explained by the first 95 PCs, so the scree and cumulative variance plots were created with those PCs. If more variance needs to be looked at, 90.0% of the variance is explained by the first 199 PCs. Since there was such a small increase in explained variance in over 100 additional PCs, the 80% threshold would most likely be the best option for further analysis.

The scree and cumulative variance plots confirm that the level of value additional PCs have in explaining the variance levels off as 95 is approached. The histograms do not show any obvious genes that are causing significant variance in any direction. The plot of the first 2 principal components shows that PRAD, COAD, and KIRC were fairly accurately grouped, but BRCA and LUAD were not. After adding in the 3 principal component clearer groupings of all the cancer types were achieved; PRAD and KIRC appear to be the most accurately grouped, with some overlap among the other 3 cancer types.

(2.b) Apply Sparse PCA, provide visualizations of low-dimensional structures, and report your findings. Note that you need to use “labels.csv” for the task of discovering patterns. Your laptop may not have sufficient computational power to implement Sparse PCA with many principal components. So, please pick a value for the sparsity controlling parameter and a value for the number of principal components to be computed that suit your computational capabilities.

#applying sparse pca
spca_2b <- spca(stdgexpProj2, K = 3, type = "predictor", sparse = "varnum", para = c(50, 50, 50))

#getting the pevs for each PC
spca_2b$pev

## [1] 0.01316455 0.01394939 0.01427847

#creating a histogram for each of the SPCs
spca_hist = spca_2b$loadings[,1:3]

colnames(spca_hist) = paste("SPC", 1:3, sep = " ")
colnames(spca_hist)~paste(colnames(spca_hist), "Loadings", sep = " ")

## colnames(spca_hist) ~ paste(colnames(spca_hist), "Loadings", 
##     sep = " ")

melt_spca_hist <- melt(spca_hist)

ggplot(melt_spca_hist, aes(x = value)) +
  geom_histogram(bins = 25) +
  facet_wrap(~Var2, scales = "free") +
  theme_bw()

#finding dominant features
LV1 <- spca_hist[,1] #first loading vector

#shows how many dominant features have non zero loadings
DF1 = which(abs(LV1)>0)
length(DF1)

## [1] 50

#shows how many DFs have large loadings
quan_95 <- quantile(abs(LV1[abs(LV1>0)]), probs=c(0.95))

DF2 = which(abs(LV1)>quan_95)
length(DF2)

## [1] 0

#getting scores and labels for the data
spca_scores <- stdgexpProj2 %*% spca_2b$loadings
spca_scores_df <- data.frame(spca_scores)
spca_scores_df$label <- labels$Class

#plot of first 2 PCs
ggplot(spca_scores_df, aes(x = PC1, y = PC2, color = label)) +
  geom_point(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Sparse PCA: First Two Components",
       x = "PC1", y = "PC2")

plot_ly(spca_scores_df, x = ~PC1, y = ~PC2, z = ~PC3, color = ~label, type = "scatter3d", mode = "markers")

The parameters chosen for sparse PCA took approx. 3 minutes to run on my PC, so additional genes or PCs would not be computationally feasible. The histograms did not show any genes that significantly explain variance. The dominant feature analysis showed 50 genes with non-zero values in the first loading vector, however none of the genes were above the 95% threshold, supporting that no one gene is significantly causing the variance.

The scatter plot with only the first 2 PCS shows very poor grouping, which is unsurprising because the first 2 PCs only explain 1.39% of the variance. What is surprising is that the groupings shown in the 3D plot are fairly clear, there is still some overlap, but the 3rd PC only brought explained variance up to 1.42%.

(2.c) Do PCA and Sparse PCA reveal different low-dimensional structures for the gene expressions for different cancer types?

In both PCA and SPCA the 3D plots showed good separation of the cancer types, however PCA far out-performed SPCA in separation based on the first 2 principal components. Neither method showed that specific genes were significant to the variance in the data, the histograms for each were fairly centered over zero. All of the features in the first loading vector of SPCA had non-zero values, but none above the 95%, which further supports what the histograms from both methods indicated. Both of the 3D plots showed that PRAD seems most similar to BRCA and was well separated. LUAD, COAD, and KIRC appeared the most similar to one another, situated opposite BRCA from PRAD. PCA appears to perform slightly better than SPCA at forming groups with all the cancer types, with both methods appearing to group BRCA the worst.

Task B: analysis of SPAM emails data set

For this task, you need to use PCA and SVM.

Dataset and its description

The spam data set ``SPAM.csv’’ is attached and also can be downloaded from https://web.stanford.edu/~hastie/CASI_files/DATA/SPAM.html. More information on this data set can be found at: https://archive.ics.uci.edu/ml/datasets/Spambase. The column “testid” in “SPAM.csv” was used to train a model when the data set was used by other analysts and hence should not be used as a feature or the response, the column “spam” contains the true status for each email, and the rest contain measurements of features. Here each email is represented by a row of features in the .csv file, and a “feature” can be regarded as a “predictor”. Also note that the first 1813 rows, i.e., observations, of the data set are for spam emails, and that the rest for non-spam emails.

Data processing

Please do the following:

Remove rows that have missing values. For a .csv file, usually a blank cell is treated as a missing value.
Check for highly correlated features using the absolute value of sample correlation. Think about if you should include all or some of highly correlated features into an SVM model. For example, “crl.ave” (average length of uninterrupted sequences of capital letters), “crl.long” (length of longest uninterrupted sequence of capital letters) and “crl.tot” (total number of capital letters in the e-mail) may be highly correlated. Whethere you choose to remove some highly correlated features from subsequent analysis or not, you need to provide a justification for your choice.

Note that each feature is stored in a column of the original data set and each observation in a row. You will analyze the processed data set.

# Load data
spam_data <- read.csv("SPAM.csv")

# Remove 'testid' and rows with missing values
spam_data <- spam_data[ , !(names(spam_data) %in% c("testid"))]
spam_data <- na.omit(spam_data)

# Check basic structure
str(spam_data)

## 'data.frame':    4601 obs. of  58 variables:
##  $ spam      : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ make      : num  0 0.21 0.06 0 0 0 0 0 0.15 0.06 ...
##  $ address   : num  0.64 0.28 0 0 0 0 0 0 0 0.12 ...
##  $ all       : num  0.64 0.5 0.71 0 0 0 0 0 0.46 0.77 ...
##  $ X3d       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ our       : num  0.32 0.14 1.23 0.63 0.63 1.85 1.92 1.88 0.61 0.19 ...
##  $ over      : num  0 0.28 0.19 0 0 0 0 0 0 0.32 ...
##  $ remove    : num  0 0.21 0.19 0.31 0.31 0 0 0 0.3 0.38 ...
##  $ internet  : num  0 0.07 0.12 0.63 0.63 1.85 0 1.88 0 0 ...
##  $ order     : num  0 0 0.64 0.31 0.31 0 0 0 0.92 0.06 ...
##  $ mail      : num  0 0.94 0.25 0.63 0.63 0 0.64 0 0.76 0 ...
##  $ receive   : num  0 0.21 0.38 0.31 0.31 0 0.96 0 0.76 0 ...
##  $ will      : num  0.64 0.79 0.45 0.31 0.31 0 1.28 0 0.92 0.64 ...
##  $ people    : num  0 0.65 0.12 0.31 0.31 0 0 0 0 0.25 ...
##  $ report    : num  0 0.21 0 0 0 0 0 0 0 0 ...
##  $ addresses : num  0 0.14 1.75 0 0 0 0 0 0 0.12 ...
##  $ free      : num  0.32 0.14 0.06 0.31 0.31 0 0.96 0 0 0 ...
##  $ business  : num  0 0.07 0.06 0 0 0 0 0 0 0 ...
##  $ email     : num  1.29 0.28 1.03 0 0 0 0.32 0 0.15 0.12 ...
##  $ you       : num  1.93 3.47 1.36 3.18 3.18 0 3.85 0 1.23 1.67 ...
##  $ credit    : num  0 0 0.32 0 0 0 0 0 3.53 0.06 ...
##  $ your      : num  0.96 1.59 0.51 0.31 0.31 0 0.64 0 2 0.71 ...
##  $ font      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X000      : num  0 0.43 1.16 0 0 0 0 0 0 0.19 ...
##  $ money     : num  0 0.43 0.06 0 0 0 0 0 0.15 0 ...
##  $ hp        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ hpl       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ george    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X650      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ lab       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ labs      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ telnet    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X857      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ data      : num  0 0 0 0 0 0 0 0 0.15 0 ...
##  $ X415      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X85       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ technology: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1999     : num  0 0.07 0 0 0 0 0 0 0 0 ...
##  $ parts     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pm        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ direct    : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ cs        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ meeting   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ original  : num  0 0 0.12 0 0 0 0 0 0.3 0 ...
##  $ project   : num  0 0 0 0 0 0 0 0 0 0.06 ...
##  $ re        : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ edu       : num  0 0 0.06 0 0 0 0 0 0 0 ...
##  $ table     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ conference: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ch.       : num  0 0 0.01 0 0 0 0 0 0 0.04 ...
##  $ ch..1     : num  0 0.132 0.143 0.137 0.135 0.223 0.054 0.206 0.271 0.03 ...
##  $ ch..2     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ch..3     : num  0.778 0.372 0.276 0.137 0.135 0 0.164 0 0.181 0.244 ...
##  $ ch..4     : num  0 0.18 0.184 0 0 0 0.054 0 0.203 0.081 ...
##  $ ch..5     : num  0 0.048 0.01 0 0 0 0 0 0.022 0 ...
##  $ crl.ave   : num  3.76 5.11 9.82 3.54 3.54 ...
##  $ crl.long  : int  61 101 485 40 40 15 4 11 445 43 ...
##  $ crl.tot   : int  278 1028 2259 191 191 54 112 49 1257 749 ...

# Calculate correlation matrix
cor_matrix <- cor(spam_data[ , -which(names(spam_data) == "spam")])
high_corr <- findCorrelation(abs(cor_matrix), cutoff = 0.9)
names(spam_data)[high_corr]  # Highly correlated features

## [1] "telnet"

# remove highly correlated features (justify based on analysis)
spam_data_reduced <- spam_data[ , -high_corr]

Classifiction via SVM

Please do the following:

(3.a) Use set.seed(123) wherever the command sample is used or cross-validation is implemented, randomly select without replacement 300 observations from the data set and save them as training set “train.RData”, and then randomly select without replacement 100 observations from the remaining observations and save them as “test.RData”. You need to check if the training set contains observations from both classes; otherwise, no model can be trained.

set.seed(123)

# Randomly select 300 samples for training
train_idx <- sample(nrow(spam_data_reduced), 300)
train <- spam_data_reduced[train_idx, ]

# Ensure both classes are present
table(train$spam)

## 
## FALSE  TRUE 
##   184   116

# From remaining data, select 100 for testing
remaining <- spam_data_reduced[-train_idx, ]
test_idx <- sample(nrow(remaining), 100)
test <- remaining[test_idx, ]

# Save data
save(train, file = "train.RData")
save(test, file = "test.RData")

(3.b) Apply PCA to the training data “train.RData” and see if you find any pattern that can be used to approximately tell a spam email from a non-spam email.

# Exclude response variable for PCA
train_features <- scale(train[ , -which(names(train) == "spam")])
pca_result <- prcomp(train_features, center = TRUE, scale. = TRUE)

# Plot PCA
pca_df <- data.frame(pca_result$x[,1:2], spam = as.factor(train$spam))
ggplot(pca_df, aes(PC1, PC2, color = spam)) + 
  geom_point(alpha = 0.6) + 
  theme_minimal() + 
  ggtitle("PCA on Training Data")

The PCA plot of the first two principal components shows that spam and non-spam emails exhibit partial separation in the reduced two-dimensional space. While there is some overlap between the two classes, a visible trend appears: many of the spam emails cluster in one region of the plot, while non-spam emails are spread in a different area. This suggests that PCA captures meaningful variation in the data related to the classification label, although the separation is not perfect. This visualization provides preliminary evidence that linear combinations of features can be useful in distinguishing between spam and non-spam emails.

(3.c) Use “train.RData” to build an SVM model with linear kernel, whose cost parameter is determined by 10-fold cross-validation, for which the features are predictors, the status of email is the response, and cost ranges in c(0.01,0.1,1,5,10,50). Apply the obtained optimal model to “test.RData”, and report via a 2-by-2 table on spams that are classified as spams or non-spams and on non-spams that are classified as non-spams or spams.

# Make sure 'spam' is a factor (for classification)
train$spam <- as.factor(train$spam)
test$spam <- as.factor(test$spam)

# Define cross-validation control
ctrl <- trainControl(method = "cv", number = 10)

# Train the SVM model with linear kernel
set.seed(123)
svm_linear <- train(
  spam ~ ., data = train,
  method = "svmLinear",
  trControl = ctrl,
  tuneGrid = expand.grid(C = c(0.01, 0.1, 1, 5, 10, 50))
)

# Predict on test set
pred_linear <- predict(svm_linear, test)

# Show confusion matrix
confusionMatrix(pred_linear, test$spam)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE    56   14
##      TRUE      2   28
##                                           
##                Accuracy : 0.84            
##                  95% CI : (0.7532, 0.9057)
##     No Information Rate : 0.58            
##     P-Value [Acc > NIR] : 2.27e-08        
##                                           
##                   Kappa : 0.6581          
##                                           
##  Mcnemar's Test P-Value : 0.00596         
##                                           
##             Sensitivity : 0.9655          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.8000          
##          Neg Pred Value : 0.9333          
##              Prevalence : 0.5800          
##          Detection Rate : 0.5600          
##    Detection Prevalence : 0.7000          
##       Balanced Accuracy : 0.8161          
##                                           
##        'Positive' Class : FALSE           
##

(3.d) Use “train.RData” to build an SVM model with radial kernel, whose “cost” parameter is determined by 10-fold cross-validation, for which the features are predictors, the status of email is the response, cost ranges in c(0.01,0.1,1,5,10,50), and gamma=c(0.5,1,2,3,4). Report the number of support vectors. Apply the obtained optimal model to “test.RData”, and report via a 2-by-2 table on spams that are classified as spams or non-spams and on non-spams that are classified as non-spams or spams.

# SVM with radial kernel using tune
set.seed(123)
tune_radial <- tune.svm(
  spam ~ ., data = train,
  kernel = "radial",
  cost = c(0.01, 0.1, 1, 5, 10, 50),
  gamma = c(0.5, 1, 2, 3, 4),
  tunecontrol = tune.control(cross = 10)
)

# Best model
best_radial <- tune_radial$best.model

# Number of support vectors
summary(best_radial)$tot.nSV

## [1] 289

# Predictions
pred_radial <- predict(best_radial, test)
table(Predicted = pred_radial, Actual = test$spam)

##          Actual
## Predicted FALSE TRUE
##     FALSE    58   38
##     TRUE      0    4

The classification results demonstrate that the SVM with a linear kernel (3.c) significantly outperforms the SVM with a radial kernel (3.d) in terms of overall accuracy and balanced performance across both classes. The linear SVM achieved an accuracy of 84% with a strong Kappa value of 0.6581, indicating substantial agreement beyond chance. It maintained high sensitivity (96.55%) and reasonable specificity (66.67%), showing that it effectively identified both spam and non-spam emails. In contrast, the radial SVM showed a much lower accuracy of 62% and a poor Kappa value of 0.1088, suggesting weak predictive power. While it achieved perfect sensitivity (100%) by correctly identifying all non-spam messages, it suffered from extremely low specificity (9.52%), missing nearly all spam emails. This imbalance implies that the radial model was heavily biased toward predicting the majority class, making it unreliable for spam detection. Overall, the linear kernel provided a more balanced and effective classification, making it the preferable model in this context.

Stat 437 Project 2

Thi Thanh Thao Tran (011838585) Daisy Spencer (011180264)

General rule and information

Task A: Analysis of gene expression data

Data set and its description

Data processing

Questions to answer when doing data analysis

Identify patterns and low-dimensional structures

Task B: analysis of SPAM emails data set

Dataset and its description

Data processing

Classifiction via SVM