Loading the file

First of all, the file on which the analysis will be performed has been loaded in the working directory.

setwd("C:/Users/user/Desktop/Molecular Biotechnology/Data analysis/Provero")
load("test_2006.RData")
class(data)

## [1] "data.frame"

lapply(data, class)

## $gene1
## [1] "numeric"
## 
## $gene2
## [1] "numeric"
## 
## $metastatic
## [1] "numeric"

head of the test_2006 file
	gene1	gene2	metastatic
66	28.46414	10.566377	0
19	13.18606	17.284885	1
94	20.05122	12.100641	0
88	15.08214	9.087448	0
91	18.74204	8.372440	0
69	16.81824	10.459760	0

The dataset has 100 rows representing 100 tumour patients. For each of them, there are three columns. The first two ones represent the expression of two genes. The third column is composed by a binary variable describing whether the tumour is metastatic or not (0 = non-metastatic; 1 = metastatic).

Regression analysis between the variables

In this section, regression methods will be used as a tool to investigate the relationships among the variables present in the file previously loaded.

Linear regression

Gene 1 and gene 2

The function lm() is used to create a linear model. x is the independent variable, while y is the dependent one.

Histograms are created to get an idea about these numeric variables:

gene1 <- as.matrix(data[, 1])
hist(gene1, main = "Histogram of gene 1")

gene2 <- as.matrix(data[, 2])
hist(gene2, main = "Histogram of gene 2")

The relationship between two numerical variables, in this case the expression of the two genes, is shown in a scatterplot.

# x = gene 1 and y = gene 2
plot(
  x = data$gene1,
  y = data$gene2,
  xlab = "gene1",
  ylab = "gene2"
)

The graph could suggest a linear relationship between the two variables.

lr_genes <- lm(data$gene2 ~ data$gene1)
lr_genes$coefficients

## (Intercept)  data$gene1 
##  17.1401391  -0.3437202

The coefficients beta1 and beta2 of the best linear regression are respectively 17.1401 and -0.3437. Therefore, when gene 1 expression increases of 1 unit, the expression of gene 2 decreases of 0.3437.

#Plotting the data together with the regression line
plot(
    x = data$gene1,
    y = data$gene2,
    xlab = "gene1",
    ylab = "gene2"
)
abline(a = lr_genes$coefficients[1],
       b = lr_genes$coefficients[2],
       col = "red")

sum_genes <- summary(lr_genes)
sum_genes$coefficients

##               Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 17.1401391  0.8656227 19.80093 5.008559e-36
## data$gene1  -0.3437202  0.0501520 -6.85357 6.434938e-10

# p-value: 6.43e-10
sum_genes$r.squared

## [1] 0.3240047

We can conclude that the expressions of the two genes are linearly correlated. Indeed, the p-value is significantly low. The fraction of variance explained by this model is about 32%.

Gene 1 and metastatic

The function lm() is again used to find the best linear regression, even though this time the independent variable is categorical. The aim of this part of the analysis is to investigate whether the state of the tumour “metastatic” predicts the expression of gene 1 or not. To visualize the dependence of a numerical variable on a categorical one a boxplot is created.

boxplot(
    data$gene1 ~ data$metastatic,
    data = data,
    main = "Boxplot",
    xlab = "metastatic",
    ylab = "gene1",
    names = c("non-met", "met")
)

In patients with metastatic tumour the gene looks downregulated.

lm_1 <- lm(data$gene1 ~ data$metastatic)
sum_lm_1 <- summary(lm_1)
sum_lm_1$coefficients #p-value: 1.177240e-02

##                  Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)     17.211095  0.6767824 25.430767 5.991135e-45
## data$metastatic -3.171785  1.2356299 -2.566937 1.177240e-02

sum_lm_1$r.squared

## [1] 0.06300047

Having a metastatic tumour lowers the expression of gene 1 of 3.1718 times, confirming the hypotesis inferred by the boxplot. The standard error on this value is 1.2356. The p-value is 1.177240e-02, which can be considered significant as it is lower than 0.05, but it would be out of the range of significance if 0.01 was set as threshold. Considering also the small fraction of the variance explained by the model (about 6%), the expression of gene 1 seems to be not well predicted by the metastatic state.

Gene 2 and metastatic

The aim of this part of the analysis is to investigate whether the state of the tumour “metastatic” predicts the expression of gene 2 or not. The same procedure reported above is followed.

boxplot(
    data$gene2 ~ data$metastatic,
    data = data,
    main = "Boxplot",
    xlab = "metastatic",
    ylab = "gene2",
    names = c("non-met", "met")
)

In metastatic patients, the gene 2 looks upregulated.

lm_2 <- lm(data$gene2 ~ data$metastatic)
sum_lm_2 <- summary(lm_2)
sum_lm_2$coefficients #p-value: 2.912209e-08

##                 Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)     10.36064  0.3605789 28.733360 1.549851e-49
## data$metastatic  3.96919  0.6583239  6.029236 2.912209e-08

sum_lm_2$r.squared

## [1] 0.2705711

Having a metastatic tumour increases the expression of gene 2 of 3.9692 times, thus confirming the deduction from the boxplot. The uncertainity (standard error) on this value is 0.6583. The probability of this happening by chance is < 2.91e-08, as indicated by the p-value (which is significant). The fraction of the variance explained by the model is about 27%.

Logistic regression

Gene 1 and metastatic

Regressing a binary variable on a numeric one is done with logistic regression. The purpose of this part of the analysis is investigating whether the expression of the gene is predictive of the metastatic state of the tumour or not.

lg_1 <- glm(data$metastatic ~ data$gene1, family = "binomial")
sum_lg_1 <- summary(lg_1) 
sum_lg_1$coefficients #p-value: 0.01488374

##               Estimate Std. Error   z value   Pr(>|z|)
## (Intercept)  0.7710321 0.67554270  1.141352 0.25372345
## data$gene1  -0.1039378 0.04268148 -2.435196 0.01488374

library(DescTools)
PseudoR2(lg_1)

##   McFadden 
## 0.05431035

The obtained results are similar to the ones from the linear regression. Indeed, the fact of having a metastatic tumour lowers the expression of the gene 1 of 0.10394 times. The standard error on this value is 0.0427. The p-value is 0.0149, therefore again it could be considered as significant only if the threshold chosen was 0.05 and not 0.01. The McFadden’s R squared is quite close to 0, suggesting that this model explains a really low part of the variation.

Brief explanation about Pseudo R2:

Logistic regression models are fitted using the method of maximum likelihood. In particular, McFadden’s R squared measure is defined as 1-log(Lc)/log(Lnull), where Lc is the maximized likelihood value from the current fitted model, and Lnull denotes the corresponding value for the null model (the model with only an intercept and no covariates). If the model has no predictive ability, McFadden’s R squared will be closer to 0. On the other hand, if the model has a good predictive power, the value of the PseudoR2 will be closer to 1.

Gene 2 and metastatic

The same analysis is reported considering gene 2.

lg_2 <- glm(data$metastatic ~ data$gene2, family = "binomial")
sum_lg_2 <- summary(lg_2) 
sum_lg_2$coefficients #p-value: 6.941326e-06

##               Estimate Std. Error  z value     Pr(>|z|)
## (Intercept) -5.8763873 1.18955936 -4.93997 7.813468e-07
## data$gene2   0.4090281 0.09098652  4.49548 6.941326e-06

PseudoR2(lg_2)

##  McFadden 
## 0.2437931

The increase in the expression of gene 2 in case of metastatic tumour is confirmed (0.40903 times). The standard error on this value is 0.09099. The probability of this happening by chance is < 6.94e-06, as indicated by the p-value (significant). McFadden’s R2 is not particularly large because we need extremely strong predictors in order for it to get close to 1, indeed the obtained value of 0.2438 can be considered as satisfactory. The model explains a good part of the variation.

General conclusion about univariate regressions

Considering the obtained p-values and Rsquared values, the expression of gene 2 is a significant predictor of the metastatic state (and vice versa) and the relative models explain the majority of the variance with respect to gene 1. The latter has a borderline p-value and furthermore it explains a really low part of the variance.

Multivariable logistic regression

Multivariable logistic regression is performed, considering both genes as predictors of the state “metastatic”.

l_mlt <- glm(data$metastatic ~ data$gene1 + data$gene2, family = "binomial") 
summary(l_mlt)$coefficients

##                Estimate Std. Error    z value     Pr(>|z|)
## (Intercept) -7.20424230  2.0479962 -3.5177030 4.352993e-04
## data$gene1   0.04462363  0.0535634  0.8330992 4.047888e-01
## data$gene2   0.45939414  0.1125034  4.0833814 4.438509e-05

PseudoR2(l_mlt)

##  McFadden 
## 0.2494541

The p-value related to gene 1 is not significant (0.4048), while the one for the gene 2 is significant (4.4385e-05). The pseudoR2 is slightly higher than the one previously calculated in the univariate model for gene 2. This slight increase could be simply explained by the addition of a casual predictor, further confirming the fact that gene 1 seems not to be a good predictor of the metastatic state.

Cross-validation

The analysis will be performed on the univarate model for gene 2 (reduced model) and on the multivariate one (full model). The aim of this part is investigating whether the prediction of the multivariate model is improved because of overfitting. This will be the case if the full model will not perform better than the reduced model on new data.

In order to discuss the results in these terms, the data are divided in training and testing sets. On them, a ROC curve and a confusion matrix are built in order to evaluate the performance of the model.

ROC is a probability curve and AUC (area under the curve) represents a measure of separability, indicating how much model is capable of discriminating between two classes. Indeed, the closer the AUC value is to 1, the better the model is performing. On the contrary, if the AUC value is 0.5 it means that the model has a casual behaviour.

The ROC curve is plotted on TPR (True Positive Rate, y axis) over FPR (False Positive Rate, x axis). TPR can be also called recall or sensitivity and it is defined as TP/TP+FN. Whereas, the specificity is defined as TN/TN+FP. FPR is 1-specificity. Sensitivity and specificity are inversely proportional to each other.

# Defining traning and testing set 
train_set <- data[sample(nrow(data), 70), ] # randomly chosen 70 samples 
test_set <- data[sample(nrow(data), 30), ]  # randomly chosen 30 samples

# Gene 2 (reduced model)
logreg_2 <- glm(
  as.formula(data$metastatic ~ data$gene2),
  family = binomial(link = 'logit'), # logistic 
  data = train_set
) # building the model 

predictedval_2 <-
  predict(logreg_2, newdata = test_set, type = "response")

library(ROCR)
library(Metrics)
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following object is masked from 'package:Metrics':
## 
##     auc

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

auc_2 <- auc(data$metastatic, predictedval_2) # area under the curve: 0.8133

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# there is 81% chance that model will be able to distinguish between positive and negative class

# building the curve 
roc(response = data$metastatic, predictor = predictedval_2, auc = TRUE, ci = TRUE, plot = TRUE, col = "blue", main = "ROC curve", xlab = "FPR", ylab = "TPR")

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## 
## Call:
## roc.default(response = data$metastatic, predictor = predictedval_2,     auc = TRUE, ci = TRUE, plot = TRUE, col = "blue", main = "ROC curve",     xlab = "FPR", ylab = "TPR")
## 
## Data: predictedval_2 in 70 controls (data$metastatic 0) < 30 cases (data$metastatic 1).
## Area under the curve: 0.8133
## 95% CI: 0.706-0.9206 (DeLong)

ci.thresholds(response = data$metastatic, predictor = predictedval_2, thresholds = "best", conf.level = 0.95, boot.n = 100)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## 95% CI (100 stratified bootstrap replicates):
##  thresholds sp.low sp.median sp.high se.low se.median se.high
##    0.432214 0.8143       0.9  0.9571    0.5    0.6667     0.8

# boot.n is the number of bootstrap replicates
# median specificity: 0.9
# median sensitivity 0.6667

fitted.results_2 <- ifelse(predictedval_2 > 0.432214, 1, 0)
# the threshold has been calculated above
# if x (gene expr) is > threshold it returns 1, if not 0 

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'caret'

## The following objects are masked from 'package:Metrics':
## 
##     precision, recall

## The following objects are masked from 'package:DescTools':
## 
##     MAE, RMSE

library(e1071)
# building the confusion matrix 
cm_2 <- confusionMatrix(table(fitted.results_2, data$metastatic), positive = '1') 
cm_2

## Confusion Matrix and Statistics
## 
##                 
## fitted.results_2  0  1
##                0 63 10
##                1  7 20
##                                           
##                Accuracy : 0.83            
##                  95% CI : (0.7418, 0.8977)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.002163        
##                                           
##                   Kappa : 0.5833          
##                                           
##  Mcnemar's Test P-Value : 0.627626        
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.9000          
##          Pos Pred Value : 0.7407          
##          Neg Pred Value : 0.8630          
##              Prevalence : 0.3000          
##          Detection Rate : 0.2000          
##    Detection Prevalence : 0.2700          
##       Balanced Accuracy : 0.7833          
##                                           
##        'Positive' Class : 1               
##

# The values obtained with the confusion matrix correspond to the ones obtained with the roc curve: 
# Accuracy 0.83
# Specificity 0.9000  
# Sensitivity 0.6667

# Both genes (full model)

logreg_b <- glm(
  as.formula(data$metastatic ~ data$gene1 + data$gene2),
  family = binomial(link = 'logit'),
  data = train_set
) 

predictedval_b <-
  predict(logreg_b, newdata = test_set, type = "response")

auc_b <- auc(data$metastatic, predictedval_b) # auc: 0.8224

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

roc(response = data$metastatic, predictor = predictedval_b, auc = TRUE, ci = TRUE, plot = TRUE, col = "blue", main = "ROC curve", xlab = "FPR", ylab = "TPR")

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## 
## Call:
## roc.default(response = data$metastatic, predictor = predictedval_b,     auc = TRUE, ci = TRUE, plot = TRUE, col = "blue", main = "ROC curve",     xlab = "FPR", ylab = "TPR")
## 
## Data: predictedval_b in 70 controls (data$metastatic 0) < 30 cases (data$metastatic 1).
## Area under the curve: 0.8224
## 95% CI: 0.7161-0.9286 (DeLong)

ci.thresholds(response = data$metastatic, predictor = predictedval_b, thresholds = "best", conf.level = 0.95, boot.n = 100)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## 95% CI (100 stratified bootstrap replicates):
##  thresholds sp.low sp.median sp.high se.low se.median se.high
##   0.3835872 0.7639    0.8571  0.9075 0.6158    0.7333  0.8667

# mean specificity 0.8571
# mean sensitivity 0.7667

fitted.results_b <- ifelse(predictedval_b > 0.3835872, 1, 0)

cm_b <- confusionMatrix(table(fitted.results_b, data$metastatic), positive = '1') 
cm_b

## Confusion Matrix and Statistics
## 
##                 
## fitted.results_b  0  1
##                0 60  8
##                1 10 22
##                                           
##                Accuracy : 0.82            
##                  95% CI : (0.7305, 0.8897)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.004523        
##                                           
##                   Kappa : 0.5794          
##                                           
##  Mcnemar's Test P-Value : 0.813664        
##                                           
##             Sensitivity : 0.7333          
##             Specificity : 0.8571          
##          Pos Pred Value : 0.6875          
##          Neg Pred Value : 0.8824          
##              Prevalence : 0.3000          
##          Detection Rate : 0.2200          
##    Detection Prevalence : 0.3200          
##       Balanced Accuracy : 0.7952          
##                                           
##        'Positive' Class : 1               
##

# The values obtained with the confusion matrix correspond to the ones obtained with the roc curve: 
# Accuracy 0.82
# Sensitivity 0.7333           
# Specificity 0.8571

# Building a data.frame for an easier comparison of the obtained results
acc <- c(round(cm_2$overall[1], 2), round(cm_b$overall[1], 2))

auc <- c(auc_2, auc_b)

thr <- c(0.432214, 0.3835872)

sens <- c(cm_2$byClass["Sensitivity"], cm_b$byClass["Sensitivity"])

spec <- c(cm_2$byClass["Specificity"], cm_b$byClass["Specificity"])

over_df <- data.frame(acc, auc, thr, sens, spec)
colnames(over_df) <- c("Accuracy", "AUC", "Threshold", "Sensitivy", "Specificity")
rownames(over_df) <- c("Reduced model", "Full model")
over_df

Overall conclusion

The observations that can be inferred from these values are the following:

The accuracy is a measure of how often the classifier is correct. The accuracy calculated on the models are both around 80%, but the one of the full model is not higher than the one calculated on the reduced model.
The area under the curve is higher for the full model, but this could be due to overfitting.
The sensitivity of the full model is higher (better TPR), but this could also be caused by the choice of a lower threshold.
The specificity is better for the reduced model, but this can be explained by the fact that the threshold used was higher.

The overall results suggest that the multivariate model does not perform better than the univariate model for gene 2. Moreover, the fact of adding predictors increases the probability of overfitting, which is likely the reason for which some parameters are better for the full model with respect to the reduced one. This confirms the previous hypothesis that the expression of gene 1 is not a significant predictor of the metastatic state of the tumour. In conclusion, the only significant predictor of the metastatic or non-metastatic state of the tumour is gene 2 expression.

Data analysis exam

Fiammetta Gatto

21/06/2020

Loading the file

Regression analysis between the variables

Linear regression

Gene 1 and gene 2

Gene 1 and metastatic

Gene 2 and metastatic

Logistic regression

Gene 1 and metastatic

Gene 2 and metastatic

General conclusion about univariate regressions

Multivariable logistic regression

Cross-validation

Overall conclusion