Validação da classifição sobre o próprio conjunto de treinamento

Conforme solicitado, abaixo a validação do modelo no próprio conjunto de treinamento.

umbnum = read.csv("umbalanced_num.csv")
probs = read.csv("average.txt", header = F, sep = "\t")
tabderiva = subset(umbnum, select = c("gene1", "gene2", "score"))
tabderiva$prob <- probs$V3
plot(tabderiva$score ~ tabderiva$prob, ylab = "S-score", xlab = "Probabilidade")
lm1 = lm(tabderiva$score ~ tabderiva$prob)
abline(lm1, col = "red")

plot of chunk unnamed-chunk-1

summary(lm1)

## 
## Call:
## lm(formula = tabderiva$score ~ tabderiva$prob)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.624  -2.460  -0.055   1.817  20.638 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.272      0.563    0.48     0.63    
## tabderiva$prob   -6.991      0.936   -7.47  1.7e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.4 on 1011 degrees of freedom
## Multiple R-squared:  0.0523, Adjusted R-squared:  0.0514 
## F-statistic: 55.8 on 1 and 1011 DF,  p-value: 1.71e-13

fpCut <- cut(tabderiva$prob, breaks = 2)
boxplot(tabderiva$score ~ fpCut)

plot of chunk unnamed-chunk-1

A inclinação é negativa com p-value baixo.

A razão pelo modelo não ser tão preciso é que ele foi binário inicialmente.

fpCut <- cut(tabderiva$score, breaks = c(-20, 0, 20), include.lowest = TRUE)
boxplot(tabderiva$prob ~ fpCut)

plot of chunk unnamed-chunk-2

lm2 = lm(tabderiva$prob ~ tabderiva$score)
summary(lm2)

## 
## Call:
## lm(formula = tabderiva$prob ~ tabderiva$score)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3831 -0.0288  0.0651  0.0890  0.2206 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.55492    0.00592   93.73  < 2e-16 ***
## tabderiva$score -0.00749    0.00100   -7.47  1.7e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.144 on 1011 degrees of freedom
## Multiple R-squared:  0.0523, Adjusted R-squared:  0.0514 
## F-statistic: 55.8 on 1 and 1011 DF,  p-value: 1.71e-13

Teste de Wilcoxon entre as probabilidades dos ALL x AGG

derNeg <- tabderiva$prob[tabderiva$score > 0]
derPos <- tabderiva$prob[tabderiva$score < 0]
wilcox.test(derNeg, derPos)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  derNeg and derPos
## W = 35483, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0