Conforme solicitado, abaixo a validação do modelo no próprio conjunto de treinamento.
umbnum = read.csv("umbalanced_num.csv")
probs = read.csv("average.txt", header = F, sep = "\t")
tabderiva = subset(umbnum, select = c("gene1", "gene2", "score"))
tabderiva$prob <- probs$V3
plot(tabderiva$score ~ tabderiva$prob, ylab = "S-score", xlab = "Probabilidade")
lm1 = lm(tabderiva$score ~ tabderiva$prob)
abline(lm1, col = "red")
summary(lm1)
##
## Call:
## lm(formula = tabderiva$score ~ tabderiva$prob)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.624 -2.460 -0.055 1.817 20.638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.272 0.563 0.48 0.63
## tabderiva$prob -6.991 0.936 -7.47 1.7e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.4 on 1011 degrees of freedom
## Multiple R-squared: 0.0523, Adjusted R-squared: 0.0514
## F-statistic: 55.8 on 1 and 1011 DF, p-value: 1.71e-13
fpCut <- cut(tabderiva$prob, breaks = 2)
boxplot(tabderiva$score ~ fpCut)
A inclinação é negativa com p-value baixo.
A razão pelo modelo não ser tão preciso é que ele foi binário inicialmente.
fpCut <- cut(tabderiva$score, breaks = c(-20, 0, 20), include.lowest = TRUE)
boxplot(tabderiva$prob ~ fpCut)
lm2 = lm(tabderiva$prob ~ tabderiva$score)
summary(lm2)
##
## Call:
## lm(formula = tabderiva$prob ~ tabderiva$score)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3831 -0.0288 0.0651 0.0890 0.2206
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.55492 0.00592 93.73 < 2e-16 ***
## tabderiva$score -0.00749 0.00100 -7.47 1.7e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.144 on 1011 degrees of freedom
## Multiple R-squared: 0.0523, Adjusted R-squared: 0.0514
## F-statistic: 55.8 on 1 and 1011 DF, p-value: 1.71e-13
Teste de Wilcoxon entre as probabilidades dos ALL x AGG
derNeg <- tabderiva$prob[tabderiva$score > 0]
derPos <- tabderiva$prob[tabderiva$score < 0]
wilcox.test(derNeg, derPos)
##
## Wilcoxon rank sum test with continuity correction
##
## data: derNeg and derPos
## W = 35483, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0