Análise Preditiva e Data Mining
Caio Quirino - RM47203
Tiago Vinícius - RM47294
26/06/2015
A partir de métricas do código fonte, é possível construir um modelo preditivo que diz se uma classe Java apresenta bugs, com uma certa probabilidade?
dataset <- read.csv("lucene.csv", sep = ";", header = TRUE)
str(dataset)
'data.frame': 691 obs. of 24 variables:
$ classname : Factor w/ 691 levels "com::sleepycat::db::DbHandleExtractor ",..: 525 245 210 454 104 96 422 672 516 207 ...
$ cbo : num 17 34 179 5 5 128 0 1 6 16 ...
$ dit : num 3 1 1 1 1 1 2 2 1 2 ...
$ fanIn : num 10 9 174 1 0 122 0 0 4 13 ...
$ fanOut : num 7 27 5 4 5 6 0 1 2 3 ...
$ lcom : num 55 2415 120 3 0 ...
$ noc : num 0 0 0 0 0 0 0 0 0 3 ...
$ numberOfAttributes : num 2 60 6 7 0 11 1 8 18 16 ...
$ numberOfAttributesInherited: num 0 0 0 0 0 0 2 6 0 0 ...
$ numberOfLinesOfCode : num 170 829 138 40 19 240 4 345 44 182 ...
$ numberOfMethods : num 11 70 16 3 1 31 1 7 7 22 ...
$ numberOfMethodsInherited : num 44 9 18 11 10 18 19 35 18 41 ...
$ numberOfPrivateAttributes : num 2 25 5 7 0 3 0 8 4 0 ...
$ numberOfPrivateMethods : num 0 11 0 0 0 4 0 6 0 0 ...
$ numberOfPublicAttributes : num 0 19 1 0 0 7 1 0 1 0 ...
$ numberOfPublicMethods : num 11 0 16 3 1 27 1 1 7 20 ...
$ rfc : num 59 263 99 12 7 72 1 32 20 57 ...
$ wmc : num 29 177 46 10 3 66 1 51 13 60 ...
$ bugs : num 1 2 0 0 0 0 0 0 1 0 ...
$ nonTrivialBugs : num 0 0 0 0 0 0 0 0 0 0 ...
$ majorBugs : num 0 0 0 0 0 0 0 0 0 0 ...
$ criticalBugs : num 0 0 0 0 0 0 0 0 0 0 ...
$ highPriorityBugs : num 0 0 0 0 0 0 0 0 0 0 ...
$ X : logi NA NA NA NA NA NA ...
#Variável independente qualitativa nominal - Indica o nome qualitativo da classe
classname <- subset(dataset, select = classname)
#Exemplo
as.vector(dataset$classname[1])
[1] "org::apache::lucene::search::spans::SpanOrQuery "
#Variáveis quantitativas independentes
quantitativas <- subset(dataset, select = c(2:18))
names(quantitativas)
[1] "cbo" "dit"
[3] "fanIn" "fanOut"
[5] "lcom" "noc"
[7] "numberOfAttributes" "numberOfAttributesInherited"
[9] "numberOfLinesOfCode" "numberOfMethods"
[11] "numberOfMethodsInherited" "numberOfPrivateAttributes"
[13] "numberOfPrivateMethods" "numberOfPublicAttributes"
[15] "numberOfPublicMethods" "rfc"
[17] "wmc"
#Possíveis variáveis dependentes
resposta <- subset(dataset, select = c(19:23))
names(resposta)
[1] "bugs" "nonTrivialBugs" "majorBugs"
[4] "criticalBugs" "highPriorityBugs"
source(file="functions.R")
#Separação da variável classname em duas: package e class
dataset$package <- getPackageName(dataset$classname)
dataset$class <- getClassName(dataset$classname)
#excluindo colunas que nao possuem valores e linhas que não possuem o valor de classname
dataset <- subset(dataset, !is.na(classname))
dataset <- subset(dataset, select = -c(classname,majorBugs,highPriorityBugs,nonTrivialBugs,criticalBugs,X))
#Novo conjunto de variáveis
names(dataset)
[1] "cbo" "dit"
[3] "fanIn" "fanOut"
[5] "lcom" "noc"
[7] "numberOfAttributes" "numberOfAttributesInherited"
[9] "numberOfLinesOfCode" "numberOfMethods"
[11] "numberOfMethodsInherited" "numberOfPrivateAttributes"
[13] "numberOfPrivateMethods" "numberOfPublicAttributes"
[15] "numberOfPublicMethods" "rfc"
[17] "wmc" "bugs"
[19] "package" "class"
#Transformando a variavel bugs em uma variável dicotômica: 0 = Não tem bugs, 1 = Tem bugs
dataset$bugs <- ifelse(dataset$bugs > 0, 1 ,0)
#Proporção de classes
table(dataset$bugs)
0 1
627 64
fit <- glm(bugs ~ . -1, family = binomial(link = logit),data= log_dataset)
stepwise<-step(fit,direction="both")
summary(stepwise)
Call:
glm(formula = bugs ~ cbo + dit + fanIn + fanOut + lcom + noc +
numberOfMethods + numberOfMethodsInherited + numberOfPrivateAttributes +
numberOfPublicAttributes - 1, family = binomial(link = logit),
data = log_dataset)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5510 -0.4393 -0.3298 -0.2429 2.7747
Coefficients:
Estimate Std. Error z value Pr(>|z|)
cbo -1.3160 0.6372 -2.065 0.038896 *
dit 1.1561 0.6510 1.776 0.075725 .
fanIn 1.1214 0.4182 2.682 0.007321 **
fanOut 1.2501 0.4481 2.789 0.005279 **
lcom 1.7549 0.4841 3.625 0.000289 ***
noc 0.8447 0.2321 3.640 0.000273 ***
numberOfMethods -4.0403 1.0567 -3.823 0.000132 ***
numberOfMethodsInherited -0.6280 0.2707 -2.320 0.020346 *
numberOfPrivateAttributes 0.4946 0.1841 2.687 0.007218 **
numberOfPublicAttributes 0.3338 0.1827 1.827 0.067667 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 957.93 on 691 degrees of freedom
Residual deviance: 368.72 on 681 degrees of freedom
AIC: 388.72
Number of Fisher Scoring iterations: 6
#Saída do modelo em probabilidades [0,1)
predict<-fitted(stepwise)
#Faixa probabilidade
fx_predito <- cut(predict, breaks=c(0,0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,1), right=F)
# Frequencia absoluta
table(fx_predito,log_dataset$bugs)
fx_predito 0 1
[0,0.1) 480 22
[0.1,0.2) 101 20
[0.2,0.3) 26 7
[0.3,0.4) 10 6
[0.4,0.5) 5 4
[0.5,0.6) 4 0
[0.6,0.7) 1 2
[0.7,0.8) 0 1
[0.8,0.9) 0 0
[0.9,1) 0 2
library(caret)
log_dataset$pred.bugs <- ifelse(predict <=0.5, 0, 1)
table(log_dataset$pred.bugs)
0 1
681 10
atual <- as.factor(log_dataset$bugs)
predito <- as.factor(log_dataset$pred.bugs)
confusionMatrix(predito, atual, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 622 59
1 5 5
Accuracy : 0.9074
95% CI : (0.8833, 0.9279)
No Information Rate : 0.9074
P-Value [Acc > NIR] : 0.5332
Kappa : 0.1129
Mcnemar's Test P-Value : 3.472e-11
Sensitivity : 0.078125
Specificity : 0.992026
Pos Pred Value : 0.500000
Neg Pred Value : 0.913363
Prevalence : 0.092619
Detection Rate : 0.007236
Detection Prevalence : 0.014472
Balanced Accuracy : 0.535075
'Positive' Class : 1