## Warning: Missing column names filled in: 'X15' [15], 'X16' [16],
## 'X17' [17], 'X19' [19], 'X20' [20]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   aluno = col_character(),
##   X15 = col_character(),
##   X17 = col_logical(),
##   X19 = col_logical(),
##   X20 = col_character()
## )
## See spec(...) for full column specifications.

Primeiro vamos usar a técnica do PCA para reduzir dimensões dos dados.

O dataset contém um total de 11 variáveis, todas binárias.

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6    PC7
## Standard deviation     0.7434 0.6456 0.5568 0.4982 0.47040 0.45063 0.4129
## Proportion of Variance 0.2236 0.1687 0.1255 0.1005 0.08955 0.08218 0.0690
## Cumulative Proportion  0.2236 0.3923 0.5178 0.6183 0.70783 0.79001 0.8590
##                            PC8     PC9   PC10    PC11    PC12      PC13
## Standard deviation     0.37029 0.27778 0.2501 0.22610 0.14307 8.828e-17
## Proportion of Variance 0.05549 0.03123 0.0253 0.02069 0.00828 0.000e+00
## Cumulative Proportion  0.91450 0.94572 0.9710 0.99172 1.00000 1.000e+00

As amostras ficaram separadas em 2 grupos bem definidos. PC1 consegue explicar 24% dos dados, enquanto o PC2 19%. Ambos são valores bem baixos.

Usando o k-means.

Primeiro vamos tentar identificar qual seria um bom número de clusters.

k = 4 parece bom.

## Joining, by = c("X.1", "aluno")

K-means + PCA

Agora, queremos observar se os clusters gerados pelo k-means ficaram numa distribuição coerente quando plotamos com o PCA.

Os pontos nao ficaram proximos dos seus semelhantes de acordo com o k-means…

Usando Regressão Múltipla

[Essa eu nao sei como explicar]

## 
## Call:  glm(formula = X.1 ~ hasTests + hasDoc + hasController + hasFacade + 
##     useInterface + useInheritance + equals + hashCode + useException + 
##     useAbstractClass + usedHashMap + usedHashSet + usedArrayList, 
##     data = provas)
## 
## Coefficients:
##      (Intercept)          hasTests            hasDoc     hasController  
##         4.015307          3.009464          0.207925          0.005925  
##        hasFacade      useInterface    useInheritance            equals  
##        -0.586249          1.501690          0.031041          0.141836  
##         hashCode      useException  useAbstractClass       usedHashMap  
##               NA          0.319071          1.007408          0.305133  
##      usedHashSet     usedArrayList  
##        -0.828431          0.204504  
## 
## Degrees of Freedom: 111 Total (i.e. Null);  99 Residual
## Null Deviance:       557.7 
## Residual Deviance: 214.6     AIC: 418.7
## 
## Call:
## glm(formula = X.1 ~ hasTests + hasDoc + hasController + hasFacade + 
##     useInterface + useInheritance + equals + hashCode + useException + 
##     useAbstractClass + usedHashMap + usedHashSet + usedArrayList, 
##     data = provas)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6880  -0.8456   0.2750   0.9234   3.3355  
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.015307   0.676163   5.938 4.28e-08 ***
## hasTests          3.009464   0.325491   9.246 4.83e-15 ***
## hasDoc            0.207925   0.616683   0.337   0.7367    
## hasController     0.005925   0.299420   0.020   0.9843    
## hasFacade        -0.586249   0.306974  -1.910   0.0591 .  
## useInterface      1.501690   0.327877   4.580 1.36e-05 ***
## useInheritance    0.031041   0.381990   0.081   0.9354    
## equals            0.141836   0.297476   0.477   0.6346    
## hashCode                NA         NA      NA       NA    
## useException      0.319071   0.314363   1.015   0.3126    
## useAbstractClass  1.007408   0.452454   2.227   0.0282 *  
## usedHashMap       0.305133   0.570824   0.535   0.5942    
## usedHashSet      -0.828431   0.844261  -0.981   0.3289    
## usedArrayList     0.204504   0.506349   0.404   0.6872    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.167412)
## 
##     Null deviance: 557.68  on 111  degrees of freedom
## Residual deviance: 214.57  on  99  degrees of freedom
## AIC: 418.66
## 
## Number of Fisher Scoring iterations: 2
## Waiting for profiling to be done...
##                       2.5 %     97.5 %
## (Intercept)       2.6900518 5.34056247
## hasTests          2.3715136 3.64741445
## hasDoc           -1.0007517 1.41660206
## hasController    -0.5809281 0.59277816
## hasFacade        -1.1879056 0.01540852
## useInterface      0.8590636 2.14431596
## useInheritance   -0.7176455 0.77972734
## equals           -0.4412062 0.72487861
## hashCode                 NA         NA
## useException     -0.2970691 0.93521078
## useAbstractClass  0.1206139 1.89420281
## usedHashMap      -0.8136607 1.42392635
## usedHashSet      -2.4831530 0.82629041
## usedArrayList    -0.7879211 1.19692877

Usando MCA

##    eigenvalue      variance.percent cumulative.variance.percent
##  Min.   :0.00000   Min.   : 0.000   Min.   : 18.51             
##  1st Qu.:0.04736   1st Qu.: 4.736   1st Qu.: 55.63             
##  Median :0.06833   Median : 6.833   Median : 79.23             
##  Mean   :0.07692   Mean   : 7.692   Mean   : 72.26             
##  3rd Qu.:0.09904   3rd Qu.: 9.904   3rd Qu.: 95.75             
##  Max.   :0.18515   Max.   :18.515   Max.   :100.00

##                           Dim 1      Dim 2        Dim 3       Dim 4
## hasFacade_FALSE       0.2067123 -0.3990625  0.005748542 -0.38302131
## hasFacade_TRUE       -0.2142291  0.4135738 -0.005957580  0.39694936
## hasController_FALSE   0.1842612 -0.4709442 -0.329187897 -0.04362486
## hasController_TRUE   -0.1433143  0.3662899  0.256035031  0.03393045
## useInheritance_FALSE  0.3009261 -0.6126740  0.081698399 -0.33400793
## useInheritance_TRUE  -0.3731484  0.7597157 -0.101306015  0.41416984
##                           Dim 5
## hasFacade_FALSE       0.2274755
## hasFacade_TRUE       -0.2357474
## hasController_FALSE   0.5147276
## hasController_TRUE   -0.4003437
## useInheritance_FALSE -0.2254024
## useInheritance_TRUE   0.2794989
##                           Dim 1     Dim 2        Dim 3       Dim 4
## hasFacade_FALSE      0.04428378 0.1650418 0.0000342474 0.152040062
## hasFacade_TRUE       0.04428378 0.1650418 0.0000342474 0.152040062
## hasController_FALSE  0.02640726 0.1725021 0.0842836333 0.001480211
## hasController_TRUE   0.02640726 0.1725021 0.0842836333 0.001480211
## useInheritance_FALSE 0.11229011 0.4654580 0.0082765393 0.138336012
## useInheritance_TRUE  0.11229011 0.4654580 0.0082765393 0.138336012
##                           Dim 5
## hasFacade_FALSE      0.05362677
## hasFacade_TRUE       0.05362677
## hasController_FALSE  0.20606791
## hasController_TRUE   0.20606791
## useInheritance_FALSE 0.06299972
## useInheritance_TRUE  0.06299972
##                          Dim 1     Dim 2       Dim 3      Dim 4    Dim 5
## hasFacade_FALSE      0.9035007  4.308584 0.001015214 5.79897956 2.235946
## hasFacade_TRUE       0.9363552  4.465260 0.001052131 6.00985154 2.317253
## hasController_FALSE  0.6171418  5.158375 2.861877480 0.06466884 9.841650
## hasController_TRUE   0.4799992  4.012069 2.225904707 0.05029799 7.654617
## useInheritance_FALSE 2.0827284 11.046594 0.223041679 4.79662876 2.387952
## useInheritance_TRUE  2.5825832 13.697777 0.276571682 5.94781966 2.961061