Se investigó la utilidad del KNN para la detección de cáncer de mama con las metricas tomadas de celulas extraídas de masas anormales en el pecho de mujeres. El dataset fue generado por Wisconsin Breast Cancer Diagnostic de UCI (http://archive.ics.uci.edu/ml). Incluye 569 ejemplos de biopsas, cada una con 32 características. El diagnóstico es “M” para maligna y “B” para benigna.
## 'data.frame': 569 obs. of 32 variables:
## $ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ diagnosis : chr "M" "M" "M" "M" ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
##
## B M
## 357 212
##
## Benign Malignant
## 357 212
##
## Benign Malignant
## 62.7 37.3
## radius_mean area_mean smoothness_mean
## Min. : 6.981 Min. : 143.5 Min. :0.05263
## 1st Qu.:11.700 1st Qu.: 420.3 1st Qu.:0.08637
## Median :13.370 Median : 551.1 Median :0.09587
## Mean :14.127 Mean : 654.9 Mean :0.09636
## 3rd Qu.:15.780 3rd Qu.: 782.7 3rd Qu.:0.10530
## Max. :28.110 Max. :2501.0 Max. :0.16340
## [1] 0.00 0.25 0.50 0.75 1.00
## [1] 0.00 0.25 0.50 0.75 1.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1174 0.1729 0.2169 0.2711 1.0000
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_test_pred
## wbcd_test_labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 77 | 0 | 77 |
## | 1.000 | 0.000 | 0.770 |
## | 0.975 | 0.000 | |
## | 0.770 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 2 | 21 | 23 |
## | 0.087 | 0.913 | 0.230 |
## | 0.025 | 1.000 | |
## | 0.020 | 0.210 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 79 | 21 | 100 |
## | 0.790 | 0.210 | |
## -----------------|-----------|-----------|-----------|
##
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.4532 -0.6666 -0.2949 0.0000 0.3632 5.2459
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_test_pred
## wbcd_test_labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 77 | 0 | 77 |
## | 1.000 | 0.000 | 0.770 |
## | 0.975 | 0.000 | |
## | 0.770 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 2 | 21 | 23 |
## | 0.087 | 0.913 | 0.230 |
## | 0.025 | 1.000 | |
## | 0.020 | 0.210 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 79 | 21 | 100 |
## | 0.790 | 0.210 | |
## -----------------|-----------|-----------|-----------|
##
##
En esta practica de implemento el algoritmo de Bayes para el filtrado de spam de mensajes SMS. El dataset se tomo de: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/, el cual indica con una etiqueta si el mensaje no es deseado. Mensajes de broma son etiquetados como spam, mientras que los mensajes legitimos son etiquetados como ham.
## 'data.frame': 5574 obs. of 2 variables:
## $ type: chr "ham" "ham" "spam" "ham" ...
## $ text: chr "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question("| __truncated__ "U dun say so early hor... U c already then say..." ...
## Factor w/ 2 levels "ham","spam": 1 1 2 1 1 2 1 1 2 2 ...
##
## ham spam
## 4827 747
## Loading required package: NLP
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 5574
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 2
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 111
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 29
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
## $`1`
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
##
## $`2`
## [1] "Ok lar... Joking wif u oni..."
## [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
## [1] "go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there got amore wat..."
## [1] "learn" "learn" "learn" "learn"
## [1] "Hope you are having a good week. Just checking in"
## [1] "hope good week just check"
## <<DocumentTermMatrix (documents: 5574, terms: 6592)>>
## Non-/sparse entries: 42608/36701200
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
## <<DocumentTermMatrix (documents: 5574, terms: 8361)>>
## Non-/sparse entries: 44214/46560000
## Sparsity : 100%
## Maximal term length: 40
## Weighting : term frequency (tf)
## sms_train_labels
## ham spam
## 0.8648325 0.1351675
## sms_test_labels
## ham spam
## 0.8694405 0.1305595
## Loading required package: RColorBrewer
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation):
## transformation drops documents
## chr [1:1161] "£wk" "abiola" "abl" "abt" "accept" "access" "account" ...
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1394
##
##
## | actual
## predicted | ham | spam | Row Total |
## -------------|-----------|-----------|-----------|
## ham | 1203 | 20 | 1223 |
## | 0.984 | 0.016 | 0.877 |
## | 0.993 | 0.110 | |
## -------------|-----------|-----------|-----------|
## spam | 9 | 162 | 171 |
## | 0.053 | 0.947 | 0.123 |
## | 0.007 | 0.890 | |
## -------------|-----------|-----------|-----------|
## Column Total | 1212 | 182 | 1394 |
## | 0.869 | 0.131 | |
## -------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1394
##
##
## | actual
## predicted (improving model) | ham | spam | Row Total |
## ----------------------------|-----------|-----------|-----------|
## ham | 1205 | 28 | 1233 |
## | 0.977 | 0.023 | 0.885 |
## | 0.994 | 0.154 | |
## ----------------------------|-----------|-----------|-----------|
## spam | 7 | 154 | 161 |
## | 0.043 | 0.957 | 0.115 |
## | 0.006 | 0.846 | |
## ----------------------------|-----------|-----------|-----------|
## Column Total | 1212 | 182 | 1394 |
## | 0.869 | 0.131 | |
## ----------------------------|-----------|-----------|-----------|
##
##
Arboles de desicion son ampliamente en la industria bancaria debido a su exactitud y su habilidad para generar modelos estadísticos en lenguaje simple. El dataset utilizado fue donado por CI Machine Learning Data Repository ( http://archive.ics.uci.edu/ml ).
## 'data.frame': 1000 obs. of 21 variables:
## $ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
## $ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_history : Factor w/ 5 levels "critical","delayed",..: 1 5 1 5 2 5 5 5 5 1 ...
## $ purpose : Factor w/ 10 levels "business","car (new)",..: 8 8 5 6 2 5 6 3 8 2 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ savings_balance : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
## $ employment_length : Factor w/ 5 levels "> 7 yrs","0 - 1 yrs",..: 1 3 4 4 3 3 1 3 4 5 ...
## $ installment_rate : int 4 2 2 2 3 2 3 2 2 4 ...
## $ personal_status : Factor w/ 4 levels "divorced male",..: 4 2 4 4 4 4 4 4 1 3 ...
## $ other_debtors : Factor w/ 3 levels "co-applicant",..: 3 3 3 2 3 3 3 3 3 3 ...
## $ residence_history : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property : Factor w/ 4 levels "building society savings",..: 3 3 3 1 4 4 1 2 3 2 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ installment_plan : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ housing : Factor w/ 3 levels "for free","own",..: 2 2 2 1 1 1 2 3 2 2 ...
## $ existing_credits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ default : int 1 2 1 1 2 1 1 1 1 2 ...
## $ dependents : int 1 1 2 2 2 2 1 1 1 1 ...
## $ telephone : Factor w/ 2 levels "none","yes": 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign_worker : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ job : Factor w/ 4 levels "mangement self-employed",..: 2 2 4 2 2 4 2 1 4 1 ...
##
## < 0 DM > 200 DM 1 - 200 DM unknown
## 274 63 269 394
##
## < 100 DM > 1000 DM 101 - 500 DM 501 - 1000 DM unknown
## 603 48 103 63 183
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 12.0 18.0 20.9 24.0 72.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 1366 2320 3271 3972 18424
##
## no yes
## 700 300
## int [1:900] 288 788 409 881 937 46 525 887 548 453 ...
##
## no yes
## 0.7033333 0.2966667
##
## no yes
## 0.67 0.33
##
## Call:
## C5.0.default(x = credit_train[-17], y = as.factor(credit_train$default))
##
##
## C5.0 [Release 2.07 GPL Edition] Fri Jan 25 02:44:35 2019
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 900 cases (21 attributes) from undefined.data
##
## Decision tree:
##
## checking_balance in {> 200 DM,unknown}: no (412/50)
## checking_balance in {< 0 DM,1 - 200 DM}:
## :...other_debtors = guarantor:
## :...months_loan_duration > 36: yes (4/1)
## : months_loan_duration <= 36:
## : :...installment_plan in {none,stores}: no (24)
## : installment_plan = bank:
## : :...purpose = car (new): yes (3)
## : purpose in {business,car (used),domestic appliances,education,
## : furniture,others,radio/tv,repairs,
## : retraining}: no (7/1)
## other_debtors in {co-applicant,none}:
## :...credit_history = critical: no (102/30)
## credit_history = fully repaid: yes (27/6)
## credit_history = fully repaid this bank:
## :...other_debtors = co-applicant: no (2)
## : other_debtors = none: yes (26/8)
## credit_history in {delayed,repaid}:
## :...savings_balance in {> 1000 DM,501 - 1000 DM}: no (19/3)
## savings_balance = 101 - 500 DM:
## :...other_debtors = co-applicant: yes (3)
## : other_debtors = none:
## : :...personal_status in {divorced male,
## : : married male}: yes (6/1)
## : personal_status = female:
## : :...installment_rate <= 3: no (4/1)
## : : installment_rate > 3: yes (4)
## : personal_status = single male:
## : :...age <= 41: no (15/2)
## : age > 41: yes (2)
## savings_balance = unknown:
## :...credit_history = delayed: no (8)
## : credit_history = repaid:
## : :...foreign_worker = no: no (2)
## : foreign_worker = yes:
## : :...checking_balance = < 0 DM:
## : :...telephone = none: yes (11/2)
## : : telephone = yes:
## : : :...amount <= 5045: no (5/1)
## : : amount > 5045: yes (2)
## : checking_balance = 1 - 200 DM:
## : :...residence_history > 3: no (9)
## : residence_history <= 3: [S1]
## savings_balance = < 100 DM:
## :...months_loan_duration > 39:
## :...residence_history <= 1: no (2)
## : residence_history > 1: yes (19/1)
## months_loan_duration <= 39:
## :...purpose in {car (new),retraining}: yes (47/16)
## purpose in {domestic appliances,others}: no (3)
## purpose = car (used):
## :...amount <= 8086: no (9/1)
## : amount > 8086: yes (5)
## purpose = education:
## :...checking_balance = < 0 DM: yes (5)
## : checking_balance = 1 - 200 DM: no (2)
## purpose = repairs:
## :...residence_history <= 3: yes (4/1)
## : residence_history > 3: no (3)
## purpose = business:
## :...credit_history = delayed: yes (2)
## : credit_history = repaid:
## : :...age <= 34: no (5)
## : age > 34: yes (2)
## purpose = radio/tv:
## :...employment_length in {0 - 1 yrs,
## : : unemployed}: yes (14/5)
## : employment_length = 4 - 7 yrs: no (3)
## : employment_length = > 7 yrs:
## : :...amount <= 932: yes (2)
## : : amount > 932: no (7)
## : employment_length = 1 - 4 yrs:
## : :...months_loan_duration <= 15: no (6)
## : months_loan_duration > 15:
## : :...amount <= 3275: yes (7)
## : amount > 3275: no (2)
## purpose = furniture:
## :...residence_history <= 1: no (8/1)
## residence_history > 1:
## :...installment_plan in {bank,stores}: no (3/1)
## installment_plan = none:
## :...telephone = yes: yes (7/1)
## telephone = none:
## :...months_loan_duration > 27: yes (3)
## months_loan_duration <= 27: [S2]
##
## SubTree [S1]
##
## property in {building society savings,unknown/none}: yes (4)
## property = other: no (6)
## property = real estate:
## :...job = skilled employee: yes (2)
## job in {mangement self-employed,unemployed non-resident,
## unskilled resident}: no (2)
##
## SubTree [S2]
##
## checking_balance = 1 - 200 DM: yes (5/2)
## checking_balance = < 0 DM:
## :...property in {building society savings,real estate,unknown/none}: no (8)
## property = other:
## :...installment_rate <= 1: no (2)
## installment_rate > 1: yes (4)
##
##
## Evaluation on training data (900 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 54 135(15.0%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 589 44 (a): class no
## 91 176 (b): class yes
##
##
## Attribute usage:
##
## 100.00% checking_balance
## 54.22% other_debtors
## 50.00% credit_history
## 32.56% savings_balance
## 25.22% months_loan_duration
## 19.78% purpose
## 10.11% residence_history
## 7.33% installment_plan
## 5.22% telephone
## 4.78% foreign_worker
## 4.56% employment_length
## 4.33% amount
## 3.44% personal_status
## 3.11% property
## 2.67% age
## 1.56% installment_rate
## 0.44% job
##
##
## Time: 0.0 secs
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | predicted default
## actual default | no | yes | Row Total |
## ---------------|-----------|-----------|-----------|
## no | 60 | 7 | 67 |
## | 0.600 | 0.070 | |
## ---------------|-----------|-----------|-----------|
## yes | 19 | 14 | 33 |
## | 0.190 | 0.140 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 79 | 21 | 100 |
## ---------------|-----------|-----------|-----------|
##
##
##
## Call:
## C5.0.default(x = credit_train[-17], y =
## as.factor(credit_train$default), trials = 10)
##
## Classification Tree
## Number of samples: 900
## Number of predictors: 20
##
## Number of boosting iterations: 10
## Average tree size: 49.7
##
## Non-standard options: attempt to group attributes
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | predicted default
## actual default | no | yes | Row Total |
## ---------------|-----------|-----------|-----------|
## no | 60 | 7 | 67 |
## | 0.600 | 0.070 | |
## ---------------|-----------|-----------|-----------|
## yes | 17 | 16 | 33 |
## | 0.170 | 0.160 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 77 | 23 | 100 |
## ---------------|-----------|-----------|-----------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | predicted default
## actual default | no | yes | Row Total |
## ---------------|-----------|-----------|-----------|
## no | 33 | 34 | 67 |
## | 0.330 | 0.340 | |
## ---------------|-----------|-----------|-----------|
## yes | 7 | 26 | 33 |
## | 0.070 | 0.260 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 40 | 60 | 100 |
## ---------------|-----------|-----------|-----------|
##
##
Para inentificar las reglas para distiguir los hongos venenosos se utlizó el dataset “mushrooms.csv” disponible en UCI Machine Learning Repository ( http://archive.ics.uci.edu/ml ).
## Step 2 - exploring and preparing the data
mushrooms <- read.csv("~/Documents/MCPI/Aprendizaje Automatico/Practicas/Clases/Datsets/mushrooms.csv", stringsAsFactors = TRUE)
str(mushrooms)
## 'data.frame': 8124 obs. of 23 variables:
## $ type : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
## $ cap.shape : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
## $ cap.surface : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
## $ cap.color : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
## $ bruises. : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
## $ odor : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
## $ gill.attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
## $ gill.spacing : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
## $ gill.size : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
## $ gill.color : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
## $ stalk.shape : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
## $ stalk.root : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
## $ stalk.surface.above.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk.surface.below.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ stalk.color.above.ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ stalk.color.below.ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ veil.type : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
## $ veil.color : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
## $ ring.number : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
## $ ring.type : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
## $ spore.print.color : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
## $ population : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
## $ habitat : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...
#the feature from the data frame eliminated
mushrooms$veil.type <- NULL
table(mushrooms$type) # to look the distribution of the mushroom type class variable
##
## e p
## 4208 3916
## Step 3 - trining a model on the data
#install.packages("rJava")
#install.packages("RWeka")
#library(RWeka)
#install.packages("OneR")
mushroom_1R = OneR::OneR(type ~ ., data=mushrooms)
mushroom_1R
##
## Call:
## OneR.formula(formula = type ~ ., data = mushrooms)
##
## Rules:
## If odor = a then type = e
## If odor = c then type = p
## If odor = f then type = p
## If odor = l then type = e
## If odor = m then type = p
## If odor = n then type = e
## If odor = p then type = p
## If odor = s then type = p
## If odor = y then type = p
##
## Accuracy:
## 8004 of 8124 instances classified correctly (98.52%)
#### Step 4 - evaluating model performance
summary(mushroom_1R) # shows classifier ditails
##
## Call:
## OneR.formula(formula = type ~ ., data = mushrooms)
##
## Rules:
## If odor = a then type = e
## If odor = c then type = p
## If odor = f then type = p
## If odor = l then type = e
## If odor = m then type = p
## If odor = n then type = e
## If odor = p then type = p
## If odor = s then type = p
## If odor = y then type = p
##
## Accuracy:
## 8004 of 8124 instances classified correctly (98.52%)
##
## Contingency table:
## odor
## type a c f l m n p s y Sum
## e * 400 0 0 * 400 0 * 3408 0 0 0 4208
## p 0 * 192 * 2160 0 * 36 120 * 256 * 576 * 576 3916
## Sum 400 192 2160 400 36 3528 256 576 576 8124
## ---
## Maximum in each column: '*'
##
## Pearson's Chi-squared test:
## X-squared = 7659.7, df = 8, p-value < 2.2e-16
## Step 5 - improving model performance
#mushroom_JRip <- JRip(class ~ predictors, data = mydata)
# Let's train the JRip() rule learner as we did with OneR()
#mushroom_JRip =RWeka::JRip(type~ ., data=mushrooms)
#mushroom_JRip
Regresion lineal simple como ejercicio introductorio a regresión simple dada por y = a + Bx. Se utilizó el dataset “challenger.csv”. La variable independiente x es la temperatura, la dependiente es distress_ct. Se usó cov() y var() para estimar b.
launch <- read.csv("~/Documents/MCPI/Aprendizaje Automatico/Practicas/Clases/Datsets/challenger.csv")
b <- cov(launch$temperature, launch$distress_ct)/var(launch$temperature)
# We can estimate a using the mean() function:
a <- mean(launch$distress_ct - b * mean(launch$temperature))
# The correlation between the launch temperture and the number of o-ring distress events is:
r <- cov(launch$temperature, launch$distress_ct)/ (sd(launch$temperature)*sd(launch$distress_ct))
reg <- function(y,x) {
x <- as.matrix(x) #convert the data frame into matrix form
x <- cbind(Intercept = 1, x) # bind and additional column onto the x matrix; the command INtercept =1 name
# the new column Intercept and till the column withrepeating 1 values
b <- solve(t(x) %*% x) %*% t(x) %*% y # solve() - inverse of a matrix, t() - transpose a metrix,
# %*% multiplies 2 matrices. The whole command return a vector b,
# which contains the estimated parameter for the linear model
# relating x to y
colnames(b) <- "estimate" # names the vector "b" estimate
print(b)
}
# The data set includes 3 features and the distress count, which is the outcome of interest:
str(launch)
## 'data.frame': 23 obs. of 4 variables:
## $ distress_ct : int 0 1 0 0 0 0 0 0 1 1 ...
## $ temperature : int 66 70 69 68 67 72 73 70 57 63 ...
## $ field_check_pressure: int 50 50 50 50 50 50 100 100 200 200 ...
## $ flight_num : int 1 2 3 4 5 6 7 8 9 10 ...
# Lets apply the function to the shuttle lauch data.
reg(y = launch$distress_ct, x = launch[2])
## estimate
## Intercept 3.69841270
## temperature -0.04753968
# Lets use the function to build a multimple regression model, specifying 3 columns of data instead of just 1:
reg(y = launch$distress_ct, x = launch[2:4])
## estimate
## Intercept 3.527093383
## temperature -0.051385940
## field_check_pressure 0.001757009
## flight_num 0.014292843
El análisis utilizará un dataset con hipotéticos gastos en el archivo insurance.csv los datos de los pacientes para estimar los gastos médicos de los distintos segmentos de la población. Este archivo contiene 13338 ejemplos de beneficiarios con un plan de seguro, con características que indican la descripción del paciente y su total de gastos médicos por año. Estas caracteristicas son: age(integer < 64), sex(male, female), bmi(body mass index 18.5-24.9), children(# of children covered by the insurence), smoker(yes,no), region(NE,NW,SE,SW). Aquí podremos darnos una idea como estas caracteristicas se relacionan entre sí, por ejemplo se puede suponer que las personas de tercera edad y fumadores tengan más altos los gastos medicos.
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ expenses: num 16885 1726 4449 21984 3867 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1122 4740 9382 13270 16640 63770
##
## northeast northwest southeast southwest
## 324 325 364 325
##
## Call:
## lm(formula = charges ~ ., data = insurance)
##
## Coefficients:
## (Intercept) age sexmale bmi
## -11941.6 256.8 -131.4 339.3
## children smokeryes regionnorthwest regionsoutheast
## 475.7 23847.5 -352.8 -1035.6
## regionsouthwest
## -959.3
##
## Call:
## lm(formula = charges ~ ., data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11302.7 -2850.9 -979.6 1383.9 29981.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11941.6 987.8 -12.089 < 2e-16 ***
## age 256.8 11.9 21.586 < 2e-16 ***
## sexmale -131.3 332.9 -0.395 0.693255
## bmi 339.3 28.6 11.864 < 2e-16 ***
## children 475.7 137.8 3.452 0.000574 ***
## smokeryes 23847.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -352.8 476.3 -0.741 0.458976
## regionsoutheast -1035.6 478.7 -2.163 0.030685 *
## regionsouthwest -959.3 477.9 -2.007 0.044921 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.9 on 8 and 1329 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = charges ~ age + age2 + children + bmi + sex + bmi30 *
## smoker + region, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17297.1 -1656.0 -1262.7 -727.8 24161.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 139.0053 1363.1359 0.102 0.918792
## age -32.6181 59.8250 -0.545 0.585690
## age2 3.7307 0.7463 4.999 6.54e-07 ***
## children 678.6017 105.8855 6.409 2.03e-10 ***
## bmi 119.7715 34.2796 3.494 0.000492 ***
## sexmale -496.7690 244.3713 -2.033 0.042267 *
## bmi30 -997.9355 422.9607 -2.359 0.018449 *
## smokeryes 13404.5952 439.9591 30.468 < 2e-16 ***
## regionnorthwest -279.1661 349.2826 -0.799 0.424285
## regionsoutheast -828.0345 351.6484 -2.355 0.018682 *
## regionsouthwest -1222.1619 350.5314 -3.487 0.000505 ***
## bmi30:smokeryes 19810.1534 604.6769 32.762 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4445 on 1326 degrees of freedom
## Multiple R-squared: 0.8664, Adjusted R-squared: 0.8653
## F-statistic: 781.7 on 11 and 1326 DF, p-value: < 2.2e-16
En esta practica se imitó a un catador de vinos. El dataset implementado “winequality-white.csv” por UCI Machine Learning Data Repository ( http://archive.ics.uci.edu/ml ) incluye 11 propiedades de 4898 muestras, las características son: acidity, sugar, content, chlorides, sulfur, alcohol, pH, and density.
## Step 2 – exploring and preparing the data
wine <- read.csv("~/Documents/MCPI/Aprendizaje Automatico/Practicas/Clases/Datsets/winequality-white.csv", sep=";")
str(wine)
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
#wine<-as.numeric(wine)
#hist(wine$quality)
wine_train <- wine[1:3750, ] #75%
wine_test <- wine[3751:4898, ] #25%
## Step 3 - training a model on the data
#install.packages("rpart")
library(rpart)
m.rpart <- rpart(quality ~ ., data = wine_train)
# For basic information about the tree, simply type the name of the model object:
m.rpart
## n= 3750
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 3750 3140.06000 5.886933
## 2) alcohol< 10.85 2473 1510.66200 5.609381
## 4) volatile.acidity>=0.2425 1406 740.15080 5.402560
## 8) volatile.acidity>=0.4225 182 92.99451 4.994505 *
## 9) volatile.acidity< 0.4225 1224 612.34560 5.463235 *
## 5) volatile.acidity< 0.2425 1067 631.12090 5.881912 *
## 3) alcohol>=10.85 1277 1069.95800 6.424432
## 6) free.sulfur.dioxide< 11.5 93 99.18280 5.473118 *
## 7) free.sulfur.dioxide>=11.5 1184 879.99920 6.499155
## 14) alcohol< 11.85 611 447.38130 6.296236 *
## 15) alcohol>=11.85 573 380.63180 6.715532 *
summary(m.rpart)
## Call:
## rpart(formula = quality ~ ., data = wine_train)
## n= 3750
##
## CP nsplit rel error xerror xstd
## 1 0.17816211 0 1.0000000 1.0010391 0.02390494
## 2 0.04439109 1 0.8218379 0.8232523 0.02238387
## 3 0.02890893 2 0.7774468 0.7885622 0.02218078
## 4 0.01655575 3 0.7485379 0.7610867 0.02103390
## 5 0.01108600 4 0.7319821 0.7498581 0.02060005
## 6 0.01000000 5 0.7208961 0.7434623 0.02026718
##
## Variable importance
## alcohol density chlorides
## 38 23 12
## volatile.acidity total.sulfur.dioxide free.sulfur.dioxide
## 12 7 6
## sulphates pH residual.sugar
## 1 1 1
##
## Node number 1: 3750 observations, complexity param=0.1781621
## mean=5.886933, MSE=0.8373493
## left son=2 (2473 obs) right son=3 (1277 obs)
## Primary splits:
## alcohol < 10.85 to the left, improve=0.17816210, (0 missing)
## density < 0.992385 to the right, improve=0.11980970, (0 missing)
## chlorides < 0.0395 to the right, improve=0.08199995, (0 missing)
## total.sulfur.dioxide < 153.5 to the right, improve=0.03875440, (0 missing)
## free.sulfur.dioxide < 11.75 to the left, improve=0.03632119, (0 missing)
## Surrogate splits:
## density < 0.99201 to the right, agree=0.869, adj=0.614, (0 split)
## chlorides < 0.0375 to the right, agree=0.773, adj=0.334, (0 split)
## total.sulfur.dioxide < 102.5 to the right, agree=0.705, adj=0.132, (0 split)
## sulphates < 0.345 to the right, agree=0.670, adj=0.031, (0 split)
## fixed.acidity < 5.25 to the right, agree=0.662, adj=0.009, (0 split)
##
## Node number 2: 2473 observations, complexity param=0.04439109
## mean=5.609381, MSE=0.6108623
## left son=4 (1406 obs) right son=5 (1067 obs)
## Primary splits:
## volatile.acidity < 0.2425 to the right, improve=0.09227123, (0 missing)
## free.sulfur.dioxide < 13.5 to the left, improve=0.04177240, (0 missing)
## alcohol < 10.15 to the left, improve=0.03313802, (0 missing)
## citric.acid < 0.205 to the left, improve=0.02721200, (0 missing)
## pH < 3.325 to the left, improve=0.01860335, (0 missing)
## Surrogate splits:
## total.sulfur.dioxide < 111.5 to the right, agree=0.610, adj=0.097, (0 split)
## pH < 3.295 to the left, agree=0.598, adj=0.067, (0 split)
## alcohol < 10.05 to the left, agree=0.590, adj=0.049, (0 split)
## sulphates < 0.715 to the left, agree=0.584, adj=0.037, (0 split)
## residual.sugar < 1.85 to the right, agree=0.581, adj=0.029, (0 split)
##
## Node number 3: 1277 observations, complexity param=0.02890893
## mean=6.424432, MSE=0.8378682
## left son=6 (93 obs) right son=7 (1184 obs)
## Primary splits:
## free.sulfur.dioxide < 11.5 to the left, improve=0.08484051, (0 missing)
## alcohol < 11.85 to the left, improve=0.06149941, (0 missing)
## fixed.acidity < 7.35 to the right, improve=0.04259695, (0 missing)
## residual.sugar < 1.275 to the left, improve=0.02795662, (0 missing)
## total.sulfur.dioxide < 67.5 to the left, improve=0.02541719, (0 missing)
## Surrogate splits:
## total.sulfur.dioxide < 48.5 to the left, agree=0.937, adj=0.14, (0 split)
##
## Node number 4: 1406 observations, complexity param=0.011086
## mean=5.40256, MSE=0.526423
## left son=8 (182 obs) right son=9 (1224 obs)
## Primary splits:
## volatile.acidity < 0.4225 to the right, improve=0.04703189, (0 missing)
## free.sulfur.dioxide < 17.5 to the left, improve=0.04607770, (0 missing)
## total.sulfur.dioxide < 86.5 to the left, improve=0.02894310, (0 missing)
## alcohol < 10.25 to the left, improve=0.02890077, (0 missing)
## chlorides < 0.0455 to the right, improve=0.02096635, (0 missing)
## Surrogate splits:
## density < 0.99107 to the left, agree=0.874, adj=0.027, (0 split)
## citric.acid < 0.11 to the left, agree=0.873, adj=0.022, (0 split)
## fixed.acidity < 9.85 to the right, agree=0.873, adj=0.016, (0 split)
## chlorides < 0.206 to the right, agree=0.871, adj=0.005, (0 split)
##
## Node number 5: 1067 observations
## mean=5.881912, MSE=0.591491
##
## Node number 6: 93 observations
## mean=5.473118, MSE=1.066482
##
## Node number 7: 1184 observations, complexity param=0.01655575
## mean=6.499155, MSE=0.7432425
## left son=14 (611 obs) right son=15 (573 obs)
## Primary splits:
## alcohol < 11.85 to the left, improve=0.05907511, (0 missing)
## fixed.acidity < 7.35 to the right, improve=0.04400660, (0 missing)
## density < 0.991395 to the right, improve=0.02522410, (0 missing)
## residual.sugar < 1.225 to the left, improve=0.02503936, (0 missing)
## pH < 3.245 to the left, improve=0.02417936, (0 missing)
## Surrogate splits:
## density < 0.991115 to the right, agree=0.710, adj=0.401, (0 split)
## volatile.acidity < 0.2675 to the left, agree=0.665, adj=0.307, (0 split)
## chlorides < 0.0365 to the right, agree=0.631, adj=0.237, (0 split)
## total.sulfur.dioxide < 126.5 to the right, agree=0.566, adj=0.103, (0 split)
## residual.sugar < 1.525 to the left, agree=0.560, adj=0.091, (0 split)
##
## Node number 8: 182 observations
## mean=4.994505, MSE=0.5109588
##
## Node number 9: 1224 observations
## mean=5.463235, MSE=0.5002823
##
## Node number 14: 611 observations
## mean=6.296236, MSE=0.7322117
##
## Node number 15: 573 observations
## mean=6.715532, MSE=0.6642788
## Step 4 - evaluating model performance
p.rpart <- predict(m.rpart, wine_test)
#install.packages("rpart.plot")
library(rpart.plot)
rpart.plot(m.rpart, digits = 3)
rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)
## Step 4 - evaluating model performance
summary(p.rpart)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.995 5.463 5.882 5.999 6.296 6.716
summary(wine_test$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.848 6.000 8.000
cor(p.rpart, wine_test$quality)
## [1] 0.4931608
MAE <- function(actual, predicted) {
mean(abs(actual - predicted))
}
# The MAE for our predictions is then:
MAE(p.rpart, wine_test$quality)
## [1] 0.5732104
mean(wine_train$quality)
## [1] 5.886933
# If we predicted the value 5.88 for every wine sample, we would have a mean absolute error of only about 0.57:
MAE(5.88, wine_test$quality)
## [1] 0.5778397
En el campo de la ingeniería es crucial estimar el rendimiento de los materiales de contrucción, el concreto es utilizado en casi todas las construcciones es por ello que es de interés particular. En este análisis se utilizo el dataset de UCI Machine Learning Data Repository ( http://archive.ics.uci.edu/ml ) por I-Cheng Yeh.
## 'data.frame': 1030 obs. of 9 variables:
## $ cement : num 540 540 332 332 199 ...
## $ slag : num 0 0 142 142 132 ...
## $ ash : num 0 0 0 0 0 0 0 0 0 0 ...
## $ water : num 162 162 228 228 192 228 228 228 228 228 ...
## $ superplastic: num 2.5 2.5 0 0 0 0 0 0 0 0 ...
## $ coarseagg : num 1040 1055 932 932 978 ...
## $ fineagg : num 676 676 594 594 826 ...
## $ age : num 28 28 270 365 360 90 365 28 28 28 ...
## $ strength : num 80 61.9 40.3 41 44.3 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2664 0.4001 0.4172 0.5457 1.0000
## [,1]
## [1,] 0.7210845505
## [,1]
## [1,] 0.6858322557