Este informe documenta un flujo reproducible para predecir la
forma en que se ejecuta un levantamiento (variable
objetivo classe) usando datos de acelerómetros en cinturón,
antebrazo, brazo y mancuerna. El pipeline incluye: limpieza y selección
de variables, partición de datos, validación cruzada y
comparación de modelos. El modelo final (Random Forest) se elige por su
alto desempeño y robustez a ruido y a variables
correlacionadas. Con el conjunto de validación se estima el
error fuera de muestra y se generan predicciones para
los 20 casos de prueba.
set.seed(2025)
# Paquetes
pkgs <- c("tidyverse","caret","randomForest","gbm","rpart","rpart.plot","pROC")
to_install <- pkgs[!sapply(pkgs, require, character.only = TRUE)]
if (length(to_install) > 0) install.packages(to_install, repos = "https://cloud.r-project.org")
invisible(lapply(pkgs, require, character.only = TRUE))
# Opciones globales de knitr
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, fig.align = "center")
train_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
train_path <- "pml-training.csv"
test_path <- "pml-testing.csv"
if (!file.exists(train_path)) download.file(train_url, train_path, mode = "wb")
if (!file.exists(test_path)) download.file(test_url, test_path, mode = "wb")
raw_train <- read.csv(train_path, na.strings = c("NA","#DIV/0!",""))
raw_test <- read.csv(test_path, na.strings = c("NA","#DIV/0!",""))
dim(raw_train); dim(raw_test)
## [1] 19622 160
## [1] 20 160
head(raw_train[,1:10])
Variable objetivo: classe (niveles
A–E). Cada fila corresponde a una ventana temporal de sensores con
múltiples estadísticas.
Estrategia: 1) Eliminar columnas con casi todo NA.
2) Quitar identificadores/campos de tiempo o texto que no describen la
ejecución.
3) Mantener predictoras numéricas presentes en
train y test.
4) Remover variables de varianza casi nula.
# 1) Remover columnas con >95% NA
na_prop <- sapply(raw_train, function(x) mean(is.na(x)))
keep_cols <- names(na_prop[na_prop <= 0.95])
train1 <- raw_train[, keep_cols]
test1 <- raw_test[, intersect(keep_cols, names(raw_test))]
# 2) Eliminar campos no predictivos conocidos
drop_exact <- c("X","user_name","raw_timestamp_part_1","raw_timestamp_part_2",
"cvtd_timestamp","new_window","num_window","problem_id")
train1 <- train1 %>% select(-any_of(drop_exact))
test1 <- test1 %>% select(-any_of(drop_exact))
# 3) Mantener numéricas + objetivo
target <- "classe"
preds_numeric <- names(train1)[sapply(train1, is.numeric)]
preds_numeric <- setdiff(preds_numeric, target) # por si estuviera mal tipado
# Asegurar que 'classe' es factor y al final
train_clean <- train1 %>%
select(all_of(preds_numeric), all_of(target)) %>%
mutate(classe = factor(classe))
# Alinear columnas entre train y test
test_clean <- test1 %>%
select(all_of(preds_numeric))
# 4) Remover varianza casi nula
nzv <- nearZeroVar(train_clean %>% select(-classe))
if (length(nzv) > 0) {
keep <- setdiff(names(train_clean), names(train_clean)[nzv])
keep <- c(setdiff(keep, "classe"), "classe")
train_clean <- train_clean[, keep]
test_clean <- test_clean[, setdiff(keep, "classe")]
}
dim(train_clean); dim(test_clean)
## [1] 19622 53
## [1] 20 52
# Ver distribución de la clase
table(train_clean$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
Usamos una partición estratificada (70/30) para estimar el error fuera de muestra y además validación cruzada (CV) de 5 pliegues dentro del set de entrenamiento para ajustar hiperparámetros.
set.seed(2025)
inTrain <- createDataPartition(train_clean$classe, p = 0.7, list = FALSE)
train_set <- train_clean[inTrain, ]
valid_set <- train_clean[-inTrain, ]
ctrl <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = multiClassSummary,
allowParallel = TRUE)
Evaluamos (1) Árbol de decisión (baseline), (2) Random Forest (principal) y opcionalmente (3) GBM.
set.seed(2025)
fit_rpart <- train(classe ~ ., data = train_set,
method = "rpart",
trControl = ctrl,
tuneLength = 10)
fit_rpart
## CART
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10991, 10990, 10990, 10988, 10989
## Resampling results across tuning parameters:
##
## cp logLoss AUC prAUC Accuracy Kappa
## 0.01312176 0.8742083 0.8897714 0.54012433 0.6972473 0.61615990
## 0.01400332 0.9130679 0.8799911 0.52389510 0.6790453 0.59507723
## 0.01576645 0.9423824 0.8732178 0.50957078 0.6617885 0.57419508
## 0.02024209 0.9613445 0.8667003 0.48477415 0.6422066 0.55002609
## 0.02115756 1.0394187 0.8326108 0.40063131 0.6014411 0.48996935
## 0.02736242 1.1693045 0.7779638 0.30676364 0.5318501 0.38859376
## 0.03061743 1.1852265 0.7706454 0.29166417 0.5210750 0.37394296
## 0.04119622 1.2109703 0.7515078 0.27017696 0.4964673 0.34130534
## 0.04948632 1.2252413 0.7393660 0.26274076 0.4852571 0.32759675
## 0.11514597 1.4982008 0.5521919 0.03791104 0.3321580 0.07325014
## Mean_F1 Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value
## 0.6812573 0.6814578 0.9238306 0.7000160
## 0.6661173 0.6684253 0.9201129 0.6913159
## 0.6511804 0.6549743 0.9159941 0.6793588
## 0.6292400 0.6363448 0.9112644 0.6777716
## 0.5770065 0.5801804 0.8981833 0.6475516
## 0.4899294 0.4863105 0.8764727 0.5923106
## 0.4786238 0.4724032 0.8737306 0.5696922
## NaN 0.4425198 0.8676690 NaN
## NaN 0.4315510 0.8651875 NaN
## NaN 0.2528802 0.8134379 NaN
## Mean_Neg_Pred_Value Mean_Precision Mean_Recall Mean_Detection_Rate
## 0.9254702 0.7000160 0.6814578 0.13944947
## 0.9205416 0.6913159 0.6684253 0.13580906
## 0.9161350 0.6793588 0.6549743 0.13235771
## 0.9120496 0.6777716 0.6363448 0.12844132
## 0.9039694 0.6475516 0.5801804 0.12028821
## 0.8887708 0.5923106 0.4863105 0.10637002
## 0.8864314 0.5696922 0.4724032 0.10421500
## 0.8813871 NaN 0.4425198 0.09929347
## 0.8791201 NaN 0.4315510 0.09705142
## 0.8632920 NaN 0.2528802 0.06643159
## Mean_Balanced_Accuracy
## 0.8026442
## 0.7942691
## 0.7854842
## 0.7738046
## 0.7391818
## 0.6813916
## 0.6730669
## 0.6550944
## 0.6483693
## 0.5331590
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01312176.
rpart.plot::rpart.plot(fit_rpart$finalModel, cex = 0.7)
set.seed(2025)
fit_rf <- train(classe ~ ., data = train_set,
method = "rf",
trControl = ctrl,
tuneLength = 5,
importance = TRUE)
fit_rf
## Random Forest
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10991, 10990, 10990, 10988, 10989
## Resampling results across tuning parameters:
##
## mtry logLoss AUC prAUC Accuracy Kappa Mean_F1
## 2 0.19665883 0.9997627 0.9794005 0.9909008 0.9884888 0.9900102
## 14 0.10696160 0.9998447 0.9210735 0.9920654 0.9899627 0.9913509
## 27 0.09664396 0.9997700 0.8670707 0.9895175 0.9867392 0.9886326
## 39 0.09473062 0.9996671 0.8214595 0.9867512 0.9832393 0.9856651
## 52 0.09753692 0.9993632 0.7527619 0.9813641 0.9764251 0.9800049
## Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value
## 0.9896438 0.9977331 0.9904874 0.9977896
## 0.9911163 0.9980234 0.9916278 0.9980623
## 0.9883478 0.9973869 0.9889808 0.9974343
## 0.9853264 0.9966981 0.9860721 0.9967529
## 0.9797080 0.9953568 0.9803972 0.9954060
## Mean_Precision Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy
## 0.9904874 0.9896438 0.1981802 0.9936884
## 0.9916278 0.9911163 0.1984131 0.9945698
## 0.9889808 0.9883478 0.1979035 0.9928674
## 0.9860721 0.9853264 0.1973502 0.9910122
## 0.9803972 0.9797080 0.1962728 0.9875324
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 14.
Importancia de variables (TOP 20). Figura 1
vip <- varImp(fit_rf, scale = TRUE)
plot(vip, top = 20, main = "Importancia de variables (Random Forest)")
##GBM
set.seed(2025)
fit_gbm <- train(classe ~ ., data = train_set,
method = "gbm",
trControl = ctrl,
verbose = FALSE,
tuneLength = 5)
fit_gbm
Calculamos predicciones en el set de validación no usado para entrenar.
pred_rpart <- predict(fit_rpart, valid_set)
pred_rf <- predict(fit_rf, valid_set)
cm_rpart <- confusionMatrix(pred_rpart, valid_set$classe)
cm_rf <- confusionMatrix(pred_rf, valid_set$classe)
cm_rpart$overall["Accuracy"]; cm_rf$overall["Accuracy"]
## Accuracy
## 0.7033135
## Accuracy
## 0.9959218
cm_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 5 0 0 0
## B 1 1130 5 0 0
## C 0 4 1019 5 0
## D 0 0 2 958 1
## E 0 0 0 1 1081
##
## Overall Statistics
##
## Accuracy : 0.9959
## 95% CI : (0.9939, 0.9974)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9948
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9921 0.9932 0.9938 0.9991
## Specificity 0.9988 0.9987 0.9981 0.9994 0.9998
## Pos Pred Value 0.9970 0.9947 0.9912 0.9969 0.9991
## Neg Pred Value 0.9998 0.9981 0.9986 0.9988 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1920 0.1732 0.1628 0.1837
## Detection Prevalence 0.2851 0.1930 0.1747 0.1633 0.1839
## Balanced Accuracy 0.9991 0.9954 0.9957 0.9966 0.9994
Matriz de confusión (RF). Figura 2
cm <- cm_rf$table %>% as.data.frame()
ggplot(cm, aes(Prediction, Reference, fill = Freq)) +
geom_tile() + geom_text(aes(label = Freq)) +
scale_fill_continuous(name = "Frecuencia") +
labs(title = "Matriz de confusión (Random Forest)") +
theme_minimal()
Estimación de error fuera de muestra:
error_out_sample = 1 - Accuracy_RF_valid. Reporte el valor
mostrado arriba.
Escogemos Random Forest por: - Mayor exactitud en validación. - Manejo de ruido, no requiere fuerte ingeniería de variables. - Estima importancias útiles para interpretación.
Alineamos columnas y predecimos sobre pml-testing.csv.
También generamos los 20 archivos
problem_id_#.txt (formato de Coursera).
# Re-asegurar columnas idénticas entre train y test
common_predictors <- intersect(names(train_set)[names(train_set) != "classe"], names(test_clean))
testX <- test_clean[, common_predictors]
trainX <- train_set[, c(common_predictors, "classe")]
# (opcional) reentrenar con columnas comunes exactamente
set.seed(2025)
fit_final <- train(classe ~ ., data = trainX,
method = "rf",
trControl = ctrl,
tuneLength = 5,
importance = TRUE)
pred_test <- predict(fit_final, testX)
pred_test
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Escritura de archivos para el quiz (20 archivos)
pml_write_files <- function(x){
for(i in seq_along(x)){
filename <- paste0("problem_id_", i, ".txt")
write.table(x[i], file = filename, quote = FALSE, row.names = FALSE, col.names = FALSE)
}
}
pml_write_files(pred_test)
list.files(pattern = "^problem_id_\\d+\\.txt$")
set.seed(2025) fijado.sessionInfo()
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/Bogota
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] pROC_1.18.5 rpart.plot_3.1.3 rpart_4.1.24
## [4] gbm_2.2.2 randomForest_4.7-1.2 caret_7.0-1
## [7] lattice_0.21-8 lubridate_1.9.4 forcats_1.0.0
## [10] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
## [13] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
## [16] ggplot2_3.5.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 timeDate_4041.110 farver_2.1.2
## [4] fastmap_1.2.0 digest_0.6.37 timechange_0.3.0
## [7] lifecycle_1.0.4 survival_3.5-5 ROCR_1.0-11
## [10] magrittr_2.0.3 compiler_4.3.1 rlang_1.1.4
## [13] sass_0.4.9 tools_4.3.1 utf8_1.2.4
## [16] yaml_2.3.10 data.table_1.17.0 knitr_1.49
## [19] labeling_0.4.3 plyr_1.8.9 withr_3.0.2
## [22] MLmetrics_1.1.3 nnet_7.3-19 grid_4.3.1
## [25] stats4_4.3.1 fansi_1.0.6 e1071_1.7-16
## [28] colorspace_2.1-1 future_1.34.0 globals_0.16.3
## [31] scales_1.3.0 iterators_1.0.14 MASS_7.3-60
## [34] cli_3.6.3 rmarkdown_2.29 generics_0.1.3
## [37] rstudioapi_0.17.1 future.apply_1.11.3 reshape2_1.4.4
## [40] tzdb_0.5.0 cachem_1.1.0 proxy_0.4-27
## [43] splines_4.3.1 parallel_4.3.1 vctrs_0.6.5
## [46] hardhat_1.4.1 Matrix_1.5-4.1 jsonlite_1.8.9
## [49] hms_1.1.3 listenv_0.9.1 foreach_1.5.2
## [52] gower_1.0.2 jquerylib_0.1.4 recipes_1.3.0
## [55] glue_1.8.0 parallelly_1.40.1 codetools_0.2-19
## [58] stringi_1.8.4 gtable_0.3.6 munsell_0.5.1
## [61] pillar_1.9.0 htmltools_0.5.8.1 ipred_0.9-15
## [64] lava_1.8.1 R6_2.5.1 evaluate_1.0.1
## [67] bslib_0.8.0 class_7.3-22 Rcpp_1.0.13-1
## [70] nlme_3.1-162 prodlim_2024.06.25 xfun_0.52
## [73] pkgconfig_2.0.3 ModelMetrics_1.2.2.2
tuning; validación externa (30%) para estimar error
fuera de muestra.mtry ajustado
por CV.