1 Resumen
2 Preparación
3 Datos
4 Limpieza y preparación de variables
5 Partición y control de validación
6 Modelos evaluados
- 6.1 Árbol de decisión (baseline)
- 6.2 Random Forest (modelo principal)
7 Desempeño en validación (error fuera de muestra)
8 Elección del modelo
9 Predicciones para los 20 casos de prueba
10 (≤ 5 Figuras)
11 Reproducibilidad
12 Decisiones clave

1 Resumen

Este informe documenta un flujo reproducible para predecir la forma en que se ejecuta un levantamiento (variable objetivo classe) usando datos de acelerómetros en cinturón, antebrazo, brazo y mancuerna. El pipeline incluye: limpieza y selección de variables, partición de datos, validación cruzada y comparación de modelos. El modelo final (Random Forest) se elige por su alto desempeño y robustez a ruido y a variables correlacionadas. Con el conjunto de validación se estima el error fuera de muestra y se generan predicciones para los 20 casos de prueba.

2 Preparación

set.seed(2025)

# Paquetes
pkgs <- c("tidyverse","caret","randomForest","gbm","rpart","rpart.plot","pROC")
to_install <- pkgs[!sapply(pkgs, require, character.only = TRUE)]
if (length(to_install) > 0) install.packages(to_install, repos = "https://cloud.r-project.org")
invisible(lapply(pkgs, require, character.only = TRUE))

# Opciones globales de knitr
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, fig.align = "center")

3 Datos

train_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_url  <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

train_path <- "pml-training.csv"
test_path  <- "pml-testing.csv"

if (!file.exists(train_path)) download.file(train_url, train_path, mode = "wb")
if (!file.exists(test_path))  download.file(test_url,  test_path,  mode = "wb")

raw_train <- read.csv(train_path, na.strings = c("NA","#DIV/0!",""))
raw_test  <- read.csv(test_path,  na.strings = c("NA","#DIV/0!",""))

dim(raw_train); dim(raw_test)

## [1] 19622   160

## [1]  20 160

head(raw_train[,1:10])

Variable objetivo: classe (niveles A–E). Cada fila corresponde a una ventana temporal de sensores con múltiples estadísticas.

4 Limpieza y preparación de variables

Estrategia: 1) Eliminar columnas con casi todo NA.
2) Quitar identificadores/campos de tiempo o texto que no describen la ejecución.
3) Mantener predictoras numéricas presentes en train y test.
4) Remover variables de varianza casi nula.

# 1) Remover columnas con >95% NA
na_prop <- sapply(raw_train, function(x) mean(is.na(x)))
keep_cols <- names(na_prop[na_prop <= 0.95])

train1 <- raw_train[, keep_cols]
test1  <- raw_test[,  intersect(keep_cols, names(raw_test))]

# 2) Eliminar campos no predictivos conocidos
drop_exact <- c("X","user_name","raw_timestamp_part_1","raw_timestamp_part_2",
                "cvtd_timestamp","new_window","num_window","problem_id")
train1 <- train1 %>% select(-any_of(drop_exact))
test1  <- test1  %>% select(-any_of(drop_exact))

# 3) Mantener numéricas + objetivo
target <- "classe"
preds_numeric <- names(train1)[sapply(train1, is.numeric)]
preds_numeric <- setdiff(preds_numeric, target)  # por si estuviera mal tipado

# Asegurar que 'classe' es factor y al final
train_clean <- train1 %>%
  select(all_of(preds_numeric), all_of(target)) %>%
  mutate(classe = factor(classe))

# Alinear columnas entre train y test
test_clean <- test1 %>%
  select(all_of(preds_numeric))

# 4) Remover varianza casi nula
nzv <- nearZeroVar(train_clean %>% select(-classe))
if (length(nzv) > 0) {
  keep <- setdiff(names(train_clean), names(train_clean)[nzv])
  keep <- c(setdiff(keep, "classe"), "classe")
  train_clean <- train_clean[, keep]
  test_clean  <- test_clean[, setdiff(keep, "classe")]
}

dim(train_clean); dim(test_clean)

## [1] 19622    53

## [1] 20 52

# Ver distribución de la clase
table(train_clean$classe)

## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

5 Partición y control de validación

Usamos una partición estratificada (70/30) para estimar el error fuera de muestra y además validación cruzada (CV) de 5 pliegues dentro del set de entrenamiento para ajustar hiperparámetros.

set.seed(2025)
inTrain <- createDataPartition(train_clean$classe, p = 0.7, list = FALSE)
train_set <- train_clean[inTrain, ]
valid_set <- train_clean[-inTrain, ]

ctrl <- trainControl(method = "cv",
                     number = 5,
                     classProbs = TRUE,
                     summaryFunction = multiClassSummary,
                     allowParallel = TRUE)

6 Modelos evaluados

Evaluamos (1) Árbol de decisión (baseline), (2) Random Forest (principal) y opcionalmente (3) GBM.

6.1 Árbol de decisión (baseline)

set.seed(2025)
fit_rpart <- train(classe ~ ., data = train_set,
                   method = "rpart",
                   trControl = ctrl,
                   tuneLength = 10)
fit_rpart

## CART 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10991, 10990, 10990, 10988, 10989 
## Resampling results across tuning parameters:
## 
##   cp          logLoss    AUC        prAUC       Accuracy   Kappa     
##   0.01312176  0.8742083  0.8897714  0.54012433  0.6972473  0.61615990
##   0.01400332  0.9130679  0.8799911  0.52389510  0.6790453  0.59507723
##   0.01576645  0.9423824  0.8732178  0.50957078  0.6617885  0.57419508
##   0.02024209  0.9613445  0.8667003  0.48477415  0.6422066  0.55002609
##   0.02115756  1.0394187  0.8326108  0.40063131  0.6014411  0.48996935
##   0.02736242  1.1693045  0.7779638  0.30676364  0.5318501  0.38859376
##   0.03061743  1.1852265  0.7706454  0.29166417  0.5210750  0.37394296
##   0.04119622  1.2109703  0.7515078  0.27017696  0.4964673  0.34130534
##   0.04948632  1.2252413  0.7393660  0.26274076  0.4852571  0.32759675
##   0.11514597  1.4982008  0.5521919  0.03791104  0.3321580  0.07325014
##   Mean_F1    Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value
##   0.6812573  0.6814578         0.9238306         0.7000160          
##   0.6661173  0.6684253         0.9201129         0.6913159          
##   0.6511804  0.6549743         0.9159941         0.6793588          
##   0.6292400  0.6363448         0.9112644         0.6777716          
##   0.5770065  0.5801804         0.8981833         0.6475516          
##   0.4899294  0.4863105         0.8764727         0.5923106          
##   0.4786238  0.4724032         0.8737306         0.5696922          
##         NaN  0.4425198         0.8676690               NaN          
##         NaN  0.4315510         0.8651875               NaN          
##         NaN  0.2528802         0.8134379               NaN          
##   Mean_Neg_Pred_Value  Mean_Precision  Mean_Recall  Mean_Detection_Rate
##   0.9254702            0.7000160       0.6814578    0.13944947         
##   0.9205416            0.6913159       0.6684253    0.13580906         
##   0.9161350            0.6793588       0.6549743    0.13235771         
##   0.9120496            0.6777716       0.6363448    0.12844132         
##   0.9039694            0.6475516       0.5801804    0.12028821         
##   0.8887708            0.5923106       0.4863105    0.10637002         
##   0.8864314            0.5696922       0.4724032    0.10421500         
##   0.8813871                  NaN       0.4425198    0.09929347         
##   0.8791201                  NaN       0.4315510    0.09705142         
##   0.8632920                  NaN       0.2528802    0.06643159         
##   Mean_Balanced_Accuracy
##   0.8026442             
##   0.7942691             
##   0.7854842             
##   0.7738046             
##   0.7391818             
##   0.6813916             
##   0.6730669             
##   0.6550944             
##   0.6483693             
##   0.5331590             
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01312176.

rpart.plot::rpart.plot(fit_rpart$finalModel, cex = 0.7)

6.2 Random Forest (modelo principal)

set.seed(2025)
fit_rf <- train(classe ~ ., data = train_set,
                method = "rf",
                trControl = ctrl,
                tuneLength = 5,
                importance = TRUE)
fit_rf

## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10991, 10990, 10990, 10988, 10989 
## Resampling results across tuning parameters:
## 
##   mtry  logLoss     AUC        prAUC      Accuracy   Kappa      Mean_F1  
##    2    0.19665883  0.9997627  0.9794005  0.9909008  0.9884888  0.9900102
##   14    0.10696160  0.9998447  0.9210735  0.9920654  0.9899627  0.9913509
##   27    0.09664396  0.9997700  0.8670707  0.9895175  0.9867392  0.9886326
##   39    0.09473062  0.9996671  0.8214595  0.9867512  0.9832393  0.9856651
##   52    0.09753692  0.9993632  0.7527619  0.9813641  0.9764251  0.9800049
##   Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value
##   0.9896438         0.9977331         0.9904874            0.9977896          
##   0.9911163         0.9980234         0.9916278            0.9980623          
##   0.9883478         0.9973869         0.9889808            0.9974343          
##   0.9853264         0.9966981         0.9860721            0.9967529          
##   0.9797080         0.9953568         0.9803972            0.9954060          
##   Mean_Precision  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
##   0.9904874       0.9896438    0.1981802            0.9936884             
##   0.9916278       0.9911163    0.1984131            0.9945698             
##   0.9889808       0.9883478    0.1979035            0.9928674             
##   0.9860721       0.9853264    0.1973502            0.9910122             
##   0.9803972       0.9797080    0.1962728            0.9875324             
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 14.

Importancia de variables (TOP 20). Figura 1

vip <- varImp(fit_rf, scale = TRUE)
plot(vip, top = 20, main = "Importancia de variables (Random Forest)")

##GBM

set.seed(2025)
fit_gbm <- train(classe ~ ., data = train_set,
                 method = "gbm",
                 trControl = ctrl,
                 verbose = FALSE,
                 tuneLength = 5)
fit_gbm

7 Desempeño en validación (error fuera de muestra)

Calculamos predicciones en el set de validación no usado para entrenar.

pred_rpart <- predict(fit_rpart, valid_set)
pred_rf    <- predict(fit_rf,    valid_set)

cm_rpart <- confusionMatrix(pred_rpart, valid_set$classe)
cm_rf    <- confusionMatrix(pred_rf,    valid_set$classe)

cm_rpart$overall["Accuracy"]; cm_rf$overall["Accuracy"]

##  Accuracy 
## 0.7033135

##  Accuracy 
## 0.9959218

cm_rf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    5    0    0    0
##          B    1 1130    5    0    0
##          C    0    4 1019    5    0
##          D    0    0    2  958    1
##          E    0    0    0    1 1081
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9959          
##                  95% CI : (0.9939, 0.9974)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9948          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9921   0.9932   0.9938   0.9991
## Specificity            0.9988   0.9987   0.9981   0.9994   0.9998
## Pos Pred Value         0.9970   0.9947   0.9912   0.9969   0.9991
## Neg Pred Value         0.9998   0.9981   0.9986   0.9988   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1920   0.1732   0.1628   0.1837
## Detection Prevalence   0.2851   0.1930   0.1747   0.1633   0.1839
## Balanced Accuracy      0.9991   0.9954   0.9957   0.9966   0.9994

Matriz de confusión (RF). Figura 2

cm <- cm_rf$table %>% as.data.frame()
ggplot(cm, aes(Prediction, Reference, fill = Freq)) +
  geom_tile() + geom_text(aes(label = Freq)) +
  scale_fill_continuous(name = "Frecuencia") +
  labs(title = "Matriz de confusión (Random Forest)") +
  theme_minimal()

Estimación de error fuera de muestra:
error_out_sample = 1 - Accuracy_RF_valid. Reporte el valor mostrado arriba.

8 Elección del modelo

Escogemos Random Forest por: - Mayor exactitud en validación. - Manejo de ruido, no requiere fuerte ingeniería de variables. - Estima importancias útiles para interpretación.

9 Predicciones para los 20 casos de prueba

Alineamos columnas y predecimos sobre pml-testing.csv. También generamos los 20 archivos problem_id_#.txt (formato de Coursera).

# Re-asegurar columnas idénticas entre train y test
common_predictors <- intersect(names(train_set)[names(train_set) != "classe"], names(test_clean))
testX <- test_clean[, common_predictors]
trainX <- train_set[, c(common_predictors, "classe")]

# (opcional) reentrenar con columnas comunes exactamente
set.seed(2025)
fit_final <- train(classe ~ ., data = trainX,
                   method = "rf",
                   trControl = ctrl,
                   tuneLength = 5,
                   importance = TRUE)

pred_test <- predict(fit_final, testX)
pred_test

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Escritura de archivos para el quiz (20 archivos)

pml_write_files <- function(x){
  for(i in seq_along(x)){
    filename <- paste0("problem_id_", i, ".txt")
    write.table(x[i], file = filename, quote = FALSE, row.names = FALSE, col.names = FALSE)
  }
}
pml_write_files(pred_test)
list.files(pattern = "^problem_id_\\d+\\.txt$")

10 (≤ 5 Figuras)

Importancia de variables RF.
Matriz de confusión RF.

11 Reproducibilidad

set.seed(2025) fijado.
Código descarga los datos si no existen en el directorio de trabajo.
Versión de R y paquetes a continuación.

sessionInfo()

## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/Bogota
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] pROC_1.18.5          rpart.plot_3.1.3     rpart_4.1.24        
##  [4] gbm_2.2.2            randomForest_4.7-1.2 caret_7.0-1         
##  [7] lattice_0.21-8       lubridate_1.9.4      forcats_1.0.0       
## [10] stringr_1.5.1        dplyr_1.1.4          purrr_1.0.2         
## [13] readr_2.1.5          tidyr_1.3.1          tibble_3.2.1        
## [16] ggplot2_3.5.1        tidyverse_2.0.0     
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     timeDate_4041.110    farver_2.1.2        
##  [4] fastmap_1.2.0        digest_0.6.37        timechange_0.3.0    
##  [7] lifecycle_1.0.4      survival_3.5-5       ROCR_1.0-11         
## [10] magrittr_2.0.3       compiler_4.3.1       rlang_1.1.4         
## [13] sass_0.4.9           tools_4.3.1          utf8_1.2.4          
## [16] yaml_2.3.10          data.table_1.17.0    knitr_1.49          
## [19] labeling_0.4.3       plyr_1.8.9           withr_3.0.2         
## [22] MLmetrics_1.1.3      nnet_7.3-19          grid_4.3.1          
## [25] stats4_4.3.1         fansi_1.0.6          e1071_1.7-16        
## [28] colorspace_2.1-1     future_1.34.0        globals_0.16.3      
## [31] scales_1.3.0         iterators_1.0.14     MASS_7.3-60         
## [34] cli_3.6.3            rmarkdown_2.29       generics_0.1.3      
## [37] rstudioapi_0.17.1    future.apply_1.11.3  reshape2_1.4.4      
## [40] tzdb_0.5.0           cachem_1.1.0         proxy_0.4-27        
## [43] splines_4.3.1        parallel_4.3.1       vctrs_0.6.5         
## [46] hardhat_1.4.1        Matrix_1.5-4.1       jsonlite_1.8.9      
## [49] hms_1.1.3            listenv_0.9.1        foreach_1.5.2       
## [52] gower_1.0.2          jquerylib_0.1.4      recipes_1.3.0       
## [55] glue_1.8.0           parallelly_1.40.1    codetools_0.2-19    
## [58] stringi_1.8.4        gtable_0.3.6         munsell_0.5.1       
## [61] pillar_1.9.0         htmltools_0.5.8.1    ipred_0.9-15        
## [64] lava_1.8.1           R6_2.5.1             evaluate_1.0.1      
## [67] bslib_0.8.0          class_7.3-22         Rcpp_1.0.13-1       
## [70] nlme_3.1-162         prodlim_2024.06.25   xfun_0.52           
## [73] pkgconfig_2.0.3      ModelMetrics_1.2.2.2

12 Decisiones clave

Limpieza conservadora: se eliminaron columnas con >95% NA y campos no predictivos (marcas de tiempo/ID) para evitar fuga de información y ruido.
Partición estratificada + CV: CV de 5 pliegues para tuning; validación externa (30%) para estimar error fuera de muestra.
Modelo final RF: equilibrio entre precisión y simplicidad operativa; evita sobreajuste con mtry ajustado por CV.

Predicción de la forma de realizar el ejercicio (Practical Machine Learning)

Alejandra A.

2025-08-16