Universidad Católica de Oriente: A la Verdad por la Fe y la Ciencia

1. Cargue y limpieza de los datos

En este capitulo haremos los primeros pasos de nuestro proyecto en ciencia de datos y BigData. Consideraremos la apertura de la base de datos, análisis de datos atípicos e imputación de datos

1.1. Cargue de la base de datos

1.2. Imputación de datos

####Imputación con la Media

#install.packages("mice", dependencies = TRUE)

library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
###names(data)

columns <- c("Peso", "Estatura", "IMC", "Promedio.Sem", "Prom.Acum")

imputed_data <- mice(data[,names(data) %in% columns],m = 1,
                     maxit = 1, method = "mean",seed = 2018,print=F)


complete.data <- mice::complete(imputed_data)


par(mfrow=c(3,2))

##names(data)

plot(density(data$Peso,na.rm = T),col=2,main="Peso")
lines(density(complete.data$Peso),col=3)

plot(density(data$Estatura,na.rm = T),col=2,main="Estatura")
lines(density(complete.data$Estatura),col=3)

plot(density(data$IMC,na.rm = T),col=2,main="IMC")
lines(density(complete.data$IMC),col=3)

plot(density(data$Promedio.Sem,na.rm = T),col=2,main="Promedio Semestral")
lines(density(complete.data$Promedio.Sem),col=3)

plot(density(data$Prom.Acum,na.rm = T),col=2,main="Prom.Acumulado")
lines(density(complete.data$Prom.Acum),col=3)

dev.off()
## null device 
##           1
#####Imputación por Random Forest

imputed_data <- mice(data[,names(data) %in% columns],m = 1,
                     maxit = 1, method = "rf",seed = 2018,print=F)


complete.data <- mice::complete(imputed_data)


par(mfrow=c(3,2))

##names(data)

plot(density(data$Peso,na.rm = T),col=2,main="Peso")
lines(density(complete.data$Peso),col=3)

plot(density(data$Estatura,na.rm = T),col=2,main="Estatura")
lines(density(complete.data$Estatura),col=3)

plot(density(data$IMC,na.rm = T),col=2,main="IMC")
lines(density(complete.data$IMC),col=3)

plot(density(data$Promedio.Sem,na.rm = T),col=2,main="Promedio Semestral")
lines(density(complete.data$Promedio.Sem),col=3)

plot(density(data$Prom.Acum,na.rm = T),col=2,main="Prom.Acumulado")
lines(density(complete.data$Prom.Acum),col=3)

dev.off()
## null device 
##           1
help(mice)
## starting httpd help server ...
##  done

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.