Limpieza de base de datos

Paso 1: Cargar la base de datos

file.choose() bd <- read.csv(“C:\Users\lffr1\Downloads\abarrotes.csv”) summary(bd) install.packages(“dplyr”) library(dplyr) install.packages(“tidyverse”) library(tydyverse) install.packages(“janitor”) library(janitor) file.choose() bd <- read.csv(“C:\Users\lffr1\Downloads\abarrotes.csv”) summary(bd)

Realizar conteo de los valores

count(bd,vcClaveTienda, sort=TRUE ) count(bd,DescGiro, sort=TRUE ) count(bd,Marca, sort=TRUE ) count(bd,Fabricante, sort=TRUE ) count(bd,Producto, sort=TRUE ) count(bd,NombreDepartamento, sort=TRUE ) count(bd,NombreFamilia, sort=TRUE ) count(bd,NombreCategoría, sort=TRUE ) count(bd,Estado, sort=TRUE ) count(bd,Giro, sort=TRUE )

tibble(bd) tail(bd, n=7) tabyl(bd, vcClaveTienda, NombreDepartamento)

Limpieza de base de datos

bd1 <- bd bd1 <- subset(bd1, select =-c(PLU,Codigo.Barras))

bd2 <- bd1 bd2 <- bd2(bd2Precio >0,] summary(bd2)

bd2(duplicated(bd2),] sum(duplicated(b2))

bd3<- bd2 bd3<- distinct(bd3)

bd4 <- bd1 bd4$precio <- abs(bd4$Precio)

bd5 <- bd4 bd$Unidades <- ceiling(bd$Unidades) summary(bd5)

bd6 <- bd3 bd6Fecha <- as.Date(bd6$Fecha, format= “%d/%m/%Y”) tibble(bd6)

bd7 <- bd6 bd7$Hora <- substr(bd7$Hora, start=1, stop=2) tibble(bd7) bd7$Hora <- as, integer(bd7$hora) str(bd7)

sum(is.na(bd7)) sum(is.na(bd))

sapply(bd, function(x) sum(is.na(x)))

Reemplazar NA por 0

bd9 <- bd bd9[is.na(bd9)]<-0 bd9

Reemplazar NA con el promedio

bd10<- bd bd10$PLU[is.na(bd10$PLU)] <- mean(bd10$10PLU, na.rm=TRUE) summary(bd10)

Tecnica 6. Verificar con datos estadísticos

bd11 <- bd7 boxplot(bd11$Precio, horizontal=TRUE)

Paso 4. Manipulación de de la base de datos

Agregar columnas

bd11$diadelasemana <- wday(bd11$Fecha) summary(bd11)

bd11$subtotal <- bd11$Precio * bd11$Unidades summary(bd11)

Paso 5. Exportar base de datos limpia

bd_limpia <- bd11 write.cvs(bd_limpia, file=“Abarrotes_bdlimpia.cvs”, row.names= FALSE)

Limpieza de base de datos

Luis Franco

2023-03-18