VISUALIZACION DE DATOS
SOFWARE PARA ANALISIS ESTADISTO Y VISUALIZACIÓN DE DATOS
Los principales sofware para el analisis estadistico y visualizacion de datos son: Rstudio, Python, Spss, entre otros. Las visualizaciones que se presentaran a continuacion fueron trabajadas con el programa Rstudio.
PASOS PARA VISUALIZAR DATOS
CARGA DE LIBRERIAS
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(mice)
##
## Attaching package: 'mice'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(knitr)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(ggplot2)
SE CARGAN LOS DATOS
df <- read_csv("C:/Users/hp/Downloads/student-mat.csv")
## Rows: 395 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (16): age, Medu, Fedu, traveltime, studytime, failures, famrel, freetime...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
RESUMEN ESTADISTICO DE MIS DATOS
summary(df)
## school sex age address
## Length:395 Length:395 Min. :15.0 Length:395
## Class :character Class :character 1st Qu.:16.0 Class :character
## Mode :character Mode :character Median :17.0 Mode :character
## Mean :16.7
## 3rd Qu.:18.0
## Max. :22.0
## famsize Pstatus Medu Fedu
## Length:395 Length:395 Min. :0.000 Min. :0.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:2.000
## Mode :character Mode :character Median :3.000 Median :2.000
## Mean :2.749 Mean :2.522
## 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :4.000 Max. :4.000
## Mjob Fjob reason guardian
## Length:395 Length:395 Length:395 Length:395
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## traveltime studytime failures schoolsup
## Min. :1.000 Min. :1.000 Min. :0.0000 Length:395
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 Class :character
## Median :1.000 Median :2.000 Median :0.0000 Mode :character
## Mean :1.448 Mean :2.035 Mean :0.3342
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## famsup paid activities nursery
## Length:395 Length:395 Length:395 Length:395
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## higher internet romantic famrel
## Length:395 Length:395 Length:395 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :3.944
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :3.000 Median :1.000 Median :2.000
## Mean :3.235 Mean :3.109 Mean :1.481 Mean :2.291
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## health absences G1 G2
## Min. :1.000 Min. : 0.000 Min. : 3.00 Min. : 0.00
## 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 8.00 1st Qu.: 9.00
## Median :4.000 Median : 4.000 Median :11.00 Median :11.00
## Mean :3.554 Mean : 5.709 Mean :10.91 Mean :10.71
## 3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:13.00 3rd Qu.:13.00
## Max. :5.000 Max. :75.000 Max. :19.00 Max. :19.00
## G3
## Min. : 0.00
## 1st Qu.: 8.00
## Median :11.00
## Mean :10.42
## 3rd Qu.:14.00
## Max. :20.00
IMPORTANCIA DE IDENTIFICAR LAS VARIABLES CATEGORICAS Y LAS NUMERICAS
# VARIABLES CATEGORICAS
categorical_cols <- sapply(df, is.character)
categorical_cols <- names(df[categorical_cols])
categorical_cols
## [1] "school" "sex" "address" "famsize" "Pstatus"
## [6] "Mjob" "Fjob" "reason" "guardian" "schoolsup"
## [11] "famsup" "paid" "activities" "nursery" "higher"
## [16] "internet" "romantic"
# VARIABLES NUMERICAS
numerical_cols <- sapply(df, is.numeric)
numerical_cols <- names(df[numerical_cols])
numerical_cols
## [1] "age" "Medu" "Fedu" "traveltime" "studytime"
## [6] "failures" "famrel" "freetime" "goout" "Dalc"
## [11] "Walc" "health" "absences" "G1" "G2"
## [16] "G3"
DATOS FALTANTES
# DATOS FALTANTES
md.pattern(df, plot = TRUE, rotate.names = TRUE)
## /\ /\
## { `---' }
## { O O }
## ==> V <== No need for mice. This data set is completely observed.
## \ \|/ /
## `-----'
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian
## 395 1 1 1 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0 0 0 0
## traveltime studytime failures schoolsup famsup paid activities nursery
## 395 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0
## higher internet romantic famrel freetime goout Dalc Walc health absences G1
## 395 1 1 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0 0 0
## G2 G3
## 395 1 1 0
## 0 0 0
IMPORTANTE: Antes de proceder con la visualizacion de los datos es necesario identificar nuestra variable objetivo.
# DESCRIPCION DE LA VARIABLE OBJETIVO
ggplot(data = df, aes(x = G3)) +
geom_bar(fill = "purple") +
labs(title = "NOTA FINAL", x = "", y = "Count")
# BOXPLOT VARIABLES NUMERICAS VS G3
for (col in numerical_cols) {
p <- ggplot(df, aes_string(x = "G3", y = col)) +
geom_boxplot() +
coord_flip() +
labs(title = paste("Boxplot of", col, "G3"), y = col, x = "G3")
print(p)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
VISUALIZACION DE LAS VARIABLES CATEGORICAS
# VARIABLES CATEGORICAS VS G3
color_palette <- rainbow(21)
for (col in categorical_cols) {
df_percent <- df %>%
group_by(!!sym(col)) %>%
count(G3) %>%
group_by(!!sym(col)) %>%
mutate(perc = n / sum(n) * 100) %>%
mutate(G3 = as.factor(G3))
p <- ggplot(df_percent, aes_string(x = col, y = "perc", fill = "G3")) +
geom_bar(stat = "identity", position = "fill") +
scale_fill_manual(values = color_palette) +
labs(title = paste("Porcentaje de G3 por", col), x = col, y = "Porcentaje (%)") +
scale_y_continuous(labels = scales::percent_format())
print(p)
}
TAREA N°1
Considere los datos asociados a precios de casas house_prices.csv el cual el docente les estara facilitando al correo electronico. Utilice las técnicas básicas de visualización de datos estudiadas en cada ítem de esta sección, ahora aplicadas al conjunto de datos relacionado con la predicción del precio de casas