VISUALIZACION DE DATOS

SOFWARE PARA ANALISIS ESTADISTO Y VISUALIZACIÓN DE DATOS

Los principales sofware para el analisis estadistico y visualizacion de datos son: Rstudio, Python, Spss, entre otros. Las visualizaciones que se presentaran a continuacion fueron trabajadas con el programa Rstudio.

PASOS PARA VISUALIZAR DATOS

CARGA DE LIBRERIAS

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(mice)
## 
## Attaching package: 'mice'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(knitr)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(ggplot2)

SE CARGAN LOS DATOS

df <- read_csv("C:/Users/hp/Downloads/student-mat.csv")
## Rows: 395 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (16): age, Medu, Fedu, traveltime, studytime, failures, famrel, freetime...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

RESUMEN ESTADISTICO DE MIS DATOS

summary(df)
##     school              sex                 age         address         
##  Length:395         Length:395         Min.   :15.0   Length:395        
##  Class :character   Class :character   1st Qu.:16.0   Class :character  
##  Mode  :character   Mode  :character   Median :17.0   Mode  :character  
##                                        Mean   :16.7                     
##                                        3rd Qu.:18.0                     
##                                        Max.   :22.0                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:395         Length:395         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :3.000   Median :2.000  
##                                        Mean   :2.749   Mean   :2.522  
##                                        3rd Qu.:4.000   3rd Qu.:3.000  
##                                        Max.   :4.000   Max.   :4.000  
##      Mjob               Fjob              reason            guardian        
##  Length:395         Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    traveltime      studytime        failures       schoolsup        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Length:395        
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   Class :character  
##  Median :1.000   Median :2.000   Median :0.0000   Mode  :character  
##  Mean   :1.448   Mean   :2.035   Mean   :0.3342                     
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000                     
##  Max.   :4.000   Max.   :4.000   Max.   :3.0000                     
##     famsup              paid            activities          nursery         
##  Length:395         Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     higher            internet           romantic             famrel     
##  Length:395         Length:395         Length:395         Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :3.944  
##                                                           3rd Qu.:5.000  
##                                                           Max.   :5.000  
##     freetime         goout            Dalc            Walc      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :3.000   Median :1.000   Median :2.000  
##  Mean   :3.235   Mean   :3.109   Mean   :1.481   Mean   :2.291  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      health         absences            G1              G2       
##  Min.   :1.000   Min.   : 0.000   Min.   : 3.00   Min.   : 0.00  
##  1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00   1st Qu.: 9.00  
##  Median :4.000   Median : 4.000   Median :11.00   Median :11.00  
##  Mean   :3.554   Mean   : 5.709   Mean   :10.91   Mean   :10.71  
##  3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00   3rd Qu.:13.00  
##  Max.   :5.000   Max.   :75.000   Max.   :19.00   Max.   :19.00  
##        G3       
##  Min.   : 0.00  
##  1st Qu.: 8.00  
##  Median :11.00  
##  Mean   :10.42  
##  3rd Qu.:14.00  
##  Max.   :20.00

IMPORTANCIA DE IDENTIFICAR LAS VARIABLES CATEGORICAS Y LAS NUMERICAS

# VARIABLES CATEGORICAS
categorical_cols <- sapply(df, is.character)
categorical_cols <- names(df[categorical_cols])
categorical_cols
##  [1] "school"     "sex"        "address"    "famsize"    "Pstatus"   
##  [6] "Mjob"       "Fjob"       "reason"     "guardian"   "schoolsup" 
## [11] "famsup"     "paid"       "activities" "nursery"    "higher"    
## [16] "internet"   "romantic"
# VARIABLES NUMERICAS
numerical_cols <- sapply(df, is.numeric)
numerical_cols <- names(df[numerical_cols])
numerical_cols
##  [1] "age"        "Medu"       "Fedu"       "traveltime" "studytime" 
##  [6] "failures"   "famrel"     "freetime"   "goout"      "Dalc"      
## [11] "Walc"       "health"     "absences"   "G1"         "G2"        
## [16] "G3"

DATOS FALTANTES

# DATOS FALTANTES
md.pattern(df, plot = TRUE, rotate.names = TRUE)
##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##     school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian
## 395      1   1   1       1       1       1    1    1    1    1      1        1
##          0   0   0       0       0       0    0    0    0    0      0        0
##     traveltime studytime failures schoolsup famsup paid activities nursery
## 395          1         1        1         1      1    1          1       1
##              0         0        0         0      0    0          0       0
##     higher internet romantic famrel freetime goout Dalc Walc health absences G1
## 395      1        1        1      1        1     1    1    1      1        1  1
##          0        0        0      0        0     0    0    0      0        0  0
##     G2 G3  
## 395  1  1 0
##      0  0 0

IMPORTANTE: Antes de proceder con la visualizacion de los datos es necesario identificar nuestra variable objetivo.

# DESCRIPCION DE LA VARIABLE OBJETIVO
ggplot(data = df, aes(x = G3)) +
  geom_bar(fill = "purple") +
  labs(title = "NOTA FINAL", x = "", y = "Count")

# BOXPLOT VARIABLES NUMERICAS VS G3
for (col in numerical_cols) {
  p <- ggplot(df, aes_string(x = "G3", y = col)) +
    geom_boxplot() +
    coord_flip() +
    labs(title = paste("Boxplot of", col, "G3"), y = col, x = "G3")
  print(p)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

VISUALIZACION DE LAS VARIABLES CATEGORICAS

# VARIABLES CATEGORICAS VS G3
color_palette <- rainbow(21)
for (col in categorical_cols) {
  df_percent <- df %>%
    group_by(!!sym(col)) %>%
    count(G3) %>%
    group_by(!!sym(col)) %>%
    mutate(perc = n / sum(n) * 100) %>%
    mutate(G3 = as.factor(G3))

  p <- ggplot(df_percent, aes_string(x = col, y = "perc", fill = "G3")) +
    geom_bar(stat = "identity", position = "fill") +
    scale_fill_manual(values = color_palette) +
    labs(title = paste("Porcentaje de G3 por", col), x = col, y = "Porcentaje (%)") +
    scale_y_continuous(labels = scales::percent_format())

  print(p)
}

TAREA N°1

Considere los datos asociados a precios de casas house_prices.csv el cual el docente les estara facilitando al correo electronico. Utilice las técnicas básicas de visualización de datos estudiadas en cada ítem de esta sección, ahora aplicadas al conjunto de datos relacionado con la predicción del precio de casas