Objetivo

Explorar datos de películas

Descripción

Cargar librerías
Cargar datos
Explorar datos
Visualizar datos

Marco teórico

La exploración de datos es un primer paso del análisis de datos que se utiliza para conocer y visualizar datos y descubrir conocimientos desde el mismo inicio o identificar áreas o patrones para profundizarlos más adelante.

Desarrollo

Cargar librerías

library(ggplot2)  # Visualizar gráficas
library(readr)  # Leer datos
library(dplyr)

Cargar datos

datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/Analisis-Inteligente-de-datos/main/datos/movies-db.csv", stringsAsFactors = TRUE )
datos

##                                         name year length_min     genre
## 1                                  Toy Story 1995         81 Animation
## 2                                      Akira 1998        125 Animation
## 3                         The Breakfast Club 1985         97     Drama
## 4                                 The Artist 2011        100   Romance
## 5                               Modern Times 1936         87    Comedy
## 6                                 Fight Club 1999        139     Drama
## 7                                City of God 2002        130     Crime
## 8                           The Untouchables 1987        119     Drama
## 9                       Star Wars Episode IV 1977        121    Action
## 10                           American Beauty 1999        122     Drama
## 11                                      Room 2015        118     Drama
## 12                           Dr. Strangelove 1964         94    Comedy
## 13                                  The Ring 1998         95    Horror
## 14           Monty Python and the Holy Grail 1975         91    Comedy
## 15                       High School Musical 2006         98    Comedy
## 16                         Shaun of the Dead 2004         99    Horror
## 17                               Taxi Driver 1976        113     Crime
## 18                  The Shawshank Redemption 1994        142     Crime
## 19                              Interstellar 2014        169 Adventure
## 20                                    Casino 1995        178 Biography
## 21                            The Goodfellas 1990        145 Biography
## 22                Blue is the Warmest Colour 2013        179   Romance
## 23                                Black Swan 2010        108  Thriller
## 24                        Back to the Future 1985        116    Sci-fi
## 25                                  The Wave 2008        107  Thriller
## 26                                  Whiplash 2014        106     Drama
## 27                  The Grand Hotel Budapest 2014        100     Crime
## 28                                   Jumanji 1995        104   Fantasy
## 29 The Eternal Sunshine of the Spotless Mind 2004        108     Drama
## 30                                   Chicago 2002        113    Comedy
## 31                                   Jumangi 2020        120    Action
##    average_rating cost_millions foreign age_restriction
## 1             8.3          30.0       0               0
## 2             8.1          10.4       1              14
## 3             7.9           1.0       0              14
## 4             8.0          15.0       1              12
## 5             8.6           1.5       0              10
## 6             8.9          63.0       0              18
## 7             8.7           3.3       1              18
## 8             7.9          25.0       0              14
## 9             8.7          11.0       0              10
## 10            8.4          15.0       0              14
## 11            8.3          13.0       1              14
## 12            8.5           1.8       1              10
## 13            7.3           1.2       1              18
## 14            8.3           0.4       1              18
## 15            5.2           4.2       0               0
## 16            8.0           6.1       1              18
## 17            8.3           1.3       1              14
## 18            9.3          25.0       0              16
## 19            8.6         165.0       0              10
## 20            8.2          50.0       0              18
## 21            8.7          25.0       0              14
## 22            7.8           4.5       1              18
## 23            8.0          13.0       0              16
## 24            8.5          19.0       0               0
## 25            7.6           5.5       1              16
## 26            8.5           3.3       1              12
## 27            8.1          25.5       0              14
## 28            6.9          65.0       0              12
## 29            8.3          20.0       0              14
## 30            7.2          45.0       0              12
## 31            8.0          50.0       0              12

Explorar datos

head()

head(datos, 10)

##                    name year length_min     genre average_rating cost_millions
## 1             Toy Story 1995         81 Animation            8.3          30.0
## 2                 Akira 1998        125 Animation            8.1          10.4
## 3    The Breakfast Club 1985         97     Drama            7.9           1.0
## 4            The Artist 2011        100   Romance            8.0          15.0
## 5          Modern Times 1936         87    Comedy            8.6           1.5
## 6            Fight Club 1999        139     Drama            8.9          63.0
## 7           City of God 2002        130     Crime            8.7           3.3
## 8      The Untouchables 1987        119     Drama            7.9          25.0
## 9  Star Wars Episode IV 1977        121    Action            8.7          11.0
## 10      American Beauty 1999        122     Drama            8.4          15.0
##    foreign age_restriction
## 1        0               0
## 2        1              14
## 3        0              14
## 4        1              12
## 5        0              10
## 6        0              18
## 7        1              18
## 8        0              14
## 9        0              10
## 10       0              14

tail()

tail(datos, 10)

##                                         name year length_min    genre
## 22                Blue is the Warmest Colour 2013        179  Romance
## 23                                Black Swan 2010        108 Thriller
## 24                        Back to the Future 1985        116   Sci-fi
## 25                                  The Wave 2008        107 Thriller
## 26                                  Whiplash 2014        106    Drama
## 27                  The Grand Hotel Budapest 2014        100    Crime
## 28                                   Jumanji 1995        104  Fantasy
## 29 The Eternal Sunshine of the Spotless Mind 2004        108    Drama
## 30                                   Chicago 2002        113   Comedy
## 31                                   Jumangi 2020        120   Action
##    average_rating cost_millions foreign age_restriction
## 22            7.8           4.5       1              18
## 23            8.0          13.0       0              16
## 24            8.5          19.0       0               0
## 25            7.6           5.5       1              16
## 26            8.5           3.3       1              12
## 27            8.1          25.5       0              14
## 28            6.9          65.0       0              12
## 29            8.3          20.0       0              14
## 30            7.2          45.0       0              12
## 31            8.0          50.0       0              12

summary()

Estadísticos descriptivos

summary(datos)

##                          name         year        length_min          genre  
##  Akira                     : 1   Min.   :1936   Min.   : 81.0   Drama    :7  
##  American Beauty           : 1   1st Qu.:1988   1st Qu.: 99.5   Comedy   :5  
##  Back to the Future        : 1   Median :1999   Median :113.0   Crime    :4  
##  Black Swan                : 1   Mean   :1996   Mean   :116.9   Action   :2  
##  Blue is the Warmest Colour: 1   3rd Qu.:2009   3rd Qu.:123.5   Animation:2  
##  Casino                    : 1   Max.   :2020   Max.   :179.0   Biography:2  
##  (Other)                   :25                                  (Other)  :9  
##  average_rating cost_millions       foreign       age_restriction
##  Min.   :5.20   Min.   :  0.40   Min.   :0.0000   Min.   : 0.0   
##  1st Qu.:7.95   1st Qu.:  3.75   1st Qu.:0.0000   1st Qu.:12.0   
##  Median :8.30   Median : 13.00   Median :0.0000   Median :14.0   
##  Mean   :8.10   Mean   : 23.19   Mean   :0.3871   Mean   :12.9   
##  3rd Qu.:8.50   3rd Qu.: 25.25   3rd Qu.:1.0000   3rd Qu.:16.0   
##  Max.   :9.30   Max.   :165.00   Max.   :1.0000   Max.   :18.0   
##

str()

Estructura de los datos

str(datos)

## 'data.frame':    31 obs. of  8 variables:
##  $ name           : Factor w/ 31 levels "Akira","American Beauty",..: 30 1 22 21 15 10 8 28 19 2 ...
##  $ year           : int  1995 1998 1985 2011 1936 1999 2002 1987 1977 1999 ...
##  $ length_min     : int  81 125 97 100 87 139 130 119 121 122 ...
##  $ genre          : Factor w/ 12 levels "Action","Adventure",..: 3 3 7 10 5 7 6 7 1 7 ...
##  $ average_rating : num  8.3 8.1 7.9 8 8.6 8.9 8.7 7.9 8.7 8.4 ...
##  $ cost_millions  : num  30 10.4 1 15 1.5 63 3.3 25 11 15 ...
##  $ foreign        : int  0 1 0 1 0 0 1 0 0 0 ...
##  $ age_restriction: int  0 14 14 12 10 18 18 14 10 14 ...

Visualiza datos

Variable de interés foreign, (idioma)

Convertir la variable foreign a datos tipo factor o categórico.

datos$foreign <- as.factor(as.character(datos$foreign))

Visualizar frecuencia de foreign (idioma) con ggplot()

ggplot(data = datos) +
  geom_bar(aes(x = foreign))

Visualizar frecuencia de foreign (idioma) con barplot()

La función barplot() no requiere la librería ggplot2 como sucede con la función ggplot() anterior, sin embargo requiere de datos sumarizados o resumidos.

Agrupar datos con funciones de dplyr

resumen <- datos %>%
  group_by(foreign) %>%
  summarise(frecuencia = n())
resumen

## # A tibble: 2 × 2
##   foreign frecuencia
##   <fct>        <int>
## 1 0               19
## 2 1               12

barplot()

barplot(height = resumen$frecuencia, names.arg = resumen$foreign)

Variable de interés genre (genero)

La variable genre ya es tipo character y tipo factor, es decir se puede contar su frecuencia

ggplot(data = datos) +
  geom_bar(aes(x = genre))

Visualizar frecuencia de foreign (idioma) con barplot()

La función barplot() no requiere la librería ggplot2 como sucede con la función ggplot() anterior, sin embargo requiere de datos sumarizados o resumidos.

Agrupar datos con funciones de dplyr

resumen <- datos %>%
  group_by(genre) %>%
  summarise(frecuencia = n())
resumen

## # A tibble: 12 × 2
##    genre     frecuencia
##    <fct>          <int>
##  1 Action             2
##  2 Adventure          1
##  3 Animation          2
##  4 Biography          2
##  5 Comedy             5
##  6 Crime              4
##  7 Drama              7
##  8 Fantasy            1
##  9 Horror             2
## 10 Romance            2
## 11 Sci-fi             1
## 12 Thriller           2

barplot()

barplot(height = resumen$frecuencia, names.arg = resumen$genre)

Interpretación

Los datos obtenidos del archivo muestran información relacionada con varias películas, esta tabla o data.frame contiene 31 registros y diez campos.

Respecto a la restricción de edades, 3 peliculas no tienen, 4 son de diez años, 5 de doce años, 9 tres de catorce años, 3 de dieciseis años y 7 de dieciocho años.

Dentro del rango de clasificación o rating, la pelicula de “The Shawshank Redemption” del género de crimen tiene la clasificación mas alta de 9.3, mientras que la mas baja es “High School Musical” con 5.2 del género de comedia.

En la parte de estadísticos descriptivos, se tiene que el valor de la dispersión del año es de 17.586346, y la dispersion de las clasificaciones es de 0.734847.

El caso muestra la posibilidad de amplias conclusiones respecto a qué le podemos atribuir a los resultados de sus datos, los campos pueden llegar a indicar una razón, idea o algo que tengan en común los datos para que arrojen dichos resultados.

Podemos comparar diferentes campos, como llegar a la conclusion de porqué ciertas películas de cierto género tienen menos exito que otras o que incluso siendo del mismo género tienen mas éxito. Tambíen, dependiendo de la duración, el cósto de producción y la clasificación tendrían algo que ver o no.

Finalmente, en base a la descripción de los datos podemos encontrar valores tanto grandes como pequeños, y así concluir cúal podría ser la razón de éxito de una película, promedio de interés por la pelicula o tambíen el efecto de ser extrangera o no.

Caso 1 Explorar datos de Películas

Carlos Daniel Reyes Valenzuela

2022-08-31

Objetivo

Descripción

Marco teórico

Desarrollo

Cargar librerías

Cargar datos

Explorar datos

head()

tail()

summary()

str()

Visualiza datos

Variable de interés foreign, (idioma)

Visualizar frecuencia de foreign (idioma) con ggplot()

Visualizar frecuencia de foreign (idioma) con barplot()

Agrupar datos con funciones de dplyr

barplot()

Variable de interés genre (genero)

Visualizar frecuencia de foreign (idioma) con barplot()

Agrupar datos con funciones de dplyr

barplot()

Interpretación