R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

library(dplyr)
## 
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
movies_1980_2020_30k <- read_csv("C:/Users/natal/OneDrive/Escritorio/Diplomado de Big data UAO/Rstudio diplomado big data/CLASE 4/CLASE 4 EJERCICIO/movies_1980_2020_30k.csv")
## Rows: 30000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): Title, Director, Genre
## dbl  (2): Duration, Rating
## date (1): Release Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(movies_1980_2020_30k)

##Piping %>% ##El operador %>% (pipe) facilita la escritura de código más legible y facilita la secuencia ##de operaciones en una tubería (pipeline). En lugar de anidar funciones o asignar ##resultados intermedios a variables, el operador %>% permite encadenar las operaciones de ##una manera más clara y directa.

head(movies_1980_2020_30k,10)
## # A tibble: 10 × 6
##    Title                   Director         Genre `Release Date` Duration Rating
##    <chr>                   <chr>            <chr> <date>            <dbl>  <dbl>
##  1 Key entire popular.     Anthony Becker   Horr… 1981-05-12          102    6.8
##  2 Gun husband reveal.     William Johnson  Docu… 2016-06-13           92    7.6
##  3 Crime cover.            Amy Le           Drama 1988-03-22          144    5.5
##  4 Challenge.              Andrea Martinez  Roma… 2013-04-01          161    2  
##  5 Close study.            Michael Rodgers  Fant… 2012-10-18          177    3.7
##  6 Must customer.          Christina Jimen… Roma… 1997-12-22          144    5.2
##  7 Recent example benefit. Megan Sims       Sci-… 1981-05-10          138    8.2
##  8 By product.             Samantha Anders… Roma… 2015-07-18          109    8  
##  9 When drive not.         Margaret Murphy  Roma… 1992-10-02          173    6.9
## 10 Phone garden.           Mandy Brooks     Thri… 2012-07-16          154    1.6
movies_1980_2020_30k %>% head(10)
## # A tibble: 10 × 6
##    Title                   Director         Genre `Release Date` Duration Rating
##    <chr>                   <chr>            <chr> <date>            <dbl>  <dbl>
##  1 Key entire popular.     Anthony Becker   Horr… 1981-05-12          102    6.8
##  2 Gun husband reveal.     William Johnson  Docu… 2016-06-13           92    7.6
##  3 Crime cover.            Amy Le           Drama 1988-03-22          144    5.5
##  4 Challenge.              Andrea Martinez  Roma… 2013-04-01          161    2  
##  5 Close study.            Michael Rodgers  Fant… 2012-10-18          177    3.7
##  6 Must customer.          Christina Jimen… Roma… 1997-12-22          144    5.2
##  7 Recent example benefit. Megan Sims       Sci-… 1981-05-10          138    8.2
##  8 By product.             Samantha Anders… Roma… 2015-07-18          109    8  
##  9 When drive not.         Margaret Murphy  Roma… 1992-10-02          173    6.9
## 10 Phone garden.           Mandy Brooks     Thri… 2012-07-16          154    1.6
10 %>% head(movies_1980_2020_30k, .)
## # A tibble: 10 × 6
##    Title                   Director         Genre `Release Date` Duration Rating
##    <chr>                   <chr>            <chr> <date>            <dbl>  <dbl>
##  1 Key entire popular.     Anthony Becker   Horr… 1981-05-12          102    6.8
##  2 Gun husband reveal.     William Johnson  Docu… 2016-06-13           92    7.6
##  3 Crime cover.            Amy Le           Drama 1988-03-22          144    5.5
##  4 Challenge.              Andrea Martinez  Roma… 2013-04-01          161    2  
##  5 Close study.            Michael Rodgers  Fant… 2012-10-18          177    3.7
##  6 Must customer.          Christina Jimen… Roma… 1997-12-22          144    5.2
##  7 Recent example benefit. Megan Sims       Sci-… 1981-05-10          138    8.2
##  8 By product.             Samantha Anders… Roma… 2015-07-18          109    8  
##  9 When drive not.         Margaret Murphy  Roma… 1992-10-02          173    6.9
## 10 Phone garden.           Mandy Brooks     Thri… 2012-07-16          154    1.6

##Select ##La función select() se utiliza para seleccionar columnas específicas de un marco de datos. Puede ser útil cuando estás trabajando con conjuntos de datos grandes y solo necesitas trabajar con un subconjunto específico de columnas.

movies_1980_2020_30k %>%
  select(Title, Director, Genre, `Release Date`, Rating)
## # A tibble: 30,000 × 5
##    Title                   Director          Genre       `Release Date` Rating
##    <chr>                   <chr>             <chr>       <date>          <dbl>
##  1 Key entire popular.     Anthony Becker    Horror      1981-05-12        6.8
##  2 Gun husband reveal.     William Johnson   Documentary 2016-06-13        7.6
##  3 Crime cover.            Amy Le            Drama       1988-03-22        5.5
##  4 Challenge.              Andrea Martinez   Romance     2013-04-01        2  
##  5 Close study.            Michael Rodgers   Fantasy     2012-10-18        3.7
##  6 Must customer.          Christina Jimenez Romance     1997-12-22        5.2
##  7 Recent example benefit. Megan Sims        Sci-Fi      1981-05-10        8.2
##  8 By product.             Samantha Anderson Romance     2015-07-18        8  
##  9 When drive not.         Margaret Murphy   Romance     1992-10-02        6.9
## 10 Phone garden.           Mandy Brooks      Thriller    2012-07-16        1.6
## # ℹ 29,990 more rows
movies_1980_2020_30k %>%
  select(Director:Genre, most_popular='Rating')
## # A tibble: 30,000 × 3
##    Director          Genre       most_popular
##    <chr>             <chr>              <dbl>
##  1 Anthony Becker    Horror               6.8
##  2 William Johnson   Documentary          7.6
##  3 Amy Le            Drama                5.5
##  4 Andrea Martinez   Romance              2  
##  5 Michael Rodgers   Fantasy              3.7
##  6 Christina Jimenez Romance              5.2
##  7 Megan Sims        Sci-Fi               8.2
##  8 Samantha Anderson Romance              8  
##  9 Margaret Murphy   Romance              6.9
## 10 Mandy Brooks      Thriller             1.6
## # ℹ 29,990 more rows
movies_1980_2020_30k %>%
  select(-'Release Date', -'Duration')
## # A tibble: 30,000 × 4
##    Title                   Director          Genre       Rating
##    <chr>                   <chr>             <chr>        <dbl>
##  1 Key entire popular.     Anthony Becker    Horror         6.8
##  2 Gun husband reveal.     William Johnson   Documentary    7.6
##  3 Crime cover.            Amy Le            Drama          5.5
##  4 Challenge.              Andrea Martinez   Romance        2  
##  5 Close study.            Michael Rodgers   Fantasy        3.7
##  6 Must customer.          Christina Jimenez Romance        5.2
##  7 Recent example benefit. Megan Sims        Sci-Fi         8.2
##  8 By product.             Samantha Anderson Romance        8  
##  9 When drive not.         Margaret Murphy   Romance        6.9
## 10 Phone garden.           Mandy Brooks      Thriller       1.6
## # ℹ 29,990 more rows

##Mutate ##La función mutate() se utiliza para agregar nuevas columnas o modificar columnas existentes en un marco de datos. Puedes realizar operaciones aritméticas, aplicar funciones a columnas existentes y crear nuevas variables basadas en las existentes.

movies_1980_2020_30k %>%
  select(Director, Genre, most_popular = Rating, everything()) %>%
  mutate(is_collab = grepl('Never', Title) & grepl('Fantasy', Genre)) %>%
  select(Genre, Director, is_collab, everything())
## # A tibble: 30,000 × 7
##    Genre       Director     is_collab most_popular Title `Release Date` Duration
##    <chr>       <chr>        <lgl>            <dbl> <chr> <date>            <dbl>
##  1 Horror      Anthony Bec… FALSE              6.8 Key … 1981-05-12          102
##  2 Documentary William Joh… FALSE              7.6 Gun … 2016-06-13           92
##  3 Drama       Amy Le       FALSE              5.5 Crim… 1988-03-22          144
##  4 Romance     Andrea Mart… FALSE              2   Chal… 2013-04-01          161
##  5 Fantasy     Michael Rod… FALSE              3.7 Clos… 2012-10-18          177
##  6 Romance     Christina J… FALSE              5.2 Must… 1997-12-22          144
##  7 Sci-Fi      Megan Sims   FALSE              8.2 Rece… 1981-05-10          138
##  8 Romance     Samantha An… FALSE              8   By p… 2015-07-18          109
##  9 Romance     Margaret Mu… FALSE              6.9 When… 1992-10-02          173
## 10 Thriller    Mandy Brooks FALSE              1.6 Phon… 2012-07-16          154
## # ℹ 29,990 more rows
 ###La función grepl() en este caso todo me aparece como falso, ya que en este caso no hay colaboraciones, solo aparece un solo nombre en la columna de director, no hay colaboraciones asi como en los cantantes que se hacen duetos.

##Filter ##La función filter() se utiliza para filtrar filas específicas de un marco de datos basándose en condiciones dadas. Puedes usar operadores lógicos y comparaciones para especificar las condiciones que determinarán qué filas deben ser incluidas en el resultado

movies_1980_2020_30k %>%
  select(Title, Director, Genre, `Release Date`, Rating) %>%
  filter(Rating >= 8, Director == 'Julie Ryan' | Director == 'James Smith')
## # A tibble: 5 × 5
##   Title                   Director    Genre     `Release Date` Rating
##   <chr>                   <chr>       <chr>     <date>          <dbl>
## 1 Try into himself.       James Smith Action    1999-06-01        9.4
## 2 Nor catch.              Julie Ryan  Adventure 2008-03-20       10  
## 3 Eye large.              James Smith Horror    1983-06-10        8.9
## 4 Never effort chair.     James Smith Fantasy   2004-08-22        9.9
## 5 Particular doctor term. James Smith Action    1983-02-25        9.7

##Distinct ##La función distinct() se utiliza para obtener las filas únicas de un marco de datos o de un conjunto de columnas específicas dentro de un marco de datos. Puedes utilizar esta función para eliminar duplicados basándote en una o más columnas

movies_1980_2020_30k %>%
  select(Title:Director, most_popular='Rating') %>%
  filter(Director == 'James Smith') 
## # A tibble: 13 × 3
##    Title                      Director    most_popular
##    <chr>                      <chr>              <dbl>
##  1 Green rule black.          James Smith          1.6
##  2 Close ten instead.         James Smith          5.1
##  3 Try into himself.          James Smith          9.4
##  4 So official.               James Smith          4.8
##  5 Work consider reality.     James Smith          1.1
##  6 Eye large.                 James Smith          8.9
##  7 Never effort chair.        James Smith          9.9
##  8 Painting and different.    James Smith          4.6
##  9 Laugh serious.             James Smith          6.1
## 10 Blue product material use. James Smith          6.8
## 11 Particular doctor term.    James Smith          9.7
## 12 Oil wide.                  James Smith          6.2
## 13 Gun who middle.            James Smith          2.2
distinct <- movies_1980_2020_30k %>%
  select(Title:Director, most_popular='Rating') %>%
  filter(Director == 'James Smith') %>%
  distinct(Title)

##Group_by & Summarise ##La función group_by() se utiliza para agrupar un marco de datos por una o más columnas. Cuando se aplica group_by(), se crea un “grupo” para cada combinación única de los valores en las columnas especificadas. Posteriormente, puedes aplicar funciones de resumen, como summarise(), a cada uno de estos grupos.

##La función summarise() se utiliza para realizar resúmenes o agregaciones de datos dentro de cada grupo creado por group_by(). Puedes aplicar diversas funciones de resumen, como mean(), sum(), min(), max(), entre otras.

movies_1980_2020_30k %>%
  select(Title:Director, most_popular='Rating') %>%
  filter(Director == 'James Smith') %>%
  group_by(Title) %>%
  summarise(total_most_popular = mean(most_popular))
## # A tibble: 13 × 2
##    Title                      total_most_popular
##    <chr>                                   <dbl>
##  1 Blue product material use.                6.8
##  2 Close ten instead.                        5.1
##  3 Eye large.                                8.9
##  4 Green rule black.                         1.6
##  5 Gun who middle.                           2.2
##  6 Laugh serious.                            6.1
##  7 Never effort chair.                       9.9
##  8 Oil wide.                                 6.2
##  9 Painting and different.                   4.6
## 10 Particular doctor term.                   9.7
## 11 So official.                              4.8
## 12 Try into himself.                         9.4
## 13 Work consider reality.                    1.1

##Arrange ##La función arrange() en la librería dplyr de R se utiliza para ordenar las filas de un marco de datos según una o más columnas. Puedes especificar el orden ascendente o descendente para cada columna.

movies_1980_2020_30k %>%
  select(Title, Director, Rating) %>%
  filter(Director == 'James Smith') %>%
  group_by(Title) %>%
  summarise(Total_Rating = max(Rating)) %>%
  arrange(desc(Total_Rating), Title) %>%
  head(10)
## # A tibble: 10 × 2
##    Title                      Total_Rating
##    <chr>                             <dbl>
##  1 Never effort chair.                 9.9
##  2 Particular doctor term.             9.7
##  3 Try into himself.                   9.4
##  4 Eye large.                          8.9
##  5 Blue product material use.          6.8
##  6 Oil wide.                           6.2
##  7 Laugh serious.                      6.1
##  8 Close ten instead.                  5.1
##  9 So official.                        4.8
## 10 Painting and different.             4.6

##Count ##La función count() se utiliza para contar el número de observaciones en cada grupo. Es comúnmente utilizada en combinación con group_by() para realizar recuentos en grupos específicos dentro de un marco de datos.

movies_1980_2020_30k %>%
  select(Title:Director, most_popular='Rating') %>%
  count(Director) %>%
  arrange(desc(n))
## # A tibble: 25,844 × 2
##    Director              n
##    <chr>             <int>
##  1 James Smith          13
##  2 Christopher Smith    12
##  3 Lisa Smith           11
##  4 David Smith          10
##  5 Jennifer Smith       10
##  6 John Smith           10
##  7 Michael Miller       10
##  8 Michael Smith        10
##  9 David Davis           9
## 10 Jessica Johnson       9
## # ℹ 25,834 more rows