This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
library(dplyr)
##
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
movies_1980_2020_30k <- read_csv("C:/Users/natal/OneDrive/Escritorio/Diplomado de Big data UAO/Rstudio diplomado big data/CLASE 4/CLASE 4 EJERCICIO/movies_1980_2020_30k.csv")
## Rows: 30000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Title, Director, Genre
## dbl (2): Duration, Rating
## date (1): Release Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(movies_1980_2020_30k)
##Piping %>% ##El operador %>% (pipe) facilita la escritura de código más legible y facilita la secuencia ##de operaciones en una tubería (pipeline). En lugar de anidar funciones o asignar ##resultados intermedios a variables, el operador %>% permite encadenar las operaciones de ##una manera más clara y directa.
head(movies_1980_2020_30k,10)
## # A tibble: 10 × 6
## Title Director Genre `Release Date` Duration Rating
## <chr> <chr> <chr> <date> <dbl> <dbl>
## 1 Key entire popular. Anthony Becker Horr… 1981-05-12 102 6.8
## 2 Gun husband reveal. William Johnson Docu… 2016-06-13 92 7.6
## 3 Crime cover. Amy Le Drama 1988-03-22 144 5.5
## 4 Challenge. Andrea Martinez Roma… 2013-04-01 161 2
## 5 Close study. Michael Rodgers Fant… 2012-10-18 177 3.7
## 6 Must customer. Christina Jimen… Roma… 1997-12-22 144 5.2
## 7 Recent example benefit. Megan Sims Sci-… 1981-05-10 138 8.2
## 8 By product. Samantha Anders… Roma… 2015-07-18 109 8
## 9 When drive not. Margaret Murphy Roma… 1992-10-02 173 6.9
## 10 Phone garden. Mandy Brooks Thri… 2012-07-16 154 1.6
movies_1980_2020_30k %>% head(10)
## # A tibble: 10 × 6
## Title Director Genre `Release Date` Duration Rating
## <chr> <chr> <chr> <date> <dbl> <dbl>
## 1 Key entire popular. Anthony Becker Horr… 1981-05-12 102 6.8
## 2 Gun husband reveal. William Johnson Docu… 2016-06-13 92 7.6
## 3 Crime cover. Amy Le Drama 1988-03-22 144 5.5
## 4 Challenge. Andrea Martinez Roma… 2013-04-01 161 2
## 5 Close study. Michael Rodgers Fant… 2012-10-18 177 3.7
## 6 Must customer. Christina Jimen… Roma… 1997-12-22 144 5.2
## 7 Recent example benefit. Megan Sims Sci-… 1981-05-10 138 8.2
## 8 By product. Samantha Anders… Roma… 2015-07-18 109 8
## 9 When drive not. Margaret Murphy Roma… 1992-10-02 173 6.9
## 10 Phone garden. Mandy Brooks Thri… 2012-07-16 154 1.6
10 %>% head(movies_1980_2020_30k, .)
## # A tibble: 10 × 6
## Title Director Genre `Release Date` Duration Rating
## <chr> <chr> <chr> <date> <dbl> <dbl>
## 1 Key entire popular. Anthony Becker Horr… 1981-05-12 102 6.8
## 2 Gun husband reveal. William Johnson Docu… 2016-06-13 92 7.6
## 3 Crime cover. Amy Le Drama 1988-03-22 144 5.5
## 4 Challenge. Andrea Martinez Roma… 2013-04-01 161 2
## 5 Close study. Michael Rodgers Fant… 2012-10-18 177 3.7
## 6 Must customer. Christina Jimen… Roma… 1997-12-22 144 5.2
## 7 Recent example benefit. Megan Sims Sci-… 1981-05-10 138 8.2
## 8 By product. Samantha Anders… Roma… 2015-07-18 109 8
## 9 When drive not. Margaret Murphy Roma… 1992-10-02 173 6.9
## 10 Phone garden. Mandy Brooks Thri… 2012-07-16 154 1.6
##Select ##La función select() se utiliza para seleccionar columnas específicas de un marco de datos. Puede ser útil cuando estás trabajando con conjuntos de datos grandes y solo necesitas trabajar con un subconjunto específico de columnas.
movies_1980_2020_30k %>%
select(Title, Director, Genre, `Release Date`, Rating)
## # A tibble: 30,000 × 5
## Title Director Genre `Release Date` Rating
## <chr> <chr> <chr> <date> <dbl>
## 1 Key entire popular. Anthony Becker Horror 1981-05-12 6.8
## 2 Gun husband reveal. William Johnson Documentary 2016-06-13 7.6
## 3 Crime cover. Amy Le Drama 1988-03-22 5.5
## 4 Challenge. Andrea Martinez Romance 2013-04-01 2
## 5 Close study. Michael Rodgers Fantasy 2012-10-18 3.7
## 6 Must customer. Christina Jimenez Romance 1997-12-22 5.2
## 7 Recent example benefit. Megan Sims Sci-Fi 1981-05-10 8.2
## 8 By product. Samantha Anderson Romance 2015-07-18 8
## 9 When drive not. Margaret Murphy Romance 1992-10-02 6.9
## 10 Phone garden. Mandy Brooks Thriller 2012-07-16 1.6
## # ℹ 29,990 more rows
movies_1980_2020_30k %>%
select(Director:Genre, most_popular='Rating')
## # A tibble: 30,000 × 3
## Director Genre most_popular
## <chr> <chr> <dbl>
## 1 Anthony Becker Horror 6.8
## 2 William Johnson Documentary 7.6
## 3 Amy Le Drama 5.5
## 4 Andrea Martinez Romance 2
## 5 Michael Rodgers Fantasy 3.7
## 6 Christina Jimenez Romance 5.2
## 7 Megan Sims Sci-Fi 8.2
## 8 Samantha Anderson Romance 8
## 9 Margaret Murphy Romance 6.9
## 10 Mandy Brooks Thriller 1.6
## # ℹ 29,990 more rows
movies_1980_2020_30k %>%
select(-'Release Date', -'Duration')
## # A tibble: 30,000 × 4
## Title Director Genre Rating
## <chr> <chr> <chr> <dbl>
## 1 Key entire popular. Anthony Becker Horror 6.8
## 2 Gun husband reveal. William Johnson Documentary 7.6
## 3 Crime cover. Amy Le Drama 5.5
## 4 Challenge. Andrea Martinez Romance 2
## 5 Close study. Michael Rodgers Fantasy 3.7
## 6 Must customer. Christina Jimenez Romance 5.2
## 7 Recent example benefit. Megan Sims Sci-Fi 8.2
## 8 By product. Samantha Anderson Romance 8
## 9 When drive not. Margaret Murphy Romance 6.9
## 10 Phone garden. Mandy Brooks Thriller 1.6
## # ℹ 29,990 more rows
##Mutate ##La función mutate() se utiliza para agregar nuevas columnas o modificar columnas existentes en un marco de datos. Puedes realizar operaciones aritméticas, aplicar funciones a columnas existentes y crear nuevas variables basadas en las existentes.
movies_1980_2020_30k %>%
select(Director, Genre, most_popular = Rating, everything()) %>%
mutate(is_collab = grepl('Never', Title) & grepl('Fantasy', Genre)) %>%
select(Genre, Director, is_collab, everything())
## # A tibble: 30,000 × 7
## Genre Director is_collab most_popular Title `Release Date` Duration
## <chr> <chr> <lgl> <dbl> <chr> <date> <dbl>
## 1 Horror Anthony Bec… FALSE 6.8 Key … 1981-05-12 102
## 2 Documentary William Joh… FALSE 7.6 Gun … 2016-06-13 92
## 3 Drama Amy Le FALSE 5.5 Crim… 1988-03-22 144
## 4 Romance Andrea Mart… FALSE 2 Chal… 2013-04-01 161
## 5 Fantasy Michael Rod… FALSE 3.7 Clos… 2012-10-18 177
## 6 Romance Christina J… FALSE 5.2 Must… 1997-12-22 144
## 7 Sci-Fi Megan Sims FALSE 8.2 Rece… 1981-05-10 138
## 8 Romance Samantha An… FALSE 8 By p… 2015-07-18 109
## 9 Romance Margaret Mu… FALSE 6.9 When… 1992-10-02 173
## 10 Thriller Mandy Brooks FALSE 1.6 Phon… 2012-07-16 154
## # ℹ 29,990 more rows
###La función grepl() en este caso todo me aparece como falso, ya que en este caso no hay colaboraciones, solo aparece un solo nombre en la columna de director, no hay colaboraciones asi como en los cantantes que se hacen duetos.
##Filter ##La función filter() se utiliza para filtrar filas específicas de un marco de datos basándose en condiciones dadas. Puedes usar operadores lógicos y comparaciones para especificar las condiciones que determinarán qué filas deben ser incluidas en el resultado
movies_1980_2020_30k %>%
select(Title, Director, Genre, `Release Date`, Rating) %>%
filter(Rating >= 8, Director == 'Julie Ryan' | Director == 'James Smith')
## # A tibble: 5 × 5
## Title Director Genre `Release Date` Rating
## <chr> <chr> <chr> <date> <dbl>
## 1 Try into himself. James Smith Action 1999-06-01 9.4
## 2 Nor catch. Julie Ryan Adventure 2008-03-20 10
## 3 Eye large. James Smith Horror 1983-06-10 8.9
## 4 Never effort chair. James Smith Fantasy 2004-08-22 9.9
## 5 Particular doctor term. James Smith Action 1983-02-25 9.7
##Distinct ##La función distinct() se utiliza para obtener las filas únicas de un marco de datos o de un conjunto de columnas específicas dentro de un marco de datos. Puedes utilizar esta función para eliminar duplicados basándote en una o más columnas
movies_1980_2020_30k %>%
select(Title:Director, most_popular='Rating') %>%
filter(Director == 'James Smith')
## # A tibble: 13 × 3
## Title Director most_popular
## <chr> <chr> <dbl>
## 1 Green rule black. James Smith 1.6
## 2 Close ten instead. James Smith 5.1
## 3 Try into himself. James Smith 9.4
## 4 So official. James Smith 4.8
## 5 Work consider reality. James Smith 1.1
## 6 Eye large. James Smith 8.9
## 7 Never effort chair. James Smith 9.9
## 8 Painting and different. James Smith 4.6
## 9 Laugh serious. James Smith 6.1
## 10 Blue product material use. James Smith 6.8
## 11 Particular doctor term. James Smith 9.7
## 12 Oil wide. James Smith 6.2
## 13 Gun who middle. James Smith 2.2
distinct <- movies_1980_2020_30k %>%
select(Title:Director, most_popular='Rating') %>%
filter(Director == 'James Smith') %>%
distinct(Title)
##Group_by & Summarise ##La función group_by() se utiliza para agrupar un marco de datos por una o más columnas. Cuando se aplica group_by(), se crea un “grupo” para cada combinación única de los valores en las columnas especificadas. Posteriormente, puedes aplicar funciones de resumen, como summarise(), a cada uno de estos grupos.
##La función summarise() se utiliza para realizar resúmenes o agregaciones de datos dentro de cada grupo creado por group_by(). Puedes aplicar diversas funciones de resumen, como mean(), sum(), min(), max(), entre otras.
movies_1980_2020_30k %>%
select(Title:Director, most_popular='Rating') %>%
filter(Director == 'James Smith') %>%
group_by(Title) %>%
summarise(total_most_popular = mean(most_popular))
## # A tibble: 13 × 2
## Title total_most_popular
## <chr> <dbl>
## 1 Blue product material use. 6.8
## 2 Close ten instead. 5.1
## 3 Eye large. 8.9
## 4 Green rule black. 1.6
## 5 Gun who middle. 2.2
## 6 Laugh serious. 6.1
## 7 Never effort chair. 9.9
## 8 Oil wide. 6.2
## 9 Painting and different. 4.6
## 10 Particular doctor term. 9.7
## 11 So official. 4.8
## 12 Try into himself. 9.4
## 13 Work consider reality. 1.1
##Arrange ##La función arrange() en la librería dplyr de R se utiliza para ordenar las filas de un marco de datos según una o más columnas. Puedes especificar el orden ascendente o descendente para cada columna.
movies_1980_2020_30k %>%
select(Title, Director, Rating) %>%
filter(Director == 'James Smith') %>%
group_by(Title) %>%
summarise(Total_Rating = max(Rating)) %>%
arrange(desc(Total_Rating), Title) %>%
head(10)
## # A tibble: 10 × 2
## Title Total_Rating
## <chr> <dbl>
## 1 Never effort chair. 9.9
## 2 Particular doctor term. 9.7
## 3 Try into himself. 9.4
## 4 Eye large. 8.9
## 5 Blue product material use. 6.8
## 6 Oil wide. 6.2
## 7 Laugh serious. 6.1
## 8 Close ten instead. 5.1
## 9 So official. 4.8
## 10 Painting and different. 4.6
##Count ##La función count() se utiliza para contar el número de observaciones en cada grupo. Es comúnmente utilizada en combinación con group_by() para realizar recuentos en grupos específicos dentro de un marco de datos.
movies_1980_2020_30k %>%
select(Title:Director, most_popular='Rating') %>%
count(Director) %>%
arrange(desc(n))
## # A tibble: 25,844 × 2
## Director n
## <chr> <int>
## 1 James Smith 13
## 2 Christopher Smith 12
## 3 Lisa Smith 11
## 4 David Smith 10
## 5 Jennifer Smith 10
## 6 John Smith 10
## 7 Michael Miller 10
## 8 Michael Smith 10
## 9 David Davis 9
## 10 Jessica Johnson 9
## # ℹ 25,834 more rows