Tidyverse is so commonly used in data science that it helps data scientists manipulate,tidy,transform and visualize data.The tidyverse contains a bunch of packages that helps beginners or expert manage data.In this project I will demonstrate the following package in the tidyverse this is the dyplr package.
Dyplr is used in data science to help the user clean and manipulate data in a manner which they fill fitting.
## Calling the Library and Reading the data Here we imported the tidyverse packages which contains the package dplyr but you can also see other packages as well such as tibble,readr, and more.I also imported a data set from kaggle which contains information about the Disney movies.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Disney_data <- read.csv("https://raw.githubusercontent.com/AldataSci/TidyVerse/main/disney_movies.csv",sep=",",header=TRUE)
Let’s take a look at the data and see what we can understand from it.We can see all the movie_title,the genre and the release date and its earnings as well.
head(Disney_data)
## movie_title release_date genre mpaa_rating
## 1 Snow White and the Seven Dwarfs 1937-12-21 Musical G
## 2 Pinocchio 1940-02-09 Adventure G
## 3 Fantasia 1940-11-13 Musical G
## 4 Song of the South 1946-11-12 Adventure G
## 5 Cinderella 1950-02-15 Drama G
## 6 20,000 Leagues Under the Sea 1954-12-23 Adventure
## total_gross inflation_adjusted_gross
## 1 184925485 5228953251
## 2 84300000 2188229052
## 3 83320000 2187090808
## 4 65000000 1078510579
## 5 85000000 920608730
## 6 28200000 528279994
We can see that there are a select few columns and we can remove from our data if we wish to
## You can use the select function to select columns based on their names
head(select(Disney_data,movie_title,release_date,genre,mpaa_rating,inflation_adjusted_gross))
## movie_title release_date genre mpaa_rating
## 1 Snow White and the Seven Dwarfs 1937-12-21 Musical G
## 2 Pinocchio 1940-02-09 Adventure G
## 3 Fantasia 1940-11-13 Musical G
## 4 Song of the South 1946-11-12 Adventure G
## 5 Cinderella 1950-02-15 Drama G
## 6 20,000 Leagues Under the Sea 1954-12-23 Adventure
## inflation_adjusted_gross
## 1 5228953251
## 2 2188229052
## 3 2187090808
## 4 1078510579
## 5 920608730
## 6 528279994
## Alternatively you can also use the pipes dpylr to get the same results!!
head(Disney_data %>%
select(movie_title,release_date,genre,mpaa_rating,inflation_adjusted_gross))
## movie_title release_date genre mpaa_rating
## 1 Snow White and the Seven Dwarfs 1937-12-21 Musical G
## 2 Pinocchio 1940-02-09 Adventure G
## 3 Fantasia 1940-11-13 Musical G
## 4 Song of the South 1946-11-12 Adventure G
## 5 Cinderella 1950-02-15 Drama G
## 6 20,000 Leagues Under the Sea 1954-12-23 Adventure
## inflation_adjusted_gross
## 1 5228953251
## 2 2188229052
## 3 2187090808
## 4 1078510579
## 5 920608730
## 6 528279994
## A much simpler way to get the columns you want without writing the whole columns name out is to write a minus sign next to the column you don't want.
head(Disney_data %>%
select(-total_gross))
## movie_title release_date genre mpaa_rating
## 1 Snow White and the Seven Dwarfs 1937-12-21 Musical G
## 2 Pinocchio 1940-02-09 Adventure G
## 3 Fantasia 1940-11-13 Musical G
## 4 Song of the South 1946-11-12 Adventure G
## 5 Cinderella 1950-02-15 Drama G
## 6 20,000 Leagues Under the Sea 1954-12-23 Adventure
## inflation_adjusted_gross
## 1 5228953251
## 2 2188229052
## 3 2187090808
## 4 1078510579
## 5 920608730
## 6 528279994
We can also filter out our data to make it more readable or simply more easy to understand. In our case we can see there is a lot of empty values under the ratings columns which simply fills the data with empty space and harder to understand.
## The filter function helps us filter out the data I filtered out the observations in the rating columns that contains empty space or if the ratings says not rated
head(filter(Disney_data,mpaa_rating !="" & mpaa_rating !="Not Rated"))
## movie_title release_date genre mpaa_rating
## 1 Snow White and the Seven Dwarfs 1937-12-21 Musical G
## 2 Pinocchio 1940-02-09 Adventure G
## 3 Fantasia 1940-11-13 Musical G
## 4 Song of the South 1946-11-12 Adventure G
## 5 Cinderella 1950-02-15 Drama G
## 6 Lady and the Tramp 1955-06-22 Drama G
## total_gross inflation_adjusted_gross
## 1 184925485 5228953251
## 2 84300000 2188229052
## 3 83320000 2187090808
## 4 65000000 1078510579
## 5 85000000 920608730
## 6 93600000 1236035515
## We can also use dplyr pipes to filter it out
head(Disney_data %>%
filter(mpaa_rating !="" & mpaa_rating != "Not Rated"))
## movie_title release_date genre mpaa_rating
## 1 Snow White and the Seven Dwarfs 1937-12-21 Musical G
## 2 Pinocchio 1940-02-09 Adventure G
## 3 Fantasia 1940-11-13 Musical G
## 4 Song of the South 1946-11-12 Adventure G
## 5 Cinderella 1950-02-15 Drama G
## 6 Lady and the Tramp 1955-06-22 Drama G
## total_gross inflation_adjusted_gross
## 1 184925485 5228953251
## 2 84300000 2188229052
## 3 83320000 2187090808
## 4 65000000 1078510579
## 5 85000000 920608730
## 6 93600000 1236035515
We can also rename our columns in your dataset if the name doesn’t make sense or if the name is ugly to you
## We can use base R rename function to simply rename our column names in our dataset.In this case we will rename our movie_title column to Movies
head(names(Disney_data)[names(Disney_data) == "movie_title"] <- "Movies")
## [1] "Movies"
### Alternatively a simpler and faster way to do this is to use dyplr rename function where we will pipe the disney_data rename the columns we want and get a new dataset with the changed column names saved to it the new name you want to give goes to the left of =, in this case it is Movie_Titles I want to rename to while the right of = is the old name of the column Movies.
head(Disney_data %>%
rename(Movie_Titles=Movies))
## Movie_Titles release_date genre mpaa_rating
## 1 Snow White and the Seven Dwarfs 1937-12-21 Musical G
## 2 Pinocchio 1940-02-09 Adventure G
## 3 Fantasia 1940-11-13 Musical G
## 4 Song of the South 1946-11-12 Adventure G
## 5 Cinderella 1950-02-15 Drama G
## 6 20,000 Leagues Under the Sea 1954-12-23 Adventure
## total_gross inflation_adjusted_gross
## 1 184925485 5228953251
## 2 84300000 2188229052
## 3 83320000 2187090808
## 4 65000000 1078510579
## 5 85000000 920608730
## 6 28200000 528279994
## Here we can combine multiple pipes command as one we can put all the changes we did to the dataset into one R pipe command that is the awesome power of the pipe operator.. and we save our data in a dataframe called Clean Disney..
head(Clean_Disney <- Disney_data %>%
rename(movies=Movies) %>%
select(-total_gross) %>%
filter(mpaa_rating !="" & mpaa_rating !="Not Rated"))
## movies release_date genre mpaa_rating
## 1 Snow White and the Seven Dwarfs 1937-12-21 Musical G
## 2 Pinocchio 1940-02-09 Adventure G
## 3 Fantasia 1940-11-13 Musical G
## 4 Song of the South 1946-11-12 Adventure G
## 5 Cinderella 1950-02-15 Drama G
## 6 Lady and the Tramp 1955-06-22 Drama G
## inflation_adjusted_gross
## 1 5228953251
## 2 2188229052
## 3 2187090808
## 4 1078510579
## 5 920608730
## 6 1236035515
We can also use dyplr to re-organize our data and we can also perform analysis on our cleaned data now.
## The group-by command helps aggregate your data, aggregate data groups your data into a specific category,
head(Clean_Disney %>%
group_by(release_date))
## # A tibble: 6 × 5
## # Groups: release_date [6]
## movies release_date genre mpaa_rating inflation_adjusted_g…
## <chr> <chr> <chr> <chr> <dbl>
## 1 Snow White and the Sev… 1937-12-21 Musical G 5228953251
## 2 Pinocchio 1940-02-09 Advent… G 2188229052
## 3 Fantasia 1940-11-13 Musical G 2187090808
## 4 Song of the South 1946-11-12 Advent… G 1078510579
## 5 Cinderella 1950-02-15 Drama G 920608730
## 6 Lady and the Tramp 1955-06-22 Drama G 1236035515
## The summarise function helps us perform numerical analysis on our data so using the max function we use the max value of the column.
head(Clean_Disney %>%
group_by(release_date) %>%
summarise(max(inflation_adjusted_gross)))
## # A tibble: 6 × 2
## release_date `max(inflation_adjusted_gross)`
## <chr> <dbl>
## 1 1937-12-21 5228953251
## 2 1940-02-09 2188229052
## 3 1940-11-13 2187090808
## 4 1946-11-12 1078510579
## 5 1950-02-15 920608730
## 6 1955-06-22 1236035515
### We can also use the arrange function to rearrange our data from low-to high with the desc function arranging our data can help provide more additional insights to perform.
head(Clean_Disney %>%
arrange(desc(inflation_adjusted_gross)))
## movies release_date genre mpaa_rating
## 1 Snow White and the Seven Dwarfs 1937-12-21 Musical G
## 2 Pinocchio 1940-02-09 Adventure G
## 3 Fantasia 1940-11-13 Musical G
## 4 101 Dalmatians 1961-01-25 Comedy G
## 5 Lady and the Tramp 1955-06-22 Drama G
## 6 Song of the South 1946-11-12 Adventure G
## inflation_adjusted_gross
## 1 5228953251
## 2 2188229052
## 3 2187090808
## 4 1362870985
## 5 1236035515
## 6 1078510579
We can create summary to show how many observation and variable has been eliminated. as we can see 59 observation deleted because the ratings are empty or not rated. Total_gross deleted because inflation_adjusted_gross is the better choice to show the fact.
summary(Disney_data)
## Movies release_date genre mpaa_rating
## Length:579 Length:579 Length:579 Length:579
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## total_gross inflation_adjusted_gross
## Min. : 0 Min. :0.000e+00
## 1st Qu.: 12788864 1st Qu.:2.274e+07
## Median : 30702446 Median :5.516e+07
## Mean : 64701789 Mean :1.188e+08
## 3rd Qu.: 75709033 3rd Qu.:1.192e+08
## Max. :936662225 Max. :5.229e+09
summary(Clean_Disney)
## movies release_date genre mpaa_rating
## Length:520 Length:520 Length:520 Length:520
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## inflation_adjusted_gross
## Min. :2.984e+03
## 1st Qu.:2.495e+07
## Median :5.814e+07
## Mean :1.242e+08
## 3rd Qu.:1.258e+08
## Max. :5.229e+09
From the plot1 we can see majority of the earnings are from G rated movies, however, does that mean G rate movie has the highest avg earnings per movies? The answer is yes. We calculate the sum, count and the avg first, then we make a plot 2. We can see G rated movies are still hold the highest earnings in avg, however, PG and PG-13 is tie to each other now. As a shareholder, they would love to see Disney produce more G rated movies because the ROI on G rated movie maybe the best without see the the cost part for now.
ggplot(Clean_Disney, aes(x=mpaa_rating, y=inflation_adjusted_gross)) +
geom_bar(stat="identity", position=position_dodge())
sum <- aggregate(x= Clean_Disney$inflation_adjusted_gross,
by= list(Clean_Disney$mpaa_rating),
FUN=sum)
count <- aggregate(Clean_Disney$inflation_adjusted_gross, by=list(Clean_Disney$mpaa_rating), FUN=length)
sum$count <- count$x
sum$avg <- sum$x/count$x
colnames(sum) <- c('mpaa_rating','inflation_adjusted_gross','count','avg_gross')
sum
## mpaa_rating inflation_adjusted_gross count avg_gross
## 1 G 25048445571 86 291260995
## 2 PG 18988248082 187 101541434
## 3 PG-13 14927544680 145 102948584
## 4 R 5641192166 102 55305806
ggplot(sum, aes(x=mpaa_rating, y=avg_gross)) +
geom_bar(stat="identity", position=position_dodge())