TidyVerse Project

Tidyverse is so commonly used in data science that it helps data scientists manipulate,tidy,transform and visualize data.The tidyverse contains a bunch of packages that helps beginners or expert manage data.In this project I will demonstrate the following package in the tidyverse this is the dyplr package.

Dyplr

Dyplr is used in data science to help the user clean and manipulate data in a manner which they fill fitting.

## Calling the Library and Reading the data Here we imported the tidyverse packages which contains the package dplyr but you can also see other packages as well such as tibble,readr, and more.I also imported a data set from kaggle which contains information about the Disney movies.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
Disney_data <- read.csv("https://raw.githubusercontent.com/AldataSci/TidyVerse/main/disney_movies.csv",sep=",",header=TRUE)

Looking at the data

Let’s take a look at the data and see what we can understand from it.We can see all the movie_title,the genre and the release date and its earnings as well.

head(Disney_data)
##                       movie_title release_date     genre mpaa_rating
## 1 Snow White and the Seven Dwarfs   1937-12-21   Musical           G
## 2                       Pinocchio   1940-02-09 Adventure           G
## 3                        Fantasia   1940-11-13   Musical           G
## 4               Song of the South   1946-11-12 Adventure           G
## 5                      Cinderella   1950-02-15     Drama           G
## 6    20,000 Leagues Under the Sea   1954-12-23 Adventure            
##   total_gross inflation_adjusted_gross
## 1   184925485               5228953251
## 2    84300000               2188229052
## 3    83320000               2187090808
## 4    65000000               1078510579
## 5    85000000                920608730
## 6    28200000                528279994

Cleaning The Data

We can see that there are a select few columns and we can remove from our data if we wish to

## You can use the select function to select columns based on their names 
head(select(Disney_data,movie_title,release_date,genre,mpaa_rating,inflation_adjusted_gross))
##                       movie_title release_date     genre mpaa_rating
## 1 Snow White and the Seven Dwarfs   1937-12-21   Musical           G
## 2                       Pinocchio   1940-02-09 Adventure           G
## 3                        Fantasia   1940-11-13   Musical           G
## 4               Song of the South   1946-11-12 Adventure           G
## 5                      Cinderella   1950-02-15     Drama           G
## 6    20,000 Leagues Under the Sea   1954-12-23 Adventure            
##   inflation_adjusted_gross
## 1               5228953251
## 2               2188229052
## 3               2187090808
## 4               1078510579
## 5                920608730
## 6                528279994
## Alternatively you can also use the pipes dpylr to get the same results!! 
head(Disney_data %>%
  select(movie_title,release_date,genre,mpaa_rating,inflation_adjusted_gross))
##                       movie_title release_date     genre mpaa_rating
## 1 Snow White and the Seven Dwarfs   1937-12-21   Musical           G
## 2                       Pinocchio   1940-02-09 Adventure           G
## 3                        Fantasia   1940-11-13   Musical           G
## 4               Song of the South   1946-11-12 Adventure           G
## 5                      Cinderella   1950-02-15     Drama           G
## 6    20,000 Leagues Under the Sea   1954-12-23 Adventure            
##   inflation_adjusted_gross
## 1               5228953251
## 2               2188229052
## 3               2187090808
## 4               1078510579
## 5                920608730
## 6                528279994
## A much simpler way to get the columns you want without writing the whole columns name out is to write a minus sign next to the column you don't want.
head(Disney_data %>%
  select(-total_gross))
##                       movie_title release_date     genre mpaa_rating
## 1 Snow White and the Seven Dwarfs   1937-12-21   Musical           G
## 2                       Pinocchio   1940-02-09 Adventure           G
## 3                        Fantasia   1940-11-13   Musical           G
## 4               Song of the South   1946-11-12 Adventure           G
## 5                      Cinderella   1950-02-15     Drama           G
## 6    20,000 Leagues Under the Sea   1954-12-23 Adventure            
##   inflation_adjusted_gross
## 1               5228953251
## 2               2188229052
## 3               2187090808
## 4               1078510579
## 5                920608730
## 6                528279994

Filtering the Data

We can also filter out our data to make it more readable or simply more easy to understand. In our case we can see there is a lot of empty values under the ratings columns which simply fills the data with empty space and harder to understand.

## The filter function helps us filter out the data I filtered out the observations in the rating columns that contains empty space or if the ratings says not rated
head(filter(Disney_data,mpaa_rating !="" & mpaa_rating !="Not Rated"))
##                       movie_title release_date     genre mpaa_rating
## 1 Snow White and the Seven Dwarfs   1937-12-21   Musical           G
## 2                       Pinocchio   1940-02-09 Adventure           G
## 3                        Fantasia   1940-11-13   Musical           G
## 4               Song of the South   1946-11-12 Adventure           G
## 5                      Cinderella   1950-02-15     Drama           G
## 6              Lady and the Tramp   1955-06-22     Drama           G
##   total_gross inflation_adjusted_gross
## 1   184925485               5228953251
## 2    84300000               2188229052
## 3    83320000               2187090808
## 4    65000000               1078510579
## 5    85000000                920608730
## 6    93600000               1236035515
## We can also use dplyr pipes to filter it out
head(Disney_data %>%
  filter(mpaa_rating !="" & mpaa_rating != "Not Rated"))
##                       movie_title release_date     genre mpaa_rating
## 1 Snow White and the Seven Dwarfs   1937-12-21   Musical           G
## 2                       Pinocchio   1940-02-09 Adventure           G
## 3                        Fantasia   1940-11-13   Musical           G
## 4               Song of the South   1946-11-12 Adventure           G
## 5                      Cinderella   1950-02-15     Drama           G
## 6              Lady and the Tramp   1955-06-22     Drama           G
##   total_gross inflation_adjusted_gross
## 1   184925485               5228953251
## 2    84300000               2188229052
## 3    83320000               2187090808
## 4    65000000               1078510579
## 5    85000000                920608730
## 6    93600000               1236035515

Renaming Columns

We can also rename our columns in your dataset if the name doesn’t make sense or if the name is ugly to you

## We can use base R rename function to simply rename our column names in our dataset.In this case we will rename our movie_title column to Movies 
head(names(Disney_data)[names(Disney_data) == "movie_title"] <- "Movies")
## [1] "Movies"
### Alternatively a simpler and faster way to do this is to use dyplr rename function where we will pipe the disney_data rename the columns we want and get a new dataset with the changed column names saved to it the new name you want to give goes to the left of =, in this case it is Movie_Titles I want to rename to while the right of = is the old name of the column Movies. 
head(Disney_data %>%
  rename(Movie_Titles=Movies))
##                      Movie_Titles release_date     genre mpaa_rating
## 1 Snow White and the Seven Dwarfs   1937-12-21   Musical           G
## 2                       Pinocchio   1940-02-09 Adventure           G
## 3                        Fantasia   1940-11-13   Musical           G
## 4               Song of the South   1946-11-12 Adventure           G
## 5                      Cinderella   1950-02-15     Drama           G
## 6    20,000 Leagues Under the Sea   1954-12-23 Adventure            
##   total_gross inflation_adjusted_gross
## 1   184925485               5228953251
## 2    84300000               2188229052
## 3    83320000               2187090808
## 4    65000000               1078510579
## 5    85000000                920608730
## 6    28200000                528279994

Combining Everything with the Pipe Operator

## Here we can combine multiple pipes command as one we can put all the changes we did to the dataset into one R pipe command that is the awesome power of the pipe operator.. and we save our data in a dataframe called Clean Disney.. 
head(Clean_Disney <- Disney_data %>%
  rename(movies=Movies) %>%
  select(-total_gross) %>%
  filter(mpaa_rating !="" & mpaa_rating !="Not Rated"))
##                            movies release_date     genre mpaa_rating
## 1 Snow White and the Seven Dwarfs   1937-12-21   Musical           G
## 2                       Pinocchio   1940-02-09 Adventure           G
## 3                        Fantasia   1940-11-13   Musical           G
## 4               Song of the South   1946-11-12 Adventure           G
## 5                      Cinderella   1950-02-15     Drama           G
## 6              Lady and the Tramp   1955-06-22     Drama           G
##   inflation_adjusted_gross
## 1               5228953251
## 2               2188229052
## 3               2187090808
## 4               1078510579
## 5                920608730
## 6               1236035515

Advanced Usages of the Pipes

We can also use dyplr to re-organize our data and we can also perform analysis on our cleaned data now.

## The group-by command helps aggregate your data, aggregate data groups your data into a specific category, 
head(Clean_Disney %>%
  group_by(release_date))
## # A tibble: 6 × 5
## # Groups:   release_date [6]
##   movies                  release_date genre   mpaa_rating inflation_adjusted_g…
##   <chr>                   <chr>        <chr>   <chr>                       <dbl>
## 1 Snow White and the Sev… 1937-12-21   Musical G                      5228953251
## 2 Pinocchio               1940-02-09   Advent… G                      2188229052
## 3 Fantasia                1940-11-13   Musical G                      2187090808
## 4 Song of the South       1946-11-12   Advent… G                      1078510579
## 5 Cinderella              1950-02-15   Drama   G                       920608730
## 6 Lady and the Tramp      1955-06-22   Drama   G                      1236035515
## The summarise function helps us perform numerical analysis on our data so using the max function we use the max value of the column. 
head(Clean_Disney %>%
  group_by(release_date) %>%
  summarise(max(inflation_adjusted_gross)))
## # A tibble: 6 × 2
##   release_date `max(inflation_adjusted_gross)`
##   <chr>                                  <dbl>
## 1 1937-12-21                        5228953251
## 2 1940-02-09                        2188229052
## 3 1940-11-13                        2187090808
## 4 1946-11-12                        1078510579
## 5 1950-02-15                         920608730
## 6 1955-06-22                        1236035515
### We can also use the arrange function to rearrange our data from low-to high with the desc function arranging our data can help provide more additional insights to perform. 
head(Clean_Disney %>%
  arrange(desc(inflation_adjusted_gross)))
##                            movies release_date     genre mpaa_rating
## 1 Snow White and the Seven Dwarfs   1937-12-21   Musical           G
## 2                       Pinocchio   1940-02-09 Adventure           G
## 3                        Fantasia   1940-11-13   Musical           G
## 4                  101 Dalmatians   1961-01-25    Comedy           G
## 5              Lady and the Tramp   1955-06-22     Drama           G
## 6               Song of the South   1946-11-12 Adventure           G
##   inflation_adjusted_gross
## 1               5228953251
## 2               2188229052
## 3               2187090808
## 4               1362870985
## 5               1236035515
## 6               1078510579

The part Jay added

Create summary to show the result of data mining

We can create summary to show how many observation and variable has been eliminated. as we can see 59 observation deleted because the ratings are empty or not rated. Total_gross deleted because inflation_adjusted_gross is the better choice to show the fact.

summary(Disney_data)
##     Movies          release_date          genre           mpaa_rating       
##  Length:579         Length:579         Length:579         Length:579        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   total_gross        inflation_adjusted_gross
##  Min.   :        0   Min.   :0.000e+00       
##  1st Qu.: 12788864   1st Qu.:2.274e+07       
##  Median : 30702446   Median :5.516e+07       
##  Mean   : 64701789   Mean   :1.188e+08       
##  3rd Qu.: 75709033   3rd Qu.:1.192e+08       
##  Max.   :936662225   Max.   :5.229e+09
summary(Clean_Disney)
##     movies          release_date          genre           mpaa_rating       
##  Length:520         Length:520         Length:520         Length:520        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  inflation_adjusted_gross
##  Min.   :2.984e+03       
##  1st Qu.:2.495e+07       
##  Median :5.814e+07       
##  Mean   :1.242e+08       
##  3rd Qu.:1.258e+08       
##  Max.   :5.229e+09

ggplot to show the comparison of data

From the plot1 we can see majority of the earnings are from G rated movies, however, does that mean G rate movie has the highest avg earnings per movies? The answer is yes. We calculate the sum, count and the avg first, then we make a plot 2. We can see G rated movies are still hold the highest earnings in avg, however, PG and PG-13 is tie to each other now. As a shareholder, they would love to see Disney produce more G rated movies because the ROI on G rated movie maybe the best without see the the cost part for now.

ggplot(Clean_Disney, aes(x=mpaa_rating, y=inflation_adjusted_gross)) +
geom_bar(stat="identity", position=position_dodge())

sum <- aggregate(x= Clean_Disney$inflation_adjusted_gross,
          by= list(Clean_Disney$mpaa_rating),
          FUN=sum)

count <- aggregate(Clean_Disney$inflation_adjusted_gross, by=list(Clean_Disney$mpaa_rating), FUN=length)

sum$count <- count$x

sum$avg <- sum$x/count$x

colnames(sum) <- c('mpaa_rating','inflation_adjusted_gross','count','avg_gross')

sum
##   mpaa_rating inflation_adjusted_gross count avg_gross
## 1           G              25048445571    86 291260995
## 2          PG              18988248082   187 101541434
## 3       PG-13              14927544680   145 102948584
## 4           R               5641192166   102  55305806
ggplot(sum, aes(x=mpaa_rating, y=avg_gross)) +
geom_bar(stat="identity", position=position_dodge())