This is a continuation of the second phase of the project. We are loading data from the data set. The second phase can be found at the following link
http://rpubs.com/dhirajbasnet/820823
Since this is a different rmd file we will need to load the dataset again. Loading data from the github link.
df <- read.csv("https://raw.githubusercontent.com/nurfnick/Data_Sets_For_Stats/master/CuratedDataSets/DisneyMoviesDataset.csv")
Cleaning our data.
The dataset consists of a lot of columns that have majority of values as empty. We will remove such columns.We removed 16 variables out of 32 keeping only the ones that are used. Also, the variables that were removed had large number of empty entries. We are also removing any row that has imdb value of “N/A” and Empty Values.
df <- df[c(1:14,22,23)]
df<-df[!(df$imdb=="N/A" | df$imdb ==""),]
df<-df[!(is.na(df$Box.office..float.) | is.na(df$Budget..float.)),]
We will need to create a factor so we will do the same thing that we did in phase II. We will create imdb as a factor. For movies that have a rating of 7.0 and higher, they will be considered Great Movies for our project. Else they will be called “Not So Great Movies”
library(dplyr)
df <- df %>% mutate(
imdb = factor(imdb >= 7.0 , levels = c(TRUE, FALSE),
labels = c('Great Movies', 'Not So Great Movies'))
)
Association
We will need some new libraries for association. Let’s import them first.
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
After adding the librarires, we will need to convert the data for association analysis using the following code.
transactions(df)
## Warning: Column(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16 not logical
## or factor. Applying default discretization (see '? discretizeDF').
## transactions in sparse format with
## 252 transactions (rows) and
## 1494 items (columns)
We can see that we have a warning here saying that the above columns were not logical or factor. This includes all columns except imdb that converted into a factor. R uses default method to create transactions.
We will see if we can convert any other column into factors for analysis.
colnames(df)[c(1, 2, 3, 4, 5, 6,7, 8, 9, 10 , 11, 13, 14, 15)]
## [1] "X" "title"
## [3] "Production.company" "Release.date"
## [5] "Running.time" "Country"
## [7] "Language" "Running.time..int."
## [9] "Budget..float." "Box.office..float."
## [11] "Release.date..datetime." "metascore"
## [13] "rotten_tomatoes" "Budget"
We will convert the “running time” column into a factor. There are a lot of empty data on metascore and rotten tomatoes so we will not use that. We cannot do much with title, release dates, production company and budget.
library(dplyr)
df <- df %>% mutate(
Running.time = factor(Running.time..int. >= 90 , levels = c(TRUE, FALSE),
labels = c('Long Duration Movies', 'Short Duration Movies'))
)
We can also turn country into factor. We will do if the movie was shot in USA or Outside.
library(dplyr)
df <- df %>% mutate(
Country = factor(Country == "United States" , levels = c(TRUE, FALSE),
labels = c('In State', 'Out'))
)
We can also turn language into factor and convert it into whether it is “English” or “Other”
library(dplyr)
df <- df %>% mutate(
Language = factor(Language == "English" , levels = c(TRUE, FALSE),
labels = c('English', 'Others'))
)
Now lets convert the data again for analysis.
transactions(df)
## Warning: Column(s) 1, 2, 3, 4, 8, 9, 10, 11, 13, 14, 15, 16 not logical or
## factor. Applying default discretization (see '? discretizeDF').
## transactions in sparse format with
## 252 transactions (rows) and
## 1388 items (columns)
trans <- transactions(df)
## Warning: Column(s) 1, 2, 3, 4, 8, 9, 10, 11, 13, 14, 15, 16 not logical or
## factor. Applying default discretization (see '? discretizeDF').
R does default discretization for the rest of the columns.
summary(trans)
## transactions as itemMatrix in sparse format with
## 252 rows (elements/itemsets/transactions) and
## 1388 columns (items) and a density of 0.01152738
##
## most frequent items:
## Language=English Country=In State
## 236 225
## Running.time=Long Duration Movies imdb=Not So Great Movies
## 172 149
## imdb=Great Movies (Other)
## 103 3147
##
## element (itemset/transaction) length distribution:
## sizes
## 16
## 252
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16 16 16 16 16 16
##
## includes extended item information - examples:
## labels variables levels
## 1 X=[1,214) X [1,214)
## 2 X=[214,307) X [214,307)
## 3 X=[307,425] X [307,425]
##
## includes extended transaction information - examples:
## transactionID
## 1 2
## 2 3
## 3 4
This shows that the most frequent items in our dataset is Language = “English”. There is 236 occurence of items having that attribute. Also, items having Country = “In State” is also one of the most frequent which in our case means the movies were shot in USA. We can see in the summary that, out of the 252 observations, 236 observations had that language and 225 observations had that value for Country.
head(colnames(trans))
## [1] "X=[1,214)" "X=[214,307)"
## [3] "X=[307,425]" "title=101 Dalmatians"
## [5] "title=102 Dalmatians" "title=20,000 Leagues Under the Sea"
Here are the column names for the transactions.
inspect(trans[1:3])
## items transactionID
## [1] {X=[1,214),
## title=Snow White and the Seven Dwarfs,
## Production.company=Walt Disney Productions,
## Release.date=['December 21, 1937 ( Carthay Circle Theatre , Los Angeles , CA )', 'February 4, 1938 (United States)'],
## Running.time=Short Duration Movies,
## Country=In State,
## Language=English,
## Running.time..int.=[40,91),
## Budget..float.=[150,1.8e+07),
## Box.office..float.=[1.96e+08,1.66e+09],
## Release.date..datetime.=12/21/1937,
## imdb=Great Movies,
## metascore=95,
## rotten_tomatoes=,
## Budget=$1.49 million,
## Box.office=$418 million} 2
## [2] {X=[1,214),
## title=Pinocchio,
## Production.company=Walt Disney Productions,
## Release.date=['February 7, 1940 ( Center Theatre )', 'February 23, 1940 (United States)'],
## Running.time=Short Duration Movies,
## Country=In State,
## Language=English,
## Running.time..int.=[40,91),
## Budget..float.=[150,1.8e+07),
## Box.office..float.=[5.02e+07,1.96e+08),
## Release.date..datetime.=2/7/1940,
## imdb=Great Movies,
## metascore=99,
## rotten_tomatoes=100%,
## Budget=$2.6 million,
## Box.office=$164 million} 3
## [3] {X=[1,214),
## title=Fantasia,
## Production.company=Walt Disney Productions,
## Release.date=['November 13, 1940'],
## Running.time=Long Duration Movies,
## Country=In State,
## Language=English,
## Running.time..int.=[104,168],
## Budget..float.=[150,1.8e+07),
## Box.office..float.=[5.02e+07,1.96e+08),
## Release.date..datetime.=11/13/1940,
## imdb=Great Movies,
## metascore=96,
## rotten_tomatoes=95%,
## Budget=$2.28 million,
## Box.office=$76.4–$83.3 million} 4
image(trans)
This shows how sparse the data is. The black dots is where it returned true and the white parts are where it was false. The data is scattered in our dataset.
itemFrequencyPlot(trans,topN = 4)
This further validates our finding from the summary. This show the relative frequency of movies having English as language is the highest. Moreover, movies that were shot in USA are also in large numbers just behind language attribute. The next most frequent data in out dataset is the running time being “long duration”.
vertical <- as(trans, "tidLists")
as(vertical, "matrix")[1:10, 1:5]
## 2 3 4 5 6
## X=[1,214) TRUE TRUE TRUE TRUE TRUE
## X=[214,307) FALSE FALSE FALSE FALSE FALSE
## X=[307,425] FALSE FALSE FALSE FALSE FALSE
## title=101 Dalmatians FALSE FALSE FALSE FALSE FALSE
## title=102 Dalmatians FALSE FALSE FALSE FALSE FALSE
## title=20,000 Leagues Under the Sea FALSE FALSE FALSE FALSE FALSE
## title=A Bug's Life FALSE FALSE FALSE FALSE FALSE
## title=A Christmas Carol FALSE FALSE FALSE FALSE FALSE
## title=A Kid in King Arthur's Court FALSE FALSE FALSE FALSE FALSE
## title=A Wrinkle in Time FALSE FALSE FALSE FALSE FALSE
trans
## transactions in sparse format with
## 252 transactions (rows) and
## 1388 items (columns)
trans_USA <- trans[trans %in% "Country=In State"]
trans_USA
## transactions in sparse format with
## 225 transactions (rows) and
## 1388 items (columns)
This shows the transactions having “In State” value for Country variable in our dataset.
inspect(head(trans_USA))
## items transactionID
## [1] {X=[1,214),
## title=Snow White and the Seven Dwarfs,
## Production.company=Walt Disney Productions,
## Release.date=['December 21, 1937 ( Carthay Circle Theatre , Los Angeles , CA )', 'February 4, 1938 (United States)'],
## Running.time=Short Duration Movies,
## Country=In State,
## Language=English,
## Running.time..int.=[40,91),
## Budget..float.=[150,1.8e+07),
## Box.office..float.=[1.96e+08,1.66e+09],
## Release.date..datetime.=12/21/1937,
## imdb=Great Movies,
## metascore=95,
## rotten_tomatoes=,
## Budget=$1.49 million,
## Box.office=$418 million} 2
## [2] {X=[1,214),
## title=Pinocchio,
## Production.company=Walt Disney Productions,
## Release.date=['February 7, 1940 ( Center Theatre )', 'February 23, 1940 (United States)'],
## Running.time=Short Duration Movies,
## Country=In State,
## Language=English,
## Running.time..int.=[40,91),
## Budget..float.=[150,1.8e+07),
## Box.office..float.=[5.02e+07,1.96e+08),
## Release.date..datetime.=2/7/1940,
## imdb=Great Movies,
## metascore=99,
## rotten_tomatoes=100%,
## Budget=$2.6 million,
## Box.office=$164 million} 3
## [3] {X=[1,214),
## title=Fantasia,
## Production.company=Walt Disney Productions,
## Release.date=['November 13, 1940'],
## Running.time=Long Duration Movies,
## Country=In State,
## Language=English,
## Running.time..int.=[104,168],
## Budget..float.=[150,1.8e+07),
## Box.office..float.=[5.02e+07,1.96e+08),
## Release.date..datetime.=11/13/1940,
## imdb=Great Movies,
## metascore=96,
## rotten_tomatoes=95%,
## Budget=$2.28 million,
## Box.office=$76.4–$83.3 million} 4
## [4] {X=[1,214),
## title=The Reluctant Dragon,
## Production.company=Walt Disney Productions,
## Release.date=['June 20, 1941'],
## Running.time=Short Duration Movies,
## Country=In State,
## Language=English,
## Running.time..int.=[40,91),
## Budget..float.=[150,1.8e+07),
## Box.office..float.=[36,5.02e+07),
## Release.date..datetime.=6/20/1941,
## imdb=Not So Great Movies,
## metascore=N/A,
## rotten_tomatoes=67%,
## Budget=$600,000 ,
## Box.office=$960,000 (worldwide rentals)} 5
## [5] {X=[1,214),
## title=Dumbo,
## Production.company=Walt Disney Productions,
## Release.date=['October 23, 1941 (New York City)', 'October 31, 1941 (U.S.)'],
## Running.time=Short Duration Movies,
## Country=In State,
## Language=English,
## Running.time..int.=[40,91),
## Budget..float.=[150,1.8e+07),
## Box.office..float.=[36,5.02e+07),
## Release.date..datetime.=10/23/1941,
## imdb=Great Movies,
## metascore=96,
## rotten_tomatoes=98%,
## Budget=$950,000 ,
## Box.office=$1.3 million (est. United States/Canada rentals, 1941)} 6
## [6] {X=[1,214),
## title=Bambi,
## Production.company=Walt Disney Productions,
## Release.date=['August 9, 1942 (World Premiere-London)', 'August 13, 1942 (Premiere-New York City)', 'August 21, 1942 (U.S.)'],
## Running.time=Short Duration Movies,
## Country=In State,
## Language=English,
## Running.time..int.=[40,91),
## Budget..float.=[150,1.8e+07),
## Box.office..float.=[1.96e+08,1.66e+09],
## Release.date..datetime.=8/9/1942,
## imdb=Great Movies,
## metascore=91,
## rotten_tomatoes=90%,
## Budget=$858,000 ,
## Box.office=$267.4 million} 7
We can see the first few items of the transactions here.
Now lets find some frequent itemsets.
its <- apriori(trans, parameter=list(target = "frequent"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 frequent itemsets TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 25
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[1388 item(s), 252 transaction(s)] done [0.00s].
## sorting and recoding items ... [22 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## sorting transactions ... done [0.00s].
## writing ... [603 set(s)] done [0.00s].
## creating S4 object ... done [0.00s].
its
## set of 603 itemsets
inspect(head(its, n = 10))
## items support
## [1] {metascore=N/A} 0.1071429
## [2] {Country=Out} 0.1071429
## [3] {Production.company=} 0.1230159
## [4] {Production.company=Walt Disney Productions} 0.1706349
## [5] {Running.time..int.=[91,104)} 0.3015873
## [6] {Running.time=Short Duration Movies} 0.3174603
## [7] {Budget..float.=[150,1.8e+07)} 0.3253968
## [8] {Running.time..int.=[40,91)} 0.3293651
## [9] {X=[1,214)} 0.3333333
## [10] {Box.office..float.=[5.02e+07,1.96e+08)} 0.3333333
## transIdenticalToItemsets count
## [1] 0 27
## [2] 0 27
## [3] 0 31
## [4] 0 43
## [5] 0 76
## [6] 0 80
## [7] 0 82
## [8] 0 83
## [9] 0 84
## [10] 0 84
ggplot(tibble(`Itemset Size` = factor(size(its))), aes(`Itemset Size`)) + geom_bar()
It looks like most of the itemsets have a size of 3. We can see a bell curve above with itemset size 3 having the highest count.
Now we’ll build some rules. Very similar to looking at frequent!
rules <- apriori(trans, parameter = list(support = 0.05, confidence = 0.9))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.9 0.1 1 none FALSE TRUE 5 0.05 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 12
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[1388 item(s), 252 transaction(s)] done [0.00s].
## sorting and recoding items ... [25 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s].
## writing ... [2444 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(head(rules))
## lhs rhs support confidence coverage lift count
## [1] {} => {Language=English} 0.93650794 0.9365079 1.00000000 1.000000 236
## [2] {Production.company=Walt Disney Pictures} => {Country=In State} 0.06746032 1.0000000 0.06746032 1.120000 17
## [3] {Production.company=Walt Disney Pictures} => {Language=English} 0.06746032 1.0000000 0.06746032 1.067797 17
## [4] {Production.company=['Walt Disney Pictures', 'Pixar Animation Studios']} => {Box.office..float.=[1.96e+08,1.66e+09]} 0.06746032 0.9444444 0.07142857 2.833333 17
## [5] {Production.company=['Walt Disney Pictures', 'Pixar Animation Studios']} => {Budget..float.=[8e+07,4.11e+08]} 0.06746032 0.9444444 0.07142857 2.800000 17
## [6] {Production.company=['Walt Disney Pictures', 'Pixar Animation Studios']} => {Running.time=Long Duration Movies} 0.06746032 0.9444444 0.07142857 1.383721 17
We have used a support of 0.05 and a confidence of 0.9.
plot(rules,jitter = 1)
The plot shows the data that are positively correlated meaning the itemsets having lift that is positive. Since negative value of lift represents negative correlation, they are faded away in the diagram. Here, we can see that the rules having high lift have low support.
plot(rules, shading = "order")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
Order is how many items are in the rule. That is items implying other items.
plot(head(rules, n = 50), method = "graph")
We can see below that Country = “United States” and Language = “English” has the maximum support.