This is a continuation of the second phase of the project. We are loading data from the data set. The second phase can be found at the following link
Loading data from the github link.
df <- read.csv("https://raw.githubusercontent.com/nurfnick/Data_Sets_For_Stats/master/CuratedDataSets/DisneyMoviesDataset.csv")
Cleaning our data. We removed 16 variables out of 32 keeping only the ones that are used. Also, the variables that were removed had large number of empty entries. We are also removing any row that has imdb value of “N/A” and Empty Values.
df <- df[c(1:14,22,23)]
df<-df[!(df$imdb=="N/A" | df$imdb ==""),]
df<-df[!(is.na(df$Box.office..float.) | is.na(df$Budget..float.)),]
Creating imdb as a factor. For movies that have a rating of 7.0 and higher, they are Great Movies. Else they are Not So Great Movies
library(dplyr)
df <- df %>% mutate(
imdb = factor(imdb >= 7.0 , levels = c(TRUE, FALSE),
labels = c('Great Movies', 'Not So Great Movies'))
)
##Association Importing new libraries for Association
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
Added some new libraries. And now to convert the data for association analysis.
transactions(df)
## Warning: Column(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16 not logical
## or factor. Applying default discretization (see '? discretizeDF').
## transactions in sparse format with
## 252 transactions (rows) and
## 1494 items (columns)
Might have to look at the columns to see if any of the columns could be converted into factors for analysis. DO THAT HERE
trans <- transactions(df)
## Warning: Column(s) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16 not logical
## or factor. Applying default discretization (see '? discretizeDF').
summary(trans)
## transactions as itemMatrix in sparse format with
## 252 rows (elements/itemsets/transactions) and
## 1494 columns (items) and a density of 0.0107095
##
## most frequent items:
## Language=English Country=United States
## 236 225
## imdb=Not So Great Movies imdb=Great Movies
## 149 103
## Running.time..int.=[104,168] (Other)
## 93 3226
##
## element (itemset/transaction) length distribution:
## sizes
## 16
## 252
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16 16 16 16 16 16
##
## includes extended item information - examples:
## labels variables levels
## 1 X=[1,214) X [1,214)
## 2 X=[214,307) X [214,307)
## 3 X=[307,425] X [307,425]
##
## includes extended transaction information - examples:
## transactionID
## 1 2
## 2 3
## 3 4
head(colnames(trans))
## [1] "X=[1,214)" "X=[214,307)"
## [3] "X=[307,425]" "title=101 Dalmatians"
## [5] "title=102 Dalmatians" "title=20,000 Leagues Under the Sea"
inspect(trans[1:3])
## items transactionID
## [1] {X=[1,214),
## title=Snow White and the Seven Dwarfs,
## Production.company=Walt Disney Productions,
## Release.date=['December 21, 1937 ( Carthay Circle Theatre , Los Angeles , CA )', 'February 4, 1938 (United States)'],
## Running.time=83 minutes,
## Country=United States,
## Language=English,
## Running.time..int.=[40,91),
## Budget..float.=[150,1.8e+07),
## Box.office..float.=[1.96e+08,1.66e+09],
## Release.date..datetime.=12/21/1937,
## imdb=Great Movies,
## metascore=95,
## rotten_tomatoes=,
## Budget=$1.49 million,
## Box.office=$418 million} 2
## [2] {X=[1,214),
## title=Pinocchio,
## Production.company=Walt Disney Productions,
## Release.date=['February 7, 1940 ( Center Theatre )', 'February 23, 1940 (United States)'],
## Running.time=88 minutes,
## Country=United States,
## Language=English,
## Running.time..int.=[40,91),
## Budget..float.=[150,1.8e+07),
## Box.office..float.=[5.02e+07,1.96e+08),
## Release.date..datetime.=2/7/1940,
## imdb=Great Movies,
## metascore=99,
## rotten_tomatoes=100%,
## Budget=$2.6 million,
## Box.office=$164 million} 3
## [3] {X=[1,214),
## title=Fantasia,
## Production.company=Walt Disney Productions,
## Release.date=['November 13, 1940'],
## Running.time=126 minutes,
## Country=United States,
## Language=English,
## Running.time..int.=[104,168],
## Budget..float.=[150,1.8e+07),
## Box.office..float.=[5.02e+07,1.96e+08),
## Release.date..datetime.=11/13/1940,
## imdb=Great Movies,
## metascore=96,
## rotten_tomatoes=95%,
## Budget=$2.28 million,
## Box.office=$76.4–$83.3 million} 4
image(trans)
itemFrequencyPlot(trans,topN = 10)
vertical <- as(trans, "tidLists")
as(vertical, "matrix")[1:10, 1:5]
## 2 3 4 5 6
## X=[1,214) TRUE TRUE TRUE TRUE TRUE
## X=[214,307) FALSE FALSE FALSE FALSE FALSE
## X=[307,425] FALSE FALSE FALSE FALSE FALSE
## title=101 Dalmatians FALSE FALSE FALSE FALSE FALSE
## title=102 Dalmatians FALSE FALSE FALSE FALSE FALSE
## title=20,000 Leagues Under the Sea FALSE FALSE FALSE FALSE FALSE
## title=A Bug's Life FALSE FALSE FALSE FALSE FALSE
## title=A Christmas Carol FALSE FALSE FALSE FALSE FALSE
## title=A Kid in King Arthur's Court FALSE FALSE FALSE FALSE FALSE
## title=A Wrinkle in Time FALSE FALSE FALSE FALSE FALSE