Set the work directory. This will determine the default location of all of your data.
if(!is.null(dev.list())) dev.off()
## null device
## 1
cat("\014")
rm(list=ls())
options(scipen=9)
#Load packages for excell if required or install if not available
if(!require(readxl)){install.packages("readxl")}
## Loading required package: readxl
library("readxl")
getwd() # calling of the working directory here
## [1] "C:/Users/jeelv/OneDrive/BDSA 2022 WInter/8430 DATA ANALYSIS ALGO AND MATHEMATICS/WEEK 3"
movieSurvey <- read_excel("2014 and 2015 CSM 22W.xlsx") # 1.1 reading of excel file to change to a dataframe.
movieSurvey <- as.data.frame(movieSurvey)
head(movieSurvey)
## Movie Year Ratings Genre Gross Budget
## 1 13 Sins 2014 6.3 8 9130 4000000
## 2 22 Jump Street 2014 7.1 1 192000000 50000000
## 3 3 Days to Kill 2014 6.2 1 30700000 28000000
## 4 300: Rise of an Empire 2014 6.3 1 106000000 110000000
## 5 A Haunted House 2 2014 4.7 8 17300000 3500000
## 6 A Million Ways to Die in the West 2014 6.1 8 42600000 40000000
## Screens Sequel Sentiment Views Likes Dislikes Comments Aggregate Followers
## 1 45 1 0 3280543 4632 425 636 1120000
## 2 3306 2 2 583289 3465 61 186 12350000
## 3 2872 1 0 304861 328 34 47 483000
## 4 3470 2 0 452917 2429 132 590 568000
## 5 2310 2 0 3145573 12163 610 1082 1923800
## 6 3158 1 0 3013011 9595 419 1020 8153000
#1.2 Append JV initials to column names
colnames(movieSurvey) <- paste(colnames(movieSurvey), "JV", sep = "_")
head(movieSurvey,10)
## Movie_JV Year_JV Ratings_JV Genre_JV Gross_JV
## 1 13 Sins 2014 6.3 8 9130
## 2 22 Jump Street 2014 7.1 1 192000000
## 3 3 Days to Kill 2014 6.2 1 30700000
## 4 300: Rise of an Empire 2014 6.3 1 106000000
## 5 A Haunted House 2 2014 4.7 8 17300000
## 6 A Million Ways to Die in the West 2014 6.1 8 42600000
## 7 A Most Violent Year 2014 7.1 1 5750000
## 8 A Walk Among the Tombstones 2014 6.5 10 26000000
## 9 About Last Night 2014 6.1 8 48600000
## 10 American Sniper 2014 7.3 1 350000000
## Budget_JV Screens_JV Sequel_JV Sentiment_JV Views_JV Likes_JV Dislikes_JV
## 1 4000000 45 1 0 3280543 4632 425
## 2 50000000 3306 2 2 583289 3465 61
## 3 28000000 2872 1 0 304861 328 34
## 4 110000000 3470 2 0 452917 2429 132
## 5 3500000 2310 2 0 3145573 12163 610
## 6 40000000 3158 1 0 3013011 9595 419
## 7 20000000 818 1 2 1854103 2207 197
## 8 28000000 2714 1 3 2213659 2210 419
## 9 12500000 2253 1 0 5218079 11709 532
## 10 58800000 3555 1 4 3927600 13143 573
## Comments_JV Aggregate Followers_JV
## 1 636 1120000
## 2 186 12350000
## 3 47 483000
## 4 590 568000
## 5 1082 1923800
## 6 1020 8153000
## 7 593 130655
## 8 382 125646
## 9 770 21697300
## 10 3134 24300
1.3. What are the dimensions of the dataset (rows and columns)? -> using dimesnion functin which is dim we can calculate and see that the dimensions for the table is 187*14.
dim(movieSurvey)
## [1] 187 14
2 Summarizing Data 1. Means and Standard Deviations a. Calculate the mean and standard deviation for Gross. b. Use the results above to calculate the coefficient of variation (rounded to 2 decimal places). c. Calculate the mean and standard deviation for Budget. Also calculate the coefficients of variation (rounded to 2 decimal places). d. Does the budget or the gross sales of a movie have more variation?
#2.1.a calculated mean and standard deviation for Gross
mean(movieSurvey$Gross_JV)
## [1] 77646944
sd(movieSurvey$Gross_JV)
## [1] 93899208
#2.1.b using the results of mean and SD from data we calculate the Coefficient variaton by storing the values of the mean and sd in a variable.
#and also rounded it by 2 decimal places.
mm <- mean(movieSurvey$Gross_JV)
ss <- sd(movieSurvey$Gross_JV)
coeff_var_mov <- ss/mm
round(coeff_var_mov,2)
## [1] 1.21
#2.1.c calculating the mean SD and coeff variation of budget and round it by 2 decimal places.
mean(movieSurvey$Budget_JV)
## [1] 53844373
sd(movieSurvey$Budget_JV)
## [1] 57100689
Budget_Mean <- mean(movieSurvey$Budget_JV)
BUdget_Sd <- sd(movieSurvey$Budget_JV)
coeff_var_mov <- BUdget_Sd/Budget_Mean
round(coeff_var_mov,2)
## [1] 1.06
#2.1.d apparantely, by watching both of the data. we can say that budget and a gross sale of the movie have more variation and the gross is more.
2.2 Calculate the 32nd percentile of the number of Likes given. This calculation should be rounded to the nearest whole number (no decimal places).
# 2.2 calculation of 32nd percentile of likes is 3355 which is rounded to neatest whole number too.
PercentileOfLikes <- quantile(movieSurvey$Likes_JV, c(.32))
round(PercentileOfLikes)
## 32%
## 3355
#3.1.a created summary table for average of ratings by year and also rounded it to 2 decimal places.
#3.1.b 2014 have the highest number of rating which is apparently from data we processed because the volume of movies are larger in 2014 that is obvious that more people will rate.
Table1_JV <- aggregate(movieSurvey$Ratings_JV, by=list(movieSurvey$Year_JV), FUN=mean, na.rm=TRUE)
Table1_JV
## Group.1 x
## 1 2014 6.435338
## 2 2015 6.403704
round(Table1_JV,2)
## Group.1 x
## 1 2014 6.44
## 2 2015 6.40
# 3.2.a we did a cross tabulation of the data by using 'with' of the column genre and sequel
# 3.2.b we changed the table and showed the percentage individually.
# 3.2.c apparently number 8 has 18.18% or 81.82% movie that is not in sequel
crossTab_JV <- with(movieSurvey, table(Genre_JV, Sequel_JV))
crossTab_JV <-round(prop.table(crossTab_JV)*100,2)
crossTab_JV
## Sequel_JV
## Genre_JV 1 2 3 4 5 6 7
## 1 19.25 5.35 1.07 1.60 1.07 0.00 1.07
## 2 2.67 0.53 1.60 0.00 0.53 0.53 0.00
## 3 17.65 1.07 0.53 0.00 0.53 0.00 0.00
## 6 1.07 0.00 0.00 0.00 0.00 0.00 0.00
## 7 0.53 0.00 0.00 0.00 0.00 0.00 0.00
## 8 18.18 4.28 0.53 0.00 0.00 0.00 0.00
## 9 5.35 0.00 0.00 0.00 0.00 0.00 0.00
## 10 4.28 0.53 0.00 0.00 0.00 0.00 0.00
## 12 4.81 1.07 0.53 0.00 0.00 0.00 0.00
## 15 3.74 0.00 0.00 0.00 0.00 0.00 0.00
3.3. Bar Plot a. Create a bar plot of genre of movies. b. The plot should be: i. Rank ordered by highest count of genre. ii. Properly labeled (title, x-axis, etc) iii. The bars should have a different colour than the one shown in class. c. Based on the bar plot, (approximately) how many movies are there in genre number 8?
#Bar Plot
#3.3.a&b created bar plot of genre with descending order properly labeled and blue colour.
#3.3.c apparently we can observe that there are approximately 43 genre in no 8.
genreForBar_JV <- table(movieSurvey$Genre_JV)
Genre_JV <- genreForBar_JV[order(genreForBar_JV,decreasing=TRUE)]
barplot(Genre_JV,
col=5,
density = 25, angle = 45,
main="Bar Plot of Number of Genres",
xlab="Genres in descending order")
#4.a&b created the histogram of column sentiment with proper labels and unique colour.
#4.c as observed in the histogram below it is clear that sentiments ratings from -10 to 0 is most common.
hist(movieSurvey$Sentiment_JV,
col=8,
density = 20, angle = 120,
xlab="sentiments",
ylab = "reoccurance",
main="Histogram of sentiments for movies")
# 5.a&b created a boxplot of repeatation of movie on screens(horizontal) labeled and unique color.
# 5.c the movies which are shown less than 775 times are 25%. that is first quantile.
boxplot(movieSurvey$Screens_JV,
main="Distribution of movie shown / screens",
xlab="Percentage of movies on screens",
col=8,
horizontal=TRUE,
pch=20)
#6.a&b created the scatterplot of budget and gross sales of movies. it is properly labeled
#6.c adding of line with use of abline(1,1) that overlays the line at 45 degrees.
#6.d many movies lays on the line and are associated with the budget and gross, meaning no profit, no loss.
#6.e it means that particular movie has no loss and no profit. it is all sqaured. simply: left/upper part of the layed line is profitable and right/lower part of the line is loss.
plot(Budget_JV ~ Gross_JV,
data=movieSurvey,
col=3,
pch=9,
main="scatterplot for Budget and Gross",
xlab="budget",
ylab="gross")
abline(coef = c(1,1))