Set the work directory. This will determine the default location of all of your data.

if(!is.null(dev.list())) dev.off()
## null device 
##           1
cat("\014") 
rm(list=ls())
options(scipen=9)
#Load packages for excell if required or install if not available
if(!require(readxl)){install.packages("readxl")}
## Loading required package: readxl
library("readxl")
  1. Read in the excel file and change to a data frame
getwd()  # calling of the working directory here
## [1] "C:/Users/jeelv/OneDrive/BDSA 2022 WInter/8430 DATA ANALYSIS ALGO AND MATHEMATICS/WEEK 3"
movieSurvey <- read_excel("2014 and 2015 CSM 22W.xlsx")     # 1.1 reading of excel file to change to a dataframe.
movieSurvey <- as.data.frame(movieSurvey)
head(movieSurvey)
##                               Movie Year Ratings Genre     Gross    Budget
## 1                           13 Sins 2014     6.3     8      9130   4000000
## 2                    22 Jump Street 2014     7.1     1 192000000  50000000
## 3                    3 Days to Kill 2014     6.2     1  30700000  28000000
## 4            300: Rise of an Empire 2014     6.3     1 106000000 110000000
## 5                 A Haunted House 2 2014     4.7     8  17300000   3500000
## 6 A Million Ways to Die in the West 2014     6.1     8  42600000  40000000
##   Screens Sequel Sentiment   Views Likes Dislikes Comments Aggregate Followers
## 1      45      1         0 3280543  4632      425      636             1120000
## 2    3306      2         2  583289  3465       61      186            12350000
## 3    2872      1         0  304861   328       34       47              483000
## 4    3470      2         0  452917  2429      132      590              568000
## 5    2310      2         0 3145573 12163      610     1082             1923800
## 6    3158      1         0 3013011  9595      419     1020             8153000
#1.2 Append JV initials to column names
colnames(movieSurvey) <- paste(colnames(movieSurvey), "JV", sep = "_")
head(movieSurvey,10)
##                             Movie_JV Year_JV Ratings_JV Genre_JV  Gross_JV
## 1                            13 Sins    2014        6.3        8      9130
## 2                     22 Jump Street    2014        7.1        1 192000000
## 3                     3 Days to Kill    2014        6.2        1  30700000
## 4             300: Rise of an Empire    2014        6.3        1 106000000
## 5                  A Haunted House 2    2014        4.7        8  17300000
## 6  A Million Ways to Die in the West    2014        6.1        8  42600000
## 7                A Most Violent Year    2014        7.1        1   5750000
## 8        A Walk Among the Tombstones    2014        6.5       10  26000000
## 9                   About Last Night    2014        6.1        8  48600000
## 10                   American Sniper    2014        7.3        1 350000000
##    Budget_JV Screens_JV Sequel_JV Sentiment_JV Views_JV Likes_JV Dislikes_JV
## 1    4000000         45         1            0  3280543     4632         425
## 2   50000000       3306         2            2   583289     3465          61
## 3   28000000       2872         1            0   304861      328          34
## 4  110000000       3470         2            0   452917     2429         132
## 5    3500000       2310         2            0  3145573    12163         610
## 6   40000000       3158         1            0  3013011     9595         419
## 7   20000000        818         1            2  1854103     2207         197
## 8   28000000       2714         1            3  2213659     2210         419
## 9   12500000       2253         1            0  5218079    11709         532
## 10  58800000       3555         1            4  3927600    13143         573
##    Comments_JV Aggregate Followers_JV
## 1          636                1120000
## 2          186               12350000
## 3           47                 483000
## 4          590                 568000
## 5         1082                1923800
## 6         1020                8153000
## 7          593                 130655
## 8          382                 125646
## 9          770               21697300
## 10        3134                  24300

1.3. What are the dimensions of the dataset (rows and columns)? -> using dimesnion functin which is dim we can calculate and see that the dimensions for the table is 187*14.

dim(movieSurvey)
## [1] 187  14

2 Summarizing Data 1. Means and Standard Deviations a. Calculate the mean and standard deviation for Gross. b. Use the results above to calculate the coefficient of variation (rounded to 2 decimal places). c. Calculate the mean and standard deviation for Budget. Also calculate the coefficients of variation (rounded to 2 decimal places). d. Does the budget or the gross sales of a movie have more variation?

#2.1.a calculated mean and standard deviation for Gross
mean(movieSurvey$Gross_JV)
## [1] 77646944
sd(movieSurvey$Gross_JV)
## [1] 93899208
#2.1.b using the results of mean and SD from data we calculate the Coefficient variaton by storing the values of the mean and sd in a variable.
#and also rounded it by 2 decimal places.
mm <- mean(movieSurvey$Gross_JV)
ss <- sd(movieSurvey$Gross_JV)

coeff_var_mov <- ss/mm
round(coeff_var_mov,2)
## [1] 1.21
#2.1.c calculating the mean SD and coeff variation of budget and round it by 2 decimal places.
mean(movieSurvey$Budget_JV)
## [1] 53844373
sd(movieSurvey$Budget_JV)
## [1] 57100689
Budget_Mean <- mean(movieSurvey$Budget_JV)
BUdget_Sd <- sd(movieSurvey$Budget_JV)

coeff_var_mov <- BUdget_Sd/Budget_Mean
round(coeff_var_mov,2)
## [1] 1.06
#2.1.d apparantely, by watching both of the data. we can say that budget and a gross sale of the movie have more variation and the gross is more.

2.2 Calculate the 32nd percentile of the number of Likes given. This calculation should be rounded to the nearest whole number (no decimal places).

# 2.2 calculation of 32nd percentile of likes is 3355 which is rounded to neatest whole number too.
PercentileOfLikes <- quantile(movieSurvey$Likes_JV, c(.32))
round(PercentileOfLikes)
##  32% 
## 3355
  1. Organizing Data:
  2. Summary Table
  1. Create a table showing average rating by year. This should be rounded to two decimal places.
  2. Which year’s movies have the highest rating? What is it?
#3.1.a created summary table for average of ratings by year and also rounded it to 2 decimal places.
#3.1.b 2014 have the highest number of rating which is apparently from data we processed because the volume of movies are larger in 2014 that is obvious that more people will rate.

Table1_JV <- aggregate(movieSurvey$Ratings_JV, by=list(movieSurvey$Year_JV), FUN=mean, na.rm=TRUE)
Table1_JV
##   Group.1        x
## 1    2014 6.435338
## 2    2015 6.403704
round(Table1_JV,2)
##   Group.1    x
## 1    2014 6.44
## 2    2015 6.40
  1. Create a table (and cross tabulation)
  2. Cross Tabulation
  1. Create a table counting all genres of movies and which sequel number it is.
  2. Change the table to show the percentage of each genre that is the 1st, 2nd, etc. movie in the series. These should be rounded to two decimal places.
  3. What percentage of movies in genre number 8 are not sequel?
# 3.2.a we did a cross tabulation of the data by using 'with' of the column genre and sequel
# 3.2.b we changed the table and showed the percentage individually.
# 3.2.c apparently number 8 has 18.18% or 81.82% movie that is not in sequel
crossTab_JV <- with(movieSurvey, table(Genre_JV, Sequel_JV))
crossTab_JV <-round(prop.table(crossTab_JV)*100,2)
crossTab_JV
##         Sequel_JV
## Genre_JV     1     2     3     4     5     6     7
##       1  19.25  5.35  1.07  1.60  1.07  0.00  1.07
##       2   2.67  0.53  1.60  0.00  0.53  0.53  0.00
##       3  17.65  1.07  0.53  0.00  0.53  0.00  0.00
##       6   1.07  0.00  0.00  0.00  0.00  0.00  0.00
##       7   0.53  0.00  0.00  0.00  0.00  0.00  0.00
##       8  18.18  4.28  0.53  0.00  0.00  0.00  0.00
##       9   5.35  0.00  0.00  0.00  0.00  0.00  0.00
##       10  4.28  0.53  0.00  0.00  0.00  0.00  0.00
##       12  4.81  1.07  0.53  0.00  0.00  0.00  0.00
##       15  3.74  0.00  0.00  0.00  0.00  0.00  0.00

3.3. Bar Plot a. Create a bar plot of genre of movies. b. The plot should be: i. Rank ordered by highest count of genre. ii. Properly labeled (title, x-axis, etc) iii. The bars should have a different colour than the one shown in class. c. Based on the bar plot, (approximately) how many movies are there in genre number 8?

#Bar Plot
#3.3.a&b created bar plot of genre with descending order properly labeled and blue colour.
#3.3.c apparently we can observe that there are approximately 43 genre in no 8.

genreForBar_JV <- table(movieSurvey$Genre_JV)
Genre_JV <- genreForBar_JV[order(genreForBar_JV,decreasing=TRUE)]
barplot(Genre_JV,
        col=5,
        density = 25, angle = 45,
        main="Bar Plot of Number of Genres",
        xlab="Genres in descending order")

  1. Histogram
  1. Create a histogram of sentiment.
  2. The plot should be properly labeled and a unique colour.
  3. Which range of sentiment is the most common?
#4.a&b created the histogram of column sentiment with proper labels and unique colour.
#4.c as observed in the histogram below it is clear that sentiments ratings from -10 to 0 is most common.

hist(movieSurvey$Sentiment_JV,
     col=8,
     density = 20, angle = 120,
     xlab="sentiments",
     ylab = "reoccurance",
     main="Histogram of sentiments for movies")

  1. Box plot
  1. Create a horizontal box plot of number of screens the movies were shown on.
  2. The plot should be properly labeled and a unique colour.
  3. Based on the box plot, approximately how many movies were on fewer than 775 screens?
# 5.a&b created a boxplot of repeatation of movie on screens(horizontal) labeled and unique color.
# 5.c the movies which are shown less than 775 times are 25%. that is first quantile.

boxplot(movieSurvey$Screens_JV, 
        main="Distribution of movie shown / screens",
        xlab="Percentage of movies on screens",
        col=8,
        horizontal=TRUE,
        pch=20)

  1. Scatter Plot
  1. Create a scatter plot comparing budget and gross sales.
  2. The plot should be properly labeled with a marker type different than the one demonstrated in class.
  3. Add a line at 45 degrees to the chart.
  4. Does there appear to be an association between budget and gross sales for movies?
  5. What does it mean if a movie is plotted below the line?
#6.a&b created the scatterplot of budget and gross sales of movies. it is properly labeled 
#6.c adding of line with use of abline(1,1) that overlays the line at 45 degrees.
#6.d many movies lays on the line and are associated with the budget and gross, meaning no profit, no loss.
#6.e it means that particular movie has no loss and no profit. it is all sqaured. simply: left/upper part of the layed line is profitable and right/lower part of the line is loss.
plot(Budget_JV ~ Gross_JV,
     data=movieSurvey,
     col=3,
     pch=9,
     main="scatterplot for Budget and Gross",
     xlab="budget",
     ylab="gross")
abline(coef = c(1,1))