One of the quantitative variable in our dataset is the running time. It is showing how long each movies are in terms of minutes. The average running time of the movies is given below. It is also known as the central value of the variable.

mean(df$Running.time..int., na.rm = TRUE)
## [1] 97.30569

Similarly, standard deviation can show us how spread the data is from the central value.

sd(df$Running.time..int., na.rm = TRUE)
## [1] 18.95949

5 Number summary for our quantitative variable:

Maximum

max(df$Running.time..int., na.rm = TRUE)
## [1] 168

Minimum

min(df$Running.time..int., na.rm = TRUE)
## [1] 40

Median

median(df$Running.time..int., na.rm = TRUE)
## [1] 96

First Quartile Q1

quantile(df$Running.time..int., .25, na.rm = TRUE)
## 25% 
##  86

Third Quartile Q3

quantile(df$Running.time..int., .75, na.rm = TRUE)
##    75% 
## 106.75

Histogram

Running_Time <- df$Running.time..int.
hist(Running_Time,main="Histogram of Running Time",
xlab="Running Time in minutes")

Box Plot

boxplot(df$Running.time..int.)

We can see that there are many outliers in the data. Outliers are the value that are way above the top whisker or below the lower whisker. They represent extreme values in the dataset.

ggplot(df,aes(sample = Running.time..int.)) +
  geom_qq() +
  geom_qq_line()
## Warning: Removed 10 rows containing non-finite values (stat_qq).
## Warning: Removed 10 rows containing non-finite values (stat_qq_line).

The distribution is normal. We can see that the mean and and the median are very close. Looking at the box and whisker plot, the whiskers look about the same length and also the median line is at the middle of the box. Therefore the distribution is normal. Additionally, the histogram has a single peak where the mean and median lies. We can see that it forms the bell curve. Hence, the distribution is normal.

Graphical display of budget, box office and running time in a Correlation Heatmap

library(corrr)
df_cor <- df %>%
  select(c(Budget..float.,Box.office..float.,Running.time..int.)) %>%
  correlate()
## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
df_cor
## # A tibble: 3 × 4
##   term               Budget..float. Box.office..float. Running.time..int.
##   <chr>                       <dbl>              <dbl>              <dbl>
## 1 Budget..float.             NA                  0.740              0.389
## 2 Box.office..float.          0.740             NA                  0.278
## 3 Running.time..int.          0.389              0.278             NA
stretch(df_cor) %>%
  ggplot(aes(x=x, y=y, fill=r, label = round(r,2))) +
  geom_tile()

Frequency Table for Rotten Tomatoes

df %>%
  group_by(rotten_tomatoes) %>%
  summarise(Frequency = n()) %>%
  mutate(Proportion = Frequency/sum(Frequency))
## # A tibble: 92 × 3
##    rotten_tomatoes Frequency Proportion
##    <chr>               <int>      <dbl>
##  1 ""                     71    0.164  
##  2 "0%"                    1    0.00231
##  3 "100%"                 12    0.0278 
##  4 "12%"                   2    0.00463
##  5 "13%"                   1    0.00231
##  6 "14%"                   3    0.00694
##  7 "17%"                   2    0.00463
##  8 "18%"                   1    0.00231
##  9 "19%"                   2    0.00463
## 10 "20%"                   4    0.00926
## # … with 82 more rows

Two way table for categorical variables

 theTable <- xtabs(~imdb+rotten_tomatoes, data = df)
 head(theTable)
##      rotten_tomatoes
## imdb     0% 100% 12% 13% 14% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28%
##       10  0    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   1.5  0  0    0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0
##   2.4  0  0    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   3.8  0  0    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   4    0  0    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   4.1  0  0    0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0
##      rotten_tomatoes
## imdb  29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43% 44% 45% 46%
##         0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   1.5   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   2.4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   3.8   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0
##   4     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   4.1   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##      rotten_tomatoes
## imdb  47% 48% 49% 5% 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 60% 61% 62% 63%
##         0   0   0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   1.5   0   0   0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   2.4   0   0   0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   3.8   0   0   0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   4     0   0   0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   4.1   0   0   0  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##      rotten_tomatoes
## imdb  64% 65% 66% 67% 68% 69% 7% 70% 71% 72% 73% 74% 75% 76% 77% 78% 79% 80%
##         0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0   0   0
##   1.5   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0   0   0
##   2.4   0   0   0   0   0   0  0   0   1   0   0   0   0   0   0   0   0   0
##   3.8   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0   0   0   0
##   4     0   0   0   0   0   0  1   0   0   0   0   0   0   0   0   0   0   0
##   4.1   0   0   0   0   0   0  1   0   0   0   0   0   0   0   0   0   0   0
##      rotten_tomatoes
## imdb  81% 82% 83% 84% 85% 86% 87% 88% 89% 9% 90% 91% 92% 93% 94% 95% 96% 97%
##         0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0
##   1.5   0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0
##   2.4   0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0
##   3.8   0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0
##   4     0   0   0   0   0   0   0   0   0  0   0   0   0   0   0   0   0   0
##   4.1   0   0   0   0   0   0   0   0   0  1   0   0   0   0   0   0   0   0
##      rotten_tomatoes
## imdb  98% 99%
##         0   0
##   1.5   0   0
##   2.4   0   0
##   3.8   0   0
##   4     0   0
##   4.1   0   0
summary(df$Budget..float.)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##       150  10000000  30000000  63588612 100000000 410600000       159

Side by side plot for budget and Rotten tomatoes

ggplot(data = df, aes(y= Budget..float.,color = rotten_tomatoes)) +
  geom_boxplot()
## Warning: Removed 159 rows containing non-finite values (stat_boxplot).

We can see that the disney movies that have performed exceptionally well have a really high rating on rotten tomatoes. However, at the same time, not all the movies that have a high rating on rotten tomatoes have performed well.

library(ggplot2)
BoxOffice_Collection <- df$Box.office..float.
Rating <- df$rotten_tomatoes
ggplot(df, aes(x = Rating, y=BoxOffice_Collection)) +
         geom_bar(stat="identity")
## Warning: Removed 77 rows containing missing values (position_stack).

  1. The dataset that has been used in this project is Disney Movies Dataset. It is available on the link below. https://github.com/nurfnick/Data_Sets_For_Stats/blob/e3598535dbd59f319eeedb23abd8a0b2816c2022/CuratedDataSets/DisneyMoviesDataset.csv

The variables that I have used in the first phase of the project are as follows

There are some other variables that I have planned using in future for analysis. Some of them are

  1. The dataset is good but there are a lot of empty fields. For some attributes, there only a few unavailable entries but some of the attributes were totally unusable beacuause they had more empty slots than data. Also, there are many duplicate entries as there fields like budget and then again budget(in float) increasing redundancy. For the project, I tried to choose fields that had more information. There were still many entries that had dirty data (incomplete) however, they were ignored. Moreover, many of the fields have multiple values in them like Language and Country. This made it harder to organize the data.

  2. The graphs have been explained above.

  3. I think the most interesting feature of my data is the availability of a large number of attributes. There is a lot of information on any instance which makes visualization easier. The data consists of really old movies as well which provides a better accuracy to our study and predictions that we make. However, there is lot of noise in the data with multiple values and tons of incomplete entries.