One of the quantitative variable in our dataset is the running time. It is showing how long each movies are in terms of minutes. The average running time of the movies is given below. It is also known as the central value of the variable.
mean(df$Running.time..int., na.rm = TRUE)
## [1] 97.30569
Similarly, standard deviation can show us how spread the data is from the central value.
sd(df$Running.time..int., na.rm = TRUE)
## [1] 18.95949
5 Number summary for our quantitative variable:
Maximum
max(df$Running.time..int., na.rm = TRUE)
## [1] 168
Minimum
min(df$Running.time..int., na.rm = TRUE)
## [1] 40
Median
median(df$Running.time..int., na.rm = TRUE)
## [1] 96
First Quartile Q1
quantile(df$Running.time..int., .25, na.rm = TRUE)
## 25%
## 86
Third Quartile Q3
quantile(df$Running.time..int., .75, na.rm = TRUE)
## 75%
## 106.75
Histogram
Running_Time <- df$Running.time..int.
hist(Running_Time,main="Histogram of Running Time",
xlab="Running Time in minutes")
Box Plot
boxplot(df$Running.time..int.)
We can see that there are many outliers in the data. Outliers are the value that are way above the top whisker or below the lower whisker. They represent extreme values in the dataset.
ggplot(df,aes(sample = Running.time..int.)) +
geom_qq() +
geom_qq_line()
## Warning: Removed 10 rows containing non-finite values (stat_qq).
## Warning: Removed 10 rows containing non-finite values (stat_qq_line).
The distribution is normal. We can see that the mean and and the median are very close. Looking at the box and whisker plot, the whiskers look about the same length and also the median line is at the middle of the box. Therefore the distribution is normal. Additionally, the histogram has a single peak where the mean and median lies. We can see that it forms the bell curve. Hence, the distribution is normal.
Graphical display of budget, box office and running time in a Correlation Heatmap
library(corrr)
df_cor <- df %>%
select(c(Budget..float.,Box.office..float.,Running.time..int.)) %>%
correlate()
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
df_cor
## # A tibble: 3 × 4
## term Budget..float. Box.office..float. Running.time..int.
## <chr> <dbl> <dbl> <dbl>
## 1 Budget..float. NA 0.740 0.389
## 2 Box.office..float. 0.740 NA 0.278
## 3 Running.time..int. 0.389 0.278 NA
stretch(df_cor) %>%
ggplot(aes(x=x, y=y, fill=r, label = round(r,2))) +
geom_tile()
Frequency Table for Rotten Tomatoes
df %>%
group_by(rotten_tomatoes) %>%
summarise(Frequency = n()) %>%
mutate(Proportion = Frequency/sum(Frequency))
## # A tibble: 92 × 3
## rotten_tomatoes Frequency Proportion
## <chr> <int> <dbl>
## 1 "" 71 0.164
## 2 "0%" 1 0.00231
## 3 "100%" 12 0.0278
## 4 "12%" 2 0.00463
## 5 "13%" 1 0.00231
## 6 "14%" 3 0.00694
## 7 "17%" 2 0.00463
## 8 "18%" 1 0.00231
## 9 "19%" 2 0.00463
## 10 "20%" 4 0.00926
## # … with 82 more rows
Two way table for categorical variables
theTable <- xtabs(~imdb+rotten_tomatoes, data = df)
head(theTable)
## rotten_tomatoes
## imdb 0% 100% 12% 13% 14% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28%
## 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1.5 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## 2.4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3.8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4.1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## rotten_tomatoes
## imdb 29% 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43% 44% 45% 46%
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1.5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2.4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3.8 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4.1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## rotten_tomatoes
## imdb 47% 48% 49% 5% 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 60% 61% 62% 63%
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1.5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2.4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3.8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## rotten_tomatoes
## imdb 64% 65% 66% 67% 68% 69% 7% 70% 71% 72% 73% 74% 75% 76% 77% 78% 79% 80%
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1.5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2.4 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## 3.8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## 4.1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## rotten_tomatoes
## imdb 81% 82% 83% 84% 85% 86% 87% 88% 89% 9% 90% 91% 92% 93% 94% 95% 96% 97%
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1.5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2.4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3.8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4.1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## rotten_tomatoes
## imdb 98% 99%
## 0 0
## 1.5 0 0
## 2.4 0 0
## 3.8 0 0
## 4 0 0
## 4.1 0 0
summary(df$Budget..float.)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 150 10000000 30000000 63588612 100000000 410600000 159
Side by side plot for budget and Rotten tomatoes
ggplot(data = df, aes(y= Budget..float.,color = rotten_tomatoes)) +
geom_boxplot()
## Warning: Removed 159 rows containing non-finite values (stat_boxplot).
We can see that the disney movies that have performed exceptionally well have a really high rating on rotten tomatoes. However, at the same time, not all the movies that have a high rating on rotten tomatoes have performed well.
library(ggplot2)
BoxOffice_Collection <- df$Box.office..float.
Rating <- df$rotten_tomatoes
ggplot(df, aes(x = Rating, y=BoxOffice_Collection)) +
geom_bar(stat="identity")
## Warning: Removed 77 rows containing missing values (position_stack).
The variables that I have used in the first phase of the project are as follows
Running time: It is the measure of how long the movie is in terms of minutes. It is a quantitative variable.
Rotten Tomatoes: This is the rating given to the disney movies by Rotten Tomatoes. It is categorical.
Box Office: This is the total box office collection of the movie.This is quantitative data.
Budget: This is the total expense of creating the movie. It is also a quantitative data.
There are some other variables that I have planned using in future for analysis. Some of them are
IMDB: This is the IMDB rating for the movies.
Country: This is where the movies was made.
Release Date: The date when the movie was released to public.
The dataset is good but there are a lot of empty fields. For some attributes, there only a few unavailable entries but some of the attributes were totally unusable beacuause they had more empty slots than data. Also, there are many duplicate entries as there fields like budget and then again budget(in float) increasing redundancy. For the project, I tried to choose fields that had more information. There were still many entries that had dirty data (incomplete) however, they were ignored. Moreover, many of the fields have multiple values in them like Language and Country. This made it harder to organize the data.
The graphs have been explained above.
I think the most interesting feature of my data is the availability of a large number of attributes. There is a lot of information on any instance which makes visualization easier. The data consists of really old movies as well which provides a better accuracy to our study and predictions that we make. However, there is lot of noise in the data with multiple values and tons of incomplete entries.