I chose this data-set I found on Kaggle as it seemed like an interesting idea to delve into and analyse data & ratings on movies and TV shows available to watch on Netflix, as the winter season is coming up and staying in-doors and watching movies will become a more frequent occurence. Let’s begin!
First, I imported the data set into R Studio and activated the most common libraries we used throughout the Bootcamp (Output is hidden, the libraries are: readr, car, ggplot2, reshape2, onewaytests, dplyr, pastecs, psych, combinat, knitr, readxl, effectsize).The libraries are useful both for functions used in analysis, ease of use and aesthetics.
Netflix_IMDB_Data <- read_csv("~/R-Take home exam/Netflix TV Shows and Movies.csv", show_col_types = FALSE)
After importing, we can see that the set has 5283 observations and 11 variables.
Using the function head we can quickly look at the taken data:
head(Netflix_IMDB_Data)
## # A tibble: 6 × 11
## index id title type description release_year age_certification runtime imdb_id imdb_score imdb_votes
## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 0 tm84618 Taxi Driver MOVIE "A mentall… 1976 R 113 tt0075… 8.3 795222
## 2 1 tm127384 Monty Python an… MOVIE "King Arth… 1975 PG 91 tt0071… 8.2 530877
## 3 2 tm70993 Life of Brian MOVIE "Brian Coh… 1979 R 94 tt0079… 8 392419
## 4 3 tm190788 The Exorcist MOVIE "12-year-o… 1973 R 133 tt0070… 8.1 391942
## 5 4 ts22164 Monty Python's … SHOW "A British… 1969 TV-14 30 tt0063… 8.8 72895
## 6 5 tm14873 Dirty Harry MOVIE "When a ma… 1971 R 102 tt0066… 7.7 153463
The following data is specified in the imported data-set as follows:
Netflix_IMDB_Data <- Netflix_IMDB_Data %>%
rename(
Title = title,
Type = type,
Description = description,
Release_year = release_year,
Age_certification = age_certification,
Runtime = runtime,
IMDB_score = imdb_score,
IMDB_votes = imdb_votes
)
Netflix_IMDB_Data_Clean <- Netflix_IMDB_Data[ , c(-1, -2, -7, -9)]
Netflix_IMDB_Data_Clean$ContentType <- factor(Netflix_IMDB_Data_Clean$Type,
levels = c("MOVIE", "SHOW"),
labels = c(0, 1))
summary_output <- summary(Netflix_IMDB_Data_Clean[ , c(-1, -2, -3)])
print(summary_output)
## Release_year Runtime IMDB_score IMDB_votes ContentType
## Min. :1953 Min. : 0.0 Min. :1.500 Min. : 5 0:3407
## 1st Qu.:2015 1st Qu.: 45.0 1st Qu.:5.800 1st Qu.: 521 1:1876
## Median :2018 Median : 87.0 Median :6.600 Median : 2279
## Mean :2016 Mean : 79.2 Mean :6.533 Mean : 23407
## 3rd Qu.:2020 3rd Qu.:106.0 3rd Qu.:7.400 3rd Qu.: 10144
## Max. :2022 Max. :235.0 Max. :9.600 Max. :2268288
## NA's :16
Netflix_IMDB_Data_Clean <- Netflix_IMDB_Data_Clean %>%
filter(!is.na(IMDB_votes))
After cleaning the data, we can quickly overview the data to see our progress:
print(head(Netflix_IMDB_Data_Clean[ , c(-2, -3)]))
## # A tibble: 6 × 6
## Title Release_year Runtime IMDB_score IMDB_votes ContentType
## <chr> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 Taxi Driver 1976 113 8.3 795222 0
## 2 Monty Python and the Holy Grail 1975 91 8.2 530877 0
## 3 Life of Brian 1979 94 8 392419 0
## 4 The Exorcist 1973 133 8.1 391942 0
## 5 Monty Python's Flying Circus 1969 30 8.8 72895 1
## 6 Dirty Harry 1971 102 7.7 153463 0
summary_output <- summary(Netflix_IMDB_Data_Clean[ , c(-1, -2, -3)])
print(summary_output)
## Release_year Runtime IMDB_score IMDB_votes ContentType
## Min. :1953 Min. : 0.00 Min. :1.500 Min. : 5 0:3391
## 1st Qu.:2015 1st Qu.: 46.00 1st Qu.:5.800 1st Qu.: 521 1:1876
## Median :2018 Median : 87.00 Median :6.600 Median : 2279
## Mean :2016 Mean : 79.31 Mean :6.533 Mean : 23407
## 3rd Qu.:2020 3rd Qu.:106.00 3rd Qu.:7.400 3rd Qu.: 10144
## Max. :2022 Max. :235.00 Max. :9.600 Max. :2268288
We can see there are 3391 movies and 1876 TV shows in the clean data set, which shows a greater representation of movies available on Netflix.
From the summary we can see the earliest movie in the set was released in 1953 and the latest one in 2022, meaning the set only includes movies available to watch up to 2022. - We must also take into account the regional view settings that apply to Netflix, meaning a lot of the movies will not be available for us to watch in Slovenia, but we will still use the full set for the analysis.
The median release year is 2018, indicating that half of the shows and movies were released before 2018 and half of the shows and movies after were released 2018, which is logical as Netflix has recent movies and shows in the catalog.
The data shows us the longest movie is 235 minutes long - No Longer Kids (1979), which is 3,9 hours long. We can not confidently say which movie or TV show is the shortest as the Minimum shown is 0, implying either imperfect data in the set or other anomalies in the content runtime.
The mean runtime is 79.31 minutes, indicating the content is somewhat skewed towards shorter movies or TV show episodes.
The IMDB scores range from 1.5 (reflecting very poorly rated titles) to 9.6 (highly acclaimed titles).
The median score of 6.6 suggests that half of the titles have a rating above this, while half have a rating below.
The number of votes ranges from as few as 5 votes to over 2.26 million votes. This large range shows a huge disparity in the popularity of content in the dataset.
The median number of votes is 2279, indicating that half of the titles received fewer than this amount of votes, suggesting a significant portion of the dataset consists of less popular or less widely viewed content.
As movies and TV shows are two differing types of media, for my further analysis I will differentiate the two and analyse the data. First, we’ll see how the descriptive statistics & insights change as compared to the previous combined data.
I will now use the same type of filter as before to create two new datasets which include only movies and only TV shows.
Netflix_IMDB_Data_Clean_Movies <- Netflix_IMDB_Data_Clean %>%
filter(ContentType == 0)
Netflix_IMDB_Data_Clean_TVShows <- Netflix_IMDB_Data_Clean %>%
filter(ContentType == 1)
We can now run the same summary as before for each data set:
summary_output_Movies <- summary(Netflix_IMDB_Data_Clean_Movies[ , c(-1, -2, -3)])
print(summary_output_Movies)
## Release_year Runtime IMDB_score IMDB_votes ContentType
## Min. :1953 Min. : 8.0 Min. :1.500 Min. : 5 0:3391
## 1st Qu.:2014 1st Qu.: 88.0 1st Qu.:5.600 1st Qu.: 573 1: 0
## Median :2018 Median :100.0 Median :6.400 Median : 2632
## Mean :2015 Mean :101.8 Mean :6.265 Mean : 26683
## 3rd Qu.:2020 3rd Qu.:116.0 3rd Qu.:7.100 3rd Qu.: 12281
## Max. :2022 Max. :235.0 Max. :9.000 Max. :2268288
summary_output_TVShows <- summary(Netflix_IMDB_Data_Clean_TVShows[ , c(-1, -2, -3)])
print(summary_output_TVShows)
## Release_year Runtime IMDB_score IMDB_votes ContentType
## Min. :1969 Min. : 0.00 Min. :1.600 Min. : 5.0 0: 0
## 1st Qu.:2016 1st Qu.: 24.00 1st Qu.:6.400 1st Qu.: 444.5 1:1876
## Median :2018 Median : 41.00 Median :7.200 Median : 1782.0
## Mean :2017 Mean : 38.63 Mean :7.017 Mean : 17485.6
## 3rd Qu.:2020 3rd Qu.: 49.00 3rd Qu.:7.800 3rd Qu.: 7216.8
## Max. :2022 Max. :178.00 Max. :9.600 Max. :1727694.0
Looking at the two outputs we can pinpoint some differences
The mean runtime for movies is 101.8 minutes, while TV Shows average at 38.63 minutes. We see that TV shows are much shorter on average compared to movies.
Movies have an average score of 6.27 and TV shows average at 7.02, which implies TV shows generally receive better ratings on average than movies.
Movies have an average of 26683 votes and TV shows average 17486, which shows movies typically attract more votes than TV shows on average.
Now we’ll attempt to discover some new insights by visualizing our data.
I’ll start by using a simple ggplot function to create a histogram
ggplot_MoviesTV <- ggplot(Netflix_IMDB_Data_Clean, aes(x = Type, fill = Type)) +
geom_bar() +
labs(title = "Distribution of Movies and TV Shows in data",
x = "Type",
y = "Count") +
theme_minimal()
print(ggplot_MoviesTV)
ggplot_MoviesTV_ReleaseYear <- Netflix_IMDB_Data_Clean %>%
group_by(Release_year, Type) %>%
summarise(Count = n(), .groups = 'drop') %>%
ggplot(aes(x = Release_year, y = Count, color = Type)) +
geom_line(linewidth = 1.5) +
labs(title = "Distribution of Movies and TV Shows Over Release Year",
x = "Release Year",
y = "Count",
color = "Type") +
theme_minimal()
print(ggplot_MoviesTV_ReleaseYear)
First we have to import our data and use head to give us a quick overview.
MBA_Data_Excel <- read_excel("~/R-Take home exam/Business School.xlsx")
head(MBA_Data_Excel)
## # A tibble: 6 × 9
## `Student ID` `Undergrad Degree` `Undergrad Grade` `MBA Grade` `Work Experience` `Employability (Before)`
## <dbl> <chr> <dbl> <dbl> <chr> <dbl>
## 1 1 Business 68.4 90.2 No 252
## 2 2 Computer Science 70.2 68.7 Yes 101
## 3 3 Finance 76.4 83.3 No 401
## 4 4 Business 82.6 88.7 No 287
## 5 5 Finance 76.9 75.4 No 275
## 6 6 Computer Science 83.3 82.1 No 254
## # ℹ 3 more variables: `Employability (After)` <dbl>, Status <chr>, `Annual Salary` <dbl>
We can now format the data as a data frame.
MBA_Data <- as.data.frame(MBA_Data_Excel)
We can see from the data that there are 5 types of undergraduate degrees: Art, Business, Computer Science, Engineering and Finance, which we will have to factor for further analysis.
Before I do that I will rename the column to a one word item so it does not cause problems in functions.
colnames(MBA_Data)[2] <- "Degree"
MBA_Data$Degree <- factor(MBA_Data$Degree,
levels = c("Art", "Business", "Computer Science", "Engineering", "Finance"),
labels = c("Art", "Business", "Computer Science", "Engineering", "Finance"))
ggplot_Undergraduate_Degrees <- ggplot(MBA_Data, aes(x = Degree)) +
geom_bar(fill = "chocolate1") +
labs(title = "Undergraduate degrees",
x = "Type of degree",
y = "Frequency")
print(ggplot_Undergraduate_Degrees)
For the purpose of easier analysis I removed the id, work experience and status columns.
MBA_Data_summary_output <- summary(MBA_Data[ , c(-1, -5, -8)])
print(MBA_Data_summary_output)
## Degree Undergrad Grade MBA Grade Employability (Before) Employability (After) Annual Salary
## Art : 6 Min. : 61.20 Min. :58.14 Min. :101.0 Min. :119.0 Min. : 20000
## Business :35 1st Qu.: 71.47 1st Qu.:71.14 1st Qu.:245.8 1st Qu.:312.0 1st Qu.: 87125
## Computer Science:25 Median : 76.65 Median :76.38 Median :256.8 Median :435.6 Median :103500
## Engineering : 9 Mean : 76.90 Mean :76.04 Mean :257.9 Mean :422.7 Mean :109058
## Finance :25 3rd Qu.: 81.70 3rd Qu.:82.15 3rd Qu.:261.0 3rd Qu.:529.0 3rd Qu.:124000
## Max. :100.00 Max. :95.00 Max. :421.0 Max. :631.0 Max. :340000
Now lets look at the annual salary.
MBA_stat.desc.data <- round(stat.desc(MBA_Data$`Annual Salary`))
print(MBA_stat.desc.data)
## nbr.val nbr.null nbr.na min max range sum median mean
## 100 0 0 20000 340000 320000 10905800 103500 109058
## SE.mean CI.mean.0.95 var std.dev coef.var
## 4150 8235 1722373475 41501 0
options(scipen = 10)
ggplot_Salary <- ggplot(MBA_Data, aes(x=`Annual Salary`)) +
geom_histogram(binwidth = 20000, colour="aliceblue", fill = "chocolate3") +
theme_dark() +
ylab("Frequency")
print(ggplot_Salary)
H0: mean MBA grade is equal to 74 H1: mean MBA Grade is not equal to 74
MBA_t.test <- t.test(MBA_Data$`MBA Grade`,
mu = 74,
alternative = "two.sided")
print(MBA_t.test)
##
## One Sample t-test
##
## data: MBA_Data$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
We see that p=0.00915, and basing on that we can reject the null hypothesis H0 and accept the alternate hypothesis. We can predict (with 95% certainty) that the true mean is between 74.51764 and 77.56346.
effectsize::cohens_d(MBA_Data$`MBA Grade`, mu=74)
## Cohen's d | 95% CI
## ------------------------
## 0.27 | [0.07, 0.46]
##
## - Deviation from a difference of 74.
We see the Cohen’s deviation of 0.27, which is a small effect based on the Sawilowsky 2009 rate. - Such a small effect implies further analysis might not yield interesting results.
First, we import the dataset:
Apartment_Data <- read_excel("~/R-Take home exam/Apartments.xlsx")
head(Apartment_Data)
## # A tibble: 6 × 5
## Age Distance Price Parking Balcony
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
We can describe the data as follows: - Age: How old an apartment is - Distance: The distance from the city centre (km) - Price: Price per square meter - Parking: 0 = no, 1 = yes - Balcony: 0 = no, 1 = yes
First we format the data as a data frame.
Apartments <- as.data.frame(Apartment_Data)
head(Apartments)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Apartments$Parking_type <- factor(Apartments$Parking,
levels = c(0,1),
labels = c("No", "Yes"))
Apartments$Balcony_type <- factor(Apartments$Balcony,
levels = c(0,1),
labels = c("No", "Yes"))
Apartments_t.test <- t.test(Apartments$Price,
mu = 1900,
alternative = "two.sided")
print(Apartments_t.test)
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
We see that p=0.004731, and basing on that we can reject the null hypothesis H0.
fit1 <- lm(Price ~ Age,
data = Apartments)
print(summary(fit1))
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Apartments_scatterplot <- scatterplotMatrix(Apartments[,c(-4,-5,-6,-7)],
smooth = FALSE)
print(Apartments_scatterplot)
## NULL
Based on the matrix we can not spot multicolinearity, as the regression function is not steep. Although from just the matrix we can not conclude a clear definite observation.
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
print(summary(fit2))
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 0.0000000000618 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 0.00000000004896
print(vif(fit2))
## Age Distance
## 1.001845 1.001845
print(mean(vif(fit2)))
## [1] 1.001845
The VIF statistic shows there should not be a problem with multicolinearity as both statistics are very close to 1 (and very far away from 5).
Apartments$StdResid <- round(rstandard(fit2), 3)
Apartments$CooksDistance <- round(cooks.distance(fit2), 3)
Apartments_Resid_Distance <- hist(Apartments$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Distribution of Standardized residuals")
Apartments_CooksDistance <- hist(Apartments$CooksDistance,
main = "Distribution of Cooks distances",
xlab = "Cooks distances",
ylab = "Frequency")
print(Apartments_CooksDistance)
## $breaks
## [1] 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
##
## $counts
## [1] 80 3 1 0 0 0 1
##
## $density
## [1] 18.8235294 0.7058824 0.2352941 0.0000000 0.0000000 0.0000000 0.2352941
##
## $mids
## [1] 0.025 0.075 0.125 0.175 0.225 0.275 0.325
##
## $xname
## [1] "Apartments$CooksDistance"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
We can not spot any outliers in the graph as there is no standardized residual greater than +3 or smaller than -3. So we did not have to remove any problematic units.
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
Apartments$StdFitValues <- scale(fit2$fitted.values)
Heteroskedasticity_Apartments <- ggplot(Apartments,
aes(x=StdFitValues, y=StdResid)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
print(Heteroskedasticity_Apartments)
## `geom_smooth()` using formula = 'y ~ x'
Based on the observations in the scatterplot, the values are heteroskedastic.
ggplot_Apartments <- ggplot(Apartments,
aes(x=StdResid)) +
geom_histogram(color = "gray") +
labs(title = "Histogram of standardized residuals",
x = "Standardized residuals",
y = "Frequency")
print(ggplot_Apartments)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
print(shapiro.test(Apartments$StdResid))
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdResid
## W = 0.95303, p-value = 0.003645
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
print(summary(fit2))
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 0.0000000000618 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 0.00000000004896
Age (-7.934): If the age increases by 1 year, the price decreases on average by 7,934 EUR per square meter.
Distance (-20.677): If the distance from the city centre increases by 1 kilometer, the price decreases on average by 20,667 EUR per square meter.
R-Squared (0.4396 = 43,96%) Relationship between price, age affect the variability of apartment price by 43,96%
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = Apartments)
print(anova (fit2, fit3))
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + Parking + Balcony
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can test the following: H0 = First model is better than second H1 = First model is not better than second
We reject H0 (p=0,01007) and conclude the second model is statistically better.
print(summary(fit3))
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 0.00000000528 ***
## Parking 196.168 62.868 3.120 0.00251 **
## Balcony 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 0.00000000001849
Looking at the explanitory values we see the R-Squared has increased, suggesting the model is better. Age and Distance both negatively influence price, with coefficients of -6.799 and -18.045. We can also see that the apartment having a balcony does not have a statistically significant impact on the price (p = 0.97436).
Apartments$Fitted <- fitted(fit3)
Apartments$Residuals <- residuals(fit3)
print(head(Apartments[colnames(Apartments) %in% c("StdFitValues", "Residuals")]))
## StdFitValues Residuals
## 1 -0.7706550 -110.74150
## 2 1.1084679 442.58893
## 3 -0.7706550 -88.80674
## 4 -1.5182915 260.07897
## 5 -0.2940493 -412.57579
## 6 -0.1157744 -126.69107
Residual for 2nd row (ID 2 apartment) is 442,58893.