Ke Liu (S3871535)
24 May, 2021
Netflix, the largest entertainment and media company its market capitalization as of July 2020 (Netflix Wikipedia 2021).
According to Roy Morgan (2021), the well-known Australian market research company, “Netflix is by far Australia’s most watched subscription television service with 14,168,000 viewers in an average four weeks”.
Netflix Logo (StickPNG 2021)
Hypothesis Testing is an essential tool for data-driven decision-making in either academic or professional environments, which gains statistical insights from a sample for the prediction of a population.
In this project, the targets are to:
summarise the statistical data of two independent samples of Netflix TV and movie ratings.
apply the hypothesis testing to check and compare which Netflix TV and movie rating is more popular among users based on their average rating scores.
provide Netflix with statistical reference about future launches of different classifications regarding the result of the hypothesis testing.
Netflix gains profits mainly by subscriptions of large number of users. The users’ opinions, especially the user rating score, towards different classifications of TV and movie are assumed to have an impact on Netflix future launches.
Meanwhile, that may stand for the popularity of diverse TV and movie among users although some of the rating score can be biased or subjective.
Therefore, in this project, the main focus is two ratings of TV and movie, G (suitable for general audiences) and PG (parental guidance suggested) (Netflix Help Center 2021)
Both two samples of these two ratings extracts 50 observations of user rating score for testing. The average user rating score is the key feature to investigate which classification of TV and movie are more popular among users at some extent.
The data set used for the testing is the rating distributions of 1000 Netflix shows, which is downloaded from the open data source website Kaggle (https://www.kaggle.com/chasewillden/netflix-shows) and is collected by Chase Willden (2017) under the license of CC0: Public Domain.
Screenshot of Data Source (Chase Willden 2017)
The sampling method used in this project is simple random sampling as the probability of every selected user rating score for each rating of shows can be equal.And the selection process is shown as below:
loading the data set and conducting data cleaning for missing values
categorizing two subsets by two ratings (G and PG)
randomly selecting 100 observations from each subset separately
The data set contains 1000 observations of 7 variables. The specific information about all variables is shown in the table below.
| Variable | Description | Data Type | Range |
|---|---|---|---|
| title | Name of show | Character | N/A |
| rating | Maturity ratings and classificaions | Factor (including 13 levels) | N/A |
| ratingLevel | Rating criteria | Character | N/A |
| ratingDescription | N/A | Double (numeric) | 10 - 124 |
| release year | Rease year of show | Double (numeric) | 1940 - 2017 |
| user rating score | User rating score, full score is 100 | Double (numeric) | 0 -100 |
| user rating size | The amount of user’s rating | Double (numeric) | 80 - 82 |
The rating variable is considered to be a factor in this case, which indicates the different classifications of TV and movie.
After the conversion of data type, the rating factor includes 13 levels, “G”, “NR”, “PG” “PG-13”, “R”, “TV-14”, “TV-G”, “TV-MA”, “TV-PG”, “TV-Y”, “TV-Y7”, “TV-Y7-FV” and “UR”.
“G” stands for the TV and movie suitable for general audiences while “PG” means shows series suitable for parental guidance suggested. These are only two levels this project mainly focus on.
Firstly, the original data set is approximately complete while only the user rating score variable have missing values without outliers by observation and it is the target variable as well. Then, two subsets without missing values are created as netflix_G_clean (57 observations) and netflix_PG_clean (119 observations) separately. Finally, creating two samples with 50 randomly selected user rating score for each rating as G_sample and PG_sample.
The statistic summary of user rating score for the TV and movie suitable for general audiences is shown below.
| Min | Q1 | Median | Q3 | Max | Mean | SD | n |
|---|---|---|---|---|---|---|---|
| 55 | 64 | 79.5 | 93 | 96 | 77.04 | 15.19527 | 50 |
The statistic summary of user rating score for series suitable for parental guidance suggested is displayed below.
| Min | Q1 | Median | Q3 | Max | Mean | SD | n |
|---|---|---|---|---|---|---|---|
| 57 | 80 | 86 | 94 | 98 | 85.28 | 10.81767 | 50 |
Regarding the data in two tables, the average user rating score of G_sample is lower that the figure of PG_sample. In a another word, the TV and movie of parental guidance suggested is more likely to be popular among Netflix users.
The following two graphs show the sample distributions of user rating score.
As the above graph shows, the distribution of randomly selected G_sample of user rating score does not fit the normal distribution.At the same time, the distribution of PG_sample is left-skewed. Thus, two randomly selected samples both far beyond the normal distribution. So the next step is to conduct data transformation in order to reduce the skewness.
After data transformation, the G_sample appears to fit normal distribution.
On the other hand, the distribution of PG_sample is considered to be approximately normal if this sample size keep increasing.
Given all data preparation and exploration above, in order to compare the average user rating score of two independent samples, the two-sample t-test is implemented as both transformed samples appear to fit normal distribution.
null hypothesis: the average user rating scores of these two samples are equal \[H_0: \mu_1 = \mu_2 \]
alternative hypothesis: the average user rating scores of these two samples are different \[H_A: \mu_1 \ne \mu_2\]
the formula of t-test is as follow: \[t = \frac{m_A - m_B}{\sqrt{ \frac{S^2}{n_A} + \frac{S^2}{n_B} }}\]
Decision rules are as follow:
reject null hypothesis if p-value < 0.05 (significance level) or if 95% confidence interval of the mean difference does not capture null hypothesis
otherwise, fail to reject null hypothesis
Assumption check:
comparing two independent sample means with unknown population variance
the size of two samples are both 50 (larger than 30)
Two-sample t-test result summary:
##
## One Sample t-test
##
## data: G_sample
## t = 151.6, df = 49, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 4.267346 4.382002
## sample estimates:
## mean of x
## 4.324674
it is assumed normality as G_sample, PG_sample > 30
p-value = 2.2e-16, p < 0.05
Decision: reject null hypothesis
The result of the project found a statistically significant mean difference of user rating score between two ratings, p = 2.2e-16..
This limitation of this project is the size of sample. It would be better to increase the sample size. Also, only two ratings are main focus. In future study, more ratings can be investigated.
Finally, it is recommend that the TV and movie of PG is more popular than the TV and movie of G among Netflix users. Netflix might increase shows of PG to attract more potential subscribers.
Chase Willden 2017, Understand the rating distributions of Netflix shows, 1000 Netflix Shows data set, data file, Kaggle Dataset, viewed 16 May 2021, https://www.kaggle.com/chasewillden/netflix-shows
Netflix Help Center 2021, Maturity ratings and classifications on Netflix, viewed 16 May, https://help.netflix.com/es/node/2064
Netflix Wikipedia 2021, Netflix, Wikipedia, viewed 16 May, https://en.wikipedia.org/wiki/Netflix
Roy Morgan 2021, Subscription TV viewers soared to 17.3 million Australians during 2020: Netflix, Foxtel, Stan, Disney+ & Amazon Prime all increased viewership by at least 1.5 million, Findings, viewed 16 May, http://www.roymorgan.com/findings/8606-subscription-pay-tv-services-september-2020-202101120113#:~:text=Netflix%20is%20by%20far%20Australia's,in%20an%20average%20four%20weeks.
StickPNG 2021, Netflix Logo, StickPNG, viewed 16 May 2021, https://www.stickpng.com/img/icons-logos-emojis/tech-companies/netflix-logo