MATH1324 Introduction to Statistics Assignment 2

The Application of Hypothesis Testing for the Rating of Nelix Shows

Ke Liu (S3871535)

24 May, 2021

RPubs link information

Rpubs link is as follow:

https://rpubs.com/kekyleliu/AAassignment2

Introduction

Netflix, the largest entertainment and media company its market capitalization as of July 2020 (Netflix Wikipedia 2021).

According to Roy Morgan (2021), the well-known Australian market research company, “Netflix is by far Australia’s most watched subscription television service with 14,168,000 viewers in an average four weeks”.

Netflix Logo (StickPNG 2021)

Introduction Cont.

Hypothesis Testing is an essential tool for data-driven decision-making in either academic or professional environments, which gains statistical insights from a sample for the prediction of a population.

In this project, the targets are to:

summarise the statistical data of two independent samples of Netflix TV and movie ratings.
apply the hypothesis testing to check and compare which Netflix TV and movie rating is more popular among users based on their average rating scores.
provide Netflix with statistical reference about future launches of different classifications regarding the result of the hypothesis testing.

Problem Statement

Netflix gains profits mainly by subscriptions of large number of users. The users’ opinions, especially the user rating score, towards different classifications of TV and movie are assumed to have an impact on Netflix future launches.

Meanwhile, that may stand for the popularity of diverse TV and movie among users although some of the rating score can be biased or subjective.

Therefore, in this project, the main focus is two ratings of TV and movie, G (suitable for general audiences) and PG (parental guidance suggested) (Netflix Help Center 2021)

Both two samples of these two ratings extracts 50 observations of user rating score for testing. The average user rating score is the key feature to investigate which classification of TV and movie are more popular among users at some extent.

Data

The data set used for the testing is the rating distributions of 1000 Netflix shows, which is downloaded from the open data source website Kaggle (https://www.kaggle.com/chasewillden/netflix-shows) and is collected by Chase Willden (2017) under the license of CC0: Public Domain.

Screenshot of Data Source (Chase Willden 2017)

The sampling method used in this project is simple random sampling as the probability of every selected user rating score for each rating of shows can be equal.And the selection process is shown as below:

loading the data set and conducting data cleaning for missing values
categorizing two subsets by two ratings (G and PG)
randomly selecting 100 observations from each subset separately

Data Cont.

The data set contains 1000 observations of 7 variables. The specific information about all variables is shown in the table below.

Variable	Description	Data Type	Range
title	Name of show	Character	N/A
rating	Maturity ratings and classificaions	Factor (including 13 levels)	N/A
ratingLevel	Rating criteria	Character	N/A
ratingDescription	N/A	Double (numeric)	10 - 124
release year	Rease year of show	Double (numeric)	1940 - 2017
user rating score	User rating score, full score is 100	Double (numeric)	0 -100
user rating size	The amount of user’s rating	Double (numeric)	80 - 82

Data Cont.

The rating variable is considered to be a factor in this case, which indicates the different classifications of TV and movie.

After the conversion of data type, the rating factor includes 13 levels, “G”, “NR”, “PG” “PG-13”, “R”, “TV-14”, “TV-G”, “TV-MA”, “TV-PG”, “TV-Y”, “TV-Y7”, “TV-Y7-FV” and “UR”.

“G” stands for the TV and movie suitable for general audiences while “PG” means shows series suitable for parental guidance suggested. These are only two levels this project mainly focus on.

Descriptive Statistics and Visualisation

Firstly, the original data set is approximately complete while only the user rating score variable have missing values without outliers by observation and it is the target variable as well. Then, two subsets without missing values are created as netflix_G_clean (57 observations) and netflix_PG_clean (119 observations) separately. Finally, creating two samples with 50 randomly selected user rating score for each rating as G_sample and PG_sample.

The statistic summary of user rating score for the TV and movie suitable for general audiences is shown below.

Statistic Summary of User Rating Score for G
Min	Q1	Median	Q3	Max	Mean	SD	n
55	64	79.5	93	96	77.04	15.19527	50

The statistic summary of user rating score for series suitable for parental guidance suggested is displayed below.

Statistic Summary of User Rating Score for PG
Min	Q1	Median	Q3	Max	Mean	SD	n
57	80	86	94	98	85.28	10.81767	50

Decsriptive Statistics Cont.

Regarding the data in two tables, the average user rating score of G_sample is lower that the figure of PG_sample. In a another word, the TV and movie of parental guidance suggested is more likely to be popular among Netflix users.

The following two graphs show the sample distributions of user rating score.

Decsriptive Statistics Cont.

As the above graph shows, the distribution of randomly selected G_sample of user rating score does not fit the normal distribution.At the same time, the distribution of PG_sample is left-skewed. Thus, two randomly selected samples both far beyond the normal distribution. So the next step is to conduct data transformation in order to reduce the skewness.

Data Transformation

After data transformation, the G_sample appears to fit normal distribution.

Data Transformation Cont.

On the other hand, the distribution of PG_sample is considered to be approximately normal if this sample size keep increasing.

Hypothesis Testing

Given all data preparation and exploration above, in order to compare the average user rating score of two independent samples, the two-sample t-test is implemented as both transformed samples appear to fit normal distribution.

null hypothesis: the average user rating scores of these two samples are equal \[H_0: \mu_1 = \mu_2 \]
alternative hypothesis: the average user rating scores of these two samples are different \[H_A: \mu_1 \ne \mu_2\]
the formula of t-test is as follow: \[t = \frac{m_A - m_B}{\sqrt{ \frac{S^2}{n_A} + \frac{S^2}{n_B} }}\]

Hypthesis Testing Cont.

Decision rules are as follow:

reject null hypothesis if p-value < 0.05 (significance level) or if 95% confidence interval of the mean difference does not capture null hypothesis
otherwise, fail to reject null hypothesis

Assumption check:

comparing two independent sample means with unknown population variance
the size of two samples are both 50 (larger than 30)

Hypthesis Testing Cont.

Two-sample t-test result summary:

## 
##  One Sample t-test
## 
## data:  G_sample
## t = 151.6, df = 49, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  4.267346 4.382002
## sample estimates:
## mean of x 
##  4.324674

it is assumed normality as G_sample, PG_sample > 30
p-value = 2.2e-16, p < 0.05

Decision: reject null hypothesis

Discussion

The result of the project found a statistically significant mean difference of user rating score between two ratings, p = 2.2e-16..

This limitation of this project is the size of sample. It would be better to increase the sample size. Also, only two ratings are main focus. In future study, more ratings can be investigated.

Finally, it is recommend that the TV and movie of PG is more popular than the TV and movie of G among Netflix users. Netflix might increase shows of PG to attract more potential subscribers.

References

Chase Willden 2017, Understand the rating distributions of Netflix shows, 1000 Netflix Shows data set, data file, Kaggle Dataset, viewed 16 May 2021, https://www.kaggle.com/chasewillden/netflix-shows

Netflix Help Center 2021, Maturity ratings and classifications on Netflix, viewed 16 May, https://help.netflix.com/es/node/2064

Netflix Wikipedia 2021, Netflix, Wikipedia, viewed 16 May, https://en.wikipedia.org/wiki/Netflix

Roy Morgan 2021, Subscription TV viewers soared to 17.3 million Australians during 2020: Netflix, Foxtel, Stan, Disney+ & Amazon Prime all increased viewership by at least 1.5 million, Findings, viewed 16 May, http://www.roymorgan.com/findings/8606-subscription-pay-tv-services-september-2020-202101120113#:~:text=Netflix%20is%20by%20far%20Australia's,in%20an%20average%20four%20weeks.

StickPNG 2021, Netflix Logo, StickPNG, viewed 16 May 2021, https://www.stickpng.com/img/icons-logos-emojis/tech-companies/netflix-logo