Find whether the mean rating of free apps differ from the mean rating of paid apps in Google Play Store. Then, briefly examine the average effect of apps being free on their rating.
googleplaystore <- read_excel("googleplaystore.xlsx",
col_types = c("text", "text", "numeric",
"numeric", "text", "text", "text",
"numeric", "text", "text", "numeric",
"text", "text"), na = "NaN")
paid <- filter(googleplaystore, Type == "Paid") %>%
select(Rating) %>%
na.omit()
free <- filter(googleplaystore, Type == "Free") %>%
select(Rating) %>%
na.omit()
We will compare the required sample size for both data and choose the larger one. For simplicity, we will set the bound of error as 0.2.
nrow(paid)
## [1] 647
summary(paid)
## Rating
## Min. :1.000
## 1st Qu.:4.100
## Median :4.400
## Mean :4.267
## 3rd Qu.:4.600
## Max. :5.000
sd(paid$Rating)
## [1] 0.5475231
nrow(free)
## [1] 8720
summary(free)
## Rating
## Min. :1.000
## 1st Qu.:4.000
## Median :4.300
## Mean :4.186
## 3rd Qu.:4.500
## Max. :5.000
sd(free$Rating)
## [1] 0.5128933
D <- ((0.2)^2 / 4)
N1 <- 647
N2 <- 8720
sd1 <- sd(paid$Rating)
sd2 <- sd(free$Rating)
n1 <- ((N1 * (sd1^2)) / (((N1-1) * D) + (sd1^2)))
n2 <- ((N2 * (sd2^2)) / (((N2-1) * D) + (sd2^2)))
n1
## [1] 28.69304
n2
## [1] 26.22984
Since we already have more than 29 rows in both data, therefore we are certain that our data will give at least 0.2 bound for each sample.
We are trying to do two-sample unpaired t-test, therefore the sample sizes do not have to be equal. However, free sample has 10 times of more sample size than paid sample size. To set those equal, we will randomly choose 500 samples from each data, making it a stratified random sampling.
set.seed(100)
paid1 <- sample(paid$Rating, 500)
free1 <- sample(free$Rating, 500)
var.test(free1, paid1, alternative = "two.sided")
##
## F test to compare two variances
##
## data: free1 and paid1
## F = 0.92223, num df = 499, denom df = 499, p-value = 0.3661
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.7736429 1.0993472
## sample estimates:
## ratio of variances
## 0.9222267
Since p-value is larger than 0.3, the test fail to reject the null hypothesis. Therefore, we can assume that they have the equal variances.
t.test(free1, paid1, alternative = "two.sided", var.equal = TRUE)
##
## Two Sample t-test
##
## data: free1 and paid1
## t = -1.9596, df = 998, p-value = 0.05032
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.348944e-01 9.442805e-05
## sample estimates:
## mean of x mean of y
## 4.1952 4.2626
Under 90% confidence level, the test rejects the null hypothesis, but it barely does with 95% confidence level.
The t-test showed us that the mean may or may not be different. To go little further, let’s examine the effect of the app being free in Google Play Store by performing factor effect model.
ratingDF <- as.data.frame(cbind(rating = c(paid1, free1), type = c(rep(0, 500), rep(1, 500))))
ratingGLM <- lm(rating ~ type, data = ratingDF)
summary(ratingGLM)
##
## Call:
## lm(formula = rating ~ type, data = ratingDF)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2626 -0.1952 0.1048 0.3374 0.8048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.26260 0.02432 175.27 <2e-16 ***
## type -0.06740 0.03439 -1.96 0.0503 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5438 on 998 degrees of freedom
## Multiple R-squared: 0.003833, Adjusted R-squared: 0.002835
## F-statistic: 3.84 on 1 and 998 DF, p-value: 0.05032
Without checking the assumptions of GLM, we can see that the estimate may not be useful since it is very small (-0.067). Since the estimate is negative, we can say that the free price of app reduces the rating.
We have found that there is a mean difference between free app and paid app, however the effect of being a free app needs further analysis. Since we have found the evidence that there is a mean difference, we may construct a more sophisticated glm model with other variables as well to make an inference model and generate some insights of ratings of Google play store.