0. Objective

Find whether the mean rating of free apps differ from the mean rating of paid apps in Google Play Store. Then, briefly examine the average effect of apps being free on their rating.

1. Prep the Data

googleplaystore <- read_excel("googleplaystore.xlsx", 
    col_types = c("text", "text", "numeric", 
        "numeric", "text", "text", "text", 
        "numeric", "text", "text", "numeric", 
        "text", "text"), na = "NaN")
paid <- filter(googleplaystore, Type == "Paid") %>% 
  select(Rating) %>% 
  na.omit()
free <- filter(googleplaystore, Type == "Free") %>% 
  select(Rating) %>% 
  na.omit()

2. Determine the Staratum Size

We will compare the required sample size for both data and choose the larger one. For simplicity, we will set the bound of error as 0.2.

a. Descriptive Statistics

nrow(paid)
## [1] 647
summary(paid)
##      Rating     
##  Min.   :1.000  
##  1st Qu.:4.100  
##  Median :4.400  
##  Mean   :4.267  
##  3rd Qu.:4.600  
##  Max.   :5.000
sd(paid$Rating)
## [1] 0.5475231
nrow(free)
## [1] 8720
summary(free)
##      Rating     
##  Min.   :1.000  
##  1st Qu.:4.000  
##  Median :4.300  
##  Mean   :4.186  
##  3rd Qu.:4.500  
##  Max.   :5.000
sd(free$Rating)
## [1] 0.5128933

b. Requied Sample Size Calculation

D <- ((0.2)^2 / 4)
N1 <- 647
N2 <- 8720
sd1 <- sd(paid$Rating)
sd2 <- sd(free$Rating)

n1 <- ((N1 * (sd1^2)) / (((N1-1) * D) + (sd1^2)))
n2 <- ((N2 * (sd2^2)) / (((N2-1) * D) + (sd2^2)))
n1
## [1] 28.69304
n2
## [1] 26.22984

Since we already have more than 29 rows in both data, therefore we are certain that our data will give at least 0.2 bound for each sample.

c. Decision

We are trying to do two-sample unpaired t-test, therefore the sample sizes do not have to be equal. However, free sample has 10 times of more sample size than paid sample size. To set those equal, we will randomly choose 500 samples from each data, making it a stratified random sampling.

d. Select the samples

set.seed(100)
paid1 <- sample(paid$Rating, 500)
free1 <- sample(free$Rating, 500)

3. Equal Variance Test

var.test(free1, paid1, alternative = "two.sided")
## 
##  F test to compare two variances
## 
## data:  free1 and paid1
## F = 0.92223, num df = 499, denom df = 499, p-value = 0.3661
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.7736429 1.0993472
## sample estimates:
## ratio of variances 
##          0.9222267

Since p-value is larger than 0.3, the test fail to reject the null hypothesis. Therefore, we can assume that they have the equal variances.

4. Two-Sample Unpaired t-test with Equal Variance

t.test(free1, paid1, alternative = "two.sided", var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  free1 and paid1
## t = -1.9596, df = 998, p-value = 0.05032
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.348944e-01  9.442805e-05
## sample estimates:
## mean of x mean of y 
##    4.1952    4.2626

Under 90% confidence level, the test rejects the null hypothesis, but it barely does with 95% confidence level.

5. GLM Model

The t-test showed us that the mean may or may not be different. To go little further, let’s examine the effect of the app being free in Google Play Store by performing factor effect model.

a. Construct the data table

ratingDF <- as.data.frame(cbind(rating = c(paid1, free1), type = c(rep(0, 500), rep(1, 500))))

b. Fit GLM

ratingGLM <- lm(rating ~ type, data = ratingDF)
summary(ratingGLM)
## 
## Call:
## lm(formula = rating ~ type, data = ratingDF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2626 -0.1952  0.1048  0.3374  0.8048 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.26260    0.02432  175.27   <2e-16 ***
## type        -0.06740    0.03439   -1.96   0.0503 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5438 on 998 degrees of freedom
## Multiple R-squared:  0.003833,   Adjusted R-squared:  0.002835 
## F-statistic:  3.84 on 1 and 998 DF,  p-value: 0.05032

Without checking the assumptions of GLM, we can see that the estimate may not be useful since it is very small (-0.067). Since the estimate is negative, we can say that the free price of app reduces the rating.

6. Conclusions

We have found that there is a mean difference between free app and paid app, however the effect of being a free app needs further analysis. Since we have found the evidence that there is a mean difference, we may construct a more sophisticated glm model with other variables as well to make an inference model and generate some insights of ratings of Google play store.