Table of Contents
Abstract
Data Preprocessing
Conducting Analysis Testing for exponential correlation
Determining correlation between Loves and other variables
Visualizing Data
Conclusion References
Utilizing the dataset provided from the Scratch community(Benjamin Mako Hill 2016), we are able to determine that there is a statistically significant exponential correlation between view count and remix status, as well as a correlation between Loves and Views + Sprites. Finally, we will visualize these relations using ggplot2.
Start by importing the library for tidyverse which contains various data science tools like ggplot2 and dpylr, and RColorBrewer, because pretty colors are good.
library(tidyverse)
library(RColorBrewer)
df <- read.csv("250750views.csv")
dataset <- dplyr::select(df, "is_remixed", "sprites_website",
"scripts_website", "viewers_website",
"lovers_website")
names(dataset) <- c("remix.status", "sprites", "scripts",
"views", "loves")
First, we test all variables against remix status to determine if there is a correlation between Remix status and other variables.
regressor <- lm(formula = remix.status ~., data = dataset)
summary(regressor)
Call:
lm(formula = remix.status ~ ., data = dataset)
Residuals:
Min 1Q Median 3Q Max
-0.7330 -0.4663 0.2772 0.5081 0.5921
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.146e-01 9.726e-03 32.350 <2e-16 ***
sprites 4.390e-05 1.610e-04 0.273 0.7852
scripts -3.999e-06 1.960e-06 -2.040 0.0413 *
views 4.137e-04 2.209e-05 18.730 <2e-16 ***
loves 7.865e-05 3.486e-04 0.226 0.8215
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4938 on 18210 degrees of freedom
(4102 observations deleted due to missingness)
Multiple R-squared: 0.02483, Adjusted R-squared: 0.02462
F-statistic: 115.9 on 4 and 18210 DF, p-value: < 2.2e-16
Notice that there are two variables with \(P_{r} >\left|t\right|\) values less than 0.05. If we choose that as the significance level, we can re-run the linear regression test and see if we get better results.
We will remove the Loves variable and rerun. (Potentially, the Loves variable could be masking the Scripts variable, so we will include it in the next test)
regressor2 <- lm(formula = remix.status ~ views + sprites + scripts,
data = dataset)
summary(regressor2)
Call:
lm(formula = remix.status ~ views + sprites + scripts, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-0.7530 -0.4579 -0.4096 0.5161 0.5917
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.032e-01 9.497e-03 31.924 <2e-16 ***
views 4.182e-04 1.912e-05 21.870 <2e-16 ***
sprites 1.877e-04 1.597e-04 1.175 0.2399
scripts -4.079e-06 1.960e-06 -2.082 0.0374 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4937 on 18956 degrees of freedom
(3357 observations deleted due to missingness)
Multiple R-squared: 0.025, Adjusted R-squared: 0.02485
F-statistic: 162 on 3 and 18956 DF, p-value: < 2.2e-16
We can conclude that the Sprites variable is not statistically significant and remove it to perform the test again.
regressor3 <- lm(formula = remix.status ~ views + scripts, data = dataset)
summary(regressor3)
Call:
lm(formula = remix.status ~ views + scripts, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-0.7066 -0.4446 -0.3966 0.5287 0.6571
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.896e-01 8.798e-03 32.915 <2e-16 ***
views 4.180e-04 1.803e-05 23.189 <2e-16 ***
scripts -3.458e-06 1.941e-06 -1.782 0.0748 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4934 on 21432 degrees of freedom
(882 observations deleted due to missingness)
Multiple R-squared: 0.0246, Adjusted R-squared: 0.02451
F-statistic: 270.3 on 2 and 21432 DF, p-value: < 2.2e-16
Now we can see that the only variable with statistical significance is Views. We will perform regression once more for our final value.
regressor4 <- lm(formula = remix.status ~ views, data = dataset)
summary(regressor4)
Call:
lm(formula = remix.status ~ views, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-0.7042 -0.4384 -0.3895 0.5331 0.6135
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.798e-01 8.613e-03 32.48 <2e-16 ***
views 4.253e-04 1.769e-05 24.05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4928 on 22315 degrees of freedom
Multiple R-squared: 0.02526, Adjusted R-squared: 0.02521
F-statistic: 578.2 on 1 and 22315 DF, p-value: < 2.2e-16
The significance code is much much less than 0.001 and can be deemed as statistically significant. The \(R^{2}\) and \(R^{2}_{\textrm{adjusted}}\) values are less than 1 implying that there is not a linear relationship between the two variables, but the \(p\)-value being so low shows that there is a relation between the two.
Below we have taken the dataset variables for Remix Status and counted the total amount of remixes per number of views. We then take the log of Remix Totals and perform linear regression against it.
data.log <- aggregate(dataset$remix.status, by=list(dataset$views), FUN=sum)
names(data.log) <- c("views", "remix.totals")
log.reg <- lm(formula = log(remix.totals) ~ views, data = data.log)
summary(log.reg)
Call:
lm(formula = log(remix.totals) ~ views, data = data.log)
Residuals:
Min 1Q Median 3Q Max
-2.1150 -0.1776 0.0601 0.2325 1.1970
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.343e+00 4.395e-02 98.81 <2e-16 ***
views -3.196e-03 6.652e-05 -48.05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3928 on 746 degrees of freedom
Multiple R-squared: 0.7558, Adjusted R-squared: 0.7555
F-statistic: 2309 on 1 and 746 DF, p-value: < 2.2e-16
Our \(p\)-value is still good and we have an \(R^{2}\) closer to 1. This leads us to the conclusion that the variables are related exponentially.
We can run tests as before to determine which variables will be statistically significant as before.
love.reg <- lm(formula = loves ~., data = dataset)
love.reg2 <- lm(formula = loves ~ views + sprites + scripts,
data = dataset)
love.reg3 <- lm(formula = loves ~ views + sprites,
data = dataset)
summary(love.reg3)
Call:
lm(formula = loves ~ views + sprites, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-36.17 -5.92 -1.94 3.91 400.39
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.6110280 0.2179760 -7.391 1.52e-13 ***
views 0.0306892 0.0004368 70.254 < 2e-16 ***
sprites 0.0365094 0.0035985 10.146 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.23 on 18678 degrees of freedom
(3636 observations deleted due to missingness)
Multiple R-squared: 0.2147, Adjusted R-squared: 0.2146
F-statistic: 2553 on 2 and 18678 DF, p-value: < 2.2e-16
This shows that we have statistical significance between Loves, Views, and Sprites.
Finally, we shall use ggplot to visualize the data that we have looked at so far before importing the csv into Tableau.
ggplot(data = dataset, aes(x = views, fill = remix.status)) +
geom_histogram(color = "black", alpha = 0.4) + ggtitle("Views") + xlab("Views") +
ylab("Counts") + scale_fill_brewer(palette = "Paired") + theme(line = element_line(size = 1))
ggplot(data = data.log, aes(x = views, y = log(remix.totals),
color = I("Blue"))) +
geom_point() + geom_smooth(color = I("Green")) +
ggtitle("Log Remix Totals vs. Views") + xlab("Views") +
ylab("Log Remix")
The Scratch data obtained shows that there is an exponential correlation between Remix status and Views. We also learned that there is a relationship between Loves and Views + Sprites. These findings will make our analysis with Tableau much more meaningful.
Feel free to check out the source R code at my github
Benjamin Mako Hill, Andrés Monroy-Hernández. 2016. “Archival Dataset: A Longitudinal Dataset of Five Years of Public Activity in the Scratch Online Community.” https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KFT8EZ.