Phase 1: Clean up data from the Franklin College Division of Natural Science Facebook Page.
Phase 2: Explore data and find the most interesting relationships.
Phase 3: Use simulation-based statistical inference to do dive deeper into two of the most interesting relationships found during Phase 2.

library(readr)
library(fastR2)

LifetimeTalkingAboutThis <- read_csv("~/Prob&Stats/LifetimeTalkingAboutThis.csv", 
                                     col_types = cols(Description = col_skip(), 
                                                      Posted = col_datetime(format = "%m/%d/%y %I:%M %p")))

LifetimePostConsumersByType <- read_csv("~/Prob&Stats/LifetimePostConsumersByType.csv", 
                                        col_types = cols(Description = col_skip(), 
                                                         Posted = col_datetime(format = "%m/%d/%y %I:%M %p")))

LifetimeNegativeFeedback <- read_csv("~/Prob&Stats/LifetimeNegativeFeedback.csv", 
                                     col_types = cols(Description = col_skip(), 
                                                      Posted = col_datetime(format = "%m/%d/%y %I:%M %p")))

ScienceData <- merge(LifetimeTalkingAboutThis, LifetimePostConsumersByType)
ScienceData <- merge(ScienceData, LifetimeNegativeFeedback)

ScienceData <- ScienceData[, !apply(is.na(ScienceData), 2, all)]
ScienceData$Type[is.na(ScienceData$Type)] <- 'None'
ScienceData[is.na(ScienceData)] <- '0'

ScienceData$Posted <- ScienceData$Posted + 3*60*60

day <- weekdays(ScienceData$Posted)
day.value <- (as.POSIXlt(ScienceData$Posted))
day.value <- strftime(day.value, "%w")
day.value <- as.integer(day.value)
weekend <- ifelse(day.value == 0 | day.value == 6, T, F)
hour <- as.POSIXlt(ScienceData$Posted, format = "%m/%d/%y %I:%M %p")$hour
workHours <- (hour >= 8 & hour < 17)
month <- as.POSIXlt(ScienceData$Posted, format = "%m/%d/%y %I:%M %p")$mon
term <- ifelse(month > 7, "Fall", ifelse(month > 4, "Summer", ifelse(month > 0, "Spring", "Winter"))) 
Activity  <- (ScienceData$like) + as.integer(ScienceData$comment) + as.integer(ScienceData$share) + (ScienceData$`other clicks`) + as.integer(ScienceData$`photo view`) + as.integer(ScienceData$`link clicks`) + as.integer(ScienceData$`video play`)

ScienceData <- data.frame(ScienceData, day, weekend, hour, workHours, term, Activity)

1.) Do posts that are posted in the morning get more positive activity than those posted at night?
The means between the two subsets of posts differed greatly because of the positively skewed distribution. And the medians were relatively close. More investigation on the outlier posts in the night subset would help determine if there is truly a difference in activity between morning and night post.

Posts in the morning vs posts at night. Relates time of day to Activity. The Activity variable is already positive activity, which is a sum of all the clicks besides the hide clicks. A difference in means will be used to determine whether or not the time of day a post was made has an effect on the positive activity the post recieves.

library(fastR2)

TimeOfDay <- ifelse(ScienceData$hour >= 3 & ScienceData$hour <= 11, "morning", "night")
#creates a side by side of histograms with the distributions of positive activity on morning and night posts.  
histogram(~ScienceData$Activity|TimeOfDay, width = 80, main = 'Morning vs. Night Activity', xlab = 'Positive Activity')

#Treat night as not morning  
morning <- ScienceData$hour >= 3 & ScienceData$hour <= 11 
#make a subset of the data  
MorningActivity <- ScienceData$Activity[morning]
NightActivity <- ScienceData$Activity[!morning]

hist(MorningActivity, breaks = 5.5)

hist(NightActivity)

#These histograms have better scaling for clearer observation, but sacrifices side-by-side comparison.
#histograms matches the resulting means and medians below. 

mean(MorningActivity)
## [1] 60.22222
mean(NightActivity)
## [1] 147.381
length(MorningActivity)
## [1] 9
length(NightActivity)
## [1] 21
#The mean and median of activity for post at night is higher than the mean and median for post in the morning. 
#The mean is a lot higher because the distribution of activity on post at night is skewed right, 
#where the outliers raise the mean. 


#Difference between Two Independent Means 
origDiff <- mean(MorningActivity)-mean(NightActivity)
origDiff
## [1] -87.15873
#merge both groups
allData <- c(MorningActivity,NightActivity)

#Storage Variable for simulated differences 
diffInMeans <- c()

for(i in 1:10000)
{
  #shuffle all of the values together
  reordering <- sample(allData,size=30,replace=F)
  
  fakeMorningActivity <- reordering[1:15]
  fakeNightActivity <- reordering[16:30]
  
  #computing the difference in means for these 2 fake groups
  newDiff <- mean(fakeMorningActivity)-mean(fakeNightActivity)
  
  #store this difference of means in a vector with all the simulated differences
  diffInMeans <- c(diffInMeans,newDiff)
}

histogram(~diffInMeans)

pvalue <- sum(diffInMeans >= origDiff)/10000
pvalue
## [1] 0.8724

First, set the value for alpha to be 0.05 to test our p-value against.
\(H_o\): \(\mu_{m} = \mu_{n}\)

\(H_a\): \(\mu_{m} \neq \mu_{n}\)
The mean of activity for post at night is higher than the mean and median for post in the morning. The mean is a lot higher because the distribution of activity on post at night is skewed right, where the outliers raise the mean. The test statistic to be used to compare our simulated differences to, will be the differnce between the two sample means. The sample means were found to be 60.2222222 for the morning and 147.3809524
\(\bar x_{m} = 60.2222\); \(\bar x_{n} = 147.381\)

Compute test statistic.
\[\bar x_{diff} = \bar x_{m} - \bar x_{n} = 87.1588\]

Randomization simulation will be done to compute our p-value. The test is two tailed, thus both sides of the distribution must be accounted for the p-value.
P-value:
The calculated p-value was calculated to be 0.8724, which is more than alpha. Thus we fail to reject the null hypothesis, and there is not enough evidence to conclude whether the time of day has any effect on the amount of activity a post gets.

2.) Relationship between amount of shares and positive activity?
The xyplot of shares and activity showed signs of a linear relationship and more measurements (such as \(R^2\)). The relationship between amount of shares and Positive Activity is between two quantitative variables.

x_values <- as.integer(ScienceData$share)
y_values <- as.integer(ScienceData$Activity)

xyplot(y_values~x_values,type = c("p", "r"), xlab = "Shares", ylab = "Activity")

#computes regression line components
model <- lm(y_values~x_values)
model
## 
## Call:
## lm(formula = y_values ~ x_values)
## 
## Coefficients:
## (Intercept)     x_values  
##       15.53        46.63
origCorrelation <- cor(y_values~x_values)
origCorrelation
## [1] 0.6854408
listOfCorrelations <- c()

for (i in 1:10000){
  reordered_y_values <- sample(y_values,size=30,replace=F)
  newCorrelation <- cor(reordered_y_values~x_values)
  listOfCorrelations <- c(listOfCorrelations,newCorrelation)
}

histogram(~listOfCorrelations)

#1-sided p-value
pvalue <- (sum(listOfCorrelations >= origCorrelation))/10000
pvalue
## [1] 0.0011

First, set the value for alpha to be 0.05 to test our p-value against. They null and alternative hypotheses are stated, where rho represents the correlation cofficient.
\(H_o\): \(\rho = 0\)

\(H_a\): \(\rho > 0\)

Compute test statistic R. The correlation coefficient was computed 0.6854408.

Regression line: \(\hat{y} = 15.53 + 46.63x\) Shows our line equation for predicting positive activity based on the number of shares.

Randomization simulation will be done to compute our p-value. The test is one-sided, thus only one side of the distribution will be accounted for the p-value.
P-value:
The calculated p-value was calculated to be 0.0011, which is less than alpha. Thus we reject the null hypothesis, and there is statistically significant evidence to conclude that there is a positive, linear correlation between the amount of shares a post gets and it’s positive activity.

Idea: Other Franklin College pages should share eachothers posts. This could also possibly encourage the followers of those other pages to also follow the Science page which can only lead to more activity.