Phase 2

Phase 1

library(readr)

LifetimeTalkingAboutThis <- read_csv("~/Prob&Stats/LifetimeTalkingAboutThis.csv", 
                                     col_types = cols(Description = col_skip(), 
                                                      Posted = col_datetime(format = "%m/%d/%y %I:%M %p")))

LifetimePostConsumersByType <- read_csv("~/Prob&Stats/LifetimePostConsumersByType.csv", 
                                        col_types = cols(Description = col_skip(), 
                                                         Posted = col_datetime(format = "%m/%d/%y %I:%M %p")))

LifetimeNegativeFeedback <- read_csv("~/Prob&Stats/LifetimeNegativeFeedback.csv", 
                                     col_types = cols(Description = col_skip(), 
                                                      Posted = col_datetime(format = "%m/%d/%y %I:%M %p")))

ScienceData <- merge(LifetimeTalkingAboutThis, LifetimePostConsumersByType)
ScienceData <- merge(ScienceData, LifetimeNegativeFeedback)

!apply(is.na(ScienceData), 2, all)

##            Post ID          Permalink       Post Message 
##               TRUE               TRUE               TRUE 
##               Type          Countries          Languages 
##               TRUE              FALSE              FALSE 
##             Posted Audience Targeting               like 
##               TRUE              FALSE               TRUE 
##            comment              share       other clicks 
##               TRUE               TRUE               TRUE 
##         photo view        link clicks         video play 
##               TRUE               TRUE               TRUE 
##    hide_all_clicks        hide_clicks 
##               TRUE               TRUE

ScienceData <- ScienceData[, !apply(is.na(ScienceData), 2, all)]
ScienceData$Type[is.na(ScienceData$Type)] <- 'None'
ScienceData[is.na(ScienceData)] <- '0'

ScienceData$Posted <- ScienceData$Posted + 3*60*60

day <- weekdays(ScienceData$Posted)
day.value <- (as.POSIXlt(ScienceData$Posted))
day.value <- strftime(day.value, "%w")
day.value <- as.integer(day.value)
weekend <- ifelse(day.value == 0 | day.value == 6, T, F)
hour <- as.POSIXlt(ScienceData$Posted, format = "%m/%d/%y %I:%M %p")$hour
workHours <- (hour >= 8 & hour < 17)
month <- as.POSIXlt(ScienceData$Posted, format = "%m/%d/%y %I:%M %p")$mon
term <- ifelse(month > 7, "Fall", ifelse(month > 4, "Summer", ifelse(month > 0, "Spring", "Winter"))) 
Activity  <- (ScienceData$like) + as.integer(ScienceData$comment) + as.integer(ScienceData$share) + (ScienceData$`other clicks`) + as.integer(ScienceData$`photo view`) + as.integer(ScienceData$`link clicks`) + as.integer(ScienceData$`video play`)

ScienceData <- data.frame(ScienceData, day, weekend, hour, workHours, term, Activity)

Note: Comments and analysis of the R-code, as well as analysis of the graphs will be inside the R-code chungs to address each important factor resulting from the code. a “#” will be in front of every comment and the comments are italicized. Remember that this is a preliminary analysis of the data we cannot draw definitive conclusions from the graphical displays.

1.) Distribution of likes analyzed by using mean and median, box plot, and a histogram.

library(fastR2)
#Summary Statistics
mean(ScienceData$like)

## [1] 32.1

sd(ScienceData$like)

## [1] 36.82611

median(ScienceData$like)

## [1] 22.5

#Outliers pull the which is why the mean is greater than the median  
#Quantiles show the amount of likes where Min, 25%, 50%, 75%, and Max of the set of posts lie. 
#Ex: 25% of the data will have up to 13 likes, 50% (the median) will have up to 22.50 likes.   
quantile(ScienceData$like)

##     0%    25%    50%    75%   100% 
##   4.00  13.00  22.50  31.75 172.00

boxplot(ScienceData$like,horizontal = T, main = 'Boxplot of Likes')

#boxplot matches the calculated quantiles. Shows that there are three outliers.   

hist(ScienceData$like, breaks = 15)

#The plot shows that most of the posts have between 10 and 30 likes.

2.) Posts in the morning vs posts at night. Relates time of day to Activity. The Activity variable is already positive activity, which is a sum of all the clicks besides the hide clicks.

#make a Total Activity varable includes positive and negative activity for future references  
TotalActivity <- ScienceData$like + as.integer(ScienceData$comment) + as.integer(ScienceData$share) + as.integer(ScienceData$photo.view) + as.integer(ScienceData$link.clicks) + as.integer(ScienceData$video.play) + as.integer(ScienceData$other.clicks) + as.integer(ScienceData$hide_all_clicks) + as.integer(ScienceData$hide_clicks)

#PositiveActivity <- ScienceData$like + as.integer(ScienceData$comment) + as.integer(ScienceData$share) + as.integer(ScienceData$photo.view) + as.integer(ScienceData$link.clicks) + as.integer(ScienceData$video.play) + as.integer(ScienceData$other.clicks)  
#Same as Activity variable.   


#Treat night as not morning  
morning <- ScienceData$hour >= 3 & ScienceData$hour <= 11  
#creates a side by side of histograms with the distributions of positive activity on morning and night posts.  
histogram(~ScienceData$Activity|morning, width = 80, main = 'Night vs. Morning Activity')

#make a subset of the data  
MorningActivity <- ScienceData$Activity[morning]
NightActivity <- ScienceData$Activity[!morning]

#two histograms that show the distribution of likes based on whether or not it was posted in the morning or not in the morning.  

hist(MorningActivity, breaks = 5.5)

hist(NightActivity)

#These histograms have better scaling for clearer observation, but sacrifices side-by-side comparison.
#histograms matches the resulting means and medians below. 

mean(MorningActivity)

## [1] 60.22222

median(MorningActivity)

## [1] 53

mean(NightActivity)

## [1] 147.381

median(NightActivity)

## [1] 60

#The mean and median of activity for post at night is higher than the mean and median for post in the morning. 
#The mean is a lot higher because the distribution of activity on post at night is skewed right, 
#where the outliers raise the mean.

3.) Relationship between number of comments and number of likes. More comments should indicate more likes.

#xyplot of amount likes responding to the amount of comments  
gf_point(as.integer(ScienceData$like) ~ as.integer(ScienceData$comment), main = 'Growth of likes by comments')

#A linear trend is shown in the plot indicating a direct relationship between amount of comments and likes. 
#More comments should be written on posts to boost likes on the post.  
#Followers like posts that already have interest, and/or additional converation.

4.) Activity of posts during work hours vs not during work hours.

#Two histograms to represent the distribution of positive activity on post during and not during work hours.
WorkHoursActivity <- ScienceData$Activity[ScienceData$workHours == T]
hist(WorkHoursActivity)

#The histogram is skewed right 
mean(WorkHoursActivity)

## [1] 132.55

median(WorkHoursActivity)

## [1] 50.5

FreetimeActivity <- ScienceData$Activity[ScienceData$workHours == F]

hist(FreetimeActivity)

#slightly skewed right, close to uniform distribution
mean(FreetimeActivity)

## [1] 98.6

median(FreetimeActivity)

## [1] 77.5

#The skewed right distribution of data for work hours pulled its mean to be higher than the data for outside of work hours, but the median for not work hour was higher and represents a better measure of center to account  for the skewedness.

5.) Relationship between amount of shares and Positive Activity. (idea: Other pages should share eachothers posts)

#xyplot of the amount of Activity and the amount of shares  
gf_point(as.integer(ScienceData$Activity) ~ as.integer(ScienceData$share), main = 'Activity and Shares Datapoints')

#Another linear trend is observed where the post with more shares have a higher amount of activity.  
#Other pages have a lot more followers than the Science page; if the other pages shared the Science page's     posts the amount of activity for the Science page's post will be higher. 
#This could also possibly encourage the followers of those other pages to also follow the Science page.

6.) Which days of the week do posts get the most likes?

#Shows amount posts for each day of the week
table(ScienceData$day)

## 
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##         3         2         1         1         8         8         7

#Putting the posts in the subsets for each day of the week.
#Then creating a vector of the means for each day of the week.
Friday <- ScienceData$like[ScienceData$day == 'Friday']
Friday <- mean(Friday)

Saturday <- ScienceData$like[ScienceData$day == 'Saturday']
Saturday <- mean(Saturday)

Sunday <- ScienceData$like[ScienceData$day == 'Sunday']
Sunday <- mean(Sunday)

Monday <- ScienceData$like[ScienceData$day == 'Monday']
Monday <- mean(Monday)

Tuesday <- ScienceData$like[ScienceData$day == 'Tuesday']
Tuesday <- mean(Tuesday)

Wednesday <- ScienceData$like[ScienceData$day == 'Wednesday']
Wednesday <- mean(Wednesday)

Thursday <- ScienceData$like[ScienceData$day == 'Thursday']
Thursday <- mean(Thursday)

WeekdayLikes <- c(Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday)
names(WeekdayLikes) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

#Pie graph of the means of likes for each day of the week.  
pie(WeekdayLikes, labels = names(WeekdayLikes), main = 'Mean of Likes: Days of the Week')

#The graph shows that Monday and Sunday posts have a higher average of likes compared to the rest of the days  of the week.  
#Sunday only has one post, which is also the oldest post. 
#This could indicate that it's length of existance is the only reason for its high amount of likes rather than the day of the week.  
#Leads us to another question where we look at the relationship between how old the post is and its number of likes.

WeekdayLikes

##    Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday 
##  98.00000  28.37500  17.14286  25.62500  19.66667  16.00000 140.00000

#Making a vector of the sum of likes for each day of week   
Friday <- ScienceData$like[ScienceData$day == 'Friday']
Friday <- sum(Friday)

Saturday <- ScienceData$like[ScienceData$day == 'Saturday']
Saturday <- sum(Saturday)

Sunday <- ScienceData$like[ScienceData$day == 'Sunday']
Sunday <- sum(Sunday)

Monday <- ScienceData$like[ScienceData$day == 'Monday']
Monday <- sum(Monday)

Tuesday <- ScienceData$like[ScienceData$day == 'Tuesday']
Tuesday <- sum(Tuesday)

Wednesday <- ScienceData$like[ScienceData$day == 'Wednesday']
Wednesday <- sum(Wednesday)

Thursday <- ScienceData$like[ScienceData$day == 'Thursday']
Thursday <- sum(Thursday)

WeekdayLikes <- c(Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday)
names(WeekdayLikes) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

# Pie graph of the sum of likes on the individual days of the week.
pie(WeekdayLikes, labels = names(WeekdayLikes), main = 'Sum of Likes: Days of the Week')

#The graph shows that posts that are posted on weekdays from (Sunday to Thursday) have a much higher 
#   percent  of the total likes given on the page than the post on Friday and Saturday. 
#This can most likely be due to Sunday through Thursday being work/school nights, indicating that more Activity is done on social media during these days because they are weeknights. 

WeekdayLikes

##    Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday 
##       196       227       120       205        59        16       140

#More data would be needed to be able to draw any relitionship between days and likes

The three questions that were most interesting were:
1.) Do posts that are posted in the morning get more positive activity than those posted at night?
The means between the two subsets of posts differed greatly because of the positively skewed distribution. And the medians were relatively close. More investigation on the outlier posts in the night subset would help determine if there is truly a difference in activity between morning and night post.

2.) Relationship between amount of shares and positive activity?
The xyplot of shares and activity showed signs of a linear relationship and more measurements (such as \(R^2\)). Also it would be interesting to see if the other pages sharing the Science page’s posts would increase amount of activity for the Science page’s posts. This could also possibly encourage the followers of those other pages to also follow the Science page which can only lead to more activity.

3.) Which days of the week do posts get the most likes?
Helps determine which day of the week would be best to post to recieve the most likes. More data points would help because only 1 post falls on Saturday, as well as Sunday.

Phase 2

Jacob Wolfla and Maria Torres

October 12, 2017