Phase 1
library(readr)
LifetimeTalkingAboutThis <- read_csv("~/Prob&Stats/LifetimeTalkingAboutThis.csv",
col_types = cols(Description = col_skip(),
Posted = col_datetime(format = "%m/%d/%y %I:%M %p")))
LifetimePostConsumersByType <- read_csv("~/Prob&Stats/LifetimePostConsumersByType.csv",
col_types = cols(Description = col_skip(),
Posted = col_datetime(format = "%m/%d/%y %I:%M %p")))
LifetimeNegativeFeedback <- read_csv("~/Prob&Stats/LifetimeNegativeFeedback.csv",
col_types = cols(Description = col_skip(),
Posted = col_datetime(format = "%m/%d/%y %I:%M %p")))
ScienceData <- merge(LifetimeTalkingAboutThis, LifetimePostConsumersByType)
ScienceData <- merge(ScienceData, LifetimeNegativeFeedback)
!apply(is.na(ScienceData), 2, all)
## Post ID Permalink Post Message
## TRUE TRUE TRUE
## Type Countries Languages
## TRUE FALSE FALSE
## Posted Audience Targeting like
## TRUE FALSE TRUE
## comment share other clicks
## TRUE TRUE TRUE
## photo view link clicks video play
## TRUE TRUE TRUE
## hide_all_clicks hide_clicks
## TRUE TRUE
ScienceData <- ScienceData[, !apply(is.na(ScienceData), 2, all)]
ScienceData$Type[is.na(ScienceData$Type)] <- 'None'
ScienceData[is.na(ScienceData)] <- '0'
ScienceData$Posted <- ScienceData$Posted + 3*60*60
day <- weekdays(ScienceData$Posted)
day.value <- (as.POSIXlt(ScienceData$Posted))
day.value <- strftime(day.value, "%w")
day.value <- as.integer(day.value)
weekend <- ifelse(day.value == 0 | day.value == 6, T, F)
hour <- as.POSIXlt(ScienceData$Posted, format = "%m/%d/%y %I:%M %p")$hour
workHours <- (hour >= 8 & hour < 17)
month <- as.POSIXlt(ScienceData$Posted, format = "%m/%d/%y %I:%M %p")$mon
term <- ifelse(month > 7, "Fall", ifelse(month > 4, "Summer", ifelse(month > 0, "Spring", "Winter")))
Activity <- (ScienceData$like) + as.integer(ScienceData$comment) + as.integer(ScienceData$share) + (ScienceData$`other clicks`) + as.integer(ScienceData$`photo view`) + as.integer(ScienceData$`link clicks`) + as.integer(ScienceData$`video play`)
ScienceData <- data.frame(ScienceData, day, weekend, hour, workHours, term, Activity)
Note: Comments and analysis of the R-code, as well as analysis of the graphs will be inside the R-code chungs to address each important factor resulting from the code. a “#” will be in front of every comment and the comments are italicized. Remember that this is a preliminary analysis of the data we cannot draw definitive conclusions from the graphical displays.
1.) Distribution of likes analyzed by using mean and median, box plot, and a histogram.
library(fastR2)
#Summary Statistics
mean(ScienceData$like)
## [1] 32.1
sd(ScienceData$like)
## [1] 36.82611
median(ScienceData$like)
## [1] 22.5
#Outliers pull the which is why the mean is greater than the median
#Quantiles show the amount of likes where Min, 25%, 50%, 75%, and Max of the set of posts lie.
#Ex: 25% of the data will have up to 13 likes, 50% (the median) will have up to 22.50 likes.
quantile(ScienceData$like)
## 0% 25% 50% 75% 100%
## 4.00 13.00 22.50 31.75 172.00
boxplot(ScienceData$like,horizontal = T, main = 'Boxplot of Likes')
#boxplot matches the calculated quantiles. Shows that there are three outliers.
hist(ScienceData$like, breaks = 15)
#The plot shows that most of the posts have between 10 and 30 likes.
2.) Posts in the morning vs posts at night. Relates time of day to Activity. The Activity variable is already positive activity, which is a sum of all the clicks besides the hide clicks.
#make a Total Activity varable includes positive and negative activity for future references
TotalActivity <- ScienceData$like + as.integer(ScienceData$comment) + as.integer(ScienceData$share) + as.integer(ScienceData$photo.view) + as.integer(ScienceData$link.clicks) + as.integer(ScienceData$video.play) + as.integer(ScienceData$other.clicks) + as.integer(ScienceData$hide_all_clicks) + as.integer(ScienceData$hide_clicks)
#PositiveActivity <- ScienceData$like + as.integer(ScienceData$comment) + as.integer(ScienceData$share) + as.integer(ScienceData$photo.view) + as.integer(ScienceData$link.clicks) + as.integer(ScienceData$video.play) + as.integer(ScienceData$other.clicks)
#Same as Activity variable.
#Treat night as not morning
morning <- ScienceData$hour >= 3 & ScienceData$hour <= 11
#creates a side by side of histograms with the distributions of positive activity on morning and night posts.
histogram(~ScienceData$Activity|morning, width = 80, main = 'Night vs. Morning Activity')
#make a subset of the data
MorningActivity <- ScienceData$Activity[morning]
NightActivity <- ScienceData$Activity[!morning]
#two histograms that show the distribution of likes based on whether or not it was posted in the morning or not in the morning.
hist(MorningActivity, breaks = 5.5)
hist(NightActivity)
#These histograms have better scaling for clearer observation, but sacrifices side-by-side comparison.
#histograms matches the resulting means and medians below.
mean(MorningActivity)
## [1] 60.22222
median(MorningActivity)
## [1] 53
mean(NightActivity)
## [1] 147.381
median(NightActivity)
## [1] 60
#The mean and median of activity for post at night is higher than the mean and median for post in the morning.
#The mean is a lot higher because the distribution of activity on post at night is skewed right,
#where the outliers raise the mean.
3.) Relationship between number of comments and number of likes. More comments should indicate more likes.
#xyplot of amount likes responding to the amount of comments
gf_point(as.integer(ScienceData$like) ~ as.integer(ScienceData$comment), main = 'Growth of likes by comments')
#A linear trend is shown in the plot indicating a direct relationship between amount of comments and likes.
#More comments should be written on posts to boost likes on the post.
#Followers like posts that already have interest, and/or additional converation.
4.) Activity of posts during work hours vs not during work hours.
#Two histograms to represent the distribution of positive activity on post during and not during work hours.
WorkHoursActivity <- ScienceData$Activity[ScienceData$workHours == T]
hist(WorkHoursActivity)
#The histogram is skewed right
mean(WorkHoursActivity)
## [1] 132.55
median(WorkHoursActivity)
## [1] 50.5
FreetimeActivity <- ScienceData$Activity[ScienceData$workHours == F]
hist(FreetimeActivity)
#slightly skewed right, close to uniform distribution
mean(FreetimeActivity)
## [1] 98.6
median(FreetimeActivity)
## [1] 77.5
#The skewed right distribution of data for work hours pulled its mean to be higher than the data for outside of work hours, but the median for not work hour was higher and represents a better measure of center to account for the skewedness.
5.) Relationship between amount of shares and Positive Activity. (idea: Other pages should share eachothers posts)
#xyplot of the amount of Activity and the amount of shares
gf_point(as.integer(ScienceData$Activity) ~ as.integer(ScienceData$share), main = 'Activity and Shares Datapoints')
#Another linear trend is observed where the post with more shares have a higher amount of activity.
#Other pages have a lot more followers than the Science page; if the other pages shared the Science page's posts the amount of activity for the Science page's post will be higher.
#This could also possibly encourage the followers of those other pages to also follow the Science page.
6.) Which days of the week do posts get the most likes?
#Shows amount posts for each day of the week
table(ScienceData$day)
##
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 3 2 1 1 8 8 7
#Putting the posts in the subsets for each day of the week.
#Then creating a vector of the means for each day of the week.
Friday <- ScienceData$like[ScienceData$day == 'Friday']
Friday <- mean(Friday)
Saturday <- ScienceData$like[ScienceData$day == 'Saturday']
Saturday <- mean(Saturday)
Sunday <- ScienceData$like[ScienceData$day == 'Sunday']
Sunday <- mean(Sunday)
Monday <- ScienceData$like[ScienceData$day == 'Monday']
Monday <- mean(Monday)
Tuesday <- ScienceData$like[ScienceData$day == 'Tuesday']
Tuesday <- mean(Tuesday)
Wednesday <- ScienceData$like[ScienceData$day == 'Wednesday']
Wednesday <- mean(Wednesday)
Thursday <- ScienceData$like[ScienceData$day == 'Thursday']
Thursday <- mean(Thursday)
WeekdayLikes <- c(Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday)
names(WeekdayLikes) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
#Pie graph of the means of likes for each day of the week.
pie(WeekdayLikes, labels = names(WeekdayLikes), main = 'Mean of Likes: Days of the Week')
#The graph shows that Monday and Sunday posts have a higher average of likes compared to the rest of the days of the week.
#Sunday only has one post, which is also the oldest post.
#This could indicate that it's length of existance is the only reason for its high amount of likes rather than the day of the week.
#Leads us to another question where we look at the relationship between how old the post is and its number of likes.
WeekdayLikes
## Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## 98.00000 28.37500 17.14286 25.62500 19.66667 16.00000 140.00000
#Making a vector of the sum of likes for each day of week
Friday <- ScienceData$like[ScienceData$day == 'Friday']
Friday <- sum(Friday)
Saturday <- ScienceData$like[ScienceData$day == 'Saturday']
Saturday <- sum(Saturday)
Sunday <- ScienceData$like[ScienceData$day == 'Sunday']
Sunday <- sum(Sunday)
Monday <- ScienceData$like[ScienceData$day == 'Monday']
Monday <- sum(Monday)
Tuesday <- ScienceData$like[ScienceData$day == 'Tuesday']
Tuesday <- sum(Tuesday)
Wednesday <- ScienceData$like[ScienceData$day == 'Wednesday']
Wednesday <- sum(Wednesday)
Thursday <- ScienceData$like[ScienceData$day == 'Thursday']
Thursday <- sum(Thursday)
WeekdayLikes <- c(Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday)
names(WeekdayLikes) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
# Pie graph of the sum of likes on the individual days of the week.
pie(WeekdayLikes, labels = names(WeekdayLikes), main = 'Sum of Likes: Days of the Week')
#The graph shows that posts that are posted on weekdays from (Sunday to Thursday) have a much higher
# percent of the total likes given on the page than the post on Friday and Saturday.
#This can most likely be due to Sunday through Thursday being work/school nights, indicating that more Activity is done on social media during these days because they are weeknights.
WeekdayLikes
## Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## 196 227 120 205 59 16 140
#More data would be needed to be able to draw any relitionship between days and likes
The three questions that were most interesting were:
1.) Do posts that are posted in the morning get more positive activity than those posted at night?
The means between the two subsets of posts differed greatly because of the positively skewed distribution. And the medians were relatively close. More investigation on the outlier posts in the night subset would help determine if there is truly a difference in activity between morning and night post.
2.) Relationship between amount of shares and positive activity?
The xyplot of shares and activity showed signs of a linear relationship and more measurements (such as \(R^2\)). Also it would be interesting to see if the other pages sharing the Science page’s posts would increase amount of activity for the Science page’s posts. This could also possibly encourage the followers of those other pages to also follow the Science page which can only lead to more activity.
3.) Which days of the week do posts get the most likes?
Helps determine which day of the week would be best to post to recieve the most likes. More data points would help because only 1 post falls on Saturday, as well as Sunday.