Instructor: @ Charlie Terng
Student: @ Xukun LIU
Student: @ Dzenisa Bihorac
Student: @ Xicheng LIN
Student: @ Yanming Shou
1.1 Read in the Facebook data hosted from a link
fb = read.csv(url("http://guides.newman.baruch.cuny.edu/ld.php?content_id=39953204"),sep=";")
1.2 Get an initial understanding of how the data is structured
dim(fb)
## [1] 500 19
str(fb)
## 'data.frame': 500 obs. of 19 variables:
## $ Page.total.likes : int 139441 139441 139441 139441 139441 139441 139441 139441 139441 139441 ...
## $ Type : Factor w/ 4 levels "Link","Photo",..: 2 3 2 2 2 3 2 2 3 2 ...
## $ Category : int 2 2 3 2 2 2 3 3 2 3 ...
## $ Post.Month : int 12 12 12 12 12 12 12 12 12 12 ...
## $ Post.Weekday : int 4 3 3 2 2 1 1 7 7 6 ...
## $ Post.Hour : int 3 10 3 10 3 9 3 9 3 10 ...
## $ Paid : int 0 0 0 1 0 0 1 1 0 0 ...
## $ Lifetime.Post.Total.Reach : int 2752 10460 2413 50128 7244 10472 11692 13720 11844 4694 ...
## $ Lifetime.Post.Total.Impressions : int 5091 19057 4373 87991 13594 20849 19479 24137 22538 8668 ...
## $ Lifetime.Engaged.Users : int 178 1457 177 2211 671 1191 481 537 1530 280 ...
## $ Lifetime.Post.Consumers : int 109 1361 113 790 410 1073 265 232 1407 183 ...
## $ Lifetime.Post.Consumptions : int 159 1674 154 1119 580 1389 364 305 1692 250 ...
## $ Lifetime.Post.Impressions.by.people.who.have.liked.your.Page : int 3078 11710 2812 61027 6228 16034 15432 19728 15220 4309 ...
## $ Lifetime.Post.reach.by.people.who.like.your.Page : int 1640 6112 1503 32048 3200 7852 9328 11056 7912 2324 ...
## $ Lifetime.People.who.have.liked.your.Page.and.engaged.with.your.post: int 119 1108 132 1386 396 1016 379 422 1250 199 ...
## $ comment : int 4 5 0 58 19 1 3 0 0 3 ...
## $ like : int 79 130 66 1572 325 152 249 325 161 113 ...
## $ share : int 17 29 14 147 49 33 27 14 31 26 ...
## $ Total.Interactions : int 100 164 80 1777 393 186 279 339 192 142 ...
1.3 Take a look at the variable types. Do they make sense for the given variable?
fb$Category = as.factor(fb$Category)
fb$Type = as.factor(fb$Type)
fb$Post.Month = as.factor(fb$Post.Month)
fb$Post.Weekday = as.factor(fb$Post.Weekday)
fb$Post.Hour = as.factor(fb$Post.Hour)
fb$Paid = as.factor(fb$Paid)
str(fb)
## 'data.frame': 500 obs. of 19 variables:
## $ Page.total.likes : int 139441 139441 139441 139441 139441 139441 139441 139441 139441 139441 ...
## $ Type : Factor w/ 4 levels "Link","Photo",..: 2 3 2 2 2 3 2 2 3 2 ...
## $ Category : Factor w/ 3 levels "1","2","3": 2 2 3 2 2 2 3 3 2 3 ...
## $ Post.Month : Factor w/ 12 levels "1","2","3","4",..: 12 12 12 12 12 12 12 12 12 12 ...
## $ Post.Weekday : Factor w/ 7 levels "1","2","3","4",..: 4 3 3 2 2 1 1 7 7 6 ...
## $ Post.Hour : Factor w/ 22 levels "1","2","3","4",..: 3 10 3 10 3 9 3 9 3 10 ...
## $ Paid : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 2 1 1 ...
## $ Lifetime.Post.Total.Reach : int 2752 10460 2413 50128 7244 10472 11692 13720 11844 4694 ...
## $ Lifetime.Post.Total.Impressions : int 5091 19057 4373 87991 13594 20849 19479 24137 22538 8668 ...
## $ Lifetime.Engaged.Users : int 178 1457 177 2211 671 1191 481 537 1530 280 ...
## $ Lifetime.Post.Consumers : int 109 1361 113 790 410 1073 265 232 1407 183 ...
## $ Lifetime.Post.Consumptions : int 159 1674 154 1119 580 1389 364 305 1692 250 ...
## $ Lifetime.Post.Impressions.by.people.who.have.liked.your.Page : int 3078 11710 2812 61027 6228 16034 15432 19728 15220 4309 ...
## $ Lifetime.Post.reach.by.people.who.like.your.Page : int 1640 6112 1503 32048 3200 7852 9328 11056 7912 2324 ...
## $ Lifetime.People.who.have.liked.your.Page.and.engaged.with.your.post: int 119 1108 132 1386 396 1016 379 422 1250 199 ...
## $ comment : int 4 5 0 58 19 1 3 0 0 3 ...
## $ like : int 79 130 66 1572 325 152 249 325 161 113 ...
## $ share : int 17 29 14 147 49 33 27 14 31 26 ...
## $ Total.Interactions : int 100 164 80 1777 393 186 279 339 192 142 ...
1.4 The “Category” variable indicates its marketing purpose
levels(fb$Category)
## [1] "1" "2" "3"
fb$Category = as.character(fb$Category)
fb$Category[fb$Category=="1"] = "action"
fb$Category[fb$Category=="2"] = "product"
fb$Category[fb$Category=="3"] = "inspiration"
fb$Category = as.factor(fb$Category)
levels(fb$Category)
## [1] "action" "inspiration" "product"
1.5 How many missing values do we have in the dataset?
colSums(is.na(fb))
## Page.total.likes
## 0
## Type
## 0
## Category
## 0
## Post.Month
## 0
## Post.Weekday
## 0
## Post.Hour
## 0
## Paid
## 1
## Lifetime.Post.Total.Reach
## 0
## Lifetime.Post.Total.Impressions
## 0
## Lifetime.Engaged.Users
## 0
## Lifetime.Post.Consumers
## 0
## Lifetime.Post.Consumptions
## 0
## Lifetime.Post.Impressions.by.people.who.have.liked.your.Page
## 0
## Lifetime.Post.reach.by.people.who.like.your.Page
## 0
## Lifetime.People.who.have.liked.your.Page.and.engaged.with.your.post
## 0
## comment
## 0
## like
## 1
## share
## 4
## Total.Interactions
## 0
sum(is.na(fb))
## [1] 6
fb = na.omit(fb)
2.1 Provide descriptive statistics the “like” variable (min, max, median, Q1, Q3, mean, and standard deviation)
summary(fb$share)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 19.00 27.26 32.50 790.00
sd(fb$share)
## [1] 42.65639
2.2 What do the summary statistics for “like” tell us about the cosmetic company’s social media performance on Facebook?
summary(fb$like)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 57.0 101.0 179.1 188.0 5172.0
sum(fb$like)
## [1] 88677
sd(fb$like)
## [1] 324.4122
2.3 Plot a histogram of “like” using the hist( ) function.
hist(fb$like)
hist(fb$like,main = "LIKE",xlab = "# of like",ylab = "# of post")
hist(fb$like,main = "LIKE",xlab = "# of like",ylab = "# of post",breaks = 1)
hist(fb$like,main = "LIKE",xlab = "# of like",ylab = "# of post",breaks = 10)
hist(fb$like,main = "LIKE",xlab = "# of like",ylab = "# of post",breaks = 50)
hist(fb$like,main = "LIKE",xlab = "# of like",ylab = "# of post",breaks = 100)
hist(fb$like,main = "LIKE",xlab = "# of like",ylab = "# of post",breaks = 200)
hist(fb$like,main = "LIKE",xlab = "# of like",ylab = "# of post",breaks = 200, xlim = c(1,1000))
2.4 Provide descriptive statistics of the categorical variables with summary( ) function (frequency, proportions)
summary(fb$Type)
## Link Photo Status Video
## 22 421 45 7
summary(fb$Post.Month)
## 1 2 3 4 5 6 7 8 9 10 11 12
## 24 26 36 50 37 49 52 34 35 57 45 50
plot(fb$Type)
plot(fb$Post.Month,las=2,main="Post by Month",xlab="Month",ylab="# of post")
Compate Like vs. Share, using scatter plot
x=(fb$like)
y=(fb$share)
plot(x,y,xlab="likes", ylab="shares",main = "likes vs shares")
plot(x,y,xlab="likes", ylab="shares",main = "likes vs shares",xlim=c(0,1000),ylim= c(0,200))
x=fb$Post.Month
y=fb$like
plot(x,y,ylim=c(0,500),las=2,xlab="month",ylab = "likes")
x=fb$Post.Month
y=fb$like
plot(x,y,ylim=c(0,500),las=2,xlab="month",ylab = "likes")
PAID
summary(fb$Paid)
## 0 1
## 356 139
aggregate(fb$like~fb$Paid,FUN=summary)
## fb$Paid fb$like.Min. fb$like.1st Qu. fb$like.Median fb$like.Mean
## 1 0 0.0 54.0 96.0 157.1
## 2 1 0.0 65.5 128.0 235.6
## fb$like.3rd Qu. fb$like.Max.
## 1 182.0 1639.0
## 2 214.5 5172.0
Analyze paid vs non-paid posts
par(mfrow=c(1,2))
#par(mfrow=c(1,2))
#par(mfcol=c(2,2))
x = fb[fb$Paid=="0", "Type"] # This is x axis, type of post for non-paid posts
y = fb[fb$Paid=="0", "like"] # This is y axis, likes for non-paid posts
plot(x, y, las=2,ylab="likes",ylim=c(0,500),main="Non-paid")
x = fb[fb$Paid=="1", "Type"] # This is x axis, type of post for non-paid posts
y = fb[fb$Paid=="1", "like"] # This is y axis, likes for non-paid posts
plot(x, y, las=2,ylab="likes",ylim=c(0,500),main="Paid")
x = fb[fb$Paid=="0", "Type"] # This is x axis, type of post for non-paid posts
y = fb[fb$Paid=="0", "like"] # This is y axis, likes for non-paid posts
plot(x, y, las=2,ylab="likes",ylim=c(0,500),main="Non-paid")
x = fb[fb$Paid=="1", "Type"] # This is x axis, type of post for non-paid posts
y = fb[fb$Paid=="1", "like"] # This is y axis, likes for non-paid posts
plot(x, y, las=2,ylab="likes",ylim=c(0,500),main="Paid")
CIS/STA 3920 - Data Mining for Business Analytics
Instructor: Charlie Terng
Student: @ Xukun LIU
Student: @ Dzenisa Bihorac
Student: @ Xicheng LIN
Student: @ Yanming Shou
1 - How many shares on average does each post get?
summary(fb$share)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 19.00 27.26 32.50 790.00
sum(fb$share)
## [1] 13496
sd(fb$share)
## [1] 42.65639
mean(fb$share)
## [1] 27.26465
2- What is the most shares a post has gotten? What is the fewest?
max(fb$share)
## [1] 790
3 - Plot a histogram of it with enough breaks to get a clear look. 3 What is the message you get from this? (refer to section II.3)
par(mfrow=c(1,1))
hist(fb$share)
hist(fb$share,breaks = 200,xlab = "# of share", ylab = "# of post",main = "SHARE")
hist(fb$share,breaks = 200,xlab = "# of share", ylab = "# of post",main = "SHARE",xlim = c(1,300))
4 - Create a box plot to compare shares by each “Type” of post. Which type is most effective? Which is least effective? (refer to section II.4)
class(fb$Type)
## [1] "factor"
levels(fb$Type)
## [1] "Link" "Photo" "Status" "Video"
x=fb$Type
y=fb$share
plot(x,y,xlab="type",ylab="share",main="Type V.S Share")
plot(x,y,xlab="type",ylab="share",main="Type V.S Share",ylim=c(1,200))
5 - Create a box plot to compare shares by each “Category” of post. How do you interpret this?
class(fb$share)
## [1] "integer"
class(fb$Category)
## [1] "factor"
levels(fb$Category)
## [1] "action" "inspiration" "product"
x=fb$Category
y=fb$share
plot(x,y,xlab="category",ylab="share",main="Category V.S Share")
plot(x,y,xlab="category",ylab="share",main="Category V.S Share",ylim=c(0,150))
6 - Create a separate box plot comparing shares by “Post.Month”, “Post.Weekday”, and “Post.Hour”. Summarize a few key points from this.
plot(fb$Post.Month,fb$share,xlab="Post by Month",ylab="Share",main="Post by Month V.S Share",ylim=c(0,150))
plot(fb$Post.Weekday,fb$share,xlab="Post by Weekday",ylab="Share",main="Post by Weekday V.S Share",ylim=c(0,150))
plot(fb$Post.Hour,fb$share,xlab="Post by Hour",ylab="Share",main="Post by Hour V.S Share",ylim=c(0,150),las=2)
7 - Create a 1x2 panel plot with one being a boxplot comparing shares by “Category” for non-paid posts, and the other for paid posts. Is there any noticeable difference? (IV.2)
par(mfrow=c(1,2))
str(fb$Paid)
## Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 2 1 1 ...
summary(fb$share)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 19.00 27.26 32.50 790.00
#paid
x=fb$Category[fb$Paid==1]
y=fb$share[fb$Paid==1]
plot(x,y,xlab="category",ylab="share",main="Paid",ylim=c(1,150))
#non-paid
x=fb$Category[fb$Paid==0]
y=fb$share[fb$Paid==0]
plot(x,y,xlab="category",ylab="share",main="Non-Paid",ylim=c(1,150))
8 - Compare the mean shares for non-paid and paid posts using the aggregate function. Is the result in consistent with the boxplots shown before? (refer to section IV.1)
aggregate(share~Paid,data=fb,FUN=mean)
## Paid share
## 1 0 25.2191
## 2 1 32.5036
9 - Come up with your own interesting insight from the data. Support it your claim with any appropriate statistics and/or visualizations.
CIS/STA 3920 - Data Mining for Business Analytics
HW2
Instructor: Charlie Terng
Student: @ Xukun LIU
Student: @ Dzenisa Bihorac
Student: @ Xicheng LIN
Student: @ Yanming Shou