CIS/STA 3920 - Data Mining for Business Analytics

Instructor: @ Charlie Terng
Student: @ Xukun LIU
Student: @ Dzenisa Bihorac
Student: @ Xicheng LIN
Student: @ Yanming Shou

Notes from class

1.1 Read in the Facebook data hosted from a link

fb = read.csv(url("http://guides.newman.baruch.cuny.edu/ld.php?content_id=39953204"),sep=";")

1.2 Get an initial understanding of how the data is structured

dim(fb)
## [1] 500  19
str(fb)
## 'data.frame':    500 obs. of  19 variables:
##  $ Page.total.likes                                                   : int  139441 139441 139441 139441 139441 139441 139441 139441 139441 139441 ...
##  $ Type                                                               : Factor w/ 4 levels "Link","Photo",..: 2 3 2 2 2 3 2 2 3 2 ...
##  $ Category                                                           : int  2 2 3 2 2 2 3 3 2 3 ...
##  $ Post.Month                                                         : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ Post.Weekday                                                       : int  4 3 3 2 2 1 1 7 7 6 ...
##  $ Post.Hour                                                          : int  3 10 3 10 3 9 3 9 3 10 ...
##  $ Paid                                                               : int  0 0 0 1 0 0 1 1 0 0 ...
##  $ Lifetime.Post.Total.Reach                                          : int  2752 10460 2413 50128 7244 10472 11692 13720 11844 4694 ...
##  $ Lifetime.Post.Total.Impressions                                    : int  5091 19057 4373 87991 13594 20849 19479 24137 22538 8668 ...
##  $ Lifetime.Engaged.Users                                             : int  178 1457 177 2211 671 1191 481 537 1530 280 ...
##  $ Lifetime.Post.Consumers                                            : int  109 1361 113 790 410 1073 265 232 1407 183 ...
##  $ Lifetime.Post.Consumptions                                         : int  159 1674 154 1119 580 1389 364 305 1692 250 ...
##  $ Lifetime.Post.Impressions.by.people.who.have.liked.your.Page       : int  3078 11710 2812 61027 6228 16034 15432 19728 15220 4309 ...
##  $ Lifetime.Post.reach.by.people.who.like.your.Page                   : int  1640 6112 1503 32048 3200 7852 9328 11056 7912 2324 ...
##  $ Lifetime.People.who.have.liked.your.Page.and.engaged.with.your.post: int  119 1108 132 1386 396 1016 379 422 1250 199 ...
##  $ comment                                                            : int  4 5 0 58 19 1 3 0 0 3 ...
##  $ like                                                               : int  79 130 66 1572 325 152 249 325 161 113 ...
##  $ share                                                              : int  17 29 14 147 49 33 27 14 31 26 ...
##  $ Total.Interactions                                                 : int  100 164 80 1777 393 186 279 339 192 142 ...

1.3 Take a look at the variable types. Do they make sense for the given variable?

fb$Category = as.factor(fb$Category)
fb$Type = as.factor(fb$Type)
fb$Post.Month = as.factor(fb$Post.Month)
fb$Post.Weekday = as.factor(fb$Post.Weekday)
fb$Post.Hour = as.factor(fb$Post.Hour)
fb$Paid = as.factor(fb$Paid)
str(fb)
## 'data.frame':    500 obs. of  19 variables:
##  $ Page.total.likes                                                   : int  139441 139441 139441 139441 139441 139441 139441 139441 139441 139441 ...
##  $ Type                                                               : Factor w/ 4 levels "Link","Photo",..: 2 3 2 2 2 3 2 2 3 2 ...
##  $ Category                                                           : Factor w/ 3 levels "1","2","3": 2 2 3 2 2 2 3 3 2 3 ...
##  $ Post.Month                                                         : Factor w/ 12 levels "1","2","3","4",..: 12 12 12 12 12 12 12 12 12 12 ...
##  $ Post.Weekday                                                       : Factor w/ 7 levels "1","2","3","4",..: 4 3 3 2 2 1 1 7 7 6 ...
##  $ Post.Hour                                                          : Factor w/ 22 levels "1","2","3","4",..: 3 10 3 10 3 9 3 9 3 10 ...
##  $ Paid                                                               : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 2 1 1 ...
##  $ Lifetime.Post.Total.Reach                                          : int  2752 10460 2413 50128 7244 10472 11692 13720 11844 4694 ...
##  $ Lifetime.Post.Total.Impressions                                    : int  5091 19057 4373 87991 13594 20849 19479 24137 22538 8668 ...
##  $ Lifetime.Engaged.Users                                             : int  178 1457 177 2211 671 1191 481 537 1530 280 ...
##  $ Lifetime.Post.Consumers                                            : int  109 1361 113 790 410 1073 265 232 1407 183 ...
##  $ Lifetime.Post.Consumptions                                         : int  159 1674 154 1119 580 1389 364 305 1692 250 ...
##  $ Lifetime.Post.Impressions.by.people.who.have.liked.your.Page       : int  3078 11710 2812 61027 6228 16034 15432 19728 15220 4309 ...
##  $ Lifetime.Post.reach.by.people.who.like.your.Page                   : int  1640 6112 1503 32048 3200 7852 9328 11056 7912 2324 ...
##  $ Lifetime.People.who.have.liked.your.Page.and.engaged.with.your.post: int  119 1108 132 1386 396 1016 379 422 1250 199 ...
##  $ comment                                                            : int  4 5 0 58 19 1 3 0 0 3 ...
##  $ like                                                               : int  79 130 66 1572 325 152 249 325 161 113 ...
##  $ share                                                              : int  17 29 14 147 49 33 27 14 31 26 ...
##  $ Total.Interactions                                                 : int  100 164 80 1777 393 186 279 339 192 142 ...

1.4 The “Category” variable indicates its marketing purpose

levels(fb$Category)
## [1] "1" "2" "3"
fb$Category = as.character(fb$Category)
fb$Category[fb$Category=="1"] = "action"
fb$Category[fb$Category=="2"] = "product"
fb$Category[fb$Category=="3"] = "inspiration"
fb$Category = as.factor(fb$Category)
levels(fb$Category)
## [1] "action"      "inspiration" "product"

1.5 How many missing values do we have in the dataset?

colSums(is.na(fb))
##                                                    Page.total.likes 
##                                                                   0 
##                                                                Type 
##                                                                   0 
##                                                            Category 
##                                                                   0 
##                                                          Post.Month 
##                                                                   0 
##                                                        Post.Weekday 
##                                                                   0 
##                                                           Post.Hour 
##                                                                   0 
##                                                                Paid 
##                                                                   1 
##                                           Lifetime.Post.Total.Reach 
##                                                                   0 
##                                     Lifetime.Post.Total.Impressions 
##                                                                   0 
##                                              Lifetime.Engaged.Users 
##                                                                   0 
##                                             Lifetime.Post.Consumers 
##                                                                   0 
##                                          Lifetime.Post.Consumptions 
##                                                                   0 
##        Lifetime.Post.Impressions.by.people.who.have.liked.your.Page 
##                                                                   0 
##                    Lifetime.Post.reach.by.people.who.like.your.Page 
##                                                                   0 
## Lifetime.People.who.have.liked.your.Page.and.engaged.with.your.post 
##                                                                   0 
##                                                             comment 
##                                                                   0 
##                                                                like 
##                                                                   1 
##                                                               share 
##                                                                   4 
##                                                  Total.Interactions 
##                                                                   0
sum(is.na(fb))
## [1] 6
fb = na.omit(fb)

2.1 Provide descriptive statistics the “like” variable (min, max, median, Q1, Q3, mean, and standard deviation)

summary(fb$share)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   19.00   27.26   32.50  790.00
sd(fb$share)
## [1] 42.65639

2.2 What do the summary statistics for “like” tell us about the cosmetic company’s social media performance on Facebook?

summary(fb$like)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    57.0   101.0   179.1   188.0  5172.0
sum(fb$like)
## [1] 88677
sd(fb$like)
## [1] 324.4122

2.3 Plot a histogram of “like” using the hist( ) function.

hist(fb$like)

hist(fb$like,main = "LIKE",xlab = "# of like",ylab = "# of post")

hist(fb$like,main = "LIKE",xlab = "# of like",ylab = "# of post",breaks = 1)

hist(fb$like,main = "LIKE",xlab = "# of like",ylab = "# of post",breaks = 10)

hist(fb$like,main = "LIKE",xlab = "# of like",ylab = "# of post",breaks = 50)

hist(fb$like,main = "LIKE",xlab = "# of like",ylab = "# of post",breaks = 100)

hist(fb$like,main = "LIKE",xlab = "# of like",ylab = "# of post",breaks = 200)

hist(fb$like,main = "LIKE",xlab = "# of like",ylab = "# of post",breaks = 200, xlim = c(1,1000))

2.4 Provide descriptive statistics of the categorical variables with summary( ) function (frequency, proportions)

summary(fb$Type)
##   Link  Photo Status  Video 
##     22    421     45      7
summary(fb$Post.Month)
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 24 26 36 50 37 49 52 34 35 57 45 50
plot(fb$Type)

plot(fb$Post.Month,las=2,main="Post by Month",xlab="Month",ylab="# of post")

Compate Like vs. Share, using scatter plot

x=(fb$like)
y=(fb$share)
plot(x,y,xlab="likes", ylab="shares",main = "likes vs shares")

plot(x,y,xlab="likes", ylab="shares",main = "likes vs shares",xlim=c(0,1000),ylim= c(0,200))

x=fb$Post.Month
y=fb$like
plot(x,y,ylim=c(0,500),las=2,xlab="month",ylab = "likes")

x=fb$Post.Month
y=fb$like
plot(x,y,ylim=c(0,500),las=2,xlab="month",ylab = "likes")

PAID

summary(fb$Paid)
##   0   1 
## 356 139
aggregate(fb$like~fb$Paid,FUN=summary)
##   fb$Paid fb$like.Min. fb$like.1st Qu. fb$like.Median fb$like.Mean
## 1       0          0.0            54.0           96.0        157.1
## 2       1          0.0            65.5          128.0        235.6
##   fb$like.3rd Qu. fb$like.Max.
## 1           182.0       1639.0
## 2           214.5       5172.0

Analyze paid vs non-paid posts

par(mfrow=c(1,2))
#par(mfrow=c(1,2))
#par(mfcol=c(2,2))

x = fb[fb$Paid=="0", "Type"]    # This is x axis, type of post for non-paid posts
y = fb[fb$Paid=="0", "like"]      # This is y axis, likes for non-paid posts
plot(x, y, las=2,ylab="likes",ylim=c(0,500),main="Non-paid")

x = fb[fb$Paid=="1", "Type"]    # This is x axis, type of post for non-paid posts
y = fb[fb$Paid=="1", "like"]      # This is y axis, likes for non-paid posts
plot(x, y, las=2,ylab="likes",ylim=c(0,500),main="Paid")

x = fb[fb$Paid=="0", "Type"]    # This is x axis, type of post for non-paid posts
y = fb[fb$Paid=="0", "like"]      # This is y axis, likes for non-paid posts
plot(x, y, las=2,ylab="likes",ylim=c(0,500),main="Non-paid")

x = fb[fb$Paid=="1", "Type"]    # This is x axis, type of post for non-paid posts
y = fb[fb$Paid=="1", "like"]      # This is y axis, likes for non-paid posts
plot(x, y, las=2,ylab="likes",ylim=c(0,500),main="Paid")

Group Assignment - HW2

CIS/STA 3920 - Data Mining for Business Analytics
Instructor: Charlie Terng
Student: @ Xukun LIU
Student: @ Dzenisa Bihorac
Student: @ Xicheng LIN
Student: @ Yanming Shou

1 - How many shares on average does each post get?

summary(fb$share)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   19.00   27.26   32.50  790.00
sum(fb$share)
## [1] 13496
sd(fb$share)
## [1] 42.65639
mean(fb$share)
## [1] 27.26465

2- What is the most shares a post has gotten? What is the fewest?

max(fb$share)
## [1] 790

3 - Plot a histogram of it with enough breaks to get a clear look. 3 What is the message you get from this? (refer to section II.3)

par(mfrow=c(1,1))
hist(fb$share)

hist(fb$share,breaks = 200,xlab = "# of share", ylab = "# of post",main = "SHARE")

hist(fb$share,breaks = 200,xlab = "# of share", ylab = "# of post",main = "SHARE",xlim = c(1,300))

4 - Create a box plot to compare shares by each “Type” of post. Which type is most effective? Which is least effective? (refer to section II.4)

class(fb$Type)
## [1] "factor"
levels(fb$Type)
## [1] "Link"   "Photo"  "Status" "Video"
x=fb$Type
y=fb$share
plot(x,y,xlab="type",ylab="share",main="Type V.S Share")

plot(x,y,xlab="type",ylab="share",main="Type V.S Share",ylim=c(1,200))

5 - Create a box plot to compare shares by each “Category” of post. How do you interpret this?

class(fb$share)
## [1] "integer"
class(fb$Category)
## [1] "factor"
levels(fb$Category)
## [1] "action"      "inspiration" "product"
x=fb$Category
y=fb$share
plot(x,y,xlab="category",ylab="share",main="Category V.S Share")

plot(x,y,xlab="category",ylab="share",main="Category V.S Share",ylim=c(0,150))

6 - Create a separate box plot comparing shares by “Post.Month”, “Post.Weekday”, and “Post.Hour”. Summarize a few key points from this.

plot(fb$Post.Month,fb$share,xlab="Post by Month",ylab="Share",main="Post by Month V.S Share",ylim=c(0,150))

plot(fb$Post.Weekday,fb$share,xlab="Post by Weekday",ylab="Share",main="Post by Weekday V.S Share",ylim=c(0,150))

plot(fb$Post.Hour,fb$share,xlab="Post by Hour",ylab="Share",main="Post by Hour V.S Share",ylim=c(0,150),las=2)

7 - Create a 1x2 panel plot with one being a boxplot comparing shares by “Category” for non-paid posts, and the other for paid posts. Is there any noticeable difference? (IV.2)

par(mfrow=c(1,2))
str(fb$Paid)
##  Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 2 1 1 ...
summary(fb$share)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   19.00   27.26   32.50  790.00
#paid
x=fb$Category[fb$Paid==1]
y=fb$share[fb$Paid==1]
plot(x,y,xlab="category",ylab="share",main="Paid",ylim=c(1,150))
#non-paid
x=fb$Category[fb$Paid==0]
y=fb$share[fb$Paid==0]
plot(x,y,xlab="category",ylab="share",main="Non-Paid",ylim=c(1,150))

8 - Compare the mean shares for non-paid and paid posts using the aggregate function. Is the result in consistent with the boxplots shown before? (refer to section IV.1)

aggregate(share~Paid,data=fb,FUN=mean)
##   Paid   share
## 1    0 25.2191
## 2    1 32.5036

9 - Come up with your own interesting insight from the data. Support it your claim with any appropriate statistics and/or visualizations.

CIS/STA 3920 - Data Mining for Business Analytics
HW2
Instructor: Charlie Terng
Student: @ Xukun LIU
Student: @ Dzenisa Bihorac
Student: @ Xicheng LIN
Student: @ Yanming Shou