Hello! As a new Openintro employee I was assigned the tasks of reorganizing the company’s website data from our parent company. Our company is deciding if the openintro’s website generates enough traffic and ROI to keep the site online. The web domain costs 25k annual to retain web domain and a extra 3k for upkeep.
In order to gauge site activity, our company values high consumer interactions. We defined this as populated comment sections, overall views over 50,000, and a high average in number of likes.
My first task was renaming our data frame columns, as it had the wrong column names for company/sector Views/likes/Comments.
setwd("C:/Users/walki/Documents/")
d<-read.csv("datasets.csv")
colnames(d)<-c('Sector','WebURL','Title','Views','Likes','Comments')
print(d[1,])
## Sector WebURL Title Views Likes Comments NA NA NA
## 1 AER Affairs Fair's Extramarital Affairs Data 601 9 2 0 2 0
## NA NA
## 1 7 https://vincentarelbundock.github.io/Rdatasets/csv/AER/Affairs.csv
## NA
## 1 https://vincentarelbundock.github.io/Rdatasets/doc/AER/Affairs.html
Now,I will extract my sector’s information from the master’s csv. My boss noted that I do not need the last two columns in our copy.
openIntro<-subset(d,Sector=='openintro')
openIntro<-openIntro[,1:6]
write.csv(openIntro,file="openIntro_data.csv",row.names = FALSE)
print(openIntro[1,1:6])
## Sector WebURL Title Views
## 1062 openintro absenteeism Absenteeism from school in New South Wales 146
## Likes Comments
## 1062 5 3
For our analytic team, we need to calculate and insert our KPI in the search. Our KPI is measured through engagement, so we included it in the data frame.
KPI<-openIntro$Likes/openIntro$Views
openIntro$KPI<-KPI
Now, We have the ability to search for KPI’s in our frame. Let’s add the new column and see what journal entry had the largest engagement.
Top.Post<-openIntro$Title[which(openIntro$KPI==max(openIntro$KPI))]
print(openIntro[1,])
## Sector WebURL Title Views
## 1062 openintro absenteeism Absenteeism from school in New South Wales 146
## Likes Comments KPI
## 1062 5 3 0.03424658
print(Top.Post)
## [1] "Findings on n-3 Fatty Acid Supplement Health Benefits"
The data frame is now corrected. let’s print out the summary for openintro’s web journals.
summary(openIntro)
## Sector WebURL Title Views
## Length:206 Length:206 Length:206 Min. : 2.0
## Class :character Class :character Class :character 1st Qu.: 59.5
## Mode :character Mode :character Mode :character Median : 198.0
## Mean : 10182.9
## 3rd Qu.: 1000.0
## Max. :1414593.0
## Likes Comments KPI
## Min. : 1.000 Min. : 0.000 Min. : 0.000004
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 0.003727
## Median : 3.000 Median : 0.500 Median : 0.016097
## Mean : 7.248 Mean : 1.437 Mean : 0.187771
## 3rd Qu.: 7.750 3rd Qu.: 2.000 3rd Qu.: 0.073642
## Max. :123.000 Max. :46.000 Max. :24.000000
From our summary, we can see the overall average for views is ~10183 views. Our average views have the potential of reaching a large audience,as our max views was 1,414,593 views. However, Let concentrated our search within the Views’ medium for clarity.
library(ggplot2)
ggplot(openIntro, aes(x=Views, y=Comments,color=Comments>0)) + geom_point()+xlim(0,10900)+ylim(0,46)
## Warning: Removed 11 rows containing missing values (geom_point).
From this scatter plot, we can see a majority of openIntro’s web journal received no comments. There seems to be a small trend of higher views getting more comments; However, this is too small to show potential. To confirm, I made a test data frame grouping comments by tens to see more details. A discovery was made that this was not possible as the box plots were too
temp<-openIntro
temp$G<-NA
temp$G[temp$Comments<5]<-5
temp$G[temp$Comments>4]<-10
temp$G[temp$Comments>10]<-50
ggplot(temp, aes(x = G, y = Comments,group=G,fill=G)) + geom_boxplot()
By this box plot, the range of 0:5 and 5:10 is heavily concentrated compared to the 20:50 range. There isn’t a variety ranges for comments and users do not comment a lot above 10 comments per post.
For Likes, the website pulls at most 125 likes on a post. It appears that the average post ranges below 25 likes. Likes are essential for our KPI, so these current statistic are troubling.
ggplot(openIntro, aes(x=Views,y=Likes,color=Comments>0)) + geom_point()+xlim(0,50000)
## Warning: Removed 4 rows containing missing values (geom_point).
There is now a raised concern over views, as our comments are not in their target range. In reference towards our view target, we have to view how frequently we reached our targeted medium.
temp<-subset(openIntro,Views<=100000)
ggplot(temp, aes(x=Views))+geom_histogram(color="red",fill='pink')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(temp, aes(x=Views))+geom_histogram(color="red",fill='pink',binwidth = 1000)+geom_vline(aes(xintercept=mean(Views)),color="red", linetype="dashed", size=1)
Unfortunately, our overall views on our domain do not met their targeted average views of 50k. Its average is below 5000.
The average post generates an average of 5k views and our KPI rates are low. The parent company needs their target range covered at minimum to cover operational costs. One suggestion is revamping the content on this website, but these decisions are up for the analytics department for further analysis.In conclusion, The openintro website is not generating enough traffic to justify the cost to our parent company.