Title: Final-Report-Flickr-Data
Team Members: Blesson Thomas, Shekha Saxena, Shivam Namdeo
What is Flickr ? It is one of the largest photo management and sharing application in the world. It lets users share their photos on its platflorm and enable new ways of organizing photos and videos.
flickr.interestingness.getList flickr.tags.getListPhoto flickr.tags.getRelated flickr.tags.getHotList flickr.places.placesForTags
We were predicting if a photo was uploaded with some tags in some region, how many photos with similar tags would get be uploaded soon.
We found that tags had a correlation with the regions where the tag was originated.
Prediction of the number of the uploads Flickr should expect are coming could help them manage their resources of storage and servers for the upcoming traffic and data.
We were gonna run prediction alogorithms to find the accuracy of our prediction
We have used Knn algorithm for predicting the photo count.
RMSE- 16772 MAE - 6158
library(tm)
library(tidyverse)
library(forecast)
library(wordcloud)
library(syuzhet)
library(lubridate)
library(ggplot2)
library(scales)
library(reshape2)
library(dplyr)
library(tidytext)mydata <- read_csv("flickr.csv")
predictData <- read_csv("knn.csv")
KnnModel<-readRDS("Knn.rds")
head(mydata)## # A tibble: 6 x 3
## tags continent photoCount
## <chr> <chr> <dbl>
## 1 palisades North America 7138
## 2 palisades Europe 21
## 3 palisades Australia 3
## 4 palisades Asia 3
## 5 floridakeys North America 9867
## 6 floridakeys Europe 6
mydata$tags = as.factor(mydata$tags)
mydata$continent = as.factor(mydata$continent)
# partition
set.seed(1)
train.index <- sample(c(1:dim(mydata)[1]), dim(mydata)[1]*0.80)
train <- mydata[train.index, ]
valid <- mydata[-train.index, ]
accuracy(predictData$Prediction, predictData$Actual)## ME RMSE MAE MPE MAPE
## Test set -1896.413 16772.93 6158.554 -Inf Inf
Implication - Using the data, we have tried to predict the photocount expected if a new photo with some tag is uploaded in a continent. This will help the Flickr team to prepare the servers for an upload. We are predict this value because today’s is a world of trend and people follow it blindly so one needs to to be prepared for an upcoming trend.Although the error value is not very good, but Flickr will still be able to predict an estimate value up and down the number i.e the number of photos that might get uploaded. We can further train the model and make it more accurate with some external data as well.
plot1 = mydata %>% group_by(continent) %>% summarise(photoCount = sum(photoCount))
ggplot(plot1, aes(x = continent, y = photoCount, fill=continent)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Continents where Flickr is used most") +
theme(legend.position="none", axis.text.x = element_text(hjust = 1)) Implication - As we can see from the graph above, there is a huge scope for Flickr to market in Africa, South America and Australia. The biggest market for Flickr is Europe.
plot2 = mydata %>% group_by(tags) %>% summarise(photoCount = sum(photoCount)) %>% filter(photoCount > 800000)
ggplot(plot2, aes(x = tags, y = photoCount, fill=tags)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Top tags and the no. of pics count on them all over the world") +
theme(legend.position="none", axis.text.x = element_text(hjust = 1)) Implication - From the graph above we can see that the most popular tag is “instagramapp” all over the world.
Summary
Since the peer comments suggested most that we should perform sentiment analysis on the comments, below is the sentiment analysis on the comments of most interesting photos on Flickr.
myData <- read_csv("Comment_flickr.csv")
corpus<-iconv(myData$comments)
corpus<-Corpus(VectorSource(corpus))
inspect(corpus[1:5])## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 5
##
## [1] Fantastic love this photo !!
## [2] [https://www.flickr.com/photos/112944336@N06] Thanks !
## [3] Fabulous image, fantastic captured, a very successful job
## [4] There is a big difference between a snapshot and a photo.\nMany who posted on Flickr cannot tell the difference but you definitely could. Thank you for sharing this aesthetically pleasing, attention catching, clean photo.\nps: My comment is not individualised but I really choose photos I comment. If you don’t like having my comment, just block me.
## [5] Excellent image!
corpus<-tm_map(corpus,tolower)
corpus<-tm_map(corpus,removePunctuation)
corpus<-tm_map(corpus,removeNumbers)
clean_set<-tm_map(corpus,removeWords,stopwords('english'))
length(clean_set)## [1] 2200
removeURL <- function(x) gsub('http[[:alnum:]]*', '', x)
clean_set <- tm_map(clean_set, content_transformer(removeURL))
clean_set<-tm_map(clean_set,stripWhitespace)
tdm<-TermDocumentMatrix(clean_set)
tdm<-as.matrix(tdm)
tdm[1:10,1:20]## Docs
## Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## fantastic 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## love 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## photo 1 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## thanks 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## captured 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## fabulous 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## image 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## job 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## successful 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## aesthetically 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# Here we plot the words which a frequency of more than 50
w<-rowSums(tdm)
w<-subset(w,w>=50)
barplot(w,las=2,col=rainbow(50))Implication - From the plot above we can see the most occuring words in the corpus.
# Wordcloud
w <- sort(rowSums(tdm), decreasing = TRUE)
set.seed(222)
wordcloud(words = names(w),
freq = w,
max.words = 100,
random.order = F,
min.freq = 5,
colors = brewer.pal(8, 'Dark2'),
scale = c(5, 0.3),
rot.per = 0.7)Implication - Here we have plotted the word cloud which shows the words occuring most in bold and denser towards the center of the cloud.
comments<-iconv(myData$comments)
s<-get_nrc_sentiment(comments)
barplot(colSums(s),
las = 2,
col = rainbow(10),
ylab = 'Count',
main = 'Sentiment Scores for Flickr Comment')Implications - from the plot above we can see that mostly comments have a positive sentiment and very few comments have disgust sentiment.