Presented By

Title: Final-Report-Flickr-Data

Team Members: Blesson Thomas, Shekha Saxena, Shivam Namdeo

Introduction

What is Flickr ? It is one of the largest photo management and sharing application in the world. It lets users share their photos on its platflorm and enable new ways of organizing photos and videos.

Project Proposal

Identify regions where Flickr is used most and least.
Flickr can exercise their marketing team where Flickr used least.
Predict number of uploads based on tags used and region where the photo is uploaded.
This Prediction matrix can be used to effectively design and strengthen the storage servers and the architecture of system to sustain traffic and manage the upcoming load.

Data Collection

We are using below Flickr API’s to collect the data.

flickr.interestingness.getList flickr.tags.getListPhoto flickr.tags.getRelated flickr.tags.getHotList flickr.places.placesForTags

We have got around 28000 instances of data after this step. Our target variable is photo count.

Original Plan

We were predicting if a photo was uploaded with some tags in some region, how many photos with similar tags would get be uploaded soon.
We found that tags had a correlation with the regions where the tag was originated.
Prediction of the number of the uploads Flickr should expect are coming could help them manage their resources of storage and servers for the upcoming traffic and data.
We were gonna run prediction alogorithms to find the accuracy of our prediction

Analytics

We have used Knn algorithm for predicting the photo count.

RMSE- 16772 MAE - 6158

Loading libraries

library(tm)
library(tidyverse)
library(forecast)
library(wordcloud)
library(syuzhet)
library(lubridate)
library(ggplot2)
library(scales)
library(reshape2)
library(dplyr)
library(tidytext)

AI/ML/NLP Procedure summary

mydata <- read_csv("flickr.csv")
predictData <- read_csv("knn.csv")
KnnModel<-readRDS("Knn.rds")
head(mydata)

## # A tibble: 6 x 3
##   tags        continent     photoCount
##   <chr>       <chr>              <dbl>
## 1 palisades   North America       7138
## 2 palisades   Europe                21
## 3 palisades   Australia              3
## 4 palisades   Asia                   3
## 5 floridakeys North America       9867
## 6 floridakeys Europe                 6

mydata$tags = as.factor(mydata$tags)
mydata$continent = as.factor(mydata$continent)

# partition
set.seed(1)  
train.index <- sample(c(1:dim(mydata)[1]), dim(mydata)[1]*0.80)  
train <- mydata[train.index, ]
valid <- mydata[-train.index, ]

accuracy(predictData$Prediction, predictData$Actual)

##                 ME     RMSE      MAE  MPE MAPE
## Test set -1896.413 16772.93 6158.554 -Inf  Inf

Implication - Using the data, we have tried to predict the photocount expected if a new photo with some tag is uploaded in a continent. This will help the Flickr team to prepare the servers for an upload. We are predict this value because today’s is a world of trend and people follow it blindly so one needs to to be prepared for an upcoming trend.Although the error value is not very good, but Flickr will still be able to predict an estimate value up and down the number i.e the number of photos that might get uploaded. We can further train the model and make it more accurate with some external data as well.

Analyzing Market for Flickr

plot1 = mydata %>% group_by(continent) %>% summarise(photoCount = sum(photoCount))

ggplot(plot1, aes(x = continent, y = photoCount, fill=continent)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  ggtitle("Continents where Flickr is used most") +
  theme(legend.position="none", axis.text.x = element_text(hjust = 1))

Implication - As we can see from the graph above, there is a huge scope for Flickr to market in Africa, South America and Australia. The biggest market for Flickr is Europe.

Popular Tags as per count of Photos

plot2 = mydata %>% group_by(tags) %>% summarise(photoCount = sum(photoCount)) %>% filter(photoCount > 800000)

ggplot(plot2, aes(x = tags, y = photoCount, fill=tags)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  ggtitle("Top tags and the no. of pics count on them all over the world") +
  theme(legend.position="none", axis.text.x = element_text(hjust = 1))

Implication - From the graph above we can see that the most popular tag is “instagramapp” all over the world.

Peer comments

Summary

Sentiment Analysis of comments in a photo
Locations where Flickr is most popular
Popular tags in a country
Most used camera brands and models used

Since the peer comments suggested most that we should perform sentiment analysis on the comments, below is the sentiment analysis on the comments of most interesting photos on Flickr.

Sentiment Analysis on Flickr Comments

myData <- read_csv("Comment_flickr.csv")
corpus<-iconv(myData$comments)
corpus<-Corpus(VectorSource(corpus))
inspect(corpus[1:5])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 5
## 
## [1] Fantastic love this photo !!                                                                                                                                                                                                                                                                                                                                 
## [2] [https://www.flickr.com/photos/112944336@N06] Thanks !                                                                                                                                                                                                                                                                                                       
## [3] Fabulous image, fantastic captured, a very successful job                                                                                                                                                                                                                                                                                                    
## [4] There is a big difference between a snapshot and a photo.\nMany who posted on Flickr cannot tell the difference but you definitely could. Thank you for sharing this aesthetically pleasing, attention catching, clean photo.\nps: My comment is not individualised but I really choose photos I comment. If you don’t like having my comment, just block me.
## [5] Excellent image!

corpus<-tm_map(corpus,tolower)
corpus<-tm_map(corpus,removePunctuation)
corpus<-tm_map(corpus,removeNumbers)
clean_set<-tm_map(corpus,removeWords,stopwords('english'))
length(clean_set)

## [1] 2200

removeURL <- function(x) gsub('http[[:alnum:]]*', '', x)
clean_set <- tm_map(clean_set, content_transformer(removeURL))

clean_set<-tm_map(clean_set,stripWhitespace)

tdm<-TermDocumentMatrix(clean_set)
tdm<-as.matrix(tdm)
tdm[1:10,1:20]

##                Docs
## Terms           1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##   fantastic     1 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   love          1 0 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   photo         1 0 0 2 0 0 1 0 0  0  0  0  0  0  0  0  0  0  0  0
##   thanks        0 1 0 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   captured      0 0 1 0 0 0 0 0 0  0  0  1  0  0  0  0  0  0  0  0
##   fabulous      0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   image         0 0 1 0 1 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   job           0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   successful    0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
##   aesthetically 0 0 0 1 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0

# Here we plot the words which a frequency of more than 50 
w<-rowSums(tdm)
w<-subset(w,w>=50)
barplot(w,las=2,col=rainbow(50))

Implication - From the plot above we can see the most occuring words in the corpus.

Word Cloud

# Wordcloud

w <- sort(rowSums(tdm), decreasing = TRUE)
set.seed(222)
wordcloud(words = names(w),
          freq = w,
          max.words = 100,
          random.order = F,
          min.freq = 5,
          colors = brewer.pal(8, 'Dark2'),
          scale = c(5, 0.3),
          rot.per = 0.7)

Implication - Here we have plotted the word cloud which shows the words occuring most in bold and denser towards the center of the cloud.

Sentiment Analysis - NRC

comments<-iconv(myData$comments)
s<-get_nrc_sentiment(comments)

barplot(colSums(s),
        las = 2,
        col = rainbow(10),
        ylab = 'Count',
        main = 'Sentiment Scores for Flickr Comment')

Implications - from the plot above we can see that mostly comments have a positive sentiment and very few comments have disgust sentiment.