Team members- Shekha Saxena, Blesson Thomas, Shivam Namdeo

Data Collection:

What is Flickr ? It is one of the largest photo management and sharing application in the world. It lets users share their photos on its platflorm and enable new ways of organizing photos and videos.

The primary purpose is to collect data from Flickr APIs only, to make sure that our analysis is as accurate as possible. An analysis can only be as as good as our data source. FLickr APIs have a vast amount of data available for analysis purpose. During the journey in this data analysis if we feel the need to enrich our data set to further increase the accuracy of our analysis, we will possibly add additional data from other sources. But for now we are using only Flickr APIs.

How to extract/collect the data from FLickr APIs ? The url to get the data is https://api.flickr.com/services/rest/ These APIs have several methods available such as tags, photos, groups, comments, interstingness etc. to extract the data from. We can filter out the certain columns we want in our analysis and chose only those which are relevant.

How will we clean and transform the data ? We have collected 166,376 instances from our API. We will find out uniques tags from this data source. We will be removing columns or tags which do not have geo location data present. Chosen columns: ‘Tags’, ‘Place_type’, ‘Photo_count’, ‘Photoids’

Problem Description:

We are performing anaysis using Flickr APIs to identify regions where Flickr might wish to market more to elevate its use worldwide. The interesting thing is that we are not going to identify the popularity of the Flickr app based on the number of people who have signed up on the app. We are going to use it on the basis of usage. It is interesting because we are identifying regions where Flickr is not very popular and might need to excercise its marketing team to focus there.

In order to achieve our goal we have identified tags of interesting pics and hot tags in day and week. Since these tags list is small, we then find tags related to these tags which makes our dataset of tags having more than 20,000 instances. We then are going to identify the origin of these tags by merging our data with the geo location API. Once we have identified the geo location, we will see that which country has the highest or lowest number of originating tags along with the images count in those regions. The maximum the tags, the higher the usage. The idea behind this is that sometimes people just signs up on the apps and never use it. So we want to make use of this usage data and identify the app’s popularity region wise. And once we have got that data, we can also direct the marketing team of Flickr to market app in lower popularity areas.

In addition to this, we are also planning to make use of comments on the photos to identify which photos are most popular or talked about in a particular period of time. The more the comments, the higher the popularity of the photo. If possible, we will also try to find out from which geo locations most popular photos originate. We can also find relationship between tags and popular pics i.e. If a popular pic has a hot tag or no corelation is present.

Analytics Plan:

How will you analyze your data ? What methods or tools will you use?

We will collect data from the APIs mentioned below and will merge them to get the final result.

flickr.interestingness.getList flickr.tags.getListPhoto flickr.tags.getRelated flickr.tags.getHotList flickr.places.placesForTags

We will also make use of below machine learning techniques to predict the trending tag based on region. It will help Flickr to manage their storage and network traffic to handle vast amount of photos that might get uploaded. Also one of the applications of this prediction is that if an offensive trend is getting popular, it would help the governments to deploy more forces in the regions of unrest.

Knn
Linear Regression
Neural Networks

Evaluation Plan

How will you evaluate your results ? Qualitatively, what kind of results do you expect ? Quantitatively, what kind of analyses will you use to evaluate /or compare your results ?

Qualitatively, We will make use of visualization techniques such as ggplot, bars, charts etc. Quantitatively we are going to use the above mentioned prediction techniques to find the best model fit with best accuracy