CIS8392 - Group 4 - Project Proposal - Twitter WeRateDogs Data predictive analyses using different ML algorithms

Narender Reddy Konuganti | Sivananda Reddy Bushireddy | Sukumar Reddy kottala | Gopi Sandeep Guntuku

03/23/2022

Data: What primary data will you use? Will you collect secondary data from other sources? How will you collect all your data? How will you clean and transform the data?

The primary data will be from Twitter archive from external sources@dog_rates, also known as WeRateDogs with the tweet IDs within it. Additional data can be gathered from Twitter’s API and this valuable data will be gathered querying Twitter’s API.

Data wrangling, which consists of: Gathering data, Assessing data, Cleaning data (Missing value treatment), Storing/Exporting data

During this project, additional data gathering, assessing and cleaning processes has been considered for worthy analyses and visualizations. Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. In addition to that trying different machine learning algorithms for both unsupervised sides like data extraction, dimension reduction and supervised side like random forest, decision tree increased the model performance.

Problem Description: What is the problem that you will be investigating? Why is it interesting?

Project goal is wrangling WeRateDogs twitter data to create interesting and trustworthy exploratory / predictive analyses and visualization using different machine learning algorithms. The Twitter archive is great, but it only contains very basic tweet information.

It is interesting because we will be using neural network model that can classify breeds of dogs. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction.

Analytics plan: How will you analyze your data? What methods or tools will you use?

There are 3 main options to make predictive analysis. First one was the predicting dog’s rating, which is ignored due to ambiguity and subjectivity of ratings shown in the analysis. Second option was the predicting dog’s breed; however, this option was dropped as well because there are many unique values of dog’s breed (+100) in very small number of observation (1.9K). Therefore, 3rd option which predicts whether dog’s breed is retriever nor not is very good option; because, retriever breed is the most dominated breed in the dataset.

We will import the required packages required for analysing the data and later building the models. The packages are namely: httr,base64enc,jsonlite,stringr,tidyverse,dplyr and a few others. To fulfill the objective behind developing the project, we will be doing:

Evaluation Plan: How will you evaluate your results?

We will evaluate the results by detemining the accuracy for each of the individual models. We will then judge the efficiency of the individual models by printing out the accuracy charts and also by plotting ROC curve, it is important to observe and compare performance of model on both train and test dataset to control whether there is any overfitting in the model,we can also print a table showcasing the results of different models with their accuracies.