DATA607- Assignment

Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends) to rate each of these movie that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database. Load the information into an R dataframe. Your deliverables should include your SQL scripts and your R Markdown code, posted to GitHub. This is by design a very open ended assignment. A variety of reasonable approaches are acceptable. You can (and should) blank out your SQL password if your solution requires it; otherwise, full credit requires that your code is “reproducible,” with the assumption that I have the same database server and R software.

Loading packages

The data, sql command and .rmd used in this exercise can be found at https://github.com/forhadakbar/data607fall2019/tree/master/Week%2002

Install the RMySQL and odbc packages by running install.packages(“RMySQL”) and install.packages(“odbc”) in RStudio if you don’t already have it installed and then load the RMySQL package. In addition, i am loading tidyverse and dplyr package.

library(RMySQL)
library(tidyverse)
library(dplyr)
library(DBI)

Establish Connection

To hide the password i store the password in a variable called “localpassword” and run it in the console. Each time we start R we will need to reset this variable, but that is better than publishing a hardcoded password as i will share the .rmd using GitHub. There is a better way of storing the connection info in a configuration file outside R. I used {r echo= FALSE} to hide “localpassword” variable while kniting to HTML.

con <- dbConnect(MySQL(),
                 user="Data607", password= localpassword,
                 dbname="moviesratings", host="localhost")

Load data

#Store sql join quary in a variable
sql<- "SELECT 
tblreviewers.ReviewerName,
tblmovies.MovieTitle,
tblmovieratings.MovieRating
FROM tblmovieratings
JOIN tblreviewers 
ON tblreviewers.ReviewerID = tblmovieratings.ReviewerID
JOIN tblmovies 
ON tblmovies.MovieID = tblmovieratings.MovieID"

#Fetch data from MySQL instance into a R dataframe 
Res_movies<- dbGetQuery(con, sql)
Res_movies

Exploratorory Analysis

#Check class
class (Res_movies)

## [1] "data.frame"

#Check dimension
dim (Res_movies)

## [1] 36  3

# Check column names
names(Res_movies)

## [1] "ReviewerName" "MovieTitle"   "MovieRating"

# Check for null values
any(is.null(Res_movies))

## [1] FALSE

# Check for missing values NA's
any(is.na(Res_movies))

## [1] FALSE

#check summary
summary(Res_movies)

##  ReviewerName        MovieTitle         MovieRating   
##  Length:36          Length:36          Min.   :1.000  
##  Class :character   Class :character   1st Qu.:3.000  
##  Mode  :character   Mode  :character   Median :4.000  
##                                        Mean   :3.639  
##                                        3rd Qu.:5.000  
##                                        Max.   :5.000

Aggregate

aggregate(MovieRating ~ MovieTitle, data = Res_movies, FUN=mean)

Joker has the best rating

Visualization

qplot(MovieTitle, MovieRating, data=Res_movies,xlab = "Rating", ylab = "Movie", main = "Individual Movie Rating by Reviewer") + facet_wrap(~ReviewerName) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(Res_movies, aes(x = reorder(MovieTitle, MovieRating), y = MovieRating, fill = MovieTitle), xlab = 'Rating',  col = I("grey")) + geom_bar(stat = "identity") + 
  ggtitle("Movie Cummulative Ratings") +  labs(x = "Movie") +  coord_flip()

Discconect MySQL

dbDisconnect(con)

## [1] TRUE

Optional: Google Cloud and Survey Monkey

Mysql Google Cloud Instance

I used google cloud to create a mysql instance and connect the instance with my mysql workbench. So, if i load any file or create a table using workbench i will be able to access same table from google clould mysql instance. The data i am using here is a Survey Monkey Data for this section. I run a survey in my social media and slack class slack channel. I discover people tend not to participate in a survey eventhough the survey i created is takes less than 1 min to complete. My survey reached at least 500 people but i only received 24 responses. I had to manually get the survey monkey data as i am using a free version of survey monkey.

Survey Monkey result link: https://www.surveymonkey.com/results/SM-JNYCRQ3C7/

Connection

Establish Connection to google clound mysql instance. Hide host and password in two varibles.

congsql <- dbConnect(MySQL(),
                 user="root", password= password,
                 dbname="showratings", host=host)

Load data

Load data collected from survey monkey

#Store sql join quary in a variable
sql<- "SELECT * FROM surveymonkeyratings"

#Fetch data from MySQL google cloud instance into a R dataframe 
myQuery <- paste("SELECT * FROM surveymonkeyratings")
df <- dbGetQuery(congsql, myQuery)
df

Exploratorory Analysis

#Check class
class (df)

## [1] "data.frame"

#Check dimension
dim (df)

## [1] 120   3

# Check column names
names(df)

## [1] "ResponseID" "ShowTitle"  "Rating"

# Check for null values
any(is.null(df))

## [1] FALSE

# Check for missing values NA's
any(is.na(df))

## [1] TRUE

#check summary
summary(df)

##    ResponseID     ShowTitle             Rating     
##  Min.   : 1.00   Length:120         Min.   :1.000  
##  1st Qu.: 6.75   Class :character   1st Qu.:3.000  
##  Median :12.50   Mode  :character   Median :4.000  
##  Mean   :12.50                      Mean   :3.926  
##  3rd Qu.:18.25                      3rd Qu.:5.000  
##  Max.   :24.00                      Max.   :5.000  
##                                     NA's   :26

#check missing value
sum(is.na(df))

## [1] 26

mean (is.na(df))

## [1] 0.07222222

There are 26 missing values within 120 rows.

Removing missing values

cleandf<- na.omit(df)
cleandf

Again check for missing value and run summary command to see stats.

# Check for null values
any(is.null(cleandf))

## [1] FALSE

#check summary
summary(cleandf)

##    ResponseID     ShowTitle             Rating     
##  Min.   : 1.00   Length:94          Min.   :1.000  
##  1st Qu.: 6.00   Class :character   1st Qu.:3.000  
##  Median :11.00   Mode  :character   Median :4.000  
##  Mean   :11.78                      Mean   :3.926  
##  3rd Qu.:18.00                      3rd Qu.:5.000  
##  Max.   :24.00                      Max.   :5.000

We were able to get rid of 26 missing values.

Aggregate

aggregate(Rating ~ ShowTitle, data = cleandf, FUN=mean)

Breaking Bad (my favourite show of alltime) has the best rating even from real world survey data.

Visualization

ggplot(cleandf, aes(x = reorder(ShowTitle, Rating), y = Rating, fill = ShowTitle), xlab = 'Rating',  col = I("grey")) + geom_bar(stat = "identity") + 
  ggtitle("Show Cummulative Ratings") +  labs(x = "Show") +  coord_flip()

Survey Monkey Result

dbDisconnect(congsql)

DATA607- Assignment – SQL and R

Assignment 02 by Md Forhad Akbar

2019-09-06

Loading packages

Establish Connection

Load data

Exploratorory Analysis

Aggregate

Visualization

Optional: Google Cloud and Survey Monkey

Connection

Load data

Exploratorory Analysis

Removing missing values

Aggregate

Visualization

Survey Monkey Result