Choose six recent popular movies. Ask at least five people that you know (friends, family, classmates, imaginary friends) to rate each of these movie that they have seen on a scale of 1 to 5. Take the results (observations) and store them in a SQL database. Load the information into an R dataframe. Your deliverables should include your SQL scripts and your R Markdown code, posted to GitHub. This is by design a very open ended assignment. A variety of reasonable approaches are acceptable. You can (and should) blank out your SQL password if your solution requires it; otherwise, full credit requires that your code is “reproducible,” with the assumption that I have the same database server and R software.
The data, sql command and .rmd used in this exercise can be found at https://github.com/forhadakbar/data607fall2019/tree/master/Week%2002
Install the RMySQL and odbc packages by running install.packages(“RMySQL”) and install.packages(“odbc”) in RStudio if you don’t already have it installed and then load the RMySQL package. In addition, i am loading tidyverse and dplyr package.
To hide the password i store the password in a variable called “localpassword” and run it in the console. Each time we start R we will need to reset this variable, but that is better than publishing a hardcoded password as i will share the .rmd using GitHub. There is a better way of storing the connection info in a configuration file outside R. I used {r echo= FALSE} to hide “localpassword” variable while kniting to HTML.
#Store sql join quary in a variable
sql<- "SELECT
tblreviewers.ReviewerName,
tblmovies.MovieTitle,
tblmovieratings.MovieRating
FROM tblmovieratings
JOIN tblreviewers
ON tblreviewers.ReviewerID = tblmovieratings.ReviewerID
JOIN tblmovies
ON tblmovies.MovieID = tblmovieratings.MovieID"## [1] "data.frame"
## [1] 36 3
## [1] "ReviewerName" "MovieTitle" "MovieRating"
## [1] FALSE
## [1] FALSE
## ReviewerName MovieTitle MovieRating
## Length:36 Length:36 Min. :1.000
## Class :character Class :character 1st Qu.:3.000
## Mode :character Mode :character Median :4.000
## Mean :3.639
## 3rd Qu.:5.000
## Max. :5.000
Joker has the best rating
ggplot(Res_movies, aes(x = reorder(MovieTitle, MovieRating), y = MovieRating, fill = MovieTitle), xlab = 'Rating', col = I("grey")) + geom_bar(stat = "identity") +
ggtitle("Movie Cummulative Ratings") + labs(x = "Movie") + coord_flip()Discconect MySQL
## [1] TRUE
Mysql Google Cloud Instance
I used google cloud to create a mysql instance and connect the instance with my mysql workbench. So, if i load any file or create a table using workbench i will be able to access same table from google clould mysql instance. The data i am using here is a Survey Monkey Data for this section. I run a survey in my social media and slack class slack channel. I discover people tend not to participate in a survey eventhough the survey i created is takes less than 1 min to complete. My survey reached at least 500 people but i only received 24 responses. I had to manually get the survey monkey data as i am using a free version of survey monkey.
Survey Monkey result link: https://www.surveymonkey.com/results/SM-JNYCRQ3C7/
Establish Connection to google clound mysql instance. Hide host and password in two varibles.
Load data collected from survey monkey
#Fetch data from MySQL google cloud instance into a R dataframe
myQuery <- paste("SELECT * FROM surveymonkeyratings")
df <- dbGetQuery(congsql, myQuery)
df## [1] "data.frame"
## [1] 120 3
## [1] "ResponseID" "ShowTitle" "Rating"
## [1] FALSE
## [1] TRUE
## ResponseID ShowTitle Rating
## Min. : 1.00 Length:120 Min. :1.000
## 1st Qu.: 6.75 Class :character 1st Qu.:3.000
## Median :12.50 Mode :character Median :4.000
## Mean :12.50 Mean :3.926
## 3rd Qu.:18.25 3rd Qu.:5.000
## Max. :24.00 Max. :5.000
## NA's :26
## [1] 26
## [1] 0.07222222
There are 26 missing values within 120 rows.
Again check for missing value and run summary command to see stats.
## [1] FALSE
## ResponseID ShowTitle Rating
## Min. : 1.00 Length:94 Min. :1.000
## 1st Qu.: 6.00 Class :character 1st Qu.:3.000
## Median :11.00 Mode :character Median :4.000
## Mean :11.78 Mean :3.926
## 3rd Qu.:18.00 3rd Qu.:5.000
## Max. :24.00 Max. :5.000
We were able to get rid of 26 missing values.
Breaking Bad (my favourite show of alltime) has the best rating even from real world survey data.
ggplot(cleandf, aes(x = reorder(ShowTitle, Rating), y = Rating, fill = ShowTitle), xlab = 'Rating', col = I("grey")) + geom_bar(stat = "identity") +
ggtitle("Show Cummulative Ratings") + labs(x = "Show") + coord_flip()Survey Monkey Result
dbDisconnect(congsql)