SQL and R

This projects objective is to load two .csv files in SQL, handle the null values on import, join the tables and load this output into a datframe in R.

Data Collection

the following is a link to the github with the csv files and the SQL program to import them.

Github Link

Importing and Preparing Data

The data is imported and handled in the MySQL program. Input preprocessing is used to handle the null values for each of the movies.

One feature I tried very hard to execute was replacing the null values with the average ranking for each movie.

The tables are normalized and joined using the “inner join” command, pulling the data most relevant to analysis.

Reproducability

All of the files and code for producing this analysis is accesible from this rMarkdown file.

Code for loading data from database into R

# 1. Library
library(RMySQL)
## Loading required package: DBI
# 2. Settings
db_user <- 'root'
db_password <- 'password'
db_name <- 'datacollection'
db_table <- 'output'
db_host <- '127.0.0.1' # for local access
db_port <- 3306

# 3. Read data from db
mydb <-  dbConnect(MySQL(), user = db_user, password = db_password,
                   dbname = db_name, host = db_host, port = db_port)
s <- paste0("select * from ", db_table)
rs <- dbSendQuery(mydb, s)
df <-  fetch(rs, n = -1)
on.exit(dbDisconnect(mydb))
## Warning: Closing open result sets

Exploratory Analysis

I want to see which movie had the highest average ranking, so I will convert the values to integers, make a vector with the mean ratings of each movie and print it.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)

#Convert values to numeric for analysis

df$parasite<-as.integer(df$parasite)
df$joker<-as.integer(df$joker)
df$littleWomen<-as.integer(df$littleWomen)
df$tenet<-as.integer(df$tenet)
df$starWars<-as.integer(df$starWars)
df$batman<-as.integer(df$parasite)

# create  vector of mean movie ranking

ratingMean<-
  c("Mean Rating",
    mean(df$parasite, na.rm=TRUE),
    mean(df$joker, na.rm=TRUE),
    mean(df$littleWomen, na.rm=TRUE), 
    mean(df$tenet, na.rm=TRUE),
    mean(df$starWars, na.rm=TRUE),
    mean(df$batman, na.rm=TRUE))

newdf<-rbind(df, ratingMean)



kable(newdf[6,2:7],caption="Mean rating for each Movie")
Mean rating for each Movie
parasite joker littleWomen tenet starWars batman
6 3.75 4.4 5 5.2 6.2 3.75

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.