This projects objective is to load two .csv files in SQL, handle the null values on import, join the tables and load this output into a datframe in R.
the following is a link to the github with the csv files and the SQL program to import them.
The data is imported and handled in the MySQL program. Input preprocessing is used to handle the null values for each of the movies.
One feature I tried very hard to execute was replacing the null values with the average ranking for each movie.
The tables are normalized and joined using the “inner join” command, pulling the data most relevant to analysis.
All of the files and code for producing this analysis is accesible from this rMarkdown file.
# 1. Library
library(RMySQL)
## Loading required package: DBI
# 2. Settings
db_user <- 'root'
db_password <- 'password'
db_name <- 'datacollection'
db_table <- 'output'
db_host <- '127.0.0.1' # for local access
db_port <- 3306
# 3. Read data from db
mydb <- dbConnect(MySQL(), user = db_user, password = db_password,
dbname = db_name, host = db_host, port = db_port)
s <- paste0("select * from ", db_table)
rs <- dbSendQuery(mydb, s)
df <- fetch(rs, n = -1)
on.exit(dbDisconnect(mydb))
## Warning: Closing open result sets
I want to see which movie had the highest average ranking, so I will convert the values to integers, make a vector with the mean ratings of each movie and print it.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
#Convert values to numeric for analysis
df$parasite<-as.integer(df$parasite)
df$joker<-as.integer(df$joker)
df$littleWomen<-as.integer(df$littleWomen)
df$tenet<-as.integer(df$tenet)
df$starWars<-as.integer(df$starWars)
df$batman<-as.integer(df$parasite)
# create vector of mean movie ranking
ratingMean<-
c("Mean Rating",
mean(df$parasite, na.rm=TRUE),
mean(df$joker, na.rm=TRUE),
mean(df$littleWomen, na.rm=TRUE),
mean(df$tenet, na.rm=TRUE),
mean(df$starWars, na.rm=TRUE),
mean(df$batman, na.rm=TRUE))
newdf<-rbind(df, ratingMean)
kable(newdf[6,2:7],caption="Mean rating for each Movie")
| parasite | joker | littleWomen | tenet | starWars | batman | |
|---|---|---|---|---|---|---|
| 6 | 3.75 | 4.4 | 5 | 5.2 | 6.2 | 3.75 |
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.