607 Week2 assignment-SQL and R

Summary

Watching movie is one of the best thing to do during our free time, and Netflix may become your best friends after Covid-19 hit. I reach out to my friends and ask them to rate 6 of the movies on survey monkey. Let’s see what movie they like, and what movie they are not interested much!

hollywood

Preparing the data

Using Survey Monkey

For this project, I decided to use survey monkey to collect my data from audience

check the survey out here

URL:https://www.surveymonkey.com/r/36V9CBW

Inserting data to SQL local server

After getting the data from survey monkey, I have created following 4 local tables to store the info and data by writing SQL code.

movies: This table stores movies name and assigns movies id friends: This table stores friends name and and assignes friends id review: it stores review data from survey money, csv file are linked down below movieruslt: it join 3 table together to get the final data for anlaysis, SQL code are linked down below

check the review data in cvs format link out here

check the SQL code link out here

Connecting R to local PostgreSQL server

We can install R package and connect R to local PostgreSQL server

library(DBI)

db <- 'movie_assignment'  
host_db <- 'localhost'
db_port <- '5432'  
db_user <- 'jayleesps'  
db_password <- myLocalPassword

con <- dbConnect(RPostgres::Postgres(), dbname = db, host=host_db, port=db_port, user=db_user, password=db_password)  

dbListTables(con)

## [1] "movies"      "friends"     "review"      "movieresult"

## Showing Data

Under this section, I will show you each of the table!

Details of the 4 main tables

##moviestable <- dbSendQuery(con, "SELECT * FROM movies")
##dbFetch(moviestable)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

tbl(con, "movies")

## # Source:   table<movies> [?? x 2]
## # Database: postgres [jayleesps@localhost:5432/movie_assignment]
##   movie_id movie_name
##      <int> <chr>     
## 1        1 SpiderMan 
## 2        2 Sing2     
## 3        3 Uncharted 
## 4        4 Moonfall  
## 5        5 Blacklight
## 6        6 Marry Me

tbl(con, "friends")

## # Source:   table<friends> [?? x 2]
## # Database: postgres [jayleesps@localhost:5432/movie_assignment]
##   friends_id friends_name
##        <int> <chr>       
## 1          1 Victor      
## 2          2 Pak         
## 3          3 Daisy       
## 4          4 Mike        
## 5          5 Greg

tbl(con, "review")

## # Source:   table<review> [?? x 3]
## # Database: postgres [jayleesps@localhost:5432/movie_assignment]
##    movie_id friends_id ranking
##       <int>      <int>   <int>
##  1        1          1       4
##  2        2          1      NA
##  3        3          1       3
##  4        4          1       2
##  5        5          1       3
##  6        6          1       4
##  7        1          2       5
##  8        2          2       5
##  9        3          2       3
## 10        4          2      NA
## # … with more rows

tbl(con, "movieresult")

## # Source:   table<movieresult> [?? x 5]
## # Database: postgres [jayleesps@localhost:5432/movie_assignment]
##    movie_id friends_id ranking movie_name friends_name
##       <int>      <int>   <int> <chr>      <chr>       
##  1        1          1       4 SpiderMan  Victor      
##  2        2          1      NA Sing2      Victor      
##  3        3          1       3 Uncharted  Victor      
##  4        4          1       2 Moonfall   Victor      
##  5        5          1       3 Blacklight Victor      
##  6        6          1       4 Marry Me   Victor      
##  7        1          2       5 SpiderMan  Pak         
##  8        2          2       5 Sing2      Pak         
##  9        3          2       3 Uncharted  Pak         
## 10        4          2      NA Moonfall   Pak         
## # … with more rows

Excluding Null data

We can just pull the data that the ranking is not null since we have already join the table together by using SQL.

finaltable <- dbGetQuery(con,'
  select *
  from "movieresult"
  where ranking >= 1
')

Anlaysis

We can see Spiderman is the most watched movie and people enjoy the most which can explain why it is famous!

As we can see Spiderman has the highest ranking which is 4.4, while Moonfall has the lowest which is 2.33.
Daisy and Pak enjoys the movie the most while Greg may not interest in the movies that I select!
Again Spiderman is the most popular movie out of these 6, and Moonfall only have 3.
Greg watch all 6 while Daisy and Pak watch only 4 of them

Spiderman

aggregate(ranking ~ movie_name,finaltable,mean)

##   movie_name  ranking
## 1 Blacklight 3.500000
## 2   Marry Me 3.000000
## 3   Moonfall 2.333333
## 4      Sing2 3.750000
## 5  SpiderMan 4.400000
## 6  Uncharted 3.000000

aggregate(ranking ~ friends_name,finaltable,mean)

##   friends_name  ranking
## 1        Daisy 4.000000
## 2         Greg 2.666667
## 3         Mike 3.600000
## 4          Pak 4.000000
## 5       Victor 3.200000

aggregate(movie_id ~ movie_name, data = finaltable, FUN = length)

##   movie_name movie_id
## 1 Blacklight        4
## 2   Marry Me        4
## 3   Moonfall        3
## 4      Sing2        4
## 5  SpiderMan        5
## 6  Uncharted        4

aggregate(friends_id ~ friends_name, data = finaltable, FUN = length)

##   friends_name friends_id
## 1        Daisy          4
## 2         Greg          6
## 3         Mike          5
## 4          Pak          4
## 5       Victor          5

Data Visualization

library(ggplot2)
ggplot (finaltable, aes(x=movie_name, y=ranking,fill=friends_name)) + geom_col() + xlab("movie name") + ylab("ranking") +ggtitle("Ranking of the movies")

moviescount <- table(finaltable$friends_name)
lbls <- c("Daisy","Greg","Mike","Pak","Victor")
pct <- round(moviescount/sum(moviescount)*100)
lbls <-paste(lbls,pct)
lbls <-paste(lbls,"%")
pie(moviescount, labels=lbls, main ="Pie chart of movie watched")

peoplecount <- table(finaltable$movie_name)
lbls <- c("Blacklight","Marry Me","Moonfall","Sing2","SpiderMan","Uncharted")
pct <- round(peoplecount/sum(peoplecount)*100)
lbls <-paste(lbls,pct)
lbls <-paste(lbls,"%")
pie(peoplecount, labels=lbls, main ="Movies of people watched")