607 Assignment 2 R and SQL

Introduction:

In this weeks assignment I aim to create a table using SQL workbench and load it to the class Azure server. Then I want to utilize that data stored in the class SQL azure database using RStudio.

Creating an azure MySQL Database, Installing MySQL and MySQL Workbench using the resources provided in the class’s content.
Connecting MySQL Workbench to class Azure Database.

## The following script was provided via email and run to create a hash to use as my password for the connections made from SQL workbench to Azure.

# Load necessary libraries
library(digest)     # For generating the hash
library(clipr)      # For copying to clipboard

## Welcome to clipr. See ?write_clip for advisories on writing to the clipboard in R.

# The string to hash
input_string <- '24610585'

# Generate the hash using SHA-256
hash_hex <- digest(input_string, algo = "sha256", serialize = FALSE)

# Declare a variable with only a quarter of the resulting hash
quarter_hash <- substr(hash_hex, 1, nchar(hash_hex) %/% 4)


## First create connection using the server name, port, and my user and password found above.

## Then connect to database using the URL, cunydata607sql.mysql.database.azure.com, and the user and password from above.

Creating in SQL, on both the class azure database and my own azure database, the table of 6x6 for 1-5 movie ratings by five of my friends to six movies.
Connecting R to the class Azure SQL database.

## Loading required packages for connection

library("RMySQL")

## Loading required package: DBI

## Connecting to the SQL database

mydb <- dbConnect(MySQL(),user= 'brandon.chung85',password= '704c7720437dbcfb',dbname= 'brandon.chung85',host= 'cunydata607sql.mysql.database.azure.com')

dbListTables(mydb)

## [1] "movieratings"

## Importing data table from SQL to R as a dataframe
data <- dbReadTable(mydb, "movieratings")

## Previewing data
head(data)

##               Movie.Name Claudia Raj Ryan Kelly Mark
## 1 A Quiet Place: Day One       3   3    2     5    2
## 2      Avengers End Game       5   5    5     4    3
## 3  Avengers Infinity War       5   5    4     4    3
## 4 Deadpool and Wolverine       4   3    5     2    4
## 5                  Joker       4   4    3     4    5
## 6            The Arrival       5   2    3     3    3

summary(data)

##   Movie.Name           Claudia           Raj             Ryan      
##  Length:6           Min.   :3.000   Min.   :2.000   Min.   :2.000  
##  Class :character   1st Qu.:4.000   1st Qu.:3.000   1st Qu.:3.000  
##  Mode  :character   Median :4.500   Median :3.500   Median :3.500  
##                     Mean   :4.333   Mean   :3.667   Mean   :3.667  
##                     3rd Qu.:5.000   3rd Qu.:4.750   3rd Qu.:4.750  
##                     Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      Kelly            Mark      
##  Min.   :2.000   Min.   :2.000  
##  1st Qu.:3.250   1st Qu.:3.000  
##  Median :4.000   Median :3.000  
##  Mean   :3.667   Mean   :3.333  
##  3rd Qu.:4.000   3rd Qu.:3.750  
##  Max.   :5.000   Max.   :5.000

## A quick summary of the dataframe shows us some descriptive statistics of the data. Claudia has the highest median score out of the group, she may be a more avid movie lover than the others. Other call outs can include that all participants scored a 5 out of 5 rating on at least one of the movies listed, and non of the participants rated any movie below a 2 score.

library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data <- data %>%
  mutate(total_score = Claudia + Raj + Ryan + Kelly + Mark)

ggplot (data = data, aes (x = Movie.Name, y = total_score)) +
  geom_point() +
  scale_x_discrete(guide = guide_axis(angle = 90))

## Here we can see each movie total score (a sum of the scores given by each participant). Avengers end game ranked the highest and A Quiet Place: Day One ranked the lowest in the group.

Missing data strategy: For missing data I would remove the participant altogether, however the dataset is so small that in this example I would use a calculated value. My thought behind this is that if there were for example 1,000 participants, deleting the participant’s inputs would not be as statistically impactful. For this dataset I would propose to use a calculated field, a mean of the other’s ratings for that specific movie. This way we would not lost 1/5th of the recorded data, and can keep the aggregated data’s results somewhat similar to what would likely be if the value were not null.

607 Assignment 2 R and SQL

Brandon Chung

2024-09-15