DATA607 Assignment 2

This assignment requires to survey five friends about six movies, and their rating on a scale of 1 to 5. The assingment also requires to utilize SQL databases Below I am recording the answers in a dataframe called answers

I don’t have much SQL experience besides working with Access databases in R, but I know that SQLite has also good R compatibility. So I decided to import my dataframe via R into an SQL database, and then call it again as demonstration.

library(DBI)
library(RSQLite)

con = dbConnect(SQLite(), dbname = 'movie_survey.db')
dbWriteTable(con, 'movie_survey', answers, overwrite = TRUE)
dbDisconnect(con)

Here I have established a connection to a new datbase, created a database in SQLite and uploaded my dataframe. I will now connect again and import the dataframe into R as sql_df

library(RSQLite)
con2 = dbConnect(SQLite(), dbname = 'movie_survey.db')
sql_df = dbReadTable(con2, 'movie_survey')
head(sql_df)

head() shows that the dataframe was successfully re-imported into R, and it can be used and manipulated. There are some missing values, so complete data analysis is not possible. There are several ways how these could be handled. (1) They could be simply ignored and used as-is, however, this can lead to imbalanced data that can skew results in analysis. (2) The mean of all within subject observations could be computed and used to replace the missing value, however, since the movies are different genres, it is really only a band-aid. (3) If more data was available, one could use algorithms to predict the missing value based on the existing data, with high confidence.

Since there is not much data, I will use the mean of the within-subject observations to replace the missing value.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

sql_df2 = sql_df %>%
  mutate(across(where(is.numeric), ~ifelse(is.na(.), mean(., na.rm = TRUE), .)))

Now the missing values are handled, more or less. To get a quick idea of the data, summary() shows key descriptive stats

summary(sql_df2)

##        ID    Wolf_of_Wallstreet Miracle_in_Cell_7  The_Revenant     Barbie   
##  Min.   :1   Min.   :1.0        Min.   :3         Min.   :3     Min.   :2.0  
##  1st Qu.:2   1st Qu.:2.0        1st Qu.:4         1st Qu.:4     1st Qu.:3.0  
##  Median :3   Median :3.0        Median :4         Median :4     Median :4.0  
##  Mean   :3   Mean   :2.8        Mean   :4         Mean   :4     Mean   :3.6  
##  3rd Qu.:4   3rd Qu.:4.0        3rd Qu.:4         3rd Qu.:4     3rd Qu.:4.0  
##  Max.   :5   Max.   :4.0        Max.   :5         Max.   :5     Max.   :5.0  
##   Oppenheimer     Napoleon    
##  Min.   :1.0   Min.   :2.000  
##  1st Qu.:3.0   1st Qu.:3.333  
##  Median :3.0   Median :3.333  
##  Mean   :3.2   Mean   :3.333  
##  3rd Qu.:4.0   3rd Qu.:4.000  
##  Max.   :5.0   Max.   :4.000

As summary() shows, the highest mean rating is a tie between Miracle in Cell 7 and The Revenant with 4, while the lowest mean rating is tied for Oppenheimer and The Wolf of Wallstreet. It would have been cool to ask some additional questions about what makes people rate a movie in a specific way.

DATA607 Assignment 2

Lucas Weyrich

2024-02-02