Intro

The purpose of this assignment was to manage and prepare data for downstream analysis. This was achieved by taking a move ratings data set and storing it in a MySQL database, then retrieving it into R to check for any missing data.

Connect to SQL database and retrieve table into R dataframe

First, we connect to the SQL database and retrieve the table into an R dataframe.

Attempt to connect to SQL database

connection <- dbConnect(
  RMySQL::MySQL(),
  dbname = "emin.allen15",
  host = "cunydata607sql.mysql.database.azure.com",
  port = 3306,
  user = "emin.allen15",
  password = "a3d09bd8a2b71f9f"
)

# Retrieve data from database
movie_ratings_df <- dbGetQuery(connection, "SELECT * FROM movie_ratings")

# Close connection
dbDisconnect(connection)
## [1] TRUE

Check for missing data

After the table from SQL is loaded into R, we check for missing values

missing_data_check <- any(is.na(movie_ratings_df)) 
missing_data_summary <- colSums(is.na(movie_ratings_df))

Missing values can be handled with simple removal or imputation.

Simple removal of rows from the dataframe that contain missing values

  movie_ratings_clean <- na.omit(movie_ratings_df)

Imputation by mean, median, mode, or constant value

movie_ratings_clean <- movie_ratings_df

Mean imputation

movie_ratings_clean$rating[is.na(movie_ratings_clean$rating)] <-
mean(movie_ratings_clean$rating, na.rm = TRUE)

Median imputation

movie_ratings_clean$rating[is.na(movie_ratings_clean$rating)] <-       median(movie_ratings_clean$rating, na.rm = TRUE)

Mode imputation

mode_rating <- names(sort(table(movie_ratings_clean$rating), decreasing = TRUE))[1]

movie_ratings_clean$rating[is.na(movie_ratings_clean$rating)] <- as.numeric(mode_rating)

Constant value imputation

movie_ratings_clean$person[is.na(movie_ratings_clean$person)] <- "Unknown"

Display table

knitr::kable(movie_ratings_df)
id person movie rating
1 Alex Oppenheimer 4
2 Victoria Barbie 5
3 John John Wick 4 5
4 Isabel Mission Impossible 7 3
5 David Dune Part Two 3

Conclusion

No missing data was encountered.

No imputation or removal was necessary.