The purpose of this assignment was to manage and prepare data for downstream analysis. This was achieved by taking a move ratings data set and storing it in a MySQL database, then retrieving it into R to check for any missing data.
First, we connect to the SQL database and retrieve the table into an R dataframe.
connection <- dbConnect(
RMySQL::MySQL(),
dbname = "emin.allen15",
host = "cunydata607sql.mysql.database.azure.com",
port = 3306,
user = "emin.allen15",
password = "a3d09bd8a2b71f9f"
)
# Retrieve data from database
movie_ratings_df <- dbGetQuery(connection, "SELECT * FROM movie_ratings")
# Close connection
dbDisconnect(connection)
## [1] TRUE
After the table from SQL is loaded into R, we check for missing values
missing_data_check <- any(is.na(movie_ratings_df))
missing_data_summary <- colSums(is.na(movie_ratings_df))
missing_data_check
## [1] FALSE
missing_data_summary
## id person movie rating
## 0 0 0 0
Missing data check should result in FALSE, meaning there are no missing values found anywhere in the dataframe.
Furthermore, there should be zero missing values in any column.
movie_ratings_clean <- na.omit(movie_ratings_df)
movie_ratings_clean <- movie_ratings_df
movie_ratings_clean$rating[is.na(movie_ratings_clean$rating)] <-
mean(movie_ratings_clean$rating, na.rm = TRUE)
movie_ratings_clean$rating[is.na(movie_ratings_clean$rating)] <- median(movie_ratings_clean$rating, na.rm = TRUE)
mode_rating <- names(sort(table(movie_ratings_clean$rating), decreasing = TRUE))[1]
movie_ratings_clean$rating[is.na(movie_ratings_clean$rating)] <- as.numeric(mode_rating)
movie_ratings_clean$person[is.na(movie_ratings_clean$person)] <- "Unknown"
knitr::kable(movie_ratings_df)
| id | person | movie | rating |
|---|---|---|---|
| 1 | Alex | Oppenheimer | 4 |
| 2 | Victoria | Barbie | 5 |
| 3 | John | John Wick 4 | 5 |
| 4 | Isabel | Mission Impossible 7 | 3 |
| 5 | David | Dune Part Two | 3 |
No missing data was encountered.
No imputation or removal was necessary.