Assignment 2: SQL and R (Importing a table from SQL database to Rstudio)

SQL Code:

create table movie_rating( viewer_name varchar(50), movie varchar(50), ratings float);

insert into movie_rating values(‘amina’, ‘spiderman’,4.5);

insert into movie_rating values(‘amina’, ‘batman’,3.5);

insert into movie_rating values(‘amina’, ‘ironman’,2);

insert into movie_rating values(‘amina’, ‘xmen’,4);

insert into movie_rating values(‘amina’, ‘galdiator’,3);

insert into movie_rating values(‘amina’, ‘superman’,NULL);

insert into movie_rating values(‘haris’, ‘spiderman’,4);

insert into movie_rating values(‘haris’, ‘batman’,4);

insert into movie_rating values(‘haris’, ‘ironman’,2);

insert into movie_rating values(‘haris’, ‘xmen’,4.5);

insert into movie_rating values(‘haris’, ‘galdiator’,NULL);

insert into movie_rating values(‘haris’, ‘superman’,3);

insert into movie_rating values(‘amir’, ‘spiderman’,3.5);

insert into movie_rating values(‘amir’, ‘batman’,3);

insert into movie_rating values(‘amir’, ‘ironman’,2.5);

insert into movie_rating values(‘amir’, ‘xmen’,4.5);

insert into movie_rating values(‘amir’, ‘galdiator’,3.5);

insert into movie_rating values(‘amir’, ‘superman’,3);

insert into movie_rating values(‘ali’, ‘spiderman’,4);

insert into movie_rating values(‘ali’, ‘batman’,3.5);

insert into movie_rating values(‘ali’, ‘ironman’,NULL);

insert into movie_rating values(‘ali’, ‘xmen’,4);

insert into movie_rating values(‘ali’, ‘galdiator’,2.5);

insert into movie_rating values(‘ali’, ‘superman’,2.5);

insert into movie_rating values(‘imran’, ‘spiderman’,3.5);

insert into movie_rating values(‘imran’, ‘batman’,3.5);

insert into movie_rating values(‘imran’, ‘ironman’,2.5);

insert into movie_rating values(‘imran’, ‘xmen’,5);

insert into movie_rating values(‘imran’, ‘galdiator’,1);

insert into movie_rating values(‘imran’, ‘superman’,1);

Note: We can also create the table here in Rstudio using sqlQuerry() function after establishing a connection with the data base.

Loading and setting up the environment:

Acquiring the libraries that will be required to achieve the goal of this R markdown

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.3.0      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(RODBC)
library(knitr)

Setting up the connections:

The connection has been setup to a database named as “data607” using odbcConnect() function from RODBC package. This data base contains our desired table that we would like to import into our Rstudio

con <- odbcConnect("data607")

Loading the table into Rstudio:

There are numerous ways of doing that. Over here the sqlQuerry() function has been used to load the table into moving_rating with the help of an SQL Querry.

movie_rating <- sqlQuery(con, "SELECT * FROM movie_rating")

Data Exploration:

Displaying the data:

First of all, we will try to get a preview of data which is loaded into Rstudio from the database “data607”. There a numerous ways of doing it we can use head() function if we want to preview all the columns and n rows of the table or as_tibble() function which can preview the first 10 rows with the columns that can fit into the screen. We can also use View() function if we want to preview the whole table. Over here I’m using as_tibble function since there only three columns

str(movie_rating)

## 'data.frame':    30 obs. of  3 variables:
##  $ viewer_name: chr  "amina" "amina" "amina" "amina" ...
##  $ movie      : chr  "spiderman" "batman" "ironman" "xmen" ...
##  $ ratings    : num  4.5 3.5 2 4 3 NA 4 4 2 4.5 ...

as_tibble(movie_rating)

## # A tibble: 30 × 3
##    viewer_name movie     ratings
##    <chr>       <chr>       <dbl>
##  1 amina       spiderman     4.5
##  2 amina       batman        3.5
##  3 amina       ironman       2  
##  4 amina       xmen          4  
##  5 amina       galdiator     3  
##  6 amina       superman     NA  
##  7 haris       spiderman     4  
##  8 haris       batman        4  
##  9 haris       ironman       2  
## 10 haris       xmen          4.5
## # … with 20 more rows

Checking the table for missing values:

Let’s have a quick check-up for any missing values in the table using is.na() functions.

is.na(movie_rating)

##    viewer_name movie ratings
## 1        FALSE FALSE   FALSE
## 2        FALSE FALSE   FALSE
## 3        FALSE FALSE   FALSE
## 4        FALSE FALSE   FALSE
## 5        FALSE FALSE   FALSE
## 6        FALSE FALSE    TRUE
## 7        FALSE FALSE   FALSE
## 8        FALSE FALSE   FALSE
## 9        FALSE FALSE   FALSE
## 10       FALSE FALSE   FALSE
## 11       FALSE FALSE    TRUE
## 12       FALSE FALSE   FALSE
## 13       FALSE FALSE   FALSE
## 14       FALSE FALSE   FALSE
## 15       FALSE FALSE   FALSE
## 16       FALSE FALSE   FALSE
## 17       FALSE FALSE   FALSE
## 18       FALSE FALSE   FALSE
## 19       FALSE FALSE   FALSE
## 20       FALSE FALSE   FALSE
## 21       FALSE FALSE    TRUE
## 22       FALSE FALSE   FALSE
## 23       FALSE FALSE   FALSE
## 24       FALSE FALSE   FALSE
## 25       FALSE FALSE   FALSE
## 26       FALSE FALSE   FALSE
## 27       FALSE FALSE   FALSE
## 28       FALSE FALSE   FALSE
## 29       FALSE FALSE   FALSE
## 30       FALSE FALSE   FALSE

We can also find out the total number of missing values along side the locations of missing values using sum(is.na()) and which(is.na()) functions as shown below:

sum(is.na(movie_rating))

## [1] 3

which(is.na(movie_rating))

## [1] 66 71 81

Cleaning the data from missing values:

There several way to deal with missing values in R. We can use the na.omit() function to ignore the missing values during calculations in the data frame.

movie_rating_omit <- na.omit(movie_rating)

Now we can see if there is any missing data left in the data frame using na.fail() function. If it returns an error message it means that there are missing values in the data frame

na.fail(movie_rating_omit)

##    viewer_name     movie ratings
## 1        amina spiderman     4.5
## 2        amina    batman     3.5
## 3        amina   ironman     2.0
## 4        amina      xmen     4.0
## 5        amina galdiator     3.0
## 7        haris spiderman     4.0
## 8        haris    batman     4.0
## 9        haris   ironman     2.0
## 10       haris      xmen     4.5
## 12       haris  superman     3.0
## 13        amir spiderman     3.5
## 14        amir    batman     3.0
## 15        amir   ironman     2.5
## 16        amir      xmen     4.5
## 17        amir galdiator     3.5
## 18        amir  superman     3.0
## 19         ali spiderman     4.0
## 20         ali    batman     3.5
## 22         ali      xmen     4.0
## 23         ali galdiator     2.5
## 24         ali  superman     2.5
## 25       imran spiderman     3.5
## 26       imran    batman     3.5
## 27       imran   ironman     2.5
## 28       imran      xmen     5.0
## 29       imran galdiator     1.0
## 30       imran  superman     1.0

As we observe that the table has no missing values which means that now we are clear to analyze it.

Analysis:

Visualization:

Lets make a quick graph using ggplot() to see which moving has the highest overall rating i.e. summation of ratings given by each viewer

ggplot(data = movie_rating_omit, mapping = aes(x=movie, y=ratings, fill=viewer_name))+geom_col()+theme_classic()

It can be clearly observed that movie name “xmen” has the highest overall ratings while “ironman” has lowest overall ratings.

Note: Since we have omitted missing values even if we add maximum ratings to those columns still it will not affect the highest overall ratings

We can also check out the individual ratings given by each viewer

ggplot(data=movie_rating_omit, aes(x=ratings, y=movie, fill=viewer_name))+
  geom_bar(stat="identity")+facet_wrap(~viewer_name)+theme_classic()+
  labs(title = "Individual Viewer Rating", x="Ratings", y="Movie Name")

Conclusion:

The idea of this R-markdown was to import a table from a database that created using SQL Queries and then perform basic data exploration on table imported in Rstudio. Afterwards the R-markdown focused on missing values in data frame and try to remove any missing values from calculations. At the end a quick Bar/column chart were produced using ggplot to gain a quick view of the data.