This markdown aims at creating a table in SQL database using SQL queries and then importing that table into Rstudio. Following are the SQL queries that were used to create the table in database, that was created in SSMS:
create table movie_rating( viewer_name varchar(50), movie varchar(50), ratings float);
insert into movie_rating values(‘amina’, ‘spiderman’,4.5);
insert into movie_rating values(‘amina’, ‘batman’,3.5);
insert into movie_rating values(‘amina’, ‘ironman’,2);
insert into movie_rating values(‘amina’, ‘xmen’,4);
insert into movie_rating values(‘amina’, ‘galdiator’,3);
insert into movie_rating values(‘amina’, ‘superman’,NULL);
insert into movie_rating values(‘haris’, ‘spiderman’,4);
insert into movie_rating values(‘haris’, ‘batman’,4);
insert into movie_rating values(‘haris’, ‘ironman’,2);
insert into movie_rating values(‘haris’, ‘xmen’,4.5);
insert into movie_rating values(‘haris’, ‘galdiator’,NULL);
insert into movie_rating values(‘haris’, ‘superman’,3);
insert into movie_rating values(‘amir’, ‘spiderman’,3.5);
insert into movie_rating values(‘amir’, ‘batman’,3);
insert into movie_rating values(‘amir’, ‘ironman’,2.5);
insert into movie_rating values(‘amir’, ‘xmen’,4.5);
insert into movie_rating values(‘amir’, ‘galdiator’,3.5);
insert into movie_rating values(‘amir’, ‘superman’,3);
insert into movie_rating values(‘ali’, ‘spiderman’,4);
insert into movie_rating values(‘ali’, ‘batman’,3.5);
insert into movie_rating values(‘ali’, ‘ironman’,NULL);
insert into movie_rating values(‘ali’, ‘xmen’,4);
insert into movie_rating values(‘ali’, ‘galdiator’,2.5);
insert into movie_rating values(‘ali’, ‘superman’,2.5);
insert into movie_rating values(‘imran’, ‘spiderman’,3.5);
insert into movie_rating values(‘imran’, ‘batman’,3.5);
insert into movie_rating values(‘imran’, ‘ironman’,2.5);
insert into movie_rating values(‘imran’, ‘xmen’,5);
insert into movie_rating values(‘imran’, ‘galdiator’,1);
insert into movie_rating values(‘imran’, ‘superman’,1);
Note: We can also create the table here in Rstudio using sqlQuerry() function after establishing a connection with the data base.
Acquiring the libraries that will be required to achieve the goal of this R markdown
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(RODBC)
library(knitr)
The connection has been setup to a database named as “data607” using odbcConnect() function from RODBC package. This data base contains our desired table that we would like to import into our Rstudio
con <- odbcConnect("data607")
There are numerous ways of doing that. Over here the sqlQuerry() function has been used to load the table into moving_rating with the help of an SQL Querry.
movie_rating <- sqlQuery(con, "SELECT * FROM movie_rating")
First of all, we will try to get a preview of data which is loaded into Rstudio from the database “data607”. There a numerous ways of doing it we can use head() function if we want to preview all the columns and n rows of the table or as_tibble() function which can preview the first 10 rows with the columns that can fit into the screen. We can also use View() function if we want to preview the whole table. Over here I’m using as_tibble function since there only three columns
str(movie_rating)
## 'data.frame': 30 obs. of 3 variables:
## $ viewer_name: chr "amina" "amina" "amina" "amina" ...
## $ movie : chr "spiderman" "batman" "ironman" "xmen" ...
## $ ratings : num 4.5 3.5 2 4 3 NA 4 4 2 4.5 ...
as_tibble(movie_rating)
## # A tibble: 30 × 3
## viewer_name movie ratings
## <chr> <chr> <dbl>
## 1 amina spiderman 4.5
## 2 amina batman 3.5
## 3 amina ironman 2
## 4 amina xmen 4
## 5 amina galdiator 3
## 6 amina superman NA
## 7 haris spiderman 4
## 8 haris batman 4
## 9 haris ironman 2
## 10 haris xmen 4.5
## # … with 20 more rows
Let’s have a quick check-up for any missing values in the table using is.na() functions.
is.na(movie_rating)
## viewer_name movie ratings
## 1 FALSE FALSE FALSE
## 2 FALSE FALSE FALSE
## 3 FALSE FALSE FALSE
## 4 FALSE FALSE FALSE
## 5 FALSE FALSE FALSE
## 6 FALSE FALSE TRUE
## 7 FALSE FALSE FALSE
## 8 FALSE FALSE FALSE
## 9 FALSE FALSE FALSE
## 10 FALSE FALSE FALSE
## 11 FALSE FALSE TRUE
## 12 FALSE FALSE FALSE
## 13 FALSE FALSE FALSE
## 14 FALSE FALSE FALSE
## 15 FALSE FALSE FALSE
## 16 FALSE FALSE FALSE
## 17 FALSE FALSE FALSE
## 18 FALSE FALSE FALSE
## 19 FALSE FALSE FALSE
## 20 FALSE FALSE FALSE
## 21 FALSE FALSE TRUE
## 22 FALSE FALSE FALSE
## 23 FALSE FALSE FALSE
## 24 FALSE FALSE FALSE
## 25 FALSE FALSE FALSE
## 26 FALSE FALSE FALSE
## 27 FALSE FALSE FALSE
## 28 FALSE FALSE FALSE
## 29 FALSE FALSE FALSE
## 30 FALSE FALSE FALSE
We can also find out the total number of missing values along side the locations of missing values using sum(is.na()) and which(is.na()) functions as shown below:
sum(is.na(movie_rating))
## [1] 3
which(is.na(movie_rating))
## [1] 66 71 81
There several way to deal with missing values in R. We can use the na.omit() function to ignore the missing values during calculations in the data frame.
movie_rating_omit <- na.omit(movie_rating)
Now we can see if there is any missing data left in the data frame using na.fail() function. If it returns an error message it means that there are missing values in the data frame
na.fail(movie_rating_omit)
## viewer_name movie ratings
## 1 amina spiderman 4.5
## 2 amina batman 3.5
## 3 amina ironman 2.0
## 4 amina xmen 4.0
## 5 amina galdiator 3.0
## 7 haris spiderman 4.0
## 8 haris batman 4.0
## 9 haris ironman 2.0
## 10 haris xmen 4.5
## 12 haris superman 3.0
## 13 amir spiderman 3.5
## 14 amir batman 3.0
## 15 amir ironman 2.5
## 16 amir xmen 4.5
## 17 amir galdiator 3.5
## 18 amir superman 3.0
## 19 ali spiderman 4.0
## 20 ali batman 3.5
## 22 ali xmen 4.0
## 23 ali galdiator 2.5
## 24 ali superman 2.5
## 25 imran spiderman 3.5
## 26 imran batman 3.5
## 27 imran ironman 2.5
## 28 imran xmen 5.0
## 29 imran galdiator 1.0
## 30 imran superman 1.0
As we observe that the table has no missing values which means that now we are clear to analyze it.
Lets make a quick graph using ggplot() to see which moving has the highest overall rating i.e. summation of ratings given by each viewer
ggplot(data = movie_rating_omit, mapping = aes(x=movie, y=ratings, fill=viewer_name))+geom_col()+theme_classic()
It can be clearly observed that movie name “xmen” has the highest overall ratings while “ironman” has lowest overall ratings.
Note: Since we have omitted missing values even if we add maximum ratings to those columns still it will not affect the highest overall ratings
We can also check out the individual ratings given by each viewer
ggplot(data=movie_rating_omit, aes(x=ratings, y=movie, fill=viewer_name))+
geom_bar(stat="identity")+facet_wrap(~viewer_name)+theme_classic()+
labs(title = "Individual Viewer Rating", x="Ratings", y="Movie Name")
The idea of this R-markdown was to import a table from a database that created using SQL Queries and then perform basic data exploration on table imported in Rstudio. Afterwards the R-markdown focused on missing values in data frame and try to remove any missing values from calculations. At the end a quick Bar/column chart were produced using ggplot to gain a quick view of the data.