Introduction

In this assignment, we will be fetching data from different databases from MySQL and making some basic transformation on source data. Also get some meaningful insights out of the data which we have.

Description

Two CSV data is stored in different databases in MySQL.

  1. TB data is loaded in tb database.
  2. Movie reviews data is loaded in movies database.

Below steps will provide the steps to fetch and analyze the data.

1. Install and load additional packages which are required

library("reshape")
library("DBI")
library("tidyr")
## 
## Attaching package: 'tidyr'
## The following objects are masked from 'package:reshape':
## 
##     expand, smiths
library("ggplot2")

2. Making connection with tb database

con <- dbConnect(RMySQL::MySQL(),dbname="tb",user="root",password="admin")
con
## <MySQLConnection:0,0>

3. Fetching data from tb table and providing a summary

tb_data <- dbGetQuery(con,"Select * from tb")
summary(tb_data)
##    country               year          sex                child        
##  Length:3800        Min.   :1995   Length:3800        Min.   :    0.0  
##  Class :character   1st Qu.:1999   Class :character   1st Qu.:   25.0  
##  Mode  :character   Median :2004   Mode  :character   Median :   76.0  
##                     Mean   :2004                      Mean   :  493.2  
##                     3rd Qu.:2009                      3rd Qu.:  264.5  
##                     Max.   :2013                      Max.   :25661.0  
##                                                       NA's   :396      
##      adult           elderly        
##  Min.   :     0   Min.   :     0.0  
##  1st Qu.:  1128   1st Qu.:    84.5  
##  Median :  2589   Median :   230.0  
##  Mean   : 10864   Mean   :  1253.0  
##  3rd Qu.:  6706   3rd Qu.:   640.0  
##  Max.   :731540   Max.   :125991.0  
##  NA's   :413      NA's   :413
dbDisconnect(con)
## [1] TRUE

4. Making connection with movies database

con_movies <- dbConnect(RMySQL::MySQL(),dbname="movies",user="root",password="admin")
con_movies
## <MySQLConnection:0,1>

5. Fetching data from reviews table and providing a summary of data

moviereviews <- dbGetQuery(con_movies,"select * from reviews")
summary(moviereviews)
##      Votes             Rank          Title          
##  Min.   :   733   Min.   :7.900   Length:26         
##  1st Qu.:  1158   1st Qu.:8.100   Class :character  
##  Median :  1468   Median :8.100   Mode  :character  
##  Mean   : 39102   Mean   :8.204                     
##  3rd Qu.:  4000   3rd Qu.:8.300                     
##  Max.   :683198   Max.   :8.700
head(moviereviews)
##   Votes Rank                       Title
## 1  5999  8.3               Athadu (2005)
## 2  3563  8.5               Sairat (2016)
## 3  6097  8.5         Thani Oruvan (2015)
## 4  1206  8.1 Nuvvu Naaku Nachchav (2001)
## 5   994  8.4        Dipu Number 2 (1996)
## 6  1005  8.3     In Harihar Nagar (1990)

Lets derive some insights out of the data.

Lets organize the input data. We can split the title and year to different columns.

moviereviews_year <- separate(moviereviews, Title, c("Title","Year"),sep="\\(")
## l;;
moviereviews_year$Year <- gsub("\\)","",moviereviews_year$Year)
head(moviereviews_year)
##   Votes Rank                 Title Year
## 1  5999  8.3               Athadu  2005
## 2  3563  8.5               Sairat  2016
## 3  6097  8.5         Thani Oruvan  2015
## 4  1206  8.1 Nuvvu Naaku Nachchav  2001
## 5   994  8.4        Dipu Number 2  1996
## 6  1005  8.3     In Harihar Nagar  1990

Below are the movies released in each year

table(moviereviews_year$Year)
## 
## 1946 1975 1986 1989 1990 1992 1993 1995 1996 1999 2001 2002 2003 2005 2011 
##    1    1    1    1    1    1    2    1    2    1    1    1    1    2    2 
## 2012 2013 2015 2016 
##    3    1    2    1

Historically below are the total words received in each year

moviereviews_consolidate <- aggregate(moviereviews_year$Votes,by=list(Category=moviereviews_year$Year),FUN=sum)

Below is the average reviews which is received in each year

aggregate(moviereviews_year[,2],list(moviereviews_year$Year),mean)
##    Group.1        x
## 1     1946 8.600000
## 2     1975 8.700000
## 3     1986 8.100000
## 4     1989 7.900000
## 5     1990 8.300000
## 6     1992 8.100000
## 7     1993 8.250000
## 8     1995 8.100000
## 9     1996 8.250000
## 10    1999 8.100000
## 11    2001 8.100000
## 12    2002 7.900000
## 13    2003 8.000000
## 14    2005 8.150000
## 15    2011 8.100000
## 16    2012 8.133333
## 17    2013 8.200000
## 18    2015 8.400000
## 19    2016 8.500000

Plot between the movie released Year and Votes received on each year

plot(moviereviews_consolidate$Category,moviereviews_consolidate$x, xlab="Year", ylab = "Votes Received",main="Votes received per year") 

Above plot means that there were more number of reviews in past than the recent years.