In this assignment, we will be fetching data from different databases from MySQL and making some basic transformation on source data. Also get some meaningful insights out of the data which we have.
Two CSV data is stored in different databases in MySQL.
tb
database.movies
database.Below steps will provide the steps to fetch and analyze the data.
library("reshape")
library("DBI")
library("tidyr")
##
## Attaching package: 'tidyr'
## The following objects are masked from 'package:reshape':
##
## expand, smiths
library("ggplot2")
tb
databasecon <- dbConnect(RMySQL::MySQL(),dbname="tb",user="root",password="admin")
con
## <MySQLConnection:0,0>
tb
table and providing a summarytb_data <- dbGetQuery(con,"Select * from tb")
summary(tb_data)
## country year sex child
## Length:3800 Min. :1995 Length:3800 Min. : 0.0
## Class :character 1st Qu.:1999 Class :character 1st Qu.: 25.0
## Mode :character Median :2004 Mode :character Median : 76.0
## Mean :2004 Mean : 493.2
## 3rd Qu.:2009 3rd Qu.: 264.5
## Max. :2013 Max. :25661.0
## NA's :396
## adult elderly
## Min. : 0 Min. : 0.0
## 1st Qu.: 1128 1st Qu.: 84.5
## Median : 2589 Median : 230.0
## Mean : 10864 Mean : 1253.0
## 3rd Qu.: 6706 3rd Qu.: 640.0
## Max. :731540 Max. :125991.0
## NA's :413 NA's :413
dbDisconnect(con)
## [1] TRUE
movies
databasecon_movies <- dbConnect(RMySQL::MySQL(),dbname="movies",user="root",password="admin")
con_movies
## <MySQLConnection:0,1>
reviews
table and providing a summary of datamoviereviews <- dbGetQuery(con_movies,"select * from reviews")
summary(moviereviews)
## Votes Rank Title
## Min. : 733 Min. :7.900 Length:26
## 1st Qu.: 1158 1st Qu.:8.100 Class :character
## Median : 1468 Median :8.100 Mode :character
## Mean : 39102 Mean :8.204
## 3rd Qu.: 4000 3rd Qu.:8.300
## Max. :683198 Max. :8.700
head(moviereviews)
## Votes Rank Title
## 1 5999 8.3 Athadu (2005)
## 2 3563 8.5 Sairat (2016)
## 3 6097 8.5 Thani Oruvan (2015)
## 4 1206 8.1 Nuvvu Naaku Nachchav (2001)
## 5 994 8.4 Dipu Number 2 (1996)
## 6 1005 8.3 In Harihar Nagar (1990)
Lets derive some insights out of the data.
moviereviews_year <- separate(moviereviews, Title, c("Title","Year"),sep="\\(")
## l;;
moviereviews_year$Year <- gsub("\\)","",moviereviews_year$Year)
head(moviereviews_year)
## Votes Rank Title Year
## 1 5999 8.3 Athadu 2005
## 2 3563 8.5 Sairat 2016
## 3 6097 8.5 Thani Oruvan 2015
## 4 1206 8.1 Nuvvu Naaku Nachchav 2001
## 5 994 8.4 Dipu Number 2 1996
## 6 1005 8.3 In Harihar Nagar 1990
table(moviereviews_year$Year)
##
## 1946 1975 1986 1989 1990 1992 1993 1995 1996 1999 2001 2002 2003 2005 2011
## 1 1 1 1 1 1 2 1 2 1 1 1 1 2 2
## 2012 2013 2015 2016
## 3 1 2 1
moviereviews_consolidate <- aggregate(moviereviews_year$Votes,by=list(Category=moviereviews_year$Year),FUN=sum)
aggregate(moviereviews_year[,2],list(moviereviews_year$Year),mean)
## Group.1 x
## 1 1946 8.600000
## 2 1975 8.700000
## 3 1986 8.100000
## 4 1989 7.900000
## 5 1990 8.300000
## 6 1992 8.100000
## 7 1993 8.250000
## 8 1995 8.100000
## 9 1996 8.250000
## 10 1999 8.100000
## 11 2001 8.100000
## 12 2002 7.900000
## 13 2003 8.000000
## 14 2005 8.150000
## 15 2011 8.100000
## 16 2012 8.133333
## 17 2013 8.200000
## 18 2015 8.400000
## 19 2016 8.500000
plot(moviereviews_consolidate$Category,moviereviews_consolidate$x, xlab="Year", ylab = "Votes Received",main="Votes received per year")
Above plot means that there were more number of reviews in past than the recent years.