Data Visualization to Explore Datasets:
Data Visualization is essential for delivering data in the most efficient way possible. Data visualization deals with the raw or clean data, models it, and delivers the data so that conclusions can be reached. However,Data visualization also helps to understand how to work efficiently with a dataset. Visualization helps to see through different variables and their relationships. In this blog, I am exploring some of the data visualization plots tools that I usually use to understand any datasets.
library(openintro)## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(ggplot2)
library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(reshape)##
## Attaching package: 'reshape'
## The following object is masked from 'package:dplyr':
##
## rename
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
## The following object is masked from 'package:openintro':
##
## tips
gifted## # A tibble: 36 x 8
## score fatheriq motheriq speak count read edutv cartoons
## <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 159 115 117 18 26 1.9 3 2
## 2 164 117 113 20 37 2.5 1.75 3.25
## 3 154 115 118 20 32 2.2 2.75 2.5
## 4 157 113 131 12 24 1.7 2.75 2.25
## 5 156 110 109 17 34 2.2 2.25 2.5
## 6 150 113 109 13 28 1.9 1.25 3.75
## 7 155 118 119 19 24 1.8 2 3
## 8 161 117 120 18 32 2.3 2.25 2.5
## 9 163 111 128 22 28 2.1 1 4
## 10 162 122 120 18 27 2.1 2.25 2.75
## # ... with 26 more rows
data<-giftedsummary(data)## score fatheriq motheriq speak count
## Min. :150.0 Min. :110.0 Min. :101.0 Min. :10 Min. :21.00
## 1st Qu.:155.0 1st Qu.:112.0 1st Qu.:113.8 1st Qu.:17 1st Qu.:28.00
## Median :159.0 Median :115.0 Median :118.0 Median :18 Median :31.00
## Mean :159.1 Mean :114.8 Mean :118.2 Mean :18 Mean :30.69
## 3rd Qu.:162.0 3rd Qu.:116.2 3rd Qu.:122.2 3rd Qu.:20 3rd Qu.:34.25
## Max. :169.0 Max. :126.0 Max. :131.0 Max. :23 Max. :39.00
## read edutv cartoons
## Min. :1.700 Min. :0.750 Min. :1.750
## 1st Qu.:2.000 1st Qu.:1.750 1st Qu.:2.688
## Median :2.200 Median :2.000 Median :3.000
## Mean :2.136 Mean :1.958 Mean :3.062
## 3rd Qu.:2.300 3rd Qu.:2.250 3rd Qu.:3.500
## Max. :2.500 Max. :3.000 Max. :4.500
plot(data[,c(1,5,6,7,8)])library(corrplot)## corrplot 0.90 loaded
corrplot(cor(data[,c(1,3,5,6,7)], data$price, use = "na.or.complete"),
type = "lower",
order = "original", tl.col = "black", tl.srt = 45, tl.cex = 0.55, cl.pos='n', addgrid.col = FALSE)## Warning: Unknown or uninitialised column: `price`.
### Histogram is implemented to understand the distribution of different variables:I am exploring the score variables’s distribution.
hist(data$score,col="blue")