As data analytics one of the very first steps and one of the most important steps, it’s to visualize our data. When you can plot the data you can has an insight into how your data its distributed or find missing values easier by just look at it on a graph. Selecting the right plot graph its also important because this can make easier to see through the data but if not selecting the best visualization graph can also lead to the wrong assumptions in the analysis. For this blog, I going to use some of the data visualization plots tools that I usually used to see an overview of what I will be working with.
There are many things for what you use to visualize your data, can be to see variables relationship, distribution or compare variables and many other useful benefits from your data visualization.
One of the first graphs that I used its “plot” so I can see the relations between all the variables in the dataset. correlation plot can be very useful to see the raltion on the variables.
data(diamonds)
data <- diamonds
str(data)## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
summary(data)## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
plot(data[,c(1,5,6,7,8)])corrplot(cor(data[,c(1,5,6,7,8,9,10)], data$price, use = "na.or.complete"),
type = "lower",
order = "original", tl.col = "black", tl.srt = 45, tl.cex = 0.55, cl.pos='n', addgrid.col = FALSE)Histogram its one of the most use graph tool and this case we can see a variable distribution which its price. the histogram pot shows clear information and the distribution. we can seem that the data have more of the prices of the diamonds below 5000 price.
melt(data[,c(1,5,6,7,8,9,10)]) %>% ggplot(aes(value)) +
facet_wrap( ~variable,scales = "free") +
geom_histogram()## No id variables; using all as measure variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
hist(data$price) Horizontal bar histogram its also a good graph which shows simple comparing variables side to side.
data %>%
mutate(color_freq= fct_infreq(data$color)) %>%
ggplot( aes(x=color)) +
geom_bar(fill = 'lightblue' ) +
coord_flip() +
theme_bw() +
theme( axis.text.y = element_text( size =rel(0.7), angle = 0 )) +
ggtitle("The Count of Diamonds color") +
xlab('Diamond Color letter') +
ylab(' Number of Diamond Color letter ')Another good visualization graph tools its the boxplot graph. This shows how to data it’s by mean quartile 25 and 75 percent and also helps to see if any outliers on our variables data.
ggplot(data, aes(x=color, y=price,fill=color)) +
geom_boxplot()Another important graph its when you want to see if they are any missing values on the datase. The Amelia package let your plot a dataset and shows where they are missing data.
Amelia::missmap(data)## Warning: Unknown or uninitialised column: 'arguments'.
## Warning: Unknown or uninitialised column: 'arguments'.
## Warning: Unknown or uninitialised column: 'imputations'.
We can see that visualization it’s one of the important step when we working with data. having the right graph tools can help us to progress faster in doing our research analysis. It always can be a good idea to select an easy and clear graph with a good ratio-ink approach.