Data 621 - Blog 1

Visualization

As data analytics one of the very first steps and one of the most important steps, it’s to visualize our data. When you can plot the data you can has an insight into how your data its distributed or find missing values easier by just look at it on a graph. Selecting the right plot graph its also important because this can make easier to see through the data but if not selecting the best visualization graph can also lead to the wrong assumptions in the analysis. For this blog, I going to use some of the data visualization plots tools that I usually used to see an overview of what I will be working with.

There are many things for what you use to visualize your data, can be to see variables relationship, distribution or compare variables and many other useful benefits from your data visualization.

One of the first graphs that I used its “plot” so I can see the relations between all the variables in the dataset. correlation plot can be very useful to see the raltion on the variables.

data(diamonds)
data <- diamonds
str(data)

## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

summary(data)

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
##

plot(data[,c(1,5,6,7,8)])

corrplot(cor(data[,c(1,5,6,7,8,9,10)], data$price, use = "na.or.complete"),
type = "lower", 
order = "original", tl.col = "black", tl.srt = 45, tl.cex = 0.55, cl.pos='n', addgrid.col = FALSE)

Histogram its one of the most use graph tool and this case we can see a variable distribution which its price. the histogram pot shows clear information and the distribution. we can seem that the data have more of the prices of the diamonds below 5000 price.

melt(data[,c(1,5,6,7,8,9,10)]) %>% ggplot(aes(value)) + 
    facet_wrap( ~variable,scales = "free") + 
    geom_histogram()

## No id variables; using all as measure variables

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

hist(data$price)

Horizontal bar histogram its also a good graph which shows simple comparing variables side to side.

data %>%   
  mutate(color_freq= fct_infreq(data$color)) %>% 
    ggplot( aes(x=color)) +
    geom_bar(fill = 'lightblue' ) +
    coord_flip() +
    theme_bw() +
    theme( axis.text.y = element_text( size =rel(0.7), angle = 0 )) +
    ggtitle("The Count of Diamonds color") +
    xlab('Diamond Color letter') +
    ylab(' Number of Diamond Color letter ')

Another good visualization graph tools its the boxplot graph. This shows how to data it’s by mean quartile 25 and 75 percent and also helps to see if any outliers on our variables data.

ggplot(data, aes(x=color, y=price,fill=color)) + 
    geom_boxplot()

Another important graph its when you want to see if they are any missing values on the datase. The Amelia package let your plot a dataset and shows where they are missing data.

Amelia::missmap(data)

## Warning: Unknown or uninitialised column: 'arguments'.

## Warning: Unknown or uninitialised column: 'arguments'.

## Warning: Unknown or uninitialised column: 'imputations'.

Conclusion

We can see that visualization it’s one of the important step when we working with data. having the right graph tools can help us to progress faster in doing our research analysis. It always can be a good idea to select an easy and clear graph with a good ratio-ink approach.

Data 621 - Blog 1

Anthony Munoz

4/21/2020

Visualization

Conclusion