Visualization Packages n R

The fastest way to improve your understanding of your dataset is to visualize it. Visualization means creating charts and plots from the raw data. Plots of the distribution or spread of attributes can help you spot outliers, strange or invalid data and give you an idea of possible data transformations you could apply.

  1. Visualization Packages: There are many ways to visualize data in R, but a few packages have surfaced as perhaps being the most generally useful.
  1. Univariate Visualization: Plots you can use to understand each attribute standalone.
  1. Multivariate Visualization: Plots that can help you to better understand the interac- tions between attributes.

Univariate Visualization

Univariate plots are plots of individual attributes without interactions. The goal is to learn something about the distribution, central tendency and spread of each attribute.

Histograms

Histograms provide a bar chart of a numeric attribute split into bins with the height showing the number of instances that fall into each bin. They are useful to get an indication of the distribution of an attribute.

Density Plots

Density Plots smooth out the histograms to lines. These are useful for a more abstract depiction of the distribution of each variable.

Box And Whisker Plots

The box captures the middle 50% of the data, the line shows the median and the whiskers of the plots show the reasonable extent of data. Any dots outside the whiskers are good candidates for outliers.

Bar Plots

In datasets that have categorical rather than numeric attributes, we can create bar plots that give an idea of the proportion of instances that belong to each category.

## Loading required package: mlbench

Missing Plot

missing plot to get a quick idea of the amount of missing data in your dataset. The x-axis shows attributes and the y-axis shows instances. Horizontal lines indicate missing data for an instance, vertical blocks represent missing data for an attribute.

## Loading required package: Amelia
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.6, built: 2019-11-24)
## ## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

### Multivariate Visualization Multivariate plots are plots of the relationship or interactions between attributes. The goal is to learn something about the distribution, central tendency and spread over groups of data, typically pairs of attributes.

Correlation Plot

We can calculate the correlation between each pair of numeric attributes. These pairwise correlations can be plotted in a correlation matrix to given an idea of which attributes change together.

## Loading required package: corrplot
## corrplot 0.84 loaded

A dot-representation was used where blue represents positive correlation and red negative. The larger the dot the larger the correlation. We can see that the matrix is symmetrical and that the diagonal attributes are perfectly positively correlated (because it shows the correlation of each attribute with itself). We can see that some of the attributes are highly correlated.

Scatter Plot Matrix

A scatter plot plots two variables together, one on each of the x- and y-axes with points showing the interaction. The spread of the points indicates the relationship between the attributes. You can create scatter plots for all pairs of attributes in your dataset, called a scatter plot matrix.

Scatter plot Matrix By Class

The points in a scatter plot matrix can be colored by the class label in classi cation problems. This can help to spot clear (or unclear) separation of classes and perhaps give an idea of how difficult the problem may be.

Density Plots By Class

the density plot by class can help see the separation of classes. It can also help to understand the overlap in class values for an attribute.

## Loading required package: ggplot2

Box And Whisker Plots By Class

This too can help in understanding how each attribute relates to the class value, but from a di erent perspective to that of the density plots.

Tips For Data Visualization