How To Perform Data Summary With R

Author: Daniel Abban

Some data can really look rough and confusing at the first glance, however when you are able to explore the data effectively, you can get a summary of the message contained in your data.

Performing Data summary is a very vital step to consider when doing data analysis. Before any real work can be done on your data, you must first of all get to know your data very well, in order to make accurate and unbiased inference from it.

Data Summary is the process of getting introduced to your data

In this lecture, we are going to look at a few ways data analyst check the summary of their data. So what are they? I’ll list them below and we shall understand them in details as we read along:

We have two types of data summaries, which are:

Numerical summaries
Graphical summaries

Under the numerical summaries, we have: mean, median, standard deveiation, etc. Graphical summary includes boxplot, scatter plots, histograms etc.

Straight away, lets start practicing how to perform these summaries with R. The data we shall be using in the lecture is taken from the UCI machine learning repository. Click here to download.

The first step is to read the data into R and replace the default names with their original variable names as listed in the description of the data. click here. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.

wine <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"

wine_data <- read.csv(wine, header = FALSE)

variable_names = c("Class_identifier", "Alcohol", "Malic_acid", "Ash", "Alcal_Ash",
                   "Magnessium", "Total_phenols","Flavaniods",
                   "NonFlav_phenols", "phroantocyanins",
                   "color_intensity", "Hue", "OD_of_wines",
                   "proline")

names(wine_data) <- variable_names

The summary() command is a quick way to get the usual univariate summary information from your data.

summary(wine_data)

##  Class_identifier    Alcohol        Malic_acid         Ash       
##  Min.   :1.000    Min.   :11.03   Min.   :0.740   Min.   :1.360  
##  1st Qu.:1.000    1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210  
##  Median :2.000    Median :13.05   Median :1.865   Median :2.360  
##  Mean   :1.938    Mean   :13.00   Mean   :2.336   Mean   :2.367  
##  3rd Qu.:3.000    3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558  
##  Max.   :3.000    Max.   :14.83   Max.   :5.800   Max.   :3.230  
##    Alcal_Ash       Magnessium     Total_phenols     Flavaniods   
##  Min.   :10.60   Min.   : 70.00   Min.   :0.980   Min.   :0.340  
##  1st Qu.:17.20   1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205  
##  Median :19.50   Median : 98.00   Median :2.355   Median :2.135  
##  Mean   :19.49   Mean   : 99.74   Mean   :2.295   Mean   :2.029  
##  3rd Qu.:21.50   3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875  
##  Max.   :30.00   Max.   :162.00   Max.   :3.880   Max.   :5.080  
##  NonFlav_phenols  phroantocyanins color_intensity       Hue        
##  Min.   :0.1300   Min.   :0.410   Min.   : 1.280   Min.   :0.4800  
##  1st Qu.:0.2700   1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825  
##  Median :0.3400   Median :1.555   Median : 4.690   Median :0.9650  
##  Mean   :0.3619   Mean   :1.591   Mean   : 5.058   Mean   :0.9574  
##  3rd Qu.:0.4375   3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200  
##  Max.   :0.6600   Max.   :3.580   Max.   :13.000   Max.   :1.7100  
##   OD_of_wines       proline      
##  Min.   :1.270   Min.   : 278.0  
##  1st Qu.:1.938   1st Qu.: 500.5  
##  Median :2.780   Median : 673.5  
##  Mean   :2.612   Mean   : 746.9  
##  3rd Qu.:3.170   3rd Qu.: 985.0  
##  Max.   :4.000   Max.   :1680.0

We might have observed that the output from the summary command contains the minimum value, the maximum value, the first and third quatile and finally, the median of each variable contained in the data

This information is very useful in detecting data entry error that might be present in your data. For instance in this wine data, you might want to find out why the range of the proline variable is wide.

Now that we have seen how to perform numerical summary on our data, lets see how to do graphical summaries - these types of summaries are more efficient when it comes to exploring your data and detecting errors

We shall use the plots from the ggplot2 package. make sure you install and load the ggplot2 package before continuing.

library(ggplot2)

Our first plot is the histogram, it is commonly used to examine the distribution of a continuous variable

ggplot(wine_data, aes(x = Alcal_Ash)) +
        geom_histogram(bins = 45)

In your analysis, you may go further to investigate the tall and short bars in your histogram - which represents the most frequent and less frequent numbers respectively within the variable

Anothe useful plot that can reveal a lot of information in your data is the scatter plot. It is commonly used to visualize the relationship between two continuous variables.

ggplot(wine_data, aes(x = Alcal_Ash, y = color_intensity)) +
        geom_point(color = "blue")

And it is worth noting that the scatter plots are more useful in detecting outliers in your data than the histogram.

Can you observe any unusual pattern in the plot above? Are there outliers?

In my next presentation i’ll show you how to use some other plots to examine your data prior to analysis.

Thanks for reading!!!

For any question or remark, you can contact the author from the information below:

danielabban@outlook.com

+233245935470