Some data can really look rough and confusing at the first glance, however when you are able to explore the data effectively, you can get a summary of the message contained in your data.
Performing Data summary is a very vital step to consider when doing data analysis. Before any real work can be done on your data, you must first of all get to know your data very well, in order to make accurate and unbiased inference from it.
Data Summary is the process of getting introduced to your data
In this lecture, we are going to look at a few ways data analyst check the summary of their data. So what are they? I’ll list them below and we shall understand them in details as we read along:
We have two types of data summaries, which are:
Under the numerical summaries, we have: mean, median, standard deveiation, etc. Graphical summary includes boxplot, scatter plots, histograms etc.
Straight away, lets start practicing how to perform these summaries with R. The data we shall be using in the lecture is taken from the UCI machine learning repository. Click here to download.
The first step is to read the data into R and replace the default names with their original variable names as listed in the description of the data. click here. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.
wine <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
wine_data <- read.csv(wine, header = FALSE)
variable_names = c("Class_identifier", "Alcohol", "Malic_acid", "Ash", "Alcal_Ash",
"Magnessium", "Total_phenols","Flavaniods",
"NonFlav_phenols", "phroantocyanins",
"color_intensity", "Hue", "OD_of_wines",
"proline")
names(wine_data) <- variable_names
The summary() command is a quick way to get the usual univariate summary information from your data.
summary(wine_data)
## Class_identifier Alcohol Malic_acid Ash
## Min. :1.000 Min. :11.03 Min. :0.740 Min. :1.360
## 1st Qu.:1.000 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210
## Median :2.000 Median :13.05 Median :1.865 Median :2.360
## Mean :1.938 Mean :13.00 Mean :2.336 Mean :2.367
## 3rd Qu.:3.000 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558
## Max. :3.000 Max. :14.83 Max. :5.800 Max. :3.230
## Alcal_Ash Magnessium Total_phenols Flavaniods
## Min. :10.60 Min. : 70.00 Min. :0.980 Min. :0.340
## 1st Qu.:17.20 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205
## Median :19.50 Median : 98.00 Median :2.355 Median :2.135
## Mean :19.49 Mean : 99.74 Mean :2.295 Mean :2.029
## 3rd Qu.:21.50 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875
## Max. :30.00 Max. :162.00 Max. :3.880 Max. :5.080
## NonFlav_phenols phroantocyanins color_intensity Hue
## Min. :0.1300 Min. :0.410 Min. : 1.280 Min. :0.4800
## 1st Qu.:0.2700 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825
## Median :0.3400 Median :1.555 Median : 4.690 Median :0.9650
## Mean :0.3619 Mean :1.591 Mean : 5.058 Mean :0.9574
## 3rd Qu.:0.4375 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200
## Max. :0.6600 Max. :3.580 Max. :13.000 Max. :1.7100
## OD_of_wines proline
## Min. :1.270 Min. : 278.0
## 1st Qu.:1.938 1st Qu.: 500.5
## Median :2.780 Median : 673.5
## Mean :2.612 Mean : 746.9
## 3rd Qu.:3.170 3rd Qu.: 985.0
## Max. :4.000 Max. :1680.0
We might have observed that the output from the summary command contains the minimum value, the maximum value, the first and third quatile and finally, the median of each variable contained in the data
This information is very useful in detecting data entry error that might be present in your data. For instance in this wine data, you might want to find out why the range of the proline variable is wide.
Now that we have seen how to perform numerical summary on our data, lets see how to do graphical summaries - these types of summaries are more efficient when it comes to exploring your data and detecting errors
We shall use the plots from the ggplot2 package. make sure you install and load the ggplot2 package before continuing.
library(ggplot2)
Our first plot is the histogram, it is commonly used to examine the distribution of a continuous variable
ggplot(wine_data, aes(x = Alcal_Ash)) +
geom_histogram(bins = 45)
In your analysis, you may go further to investigate the tall and short bars in your histogram - which represents the most frequent and less frequent numbers respectively within the variable
Anothe useful plot that can reveal a lot of information in your data is the scatter plot. It is commonly used to visualize the relationship between two continuous variables.
ggplot(wine_data, aes(x = Alcal_Ash, y = color_intensity)) +
geom_point(color = "blue")
And it is worth noting that the scatter plots are more useful in detecting outliers in your data than the histogram.
Can you observe any unusual pattern in the plot above? Are there outliers?
In my next presentation i’ll show you how to use some other plots to examine your data prior to analysis.
Thanks for reading!!!
For any question or remark, you can contact the author from the information below:
+233245935470