’As a Data Scientist, if you know nothing else, you need to know how to take some data, munge it, clean it, filter it, mine it, visualize it and then validate it.’Thomson Nguyen - the CEO and Co-founder of Framed Data. For the full speech please follow this link.As the title indicates, this post is a down-&-dirty demo introducing data cleaning, some data manipulation and visualizations to achieve descriptive analysis necessary to understand patterns in data.
So what exactly is descriptive analytics? The term descriptive analytics refers to the analysis of data that helps describe, show or summarize data in a meaningful way. This is distinguished from predictive/inferential analytics which will be a follow-up to this post (so stay tuned).At it’s core, descriptive analytics leverages techniques to explore relationships between variables in a given data set. This step is imperative as it reveals relationships between variables and for-what-it’s worth necessitates inferential analytics. But in order to effectively and meaningfully explore variable relationships, data must be cleaned, manipulated and transformed into a form that exposes the structure/s underlying each individual variable within it’s value domain. With that said,…let’s delve into the fun stuff!
data("iris")
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Notice that I am using the iris data set from R basic. This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. Now that we have the data, let’s proceed to step 2 and load a few R packages that will be required to perform subsequent steps.
library(dplyr) #for data manipulation grammar (the pipe operator)
library(tidyr) #for tidying up the data
library(ggplot2) #for data viz
In this step we will familiarize ourselves with the data using very simple lines of code. This step is however important in understanding the class, type of data etc. that we are dealing with and provides intuition into how much effort would be needed to prepare data for analytics.
summary(iris)#provides stat summary of the data
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
glimpse(iris)#gives a sneak-peek into the data
## Observations: 150
## Variables: 5
## $ Sepal.Length (dbl) 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
## $ Sepal.Width (dbl) 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
## $ Petal.Length (dbl) 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
## $ Petal.Width (dbl) 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
## $ Species (fctr) setosa, setosa, setosa, setosa, setosa, setosa, ...
As seen from the code output above, iris data has 150 variables spread across 6 variables. Its a simple and good data set to work with for data science beginners! 4 of the 6 variables are numeric and one (Species) is categorical or factor. Lets also run an additional piece of code to view the same data a bit differently like so:
head(iris, n = 10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
For brevity only 10 lines have been shown. But we have gotten some intuition into the data we are dealing with. Stop right here, what have you noticed about this data? The first 4 variables capture Length and Width measurements for the 2 parts of the flower(sepal and petal). This data can be described as wide-it is capturing a mesurement e.g Length across multiple attributes. We can change that into a more intuitive format. So what we will do is basically create a new data object that is long and rather narrow but will bucket Sepal and Petal under part and bucket Length and Width under measurement. In order to do that we will use the gather() & separate() functions from the previously loaded tidyr package.
long_iris <- iris%>%
gather(part,value,Sepal.Length,Sepal.Width,Petal.Length ,Petal.Width)%>%
separate(part, c('part', 'measure'), sep = '\\.')
Let’s check out the new data object long_iris we just created.
## Species part measure value
## 1 setosa Sepal Length 5.1
## 2 setosa Sepal Length 4.9
## 3 setosa Sepal Length 4.7
## 4 setosa Sepal Length 4.6
## 5 setosa Sepal Length 5.0
## 6 setosa Sepal Length 5.4
## 7 setosa Sepal Length 4.6
## 8 setosa Sepal Length 5.0
## 9 setosa Sepal Length 4.4
## 10 setosa Sepal Length 4.9
Notice the difference? long_iris is longer with 600 observations & 4 variables. Don’t be confused, the data is the same just formatted differently. But why is this necessary? Can’t we visually analyze the data as it was? Sure we can…I have seen plenty demos using iris in its original format but as you will see later, the form a data takes may impose limitations on how well you can graphically analyze it. A good data scientist should understand the data at hand. How you format data directly impacts how well you can analyze it not only for descriptive analysis but predictive as well. There are associated limitations with trying to analyze iris as is in its form. Later you will see how the long version of iris created here will pause a limitation, but let’s save that for later. For now, let’s make headway onto the next step.
Our workflow model flows from the outside-in in a funnel fashion. Now that we have transformed the appearance of the data into a long version called long_iris, let’s perform some housekeeping! This step is always a must. In this step a data scientist performs some core data quality-check-ups & decide on corrective measures as dictated by various factors. The checks include but not limited to the following;
So let’s delve into some missing values analysis & variable type checking for this demo and save the others for later analyses.
sapply(long_iris, class)
## Species part measure value
## "factor" "character" "character" "numeric"
Let’s coerce character variables to R factors. This will be handy & facilitate ease of analysis.
fcts <- c('part', 'measure')
long_iris[fcts] <- lapply(long_iris[fcts], as.factor)
sapply(long_iris, class)
## Species part measure value
## "factor" "factor" "factor" "numeric"
Next up let’s check for missing values. We will do so my creating a UDF that we will loop over the entire data set at once to get a summation of missing values for each variable.
Missing_d <- function(x){sum(is.na(x))/length(x)*100}
Now that we have created a User Defined Function, let’s loop it over long_iris using apply as follows;
apply(long_iris, 2, Missing_d)
## Species part measure value
## 0 0 0 0
Increible!!This data has zero missing values, never thought I would ever say that! Let’s also check for special values like NA’s that might be interpreated as a value sometimes. Special values result into errors when performing statistical aalyses on the data. we will define a function as before that will check if a variable is numeric, if so then check for special values. After that, we will apply the UDF I am calling is_special using the sapply() function from the apply family.
is_special <- function(x){
if(is.numeric(x)) !is.finite(x) else is.na(x)
}
sapply(long_iris, is_special)
Output of the code above not shown here. The output is a vector of booleans verifying the absence of special values (FALSE-no special values, TRUE-special values present).
#lets check and verify the presence of NAs
sum(is.na(long_iris$value)) # zero NAs! looking great!
## [1] 0
Time out! Let’s recap a bit what we have done to this point. We performed some quick raw iris data overview and then transformed it to create the long version. Then we coerced some character variables to R factors for better graphic analysis later. Then we proceeded onto performing some housekeeping in which we checked for missing and special values. Congratulations you have come far along!! This might be a bit overwhelming for aspiring data scientists/analysts but it will all come together eventually.
Now that we have prepared and quality-checked the data (notice we checked long_iris not original iris but the 2 data sets are the same, so we know checking either represents the other too) let’s move to some exciting stuff. Descriptive analysis!!!
ggplot2 is a robust graphic plotting package in R. It provides very effective and intuitive methods for visually analyzing data. In this post, I will use ggplot2 to perform descriptive analysis. One of the best properties of ggplot2 is its flexibility to create objects that can be added onto thus rendering dynamic visualizations. Enough said, let’s go!
p <- ggplot(long_iris, aes(x = Species, y = value, col = part))
p + geom_jitter(alpha = 0.4, size = 0.8) + facet_grid(.~ measure)
p + geom_jitter(alpha = 0.3, size = 0.8) + stat_boxplot(alpha = 0.5) + facet_grid(.~ measure)
p + geom_jitter(alpha = 0.5, size = 0.8) + stat_boxplot(alpha = 0.5) + facet_grid(.~ part)
Notice how Setosa is rather isolated from Versicolor & Virginica based on all observations. Setosa’s petal & sepal length are the shortest of the 3 species, their petals are the narrowest compared to the other 2 species but notice how their sepal is the widest of the 3. Versicolor and Virginica are quiet similar in their length and width properties. On average, Setosa has fewer petals but notice something interesting regarding sepals though…Setosa beats the other 2 species on average number of sepals even though it appears the other 2 species actually have more. Versicolor & Virginica have a larger spread between minimum and maximum (skewness) values by parts with a concentration on the minimum hence lowering their average value.
Let’s generate some more visualizations and fit a linear model for further analysis. It would be insightful to observe and analyze the relationship between length and width of each specie. In order to generate a plot that shows this relationship, length and width will need to be variables, right? That means formatting the data in a way that exposes the 2 variables. If you run head() function and pass it long_iris you will notice that length and width are composed under measure. So lets create a new data object with length and width as variables:
iris$Flower <- 1:nrow(iris)
#create wide_iris
wide_iris <- iris %>%
gather(key, value, -Species, -Flower) %>%
separate(key, c("Part", "Measure"),sep = "\\.") %>%
spread(Measure, value)
The above code chunk began by assigning each flower a unique id. Then created a data object called wide_iris from the original iris. Let’s check out this new data object and feel free to run head() for both long and wide versions to notice the difference.
head(wide_iris, n = 10)
## Species Flower Part Length Width
## 1 setosa 1 Petal 1.4 0.2
## 2 setosa 1 Sepal 5.1 3.5
## 3 setosa 2 Petal 1.4 0.2
## 4 setosa 2 Sepal 4.9 3.0
## 5 setosa 3 Petal 1.3 0.2
## 6 setosa 3 Sepal 4.7 3.2
## 7 setosa 4 Petal 1.5 0.2
## 8 setosa 4 Sepal 4.6 3.1
## 9 setosa 5 Petal 1.4 0.2
## 10 setosa 5 Sepal 5.0 3.6
Now that that’s taken care of, let’s proceed on to investigating the relationship between length and width by species. We will fit a non-predictive & predictive linear model to analyze the relationship more effectively.Does the linear model do a good job at fitting the data and describing length & width relationship for each species?
q <- ggplot(wide_iris, aes(x = Width, y = Length, col = Species))
q + geom_jitter(alpha = 0.4, size = 0.8) + facet_grid(. ~ Species) +
stat_smooth(method = 'lm', se = F)
q + geom_jitter(alpha = 0.4, size = 0.8) + facet_grid(. ~ Part)
q + geom_point(alpha = 0.4, size = 0.8) + stat_smooth(method = 'lm', fullrange = T, size = 0.5)
Observe the relationship, linear or non-linear? It would be interesting to learn your interpretation…and suggestions.
In this post, I demonstrated the basics of data cleaning, manipulation (will be explored further) and descriptive analytics using three (3) packages (dplyr, ggplot2 & tidyr) with some UDFs. Feel free to recreate what I showed here and modify code chunks for practice. I recommend that you visit documentation for the 3 packages and do a research then practice with iris data to get familiar.
Please provide feedback regarding this post, recommendations and requests of what you’d like to see are all welcome. This post was intended for everyone aspiring to be a data scientist, this post would be a great place to start. It is also a great place for experienced data scientists to revisit some foundational/building blocks to data analytics. I hope you learned something, now go ahead and practice!
Thanks for your time & Stay tuned for more!