MATH2349 Data Wrangling

Required packages

The following are the packages used for this assignment. They fulfill the minimum requirement #10

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(car)

## Warning: package 'car' was built under R version 4.0.3

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

Executive Summary

A brief summary of what I have done throughout the various steps in pre-processing is given below:
-> Two raw datasets - Literacy rates(2011) and national census data(2011).
-> Filtering, renaming and variable additions were done for both the datasets as explained in each section.
-> The two datasets were merged through left_join().
-> Used str() and summary() on the dataset to inspect datatypes.
-> Datatypes for some variables were converted into factors.
-> The untidy dataset was converted into a tidy one by using the gather() function.
-> Used mutate() to create a new variable from the existing ones.
-> The whole dataset was scanned for missing and/or special values as described in that particular section.
-> The dataset was also scanned for outliers using the boxplot() method and dealt with accordingly.
-> Finally, data transformations were brought about by using Centering and Scaling techniques to convert a distribution into a normal one and also to achieve a better understanding of the variables.

Data

For this assignment, I chose to work with the literacy rates in India for each state. The primary dataset containing the literacy rates was taken from here, Kaggle - GOI Literacy Rate. It was read into the dataframe goi using the read_csv function from the readr package and it is displayed. There are 8 variables in the dataset - the category variable describes whether the observation is a state or a union territory. The next variable specifies the name of the state/union territory. The remaining 6 variables denote the total, rural and urban literacy rates for two years - 2001 and 2011 respectively.

goi <- read_csv("GOI.csv")

## Parsed with column specification:
## cols(
##   Category = col_character(),
##   `Country/ States/ Union Territories Name` = col_character(),
##   `Literacy Rate (Persons) - Total - 2001` = col_double(),
##   `Literacy Rate (Persons) - Total - 2011` = col_double(),
##   `Literacy Rate (Persons) - Rural - 2001` = col_double(),
##   `Literacy Rate (Persons) - Rural - 2011` = col_double(),
##   `Literacy Rate (Persons) - Urban - 2001` = col_double(),
##   `Literacy Rate (Persons) - Urban - 2011` = col_double()
## )

head(goi)

The first observation which consists of data for the country as a whole is removed since we are working with only the states. The name of the second column is also changed for better readability.

goi <- goi[-1,]
colnames(goi)[2] <- "State/UnionTerritory"

We want only the recent observations which are more accurate and thus we choose only the variables with literacy rates for the year 2011 and not the year 2001. We use the select function from the package dplyr to choose only the required variables and assign it to a new dataframe.

goi1 <- goi %>% select(Category, `State/UnionTerritory`, `Literacy Rate (Persons) - Total - 2011`, `Literacy Rate (Persons) - Urban - 2011`, `Literacy Rate (Persons) - Rural - 2011`)

We rename the column 3,4 and 5 for better clarity.

colnames(goi1)[3] <- "Literacy Rate - Total"
colnames(goi1)[4] <- "Literacy Rate - Urban"
colnames(goi1)[5] <- "Literacy Rate - Rural"

Next we create a variable to store the abbreviations for each state/union territory and re-arrange the columns.

goi1$Abbreviation <- c("AP", "AR", "AS", "BR", "CG", "GA", "GJ", "HR", "HP", "JK", "JH", "KA", "KL", "MP", "MH", "MN", "ML", "MZ", "NL", "OR", "PB", "RJ", "SK", "TN", "TR", "UP", "UK", "WB", "AN", "CH", "DH", "DD", "LD", "DL", "PY")
goi1 <- goi1[,c("Category","State/UnionTerritory","Abbreviation","Literacy Rate - Total","Literacy Rate - Urban","Literacy Rate - Rural")]

We also create a variable to categorize the states into high, average or low based on their total literacy rates.

goi1$`Literacy Level` <- ifelse(goi1$`Literacy Rate - Total`>70, ifelse(goi1$`Literacy Rate - Total`>85, "High", "Average"), "Low")

The first dataset goi1 ultimately looks like shown below:

head(goi1)

Now that we have our primary dataset ready, we move onto preprocessing our secondary dataset. This dataset contains information state and district-wise about the population and other census details recorded from the latest national census also from the year 2011. The dataset is taken from here, Kaggle - India Census 2011. This dataset is read into the dataframe cen using the read_csv function from the readr package and it is displayed.

cen <- read_csv("india-districts-census-2011.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `State name` = col_character(),
##   `District name` = col_character()
## )

## See spec(...) for full column specifications.

head(cen)

This is a huge dataset. But the variables that we require for the scope of this project are just a few. The following code extracts only the variables that we need from the whole dataset.

cen <- cen %>% select(`State name`, `District name`, Population, Male, Female)

This filtered dataset contains the name of the states, the corresponding districts, their population and the corresponding male and female populations. Now since we are working only with state-wise data, we use group_by along with summarise on the states column and add up the populations of each district for a given state/union territory. We use summarise on the Population, Male and Female columns so as to add up the respective values grouped by each state/union territory and then finally we drop the District Name column since we no longer need it.

cen <- cen %>% group_by(`State name`) %>% summarise(Population = sum(Population), Male = sum(Male), Female = sum(Female))

## `summarise()` ungrouping output (override with `.groups` argument)

As a final step before merging, we add a variable containing the abbreviations for the state names and re-arrange the columns just like we did for the last dataset.

cen$Abbreviation <- c("AN", "AP", "AR", "AS", "BR", "CH", "CG", "DH", "DD", "GA", "GJ", "HR", "HP", "JK", "JH", "KA", "KL", "LD", "MP", "MH", "MN", "ML", "MZ", "NL", "DL", "OR", "PY", "PB", "RJ", "SK", "TN", "TR", "UP", "UK", "WB")
cen <- cen[, c("State name", "Abbreviation", "Population", "Male", "Female")]

Now that we are done with the necessary coding, our secondary dataset ultimately looks like shown below:

head(cen)

We now have our primary and secondary datasets ready for merging. Since we are working with only the States/Union Territories in both our datasets, these two can be merged on a common key column which for us is the Abbreviation column. We merge the two datasets using the left_join() function from the dplyr package. By using this type of join we can add all the columns of the second dataset onto the first one thereby retaining all the observations and variables.

df <- goi1 %>% left_join(cen, by = "Abbreviation")

One might think that we could have simply merged the two datasets on the ‘State Name’ columns instead of creating the abbreviations column and merging on that. But it is a fact that most of the time, the state/union territory names are written in short forms or denoted by their old names etc. Thus it is wise that we assign an abbreviation to each state and merge the datasets based on that. Notice that we now have two columns with the state/union territory names. Therefore we remove the redundant state/union territory name column from the second part of the merged dataset which is the 8th column.

df <- df[,-8]

Now this is how our final merged dataset df looks like and fulfills the minimum requirement #1:

head(df)

Understand

Here, we use the base R functions str and summary on the merged dataset df to obtain insightful information on the data types of each variable in the dataset and other important statistics from the numerical columns.

str(df)

## tibble [35 x 10] (S3: tbl_df/tbl/data.frame)
##  $ Category             : chr [1:35] "State" "State" "State" "State" ...
##  $ State/UnionTerritory : chr [1:35] "Andhra Pradesh" "Arunachal Pradesh" "Assam" "Bihar" ...
##  $ Abbreviation         : chr [1:35] "AP" "AR" "AS" "BR" ...
##  $ Literacy Rate - Total: num [1:35] 67 65.4 72.2 61.8 70.3 88.7 78 75.6 82.8 67.2 ...
##  $ Literacy Rate - Urban: num [1:35] 80.1 82.9 88.5 76.9 84 90 86.3 83.1 91.1 77.1 ...
##  $ Literacy Rate - Rural: num [1:35] 60.4 59.9 69.3 59.8 66 86.6 71.7 71.4 81.9 63.2 ...
##  $ Literacy Level       : chr [1:35] "Low" "Low" "Average" "Low" ...
##  $ Population           : num [1:35] 8.46e+07 1.38e+06 3.12e+07 1.04e+08 2.55e+07 ...
##  $ Male                 : num [1:35] 42442146 713912 15939443 54278157 12832895 ...
##  $ Female               : num [1:35] 42138631 669815 15266133 49821295 12712303 ...

summary(df)

##    Category         State/UnionTerritory Abbreviation      
##  Length:35          Length:35            Length:35         
##  Class :character   Class :character     Class :character  
##  Mode  :character   Mode  :character     Mode  :character  
##                                                            
##                                                            
##                                                            
##  Literacy Rate - Total Literacy Rate - Urban Literacy Rate - Rural
##  Min.   :61.80         Min.   :75.10         Min.   :59.80        
##  1st Qu.:71.25         1st Qu.:83.15         1st Qu.:65.75        
##  Median :76.90         Median :86.30         Median :72.10        
##  Mean   :77.85         Mean   :86.35         Mean   :73.57        
##  3rd Qu.:85.90         3rd Qu.:89.70         3rd Qu.:81.05        
##  Max.   :94.00         Max.   :97.60         Max.   :93.00        
##  Literacy Level       Population             Male               Female        
##  Length:35          Min.   :    64473   Min.   :    33123   Min.   :   31350  
##  Class :character   1st Qu.:  1421136   1st Qu.:   726526   1st Qu.:  694610  
##  Mode  :character   Median : 16787941   Median :  8987326   Median : 7800615  
##                     Mean   : 34595856   Mean   : 17807722   Mean   :16788135  
##                     3rd Qu.: 60767494   3rd Qu.: 31228958   3rd Qu.:29538536  
##                     Max.   :199812341   Max.   :104480510   Max.   :95331831

We see that the columns in this dataset are of the type num and chr which represent the numeric and character data types respectively. However, if we notice, the Category and the Literacy Level columns are categorical variables and thus need to be converted to the factor data type. We thus convert the said variables into the factor datatype by entering in the following codes:

df$Category <- factor(df$Category, levels = c("State", "Union Territory"))

This variable is a nominal categorical variable and thus we do not use the ordered parameter inside the factor function. We only use levels to specify the levels of the variable. Now, for the Literacy Level column:

df$`Literacy Level` <- factor(df$`Literacy Level`, levels = c("Low", "Average", "High"), labels = c("Bad", "Average", "Good"), ordered = TRUE)

Here, the Literacy Level is an ordinal categorical variable and thus we use the extra parameter ordered = TRUE and label the levels with the labels parameter. Now we have successfully converted the two character datatypes into the required factor datatypes thereby fulfilling the #3 and #4 minimum requirement. Now we check the datatypes of each of the variable in the merged data again using the str function:

str(df)

## tibble [35 x 10] (S3: tbl_df/tbl/data.frame)
##  $ Category             : Factor w/ 2 levels "State","Union Territory": 1 1 1 1 1 1 1 1 1 1 ...
##  $ State/UnionTerritory : chr [1:35] "Andhra Pradesh" "Arunachal Pradesh" "Assam" "Bihar" ...
##  $ Abbreviation         : chr [1:35] "AP" "AR" "AS" "BR" ...
##  $ Literacy Rate - Total: num [1:35] 67 65.4 72.2 61.8 70.3 88.7 78 75.6 82.8 67.2 ...
##  $ Literacy Rate - Urban: num [1:35] 80.1 82.9 88.5 76.9 84 90 86.3 83.1 91.1 77.1 ...
##  $ Literacy Rate - Rural: num [1:35] 60.4 59.9 69.3 59.8 66 86.6 71.7 71.4 81.9 63.2 ...
##  $ Literacy Level       : Ord.factor w/ 3 levels "Bad"<"Average"<..: 1 1 2 1 2 3 2 2 2 1 ...
##  $ Population           : num [1:35] 8.46e+07 1.38e+06 3.12e+07 1.04e+08 2.55e+07 ...
##  $ Male                 : num [1:35] 42442146 713912 15939443 54278157 12832895 ...
##  $ Female               : num [1:35] 42138631 669815 15266133 49821295 12712303 ...

We can see that our dataset now consists of three different datatypes - character, numeric and factor thereby fulfilling the #2 minimum requirement.

Tidy & Manipulate Data I

The dataset as it is looks tidy but on a closer inspection, it becomes clear that this dataset is indeed untidy. This is because both the Male and Female columns are indeed observations and they fall under the categorical variable “Gender”. Thus this dataset clearly violates the tidy data principles put forth by Wickham and Grolemund (2016) which states:
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.
Thus to make the dataset tidy, we use the gather() function from the tidyr package to gather the Male and Female observations under a newly created Gender variable.

df <- df %>% gather(Male, Female, key = "Gender", value = "Gender-wise Population")
df <- df %>% arrange(`State/UnionTerritory`)

Now we have used gather() to make our data tidy. Further in order to improve readability and the overall structure we use the arrange function from the dplyr package to sort or arrange the observations by the ‘state/union territory name’ column. Also, since we have created a new nominal categorical variable Gender, we have to change its datatype to factor.

df$Gender <- factor(df$Gender, levels = c("Male", "Female"))
head(df)

Thus we have fulfilled the #5 minimum requirement.

Tidy & Manipulate Data II

Well, now that we have the total literacy rate (in percentage) of a state and the total population of the corresponding state, we can actually calculate the number of literates, under a newly created column "Literate. For this purpose, we use the mutate() function from the dplyr package. We also change the datatype of the newly created variable to integer with the as.integer() function since we do not want any number in the decimal form. We also move the new column Literate before the Gender column to improve readability.

df <- df %>% mutate(Literate = (`Literacy Rate - Total`/100)*Population)
df$Literate <- as.integer(df$Literate)
df <- subset(df, select = c(Category : Population, Literate, Gender : `Gender-wise Population`))
head(df)

Thus we have used mutate to create a new meaningful variable from the existing ones and fulfilled the #6 minimum requirement.

Scan I

A very important part of pre-processing a dataset is checking for missing(null) values and special values which increase the inconsistency of the dataset thereby adversely affecting the conclusions drawn from the final processed dataset. Therefore, for our dataset, we use the base R function is.na() to check for any missing values. Alternatively we use the sum() function along with the is.na() function to obtain the total number of missing values in the entire dataset df:

sum(is.na(df))

## [1] 0

Missing values are not the only values we have to look for to eliminate inconsistencies. There are also other values called ‘special values’ which can cause unwanted problems. These special values as identified by R are -Inf, Inf and NaN. Here, -Inf denotes negative infinity, Inf denotes positive infinity and NaN denotes ‘not a number’. To scan our dataset for any special values, we use the base R functions - is.infinite() and is.nan() along with the sum() function as before to obtain the total number of special values in our entire dataset df. But the only difference here is that we use the function sapply in order to work with our dataframe since the functions is.infinite() and is.nan() accept only vectorial input.

sum(sapply(df, is.infinite))

## [1] 0

sum(sapply(df, is.nan))

## [1] 0

Thus, we have scanned our entire dataset for missing or special values and we are quite satisfied with the findings since our dataset has zero missing or special values thereby fulfilling the #7 minimum requirement. Also a thorough manual scan into the entire dataset df reveals that there are no misfilled/obvious or typographical errors.

Scan II

Another important aspect in pre-processing the dataset is to scan for outliers. We can define an outlier as an observation which stands far away from most of the other observations and arouses suspicion on its legitimacy or reliability. The best way to scan for any outliers in a particular variable is to use the boxplot() function. In the box plot, the “Tukey’s method of outlier detection” is used to detect outliers. According to this method, outliers are defined as the values in the data set that fall beyond the range of −1.5×IQR to 1.5×IQR. These −1.5×IQR and 1.5×IQR limits are called “outlier fences” and any values lying outside the outlier fences are depicted using an “o” or a similar symbol on the box plot (Dr. Anil Dolgun), where the inter-quartile range is roughly the height of the boxplot. Here, we use the boxplot() function on the Population column to check for any outliers.

df$Population %>% boxplot(main = "Box plot of population", ylab = "Population")

From the plot, we can see that we have but one outlier roughly around the population count = 200,000,000. Upon inspection we can see that it corresponds to the population of Uttar pradesh = 199,812,341. Usually we deal with outliers by excluding/deleting, imputing, capping or even transforming/binning the values. But in our case, the observations come from a reliable source and is well maintained and updated and is a public statistic. So in our case, we cannot just exclude or delete the outlier, in fact, for us, the outlier reveals a very important information about the highest state population and its difference from the other populations and is thus valuable!
We also check for outliers on the Literacy Rate - Total variable using the same boxplot() function.

df$`Literacy Rate - Total` %>%boxplot(main = "Box plot of Literacy Rates", ylab = "Literacy rate(in percentage)")

From this boxplot, we can easily conclude that there are no outliers on either the max or the min directions. So we needn’t worry about any inconsistencies. This whole operation of finding and dealing with outliers fulfills the #8 minimum requirement.

Transform

In this section, we pick up a variable and transform its distribution into a normal distribution. Let us consider the distribution of the variable Literacy Rate - Total from our dataframe df. We use hist() function on the variable and obtain this distribution:

hist(df$`Literacy Rate - Total`)

From the histogram plot we can clearly see that the variable follows an almost multimodal distribution. In order to decrease the multimodality and to steer it towards a more normal distribution we apply a transformation technique called Centering and Scaling. This is done by applying the base R function scale() to a newly created variable df1 and modifying the parameters center and scale as follows:

df1 <- scale(df$`Literacy Rate - Total`, center = TRUE, scale = FALSE)
hist(df1)

On applying the hist() function to df1 we can now see that it follows more of a normal distribution. Further, we also create a corresponding density histogram with its normal curve overlay to represent it better:

hist(df1, probability = TRUE)
x <- -20:20
y <- dnorm(x = x, mean = mean(df1), sd = sd(df1))
lines(x = x, y = y, col = "blue")

As a final step to confirm normality of the transformed variable, we use the function qqPlot() from the package car to test and prove normality:

qqPlot(df1, dist = "norm")

## [1] 33 34

From the obtained qqPlot, we can clearly see that all the observations fall under the limits with the exception of 1 or 2 observations. But since we have a large sample size (n = 70 in our case), we can take advantage of the Central Limit Theorem and easily assume normality for the variable in question.
As another activity to attain better understanding of the Population variable which usually consists of huge values upto tens and hundreds of millions for a state, we use the well-known scale() function again to scale the variable to smaller values and thus improving readability and improving the understanding on the particular variable:

df2 <- scale(df$Population, center = FALSE, scale = TRUE)
head(df2)

##             [,1]
## [1,] 0.006768527
## [2,] 0.006768527
## [3,] 1.504245550
## [4,] 1.504245550
## [5,] 0.024609199
## [6,] 0.024609199

Here, we use center = FALSE since we are only scaling the variable observations and not centering them. This now makes it easier to work with the observations and gives us an overall better understanding of the variable. Note that we have saved the transformed variables into df1 and df2 respectively and have not changed the original variables in the dataset itself.
Thus in this section, we have transformed a bimodal distributed variable to a normally distributed variable using the Centering and Scaling technique and thereby fulfilling the #9 minimum requirement.

Thus, we have succesfully pre-processed our dataset, have made it tidy and applied suitable transformation techniques to improve readability and to make it easier to work with. We can now draw inferences and conclusions from this well-processed dataset df.

References

https://www.kaggle.com/doncorleone92/govt-of-india-literacy-rate First dataset
https://www.kaggle.com/danofer/india-census?select=india-districts-census-2011.csv Second dataset
Dr.Anil Dolgun Data Wrangling course website - Module 6
https://rpubs.com/doradu8030/NDistribution Normal Curve Overlay
Wickham and Grolemund (2016) Tidy Data Principles