Data preprocessing is an important step in data analysis process. The objective of this step is to make the data ready for the further statistical analysis. In order to effectively carry out data preprocessing steps in R, packages like readr (‘Get’ the data), dplyr (‘Understand’), tidyr (‘Manipulate’), outliers (‘Scan’) and many more are used.
# Setting working directory
getwd()
## [1] "/Users/Harsh/Downloads"
setwd("~/Desktop/Week 2")
#Loading packages
library(readr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(knitr)
library(outliers)
The dataset, “final”, is a combination of 3 datasets, inducing State-wise Budget Variance of USA for the year 2017-18. The resultant dataset has 50 observations (States) across 12 variables defining different components of current account. The preprocessing was carried with the aim to perform “Budget Variance analysis” which is the analysis of deviation of the Actual Budget from Estimated one. A government budget is a document prepared by the government or other political entity presenting its anticipated revenues and proposed spending for the coming financial year. The analysis aids the government to effectively plan future income and expenses based on the past trends. In order to perform Budget Variance analysis, we have merged the three datasets of Budget, Tax Revenue and Debt of States of USA obtained from kaggle.com. We have applied the necessary data type conversions - like factorising Credit Rating using factor() function, converting Total State Debt, Gross Collections into numeric using as.numeric() - and transformation methods(log transformation) to effectively manipulate data for further statistical analysis. Thus, through primary variables, we have calculated new parameters, using tidy functions like mutate() for states like Budget Balance, Budget Deficit etc. to calculate variances. We have used boxplot to detect the existence of outliers in one of the variables of the resultant dataset using outlier package and eliminated them by omission. To prepare the data for analysis, we have finally applied data transformation using mathematical operations like converting the distribution of “Total State Debt” variable into log distribution to normalise it.
The source of the datasets - https://www.kaggle.com/beradabhishek/usa-information/version/1#GDP.csv. The resultant dataset, “final” is a merger of 3 datasets - Budget, Debt and Tax Revenue- that primarily consists of State-wise results of components of current account of USA. All the three files are in .csv fromat, thus, read_csv() of readr package has been used to import data files in R.
Steps undertaken:
# Importing dataset
budget <- read_csv("Budget.csv")
## Parsed with column specification:
## cols(
## State = col_character(),
## `Budget (billions $)` = col_double(),
## FY = col_character(),
## Reference = col_character(),
## `Budget per capita (in $)` = col_number(),
## `S&P Credit rating in January 2017[1]` = col_character()
## )
head(budget) # Checking if the dataset is imported correctly
str(budget) # Checking variables of the dataset
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 50 obs. of 6 variables:
## $ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Budget (billions $) : num 30.7 10.1 39 31.8 183.3 ...
## $ FY : chr "2018" "2018" "2018" "2018" ...
## $ Reference : chr "[2]" "[3]" "[4]" "[5]" ...
## $ Budget per capita (in $) : num 6298 13652 5559 10585 4636 ...
## $ S&P Credit rating in January 2017[1]: chr "AA" "AA+" "AA" "AA" ...
## - attr(*, "spec")=
## .. cols(
## .. State = col_character(),
## .. `Budget (billions $)` = col_double(),
## .. FY = col_character(),
## .. Reference = col_character(),
## .. `Budget per capita (in $)` = col_number(),
## .. `S&P Credit rating in January 2017[1]` = col_character()
## .. )
taxrevenue <- read_csv("TaxRevenue.csv")
## Parsed with column specification:
## cols(
## Rank = col_double(),
## `Statefederal district or territory` = col_character(),
## `Gross collections[2]` = col_character(),
## `Revenue per capita (est.)` = col_character(),
## `Ratio to GSP` = col_character()
## )
head(taxrevenue) # Checking if the dataset is imported correctly
str(taxrevenue) # Checking variables of the dataset
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 52 obs. of 5 variables:
## $ Rank : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Statefederal district or territory: chr "California" "Texas" "New York" "Florida" ...
## $ Gross collections[2] : chr "$405,851,295,000" "$279,904,425,000" "$269,716,999,000" "$177,389,488,000" ...
## $ Revenue per capita (est.) : chr "$10,408" "$10,204" "$13,659" "$8,762" ...
## $ Ratio to GSP : chr "16.60%" "17.10%" "18.50%" "19.90%" ...
## - attr(*, "spec")=
## .. cols(
## .. Rank = col_double(),
## .. `Statefederal district or territory` = col_character(),
## .. `Gross collections[2]` = col_character(),
## .. `Revenue per capita (est.)` = col_character(),
## .. `Ratio to GSP` = col_character()
## .. )
debt <- read_csv("Debt.csv")
## Parsed with column specification:
## cols(
## State = col_character(),
## `Total state debt` = col_character(),
## `State debt per capita` = col_character()
## )
head(taxrevenue) # Checking if the dataset is imported correctly
str(debt) # Checking variables of the dataset
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 50 obs. of 3 variables:
## $ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Total state debt : chr "$8,969,350,000" "$5,727,891,000" "$14,243,659,000" "$4,985,140,000" ...
## $ State debt per capita: chr "$1,848" "$7,764" "$2,089" "$1,674" ...
## - attr(*, "spec")=
## .. cols(
## .. State = col_character(),
## .. `Total state debt` = col_character(),
## .. `State debt per capita` = col_character()
## .. )
names(taxrevenue)[2] <- "State" #Renamed the variable to match with other for merge
# Merging the datasets
data <- merge(budget, debt, by="State")
final <- merge(data, taxrevenue, by="State")
str(final) # Checking variables of the dataset
## 'data.frame': 50 obs. of 12 variables:
## $ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Budget (billions $) : num 30.7 10.1 39 31.8 183.3 ...
## $ FY : chr "2018" "2018" "2018" "2018" ...
## $ Reference : chr "[2]" "[3]" "[4]" "[5]" ...
## $ Budget per capita (in $) : num 6298 13652 5559 10585 4636 ...
## $ S&P Credit rating in January 2017[1]: chr "AA" "AA+" "AA" "AA" ...
## $ Total state debt : chr "$8,969,350,000" "$5,727,891,000" "$14,243,659,000" "$4,985,140,000" ...
## $ State debt per capita : chr "$1,848" "$7,764" "$2,089" "$1,674" ...
## $ Rank : num 32 49 23 27 1 22 19 35 4 11 ...
## $ Gross collections[2] : chr "$25,070,261,000" "$5,717,640,000" "$42,631,316,000" "$32,508,761,000" ...
## $ Revenue per capita (est.) : chr "$5,165" "$7,751" "$6,253" "$10,917" ...
## $ Ratio to GSP : chr "12.00%" "10.50%" "14.30%" "26.30%" ...
head(final) # Checking if the dataset is imported correctly
dim(final) # Checking dimensions of the merged dataset
## [1] 50 12
The variables were summarised. They are a set of various data types like character(State), numeric (Budget), date (FY), etc.
After checking the data structures, steps taken:
# Summarising variables of final
final %>% summary()
## State Budget (billions $) FY
## Length:50 Min. : 4.10 Length:50
## Class :character 1st Qu.: 9.50 Class :character
## Mode :character Median : 27.65 Mode :character
## Mean : 36.60
## 3rd Qu.: 39.83
## Max. :183.30
## Reference Budget per capita (in $)
## Length:50 Min. : 1730
## Class :character 1st Qu.: 4573
## Mode :character Median : 5653
## Mean : 6841
## 3rd Qu.: 8414
## Max. :18273
## S&P Credit rating in January 2017[1] Total state debt
## Length:50 Length:50
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## State debt per capita Rank Gross collections[2]
## Length:50 Min. : 1.00 Length:50
## Class :character 1st Qu.:13.25 Class :character
## Mode :character Median :25.50 Mode :character
## Mean :25.92
## 3rd Qu.:38.75
## Max. :51.00
## Revenue per capita (est.) Ratio to GSP
## Length:50 Length:50
## Class :character Class :character
## Mode :character Mode :character
##
##
##
# Factorising Credit Rating variable
final$`S&P Credit rating in January 2017[1]` <- factor(final$`S&P Credit rating in January 2017[1]`, levels = c("BBB","A-", "A+", "AA-", "AA", "AA+", "AAA"), labels= c("Adequate", "Strong", "Strong", "Very Strong", "Very Strong", "Very Strong", "Extremly Strong"), ordered= TRUE)
levels(final$`S&P Credit rating in January 2017[1]`) # Checking levels
## [1] "Adequate" "Strong" "Very Strong" "Extremly Strong"
final$Rank <- factor(final$Rank, ordered = TRUE) # Factorising Rank variable
levels(final$Rank) # Checking levels
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43"
## [43] "44" "45" "46" "47" "48" "49" "50" "51"
# Converting variables from character to numeric type
final$`Total state debt` <- as.numeric(gsub('[$,]', '' , final$`Total state debt`))
final$`Total state debt` <- paste(format(round(final$`Total state debt` / 1e9, 1), trim = TRUE))
final$`Total state debt` <- as.numeric(final$`Total state debt`)
final$`Gross collections[2]` <- as.numeric(gsub('[$,]', '' , final$`Gross collections[2]`))
final$`Gross collections[2]` <- paste(format(round(final$`Gross collections[2]` / 1e9, 1), trim = TRUE))
final$`Gross collections[2]` <- as.numeric(final$`Gross collections[2]`)
The dataset was found to be tidy, as per the following observations:
Since the data is in a tidy format, thus, we performed data manipulation. To deduce variance, the following variables were created using mutate() function of tidyr package:
# Creating new variables
final <- mutate(final, Budget_Balance= final$`Gross collections[2]` - final$`Total state debt`)
final <- mutate(final, Balance_Revenue = final$Budget_Balance/final$`Gross collections[2]`)
final$Balance_Revenue <- round(final$Balance_Revenue, digits = 2) # Rounding to 2 decimal places
final <- mutate(final, Budget_Variance = final$`Budget (billions $)`- final$Budget_Balance)
# Assigning classes to the variables
Actual_vs_Estimate <- ifelse(final$Budget_Variance >0 , c("Unfavourable"), c("Favourable"))
Budget_deficit_surplus <- ifelse(final$Budget_Balance >0, c("Surplus"), c("Deficit"))
# Joining the class variables to the dataset
final <- cbind(final, Budget_deficit_surplus, Actual_vs_Estimate)
#Checking head of final
head(final)
str(final)
## 'data.frame': 50 obs. of 17 variables:
## $ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Budget (billions $) : num 30.7 10.1 39 31.8 183.3 ...
## $ FY : chr "2018" "2018" "2018" "2018" ...
## $ Reference : chr "[2]" "[3]" "[4]" "[5]" ...
## $ Budget per capita (in $) : num 6298 13652 5559 10585 4636 ...
## $ S&P Credit rating in January 2017[1]: Ord.factor w/ 4 levels "Adequate"<"Strong"<..: 3 3 3 3 3 3 3 4 4 4 ...
## $ Total state debt : num 9 5.7 14.2 5 151.7 ...
## $ State debt per capita : chr "$1,848" "$7,764" "$2,089" "$1,674" ...
## $ Rank : Ord.factor w/ 50 levels "1"<"2"<"3"<"4"<..: 31 48 23 27 1 22 19 34 4 11 ...
## $ Gross collections[2] : num 25.1 5.7 42.6 32.5 405.9 ...
## $ Revenue per capita (est.) : chr "$5,165" "$7,751" "$6,253" "$10,917" ...
## $ Ratio to GSP : chr "12.00%" "10.50%" "14.30%" "26.30%" ...
## $ Budget_Balance : num 16.1 0 28.4 27.5 254.2 ...
## $ Balance_Revenue : num 0.64 0 0.67 0.85 0.63 0.64 0.4 0.78 0.81 0.85 ...
## $ Budget_Variance : num 14.6 10.1 10.6 4.3 -70.9 ...
## $ Budget_deficit_surplus : Factor w/ 2 levels "Deficit","Surplus": 2 1 2 2 2 2 2 2 2 2 ...
## $ Actual_vs_Estimate : Factor w/ 2 levels "Favourable","Unfavourable": 2 2 2 2 1 1 1 1 1 1 ...
To create categorical variables (Actual_vs_Estimate, Budget_deficit_surplus), ifelse() base R function was used. These were added to the “final” dataset using cbind() function which adds a column to the dataset with similar number of rows.
is.na() coupled with sum() function was used to detect any missing values in the dataset.
#Checking for missing values
sum(is.na(final))
## [1] 0
There were no missing values in the dataset.
Before performimg data transformation, it is necessary to correct the data of any outliers. To detect the outliers , boxplot() was plotted using ggplot2 package for the variable - Total State Debt.
# Boxplot for Total State Debt
final$`Total state debt` %>% boxplot(main= "Box Plot fo Debt", ylab="Debt", col= "grey")
# Creating a separate vector of outliers in the variable
outliers <- boxplot(final$`Total state debt`, plot = FALSE)$out
print(outliers) # checking the outlier values
## [1] 151.7 64.2 75.3 66.9 137.4
final[which(final$`Total state debt` %in% outliers),] # Finding the rows of outlier values
final <- final[-which(final$`Total state debt` %in% outliers),]#Removing the outliers
boxplot(final$`Total state debt`) # Checking the boxplot
Since there were 4 outliers in the variable, we removed the outliers by creating a separate vector of outlier values using which() function and then, eliminating it from the resultant dataset. This approach was used as the number of outliers were less. Through, this approach all the outliers were successfully removed.
To Check if the variable has normal distribution, we plot the histogram using hist() fUnction of Total State Debt.
# Plotting histogram of Total State Debt
hist(final$`Total state debt`)
# Transforming into log data
log_Debt <- log(final$`Total state debt`)
hist(log_Debt) # Plotting histogram of Total State Debt
In accordance to the histogram, the data is postively skewed(right skewness). To achieve normality, we transformed the data into log values as compresses high values and spreads low values by expressing the values as orders of magnitude. This transformation is commonly used for reducing right skewness.