MATH2349 Semester 2, 2019

Required packages

Data preprocessing is an important step in data analysis process. The objective of this step is to make the data ready for the further statistical analysis. In order to effectively carry out data preprocessing steps in R, packages like readr (‘Get’ the data), dplyr (‘Understand’), tidyr (‘Manipulate’), outliers (‘Scan’) and many more are used.

# Setting working directory
getwd()

## [1] "/Users/Harsh/Downloads"

setwd("~/Desktop/Week 2")

#Loading packages
library(readr)
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(knitr)
library(outliers)

Executive Summary

The dataset, “final”, is a combination of 3 datasets, inducing State-wise Budget Variance of USA for the year 2017-18. The resultant dataset has 50 observations (States) across 12 variables defining different components of current account. The preprocessing was carried with the aim to perform “Budget Variance analysis” which is the analysis of deviation of the Actual Budget from Estimated one. A government budget is a document prepared by the government or other political entity presenting its anticipated revenues and proposed spending for the coming financial year. The analysis aids the government to effectively plan future income and expenses based on the past trends. In order to perform Budget Variance analysis, we have merged the three datasets of Budget, Tax Revenue and Debt of States of USA obtained from kaggle.com. We have applied the necessary data type conversions - like factorising Credit Rating using factor() function, converting Total State Debt, Gross Collections into numeric using as.numeric() - and transformation methods(log transformation) to effectively manipulate data for further statistical analysis. Thus, through primary variables, we have calculated new parameters, using tidy functions like mutate() for states like Budget Balance, Budget Deficit etc. to calculate variances. We have used boxplot to detect the existence of outliers in one of the variables of the resultant dataset using outlier package and eliminated them by omission. To prepare the data for analysis, we have finally applied data transformation using mathematical operations like converting the distribution of “Total State Debt” variable into log distribution to normalise it.

Data

The source of the datasets - https://www.kaggle.com/beradabhishek/usa-information/version/1#GDP.csv. The resultant dataset, “final” is a merger of 3 datasets - Budget, Debt and Tax Revenue- that primarily consists of State-wise results of components of current account of USA. All the three files are in .csv fromat, thus, read_csv() of readr package has been used to import data files in R.

Budget - This dataset(50*6 dimension) defines estimated budget for respective financial years of different states with credit ratings. The 6 variables are as follows:

State - Name of the State of USA
Budget(in billion $) - Estimated budget for the financial year of each state
FY - Financial Year
Reference - Reference number of each state
Budget per capita (in $) - Estimated Budget proportion per person
S&P Credit Rating in 2017 - Credit Rating given to each state’s financial stability on S&P’s rating scale where ‘AAA’ is the safest and ‘BBB’ is least.

Debt - This dataset(50*3 dimension) defines the State-wise debt figures. The 3 variables are as follows:

State - Name of the State of USA
Total State Debt (in billion $) - The total accrued state debt for the year
State debt per capita (in $) - Accrued debt per person

Tax Revenue - This dataset(52*5 dimension) defines various State-wise income figures for the year. The 5 variables are as follows:

Rank - Ranking of the States in terms of their Gross Collections
State - Name of the State of USA (renamed to align the data with others for join)
Gross Collections - Total government receipts/revenue for the year(in $)
Revenue per capita(in $) - Income earned per person
Ratio to GSP - This variable is the ratio of gross collections to Gross State Product(state’s combines output of all industries) which defines % of output revenue in government receipts.

Steps undertaken:

Datasets “Budget”, “Debt” and “Tax Revenue” were loaded into R using read_csv function.
The head of datasets were displayed using head() function.
Changed the name of “State” variable in taxrevenue to create a common variable.
The three datasets were merged using merge() function with “State” as common variable to create dataset, “final”.

# Importing dataset
budget <- read_csv("Budget.csv")

## Parsed with column specification:
## cols(
##   State = col_character(),
##   `Budget (billions $)` = col_double(),
##   FY = col_character(),
##   Reference = col_character(),
##   `Budget per capita (in $)` = col_number(),
##   `S&P Credit rating in January 2017[1]` = col_character()
## )

head(budget) # Checking if the dataset is imported correctly

str(budget) # Checking variables of the dataset

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 50 obs. of  6 variables:
##  $ State                               : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ Budget (billions $)                 : num  30.7 10.1 39 31.8 183.3 ...
##  $ FY                                  : chr  "2018" "2018" "2018" "2018" ...
##  $ Reference                           : chr  "[2]" "[3]" "[4]" "[5]" ...
##  $ Budget per capita (in $)            : num  6298 13652 5559 10585 4636 ...
##  $ S&P Credit rating in January 2017[1]: chr  "AA" "AA+" "AA" "AA" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   State = col_character(),
##   ..   `Budget (billions $)` = col_double(),
##   ..   FY = col_character(),
##   ..   Reference = col_character(),
##   ..   `Budget per capita (in $)` = col_number(),
##   ..   `S&P Credit rating in January 2017[1]` = col_character()
##   .. )

taxrevenue <- read_csv("TaxRevenue.csv")

## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   `Statefederal district or territory` = col_character(),
##   `Gross collections[2]` = col_character(),
##   `Revenue per capita (est.)` = col_character(),
##   `Ratio to GSP` = col_character()
## )

head(taxrevenue) # Checking if the dataset is imported correctly

str(taxrevenue) # Checking variables of the dataset

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 52 obs. of  5 variables:
##  $ Rank                              : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Statefederal district or territory: chr  "California" "Texas" "New York" "Florida" ...
##  $ Gross collections[2]              : chr  "$405,851,295,000" "$279,904,425,000" "$269,716,999,000" "$177,389,488,000" ...
##  $ Revenue per capita (est.)         : chr  "$10,408" "$10,204" "$13,659" "$8,762" ...
##  $ Ratio to GSP                      : chr  "16.60%" "17.10%" "18.50%" "19.90%" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Rank = col_double(),
##   ..   `Statefederal district or territory` = col_character(),
##   ..   `Gross collections[2]` = col_character(),
##   ..   `Revenue per capita (est.)` = col_character(),
##   ..   `Ratio to GSP` = col_character()
##   .. )

debt <- read_csv("Debt.csv")

## Parsed with column specification:
## cols(
##   State = col_character(),
##   `Total state debt` = col_character(),
##   `State debt per capita` = col_character()
## )

head(taxrevenue) # Checking if the dataset is imported correctly

str(debt) # Checking variables of the dataset

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 50 obs. of  3 variables:
##  $ State                : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ Total state debt     : chr  "$8,969,350,000" "$5,727,891,000" "$14,243,659,000" "$4,985,140,000" ...
##  $ State debt per capita: chr  "$1,848" "$7,764" "$2,089" "$1,674" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   State = col_character(),
##   ..   `Total state debt` = col_character(),
##   ..   `State debt per capita` = col_character()
##   .. )

names(taxrevenue)[2] <- "State" #Renamed the variable to match with other for merge

# Merging the datasets
data <- merge(budget, debt, by="State")

final <- merge(data, taxrevenue, by="State")
str(final) # Checking variables of the dataset

## 'data.frame':    50 obs. of  12 variables:
##  $ State                               : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ Budget (billions $)                 : num  30.7 10.1 39 31.8 183.3 ...
##  $ FY                                  : chr  "2018" "2018" "2018" "2018" ...
##  $ Reference                           : chr  "[2]" "[3]" "[4]" "[5]" ...
##  $ Budget per capita (in $)            : num  6298 13652 5559 10585 4636 ...
##  $ S&P Credit rating in January 2017[1]: chr  "AA" "AA+" "AA" "AA" ...
##  $ Total state debt                    : chr  "$8,969,350,000" "$5,727,891,000" "$14,243,659,000" "$4,985,140,000" ...
##  $ State debt per capita               : chr  "$1,848" "$7,764" "$2,089" "$1,674" ...
##  $ Rank                                : num  32 49 23 27 1 22 19 35 4 11 ...
##  $ Gross collections[2]                : chr  "$25,070,261,000" "$5,717,640,000" "$42,631,316,000" "$32,508,761,000" ...
##  $ Revenue per capita (est.)           : chr  "$5,165" "$7,751" "$6,253" "$10,917" ...
##  $ Ratio to GSP                        : chr  "12.00%" "10.50%" "14.30%" "26.30%" ...

head(final) # Checking if the dataset is imported correctly

dim(final) # Checking dimensions of the merged dataset

## [1] 50 12

Understand

The variables were summarised. They are a set of various data types like character(State), numeric (Budget), date (FY), etc.

After checking the data structures, steps taken:

Variables like S&P Credit Rating in 2017 were factorised, labelled and ordered in terms of how strong is the state’s financial health using factor() function.
Variables like Gross Collections and Total State Debt were ,first, cleaned by removing the dollar sign and then converted into numeric type using as.numeric() for the purpose of performing binary operations.
Variable like Rank was converted into ordinal variable using factor(). Factor() categories qualitative data and gives them order, if necessary.

# Summarising variables of final
final %>% summary()

##     State           Budget (billions $)      FY           
##  Length:50          Min.   :  4.10      Length:50         
##  Class :character   1st Qu.:  9.50      Class :character  
##  Mode  :character   Median : 27.65      Mode  :character  
##                     Mean   : 36.60                        
##                     3rd Qu.: 39.83                        
##                     Max.   :183.30                        
##   Reference         Budget per capita (in $)
##  Length:50          Min.   : 1730           
##  Class :character   1st Qu.: 4573           
##  Mode  :character   Median : 5653           
##                     Mean   : 6841           
##                     3rd Qu.: 8414           
##                     Max.   :18273           
##  S&P Credit rating in January 2017[1] Total state debt  
##  Length:50                            Length:50         
##  Class :character                     Class :character  
##  Mode  :character                     Mode  :character  
##                                                         
##                                                         
##                                                         
##  State debt per capita      Rank       Gross collections[2]
##  Length:50             Min.   : 1.00   Length:50           
##  Class :character      1st Qu.:13.25   Class :character    
##  Mode  :character      Median :25.50   Mode  :character    
##                        Mean   :25.92                       
##                        3rd Qu.:38.75                       
##                        Max.   :51.00                       
##  Revenue per capita (est.) Ratio to GSP      
##  Length:50                 Length:50         
##  Class :character          Class :character  
##  Mode  :character          Mode  :character  
##                                              
##                                              
##

# Factorising Credit Rating variable
final$`S&P Credit rating in January 2017[1]` <- factor(final$`S&P Credit rating in January 2017[1]`, levels = c("BBB","A-", "A+", "AA-", "AA", "AA+", "AAA"), labels= c("Adequate", "Strong", "Strong", "Very Strong", "Very Strong", "Very Strong", "Extremly Strong"), ordered= TRUE) 
levels(final$`S&P Credit rating in January 2017[1]`) # Checking levels

## [1] "Adequate"        "Strong"          "Very Strong"     "Extremly Strong"

final$Rank <- factor(final$Rank, ordered = TRUE) # Factorising Rank variable
levels(final$Rank) # Checking levels

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43"
## [43] "44" "45" "46" "47" "48" "49" "50" "51"

# Converting variables from character to numeric type
final$`Total state debt` <- as.numeric(gsub('[$,]', '' , final$`Total state debt`))
final$`Total state debt` <- paste(format(round(final$`Total state debt` / 1e9, 1), trim = TRUE))
final$`Total state debt` <- as.numeric(final$`Total state debt`)

final$`Gross collections[2]` <- as.numeric(gsub('[$,]', '' , final$`Gross collections[2]`))
final$`Gross collections[2]` <- paste(format(round(final$`Gross collections[2]` / 1e9, 1), trim = TRUE))
final$`Gross collections[2]` <- as.numeric(final$`Gross collections[2]`)

Tidy & Manipulate Data I

The dataset was found to be tidy, as per the following observations:

Each variable has its own column
Each observation has its own row
Each value has its own cell

Tidy & Manipulate Data II

Since the data is in a tidy format, thus, we performed data manipulation. To deduce variance, the following variables were created using mutate() function of tidyr package:

Budget Balance - This variable is difference between Gross Collecetions(Government Revenue) and Total Debt (Government Expenditure), giving the actual budget for the year.
Budget_Revenue - This is the ratio of Balance to Gross Collections which show the percentage of budget in revenue.
Budget_Variance - This is the deviation of Actual Budget from Estimated one.
Budget_deficit_surplus - This variable defines whether there’s a Budget deficit(i.e., Gross Collection < Total State Debt) or Budget Surplus.
Actual_vs_Estimate - This variable defines whether the budget variance was favorable (Actual > Estimate) or Unfavourable.

# Creating new variables
final <- mutate(final,  Budget_Balance= final$`Gross collections[2]` - final$`Total state debt`)
final <- mutate(final, Balance_Revenue = final$Budget_Balance/final$`Gross collections[2]`)
final$Balance_Revenue <- round(final$Balance_Revenue, digits = 2) # Rounding to 2 decimal places

final <- mutate(final, Budget_Variance = final$`Budget (billions $)`- final$Budget_Balance)

# Assigning classes to the variables
Actual_vs_Estimate <- ifelse(final$Budget_Variance >0 , c("Unfavourable"), c("Favourable"))
Budget_deficit_surplus <- ifelse(final$Budget_Balance >0, c("Surplus"), c("Deficit"))

# Joining the class variables to the dataset
final <- cbind(final, Budget_deficit_surplus, Actual_vs_Estimate)

#Checking head of final
head(final)

str(final)

## 'data.frame':    50 obs. of  17 variables:
##  $ State                               : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ Budget (billions $)                 : num  30.7 10.1 39 31.8 183.3 ...
##  $ FY                                  : chr  "2018" "2018" "2018" "2018" ...
##  $ Reference                           : chr  "[2]" "[3]" "[4]" "[5]" ...
##  $ Budget per capita (in $)            : num  6298 13652 5559 10585 4636 ...
##  $ S&P Credit rating in January 2017[1]: Ord.factor w/ 4 levels "Adequate"<"Strong"<..: 3 3 3 3 3 3 3 4 4 4 ...
##  $ Total state debt                    : num  9 5.7 14.2 5 151.7 ...
##  $ State debt per capita               : chr  "$1,848" "$7,764" "$2,089" "$1,674" ...
##  $ Rank                                : Ord.factor w/ 50 levels "1"<"2"<"3"<"4"<..: 31 48 23 27 1 22 19 34 4 11 ...
##  $ Gross collections[2]                : num  25.1 5.7 42.6 32.5 405.9 ...
##  $ Revenue per capita (est.)           : chr  "$5,165" "$7,751" "$6,253" "$10,917" ...
##  $ Ratio to GSP                        : chr  "12.00%" "10.50%" "14.30%" "26.30%" ...
##  $ Budget_Balance                      : num  16.1 0 28.4 27.5 254.2 ...
##  $ Balance_Revenue                     : num  0.64 0 0.67 0.85 0.63 0.64 0.4 0.78 0.81 0.85 ...
##  $ Budget_Variance                     : num  14.6 10.1 10.6 4.3 -70.9 ...
##  $ Budget_deficit_surplus              : Factor w/ 2 levels "Deficit","Surplus": 2 1 2 2 2 2 2 2 2 2 ...
##  $ Actual_vs_Estimate                  : Factor w/ 2 levels "Favourable","Unfavourable": 2 2 2 2 1 1 1 1 1 1 ...

To create categorical variables (Actual_vs_Estimate, Budget_deficit_surplus), ifelse() base R function was used. These were added to the “final” dataset using cbind() function which adds a column to the dataset with similar number of rows.

Scan I

is.na() coupled with sum() function was used to detect any missing values in the dataset.

#Checking for missing values
sum(is.na(final))

## [1] 0

There were no missing values in the dataset.

Scan II

Before performimg data transformation, it is necessary to correct the data of any outliers. To detect the outliers , boxplot() was plotted using ggplot2 package for the variable - Total State Debt.

# Boxplot for Total State Debt
final$`Total state debt` %>% boxplot(main= "Box Plot fo Debt", ylab="Debt", col= "grey")

# Creating a separate vector of outliers in the variable
outliers <- boxplot(final$`Total state debt`, plot = FALSE)$out
print(outliers) # checking the outlier values

## [1] 151.7  64.2  75.3  66.9 137.4

final[which(final$`Total state debt` %in% outliers),] # Finding the rows of outlier values

final <- final[-which(final$`Total state debt` %in% outliers),]#Removing the outliers

boxplot(final$`Total state debt`) # Checking the boxplot

Since there were 4 outliers in the variable, we removed the outliers by creating a separate vector of outlier values using which() function and then, eliminating it from the resultant dataset. This approach was used as the number of outliers were less. Through, this approach all the outliers were successfully removed.

Transform

To Check if the variable has normal distribution, we plot the histogram using hist() fUnction of Total State Debt.

# Plotting histogram of Total State Debt
hist(final$`Total state debt`)

# Transforming into log data
log_Debt <- log(final$`Total state debt`)
hist(log_Debt) # Plotting histogram of Total State Debt

In accordance to the histogram, the data is postively skewed(right skewness). To achieve normality, we transformed the data into log values as compresses high values and spreads low values by expressing the values as orders of magnitude. This transformation is commonly used for reducing right skewness.