R can be used to make basic visual analytics, which can be helpful in understanding the data holistically. Additionally, R can help find correlations between variables and create scatter plots.
In this lab we will delve deeper into the marketing data set with R visual analytics and use Tableau to visualize the credit risk data set.
Make sure to download the folder titled ‘bsad_lab04’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.
First, read in the states_purchase_orders.csv
file and make sure all the columns are captured and check the first couple of rows of the dataset “head(mydata)”
#This function will allow the desired file 'states_purchase_order.csv' into R Studio and names it 'mydata'
mydata = read.csv(file="data/states_purchase_orders.csv")
head(mydata)
#The following commands will allow the columns to be called
vendorstate = mydata$VENDOR.STATE
departmentname = mydata$DEPARTMENT.NAME
recordtype = mydata$RECORD.TYPE
purchaseorder = mydata$PURCHASE.ORDER.NUMBER
inputdate = mydata$INPUT.DATE
totalamount = mydata$TOTAL.AMOUNT
departmentnumber = mydata$DEPARTMENT.NUMBER
POcategorydescription = mydata$PO.CATEGORY.DESCRIPTION
vouchedamount = mydata$VOUCHED.AMOUNT
vendorname1 = mydata$VENDOR.NAME.1
vendoraddress1 = mydata$VENDOR.ADDRESS.1
vendorcity = mydata$VENDOR.CITY
vendorzip = mydata$VENDOR.ZIP
totalitems = mydata$VENDOR.ZIP
itemunitcost = mydata$ITEM.UNIT.COST
itemtotalcost = mydata$ITEM.TOTAL.COST
uniqueid = mydata$UNIQUE.ID
Frequency Tables for columns ‘vendor.state’ and ‘department.name’
# Here, I will create a frequency table for the column 'vendor.state' and name it vendorstatefqtable. Using the command "variable_table = table(variable)"
vendorstatefqtable = table(vendorstate)
vendorstatefqtable
vendorstate
, AL AR AZ CA CD CO CT DC FL GA
2 62 400 125 189 877 1 114 99 69 418 555
IA ID IL IN KS KY LA MA MD ME MI MN
1089 14 330 180 76 100 20178 149 32 12 55 233
MO MS NC ND NE NH NJ NM NV NY OH OK
74 361 313 4 56 14 1633 3 12 210 180 9
OR PA RI SC SD TN TX UT VA VT WA WI
27 401 15 41 6 152 2561 77 295 3 60 583
WV
3
The table above is the frequency table that displays the vendors and the amount of orders that they have fulfilled for each state.
# Here, I will create a frequency table for the column 'department.name' and name it deptnamefqtable
deptnamefqtable = table(departmentname)
deptnamefqtable
departmentname
ANIMAL CONTROL CENTER
79
BATON ROUGE CONVENTION & VISTORS COMMISSION
4
BATON ROUGE RIVER CENTER
134
CITY CONSTABLE
115
CITY COURT
1412
COMMUNITY DEVELOPMENT PROGRAMS
1674
CONSTRUCTION PROJECTS
2224
CORONER
47
COUNCIL ADMINISTRATOR
182
COUNCIL BUDGET OFFICE
84
DEPARTMENT OF PUBLIC WORKS
5523
DEPT OF BUILDINGS AND GROUNDS
326
DEPT OF BUSINESS OPERATIONS AND CAPITAL PRO
107
DEPT OF DEVELOPMENT
69
DEPT OF FLEET MANAGEMENT
325
DEPT OF MAINTENANCE
220
DEPT OF TRANSPORTATION AND DRAINAGE
200
DISTRICT COURT
20
DOWNTOWN DEVELOPMENT DISTRICT
182
EMERGENCY MEDICAL SERVICES
1640
FINANCE DEPARTMENT
758
FIRE DEPARTMENT
498
FIRE PROTECTION DISTRICT
217
GREATER BATON ROUGE AIRPORT DISTRICT
1677
HUMAN DEVELOPMENT AND SERVICES
2733
HUMAN RESOURCES
561
INFORMATION SERVICES
1156
JUVENILE COURT
1
JUVENILE SERVICES
376
LIBRARY BOARD OF CONTROL
2620
METROPOLITAN COUNCIL
186
MOSQUITO & RODENT CONTROL
413
MUNICIPAL FIRE & POLICE CIVIL SERVICE BOARD
11
OFC. OF HOMELAND SECURITY & EMERGENCY PREPA
472
OFFICE OF THE MAYOR-PRESIDENT
386
OUTSIDE AGENCIES
161
PARISH ATTORNEY
259
PLANNING COMMISSION
138
POLICE DEPARTMENT
4066
PUBLIC INFORMATION OFFICE
94
PURCHASING
253
REGISTRAR OF VOTERS
43
RETIREMENT OFFICE
4
ROAD REHABILITATION PROGRAM
607
SHERIFF'S OFFICE
195
This frequency table displays the orders placed by each department. I.e. 195 orders placed by the Sheriff’s office, 259 by the Parish attorney, 161 by outside agencies, etc.
Now, we will visualize some of the variables to observe some patterns and differences.
First create a bar chart of the loan purpose frequency table and job frequency table Remember to extract the sales variable before calling it.
BAR PLOTS FOR VENDOR STATE AND DEPARTMENT NAME
VARIABLES: VENDOR STATE AND DEPARTMENT NAME
# Here, I will create a bar plot for the variable vendor state.
barplot(vendorstatefqtable)
It is difficult to make assesments with this barplot since the ranges are so large, therefore I will create another barplot with a certain number of specified rows: 10-18
# Bar plot with variable vendor state- specified rows
barplot(vendorstatefqtable[10:18])
This layout is much easier to view, with noting that Indiana had the most orders in comparison to Idaho with the least in this specified range.
# Bar plot with variable department name with also specified rows to allow better use for assessments.
barplot(deptnamefqtable[12:18])
Histogram of Vendor State and Department Name
Below, I will creat histograms for the variables vendor state and department name using the function “hist(variable)”
# Plot a histogram using the " hist(variable) " command where variable is vendor state
hist(vendorstatefqtable)
# Plot a histogram using the " hist(variable) " command where variable is vendor state
hist(deptnamefqtable)
Pie Chart for Vendor state and Department Name
VARIABLE: VENDOR STATE
# Plot a pie chart using:
# pie(variable) (command to make a pie chart)
# To get only values greather than N from a frequency table use:
# new_table = my_table[ my_table > N ] (command to select values greather than N)
newVendorStateTable = vendorstatefqtable[vendorstatefqtable >500]
pie(newVendorStateTable)
Why is this visual not a good representation of the data? What makes the barplot a better representation? The above visual is not a good representation because it fails to provide an accurate representation of all states. Plus there are states whose some orders encompass a little less than half of what the other states combined ordered which does not leave for a clear and readable graph.
Lets explore the Illinois Tax dataset (READ the “state_tax_illinois.csv” file) and create a scatterplot of corporate tax vs. personal income tax.
The dataset start in 1994 but we want to tax revenue since 2000 for the first quarter of the year “Q1”
VARIABLES: personal_income_tax and corporate_income_tax
mydata1 = read.csv(file="data/state_tax_illinois.csv")
head(mydata1)
# read the dataset extract the year variable and order the dataset by year in decreasing order using the command:
year = mydata1$year
year
[1] 1994 1994 1994 1994 1995 1995 1995 1995 1996 1996 1996 1996 1997 1997
[15] 1997 1997 1998 1998 1998 1998 1999 1999 1999 1999 2000 2000 2000 2000
[29] 2001 2001 2001 2001 2002 2002 2002 2002 2003 2003 2003 2003 2004 2004
[43] 2004 2004 2005 2005 2005 2005 2006 2006 2006 2006 2007 2007 2007 2007
[57] 2008 2008 2008 2008 2009 2009 2009 2009 2010 2010 2010 2010 2011 2011
[71] 2011 2011 2012 2012 2012 2012 2013 2013 2013 2013 2014 2014 2014 2014
[85] 2015 2015 2015 2015 2016 2016
# To change the way that R shows very large or small values we can use the following command:
options(scipen = 9999)
# mydata = mydata[ order(year, decreasing = TRUE) , ] (! Order the data in decreasing order)
mydata1 = mydata1[ order(mydata1$year, decreasing = TRUE),]
# From the ordered dataset Extract personal_income_tax variable
personalincometax = mydata$personal_income_tax
personalincometax
[1] 3746068 3950197 3982255 4527933 3059415 3048487 4505921 4961971 3680480
[10] 3723148 4345335 5353786 3608644 3565618 4197619 4790521 3403778 3435763
[19] 3195378 4096206 3343258 3180912 2515231 2702995 2134865 2291055 2448000
[28] 2899039 2104226 2110793 2803540 3427232 2149834 1968785 2522527 3101128
[37] 2071646 2017821 2264920 2817633 1963943 1820838 2130841 2511261 1817907
[46] 1734644 1902394 2181615 1675372 1661305 1862628 2169229 1600339 1571274
[55] 1945837 2195228 1648066 1661059 1867375 2405920 1679520 1650800 2272950
[64] 2713373 1763852 1631196 1877350 1984974 1644788 1640535 1877350 1984974
[73] 1579539 1545132 1689205 1786043 1579539 1545132 1553374 1642423 1421240
[82] 1390281 1427336 1509160 1306955 1278486 1360884 1425714 1200911 1174752
#From the ordered dataset Extract corporate_income_tax variable
corpincometax = mydata$corporate_income_tax
corpincometax
[1] 903017 1163900 1055932 1425459 674725 625819 1105062 1508593 787764
[10] 785112 1028486 1764708 877978 948199 834075 1365912 778008 891425
[19] 632404 1160954 655115 639437 729180 971001 507167 504933 642923
[28] 1403105 459494 527011 790459 1225043 519552 439074 704820 1158924
[37] 537828 562274 591623 901551 556344 516272 557161 735924 438193
[46] 468903 431059 470507 324041 354688 362344 421964 222017 395572
[55] 362374 432374 261916 246964 583639 786706 274660 313983 792893
[64] 828315 433323 412970 473585 814038 476261 457064 473585 814038
[73] 370340 303603 435527 748621 370340 303603 391428 672820 340579
[82] 279205 357650 614760 306094 250934 281377 532591 279680 229280
The previous task focused on visualizing one variable. A good way to visualize two variables would be a scatter plot or a correlation matrix.
# Plot personal_income_tax vs. corporate_income_tax
# personal_income_tax will be on the x-axis
# corporate_income_tax will be on the y-axis
plot(personalincometax, corpincometax, xlab = "Personal Income Tax", ylab = "Corporate Income Tax")
The above scatter plot features the two variables: personal income tax on the x-axis and corporate income tax on the y-axis. These two variables allow for a clearer represenation.
as.numeric(mydata1$year)
[1] 1994 1994 1994 1994 1995 1995 1995 1995 1996 1996 1996 1996 1997 1997
[15] 1997 1997 1998 1998 1998 1998 1999 1999 1999 1999 2000 2000 2000 2000
[29] 2001 2001 2001 2001 2002 2002 2002 2002 2003 2003 2003 2003 2004 2004
[43] 2004 2004 2005 2005 2005 2005 2006 2006 2006 2006 2007 2007 2007 2007
[57] 2008 2008 2008 2008 2009 2009 2009 2009 2010 2010 2010 2010 2011 2011
[71] 2011 2011 2012 2012 2012 2012 2013 2013 2013 2013 2014 2014 2014 2014
[85] 2015 2015 2015 2015 2016 2016
# To create a new_data filtering for an N NUMERIC value we can do the following
# new_data = mydata[ mydata$NUMERIC_VARIABLE > N, ]
new_data = mydata1[ mydata1$year > 1999,]
# If we need to make another filter on a NON-NUMERIC we can do the following
# new_data = new_data[ new_data$CHAR_VARIABLE == "Q1", ]
new_data = new_data[ new_data$quarter =="Q1" , ]
head(mydata1)
It might be easier to notice a trend if there was a line that fit through the points. So, lets repeat using the function ‘scatter.smooth’. This functions adds a smooth curve computed by loess, a method to locally smooth a curve.
ScatterPlot In this seciton I will create a scatter plot for the dictated variables. I will use the funtion ‘scatter.smooth(variable, variable)’
scatter.smooth(personalincometax,corpincometax, xlab = "Personal Income Tax", ylab = "Corporate Income Tax")
Now, repeat below for total_tax_US vs. total_state, following the example that was just provided. Observe and note any trends in the scatter plots and potential relationships below. Do the same for all numeric variables.
# total_tax_US vs. total_state
# I will first call and extract the desired numeric variables
totaltaxUS = mydata1$total_tax_US
totalstate = mydata1$total_state
generalsalestax = mydata1$general_sales_tax
motorfueltax = mydata1$motor_fuel_tax
scatter.smooth(mydata1$total_tax_US, mydata1$total_state, xlab = "Total Tax US", ylab = "Total State")
To quantify the data, its best to look at the correlation between two variables. What does this correlation explain?
cor(mydata1$total_state,mydata1$personal_income_tax)
[1] 0.9652176
The number above, 0.9652176, indicates that the two variable are strongly and positively correlated.
Below, compute the correlation between every pair of variables (total_tax_US,total_state, personal_income_tax, corporate_income_tax, general_sales_tax, motor_fuel_tax ).
Make sure to extract the variables first and then compute the correlation.
corr01 = mydata1[c(3:6)]
cor(corr01)
total_tax_US total_state personal_income_tax
total_tax_US 1.0000000 0.9621443 0.8921627
total_state 0.9621443 1.0000000 0.9652176
personal_income_tax 0.8921627 0.9652176 1.0000000
corporate_income_tax 0.8378959 0.8547191 0.8794823
corporate_income_tax
total_tax_US 0.8378959
total_state 0.8547191
personal_income_tax 0.8794823
corporate_income_tax 1.0000000
To begin this task, I imported the state_purchases file into Tableau. I selected the dimension ‘Department name’ and measure ‘Number of Records’.
Next, I chose the ‘treemap’ view. This assembled all the departments into squares that were sized accordingly to the number of records they had. Therefore the largest was the Department of Public Works. This was a nice visually appealing representation of the departments and their records. It would be very effective in a presentation setting.
I then created a new sheet and selected ‘Number of records’ and dimension ‘Vendor.zip’. Then I selected the recommended map view. This presented the data in a readable, visually pleasing format. The circles on the map represented the zip codes location and their size represented their number of records that corresponded. A concentration for example, in Louisiana or Georgia is due to the fact that there are many zip codes represented in this particular area.