R can be used to make basic visual analytics, which can be helpful in understanding the data holistically. Additionally, R can help find correlations between variables and create scatter plots.
In this lab we will delve deeper into the marketing data set with R visual analytics and use Tableau to visualize the credit risk data set.
Make sure to download the folder titled ‘bsad_lab04’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.
First, read in the creditrisk.csv file and make sure all the columns are captured and check the first couple of rows of the dataset “head(mydata)”.
mydata = read.csv(file="data/states_purchase_orders.csv")
record.type = mydata$RECORD.TYPE
purchase.order.number = mydata$PURCHASE.ORDER.NUMBER
input.date = mydata$INPUT.DATE
total.amount = mydata$TOTAL.AMOUNT
department.number = mydata$DEPARTMENT.NUMBER
department.name = mydata$DEPARTMENT.NAME
po.category.description = mydata$PO.CATEGORY.DESCRIPTION
vouched.amount = mydata$VOUCHED.AMOUNT
vendor.name.1 = mydata$VENDOR.NAME.1
vendor.address.1 = mydata$VENDOR.ADDRESS.1
vendor.city = mydata$VENDOR.CITY
vendor.state = mydata$VENDOR.STATE
vendor.zip = mydata$VENDOR.ZIP
total.items = mydata$TOTAL.ITEMS
item.unit.cost = mydata$ITEM.UNIT.COST
item.total.cost = mydata$ITEM.TOTAL.COST
unique.id = mydata$UNIQUE.ID
head(mydata)
## RECORD.TYPE PURCHASE.ORDER.NUMBER INPUT.DATE TOTAL.AMOUNT
## 1 D VO014151 8/13/14 302.51
## 2 D RQ037437 11/12/14 7088.00
## 3 D RQ032962 11/26/12 100000.00
## 4 D RQ029751 8/31/11 172500.00
## 5 D PO170848 6/28/17 590812.00
## 6 D PO170847 6/28/17 4200000.00
## DEPARTMENT.NUMBER DEPARTMENT.NAME
## 1 70 DEPARTMENT OF PUBLIC WORKS
## 2 70 DEPARTMENT OF PUBLIC WORKS
## 3 72 ROAD REHABILITATION PROGRAM
## 4 61 COMMUNITY DEVELOPMENT PROGRAMS
## 5 72 ROAD REHABILITATION PROGRAM
## 6 75 CONSTRUCTION PROJECTS
## PO.CATEGORY.DESCRIPTION VOUCHED.AMOUNT
## 1 302.51
## 2 INFORMAL QUOTATIONS 7088.00
## 3 CONSTRUCTION PROJECTS OVER $25,000. 0.00
## 4 PROFESSIONAL SERVICES 0.00
## 5 CONSTRUCTION PROJECTS OVER $25,000. 0.00
## 6 CONSTRUCTION PROJECTS OVER $25,000. 0.00
## VENDOR.NAME.1 VENDOR.ADDRESS.1 VENDOR.CITY
## 1 CONTRACTORS SUPPLY & EQUIPMENT 14020 AIRLINE HWY SUITE A BATON ROUGE
## 2 WATSON MECHANICAL SERVICES INC 1743 DALLAS DR BATON ROUGE
## 3 CLEAN SWEEP INC 4300 GROOM RD BAKER
## 4 EAST BATON ROUGE REDEVELOPMENT 801 NORTH BLVD SUITE 200 BATON ROUGE
## 5 ERNEST P BREAUX ELECTRICAL INC 2812 BROKEN ARROW RD NEW IBERIA
## 6 ALLEN & LEBLANC LLC PO BOX 15789 BATON ROUGE
## VENDOR.STATE VENDOR.ZIP TOTAL.ITEMS ITEM.UNIT.COST ITEM.TOTAL.COST
## 1 LA 70817 1 302.51 302.51
## 2 LA 70806 1 3544.00 7088.00
## 3 LA 70714 1 100000.00 100000.00
## 4 LA 70802 1 172500.00 172500.00
## 5 LA 70560-0516 1 590812.00 590812.00
## 6 LA 70895 3 890000.00 890000.00
## UNIQUE.ID
## 1 VO0141510001
## 2 RQ0374370001
## 3 RQ0329620001
## 4 RQ0297510001
## 5 PO1708480001
## 6 PO1708470003
For CATEGORICAL variables it is useful to know the frequency of the different categories/levels of the variable. We can do that by using the table() function.
vendor.state.table = table(vendor.state)
department.name.table = table(department.name)
vendor.state.table
## vendor.state
## , AL AR AZ CA CD CO CT DC FL GA
## 2 62 400 125 189 877 1 114 99 69 418 555
## IA ID IL IN KS KY LA MA MD ME MI MN
## 1089 14 330 180 76 100 20178 149 32 12 55 233
## MO MS NC ND NE NH NJ NM NV NY OH OK
## 74 361 313 4 56 14 1633 3 12 210 180 9
## OR PA RI SC SD TN TX UT VA VT WA WI
## 27 401 15 41 6 152 2561 77 295 3 60 583
## WV
## 3
department.name.table
## department.name
## ANIMAL CONTROL CENTER
## 79
## BATON ROUGE CONVENTION & VISTORS COMMISSION
## 4
## BATON ROUGE RIVER CENTER
## 134
## CITY CONSTABLE
## 115
## CITY COURT
## 1412
## COMMUNITY DEVELOPMENT PROGRAMS
## 1674
## CONSTRUCTION PROJECTS
## 2224
## CORONER
## 47
## COUNCIL ADMINISTRATOR
## 182
## COUNCIL BUDGET OFFICE
## 84
## DEPARTMENT OF PUBLIC WORKS
## 5523
## DEPT OF BUILDINGS AND GROUNDS
## 326
## DEPT OF BUSINESS OPERATIONS AND CAPITAL PRO
## 107
## DEPT OF DEVELOPMENT
## 69
## DEPT OF FLEET MANAGEMENT
## 325
## DEPT OF MAINTENANCE
## 220
## DEPT OF TRANSPORTATION AND DRAINAGE
## 200
## DISTRICT COURT
## 20
## DOWNTOWN DEVELOPMENT DISTRICT
## 182
## EMERGENCY MEDICAL SERVICES
## 1640
## FINANCE DEPARTMENT
## 758
## FIRE DEPARTMENT
## 498
## FIRE PROTECTION DISTRICT
## 217
## GREATER BATON ROUGE AIRPORT DISTRICT
## 1677
## HUMAN DEVELOPMENT AND SERVICES
## 2733
## HUMAN RESOURCES
## 561
## INFORMATION SERVICES
## 1156
## JUVENILE COURT
## 1
## JUVENILE SERVICES
## 376
## LIBRARY BOARD OF CONTROL
## 2620
## METROPOLITAN COUNCIL
## 186
## MOSQUITO & RODENT CONTROL
## 413
## MUNICIPAL FIRE & POLICE CIVIL SERVICE BOARD
## 11
## OFC. OF HOMELAND SECURITY & EMERGENCY PREPA
## 472
## OFFICE OF THE MAYOR-PRESIDENT
## 386
## OUTSIDE AGENCIES
## 161
## PARISH ATTORNEY
## 259
## PLANNING COMMISSION
## 138
## POLICE DEPARTMENT
## 4066
## PUBLIC INFORMATION OFFICE
## 94
## PURCHASING
## 253
## REGISTRAR OF VOTERS
## 43
## RETIREMENT OFFICE
## 4
## ROAD REHABILITATION PROGRAM
## 607
## SHERIFF'S OFFICE
## 195
Now, we will visualize some of the veriables to observe some patterns and differences.
First create a bar chart of the loan purpose frequency table and job frequency table Remember to extract the sales variable before calling it.
barplot(vendor.state.table)
barplot(department.name.table)
Next, create a histogram of loan purpose frequency table and job frequency table.
hist(vendor.state.table)
hist(department.name.table)
What are some observable differences between the barplot and the histogram? Please record your observations here.
*The histograms of Vendor State and Department Names account for variable frequency, but do not read for the subcategories of Vendor States and Department Name like the bar plot does. The barplot does a better job of depicting these subcategories. The histogram, however, does represent the frequencies without subjecting them to each individual subcategory. This gives us a good look at frequencies alone.
Lastly, create a pie chart of Loan purpose frequency table. Plot only values greather than 10.
pie(vendor.state.table)
new_table = vendor.state.table[ vendor.state.table > 10 ]
Why is this visual not a good representation of the data? What makes the barplot a better representation?
*This pie chart is not a good representation of the data compared to the barplot because it does not portray the numerical numbers of the frequencies of the subcategories for Vendor State.There are far too many states to represent well in a small pie chart.
Lets explore the Illinois Tax dataset (READ the “state_tax_illinois.csv” file) and create a scatterplot of corporate tax vs. personal income tax.
The dataset start in 1994 but we want to tax revenue since 2000 for the first quarter of the year “Q1”.
VARIABLES: personal_income_tax and corporate_income_tax.
mydata = read.csv(file="data/state_tax_illinois.csv")
head(mydata)
## year quarter total_tax_US total_state personal_income_tax
## 1 1994 Q1 91075715 3684023 1360884
## 2 1994 Q2 104324554 4299796 1425714
## 3 1994 Q3 90160343 3803301 1200911
## 4 1994 Q4 91149689 3627632 1174752
## 5 1995 Q1 93944974 4028215 1427336
## 6 1995 Q2 109609563 4480566 1509160
## corporate_income_tax general_sales_tax motor_fuel_tax
## 1 281377 1113806 248334
## 2 532591 1181050 295800
## 3 279680 1239367 284772
## 4 229280 1271945 266208
## 5 357650 1197459 298489
## 6 614760 1250095 283860
options(scipen = 9999)
year = mydata$year
mydata = mydata[ order(year, decreasing = TRUE) , ]
personal_income_tax = mydata$personal_income_tax
corporate_income_tax = mydata$corporate_income_tax
The previous task focused on visualizing one variable. A good way to visualize two variables would be a scatter plot or a correlation matrix.
plot(personal_income_tax,corporate_income_tax)
Select only entries greater than year 1999. After selecting entries greather than year 1999, select entries for quarter “Q1”.
new_data = mydata[ mydata$NUMERIC_VARIABLE > 1999, ]
new_data = new_data[ new_data$CHAR_VARIABLE == "Q1", ]
head(new_data)
## [1] year quarter total_tax_US
## [4] total_state personal_income_tax corporate_income_tax
## [7] general_sales_tax motor_fuel_tax
## <0 rows> (or 0-length row.names)
It might be easier to notice a trend if there was a line that fit through the points. So, lets repeat using the function ‘scatter.smooth’. This functions adds a smooth curve computed by loess, a method to locally smooth a curve.
scatter.smooth(personal_income_tax,corporate_income_tax)
Now, repeat below for total_tax_US vs. total_state, following the example that was just provided. Observe and note any trends in the scatter plots and potential relationships below. Do the same for all numeric variables.
*Both are positively correlated, increasing from left to right. Both plots are greatly dense on the left half of the graphs.
total_tax_US = mydata$total_tax_US
total_state = mydata$total_state
plot(total_tax_US,total_state)
new_data = mydata[ mydata$NUMERIC_VARIABLE > 1999, ]
new_data = new_data[ new_data$CHAR_VARIABLE == "Q1", ]
head(new_data)
## [1] year quarter total_tax_US
## [4] total_state personal_income_tax corporate_income_tax
## [7] general_sales_tax motor_fuel_tax
## <0 rows> (or 0-length row.names)
scatter.smooth(total_tax_US,total_state)
To quantify the data, its best to look at the correlation between two variables. What does this correlation explain?
*This correlation tells us that the variables have a strong uphill (positive) linear relationship.
cor(total_state,personal_income_tax)
## [1] 0.9652176
Below, compute the correlation between every pair of variables (total_tax_US,total_state, personal_income_tax, corporate_income_tax, general_sales_tax, motor_fuel_tax ).
Make sure to extract the variables first and then compute the correlation.
total_tax_US = mydata$total_tax_US
total_state = mydata$total_state
personal_income_tax = mydata$personal_income_tax
corporate_income_tax = mydata$corporate_income_tax
general_sales_tax = mydata$general_sales_tax
motor_fuel_tax = mydata$motor_fuel_tax
corr01 = mydata[ c(3,4) ]
cor(corr01)
## total_tax_US total_state
## total_tax_US 1.0000000 0.9621443
## total_state 0.9621443 1.0000000
corr02 = mydata[3:8]
cor(corr02)
## total_tax_US total_state personal_income_tax
## total_tax_US 1.0000000 0.9621443 0.89216270
## total_state 0.9621443 1.0000000 0.96521760
## personal_income_tax 0.8921627 0.9652176 1.00000000
## corporate_income_tax 0.8378959 0.8547191 0.87948231
## general_sales_tax 0.8811287 0.8632035 0.73636583
## motor_fuel_tax 0.2108519 0.1397864 -0.02182754
## corporate_income_tax general_sales_tax motor_fuel_tax
## total_tax_US 0.83789591 0.8811287 0.21085187
## total_state 0.85471906 0.8632035 0.13978644
## personal_income_tax 0.87948231 0.7363658 -0.02182754
## corporate_income_tax 1.00000000 0.5503725 -0.05636253
## general_sales_tax 0.55037249 1.0000000 0.36033009
## motor_fuel_tax -0.05636253 0.3603301 1.00000000
Follow the directions on the worksheet, download tableau academic on your personal computer or use one of the labs computers.
Download Tablea academic here: https://www.tableau.com/academic/students
DOUBLE click on the ‘Department.Name’, then DOUBLE click on ‘Number of Records’ variable located on the bottom left of the screen. Note the bar like view and breakdowns.
What department ‘bin’ has the highest count? What is the count?
*The Department of Public Works has the highest count of 5,523. This heat map shows the relationship of the count within each department bin. The larger boxes are the department bins with larger counts, and the smaller boxes are the department bins with the lesser counts. The darker colored a box is, the greater its count is.
Then DOUBLE click on ‘Number of Records’ variable then DOUBLE click on ‘Vendor.Zip’, then on the top left select the recommend view.
What are the highlights on the map?
*The highlights on the map are the vendor zip codes along with the number of records in each zip code vendor.
Can you give an explanation of why there is a concentration on a particular region?
*Cities with larger populations would have a greater amount of vendors, while cities with lesser populations would not.
*This is the map depicting vendors in each region depending on their zip code. Each dot on the map represents a vendor zip code along with the number of records in each zip code. This gives us a good idea of the regions with a greater population of vendors.