About

R can be used to make basic visual analytics, which can be helpful in understanding the data holistically. Additionally, R can help find correlations between variables and create scatter plots.

In this lab we will delve deeper into the marketing data set with R visual analytics and use Tableau to visualize the credit risk data set.

Setup

Make sure to download the folder titled ‘bsad_lab04’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.


Task 1

First, read in the creditrisk.csv file and make sure all the columns are captured and check the first couple of rows of the dataset “head(mydata)”.

mydata = read.csv(file="data/states_purchase_orders.csv")
record.type = mydata$RECORD.TYPE
purchase.order.number = mydata$PURCHASE.ORDER.NUMBER
input.date = mydata$INPUT.DATE
total.amount = mydata$TOTAL.AMOUNT
department.number = mydata$DEPARTMENT.NUMBER
department.name = mydata$DEPARTMENT.NAME 
po.category.description = mydata$PO.CATEGORY.DESCRIPTION
vouched.amount = mydata$VOUCHED.AMOUNT
vendor.name.1 = mydata$VENDOR.NAME.1
vendor.address.1 = mydata$VENDOR.ADDRESS.1
vendor.city = mydata$VENDOR.CITY
vendor.state = mydata$VENDOR.STATE 
vendor.zip = mydata$VENDOR.ZIP 
total.items = mydata$TOTAL.ITEMS
item.unit.cost = mydata$ITEM.UNIT.COST
item.total.cost = mydata$ITEM.TOTAL.COST
unique.id = mydata$UNIQUE.ID
head(mydata)
##   RECORD.TYPE PURCHASE.ORDER.NUMBER INPUT.DATE TOTAL.AMOUNT
## 1           D              VO014151    8/13/14       302.51
## 2           D              RQ037437   11/12/14      7088.00
## 3           D              RQ032962   11/26/12    100000.00
## 4           D              RQ029751    8/31/11    172500.00
## 5           D              PO170848    6/28/17    590812.00
## 6           D              PO170847    6/28/17   4200000.00
##   DEPARTMENT.NUMBER                DEPARTMENT.NAME
## 1                70     DEPARTMENT OF PUBLIC WORKS
## 2                70     DEPARTMENT OF PUBLIC WORKS
## 3                72    ROAD REHABILITATION PROGRAM
## 4                61 COMMUNITY DEVELOPMENT PROGRAMS
## 5                72    ROAD REHABILITATION PROGRAM
## 6                75          CONSTRUCTION PROJECTS
##                PO.CATEGORY.DESCRIPTION VOUCHED.AMOUNT
## 1                                              302.51
## 2                  INFORMAL QUOTATIONS        7088.00
## 3 CONSTRUCTION PROJECTS OVER  $25,000.           0.00
## 4                PROFESSIONAL SERVICES           0.00
## 5 CONSTRUCTION PROJECTS OVER  $25,000.           0.00
## 6 CONSTRUCTION PROJECTS OVER  $25,000.           0.00
##                    VENDOR.NAME.1          VENDOR.ADDRESS.1 VENDOR.CITY
## 1 CONTRACTORS SUPPLY & EQUIPMENT 14020 AIRLINE HWY SUITE A BATON ROUGE
## 2 WATSON MECHANICAL SERVICES INC            1743 DALLAS DR BATON ROUGE
## 3                CLEAN SWEEP INC             4300 GROOM RD       BAKER
## 4 EAST BATON ROUGE REDEVELOPMENT  801 NORTH BLVD SUITE 200 BATON ROUGE
## 5 ERNEST P BREAUX ELECTRICAL INC      2812 BROKEN ARROW RD  NEW IBERIA
## 6            ALLEN & LEBLANC LLC              PO BOX 15789 BATON ROUGE
##   VENDOR.STATE VENDOR.ZIP TOTAL.ITEMS ITEM.UNIT.COST ITEM.TOTAL.COST
## 1           LA      70817           1         302.51          302.51
## 2           LA      70806           1        3544.00         7088.00
## 3           LA      70714           1      100000.00       100000.00
## 4           LA      70802           1      172500.00       172500.00
## 5           LA 70560-0516           1      590812.00       590812.00
## 6           LA      70895           3      890000.00       890000.00
##      UNIQUE.ID
## 1 VO0141510001
## 2 RQ0374370001
## 3 RQ0329620001
## 4 RQ0297510001
## 5 PO1708480001
## 6 PO1708470003

For CATEGORICAL variables it is useful to know the frequency of the different categories/levels of the variable. We can do that by using the table() function.

vendor.state.table = table(vendor.state)
department.name.table = table(department.name)
vendor.state.table
## vendor.state
##           ,    AL    AR    AZ    CA    CD    CO    CT    DC    FL    GA 
##     2    62   400   125   189   877     1   114    99    69   418   555 
##    IA    ID    IL    IN    KS    KY    LA    MA    MD    ME    MI    MN 
##  1089    14   330   180    76   100 20178   149    32    12    55   233 
##    MO    MS    NC    ND    NE    NH    NJ    NM    NV    NY    OH    OK 
##    74   361   313     4    56    14  1633     3    12   210   180     9 
##    OR    PA    RI    SC    SD    TN    TX    UT    VA    VT    WA    WI 
##    27   401    15    41     6   152  2561    77   295     3    60   583 
##    WV 
##     3
department.name.table
## department.name
##                       ANIMAL CONTROL CENTER 
##                                          79 
## BATON ROUGE CONVENTION & VISTORS COMMISSION 
##                                           4 
##                    BATON ROUGE RIVER CENTER 
##                                         134 
##                              CITY CONSTABLE 
##                                         115 
##                                  CITY COURT 
##                                        1412 
##              COMMUNITY DEVELOPMENT PROGRAMS 
##                                        1674 
##                       CONSTRUCTION PROJECTS 
##                                        2224 
##                                     CORONER 
##                                          47 
##                       COUNCIL ADMINISTRATOR 
##                                         182 
##                       COUNCIL BUDGET OFFICE 
##                                          84 
##                  DEPARTMENT OF PUBLIC WORKS 
##                                        5523 
##               DEPT OF BUILDINGS AND GROUNDS 
##                                         326 
## DEPT OF BUSINESS OPERATIONS AND CAPITAL PRO 
##                                         107 
##                         DEPT OF DEVELOPMENT 
##                                          69 
##                    DEPT OF FLEET MANAGEMENT 
##                                         325 
##                         DEPT OF MAINTENANCE 
##                                         220 
##         DEPT OF TRANSPORTATION AND DRAINAGE 
##                                         200 
##                              DISTRICT COURT 
##                                          20 
##               DOWNTOWN DEVELOPMENT DISTRICT 
##                                         182 
##                  EMERGENCY MEDICAL SERVICES 
##                                        1640 
##                          FINANCE DEPARTMENT 
##                                         758 
##                             FIRE DEPARTMENT 
##                                         498 
##                    FIRE PROTECTION DISTRICT 
##                                         217 
##        GREATER BATON ROUGE AIRPORT DISTRICT 
##                                        1677 
##              HUMAN DEVELOPMENT AND SERVICES 
##                                        2733 
##                             HUMAN RESOURCES 
##                                         561 
##                        INFORMATION SERVICES 
##                                        1156 
##                              JUVENILE COURT 
##                                           1 
##                           JUVENILE SERVICES 
##                                         376 
##                    LIBRARY BOARD OF CONTROL 
##                                        2620 
##                        METROPOLITAN COUNCIL 
##                                         186 
##                   MOSQUITO & RODENT CONTROL 
##                                         413 
## MUNICIPAL FIRE & POLICE CIVIL SERVICE BOARD 
##                                          11 
## OFC. OF HOMELAND SECURITY & EMERGENCY PREPA 
##                                         472 
##               OFFICE OF THE MAYOR-PRESIDENT 
##                                         386 
##                            OUTSIDE AGENCIES 
##                                         161 
##                             PARISH ATTORNEY 
##                                         259 
##                         PLANNING COMMISSION 
##                                         138 
##                           POLICE DEPARTMENT 
##                                        4066 
##                   PUBLIC INFORMATION OFFICE 
##                                          94 
##                                  PURCHASING 
##                                         253 
##                         REGISTRAR OF VOTERS 
##                                          43 
##                           RETIREMENT OFFICE 
##                                           4 
##                 ROAD REHABILITATION PROGRAM 
##                                         607 
##                            SHERIFF'S OFFICE 
##                                         195

Now, we will visualize some of the veriables to observe some patterns and differences.

First create a bar chart of the loan purpose frequency table and job frequency table Remember to extract the sales variable before calling it.

barplot(vendor.state.table) 

barplot(department.name.table)

Next, create a histogram of loan purpose frequency table and job frequency table.

hist(vendor.state.table) 

hist(department.name.table)

What are some observable differences between the barplot and the histogram? Please record your observations here.

*The histograms of Vendor State and Department Names account for variable frequency, but do not read for the subcategories of Vendor States and Department Name like the bar plot does. The barplot does a better job of depicting these subcategories. The histogram, however, does represent the frequencies without subjecting them to each individual subcategory. This gives us a good look at frequencies alone.

Lastly, create a pie chart of Loan purpose frequency table. Plot only values greather than 10.

pie(vendor.state.table)  

new_table = vendor.state.table[ vendor.state.table > 10 ] 

Why is this visual not a good representation of the data? What makes the barplot a better representation?

*This pie chart is not a good representation of the data compared to the barplot because it does not portray the numerical numbers of the frequencies of the subcategories for Vendor State.There are far too many states to represent well in a small pie chart.


Task 2

Lets explore the Illinois Tax dataset (READ the “state_tax_illinois.csv” file) and create a scatterplot of corporate tax vs. personal income tax.

The dataset start in 1994 but we want to tax revenue since 2000 for the first quarter of the year “Q1”.

VARIABLES: personal_income_tax and corporate_income_tax.

mydata = read.csv(file="data/state_tax_illinois.csv")
head(mydata)
##   year quarter total_tax_US total_state personal_income_tax
## 1 1994      Q1     91075715     3684023             1360884
## 2 1994      Q2    104324554     4299796             1425714
## 3 1994      Q3     90160343     3803301             1200911
## 4 1994      Q4     91149689     3627632             1174752
## 5 1995      Q1     93944974     4028215             1427336
## 6 1995      Q2    109609563     4480566             1509160
##   corporate_income_tax general_sales_tax motor_fuel_tax
## 1               281377           1113806         248334
## 2               532591           1181050         295800
## 3               279680           1239367         284772
## 4               229280           1271945         266208
## 5               357650           1197459         298489
## 6               614760           1250095         283860
options(scipen = 9999)
year = mydata$year
mydata = mydata[ order(year, decreasing = TRUE) , ] 
personal_income_tax = mydata$personal_income_tax
corporate_income_tax = mydata$corporate_income_tax

The previous task focused on visualizing one variable. A good way to visualize two variables would be a scatter plot or a correlation matrix.

plot(personal_income_tax,corporate_income_tax)

Select only entries greater than year 1999. After selecting entries greather than year 1999, select entries for quarter “Q1”.

new_data = mydata[ mydata$NUMERIC_VARIABLE > 1999, ]
new_data = new_data[ new_data$CHAR_VARIABLE == "Q1", ]
head(new_data)
## [1] year                 quarter              total_tax_US        
## [4] total_state          personal_income_tax  corporate_income_tax
## [7] general_sales_tax    motor_fuel_tax      
## <0 rows> (or 0-length row.names)

It might be easier to notice a trend if there was a line that fit through the points. So, lets repeat using the function ‘scatter.smooth’. This functions adds a smooth curve computed by loess, a method to locally smooth a curve.

scatter.smooth(personal_income_tax,corporate_income_tax)

Now, repeat below for total_tax_US vs. total_state, following the example that was just provided. Observe and note any trends in the scatter plots and potential relationships below. Do the same for all numeric variables.

*Both are positively correlated, increasing from left to right. Both plots are greatly dense on the left half of the graphs.

total_tax_US = mydata$total_tax_US
total_state = mydata$total_state
plot(total_tax_US,total_state)

new_data = mydata[ mydata$NUMERIC_VARIABLE > 1999, ]
new_data = new_data[ new_data$CHAR_VARIABLE == "Q1", ]
head(new_data)
## [1] year                 quarter              total_tax_US        
## [4] total_state          personal_income_tax  corporate_income_tax
## [7] general_sales_tax    motor_fuel_tax      
## <0 rows> (or 0-length row.names)
scatter.smooth(total_tax_US,total_state)

To quantify the data, its best to look at the correlation between two variables. What does this correlation explain?

*This correlation tells us that the variables have a strong uphill (positive) linear relationship.

cor(total_state,personal_income_tax)
## [1] 0.9652176

Below, compute the correlation between every pair of variables (total_tax_US,total_state, personal_income_tax, corporate_income_tax, general_sales_tax, motor_fuel_tax ).

Make sure to extract the variables first and then compute the correlation.

total_tax_US = mydata$total_tax_US
total_state = mydata$total_state
personal_income_tax = mydata$personal_income_tax
corporate_income_tax = mydata$corporate_income_tax
general_sales_tax = mydata$general_sales_tax
motor_fuel_tax = mydata$motor_fuel_tax
corr01 = mydata[ c(3,4) ] 
cor(corr01) 
##              total_tax_US total_state
## total_tax_US    1.0000000   0.9621443
## total_state     0.9621443   1.0000000
corr02 = mydata[3:8] 
cor(corr02)
##                      total_tax_US total_state personal_income_tax
## total_tax_US            1.0000000   0.9621443          0.89216270
## total_state             0.9621443   1.0000000          0.96521760
## personal_income_tax     0.8921627   0.9652176          1.00000000
## corporate_income_tax    0.8378959   0.8547191          0.87948231
## general_sales_tax       0.8811287   0.8632035          0.73636583
## motor_fuel_tax          0.2108519   0.1397864         -0.02182754
##                      corporate_income_tax general_sales_tax motor_fuel_tax
## total_tax_US                   0.83789591         0.8811287     0.21085187
## total_state                    0.85471906         0.8632035     0.13978644
## personal_income_tax            0.87948231         0.7363658    -0.02182754
## corporate_income_tax           1.00000000         0.5503725    -0.05636253
## general_sales_tax              0.55037249         1.0000000     0.36033009
## motor_fuel_tax                -0.05636253         0.3603301     1.00000000

Task 3 - TABLEAU

Follow the directions on the worksheet, download tableau academic on your personal computer or use one of the labs computers.

Download Tablea academic here: https://www.tableau.com/academic/students

  1. Refer to file ‘states_purchase_orders.csv’ in the data folder
  2. Start Tableau and enter your LUC email if prompted.
  3. Import the file into Tableau. Choose the Text File option when importing
  4. Under the dimensions tab located on the left side of the screen.

DOUBLE click on the ‘Department.Name’, then DOUBLE click on ‘Number of Records’ variable located on the bottom left of the screen. Note the bar like view and breakdowns.

What department ‘bin’ has the highest count? What is the count?

*The Department of Public Works has the highest count of 5,523. This heat map shows the relationship of the count within each department bin. The larger boxes are the department bins with larger counts, and the smaller boxes are the department bins with the lesser counts. The darker colored a box is, the greater its count is.

  1. Create a new sheet by clicking on the icon next to your current sheet.

Then DOUBLE click on ‘Number of Records’ variable then DOUBLE click on ‘Vendor.Zip’, then on the top left select the recommend view.

What are the highlights on the map?

*The highlights on the map are the vendor zip codes along with the number of records in each zip code vendor.

Can you give an explanation of why there is a concentration on a particular region?

*Cities with larger populations would have a greater amount of vendors, while cities with lesser populations would not.

*This is the map depicting vendors in each region depending on their zip code. Each dot on the map represents a vendor zip code along with the number of records in each zip code. This gives us a good idea of the regions with a greater population of vendors.