R can be used to make basic visual analytics, which can be helpful in understanding the data holistically. Additionally, R can help find correlations between variables and create scatter plots.
In this lab we will delve deeper into the marketing data set with R visual analytics and use Tableau to visualize the credit risk data set.
Make sure to download the folder titled ‘bsad_lab04’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.
First, read in the creditrisk.csv file and make sure all the columns are captured and check the first couple of rows of the dataset “head(mydata)”
mydata = read.csv(file="data/states_purchase_orders.csv")
#loan_purpose = mydata$Loan.Purpose !Do the same for all the columns!
head(mydata)
For CATEGORICAL variables it is useful to know the frequency of the different categories/levels of the variable. We can do that by using the table() function.
# Command to create a frequency table: variable_table = table(variable)
vendorstate = mydata$VENDOR.STATE
vendorstatetable = table(vendorstate)
vendorstatetable
vendorstate
, AL AR AZ CA CD CO CT DC FL GA IA ID IL IN KS KY LA MA MD ME MI MN MO MS
2 62 400 125 189 877 1 114 99 69 418 555 1089 14 330 180 76 100 20178 149 32 12 55 233 74 361
NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV
313 4 56 14 1633 3 12 210 180 9 27 401 15 41 6 152 2561 77 295 3 60 583 3
deptname = mydata$DEPARTMENT.NAME
deptnametable = table(deptname)
deptnametable
deptname
ANIMAL CONTROL CENTER BATON ROUGE CONVENTION & VISTORS COMMISSION BATON ROUGE RIVER CENTER
79 4 134
CITY CONSTABLE CITY COURT COMMUNITY DEVELOPMENT PROGRAMS
115 1412 1674
CONSTRUCTION PROJECTS CORONER COUNCIL ADMINISTRATOR
2224 47 182
COUNCIL BUDGET OFFICE DEPARTMENT OF PUBLIC WORKS DEPT OF BUILDINGS AND GROUNDS
84 5523 326
DEPT OF BUSINESS OPERATIONS AND CAPITAL PRO DEPT OF DEVELOPMENT DEPT OF FLEET MANAGEMENT
107 69 325
DEPT OF MAINTENANCE DEPT OF TRANSPORTATION AND DRAINAGE DISTRICT COURT
220 200 20
DOWNTOWN DEVELOPMENT DISTRICT EMERGENCY MEDICAL SERVICES FINANCE DEPARTMENT
182 1640 758
FIRE DEPARTMENT FIRE PROTECTION DISTRICT GREATER BATON ROUGE AIRPORT DISTRICT
498 217 1677
HUMAN DEVELOPMENT AND SERVICES HUMAN RESOURCES INFORMATION SERVICES
2733 561 1156
JUVENILE COURT JUVENILE SERVICES LIBRARY BOARD OF CONTROL
1 376 2620
METROPOLITAN COUNCIL MOSQUITO & RODENT CONTROL MUNICIPAL FIRE & POLICE CIVIL SERVICE BOARD
186 413 11
OFC. OF HOMELAND SECURITY & EMERGENCY PREPA OFFICE OF THE MAYOR-PRESIDENT OUTSIDE AGENCIES
472 386 161
PARISH ATTORNEY PLANNING COMMISSION POLICE DEPARTMENT
259 138 4066
PUBLIC INFORMATION OFFICE PURCHASING REGISTRAR OF VOTERS
94 253 43
RETIREMENT OFFICE ROAD REHABILITATION PROGRAM SHERIFF'S OFFICE
4 607 195
Now, we will visualize some of the veriables to observe some patterns and differences.
First create a bar chart of the loan purpose frequency table and job frequency table Remember to extract the sales variable before calling it.
VARIABLES: Job and Loan Purpose
# Plot a bar plot using the " barplot(variable) " command where variable is the extracted variable
barplot(vendorstatetable)
barplot(deptnametable)
Next, create a histogram of loan purpose frequency table and job frequency table
# Plot a histogram using the " hist(variable) " command where variable is the extracted variable
hist(vendorstatetable)
hist(deptnametable)
There are some notable differences between the barplots and histograms. The barplots appear to be more comprehansive and reveal more usable information to us as users of the data.The histograms have numbers on both axes and it makes it a little less meaningful. Also, with the vendor state graphs especially, it is hard to see how the two relate.
Lastly, create a pie chart of Loan purpose frequency table. Plot only values greather than 10.
VARIABLE: Loan Purpose
# Plot a pie chart using:
# pie(variable) (command to make a pie chart)
pie(vendorstatetable)
pie(deptnametable)
# To get only values greather than N from a frequency table use:
# new_table = my_table[ my_table > N ] (command to select values greather than N)
newvendor = vendorstatetable[vendorstatetable > 500]
newdept = deptnametable[deptnametable > 500]
pie(newvendor)
pie(newdept)
A pie chart does not appear to be a good representation of the data because it doesn’t really allow us to see how many there really are. Especially once we narrow it down to values greater than 500, it is not very accurate at all. The barplot at least shows us all of them in a clearer way and we are able to see how many there are of each, rather than just how they compare to others in size.
Lets explore the Illinois Tax dataset (READ the “state_tax_illinois.csv” file) and create a scatterplot of corporate tax vs. personal income tax.
The dataset start in 1994 but we want to tax revenue since 2000 for the first quarter of the year “Q1”
VARIABLES: personal_income_tax and corporate_income_tax
# To change the way that R shows very large or small values we can use the following command:
options(scipen = 9999)
# read the dataset extract the year variable and order the dataset by year in decreasing order using the command:
taxdata = read.csv(file = "data/state_tax_illinois.csv")
taxdata
year = taxdata$year
# mydata = mydata[ order(year, decreasing = TRUE) , ] (! Order the data in decreasing order)
taxdatabyyear = taxdata[order(year, decreasing = TRUE) , ]
taxdatabyyear
# From the ordered dataset Extract personal_income_tax variable
personalinctax = taxdatabyyear$personal_income_tax
personalinctax
[1] 3746068 3950197 3982255 4527933 3059415 3048487 4505921 4961971 3680480 3723148 4345335 5353786 3608644 3565618 4197619 4790521 3403778 3435763 3195378
[20] 4096206 3343258 3180912 2515231 2702995 2134865 2291055 2448000 2899039 2104226 2110793 2803540 3427232 2149834 1968785 2522527 3101128 2071646 2017821
[39] 2264920 2817633 1963943 1820838 2130841 2511261 1817907 1734644 1902394 2181615 1675372 1661305 1862628 2169229 1600339 1571274 1945837 2195228 1648066
[58] 1661059 1867375 2405920 1679520 1650800 2272950 2713373 1763852 1631196 1877350 1984974 1644788 1640535 1877350 1984974 1579539 1545132 1689205 1786043
[77] 1579539 1545132 1553374 1642423 1421240 1390281 1427336 1509160 1306955 1278486 1360884 1425714 1200911 1174752
# From the ordered dataset Extract corporate_income_tax variable
corpinctax = taxdatabyyear$corporate_income_tax
corpinctax
[1] 903017 1163900 1055932 1425459 674725 625819 1105062 1508593 787764 785112 1028486 1764708 877978 948199 834075 1365912 778008 891425 632404
[20] 1160954 655115 639437 729180 971001 507167 504933 642923 1403105 459494 527011 790459 1225043 519552 439074 704820 1158924 537828 562274
[39] 591623 901551 556344 516272 557161 735924 438193 468903 431059 470507 324041 354688 362344 421964 222017 395572 362374 432374 261916
[58] 246964 583639 786706 274660 313983 792893 828315 433323 412970 473585 814038 476261 457064 473585 814038 370340 303603 435527 748621
[77] 370340 303603 391428 672820 340579 279205 357650 614760 306094 250934 281377 532591 279680 229280
The previous task focused on visualizing one variable. A good way to visualize two variables would be a scatter plot or a correlation matrix.
# Plot personal_income_tax vs. corporate_income_tax
# personal_income_tax will be on the x-axis
# corporate_income_tax will be on the y-axis
plot(personalinctax,corpinctax)
# To create a new_data filtering for an N NUMERIC value we can do the following
# new_data = mydata[ mydata$NUMERIC_VARIABLE > N, ]
newtaxdata = taxdatabyyear[ taxdatabyyear$NUMERIC_VARIABLE > 1999]
newtaxdata
# If we need to make another filter on a NON-NUMERIC we can do the following
# new_data = new_data[ new_data$CHAR_VARIABLE == "Q1", ]
new = newtaxdata[ newtaxdata$CHAR_VARIABLE == "Q1" , ]
new
It might be easier to notice a trend if there was a line that fit through the points. So, lets repeat using the function ‘scatter.smooth’. This functions adds a smooth curve computed by loess, a method to locally smooth a curve.
scatter.smooth(personalinctax,corpinctax)
Now, repeat below for total_tax_US vs. total_state, following the example that was just provided. Observe and note any trends in the scatter plots and potential relationships below. Do the same for all numeric variables.
totalus = taxdatabyyear$total_tax_US
totalstate = taxdatabyyear$total_state
scatter.smooth(totalus,totalstate)
# total_tax_US vs. total_state
To quantify the data, its best to look at the correlation between two variables. What does this correlation explain?
cor(totalus,totalstate)
[1] 0.9621443
cor(personalinctax,corpinctax)
[1] 0.8794823
Below, compute the correlation between every pair of variables (total_tax_US,total_state, personal_income_tax, corporate_income_tax, general_sales_tax, motor_fuel_tax ).
Make sure to extract the variables first and then compute the correlation.
gentax = taxdatabyyear$general_sales_tax
motortax = taxdatabyyear$motor_fuel_tax
cor(gentax,motortax)
[1] 0.3603301
cor(totalus,gentax)
[1] 0.8811287
cor(personalinctax,motortax)
[1] -0.02182754
# To make a correlation table of all the NUMERIC variables you can select some columns or a range of columns, with the following command:
# corr01 = mydata[ c(3,4) ] (The command c(3,4,...) inside the brackets selects a given column )
corr1 = taxdatabyyear [c(3,4)]
cor(corr1)
total_tax_US total_state
total_tax_US 1.0000000 0.9621443
total_state 0.9621443 1.0000000
# cor(corr01) (will give you the correlation table for variables in columns 3 and 4)
corr2 = taxdatabyyear [c(3:8)]
cor(corr2)
total_tax_US total_state personal_income_tax corporate_income_tax general_sales_tax motor_fuel_tax
total_tax_US 1.0000000 0.9621443 0.89216270 0.83789591 0.8811287 0.21085187
total_state 0.9621443 1.0000000 0.96521760 0.85471906 0.8632035 0.13978644
personal_income_tax 0.8921627 0.9652176 1.00000000 0.87948231 0.7363658 -0.02182754
corporate_income_tax 0.8378959 0.8547191 0.87948231 1.00000000 0.5503725 -0.05636253
general_sales_tax 0.8811287 0.8632035 0.73636583 0.55037249 1.0000000 0.36033009
motor_fuel_tax 0.2108519 0.1397864 -0.02182754 -0.05636253 0.3603301 1.00000000
# corr02 = mydata[3:5] ( the command [3:5] selects a range of columns from 3 to 5)
# cor(corr02)
Follow the directions on the worksheet, download tableau academic on your personal computer or use one of the labs computers.
Refer to file ‘states_purchase_orders.csv’ in the data folder
Start Tableau and enter your LUC email if prompted.
Import the file into Tableau. Choose the Text File option when importing
Under the dimensions tab located on the left side of the screen.
The Department of Public Works has the highest count which is 5,523.
The bar graph above shows the number of records for each department name. The department with the most records is Public Works, with 5,523 records. The next highest is the Police Department, with 4,066 records. Some departments have so few records that it doesn’t look like they have any when compared to others on the bar graph. Overall, there is a wide array of how many records there are for each department.
The highlights on the map are centered in Southern Louisiana, Southeast Texas, Southeast Iowa, and Northern New Jersey. There are also highlights in some places in California, Wisconsin, Florida, and along the East Coast from Northern Virginia to Massachusetts. Overall, most of the highlights appear to be in the eastern part of the country, with the highest amount in Louisiana and Texas. This may be due to their proximity to water transportation. However, it seems more likely that the vendors are located in these areas because of the lower costs for production.