Basic Visual Analytics (lab06)

About

R can be used for basic visual analytics, which is very helpful in understanding the data holistically. Additionally, R can help find correlations between variables and create scatter plots.

Tableau is a tool more tailored for visual analytics, while R is a powerful tool for statistics and other advanced topics in data analytics. In this lab we will explore both capabilities using the two earlier sets of data for credist risk and marketing.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work in Sakai as detailed in previous notes.

Note

For your assignment you may be using different data sets than what is included here. Always read carefully the instructions provided, before executing any included code chunks and/or adding your own code. For clarity, tasks/questions to be completed/answered are highlighted in red color and numbered according to their particular placement in the task section. The red color is only apparent when in Preview mode. Quite often you will need to add your own code chunk.

Execute all code chunks (already included and own added), preview, check integrity, and submit final work (\(html\) file) in Sakai.

Task 1: Basic Visual Analytics in R

Read the file marketing.csv and make sure all the columns are captured by checking the first couple rows of the dataset “head(mydata)”

mydata = read.csv("marketing.csv")
head(mydata)

##   case_number State sales radio paper  tv pos
## 1           1    IL 11125    65    89 250 1.3
## 2           2    IL 16121    73    55 260 1.6
## 3           3    AZ 16440    74    58 270 1.7
## 4           4    AZ 16876    75    82 270 1.3
## 5           5    IL 13965    69    75 255 1.5
## 6           6    MI 14999    70    71 255 2.1

If we want to see the number of cases per state (state is a categorical variable), a bar chart is a good vizualization tool.

# Extract the State column from mydata
state = mydata$State
# Create a frequency table to extract the count for each state
state_table = table(state)
state_table

## state
## AZ CA CO FL IL MI MN 
##  3  3  2  3  3  4  2

# Execute the  command 
barplot(state_table)

The length of the bar is the values (number of cases), and x-axis includes the vlaues (cateogries) of the categorical variable What is the sum of values of all bars?

##### 1A) Use the code chunk below to repeat the above bar chart by adding proper labels to X and Y axis

# Add title and labels to plot by replacing the ?? with the proper wordings
barplot(state_table, main = '??', xlab= 'State', ylab = '??' )

A more elegant representation of the bar plot would be to order the bars by increasing value. This is shown in the code chunk below.

# Order and execute

state_table[order(state_table)]

## state
## CO MN AZ CA FL IL MI 
##  2  2  3  3  3  3  4

barplot(state_table[order(state_table)])

If we want to study the frequency distribution of a numeric variable (e.g. spend on TV ads) Next we will create a histogram

# Extract the TV column from the data and create a histogram by running the command hist(variable) 
# where variable corresponds to the extracted sales column variable
tv=mydata$tv
hist(tv)

As seen, the histogram differs from a bar plot in two distinctive ways: the x-axis is numeric (although not always necessary) broken in continuous intervals called bins, and the y-axis is representative of number of observations or frequency. hist function counts the values in each bin.

By default, bin counts include values less than or equal to the bin’s right break point and strictly greater than the bin’s left break point, except for the leftmost bin, which includes its left break point.

The number of bins can be adjusted to show a finer resolution or breakdown of intervals. This is demonstrated in the below example.

# this is better than before because it shows more detailed distribution, it gives me 15 bins instead of 20, why? 
hist(tv, breaks=20) 
# same result as previous (15 bins), why? Because the number we include after "Break =" is a suggestion for the number of bins in histogram, but the actual number of bins is calculated by a function that uses our suggested number.
hist(tv, breaks = 16)

# another way is to specify the break points
hist(tv, breaks = c(250,253,256,259,262,265,268,271,274,277,280)) # this generated exactly 10 bins (280-250)/3 = 10

#You don't need to think a lot about the number of bins, you can use trial and error until you find the level of details that gives you enough information about distribution of the numeric variable

##### 1B) Create a new histogram plot for Sales. Explain what the x-axis and y-axis represent. Can one derive the total cummulative sales from the histogram? Explain your answer

Another tool that is good to visualize values for one variable is Pie chart

# The command to create a pie chart is pie(variable) where  variable is in reference to the particular column extracted from the file. In this example, we will Create a pie chart for state count. Refer to earlier defined variable `state_table` to capture the frequency count
pie(state_table)

We can use the bar plot and pie chart to compare the values between categories (the count between states)

##### 1c) What does each slice of the pie represent? Compare the pie chart to earlier bar charts. Which type of charts is a better representation of the data and why so?

Task 2 Scatter Plots & Correlation

The previous task focused on visualizing one variable. A good way to visualize two variables and also very common is a scatter plot. A scatter plot is a good way to study relationships and trends.

How to create a scatter plot

# Plot Sales vs. Radio
# Radio will be on the x-axis
# Sales will be on the y-axis

sales = mydata$sales
radio = mydata$radio
plot(radio,sales)

# It is easier to see the trend and possible relationship by including a line that fit through the points.
# This is done with the command 
scatter.smooth(radio,sales) # this is not regression line

Compare the scatter plot above to the two plots below. Which one provides a better visual for the relationship between sales and radio?

new_data = mydata[order(sales),]
layout(matrix(1:2,1,2))
plot(new_data$sales, type = 'b')
plot(new_data$radio, type = 'b')

Scatter plot merges the two individual plots into one

##### 2A) Create three separate scatter plots for Sales vs TV, Sales vs Paper, and Sales vs Pos. Include the best fitting line in each plot. Pay attention to what variable goes on the x-axis and the y-axis

new_data = mydata[order(sales),]
plot(tv, sales)

scatter.smooth(tv,sales)

paper = mydata$paper
new_data = mydata[order(sales),]
plot(paper, sales)

scatter.smooth(paper,sales)

#Sales vs Pos
pos = mydata$pos
new_data = mydata[order(sales),]
plot(pos, sales)

scatter.smooth(pos,sales)

##### 2B) Share your observations on trends and relationships. How do your observations reconcile with your findings from lab05?

There seems to be a strong correlation between sales and tv ads, but there is a weak negative correlation between sales and paper ads. The relationship between sales and pos is less obvious though, with a shaky line that does not have a clear relationship. Regression would give a much better idea of the relationship here.

As part of any data analytics it is important to consider both qualitative and quantitative analysis. Scatter plots provide us with qualitative insights on possible trends and relationships.

To quantify the strength of any relationships in the data, we need to look at the correlation between two variables.

To compute the correlation coefficient or strength between variables follow the below example:

cor(sales,radio)

## [1] 0.9771381

Let us review correlation slides in lecture notes before we answer the questions below’

##### 2C) Repeat the correlation calculation for the following pair of variables (sales,tv), (sales,paper), and (sales,pos)

cor(sales, tv)

## [1] 0.9579703

cor(sales, paper)

## [1] -0.2830683

cor(sales, pos)

## [1] 0.0126486

##### 2D) Which pair has the highest correlation? How do these results reconcile with your scatter plots observations?

The highest correlation is between sales and tv. This ligns up with our scatter plot observations perfectly, as the line we drew indicates this relationship even before we conducted any regression analysis.

Displaying sales: Now that we have reviewed a couple visualization techniques, take a look at the charts below; Which one do you think is better for displaying sales?

layout(matrix(1:4,2,2))
barplot(sales)
pie(sales)
hist(sales)
plot(sales)

We can easily see that bar plot and pie chart are not suited to display sales because they are only listing the value of sales in each of the 20 cases. The plot / trend line is used to show change overtime, but in our data we don’t have chronological order. Histogram on the other hand provides a good display for sales in our data because it provides some meaningful information (e.g. we can see that sales is slightly skewed to the left)

Also, compare the two charts below

layout(matrix(1:2,1,2))
barplot(state_table)
pie(state_table)

You can see that bar plot is better for visualizing the count in each state because the values are close to each other.

What about scale?

layout(matrix(1:2,1,2))
plot(sales, type = 'b', ylim = c(0,100000))
plot(sales, type = 'b')

Your selection of scale will determine the insights and patterns that you can see in the data.

In summary, there is no one good chart or technique for every scenario. You have to evaluate your data and pick the right visual case by case. It is an art and science

Task 3 - Basic Visual Analytics in Tableau

Follow the directions on the worksheet, download tableau academic on your personal computer or use one of the labs computers. Make sure to download the academic version and not the free limited trial version.

– Download Tableau academic here: https://www.tableau.com/academic/students

– Refer to file ‘creditrisk.csv’ in the data folder

– Start Tableau and enter your LUC email if prompted.

– Import the file into Tableau. Choose the Text File option when importing

– Under the dimensions tab located on the left side of the screen DOUBLE click on the ‘Loan Purpose’, then DOUBLE click on ‘Number of Records’ variable located under Measures on the bottom left of the screen.

– From the upper right corner of the screen select the horizontal bars view and note the new chart. Familiarize yourself with the tool by trying other views. Only valid views will be highlighted in the menu.

– Create a new sheet by clicking on the icon in the bottom next to your current sheet.

##### 3A) Double-click on the ‘Age’ variable in Measures and select the ‘Histogram’ view. Capture a screen shot and include here. Which age bin has the highest age count and what is the count?

##### 3B) Drag-drop the variable ’Marital Status’found under Dimensions into the Marks Color box. Capture a screen shot and include here. Which age bin has the highest divorce count and what is the count?

##### 3C) Create another new sheet. Double-click ‘Months Employed’ and then double-click ‘Age’. Make sure Age appears in the columns field as shown in the image below. From the Sum(Age) drop down menu select Dimension. Repeat for Sum(Months Employed). Add another variable to the scatter plot by drag-drop the dimension variable ‘Gender’ into the Marks Color box. Capture a screen shot and include here. Share insights on what the data is telling us

##### 3D) In a new sheet generate a view of Gender, Number of Records, and Marital Status. Choose the best fitting view of your choice for the intended scope. Capture a screen shot and include here. Share insights on what the data is telling us

Visual

This data is telling us that the grand majority of loans being put out are by single men or by divorced women. But there seems to be no single women taking out loans for some reason.