About

R can be used to make basic visual analytics, which can be helpful in understanding the data holistically. Additionally, R can help find correlations between variables and create scatter plots.

Tableau is a tool more tailored for visual analytics, while R is a powerful tool for statistics and other advanced topics in data analytics. In this lab we will explore both capabilities using two earlier sets of data credistrisk and marketing.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

Note

For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. Starting with this worksheet, tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to the particular task section.


Task 1: Basic Visual Analytics in R

Read the file marketing.csv and make sure all the columns are captured by checking the first couple rows of the dataset “head(mydata)”

mydata = read.csv("data/marketing.csv")
head(mydata)
##   case_number State sales radio paper  tv pos
## 1           1    IL 11125    65    89 250 1.3
## 2           2    IL 16121    73    55 260 1.6
## 3           3    AZ 16440    74    58 270 1.7
## 4           4    AZ 16876    75    82 270 1.3
## 5           5    IL 13965    69    75 255 1.5
## 6           6    MI 14999    70    71 255 2.1

How to create a bar chart using categorical variable

# Extract the State column from mydata
state = mydata$State
# Create a frequency table to extract the count for each state
state_table = table(state)
# Execute the  command 
barplot(state_table)

1A) Repeat the above bar chart by adding proper labels to X and Y axis. See example below.
# Add labels to plot by replacing the ?? with a proper title
barplot(state_table, xlab= 'State', ylab = 'Frequency' )

How to create a histogram

# Extract the TV column from the data and create a histogram by running the command hist(variable) 
# where variable corresponds to the extracted sales column variable
tv=mydata$tv
hist(tv)

1B) Create a new histogram plot for Sales. Can you find the total cummulative sales from the histogram? Explain your answer
sales=mydata$sales
hist(sales)

One cannot find the total cumulative sales from the histogram because the frequencies are applied to different ranges, instead of specific values. For instance, although we do know that there are three values from the “Sales” column in between the range 12000 - 14000, we do not know the exact numbers. Is the value 12500, 13500, or any number in between the range? There are many possibilities. Therefore, since the values are not specific, we cannot determine an accurate total cumulative sales. In other words, one can find the total number of observations from this observation, but one cannot find the total cumulative sales because the values are not specifically given, but are divided into ranges.

How to create a pie chart

# The command to create a pie chart is pie(variable) where  variable is in reference to the particular column extracted from the file. In this example we define a variable called x. 
x = c(2,3,4,5)
pie(x)

1C) Create a pie chart for variable state
x = c(3, 3, 2, 3, 3, 4, 2)
labels = c("AZ","CA","CO", "FL","IL","MI","MN")
pie(x,labels)

1D) Compare the pie chart to the earlier bar chart. Which one you think is a better comparative representation of the data and why?

The earlier bar chart is a better representation of the data because it better displays the distribution of values across the different states. It is more obvious to observe the differences between the frequencies of the states on the bar chart than on the pie chart. This is because the sections of the pie chart are so close in size that it is difficult to differetentiate between the values, while the bar chart is labeled with the specific frequency for each.

Task 2 Scatter Plots & Correlation

The previous task focused on visualizing one variable. A good way to visualize two variables and also very common is a scatter plot.

How to create a scatter plot

# Plot Sales vs. Radio
# Radio will be on the x-axis
# Sales will be on the y-axis

sales = mydata$sales
radio = mydata$radio
plot(radio,sales)

# It is easier to see the trend and possible relationship by including a line that fit through the points.
# This is done with the command 
scatter.smooth(radio,sales)

2A) Create three other separate scatter plots for Sales vs TV, Sales vs Paper, and Sales vs Pos. Include the best fitting line in each plot. Pay attention to what goes on the x-axis and the y-axis.
# Plot Sales vs. TV
# TV will be on the x-axis
# Sales will be on the y-axis
 
sales = mydata$sales
tv = mydata$tv
plot(tv,sales)

scatter.smooth(tv,sales)

# Plot Sales vs. Paper
# Paper will be on the x-axis
# Sales will be on the y-axis

sales = mydata$sales
paper = mydata$paper
plot(paper,sales)

scatter.smooth(paper,sales)

# Plot Sales vs. Pos
# Pos will be on the x-axis
# Sales will be on the y-axis

sales = mydata$sales
pos = mydata$pos
plot(pos,sales)

scatter.smooth(pos,sales)

2C) Repeat the correlation calculation for the followinig each pair of variables (sales,tv), (sales,paper), and (sales,pos)
cor(sales,tv)
## [1] 0.9579703
cor(sales,paper)
## [1] -0.2830683
cor(sales,pos)
## [1] 0.0126486
2D) Which pair has the highest correlation? How do these results reconcile with the scatter plots observations?

The pair of variables: sales and radio has the highest correlation, with a calculation of approximately 0.98 (closest to 1). These results reconcile with the scatter plot observations which shows a smooth, upward line that fits pretty much through all, or almost all of the data for the Sales vs. Radio scatter plot.


Task 3 - Basic Visual Analytics in Tableau

Follow the directions on the worksheet, download tableau academic on your personal computer or use one of the labs computers.

– Download Tableau academic here: https://www.tableau.com/academic/students

– Refer to file ‘creditrisk.csv’ in the data folder

– Start Tableau and enter your LUC email if prompted.

– Import the file into Tableau. Choose the Text File option when importing


– Under the dimensions tab located on the left side of the screen DOUBLE click on the ‘Loan Purpose’, then DOUBLE click on ‘Number of Records’ variable located under Measures on the bottom left of the screen.

– From the upper right corner of the screen select the horizontal bars view and note the new chart. Familiarize yourself with the tool by trying other views. Only valid views will be highlighted in the menu.

– Create a new sheet by clicking on the icon in the bottom next to your current sheet.

3A) Double-click on the ‘Age’ variable in Measures and select the ‘Histogram’ view. Capture a screen shot and include here. Which age bin has the highest age count and what is the count?

The age bin with the highest age count is age bin 22, with a count of 97.

Histogram of age - Displays age distribution

Histogram of age - Displays age distribution

3B) Drag-drop the variable ’Marital Status’found under Dimensions into the Marks Color box. Capture a screen shot and include here. Which age bin has the highest divorce count and what is the count?

The age bin with the highest divorce count is age bin 22, with a count of 46.

Marital Status vs. Age - Represents the relationship between marital status and age.

Marital Status vs. Age - Represents the relationship between marital status and age.

3C) Create another new sheet. Double-click ‘Months Employed’ and then double-click ‘Age’. Make sure Age appears in the columns field as shown in the image below. From the Sum(Age) drop down menu select Dimension. Repeat for Sum(Months Employed). Add another variable to the scatter plot by drag-drop the dimension variable ‘Gender’ into the Marks Color box. Capture a screen shot and include here. Share your observations

Based on the scatter plot, one can see that a majority of females (represented by the blue circles) tend to be located in the lower, left-hand corner of the plot. Meaning that most women tend to have worked approximately for 60 months and are below the age of 45. One can then assume that as women are getting older, they tend to work less. Males, on the other hand, seem to show a direct and positive relationship between age and months employed.

Months employed vs. Age vs. Gender - Represents relationship between the number of months an individual is employed, their age, and their gender.

Months employed vs. Age vs. Gender - Represents relationship between the number of months an individual is employed, their age, and their gender.

3D) In a new sheet generate a view of Gender, Number of Records, and Marital Status. Choose the best fitting view of your choice for the intended scope. Capture a screen shot and include here. Share your observations.

The horizontal bar chart displays the relationship between gender, marital status, and number of records. The first observation that can be made is that single males are attributed with the msot number of records, followed by divorced females. This can be translated to an assumption that most loans are being distributed to single males. Moreover, it is interesting that all females fall into the “Divorced” category, while males asking for loans fall into different categories from “Divorced” to “Married” and “Single.”

Gender vs. Marital Status vs. Number of Records

Gender vs. Marital Status vs. Number of Records