Business Analytics Lab Worksheet 05 (bsad

About

R can be used to make basic visual analytics, which can be helpful in understanding the data holistically. Additionally, R can help find correlations between variables and create scatter plots.

Tableau is a tool more tailored for visual analytics, while R is a powerful tool for statistics and other advanced topics in data analytics. In this lab we will explore both capabilities using two earlier sets of data credistrisk and marketing.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

Note

For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. Starting with this worksheet, tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to the particular task section.

Task 1: Basic Visual Analytics in R

Read the file marketing.csv and make sure all the columns are captured by checking the first couple rows of the dataset “head(mydata)”

mydata = read.csv("data/marketing.csv")
head(mydata)

##   case_number State sales radio paper  tv pos
## 1           1    IL 11125    65    89 250 1.3
## 2           2    IL 16121    73    55 260 1.6
## 3           3    AZ 16440    74    58 270 1.7
## 4           4    AZ 16876    75    82 270 1.3
## 5           5    IL 13965    69    75 255 1.5
## 6           6    MI 14999    70    71 255 2.1

How to create a bar chart using categorical variable

# Extract the State column from mydata
state = mydata$State
# Create a frequency table to extract the count for each state
state_table = table(state)
# Execute the  command 
barplot(state_table)

1A) Repeat the above bar chart by adding proper labels to X and Y axis. See example below.

# Add labels to plot by replacing the ?? with a proper title
barplot(state_table, xlab= 'States', ylab = 'Number of Cases in a State' )

How to create a histogram

# Extract the TV column from the data and create a histogram by running the command hist(variable) 
# where variable corresponds to the extracted sales column variable
tv=mydata$tv
hist(tv)

1B) Create a new histogram plot for Sales. Can you find the total cummulative sales from the histogram? Explain your answer

sales=mydata$sales
hist(sales)

You cannot find the total cumulative sales from the histogram, because a histogram shows a frequency of sales occuring over certain intervals of sales. In other words, this histogram shows that 1 sales data point falls between 10,000 and 12,000 dollars. Next, 3 data points fall between 12,000 and 14,000 dollars. From this histogram alone, you couldn’t find the exct amount of total sales that corresponds to the excel spreadsheet, because the data on the histogram is specific over intervals, not a specific amount of sales. You could estimate total sales depending on which intervals have a higher frequency. you could take the average of each interval, multiply it by the frequency, and sum those values to estimate what the total sales are. With this histogram, you would calculate (11,000 x 1) + (13,000 x 3) + (15,000 x 4) + (17,000 x 5) + (19,000 x 5) + (21,000 x 2) = 332,000 while the excel sheet shows a sum of 334,344.

How to create a pie chart

# The command to create a pie chart is pie(variable) where  variable is in reference to the particular column extracted from the file. In this example we define a variable called x. 
x = c(2,3,4,5)
pie(x)

1C) Create a pie chart for variable state

state = c(state_table)
pie(state)

1D) Compare the pie chart to the earlier bar chart. Which one you think is a better comparative representation of the data and why?

Based on the data alone, I think the bar chart is easier to understand. Since many of the states has the same frequency of 3 cases, with the others being either 2 or 4, the pie chart slices aren’t very distinct. I think in this case, the pie chart is not as good of a representation because it doesn’t show clear differences between the frequencies, like the bar chart does. If there were greater differences in frequencies between the states, I think the pie chart would be better because the slices would be easier to distinguish and the slices wouldn’t be as equal.

Task 2 Scatter Plots & Correlation

The previous task focused on visualizing one variable. A good way to visualize two variables and also very common is a scatter plot.

How to create a scatter plot

# Plot Sales vs. Radio
# Radio will be on the x-axis
# Sales will be on the y-axis

sales = mydata$sales
radio = mydata$radio
plot(radio,sales)

# It is easier to see the trend and possible relationship by including a line that fit through the points.
# This is done with the command 
scatter.smooth(radio,sales)

2A) Create three other separate scatter plots for Sales vs TV, Sales vs Paper, and Sales vs Pos. Include the best fitting line in each plot. Pay attention to what goes on the x-axis and the y-axis.

sales = mydata$sales
tv = mydata$tv
plot(tv,sales)

scatter.smooth(tv,sales)

sales = mydata$sales
paper = mydata$paper
plot(paper,sales)

scatter.smooth(paper,sales)

sales = mydata$sales
pos = mydata$pos
plot(pos,sales)

scatter.smooth(pos,sales)

2B) Share your observations on trends and potential relationships

The scatter plots and the line of best fit show that radio and tv appear to have a positive correlation with sales, where both radio and tv lines of best fit are upward sloping without many outliers. This indicates that higher advertising expenditures in tv and radio ads tends to correspond to higher sales. On the other hand, paper expenditures had a more downward sloping line of best fit, however there were significant outliers clustered around the line. This does not show a clear positive correlation or relaiton to sales, so we may be able to assume that paper is not the most effective approach to marketing because there is no clear relationship. There are likely other variables at play. POS ads also have scattered points, and the line of best fit curves up and down. There is no clear relationship between sales and expenditures of POS ads. The line of best fit is not reliable because the points it is measuring do not have a clear pattern, similar to the paper data. There isn’t a clear correlaiton with POS ads either.

To quantify the strength of any relationships in the data, we need to look at the correlation between two variables.

How to compute correlation

cor(sales,radio)

## [1] 0.9771381

2C) Repeat the correlation calculation for the followinig each pair of variables (sales,tv), (sales,paper), and (sales,pos)

cor(sales,tv)

## [1] 0.9579703

cor(sales,paper)

## [1] -0.2830683

cor(sales,pos)

## [1] 0.0126486

2D) Which pair has the highest correlation? How do these results reconcile with the scatter plots observations?

Tv and radio have the highest correlation (they are the closest to 1, in the .90s). This is the same observation I made in with the scatter plot observations. Paper had negative correlation, and POS had very low correlation, which also agrees with my observations with the scatter plots. The line of best fit helped with determining the correlation.

Task 3 - Basic Visual Analytics in Tableau

Follow the directions on the worksheet, download tableau academic on your personal computer or use one of the labs computers.

– Download Tableau academic here: https://www.tableau.com/academic/students

– Refer to file ‘creditrisk.csv’ in the data folder

– Start Tableau and enter your LUC email if prompted.

– Import the file into Tableau. Choose the Text File option when importing

– Under the dimensions tab located on the left side of the screen DOUBLE click on the ‘Loan Purpose’, then DOUBLE click on ‘Number of Records’ variable located under Measures on the bottom left of the screen.

– From the upper right corner of the screen select the horizontal bars view and note the new chart. Familiarize yourself with the tool by trying other views. Only valid views will be highlighted in the menu.

– Create a new sheet by clicking on the icon in the bottom next to your current sheet.

3A) Double-click on the ‘Age’ variable in Measures and select the ‘Histogram’ view. Capture a screen shot and include here. Which age bin has the highest age count and what is the count?

The 20-25 age bin has the highest age count of abount 96 counts.

3B) Drag-drop the variable ’Marital Status’found under Dimensions into the Marks Color box. Capture a screen shot and include here. Which age bin has the highest divorce count and what is the count?

The 20-25 age been has the highest divorce count. The divorce count is about 46.

Business Analytics Lab Worksheet 05 (bsad_lab05)

CME Group Foundation Business Analytics Lab

Manuela Bojdo

10/12/17

About

Setup

Note

Task 1: Basic Visual Analytics in R

1A) Repeat the above bar chart by adding proper labels to X and Y axis. See example below.

1B) Create a new histogram plot for Sales. Can you find the total cummulative sales from the histogram? Explain your answer

1C) Create a pie chart for variable state

1D) Compare the pie chart to the earlier bar chart. Which one you think is a better comparative representation of the data and why?

Task 2 Scatter Plots & Correlation

2A) Create three other separate scatter plots for Sales vs TV, Sales vs Paper, and Sales vs Pos. Include the best fitting line in each plot. Pay attention to what goes on the x-axis and the y-axis.

2C) Repeat the correlation calculation for the followinig each pair of variables (sales,tv), (sales,paper), and (sales,pos)

2D) Which pair has the highest correlation? How do these results reconcile with the scatter plots observations?

Task 3 - Basic Visual Analytics in Tableau

3A) Double-click on the ‘Age’ variable in Measures and select the ‘Histogram’ view. Capture a screen shot and include here. Which age bin has the highest age count and what is the count?

3B) Drag-drop the variable ’Marital Status’found under Dimensions into the Marks Color box. Capture a screen shot and include here. Which age bin has the highest divorce count and what is the count?

Business Analytics Lab Worksheet 05 (bsad_lab05)

CME Group Foundation Business Analytics Lab

Manuela Bojdo

10/12/17

About

Setup

Note

Task 1: Basic Visual Analytics in R

1A) Repeat the above bar chart by adding proper labels to X and Y axis. See example below.

1B) Create a new histogram plot for Sales. Can you find the total cummulative sales from the histogram? Explain your answer

1C) Create a pie chart for variable state

1D) Compare the pie chart to the earlier bar chart. Which one you think is a better comparative representation of the data and why?

Task 2 Scatter Plots & Correlation

2A) Create three other separate scatter plots for Sales vs TV, Sales vs Paper, and Sales vs Pos. Include the best fitting line in each plot. Pay attention to what goes on the x-axis and the y-axis.

2B) Share your observations on trends and potential relationships

2C) Repeat the correlation calculation for the followinig each pair of variables (sales,tv), (sales,paper), and (sales,pos)

2D) Which pair has the highest correlation? How do these results reconcile with the scatter plots observations?

Task 3 - Basic Visual Analytics in Tableau

3A) Double-click on the ‘Age’ variable in Measures and select the ‘Histogram’ view. Capture a screen shot and include here. Which age bin has the highest age count and what is the count?

3B) Drag-drop the variable ’Marital Status’found under Dimensions into the Marks Color box. Capture a screen shot and include here. Which age bin has the highest divorce count and what is the count?

3D) In a new sheet generate a view of Gender, Number of Records, and Marital Status. Choose the best fitting view of your choice for the intended scope. Capture a screen shot and include here. Share your observations.