About
R can be used for basic visual analytics, which is very helpful in understanding the data holistically. Additionally, R can help find correlations between variables and create scatter plots.
Tableau is a tool more tailored for visual analytics, while R is a powerful tool for statistics and other advanced topics in data analytics. In this lab we will explore both capabilities using the two earlier sets of data for credist risk and marketing.
Setup
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work in Sakai as detailed in previous notes.
Note
For your assignment you may be using different data sets than what is included here. Always read carefully the instructions provided, before executing any included code chunks and/or adding your own code. For clarity, tasks/questions to be completed/answered are highlighted in red color and numbered according to their particular placement in the task section. The red color is only apparent when in Preview mode. Quite often you will need to add your own code chunk.
Execute all code chunks (already included and own added), preview, check integrity, and submit final work (\(html\) file) in Sakai.
Task 1: Basic Visual Analytics in R
Read the file marketing.csv and make sure all the columns are captured by checking the first couple rows of the dataset “head(mydata)”
mydata = read.csv("marketing.csv")
head(mydata)
How to create a bar chart using a categorical variable
# Extract the State column from mydata
state = mydata$State
# Create a frequency table to extract the count for each state
state_table = table(state)
# Execute the command
barplot(state_table)

##### 1A) Use the code chunk below to repeat the above bar chart by adding proper labels to X and Y axis (1pt)
barplot(state_table, main = 'State Frequency Bar Chart', xlab= 'State', ylab = 'Frequency' )

# Add title and labels to plot by replacing the ?? with the proper wordings
barplot(state_table, main = '??', xlab= '??', ylab = '??' )

A more elegant representation of the bar plot would be to order the bars by increasing value. This is shown in the code chunk below.
# Order and execute
barplot(state_table[order(state_table)])

Next we will create a histogram
# Extract the TV column from the data and create a histogram by running the command hist(variable)
# where variable corresponds to the extracted sales column variable
tv=mydata$tv
hist(tv)

As seen, the histogram differs from a barplot in two distinctive ways: the x-axis is numeric (although not always necessary) broken in continuos intervals called bins, and the y-axis is representative of number of observations or frequency. The number of bins can be adjusted to show a finer resolution or breakdown of intervals. This is demonstrated in the below example.
hist(tv, breaks=20)
##### 1B) Create a new histogram plot for Sales. Explain what the x-axis and y-axis represent. Can one derive the total cummulative sales from the histogram? Explain your answer (2pts)
sales=mydata$sales
hist(sales)

x-axis represents diferent ranges for the amount of dollars in sales that were generated and the y-axis represents the frequency for which sales generated a dollar amount in a certain range.No, one cannot derive the total cumulative sales from the histogram because the histogram only graphs a range of dollar amounts that the sale amount falls in, not the exact amount. Thus, the histogram can give a good estimate of the total cumulative sales, but it will not be exact.
How to create a pie chart
# The command to create a pie chart is pie(variable) where variable is in reference to the particular column extracted from the file. In this example we define a variable called x.
x = c(2,3,4,5)
pie(x)

##### 1C) Create a new pie chart for state count. Refer to earlier defined variable state_table to capture the frequency count (1pt)
state=mydata$state
state = c(2,3,4)
pie(state_table)

##### 1D) What does each slice of the pie represent? Compare the pie chart to earlier bar charts. Which type of charts is a better representation of the data and why so? (1pt) Each slice of the pie represents how frequent the state appears in the data set compared to the total number of frequencies. The larger the slice of the pie, the more frequent the state appears in the data set and the larger its proportion to the data set as a whole. I think the bar chart is a better representation of the data because it is able to more accurately depict the values visually. I feel like you compare the differnt pieces of data within the set better and have an easier time knowing the exact amounts with the bar chart. ———-
Task 2 Scatter Plots & Correlation
The previous task focused on visualizing one variable. A good way to visualize two variables and also very common is a scatter plot. A scatter plot is a good way to study relationships and trends.
How to create a scatter plot
# Plot Sales vs. Radio
# Radio will be on the x-axis
# Sales will be on the y-axis
sales = mydata$sales
radio = mydata$radio
plot(radio,sales)

# It is easier to see the trend and possible relationship by including a line that fit through the points.
# This is done with the command
scatter.smooth(radio,sales)

##### 2A) Create three separate scatter plots for Sales vs TV, Sales vs Paper, and Sales vs Pos. Include the best fitting line in each plot. Pay attention to what variable goes on the x-axis and the y-axis (2pts)
sales = mydata$sales
tv = mydata$tv
paper=mydata$paper
pos=mydata$pos
plot(tv,sales)

scatter.smooth(tv,sales)

plot(paper,sales)

scatter.smooth(paper,sales)

plot(pos,sales)

scatter.smooth(pos,sales)

##### 2B) Share your observations on trends and relationships. How do your observations reconcile with your findings from lab05? (1pt) Sales and tv has a very high positive correlation. Sales and paper has a low negative correlation. And sales and pos has a very low positive correlation. The observations resemble those in lab05 because as TV ads increased the more sales were made creating a positive relationship, similar to the one here. And the paper sales had a fluxuating relationship that created a low negative relationship similar to the one here as well.
As part of any data anlytics it is important to consider both qualitative and quantitative analysis. Scatter plots provide us with qualitative insights on possible trends and relationships. To quantify the strength of any relationships in the data, we need to look at the correlation between two variables.
To compute the correlation coefficient or strength between variables follow the below example
cor(sales,radio)
[1] 0.9771381
##### 2C) Repeat the correlation calculation for the following pair of variables (sales,tv), (sales,paper), and (sales,pos) (1pt)
cor(sales,tv)
[1] 0.9579703
cor(sales,paper)
[1] -0.2830683
cor(sales,pos)
[1] 0.0126486
##### 2D) Which pair has the highest correlation? How do these results reconcile with your scatter plots observations? (1pt) Sales and TV have the highest correlation. And they reconcile with the scatter plot observations well because these results are representative of the plots and they give a visual representation of the calculations. ———-
Task 3 - Basic Visual Analytics in Tableau
Follow the directions on the worksheet, download tableau academic on your personal computer or use one of the labs computers. Make sure to download the academic version and not the free limited trial version.
– Download Tableau academic here: https://www.tableau.com/academic/students
– Refer to file ‘creditrisk.csv’ in the data folder
– Start Tableau and enter your LUC email if prompted.
– Import the file into Tableau. Choose the Text File option when importing

– Under the dimensions tab located on the left side of the screen DOUBLE click on the ‘Loan Purpose’, then DOUBLE click on ‘Number of Records’ variable located under Measures on the bottom left of the screen.

– From the upper right corner of the screen select the horizontal bars view and note the new chart. Familiarize yourself with the tool by trying other views. Only valid views will be highlighted in the menu.

– Create a new sheet by clicking on the icon in the bottom next to your current sheet.

##### 3A) Double-click on the ‘Age’ variable in Measures and select the ‘Histogram’ view. Capture a screen shot and include here. Which age bin has the highest age count and what is the count? (2pts) 
The ‘22’ age bin has the highest age count and the count is 97. ##### 3B) Drag-drop the variable ’Marital Status’found under Dimensions into the Marks Color box. Capture a screen shot and include here. Which age bin has the highest divorce count and what is the count? (2pts) 
The age bin with the highest divorce rate is 22 and the count is 46.

##### 3C) Create another new sheet. Double-click ‘Months Employed’ and then double-click ‘Age’. Make sure Age appears in the columns field as shown in the image below. From the Sum(Age) drop down menu select Dimension. Repeat for Sum(Months Employed). Add another variable to the scatter plot by drag-drop the dimension variable ‘Gender’ into the Marks Color box. Capture a screen shot and include here. Share insights on what the data is telling us (3pts) 
The data here is giving us a visual breakdown of people at different ages and how long they were employed, with the colors blue and orange signifying males and females respectively.In this specific scatter plot, age is on the x-axis, months employed is on the y-axis, and gender is broken down by the different colors.

##### 3D) In a new sheet generate a view of Gender, Number of Records, and Marital Status. Choose the best fitting view of your choice for the intended scope. Capture a screen shot and include here. Share insights on what the data is telling us (3pts) 
Here the data is breaking down three variables: gender, number of rrecords, and marital status. The gender is on the x-axis which makes 3 columns. Number of records are on the y-axis and the blue and orange colors signify females and males respectively.
---
title: "Basic Visual Analytics (lab06)"
author: "yacheng song"
date: "20202/27"
output:
  html_notebook: default
  html_document: default
subtitle: BSAD 343, Business Analytics, Spring 2020
---

### About

R can be used for basic visual analytics, which is very helpful in understanding the data holistically. Additionally, R can help find correlations between variables and create scatter plots. 

Tableau is a tool more tailored for visual analytics, while R is a  powerful tool for statistics and other advanced topics in data analytics. In this lab  we will explore both capabilities using the two earlier sets of data for credist risk and marketing. 

### Setup

Remember to always set your working directory to the source file location. Go to 'Session', scroll down to 'Set Working Directory', and click 'To Source File Location'. Read carefully the below and follow the instructions to complete the tasks and answer any questions.  Submit your work in Sakai as detailed in previous notes. 

### Note

For your assignment you may be using different data sets than what is included here. Always read carefully the instructions provided, before executing any included code chunks and/or adding your own code.  For clarity, tasks/questions to be completed/answered are highlighted in red color and numbered according to their particular placement in the task section.  The red color is only apparent when in Preview mode. Quite often you will need to add your own code chunk.

Execute all code chunks (already included and own added), preview, check integrity, and submit final work ($html$ file) in Sakai.

--------------


### Task 1: Basic Visual Analytics in R

Read the file `marketing.csv` and make sure all the columns are captured by checking the first couple  rows of the dataset "head(mydata)"

```{r}
mydata = read.csv("marketing.csv")
head(mydata)
```

How to create a bar chart using a categorical variable

```{r}
# Extract the State column from mydata
state = mydata$State
# Create a frequency table to extract the count for each state
state_table = table(state)
# Execute the  command 
barplot(state_table)
```

<span style="color:red">
##### 1A) Use the code chunk below to repeat the above bar chart by adding proper labels to X and Y axis (1pt)
</span>
```{r}
barplot(state_table, main = 'State Frequency Bar Chart', xlab= 'State', ylab = 'Frequency' )
```

```{r}
# Add title and labels to plot by replacing the ?? with the proper wordings
barplot(state_table, main = '??', xlab= '??', ylab = '??' )
```

A more elegant representation of the bar plot would be to order the bars by increasing value.  This is shown in the code chunk below.

```{r}
# Order and execute
barplot(state_table[order(state_table)])
```


Next we will create a histogram 
```{r}
# Extract the TV column from the data and create a histogram by running the command hist(variable) 
# where variable corresponds to the extracted sales column variable
tv=mydata$tv
hist(tv)
```

As seen, the histogram differs from a barplot in two distinctive ways: the x-axis is numeric (although not always necessary) broken in continuos intervals called bins, and the y-axis is representative of number of observations or frequency.  The number of bins can be adjusted to show a finer resolution or breakdown of intervals.  This is demonstrated in the below example.

```{r}
hist(tv, breaks=20)
```

<span style="color:red">
##### 1B) Create a new histogram plot for Sales. Explain what the x-axis and y-axis represent. Can one derive the total cummulative sales from the histogram? Explain your answer (2pts)
</span>
```{r}
sales=mydata$sales
hist(sales)
```
x-axis represents diferent ranges for the amount of dollars in sales that were generated and the y-axis represents the frequency for which sales generated a dollar amount in a certain range.No, one cannot derive the total cumulative sales from the histogram because the histogram only graphs a range of dollar amounts that the sale amount falls in, not the exact amount. Thus, the histogram can give a good estimate of the total cumulative sales, but it will not be exact.


How to create a pie chart
```{r}
# The command to create a pie chart is pie(variable) where  variable is in reference to the particular column extracted from the file. In this example we define a variable called x. 
x = c(2,3,4,5)
pie(x)
```

<span style="color:red">
##### 1C) Create a new pie chart for state count. Refer to earlier defined variable `state_table` to capture the frequency count (1pt)
</span>
```{r}
state=mydata$state
state = c(2,3,4)
pie(state_table)
```

<span style="color:red">
##### 1D) What does each slice of the pie represent? Compare the pie chart to earlier bar charts. Which type of charts is a better representation of the data and why so? (1pt)
</span>
Each slice of the pie represents how frequent the state appears in the data set compared to the total number of frequencies. The larger the slice of the pie, the more frequent the state appears in the data set and the larger its proportion to the data set as a whole.
I think the bar chart is a better representation of the data because it is able to more accurately depict the values visually. I feel like you compare the differnt pieces of data within the set better and have an easier time knowing the exact amounts with the bar chart.
----------

### Task 2 Scatter Plots & Correlation 

The previous task focused on visualizing one variable. A good way to visualize two variables and also very common is a scatter plot. A scatter plot is a good way to study relationships and trends.

How to create a scatter plot
```{r}
# Plot Sales vs. Radio
# Radio will be on the x-axis
# Sales will be on the y-axis

sales = mydata$sales
radio = mydata$radio
plot(radio,sales)

# It is easier to see the trend and possible relationship by including a line that fit through the points.
# This is done with the command 
scatter.smooth(radio,sales)
```

<span style="color:red">
##### 2A) Create three separate scatter plots for Sales vs TV, Sales vs Paper, and Sales vs Pos. Include the best fitting line in each plot. Pay attention to what variable goes on the x-axis and the y-axis (2pts)
</span>
```{r}
sales = mydata$sales
tv = mydata$tv
paper=mydata$paper
pos=mydata$pos
plot(tv,sales)
```
```{r}
scatter.smooth(tv,sales)
```
```{r}
plot(paper,sales)
```
```{r}
scatter.smooth(paper,sales)
```
```{r}
plot(pos,sales)
```
```{r}
scatter.smooth(pos,sales)
```

<span style="color:red">
##### 2B) Share your observations on trends and relationships. How do your observations reconcile with your findings from lab05? (1pt)
</span>
Sales and tv has a very high positive correlation. Sales and paper has a low negative correlation. And sales and pos has a very low positive correlation. The observations resemble those in lab05 because as TV ads increased the more sales were made creating a positive relationship, similar to the one here. And the paper sales had a fluxuating relationship that created a low negative relationship similar to the one here as well.

As part of any data anlytics it is important to consider both qualitative and quantitative analysis.  Scatter plots provide us with qualitative insights on possible trends and relationships.  To quantify the strength of any relationships in the data, we need to look at the correlation between two variables.

To compute the correlation coefficient or strength between variables follow the below example

```{r}
cor(sales,radio)
```

<span style="color:red">
##### 2C) Repeat the correlation calculation for the following pair of variables  (sales,tv), (sales,paper), and (sales,pos) (1pt)
</span>
```{r}
cor(sales,tv)
```
```{r}
cor(sales,paper)
```
```{r}
cor(sales,pos)
```

<span style="color:red">
##### 2D) Which pair has the highest correlation?  How do these results reconcile with your scatter plots observations? (1pt)
</span>
Sales and TV have the highest correlation. And they reconcile with the scatter plot observations well because these results are representative of the plots and they give a visual representation of the calculations.
----------

### Task 3 - Basic Visual Analytics in Tableau

Follow the directions on the worksheet, download tableau academic on your personal computer or use one of the labs computers. Make sure to download the academic version and not the free limited trial version.

-- Download Tableau academic here: https://www.tableau.com/academic/students

-- Refer to file 'creditrisk.csv' in the data folder

-- Start Tableau and enter your LUC email if prompted.

-- Import the file into Tableau. Choose the Text File option when importing

![](tableau_importfile.png)

----------------------

-- Under the dimensions tab located on the left side of the screen DOUBLE click on the 'Loan Purpose', then DOUBLE click on 'Number of Records' variable located under Measures on the bottom left of the screen. 

![](tableau_variables.png)



-- From the upper right corner of the screen select the horizontal bars view and note the new chart.  Familiarize yourself with the tool by trying other views. Only valid views will be highlighted in the menu.  

![](tableau_showme.png)


-- Create a new sheet by clicking on the icon in the bottom next to your current sheet.

![](tableau_newsheet.png)

<span style="color:red">
##### 3A) Double-click on the 'Age' variable in Measures and select the 'Histogram' view.  Capture a screen shot and include here.  Which age bin has the highest age count and what is the count? (2pts)
</span>
![](screenshot1.png)

The ‘22’ age bin has the highest age count and the count is 97.
<span style="color:red">
##### 3B) Drag-drop the variable 'Marital Status'found under Dimensions into the Marks Color box.  Capture a screen shot and include here.  Which age bin has the highest divorce count and what is the count?  (2pts)
</span>
![](screenshot2.png)

The age bin with the highest divorce rate is 22 and the count is 46.

![](tableau_marks.png)


<span style="color:red">
##### 3C) Create another new sheet. Double-click 'Months Employed' and then double-click 'Age'. Make sure Age appears in the columns field as shown in the image below. From the Sum(Age) drop down menu select Dimension. Repeat for Sum(Months Employed). Add another variable to the scatter plot by drag-drop the dimension variable 'Gender' into the Marks Color box. Capture a screen shot and include here. Share insights on what the data is telling us (3pts)
</span>
![](screenshot3.png)

The data here is giving us a visual breakdown of people at different ages and how long they were employed, with the colors blue and orange signifying males and females respectively.In this specific scatter plot, age is on the x-axis, months employed is on the y-axis, and gender is broken down by the different colors.

![](tableau_dimension.png)




<span style="color:red">
##### 3D) In a new sheet generate a view of Gender, Number of Records, and Marital Status. Choose the best fitting view of your choice for the intended scope.   Capture a screen shot and include here. Share insights on what the data is telling us (3pts)
</span>
![](screenshot4.png)

Here the data is breaking down three variables: gender, number of rrecords, and marital status. The gender is on the x-axis which makes 3 columns. Number of records are on the y-axis and the blue and orange colors signify females and males respectively.
----------------------------------------------------------------------


