About

R can be used to make basic visual analytics, which can be helpful in understanding the data holistically. Additionally, R can help find correlations between variables and create scatter plots.

Tableau is a tool more tailored for visual analytics, while R is a powerful tool for statistics and other advanced topics in data analytics. In this lab we will explore both capabilities using two earlier sets of data credistrisk and marketing.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

Note

For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. For clarity, tasks/questions to be completed/answered are highlighted in red color and numbered according to their particular placement in the task section. Quite often you will need to add your own code chunk.

Execute all code chunks, preview, publish, and submit link on Sakai.


Task 1: Basic Visual Analytics in R

Read the file marketing.csv and make sure all the columns are captured by checking the first couple rows of the dataset “head(mydata)”

mydata = read.csv("data/marketing.csv")
head(mydata)
##   case_number State sales radio paper  tv pos
## 1           1    IL 11125    65    89 250 1.3
## 2           2    IL 16121    73    55 260 1.6
## 3           3    AZ 16440    74    58 270 1.7
## 4           4    AZ 16876    75    82 270 1.3
## 5           5    IL 13965    69    75 255 1.5
## 6           6    MI 14999    70    71 255 2.1

How to create a bar chart using a categorical variable

# Extract the State column from mydata
state = mydata$State
# Create a frequency table to extract the count for each state
state_table = table(state)
# Execute the  command 
barplot(state_table)

##### 1A) Use the code chunk below to repeat the above bar chart by adding proper labels to X and Y axis.

# Add title and labels to plot by replacing the ?? with the proper wordings
barplot(state_table, main = 'State Table', xlab= 'States', ylab = 'Study cases per State' )

A more elegant representation of the bar plot would be to order the bars by increasing value. This is shown in the code chunk below.

# Order and execute
barplot(state_table[order(state_table)])

How to create a histogram

# Extract the TV column from the data and create a histogram by running the command hist(variable) 
# where variable corresponds to the extracted sales column variable
tv=mydata$tv
hist(tv)

##### 1B) Create a new histogram plot for Sales. Explain what the x-axis and y-axis represent. Can one derive the total cummulative sales from the histogram? Explain your answer.

sales = mydata$sales
hist(sales)

#The x-axis represents the intervals of sales. And the y-axis represents the frequency or number of occurence for each interval. 

#It is not possible to calculate the total cumulative sales from the histogram because it tells you the number of occurence within the intervales but not the amount of sales. It could be estimate if we knew the amount of sales per frequency but, with the data that is shown in the histogram it is not possible. We know that between 16000-18000 interval and 18000-20000 interval there are 5 occurences in each of them, but it is not possible to determine the exact amount of sales. We could only say that those two intervals have the highest number of occurence.

How to create a pie chart

# The command to create a pie chart is pie(variable) where  variable is in reference to the particular column extracted from the file. In this example we define a variable called x. 
x = c(2,3,4,5)
pie(x) #size proportional to the number

##### 1C) Create a new pie chart for state count. Refer to variable state_table to capture the frequency count.

pie(state_table)

##### 1D) What does each slice of the pie represent? Compare the pie chart to earlier bar charts. Which type of charts is a better representation of the data and why so?

#Each slice of the pie represents the frequency of case studies per State.Each State is represented by each slice. 

#Comparing pie and bar charts, the slices with the same color represents the frequency of case studies for certain State (that is represented in that color). So, the slices are like the bars in the barchart. 

#For the purpose of better representation of the data, I would say that the bar chart is better because the results are displayed in a more visual way, ie. it is easier to differenciate the area of each bar than differentiate the area of each slice as the frequency do not differ much. Also, it is easier to determine which bar is larger in the bar chart while in the pie chart, the analysis of the slices is harder.

Task 2 Scatter Plots & Correlation

The previous task focused on visualizing one variable. A good way to visualize two variables and also very common is a scatter plot. A scatter plot is a good way to study relationships and trends.

How to create a scatter plot

# Plot Sales vs. Radio
# Radio will be on the x-axis
# Sales will be on the y-axis

sales = mydata$sales
radio = mydata$radio
plot(radio,sales)

# It is easier to see the trend and possible relationship by including a line that fit through the points.
# This is done with the command 
scatter.smooth(radio,sales)

##### 2A) Create three separate scatter plots for Sales vs TV, Sales vs Paper, and Sales vs Pos. Include the best fitting line in each plot. Pay attention to what variable goes on the x-axis and the y-axis.

TV = mydata$tv
plot(TV,sales)

scatter.smooth(TV, sales)

Paper = mydata$paper
plot(Paper,sales)

scatter.smooth(Paper, sales)

POS = mydata$pos
plot(POS,sales)

scatter.smooth(POS, sales)

##### 2B) Share your observations on trends and relationships. How do your observations reconcile with your findings from lab04?

#It seems that radio and sales have a very high positive correlation as Y increases when X increases. Moreover, the scatter-plot represents how most of the dots coincide with the line which is helps analyzing how the positive correlation. 

#TV and sales also have a positive correlation but not strong/high as radio and sales. Y also increases when X increases (positive correlation) but there are more deviations. 

#On the other hand, paper-sales has a low negative correlation because when X increases Y decreases in general. However, the correlation is low and there are many deviations so it is difficult to tell the level of correlation.

#Last but not least, There is no relationship between Pos and Sales as the dots are all over the chart. There is no pattern and the Y values sometimes increase when X increases but sometimes decreases (no correlation).

#Comparing to Lab04, the results, observations and findings reconcile with each other. 

As part of any data anlytics it is important to consider both qualitative and quantitative analysis. Scatter plots provide us with qualitative insights on possible trends and relationships. To quantify the strength of any relationships in the data, we need to look at the correlation between two variables.

How to compute correlation

cor(sales,radio)
## [1] 0.9771381

##### 2C) Repeat the correlation calculation for the following pair of variables (sales,tv), (sales,paper), and (sales,pos)

cor(sales,TV)
## [1] 0.9579703
cor(sales,Paper)
## [1] -0.2830683
cor(sales, POS)
## [1] 0.0126486

##### 2D) Which pair has the highest correlation? How do these results reconcile with your scatter plots observations?

#The pair Sales-Radio has the highest correlation (0.977). Comparing to the scatter plots this result makes sense and reconcile with it because in the Sales-Radio scatter plot, the dots are closer to the fitting line (most of them are over the line) which means that the correlation is higher. 

#Furthermore, taking a look to the other 3 scatterplots, we could say that the Sales-TV has also high correlation as the dots increase in a positive slope line and following a pattern. But in the rest two scatter plots (paper and POS), the correlation is very vague as the dots are all over the plot and without following any pattern neither increasing when sales increase. 

Task 3 - Basic Visual Analytics in Tableau

Follow the directions on the worksheet, download tableau academic on your personal computer or use one of the labs computers. Make sure to download the academic version and not the free limited trial version.

– Download Tableau academic here: https://www.tableau.com/academic/students

– Refer to file ‘creditrisk.csv’ in the data folder

– Start Tableau and enter your LUC email if prompted.

– Import the file into Tableau. Choose the Text File option when importing


– Under the dimensions tab located on the left side of the screen DOUBLE click on the ‘Loan Purpose’, then DOUBLE click on ‘Number of Records’ variable located under Measures on the bottom left of the screen.

– From the upper right corner of the screen select the horizontal bars view and note the new chart. Familiarize yourself with the tool by trying other views. Only valid views will be highlighted in the menu.

– Create a new sheet by clicking on the icon in the bottom next to your current sheet.

##### 3A) Double-click on the ‘Age’ variable in Measures and select the ‘Histogram’ view. Capture a screen shot and include here. Which age bin has the highest age count and what is the count?

#22 os the bin with the highest age count and the count is 97

##### 3B) Drag-drop the variable ’Marital Status’found under Dimensions into the Marks Color box. Capture a screen shot and include here. Which age bin has the highest divorce count and what is the count?

#22 is the age bin with the highest divorce count. The count is 46

##### 3C) Create another new sheet. Double-click ‘Months Employed’ and then double-click ‘Age’. Make sure Age appears in the columns field as shown in the image below. From the Sum(Age) drop down menu select Dimension. Repeat for Sum(Months Employed). Add another variable to the scatter plot by drag-drop the dimension variable ‘Gender’ into the Marks Color box. Capture a screen shot and include here. Share a story on what the data is telling us

#In general, there are more male employed than women (the orange dots overcome the blue ones). Furthermore, there are more people employed between the age of 20 to 40 years as the majority of the data lays on age between 20-40, whose months employed are between 0-40 (not many months employed, probably due to their young age and lack of experience). As the age increases, it seems that the months employed also increases. 

#However, there is not a clear correlation among the age and months employed. The female population is mostly focused on the age 20-40 and not many months employed (0-40), the reason for this could be the change in society and the empower of women and its entrance to the labor market.

##### 3D) In a new sheet generate a view of Gender, Number of Records, and Marital Status. Choose the best fitting view of your choice for the intended scope. Capture a screen shot and include here. Share a story on what the data is telling us.

#This plot shows that most of the population is male. As I said in the previous lab, all the women of the data are divorced which is very interesting. Personally, I do not think this is a great sample of the population as it does not represent the reality of society (there are women who ask for loans who are not divorced, and if not, there should be).  

#Furthermore, most of the men who ask for loans are single; very few are married; and the men who are divorced are less than 1/4 of the women who are divorced and ask for a loan. In general, it could be said that the two main target market segments are divorced female and single males as they both conformed the majority of the population.