basd_lab05

About

R can be used to make basic visual analytics, which can be helpful in understanding the data holistically. Additionally, R can help find correlations between variables and create scatter plots.

Tableau is a tool more tailored for visual analytics, while R is a powerful tool for statistics and other advanced topics in data analytics. In this lab we will explore both capabilities using two earlier sets of data credistrisk and marketing.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

PART I: Basic Visual Analytics in R

Read the file marketing.csv and make sure all the columns are captured by checking the first couple rows of the dataset “head(mydata)”

mydata = read.csv("data/marketing.csv")
head(mydata)

##   case_number State sales radio paper  tv pos
## 1           1    IL 11125    65    89 250 1.3
## 2           2    IL 16121    73    55 260 1.6
## 3           3    AZ 16440    74    58 270 1.7
## 4           4    AZ 16876    75    82 270 1.3
## 5           5    IL 13965    69    75 255 1.5
## 6           6    MI 14999    70    71 255 2.1

Create a bar chart using categorical variable:

# Extract the State column from mydata.
mystate = mydata$State
# Create a frequency table to extract the count for each state.
state_table = table(mystate)
# Create the bar chart.
barplot(state_table)

# Add titles for x and y axis in the chart.
barplot(state_table, xlab= 'State', ylab = 'Number of Cases' )

Create a pie chart:

# The command to create a pie chart is pie(variable) where  variable is in reference to the particular column extracted from the file. In this example we define a variable called x. 
x = c(2,3,4,5)
pie(x)

TASK 1A. Create a pie chart for variable state_table (which contains the count for each state).

pie(state_table)

TASK 1B. Compare the pie chart to the earlier bar chart of the State data. Which one you think is a better comparative representation of the State data and why?

ANSWER: The bar chart is more efficient to look at then the pie chart. The bar chart provides the viewrs with some numbers while the pie chart is simply based on the size of the piece. In this case the bar chart is easier to read and understand and therefore is a more efficient chart.

Create a histogram for sales data:

# Extract the sales column from the data and create a histogram by running the command hist(variable)  where variable corresponds to the extracted sales column variable.
mysales=mydata$sales
hist(mysales)

TASK 2. Can you find the total cummulative sales from the histogram? Explain your answer

ANSWER: 334344, we inserted an r chunk and used the sum formula to find the total cummulaive sales from the histogram.

sum(mysales)

## [1] 334344

TASK 3. Create four new histograms plot for radio, paper, tv, and pos data.

layout(matrix(1:4,2,2)) #Layout allows you to see all the 4 histograms that you're creating below on one frame. Make sure to run the codes to create all the 4 histograms in the same chunk so they are on the same layout.
myradio=mydata$radio
hist(myradio)
mypaper=mydata$paper
hist(mypaper)
mytv=mydata$tv
hist(mytv)
mypos=mydata$pos
hist(mypos)

PART II: Scatter Plots & Correlation

The previous task focused on visualizing one variable. A good way and also a common way to visualize two variables is using a scatter plot.

Create a scatter plot:

# Plot sales data vs. radio data. Sales will be on the y-axis & Radio will be on the x-axis.
# Note: we already extracted sales data into mysales variable when creating a histogram for sales data. Below we extract radio data and then plot sales and radio data using a scatter plot.
plot(myradio,mysales)

# It is easier to see the trend and possible relationship by including a line that fit through the data points. This is done using the following command.
scatter.smooth(myradio,mysales) #The scatter plot shows a positive relationship (positive trend) between ads spent on radio and the sales; i.e. as more money spent on radio ads more sales is generated.

TASK 4A. Create three other separate scatter plots for Sales vs. tV data, Sales vs. paper data, and Sales vs. pos data. Include the best fitting line in each scatter plot. Sales should be on the y-axis in each scatter plot.

layout(matrix(1:4,2,2)) #Layout allows you to see all the 3 scatter plots that you're creating below on one frame. Make sure to run the codes to create all the 3 scatter plots in the same chunk so they are on the same layout.
plot(mysales,mytv)
scatter.smooth(mysales,mytv)
plot(mysales,mypaper)
scatter.smooth(mysales,mypaper)

plot(mysales,mypos)
scatter.smooth(mysales,mypos)

TASK 4B. Share your observations on trends (potential relationships) between sales-tv, sales-paper, and sales-pos.

ANSWER: There was a positive relationship between the increasing number of tv’s and the number of sales, this graph has experienced a consistent growth gradually. There is a negative relationship between the amount of of paper and sales, the increase in sales leads to a decreas in paper. The relationship between pos and sales was relatively steady at the beginning before experiencing a small decrease before a sharp rise in correlation. there is not a clear relationship between these two and the data seems voliatile.

To quantify the strength of any relationships in the data, we need to look at the correlation between two variables.

Below is the code to compute correlation betwen sales-pos (remember the variables that keep the sales and pos data are mysales and mypos respectively):

cor(mysales,mypos)

## [1] 0.0126486

TASK 5A. Compute the correlation between sales-radio, sales-tv, and sales-paper.

cor(mysales,myradio)

## [1] 0.9771381

cor(mysales,mytv)

## [1] 0.9579703

cor(mysales,mypaper)

## [1] -0.2830683

TASK 5B. Which pair has the highest correlation? How do these results reconcile with the scatter plots observations?

ANSWER: sales-radio has the highest correlation of the graphs. This makes sense and lines up with the graph, as radio increased so did the sales, this increase was steady and consistent throughout the graph. That being saod, sales-tv correlation was very close and showed similar behaviors in the graph.

PART III: Basic Visual Analytics in Tableau

Follow the directions on the worksheet, download tableau academic on your personal computer or use one of the labs computers.

– Download Tableau academic here: https://www.tableau.com/academic/students

– Refer to file ‘creditrisk.csv’ in the data folder

– Start Tableau and enter your LUC email if prompted.

– Import the file into Tableau. Choose the Text File option when importing

Click on Sheet 1 (bottom left). – Under the dimensions tab located on the left side of the screen DOUBLE click on the ‘Loan Purpose’, then DOUBLE click on ‘Number of Records’ variable located under Measures on the bottom left of the screen.

– From the upper right corner of the screen select the horizontal bars view and note the new chart. Familiarize yourself with the tool by trying other views. Only valid views will be highlighted in the menu.

– Create a new sheet by clicking on the icon in the bottom next to your current sheet.

##TASK 6A. Go to new worksheet. Double-click on the ‘Age’ variable in Measures and select the ‘Histogram’ view. Capture a screen shot and include here.

###TASK 6B. Which age bin has the highest age count and what is the count? ANSWER: Age bin 22-25; 97 is the count

###TASK 7A. Drag-drop the variable ’Marital Status’found under Dimensions into the Marks Color box (make sure to drop in the Marks Color box until you see the small plus mark). Capture a screen shot and include here.

###TASK 7B. Which age bin has the highest divorce count and what is the count? ANSWER: Age bin: 22-25; 46 divorced

###TASK 8A. Create another new sheet. Double-click ‘Months Employed’ and then double-click ‘Age’. Make sure Age appears in the columns field as shown in the image below. From the Sum(Age) drop down menu select Dimension. Repeat for Sum(Months Employed). Add another variable to the scatter plot by drag-drop the dimension variable ‘Gender’ into the Marks Color box (until you see the small plus mark). Capture a screen shot and include here.

###TASK 8B. Share your observations ANSWER: The data seems to have more males than females and the overall data is all over the place. there is no clear line of best fit and the data seems rather random. The younger you are and the less months you have been employeed the higher your chances of credit risk. much of our data participants fall int this category.

###TASK 9A. In a new sheet generate a view of Gender, Number of Records, and Marital Status. Choose the best fitting view of your choice for the intended scope. Capture a screen shot and include here.

###TASK 9B. Share your observations. ANSWER: All women that have been divorced are facing credit risk wile there is a number of single, married, and divorced males that are facing credit risk. More males face credit risk than females based on this data.

###TASK 10A. Create another new sheet. Double-click ‘Housing’ and then double-click ‘Number of Records’. Add another variable to the graph by drag-drop the dimension variable ‘Gender’ into the Marks Color box (until you see the small plus mark). Click on staked bars. Capture a screen shot and include here.

###TASK 10B. Share your observations. ANSWER: There are many more homeowners than renters or other. Male homeowners outnumber the female homeowners. The ratio of male to female renters is relatively close. Few people fall under the category other and the majoity of those members are men.