R can be used to make basic visual analytics, which can be helpful in understanding the data holistically. Additionally, R can help find correlations between variables and create scatter plots.
Tableau is a tool more tailored for visual analytics, while R is a powerful tool for statistics and other advanced topics in data analytics. In this lab we will explore both capabilities using two earlier sets of data credistrisk and marketing.
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.
For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. Starting with this worksheet, tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to the particular task section.
Read the file marketing.csv and make sure all the columns are captured by checking the first couple rows of the dataset “head(mydata)”
mydata = read.csv("data/marketing.csv")
head(mydata)
## case_number State sales radio paper tv pos
## 1 1 IL 11125 65 89 250 1.3
## 2 2 IL 16121 73 55 260 1.6
## 3 3 AZ 16440 74 58 270 1.7
## 4 4 AZ 16876 75 82 270 1.3
## 5 5 IL 13965 69 75 255 1.5
## 6 6 MI 14999 70 71 255 2.1
How to create a bar chart using categorical variable
# Extract the State column from mydata
state = mydata$State
# Create a frequency table to extract the count for each state
state_table = table(state)
# Execute the command
barplot(state_table)
# Add labels to plot by replacing the ?? with a proper title
barplot(state_table, xlab= 'State', ylab = 'Frequency')
How to create a histogram
# Extract the TV column from the data and create a histogram by running the command hist(variable)
# where variable corresponds to the extracted sales column variable
tv=mydata$tv
hist(tv)
sales=mydata$sales
hist(sales)
sum(sales)
## [1] 334344
You can find the cumulative value of sales from the histogram in 2 ways. First, you can take each value on the x-axes, and multiply it by its corresponding value on the y-axis, then sum those values together. Although, this would only give you a range as to what the sales value would be, since the sales values lay within a certain range as displayed on the histogram. So, the minimum value would be $312,000 and the maximum value would be $352,000. In order to find the true value, you would take the shortcut route and use the equation ‘sum(sales)’. The value received from this equation is displayed above ($334,344), which lays between the histogram boundaries.
How to create a pie chart
# The command to create a pie chart is pie(variable) where variable is in reference to the particular column extracted from the file. In this example we define a variable called x.
x = c(2,3,4,5)
pie(x)
x = c(3, 3, 2, 3, 3, 4, 2)
labels = c("AZ", "CA", "CO", "FL", "IL", "MI", "MN")
pie(x, labels)
COMPARISON
Examining both data visualizations, the bar chart provides a better comparative representation of the data. This is because the bar chart better displays the distribution of the values across the different states. It’s to hard to differentiate between the values in the pie chart because they are close in size.
The previous task focused on visualizing one variable. A good way to visualize two variables and also very common is a scatter plot.
How to create a scatter plot
#Plot Sales vs. Radio
#Radio will be on the x-axis
#Sales will be on the y-axis
sales = mydata$sales
radio = mydata$radio
plot(radio,sales)
#It is easier to see the trend and possible relationship by including a line that fit through the points.
#This is done with the command
scatter.smooth(radio,sales)
sales = mydata$sales
tv = mydata$tv
plot(tv,sales)
scatter.smooth(tv,sales)
sales = mydata$sales
paper = mydata$paper
plot(paper, sales)
scatter.smooth(paper, sales)
sales = mydata$sales
pos= mydata$pos
plot(pos, sales)
scatter.smooth(pos,sales)
cor(sales,tv)
## [1] 0.9579703
cor(sales, paper)
## [1] -0.2830683
cor (sales, pos)
## [1] 0.0126486
The highest correlation is actually between sales and radio. The ‘cor’ function with respect to radio produces the highest value. Not only is it displayed in the correlation equation, but also on the scatter plot. Looking at the (radio, sales) scatter plot, there is a steady increase in sales as radio ads increase. Also, the points plotted on the graph are very snug against the line of best fit, thus carrying the smallest deviations from the line. This emphasizes an even stronger relationship.
Follow the directions on the worksheet, download tableau academic on your personal computer or use one of the labs computers.
– Download Tableau academic here: https://www.tableau.com/academic/students
– Refer to file ‘creditrisk.csv’ in the data folder
– Start Tableau and enter your LUC email if prompted.
– Import the file into Tableau. Choose the Text File option when importing
– Under the dimensions tab located on the left side of the screen DOUBLE click on the ‘Loan Purpose’, then DOUBLE click on ‘Number of Records’ variable located under Measures on the bottom left of the screen.
– From the upper right corner of the screen select the horizontal bars view and note the new chart. Familiarize yourself with the tool by trying other views. Only valid views will be highlighted in the menu.
– Create a new sheet by clicking on the icon in the bottom next to your current sheet.
Histogram of Age
The above image is of a histogram displaying age distribution. The bin with the highest age count is age bin 22, with an age count of 97.
Marital Status vs. Age
Above is an image displaying the relationship between age and marital status. The age bin with the highest divorce count is 22, with a divorce count of 46.