R can be used to make basic visual analytics, which can be helpful in understanding the data holistically. Additionally, R can help find correlations between variables and create scatter plots.
Tableau is a tool more tailored for visual analytics, while R is a powerful tool for statistics and other advanced topics in data analytics. In this lab we will explore both capabilities using two earlier sets of data credistrisk and marketing.
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.
For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. For clarity, tasks/questions to be completed/answered are highlighted in red color and numbered according to their particular placement in the task section. Quite often you will need to add your own code chunk.
Execute all code chunks, preview, publish, and submit link on Sakai.
Read the file marketing.csv and make sure all the columns are captured by checking the first couple rows of the dataset “head(mydata)”
mydata = read.csv("data/marketing.csv")
head(mydata)
## case_number State sales radio paper tv pos
## 1 1 IL 11125 65 89 250 1.3
## 2 2 IL 16121 73 55 260 1.6
## 3 3 AZ 16440 74 58 270 1.7
## 4 4 AZ 16876 75 82 270 1.3
## 5 5 IL 13965 69 75 255 1.5
## 6 6 MI 14999 70 71 255 2.1
How to create a bar chart using a categorical variable
# Extract the State column from mydata
state = mydata$State
# Create a frequency table to extract the count for each state
state_table = table(state)
# Execute the command
barplot(state_table)
##### 1A) Use the code chunk below to repeat the above bar chart by adding proper labels to X and Y axis.
# Add title and labels to plot by replacing the ?? with the proper wordings
barplot(state_table, main = 'Frequency of Cases per State', xlab= 'States', ylab = 'Frequency' )
A more elegant representation of the bar plot would be to order the bars by increasing value. This is shown in the code chunk below.
# Order and execute
barplot(state_table[order(state_table)])
How to create a histogram
# Extract the TV column from the data and create a histogram by running the command hist(variable)
# where variable corresponds to the extracted sales column variable
tv=mydata$tv
hist(tv)
##### 1B) Create a new histogram plot for Sales. Explain what the x-axis and y-axis represent. Can one derive the total cummulative sales from the histogram? Explain your answer.
sales = mydata$sales
hist(sales)
The x-axis represents the variable that is focused on (sales) while the y-axis shows the frequency of the variable in the given data. One can not derive the cummulative sales from a histogram. This is because histograms show a continuous frequency distribution of data points into groups, not exact numbers from the data provided.
How to create a pie chart
# The command to create a pie chart is pie(variable) where variable is in reference to the particular column extracted from the file. In this example we define a variable called x.
x = c(2,3,4,5)
pie(x)
##### 1C) Create a new pie chart for state count. Refer to variable state_table to capture the frequency count.
pie(state_table)
##### 1D) What does each slice of the pie represent? Compare the pie chart to earlier bar charts. Which type of charts is a better representation of the data and why so?
Each slice of the pie chart represents the state case count in regards to the whole data. Each slice is a percentage of the whole, as all slices of the pie equal 100%. Bar charts would be a better representations of the data because it can often times be difficult to see percentage differences in pie charts. Bar charts would allow the viewer to see the frequency count for a more exact answer.
The previous task focused on visualizing one variable. A good way to visualize two variables and also very common is a scatter plot. A scatter plot is a good way to study relationships and trends.
How to create a scatter plot
# Plot Sales vs. Radio
# Radio will be on the x-axis
# Sales will be on the y-axis
sales = mydata$sales
radio = mydata$radio
plot(radio,sales)
# It is easier to see the trend and possible relationship by including a line that fit through the points.
# This is done with the command
scatter.smooth(radio,sales)
##### 2A) Create three separate scatter plots for Sales vs TV, Sales vs Paper, and Sales vs Pos. Include the best fitting line in each plot. Pay attention to what variable goes on the x-axis and the y-axis.
plot(sales,tv)
scatter.smooth(sales,tv)
paper = mydata$paper
plot(sales,paper)
scatter.smooth(sales,paper)
pos = mydata$pos
plot(sales, pos)
scatter.smooth(sales,pos)
##### 2B) Share your observations on trends and relationships. How do your observations reconcile with your findings from lab04?
To begin, the sales and tv graph shows a positive correlation, better seen with the “scatter.smooth” fucntion on R. This observation reconciles with our findings from lab04 in that after we reordered the data in sales from least to highest, a trend was revealed that advertisement dollars spent on tv was directly correlated to sales. The sales and paper plot also supports our observations from lab04 because it reveals that there is a negative correlation with advertisement dollars spent and sales. Even though the trend line shows a negative correlation, there are many instances where at high sales amounts there are low paper advertisements and high paper advertisements; meaning the relationship is weak. Lastly, the sales and pos plot support our previous findings in that there is a weak relationship between the two, even though the trend line shows positive.
As part of any data anlytics it is important to consider both qualitative and quantitative analysis. Scatter plots provide us with qualitative insights on possible trends and relationships. To quantify the strength of any relationships in the data, we need to look at the correlation between two variables.
How to compute correlation
cor(sales,radio)
## [1] 0.9771381
##### 2C) Repeat the correlation calculation for the following pair of variables (sales,tv), (sales,paper), and (sales,pos)
cor(sales,tv)
## [1] 0.9579703
cor(sales,paper)
## [1] -0.2830683
cor(sales,pos)
## [1] 0.0126486
##### 2D) Which pair has the highest correlation? How do these results reconcile with your scatter plots observations?
The relationship that has the highest correlation is sales and tv, close to the perfect, positive, relationship of 1 (.9579703). This reconciles with ours scatter plot observations because it is clearly seen that as tv advertisements dollars are spent, sales also goes up.
Follow the directions on the worksheet, download tableau academic on your personal computer or use one of the labs computers. Make sure to download the academic version and not the free limited trial version.
– Download Tableau academic here: https://www.tableau.com/academic/students
– Refer to file ‘creditrisk.csv’ in the data folder
– Start Tableau and enter your LUC email if prompted.
– Import the file into Tableau. Choose the Text File option when importing
– Under the dimensions tab located on the left side of the screen DOUBLE click on the ‘Loan Purpose’, then DOUBLE click on ‘Number of Records’ variable located under Measures on the bottom left of the screen.
– From the upper right corner of the screen select the horizontal bars view and note the new chart. Familiarize yourself with the tool by trying other views. Only valid views will be highlighted in the menu.
– Create a new sheet by clicking on the icon in the bottom next to your current sheet.
##### 3A) Double-click on the ‘Age’ variable in Measures and select the ‘Histogram’ view. Capture a screen shot and include here. Which age bin has the highest age count and what is the count?
Above is the screenshot for the Histogram of Age.
The age bin that has the highest age count is the 22-26 age bin, with a count of 97.
##### 3B) Drag-drop the variable ’Marital Status’found under Dimensions into the Marks Color box. Capture a screen shot and include here. Which age bin has the highest divorce count and what is the count?
Above is the Histogram of Age with the dimensions of marital status in color.
The age bin that has the highest divorce count is also the 22-26 age bin, with a total of 46 divorced marital status counts.
##### 3C) Create another new sheet. Double-click ‘Months Employed’ and then double-click ‘Age’. Make sure Age appears in the columns field as shown in the image below. From the Sum(Age) drop down menu select Dimension. Repeat for Sum(Months Employed). Add another variable to the scatter plot by drag-drop the dimension variable ‘Gender’ into the Marks Color box. Capture a screen shot and include here. Share a story on what the data is telling us
Above is a scatter plot with relationship of months employed and age as the focus. By using Tableau, we are also able to visualize the gender of the age points with color.
By looking at this scatter plot, it can be seen that in the data set provided, men are often have more months employed at any given age than that of females. In exploring with tableau, you can insert trend lines that also show the r-squared coefficient. Although both trends are a positive and weak since data points are scattered throughout, the relationship for months employed and men is slightly stronger.
##### 3D) In a new sheet generate a view of Gender, Number of Records, and Marital Status. Choose the best fitting view of your choice for the intended scope. Capture a screen shot and include here. Share a story on what the data is telling us.
In visualizing the number of records by gender and marital status, the best fitting view was the side by side bars that Tableau suggested.
It can be seen with this visual that, surprisingly, all females in the data were only divorced. On the other hand, all other marital statuses (single, married, and divorced) were tagged for the men in the data. It can be be seen that more men in the data provided were predominantly single, then married, and less divocred.