Qualitative Descriptive Analytics aims to gather an in-depth understanding of the underlying reasons and motivations for an event or observation. It is typically represented with visuals or charts.
Quantitative Descriptive Analytics focuses on investigating a phenomenon via statistical, mathematical, and computationaly techniques. It aims to quantify an event with metrics and numbers.
In this lab project, you will explore both analytics using the data set provided.
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.
First, calculate the mean, standard deviation, maximum, and minimum for the Age column using R.
In R, we must read the file first and extract the column and find the values that are asked for.
#Read File
myData = read.csv(file="data/creditrisk.csv")
#Name the extracted variable
age = myData$Age
#Calculate the average, standard deviation, maximum and minimum age below.
meanAge = mean(age)
meanAge
## [1] 34.39765
spreadAge = sd(age)
spreadAge
## [1] 11.04513
maxAge = max(age)
maxAge
## [1] 73
minAge = min(age)
minAge
## [1] 18
An outlier is value that “lies outside” most of the other values in a set of data. Next, use the formula from class to find the upper and lower limits for age to decide on outliers.
#Use the common formula to calculate the upper and lower thresholds
UpperOutlier = meanAge + 3*spreadAge
UpperOutlier
## [1] 67.53302
LowerOtlier = meanAge - 3*spreadAge
LowerOtlier
## [1] 1.262269
Are there any outliers? How can you check the data to find out if there are potential outliers? Use the chunk below to make a desicion about possible outliers.
IF THE MAXIMUM AGE IS 73 AND THE MINIMUM AGE IS 18, BASED ON UpperOutlier and LowerOutlier, IT CAN BE CONCLUDED THERE IS AT LEAST ONE OUTLIER BECAUSE 73 (MaxAge) IS GREATHER THAN 67.53 (UpperOutlier). YOU CAN COMPARE THRESHOLDS TO MAX AND MIN VALUES TO DETERMINE IF THERE ARE POTENTIAL OUTLIERS.
# Insert here your work to find if the data contains potential outliers.
maxAge = max(age)
maxAge
## [1] 73
minAge = min(age)
minAge
## [1] 18
Another similar method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Use the chunk below to first calculate the interquartile range..
#interquantile range
quantile(age) #1st quantile 0%, 2nd quantile 25%, etc...
## 0% 25% 50% 75% 100%
## 18 26 32 41 73
lowerQ = quantile(age)[2] #2nd quantile value 26
upperQ = quantile(age)[4] #4th quantile value 41
iqr = upperQ - lowerQ #inter quantile range value 15
iqr
## 75%
## 15
The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it can be identified as a potential outlier.
Below is the upper threshold:
upperThreshold = (iqr * 1.5) + upperQ
upperThreshold
## 75%
## 63.5
Below is the lower threshold:
lowerThreshold = lowerQ - (iqr * 1.5)
lowerThreshold
## 25%
## 3.5
Are there any outliers? How many?
VALUES GREATER THAN 63.5 ARE CONSIDERED OUTLIERS. BECAUSE THE MAX AGE IS 73, THERE IS AT LEAST 1 OUTLIER IN THE DATA.
It can also be useful to visualize the data using a box and whisker plot. Use the boxplot() command to visualize your data.
boxplot(age)
Can you identify the outliers from the boxplot? If so how many outliers?
BOXPLOTS PROVIDE A VISUAL SUMMARY OF THE DATA. THE OBSERVER CAN MAKE CONCLUSIONS BASED ON THE SUMMARY DATA TO QUICKLY IDENTIFY OUTLIERS AND THE DISTRIBUTION OF DATA IN A DATASET. THERE ARE AT LEAST 5 OUTLIERS IN THIS DATA.
Begin by reading in the data from the ‘marketing.csv’ file, and viewing it to make sure it is read in correctly.
#read the marketing file and view it to make sure it is read correctly
markData = read.csv(file="data/marketing.csv")
salesData = markData$sales
head(salesData)
## [1] 11125 16121 16440 16876 13965 14999
Now calculate the Range, Min, Max, Mean, STDEV, and Variance for the variable ‘sales’.
Sales
#Max Sales
maxSales = max(salesData)
maxSales
## [1] 20450
#Min Sales
minSales = min(salesData)
minSales
## [1] 11125
#Range
rangeSales = range(salesData)
rangeSales
## [1] 11125 20450
#Mean
meanSales = mean(salesData)
meanSales
## [1] 16717.2
#Standard Deviation
spreadSales = sd(salesData)
spreadSales
## [1] 2617.052
#Variance
varSales = var(salesData)
varSales
## [1] 6848961
An easy way to calculate the statistics of all of these variables is with the summary() function. Run the summary command to visualize the statistics for all variables in the dataset.
# Summary statistics for all variables.
summary(salesData)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11125 15175 16658 16717 18874 20450
You can also use the summary() command to find the statistics for the sales variable.
# Summary statistics for the sales variable
summary(markData$sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11125 15175 16658 16717 18874 20450
There are some statistics not calculated with the summary() function. Specify which.
THE SUMMARY FUNCTION DOES NOT CALCULATE THE STANDARD DEVIATION AND VARIANCE OF A PARTICULAR VARIABLE.
Given a sales value of $25000, calculate the corresponding z-value or z-score.
# Calculate the z-value and display it We know that `z-score = (x - mean)/sd`
zScore = (25000-meanSales)/spreadSales
zScore
## [1] 3.164935
Based on the z-value, how would you rate a $25000 sales value: poor, average, good, or very good performance? Explain your logic.
BECAUSE THE Z-VALUE IS GREATER THAN 3, A SALES VALUE OF 25000 IS CONSIDERED AN OUTLIER. WITH A Z-VALUE OF 3.165, IT CAN BE CONCLUDED THAT 25000 IS VERY GOOD PERFORMANCE. IF THE Z-VALUE WAS -3 IT WOULD BE CONSIDERED POOR PERFORMANCE AND IF THE Z-VALUE WAS CLOSER TO 2 IT WOULD BE CONSIDERED AVERAGE.