Business Analytics Lab Project 1

About

Qualitative Descriptive Analytics aims to gather an in-depth understanding of the underlying reasons and motivations for an event or observation. It is typically represented with visuals or charts.

Quantitative Descriptive Analytics focuses on investigating a phenomenon via statistical, mathematical, and computationaly techniques. It aims to quantify an event with metrics and numbers.

In this lab project, you will explore both analytics using the data set provided.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

Task 1: Testing for Outliers(3 points)

First, calculate the mean, standard deviation, maximum, and minimum for the Age column using R.

In R, we must read the file first and extract the column and find the values that are asked for.

#Read File
mydata = read.csv(file="data/creditrisk.csv")

#Name the extracted variable
age = mydata$Age

#Calculate the average, standard deviation, maximum and minimum age below. 
MeanAge = mean(age)
MeanAge

## [1] 34.39765

SpreadAge = sd(age)
SpreadAge

## [1] 11.04513

MaxAge = max(age)
MaxAge

## [1] 73

MinAge = min(age)
MinAge

## [1] 18

An outlier is value that “lies outside” most of the other values in a set of data. Next, use the formula from class to find the upper and lower limits for age to decide on outliers.

#Use the common formula to calculate the upper and lower thresholds
UpperOutlier = MeanAge + 3*SpreadAge
UpperOutlier

## [1] 67.53302

LowerOtlier = MeanAge - 3*SpreadAge
LowerOtlier

## [1] 1.262269

Are there any outliers? How can you check the data to find out if there are potential outliers? Use the chunk below to make a desicion about possible outliers.

Yes, it appears that there are some outliers since the MaxAge value is higher than the upper threshold value calculated above. We can check this by finding the interquantile range as done so below.

# Insert here your work to find if the data contains potential outliers.
MaxAge - UpperOutlier

## [1] 5.466975

Another similar method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Use the chunk below to first calculate the interquartile range..

#interquantile range
quantile(age)

##   0%  25%  50%  75% 100% 
##   18   26   32   41   73

lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq
iqr

## 75% 
##  15

The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it can be identified as a potential outlier.

Below is the upper threshold:

upperthreshold = (iqr * 1.5) + upperq 
upperthreshold

##  75% 
## 63.5

Below is the lower threshold:

lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold

## 25% 
## 3.5

Are there any outliers? How many?

Yes, there are 5 outliers.

It can also be useful to visualize the data using a box and whisker plot. Use the boxplot() command to visualize your data.

boxplot(age)

Can you identify the outliers from the boxplot? If so how many outliers?

Yes, there are 5 outliers.

Task 2: Quantitative Analysis - Marketing (2 points)

Begin by reading in the data from the ‘marketing.csv’ file, and viewing it to make sure it is read in correctly.

#read the marketing file and view it to make sure it is read correctly
mydata = read.csv(file="data/marketing.csv")
head(mydata)

##   case_number sales radio paper  tv pos
## 1           1 11125    65    89 250 1.3
## 2           2 16121    73    55 260 1.6
## 3           3 16440    74    58 270 1.7
## 4           4 16876    75    82 270 1.3
## 5           5 13965    69    75 255 1.5
## 6           6 14999    70    71 255 2.1

Now calculate the Range, Min, Max, Mean, STDEV, and Variance for the variable ‘sales’.

Sales

sales = mydata$sales
#Max Sales
maxsales = max(sales)
maxsales

## [1] 20450

#Min Sales
minsales = min(sales)
minsales

## [1] 11125

#Range
rangesales = maxsales-minsales
rangesales

## [1] 9325

#Mean
meansales = mean(sales)
meansales

## [1] 16717.2

#Standard Deviation
spreadsales = sd(sales)
spreadsales

## [1] 2617.052

#Variance
variancesales = var(sales)
variancesales

## [1] 6848961

An easy way to calculate the statistics of all of these variables is with the summary() function. Run the summary command to visualize the statistics for all variables in the dataset.

# Summary statistics for all variables. 
summary(mydata)

##   case_number        sales           radio           paper      
##  Min.   : 1.00   Min.   :11125   Min.   :65.00   Min.   :35.00  
##  1st Qu.: 5.75   1st Qu.:15175   1st Qu.:70.00   1st Qu.:53.75  
##  Median :10.50   Median :16658   Median :74.50   Median :62.50  
##  Mean   :10.50   Mean   :16717   Mean   :76.10   Mean   :62.30  
##  3rd Qu.:15.25   3rd Qu.:18874   3rd Qu.:81.75   3rd Qu.:75.50  
##  Max.   :20.00   Max.   :20450   Max.   :89.00   Max.   :89.00  
##        tv             pos       
##  Min.   :250.0   Min.   :0.000  
##  1st Qu.:255.0   1st Qu.:1.200  
##  Median :270.0   Median :1.500  
##  Mean   :266.6   Mean   :1.535  
##  3rd Qu.:276.2   3rd Qu.:1.800  
##  Max.   :280.0   Max.   :3.000

You can also use the summary() command to find the statistics for the sales variable.

# Summary statistics for the sales variable
summary(sales)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11125   15175   16658   16717   18874   20450

There are some statistics not calculated with the summary() function. Specify which.

Standard deviation, variance, and range were not calculated by the summary function. This is because these variables can all be calculated with the variables given

Task 3: Calculating Z-Value (2 points)

Given a sales value of $25000, calculate the corresponding z-value or z-score.

#  Calculate the z-value and display it
zscore = (25000-meansales)/spreadsales
zscore

## [1] 3.164935

Based on the z-value, how would you rate a $25000 sales value: poor, average, good, or very good performance? Explain your logic.

The sales value of $25,000 would be very good performance. This sales value is an outlier because its z-score is greater than 3. It would be located on the far right end of the graph in comparasion to the standard deviation line assuming a normal distribution.

Business Analytics Lab Project 1

Working with Data & Descriptive Analytics

Grace Klein

10/12/21

About

Setup

Task 1: Testing for Outliers(3 points)

Task 2: Quantitative Analysis - Marketing (2 points)

Task 3: Calculating Z-Value (2 points)