Descriptive Analytics

Notebook Instructions

For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.

Load Packages in R/RStudio

We are going to use tidyverse a collection of R packages designed for data science.

Info: https://www.tidyverse.org/

## Loading required package: tidyverse

## -- Attaching packages -------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0

## -- Conflicts ----------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## Loading required package: gridExtra

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

Task 1: Quantitative Analysis

1A) Read the csv file into R Studio and display the dataset.

Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() head() max() min() var() sd()

mydata <- read.csv(file = "data/Advertising.csv")

head(mydata)

Extract the assigned features (columns) to perform some analytics.

sales = mydata$sales
TV = mydata$TV
Radio = mydata$radio
Newspaper = mydata$newspaper

Change the variable name “X1” to case_number using the function rename()

mydata <- rename(mydata, “NEW_VAR_NAME” = “OLD_VAR_NAME”)

mydata <- rename(mydata, "case_number" = "X")
head(mydata)

1B) Find the range ( difference between min and max ), min, max, standard deviation and variance for each assigned feature ( Use separate chunks for each feature ). Compare each feature and note any significant differences

Sales

#variable_max
sales_max = max(sales)
sales_max

## [1] 27

#variable_min
sales_min = min(sales)
sales_min

## [1] 1.6

#variable_Range max-min
sales_range = sales_max - sales_min
sales_range

## [1] 25.4

#variable_mean 
sales_mean = mean(sales)
sales_mean

## [1] 14.0225

#variable_sd Standard Deviation
sales_sd = sd(sales)
sales_sd

## [1] 5.217457

#variable_variance
sales_variance = var(sales)
sales_variance

## [1] 27.22185

#variable_max
TV_max = max(TV)
TV_max

## [1] 296.4

#variable_min
TV_min = min(TV)
TV_min

## [1] 0.7

#variable_Range max-min
TV_range = TV_max - TV_min
TV_range

## [1] 295.7

#variable_mean 
TV_mean = mean(TV)
TV_mean

## [1] 147.0425

#variable_sd Standard Deviation
TV_sd = sd(TV)
TV_sd

## [1] 85.85424

#variable_variance
TV_variance = var(TV)
TV_variance

## [1] 7370.95

RADIO

#variable_max
Radio_max = max(Radio)
Radio_max

## [1] 49.6

#variable_min
Radio_min = min(Radio)
Radio_min

## [1] 0

#variable_Range max-min
Radio_range = Radio_max - Radio_min
Radio_range

## [1] 49.6

#variable_mean 
Radio_mean = mean(Radio)
Radio_mean

## [1] 23.264

#variable_sd Standard Deviation
Radio_sd = sd(Radio)
Radio_sd

## [1] 14.84681

#variable_variance
Radio_variance = var(Radio)
Radio_variance

## [1] 220.4277

NEWSPAPER

#variable_max
Newspaper_max = max(Newspaper)
Newspaper_max

## [1] 114

#variable_min
Newspaper_min = min(Newspaper)
Newspaper_min

## [1] 0.3

#variable_Range max-min
Newspaper_range = Newspaper_max - Newspaper_min
Newspaper_range

## [1] 113.7

#variable_mean 
Newspaper_mean = mean(Newspaper)
Newspaper_mean

## [1] 30.554

#variable_sd Standard Deviation
Newspaper_sd = sd(Newspaper)
Newspaper_sd

## [1] 21.77862

#variable_variance
Newspaper_variance = var(Newspaper)
Newspaper_variance

## [1] 474.3083

TV has an extremely high variance and standard deviation compared to newspaper and radio. It also has a higher max and range. Newspaper is also higher, but still not extreme. Radio seems to be okay. This may be due to the fact that it has a smaller spread. Sales is the lowest in terms of SD and Variance, this may indicate that the extreme range of spending is consistent throughout the dataset and leads to even sales. Or, the advertising spending may not affect the sales much.

1C) Use the summary() function on all the dataset to give you a general description of the data. Note any differences between features.

summary(mydata)

##   case_number           TV             radio          newspaper     
##  Min.   :  1.00   Min.   :  0.70   Min.   : 0.000   Min.   :  0.30  
##  1st Qu.: 50.75   1st Qu.: 74.38   1st Qu.: 9.975   1st Qu.: 12.75  
##  Median :100.50   Median :149.75   Median :22.900   Median : 25.75  
##  Mean   :100.50   Mean   :147.04   Mean   :23.264   Mean   : 30.55  
##  3rd Qu.:150.25   3rd Qu.:218.82   3rd Qu.:36.525   3rd Qu.: 45.10  
##  Max.   :200.00   Max.   :296.40   Max.   :49.600   Max.   :114.00  
##      sales      
##  Min.   : 1.60  
##  1st Qu.:10.38  
##  Median :12.90  
##  Mean   :14.02  
##  3rd Qu.:17.40  
##  Max.   :27.00

TV is much higher than radio and newspaper when it comes to 1st Qu and up, suggesting that there was the most spending on TV advertising. It has a wider range of spending than radio and newspaper. While TV on average seems to be higher, newspaper has a large jump between 3rd Qu and Max which may suggest an outlier. Radio is consistently the smallest in terms of spending and has the smallest spread of data based on the range. With the exception of TV, most first quantiles appear similar.

Are there any outliers, if not explain the lack of outliers? if any explain what the outliers represent and how many records are outliers? ( Use code from notebook-03 to find outliers)

SALES OUTLIERS

#ranges
quantile(sales)

##     0%    25%    50%    75%   100% 
##  1.600 10.375 12.900 17.400 27.000

lowerq = quantile(sales)[2]
upperq = quantile(sales)[4]
iqr = upperq-lowerq

lowerq

##    25% 
## 10.375

upperq

##  75% 
## 17.4

iqr

##   75% 
## 7.025

#thresholds
upper_threshold = (iqr * 1.5) + upperq 
upper_threshold

##     75% 
## 27.9375

lower_threshold = lowerq - (iqr * 1.5)
lower_threshold

##     25% 
## -0.1625

#outliers above threshold
sales[ sales > upper_threshold][1:10]

##  [1] NA NA NA NA NA NA NA NA NA NA

#outliers below threshold
sales[ sales < lower_threshold][1:10]

##  [1] NA NA NA NA NA NA NA NA NA NA

#finding outlier records
mydata[ sales > upper_threshold, ]

mydata[ sales < lower_threshold, ]

There are no sales outliers. This is due to the fact that no sales values exceed or fall short of the thresholds. The sales data is most likely fairly uniform if it was steadily increasing.

TV OUTLIERS

#ranges
quantile(TV)

##      0%     25%     50%     75%    100% 
##   0.700  74.375 149.750 218.825 296.400

lowerq = quantile(TV)[2]
upperq = quantile(TV)[4]
iqr = upperq-lowerq

lowerq

##    25% 
## 74.375

upperq

##     75% 
## 218.825

iqr

##    75% 
## 144.45

#thresholds
upper_threshold = (iqr * 1.5) + upperq 
upper_threshold

##   75% 
## 435.5

lower_threshold = lowerq - (iqr * 1.5)
lower_threshold

##    25% 
## -142.3

#outliers above threshold
TV[ TV > upper_threshold][1:10]

##  [1] NA NA NA NA NA NA NA NA NA NA

#outliers below threshold
TV[ TV < lower_threshold][1:10]

##  [1] NA NA NA NA NA NA NA NA NA NA

#finding outlier records
mydata[ TV > upper_threshold, ]

mydata[ TV < lower_threshold, ]

There are no TV outliers. This is due to the fact that no tv spending values exceed or fall short of the thresholds. The tv spending data is most likely fairly uniform as advertisers seek to not make any drastic increases or decreases in spending if sales are steadily growing.

RADIO OUTLIERS

#ranges
quantile(Radio)

##     0%    25%    50%    75%   100% 
##  0.000  9.975 22.900 36.525 49.600

lowerq = quantile(Radio)[2]
upperq = quantile(Radio)[4]
iqr = upperq-lowerq

lowerq

##   25% 
## 9.975

upperq

##    75% 
## 36.525

iqr

##   75% 
## 26.55

#thresholds
upper_threshold = (iqr * 1.5) + upperq 
upper_threshold

##   75% 
## 76.35

lower_threshold = lowerq - (iqr * 1.5)
lower_threshold

##    25% 
## -29.85

#outliers above threshold
Radio[ Radio > upper_threshold][1:10]

##  [1] NA NA NA NA NA NA NA NA NA NA

#outliers below threshold
Radio[ Radio < lower_threshold][1:10]

##  [1] NA NA NA NA NA NA NA NA NA NA

#finding outlier records
mydata[ Radio > upper_threshold, ]

mydata[ Radio < lower_threshold, ]

There were no outliers for Radio. This is due to the fact that no radio spending values exceed or fall below the thresholds. The radio spending data probably does not have outliers as year over year or month to month radio spending would follow similar trends without spiking in either direction.

NEWSPAPER OUTLIERS

#ranges
quantile(Newspaper)

##     0%    25%    50%    75%   100% 
##   0.30  12.75  25.75  45.10 114.00

lowerq = quantile(Newspaper)[2]
upperq = quantile(Newspaper)[4]
iqr = upperq-lowerq

lowerq

##   25% 
## 12.75

upperq

##  75% 
## 45.1

iqr

##   75% 
## 32.35

#thresholds
upper_threshold = (iqr * 1.5) + upperq 
upper_threshold

##    75% 
## 93.625

lower_threshold = lowerq - (iqr * 1.5)
lower_threshold

##     25% 
## -35.775

#outliers above threshold
Newspaper [ Newspaper > upper_threshold][1:10]

##  [1] 114.0 100.9    NA    NA    NA    NA    NA    NA    NA    NA

#outliers below threshold
Newspaper [ Newspaper < lower_threshold][1:10]

##  [1] NA NA NA NA NA NA NA NA NA NA

#finding outlier records
mydata[ Newspaper > upper_threshold, ]

mydata[ Newspaper < lower_threshold, ]

There are two newspaper outliers that are above the upper threshold, Case Numbers 17 or 102. These are outliers because newspaper spending exceeded 93.625.This could have been caused by success found in newspaper sales for a certain time period or increased distribution of cupons in newspapers. The sales do not appear to have increased greatly for either of these outliers, so they most likely did not continue the spending strategy. In these instances, there was decreased spending on TV for one and higher spending on radio in both.

1D) Write a general description of the dataset using the statistics found in the steps above. Use the min,max range to compare the features, note any significant differences.

The most spending for advertising regularly goes to TV and the second most to newspaper; however newspaper and radio have close spending amounts. TV is significantly higher. There is a larger variance and sd in TV spending suggesting that it fluctuates more, but the lack of outliers indicate no extreme spending or cutbacks. The sales appear to have the least variance and standard deviation. This may be due to the fact that the sales numbers are scaled from thousands or millions to tens/hundreds. Another possibility could be that the different spending on advertising is not a huge indicator in sales numbers as they do not drastically change as well.

Task 2: Qualitative Analysis

2A) Plot all the assigned features as y-axis for x-axis use case_number. Use the given commands to create each plot and create a grid to plot all features Note any trends/patters in the data

Commands: VARIABLE_plot <- ggplot(data = mydata, aes(x = VARIABLE, y = VARIABLE)) + geom_point()
Commands: grid.arrange(VARIABLE_plot1, VARIABLE_plot2, VARIABLE_plot3, VARIABLE_plot4, ncol=2)

#grid.arrange(VARIABLE_plot1, VARIABLE_plot2, VARIABLE_plot3, VARIABLE_plot4, ncol=2)

sales_plot <- ggplot(data = mydata, aes(x = case_number, y = sales)) + geom_point()
TV_plot <- ggplot(data = mydata, aes(x = case_number, y = TV)) + geom_point()
Radio_plot <- ggplot(data = mydata, aes(x = case_number, y = Radio)) + geom_point()
Newspaper_plot <- ggplot(data = mydata, aes(x = case_number, y = Newspaper)) + geom_point()

grid.arrange(sales_plot, TV_plot, Radio_plot, Newspaper_plot, ncol=2)

Newspaper is much more sparse at the top of the plot compared to others which are evenly distributed on the plot. Sales has a few that are at the low end and a few at the high end, making it look more concentrated in the middle of the plot.

When looking at these plots it is hard to see a particular trend.
One way to observe any possible trend in the sales data would be to re-order the data from low to high.
The 200 months observations are in no particular chronological time sequence.
The case numbers are independent sequentially generated numbers. Since each case is independent, we can reorder them.

2B) Re-order sales from low to high, and save re-ordered data in a new set. As sales data is re-reorded associated other column fields follow.

Commands: newdata <- mydata[ order(mydata$VARIABLE), ]

newdata <- mydata[ order(mydata$sales), ]

# Extract case_number from the newdata
case_number <- newdata$case_number

head(newdata)

Extract the variables from the new data

# new_VARIABLE = newdata$VARIABLE

new_tv = newdata$TV
new_radio = newdata$radio
new_news = newdata$newspaper
new_sales = newdata$sales

2C) Repeat the 4 graphs with the newdata to spot any trends. Note your observations on what the new plots are revealing in terms of trending relationship.

Commands: VARIABLE_plot <- ggplot(data = mydata, aes(x = VARIABLE, y = VARIABLE)) + geom_point()
Commands: For x variable in the plot use: aes(x = case_number[order(case_number)])
Commands: grid.arrange(VARIABLE_plot1, VARIABLE_plot2, VARIABLE_plot3, VARIABLE_plot4, ncol=2)

newsales_plot <- ggplot(data = mydata, aes(x = case_number[order(case_number)], y = new_sales)) + geom_point()
newtv_plot <- ggplot(data = mydata, aes(x = case_number[order(case_number)], y = new_tv)) + geom_point()
newradio_plot <- ggplot(data = mydata, aes(x = case_number[order(case_number)], y = new_radio)) + geom_point()
newnews_plot <- ggplot(data = mydata, aes(x = case_number[order(case_number)], y = new_news)) + geom_point()

grid.arrange(newsales_plot, newtv_plot, newradio_plot, newnews_plot, ncol=2)

Sales appear much more orderly after arranging the data. The lower curve shown in new_tv and new_radio seems to mirror the upper part of the new_sales curve. Newspaper looks about the same as before, however there appear to be more outliers now.

Task 3: Standardized Z-Value

3A) Create a histogram of the assigned feature z-scores. Describe the output note any relevant values.

Command: z_score = ( VARIABLE - mean(VARIABLE) ) / sd(VARIABLE)
Commands: qplot( x = VARIABLE ,geom=“histogram”, binwidth = 0.3)

Sales Histogram

sales_zscore = (sales - mean(sales)) / sd(sales)
qplot ( x = sales_zscore,geom="histogram", binwidth = 0.3)

TV Histogram

TV_zscore = (TV - mean(TV)) / sd(TV)
qplot ( x = TV_zscore,geom="histogram", binwidth = 0.3)

Radio Histogram

Radio_zscore = (Radio - mean(Radio)) / sd(Radio)
qplot ( x = Radio_zscore,geom="histogram", binwidth = 0.3)

Newspaper Histogram

News_zscore = (Newspaper - mean(Newspaper)) / sd(Newspaper)
qplot ( x = News_zscore,geom="histogram", binwidth = 0.3)

The sales z-score appears to be the most normal distribution meaning that most values are close to the mean. The most notable is newspaper has a strong positive skew, meaning that the outliers push up the mean and cause a large amount of the data to fall below the average. Radio and TV are closer to unitary distribution, but have greater amounts at the extremes rather than in the center. They spend more or less than the average about equally.

3B) Given a sales value of $26700, calculate the corresponding z-value or z-score.

Command: z_score = ( VARIABLE - mean(VARIABLE) ) / sd(VARIABLE)

z_score = (26.7 - mean(sales) )/ sd(sales)
z_score

## [1] 2.429824

3C) Based on the z-value, how would you rate a $26700 sales value: poor, average, good, or very good performance? Explain your logic.

The z-score of 2.43 would indicate pretty good performance as it is quite significantly above the mean (about 2 standard deviations), but not high enough to be considered an outlier.