Descriptive Analytics

Notebook Instructions

For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.

Load Packages in R/RStudio

We are going to use tidyverse a collection of R packages designed for data science.

Info: https://www.tidyverse.org/

## Loading required package: tidyverse

## -- Attaching packages ----------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0

## -- Conflicts -------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## Loading required package: gridExtra

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

Task 1: Quantitative Analysis

1A) Read the csv file into R Studio and display the dataset.

Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() head() max() min() var() sd()

Extract the assigned features (columns) to perform some analytics.

mydata = read.csv("Advertising.csv")
head(mydata)

Change the variable name “X1” to case_number using the function rename()

mydata <- rename(mydata, “NEW_VAR_NAME” = “OLD_VAR_NAME”)

mydata <- rename(mydata, "case_number" = "X")
head(mydata)

1B) Find the range ( difference between min and max ), min, max, standard deviation and variance for each assigned feature ( Use separate chunks for each feature ). Compare each feature and note any significant differences

NAME_OF_FEATURE_HERE

case_number <- mydata$case_number
#variable_max
case_number_max <- max(mydata$case_number)
case_number_max

## [1] 200

#variable_min
case_number_min <-min(mydata$case_number)
case_number_min

## [1] 1

#variable_Range max-min
case_number_range <-range(case_number_max - case_number_min)
case_number_range

## [1] 199 199

#variable_mean 
case_number_mean <- mean(mydata$case_number)
case_number_mean

## [1] 100.5

#variable_sd Standard Deviation
case_number_sd <- sd(mydata$case_number)
case_number_sd

## [1] 57.87918

#variable_variance
case_number_variance <- var(mydata$case_number)
case_number_variance

## [1] 3350

The range of the data is 199, which is comparatively wide.

NAME_OF_OTHER_FEATURE_HERE

TV <- mydata$TV
#variable_max
TV_max <- max(mydata$TV)
TV_max

## [1] 296.4

#variable_min
TV_min <-min(mydata$TV)
TV_min

## [1] 0.7

#variable_Range max-min
TV_range <-range(TV_max - TV_min)
TV_range

## [1] 295.7 295.7

#variable_mean 
TV_mean <- mean(mydata$TV)
TV_mean

## [1] 147.0425

#variable_sd Standard Deviation
TV_sd <- sd(mydata$TV)
TV_sd

## [1] 85.85424

#variable_variance
TV_variance <- var(mydata$TV)
TV_variance

## [1] 7370.95

The variance for TV is more than double the variance of the case number data

radio <- mydata$radio
#variable_max
Radio_max <- max(mydata$radio)
Radio_max

## [1] 49.6

#variable_min
Radio_min <-min(mydata$radio)
Radio_min

## [1] 0

#variable_Range max-min
Radio_range <-range(Radio_max - Radio_min)
Radio_range

## [1] 49.6 49.6

#variable_mean 
Radio_mean <- mean(mydata$radio)
Radio_mean

## [1] 23.264

#variable_sd Standard Deviation
Radio_sd <- sd(mydata$radio)
Radio_sd

## [1] 14.84681

#variable_variance
Radio_variance <- var(mydata$radio)
Radio_variance

## [1] 220.4277

The variance of this dataset is generally on par with some of the previous data’s variances.

newspaper <- mydata$newspaper
#variable_max
newspaper_max <- max(mydata$newspaper)
newspaper_max

## [1] 114

newspaper_min <-min(mydata$newspaper)
newspaper_min

## [1] 0.3

#variable_Range max-min
newspaper_range <-range(newspaper_max - newspaper_min)
newspaper_range

## [1] 113.7 113.7

#variable_mean 
newspaper_mean <- mean(mydata$newspaper)
newspaper_mean

## [1] 30.554

#variable_sd Standard Deviation
newspaper_sd <- sd(mydata$newspaper)
newspaper_sd

## [1] 21.77862

#variable_variance
newspaper_variance <- var(mydata$newspaper)
newspaper_variance

## [1] 474.3083

There is nothing outstanding about the newspaper data

sales <- mydata$sales
#variable_max
sales_max <- max(mydata$sales)
sales_max

## [1] 27

#variable_min
sales_min <-min(mydata$sales)
sales_min

## [1] 1.6

#variable_Range max-min
sales_range <-range(sales_max - sales_min)
sales_range

## [1] 25.4 25.4

#variable_mean 
sales_mean <- mean(mydata$sales)
sales_mean

## [1] 14.0225

#variable_sd Standard Deviation
sales_sd <- sd(mydata$sales)
sales_sd

## [1] 5.217457

#variable_variance
sales_variance <- var(mydata$sales)
sales_variance

## [1] 27.22185

Sales has the lowest variance of any of the data.

1C) Use the summary() function on all the dataset to give you a general description of the data. Note any differences between features.

summary(mydata)

##   case_number           TV             radio          newspaper     
##  Min.   :  1.00   Min.   :  0.70   Min.   : 0.000   Min.   :  0.30  
##  1st Qu.: 50.75   1st Qu.: 74.38   1st Qu.: 9.975   1st Qu.: 12.75  
##  Median :100.50   Median :149.75   Median :22.900   Median : 25.75  
##  Mean   :100.50   Mean   :147.04   Mean   :23.264   Mean   : 30.55  
##  3rd Qu.:150.25   3rd Qu.:218.82   3rd Qu.:36.525   3rd Qu.: 45.10  
##  Max.   :200.00   Max.   :296.40   Max.   :49.600   Max.   :114.00  
##      sales      
##  Min.   : 1.60  
##  1st Qu.:10.38  
##  Median :12.90  
##  Mean   :14.02  
##  3rd Qu.:17.40  
##  Max.   :27.00

In terms of minimums, newspaper and radio have the lowest which makes sense because of the vast amount of new technology available. Comparatively, the largest span between min and max comes from TV and the smallest from sales. The largest span between Median and mean comes from newspaper with just underr a 5 point span. The diversity of the maximums is 269 which is far more diverse than the minimums with a spread of 1.6. #### Are there any outliers, if not explain the lack of outliers? if any explain what the outliers represent and how many records are outliers? ( Use code from notebook-03 to find outliers)

lowerquantileTV = quantile(mydata$TV)[2]
upperquantileTV = quantile(mydata$TV)[4]
lowerquantileradio = quantile(mydata$radio)[2]
upperquantileradio = quantile(mydata$radio)[4]
lowerquantilenewspaper = quantile(mydata$newspaper)[2]
upperquantilenewspaper = quantile(mydata$newspaper)[4]
lowerquantilesales = quantile(mydata$sales)[2]
upperquantilesales = quantile(mydata$sales)[4]

iqrTV = upperquantileTV - lowerquantileTV
iqrradio = upperquantileradio - lowerquantileradio
iqrnewspaper = upperquantilenewspaper - lowerquantilenewspaper
iqrsales = upperquantilesales - lowerquantilesales

UTTV = (iqrTV * 1.5) + upperquantileTV
UTradio = (iqrradio * 1.5) + upperquantileradio
UTnewspaper = (iqrnewspaper * 1.5) + upperquantilenewspaper
UTsales = (iqrsales * 1.5) + upperquantilesales

LTTV = lowerquantileTV - (iqrTV * 1.5)
LTradio = lowerquantileradio - (iqrradio * 1.5)
LTnewspaper = lowerquantilenewspaper - (iqrnewspaper * 1.5)
LTsales = lowerquantilesales - (iqrsales * 1.5)

count(mydata[mydata$TV > UTTV, ])

count(mydata[mydata$TV < LTTV, ])

count(mydata[mydata$radio > UTradio, ])

count(mydata[mydata$radio < LTradio, ])

count(mydata[mydata$newspaper > UTnewspaper, ])

count(mydata[mydata$newspaper < LTnewspaper, ])

count(mydata[mydata$sales > UTsales, ])

count(mydata[mydata$sales < LTsales, ])

There are two outliers in newspaper that exceed the upper threshold which signifies that the values of these two data points are significantly different from all the other points in this data. There are no outliers for TV, radio or sales.

1D) Write a general description of the dataset using the statistics found in the steps above. Use the min,max range to compare the features, note any significant differences.

TV advertisements had the highest max value of 296.4, whereas Radio had the lowest minimum value of 0, and TV had the largest range of 295.7.This data shows that on average the most money spent on advertising went to TV and the least money spent on advertising went to sales.

Task 2: Qualitative Analysis

2A) Plot all the assigned features as y-axis for x-axis use case_number. Use the given commands to create each plot and create a grid to plot all features Note any trends/patters in the data

Commands: VARIABLE_plot <- ggplot(data = mydata, aes(x = VARIABLE, y = VARIABLE)) + geom_point()
Commands: grid.arrange(VARIABLE_plot1, VARIABLE_plot2, VARIABLE_plot3, VARIABLE_plot4, ncol=2)

#grid.arrange(VARIABLE_plot1, VARIABLE_plot2, VARIABLE_plot3, VARIABLE_plot4, ncol=2)
TV_plot <- ggplot(data = mydata, aes(x = case_number, y = TV)) + geom_point()
#TV_plot
radio_plot <- ggplot(data = mydata, aes(x = case_number, y = radio)) + geom_point()
#radio_plot
newspaper_plot <-ggplot(data = mydata, aes(x = case_number, y = newspaper)) + geom_point()
#newspaper_plot
sales_plot <- ggplot(data = mydata, aes(x = case_number, y = sales)) + geom_point()
#sales_plot
grid.arrange(TV_plot, radio_plot, newspaper_plot, sales_plot, ncol=2)

When looking at these plots it is hard to see a particular trend.
One way to observe any possible trend in the sales data would be to re-order the data from low to high.
The 200 months observations are in no particular chronological time sequence.
The case numbers are independent sequentially generated numbers. Since each case is independent, we can reorder them.

2B) Re-order sales from low to high, and save re-ordered data in a new set. As sales data is re-reorded associated other column fields follow.

Commands: newdata <- mydata[ order(mydata$VARIABLE), ]

newdata <- mydata[ order(mydata$sales), ]
# Extract case_number from the newdata
case_number <- newdata$case_number
head(newdata)

Extract the variables from the new data

# new_VARIABLE = newdata$VARIABLE
new_TV = newdata$TV
new_radio = newdata$radio
new_newspaper = newdata$newspaper
new_sales = newdata$sales

2C) Repeat the 4 graphs with the newdata to spot any trends. Note your observations on what the new plots are revealing in terms of trending relationship.

Commands: VARIABLE_plot <- ggplot(data = mydata, aes(x = VARIABLE, y = VARIABLE)) + geom_point()
Commands: For x variable in the plot use: aes(x = case_number[order(case_number)])
Commands: grid.arrange(VARIABLE_plot1, VARIABLE_plot2, VARIABLE_plot3, VARIABLE_plot4, ncol=2)

#grid.arrange(newsales_plot, newtv_plot, newradio_plot, newnews_plot, ncol=2)
new_TV_plot <- ggplot(data = mydata, aes(x = case_number[order(case_number)], y = new_TV)) + geom_point()
new_radio_plot <- ggplot(data = mydata, aes(x = case_number[order(case_number)], y = new_radio)) + geom_point()
new_newspaper_plot <- ggplot(data = mydata, aes(x = case_number[order(case_number)], y = new_newspaper)) + geom_point()
new_sales_plot <- ggplot(data = mydata, aes(x = case_number[order(case_number)], y = new_sales)) + geom_point()
grid.arrange(new_sales_plot, new_TV_plot, new_radio_plot, new_newspaper_plot, ncol=2)

Sales and TV are postively correlated with case number. Radio has somewhat of a postive correlation with case number, however it is weak and difficult to distinguish. Newspaper does not correlate at all with case number. ———-

Task 3: Standardized Z-Value

3A) Create a histogram of the assigned feature z-scores. Describe the output note any relevant values.

Command: z_score = ( VARIABLE - mean(VARIABLE) ) / sd(VARIABLE)
Commands: qplot( x = VARIABLE ,geom=“histogram”, binwidth = 0.3)

z_scores = ( sales - mean(sales) ) / sd(sales)
qplot( x = z_scores ,geom="histogram", binwidth = 0.3)

z_scores = (radio - mean(radio)) / sd(radio)
qplot(x = z_scores, geom = "histogram", binwidth = 0.3)

z_scores = (newspaper - mean(newspaper)) / sd(newspaper)
qplot(x = z_scores, geom = "histogram", binwidth = 0.3)

The sales histogram seems to have a closer resemblence to a normal distribution while the newspaper histogram is more positively skewed.

3B) Given a sales value of $26700, calculate the corresponding z-value or z-score.

Command: z_score = ( VARIABLE - mean(VARIABLE) ) / sd(VARIABLE)

z_score = ( 26.7 - mean(sales) ) / sd(sales)
z_score

## [1] 2.429824

3C) Based on the z-value, how would you rate a $26700 sales value: poor, average, good, or very good performance? Explain your logic.

The z-value is 2.429824,which is considerably good performance. Overall this indicates that sales of $26,700 is well above average.