For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
We are going to use tidyverse a collection of R packages designed for data science.
## Loading required package: tidyverse
## -- Attaching packages -------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.7.2 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.2.0
## -- Conflicts ----------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() head() max() min() var() sd()
mydata <- read.csv(file = "data/Advertising.csv")
head(mydata)
sales = mydata$sales
TV = mydata$TV
Radio = mydata$radio
Newspaper = mydata$newspaper
mydata <- rename(mydata, "case_number" = "X")
head(mydata)
Sales
#variable_max
sales_max = max(sales)
sales_max
## [1] 27
#variable_min
sales_min = min(sales)
sales_min
## [1] 1.6
#variable_Range max-min
sales_range = sales_max - sales_min
sales_range
## [1] 25.4
#variable_mean
sales_mean = mean(sales)
sales_mean
## [1] 14.0225
#variable_sd Standard Deviation
sales_sd = sd(sales)
sales_sd
## [1] 5.217457
#variable_variance
sales_variance = var(sales)
sales_variance
## [1] 27.22185
TV
#variable_max
TV_max = max(TV)
TV_max
## [1] 296.4
#variable_min
TV_min = min(TV)
TV_min
## [1] 0.7
#variable_Range max-min
TV_range = TV_max - TV_min
TV_range
## [1] 295.7
#variable_mean
TV_mean = mean(TV)
TV_mean
## [1] 147.0425
#variable_sd Standard Deviation
TV_sd = sd(TV)
TV_sd
## [1] 85.85424
#variable_variance
TV_variance = var(TV)
TV_variance
## [1] 7370.95
RADIO
#variable_max
Radio_max = max(Radio)
Radio_max
## [1] 49.6
#variable_min
Radio_min = min(Radio)
Radio_min
## [1] 0
#variable_Range max-min
Radio_range = Radio_max - Radio_min
Radio_range
## [1] 49.6
#variable_mean
Radio_mean = mean(Radio)
Radio_mean
## [1] 23.264
#variable_sd Standard Deviation
Radio_sd = sd(Radio)
Radio_sd
## [1] 14.84681
#variable_variance
Radio_variance = var(Radio)
Radio_variance
## [1] 220.4277
NEWSPAPER
#variable_max
Newspaper_max = max(Newspaper)
Newspaper_max
## [1] 114
#variable_min
Newspaper_min = min(Newspaper)
Newspaper_min
## [1] 0.3
#variable_Range max-min
Newspaper_range = Newspaper_max - Newspaper_min
Newspaper_range
## [1] 113.7
#variable_mean
Newspaper_mean = mean(Newspaper)
Newspaper_mean
## [1] 30.554
#variable_sd Standard Deviation
Newspaper_sd = sd(Newspaper)
Newspaper_sd
## [1] 21.77862
#variable_variance
Newspaper_variance = var(Newspaper)
Newspaper_variance
## [1] 474.3083
TV has an extremely high variance and standard deviation compared to newspaper and radio. It also has a higher max and range. Newspaper is also higher, but still not extreme. Radio seems to be okay. This may be due to the fact that it has a smaller spread. Sales is the lowest in terms of SD and Variance, this may indicate that the extreme range of spending is consistent throughout the dataset and leads to even sales. Or, the advertising spending may not affect the sales much.
summary(mydata)
## case_number TV radio newspaper
## Min. : 1.00 Min. : 0.70 Min. : 0.000 Min. : 0.30
## 1st Qu.: 50.75 1st Qu.: 74.38 1st Qu.: 9.975 1st Qu.: 12.75
## Median :100.50 Median :149.75 Median :22.900 Median : 25.75
## Mean :100.50 Mean :147.04 Mean :23.264 Mean : 30.55
## 3rd Qu.:150.25 3rd Qu.:218.82 3rd Qu.:36.525 3rd Qu.: 45.10
## Max. :200.00 Max. :296.40 Max. :49.600 Max. :114.00
## sales
## Min. : 1.60
## 1st Qu.:10.38
## Median :12.90
## Mean :14.02
## 3rd Qu.:17.40
## Max. :27.00
TV is much higher than radio and newspaper when it comes to 1st Qu and up, suggesting that there was the most spending on TV advertising. It has a wider range of spending than radio and newspaper. While TV on average seems to be higher, newspaper has a large jump between 3rd Qu and Max which may suggest an outlier. Radio is consistently the smallest in terms of spending and has the smallest spread of data based on the range. With the exception of TV, most first quantiles appear similar.
SALES OUTLIERS
#ranges
quantile(sales)
## 0% 25% 50% 75% 100%
## 1.600 10.375 12.900 17.400 27.000
lowerq = quantile(sales)[2]
upperq = quantile(sales)[4]
iqr = upperq-lowerq
lowerq
## 25%
## 10.375
upperq
## 75%
## 17.4
iqr
## 75%
## 7.025
#thresholds
upper_threshold = (iqr * 1.5) + upperq
upper_threshold
## 75%
## 27.9375
lower_threshold = lowerq - (iqr * 1.5)
lower_threshold
## 25%
## -0.1625
#outliers above threshold
sales[ sales > upper_threshold][1:10]
## [1] NA NA NA NA NA NA NA NA NA NA
#outliers below threshold
sales[ sales < lower_threshold][1:10]
## [1] NA NA NA NA NA NA NA NA NA NA
#finding outlier records
mydata[ sales > upper_threshold, ]
mydata[ sales < lower_threshold, ]
There are no sales outliers. This is due to the fact that no sales values exceed or fall short of the thresholds. The sales data is most likely fairly uniform if it was steadily increasing.
TV OUTLIERS
#ranges
quantile(TV)
## 0% 25% 50% 75% 100%
## 0.700 74.375 149.750 218.825 296.400
lowerq = quantile(TV)[2]
upperq = quantile(TV)[4]
iqr = upperq-lowerq
lowerq
## 25%
## 74.375
upperq
## 75%
## 218.825
iqr
## 75%
## 144.45
#thresholds
upper_threshold = (iqr * 1.5) + upperq
upper_threshold
## 75%
## 435.5
lower_threshold = lowerq - (iqr * 1.5)
lower_threshold
## 25%
## -142.3
#outliers above threshold
TV[ TV > upper_threshold][1:10]
## [1] NA NA NA NA NA NA NA NA NA NA
#outliers below threshold
TV[ TV < lower_threshold][1:10]
## [1] NA NA NA NA NA NA NA NA NA NA
#finding outlier records
mydata[ TV > upper_threshold, ]
mydata[ TV < lower_threshold, ]
There are no TV outliers. This is due to the fact that no tv spending values exceed or fall short of the thresholds. The tv spending data is most likely fairly uniform as advertisers seek to not make any drastic increases or decreases in spending if sales are steadily growing.
RADIO OUTLIERS
#ranges
quantile(Radio)
## 0% 25% 50% 75% 100%
## 0.000 9.975 22.900 36.525 49.600
lowerq = quantile(Radio)[2]
upperq = quantile(Radio)[4]
iqr = upperq-lowerq
lowerq
## 25%
## 9.975
upperq
## 75%
## 36.525
iqr
## 75%
## 26.55
#thresholds
upper_threshold = (iqr * 1.5) + upperq
upper_threshold
## 75%
## 76.35
lower_threshold = lowerq - (iqr * 1.5)
lower_threshold
## 25%
## -29.85
#outliers above threshold
Radio[ Radio > upper_threshold][1:10]
## [1] NA NA NA NA NA NA NA NA NA NA
#outliers below threshold
Radio[ Radio < lower_threshold][1:10]
## [1] NA NA NA NA NA NA NA NA NA NA
#finding outlier records
mydata[ Radio > upper_threshold, ]
mydata[ Radio < lower_threshold, ]
There were no outliers for Radio. This is due to the fact that no radio spending values exceed or fall below the thresholds. The radio spending data probably does not have outliers as year over year or month to month radio spending would follow similar trends without spiking in either direction.
NEWSPAPER OUTLIERS
#ranges
quantile(Newspaper)
## 0% 25% 50% 75% 100%
## 0.30 12.75 25.75 45.10 114.00
lowerq = quantile(Newspaper)[2]
upperq = quantile(Newspaper)[4]
iqr = upperq-lowerq
lowerq
## 25%
## 12.75
upperq
## 75%
## 45.1
iqr
## 75%
## 32.35
#thresholds
upper_threshold = (iqr * 1.5) + upperq
upper_threshold
## 75%
## 93.625
lower_threshold = lowerq - (iqr * 1.5)
lower_threshold
## 25%
## -35.775
#outliers above threshold
Newspaper [ Newspaper > upper_threshold][1:10]
## [1] 114.0 100.9 NA NA NA NA NA NA NA NA
#outliers below threshold
Newspaper [ Newspaper < lower_threshold][1:10]
## [1] NA NA NA NA NA NA NA NA NA NA
#finding outlier records
mydata[ Newspaper > upper_threshold, ]
mydata[ Newspaper < lower_threshold, ]
There are two newspaper outliers that are above the upper threshold, Case Numbers 17 or 102. These are outliers because newspaper spending exceeded 93.625.This could have been caused by success found in newspaper sales for a certain time period or increased distribution of cupons in newspapers. The sales do not appear to have increased greatly for either of these outliers, so they most likely did not continue the spending strategy. In these instances, there was decreased spending on TV for one and higher spending on radio in both.
The most spending for advertising regularly goes to TV and the second most to newspaper; however newspaper and radio have close spending amounts. TV is significantly higher. There is a larger variance and sd in TV spending suggesting that it fluctuates more, but the lack of outliers indicate no extreme spending or cutbacks. The sales appear to have the least variance and standard deviation. This may be due to the fact that the sales numbers are scaled from thousands or millions to tens/hundreds. Another possibility could be that the different spending on advertising is not a huge indicator in sales numbers as they do not drastically change as well.
#grid.arrange(VARIABLE_plot1, VARIABLE_plot2, VARIABLE_plot3, VARIABLE_plot4, ncol=2)
sales_plot <- ggplot(data = mydata, aes(x = case_number, y = sales)) + geom_point()
TV_plot <- ggplot(data = mydata, aes(x = case_number, y = TV)) + geom_point()
Radio_plot <- ggplot(data = mydata, aes(x = case_number, y = Radio)) + geom_point()
Newspaper_plot <- ggplot(data = mydata, aes(x = case_number, y = Newspaper)) + geom_point()
grid.arrange(sales_plot, TV_plot, Radio_plot, Newspaper_plot, ncol=2)
Newspaper is much more sparse at the top of the plot compared to others which are evenly distributed on the plot. Sales has a few that are at the low end and a few at the high end, making it look more concentrated in the middle of the plot.
newdata <- mydata[ order(mydata$sales), ]
# Extract case_number from the newdata
case_number <- newdata$case_number
head(newdata)
# new_VARIABLE = newdata$VARIABLE
new_tv = newdata$TV
new_radio = newdata$radio
new_news = newdata$newspaper
new_sales = newdata$sales
newsales_plot <- ggplot(data = mydata, aes(x = case_number[order(case_number)], y = new_sales)) + geom_point()
newtv_plot <- ggplot(data = mydata, aes(x = case_number[order(case_number)], y = new_tv)) + geom_point()
newradio_plot <- ggplot(data = mydata, aes(x = case_number[order(case_number)], y = new_radio)) + geom_point()
newnews_plot <- ggplot(data = mydata, aes(x = case_number[order(case_number)], y = new_news)) + geom_point()
grid.arrange(newsales_plot, newtv_plot, newradio_plot, newnews_plot, ncol=2)
Sales Histogram
sales_zscore = (sales - mean(sales)) / sd(sales)
qplot ( x = sales_zscore,geom="histogram", binwidth = 0.3)
TV Histogram
TV_zscore = (TV - mean(TV)) / sd(TV)
qplot ( x = TV_zscore,geom="histogram", binwidth = 0.3)
Radio Histogram
Radio_zscore = (Radio - mean(Radio)) / sd(Radio)
qplot ( x = Radio_zscore,geom="histogram", binwidth = 0.3)
Newspaper Histogram
News_zscore = (Newspaper - mean(Newspaper)) / sd(Newspaper)
qplot ( x = News_zscore,geom="histogram", binwidth = 0.3)
The sales z-score appears to be the most normal distribution meaning that most values are close to the mean. The most notable is newspaper has a strong positive skew, meaning that the outliers push up the mean and cause a large amount of the data to fall below the average. Radio and TV are closer to unitary distribution, but have greater amounts at the extremes rather than in the center. They spend more or less than the average about equally.
z_score = (26.7 - mean(sales) )/ sd(sales)
z_score
## [1] 2.429824
The z-score of 2.43 would indicate pretty good performance as it is quite significantly above the mean (about 2 standard deviations), but not high enough to be considered an outlier.