For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
We are going to use tidyverse a collection of R packages designed for data science.
Loading required package: tidyverse
[30m── [1mAttaching packages[22m ────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──[39m
[30m[32m✔[30m [34mggplot2[30m 2.2.1 [32m✔[30m [34mpurrr [30m 0.2.4
[32m✔[30m [34mtibble [30m 1.4.2 [32m✔[30m [34mdplyr [30m 0.7.4
[32m✔[30m [34mtidyr [30m 0.7.2 [32m✔[30m [34mstringr[30m 1.2.0
[32m✔[30m [34mreadr [30m 1.1.1 [32m✔[30m [34mforcats[30m 0.2.0[39m
[30m── [1mConflicts[22m ───────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31m✖[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
Loading required package: gridExtra
there is no package called ‘gridExtra’also installing the dependencies ‘praise’, ‘withr’, ‘egg’, ‘testthat’
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/praise_1.0.0.tgz'
Content type 'application/x-gzip' length 14617 bytes (14 KB)
==================================================
downloaded 14 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/withr_2.1.1.tgz'
Content type 'application/x-gzip' length 118461 bytes (115 KB)
==================================================
downloaded 115 KB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/egg_0.2.0.tgz'
Content type 'application/x-gzip' length 1266217 bytes (1.2 MB)
==================================================
downloaded 1.2 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/testthat_2.0.0.tgz'
Content type 'application/x-gzip' length 1640399 bytes (1.6 MB)
==================================================
downloaded 1.6 MB
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/gridExtra_2.3.tgz'
Content type 'application/x-gzip' length 1077959 bytes (1.0 MB)
==================================================
downloaded 1.0 MB
The downloaded binary packages are in
/var/folders/5x/kl6y2ft90fgg2x8wnq5khk8m0000gn/T//RtmpujWa0M/downloaded_packages
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine
Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() head() max() min() var() sd()
mydata = read_csv(file="data/Advertising.csv")
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
X1 = col_integer(),
TV = col_double(),
radio = col_double(),
newspaper = col_double(),
sales = col_double()
)
head(mydata)
mydata = rename(mydata, "case_number" = "X1")
mydata
TV
#variable_max
maxTV = max(mydata$TV)
maxTV
[1] 296.4
#variable_min
minTV = min(mydata$TV)
minTV
[1] 0.7
#variable_Range max-min
rangeTV = sum(maxTV - minTV)
rangeTV
[1] 295.7
#variable_mean
meanTV = mean(mydata$TV)
meanTV
[1] 147.0425
#variable_sd Standard Deviation
sdTV = sd(mydata$TV)
sdTV
[1] 85.85424
#variable_variance
varianceTV = var(mydata$TV)
varianceTV
[1] 7370.95
RADIO
#variable_max
maxradio = max(mydata$radio)
maxradio
[1] 49.6
#variable_min
minradio = min(mydata$radio)
minradio
[1] 0
#variable_Range max-min
rangeradio = sum(maxradio - minradio)
rangeradio
[1] 49.6
#variable_mean
meanradio = mean(mydata$radio)
meanradio
[1] 23.264
#variable_sd Standard Deviation
sdradio = sd(mydata$radio)
sdradio
[1] 14.84681
#variable_variance
varianceradio = var(mydata$radio)
varianceradio
[1] 220.4277
NEWSPAPER
#variable_max
maxnewspaper = max(mydata$newspaper)
maxnewspaper
[1] 114
#variable_min
minnewspaper = min(mydata$newspaper)
minnewspaper
[1] 0.3
#variable_Range max-min
rangenewspaper = sum(maxradio - minnewspaper)
rangenewspaper
[1] 49.3
#variable_mean
meannewspaper = mean(mydata$newspaper)
meannewspaper
[1] 30.554
#variable_sd Standard Deviation
sdnewspaper = sd(mydata$newspaper)
sdnewspaper
[1] 21.77862
#variable_variance
variancenewspaper = var(mydata$newspaper)
variancenewspaper
[1] 474.3083
SALES
#variable_max
maxsales = max(mydata$sales)
maxsales
[1] 27
#variable_min
minsales = min(mydata$sales)
minsales
[1] 1.6
#variable_Range max-min
rangesales = sum(maxradio - minsales)
rangesales
[1] 48
#variable_mean
meansales = mean(mydata$sales)
meansales
[1] 14.0225
#variable_sd Standard Deviation
sdsales = sd(mydata$sales)
sdsales
[1] 5.217457
#variable_variance
variancesales = var(mydata$sales)
variancesales
[1] 27.22185
summary(mydata)
case_number TV radio newspaper
Min. : 1.00 Min. : 0.70 Min. : 0.000 Min. : 0.30
1st Qu.: 50.75 1st Qu.: 74.38 1st Qu.: 9.975 1st Qu.: 12.75
Median :100.50 Median :149.75 Median :22.900 Median : 25.75
Mean :100.50 Mean :147.04 Mean :23.264 Mean : 30.55
3rd Qu.:150.25 3rd Qu.:218.82 3rd Qu.:36.525 3rd Qu.: 45.10
Max. :200.00 Max. :296.40 Max. :49.600 Max. :114.00
sales
Min. : 1.60
1st Qu.:10.38
Median :12.90
Mean :14.02
3rd Qu.:17.40
Max. :27.00
lowerquantileTV = quantile(mydata$TV)[2]
upperquantileTV = quantile(mydata$TV)[4]
lowerquantileradio = quantile(mydata$radio)[2]
upperquantileradio = quantile(mydata$radio)[4]
lowerquantilenewspaper = quantile(mydata$newspaper)[2]
upperquantilenewspaper = quantile(mydata$newspaper)[4]
lowerquantilesales = quantile(mydata$sales)[2]
upperquantilesales = quantile(mydata$sales)[4]
IQR calculations
iqrTV = upperquantileTV - lowerquantileTV
iqrradio = upperquantileradio - lowerquantileradio
iqrnewspaper = upperquantilenewspaper - lowerquantilenewspaper
iqrsales = upperquantilesales - lowerquantilesales
Upper Threshold
UTTV = (iqrTV * 1.5) + upperquantileTV
UTradio = (iqrradio * 1.5) + upperquantileradio
UTnewspaper = (iqrnewspaper * 1.5) + upperquantilenewspaper
UTsales = (iqrsales * 1.5) + upperquantilesales
LOWER THRESHOLD
LTTV = lowerquantileTV - (iqrTV * 1.5)
LTradio = lowerquantileradio - (iqrradio * 1.5)
LTnewspaper = lowerquantilenewspaper - (iqrnewspaper * 1.5)
LTsales = lowerquantilesales - (iqrsales * 1.5)
count(mydata[mydata$TV > UTTV, ])
count(mydata[mydata$TV < LTTV, ])
count(mydata[mydata$radio > UTradio, ])
count(mydata[mydata$radio < LTradio, ])
count(mydata[mydata$newspaper > UTnewspaper, ])
count(mydata[mydata$newspaper < LTnewspaper, ])
count(mydata[mydata$sales > UTsales, ])
count(mydata[mydata$sales < LTsales, ])
There are no outliers for TV, radio or sales. However, there are two outliers in newspaper that exceed the upper threshold. This means that the values of these two data points are significantly different from all the other points in this data.
The mean for TV is 147.04 and the standard deviation is 85.85. The max value for TV is 296.4 and the minimum value is 0.7. The mean for radio is 23.26, the standard deviation is 14.84, the max is 49.6 and the min value is 0. For newspaper, the mean is 30.55, the standard deviation is 21.77, the max is 114 and the min is 0.3. Lastly, the mean for sales is 14.02, the standard deviation is 5.21, the max is 27 and the min is 1.6. This data shows that on average the most money spent on advertising went to TV and the least money spent on advertising went to sales. This makes sense in most applications of advertising, because TV ads tend to be one of the most expensive forms of advertising.
#grid.arrange(VARIABLE_plot1, VARIABLE_plot2, VARIABLE_plot3, VARIABLE_plot4, ncol=2)
TV_plot <- ggplot(data = mydata, aes(x = case_number, y = TV)) + geom_point()
radio_plot <- ggplot(data = mydata, aes(x = case_number, y = radio)) + geom_point()
newspaper_plot <- ggplot(data = mydata, aes(x = case_number, y = newspaper)) + geom_point()
sales_plot <- ggplot(data = mydata, aes(x = case_number, y = sales)) + geom_point()
grid.arrange(sales_plot, TV_plot, radio_plot, newspaper_plot, ncol=2)
There does not seem to be an identifiable trend in any of these scatterplots.
# Extract case_number from the newdata
newdata = mydata[ order(mydata$sales), ]
case_number <- newdata$case_number
head(newdata)
# new_VARIABLE = newdata$VARIABLE
new_TV = newdata$TV
new_radio = newdata$radio
new_newspaper = newdata$newspaper
new_sales = newdata$sales
#grid.arrange(newsales_plot, newtv_plot, newradio_plot, newnews_plot, ncol=2)
newTV_plot <- ggplot(data = newdata, aes(x = case_number[order(case_number)], y = new_TV)) + geom_point()
newradio_plot <- ggplot(data = mydata, aes(x = case_number[order(case_number)], y = new_radio)) + geom_point()
newnewspaper_plot <- ggplot(data = mydata, aes(x = case_number[order(case_number)], y = new_newspaper)) + geom_point()
newsales_plot <- ggplot(data = mydata, aes(x = case_number[order(case_number)], y = new_sales)) + geom_point()
grid.arrange(newsales_plot, newTV_plot, newradio_plot, newnewspaper_plot, ncol=2)
After re-ordering the sales it is much easier to see some identifiable patterns in these scatterplots, specifically for sales. All the data points for sales seem to fall in a positively sloped line. The data points for TV start of very heavily concentrated at a specific point and then begin to disperse even more, but still in the direction of a positive slope. Similarly, the new radio plot shows data that is mostly positively correlated. However, there still does not seem to be any identifiable pattern for the newspaper data.
z_scoreTV = (mydata$TV - meanTV) / sdTV
z_scoreradio = (mydata$radio - meanradio) / sdradio
z_scorenewspaper = (mydata$newspaper - meannewspaper) / sdnewspaper
z_scoresales = (mydata$sales - meansales) / sdsales
qplot( x = z_scoreTV ,geom="histogram", binwidth = 0.3)
qplot( x = z_scoreradio ,geom="histogram", binwidth = 0.3)
qplot( x = z_scorenewspaper ,geom="histogram", binwidth = 0.3)
qplot( x = z_scoresales ,geom="histogram", binwidth = 0.3)
The sales histogram appears the closest to resembling a normal distribution while newspaper is positively skewed. This is likely because of the outliers in the newspaper data.
z_scoresalescalculation = ( 26.7 - mean(mydata$sales) ) / sd(mydata$sales)
z_scoresalescalculation
[1] 2.429824
I would say that a z-score of 2.43 resembles a good performace. It is a positive number and it is about two standard deviations above the mean which shows that sales are above average and there is a good performance.