Notebook Instructions
For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
Load Packages in R/RStudio
We are going to use tidyverse a collection of R packages designed for data science.
Task 1: Quantitative Analysis
1A) Read the csv file into R Studio and display the dataset.
Change the variable name “X1” to case_number using the function rename()
- mydata <- rename(mydata, “NEW_VAR_NAME” = “OLD_VAR_NAME”)
mydata <- rename(mydata, "case_number" = "X")
head(mydata)
1B) Find the range ( difference between min and max ), min, max, standard deviation and variance for each assigned feature ( Use separate chunks for each feature ). Compare each feature and note any significant differences
TV
TV <- mydata$TV
TV_max <- max(mydata$TV)
TV_max
[1] 296.4
TV_min <-min(mydata$TV)
TV_min
[1] 0.7
TV_range <-range(TV_max - TV_min)
TV_range
[1] 295.7 295.7
TV_mean <- mean(mydata$TV)
TV_mean
[1] 147.0425
TV_sd <- sd(mydata$TV)
TV_sd
[1] 85.85424
TV_variance <- var(mydata$TV)
TV_variance
[1] 7370.95
the variance seems unreasonably high
Radio
radio <- mydata$radio
Radio_max <- max(mydata$radio)
Radio_max
[1] 49.6
Radio_min <-min(mydata$radio)
Radio_min
[1] 0
Radio_range <-range(Radio_max - Radio_min)
Radio_range
[1] 49.6 49.6
Radio_mean <- mean(mydata$radio)
Radio_mean
[1] 23.264
Radio_sd <- sd(mydata$radio)
Radio_sd
[1] 14.84681
Radio_variance <- var(mydata$radio)
Radio_variance
[1] 220.4277
radio is the only feature that has a proper 0 for minimum
Newspaper
newspaper_max <- max(mydata$newspaper)
newspaper_max
[1] 114
newspaper_min <-min(mydata$newspaper)
newspaper_min
[1] 0.3
newspaper_range <-range(newspaper_max - newspaper_min)
newspaper_range
[1] 113.7 113.7
newspaper_mean <- mean(mydata$newspaper)
newspaper_mean
[1] 30.554
newspaper_sd <- sd(mydata$newspaper)
newspaper_sd
[1] 21.77862
newspaper_variance <- var(mydata$newspaper)
newspaper_variance
[1] 474.3083
newspaper kind of lies in the middle of the rest as far as the values go
Sales
sales <- mydata$sales
sales_max <- max(mydata$sales)
sales_max
[1] 27
sales_min <-min(mydata$sales)
sales_min
[1] 1.6
sales_range <-range(sales_max - sales_min)
sales_range
[1] 25.4 25.4
sales_mean <- mean(mydata$sales)
sales_mean
[1] 14.0225
sales_sd <- sd(mydata$sales)
sales_sd
[1] 5.217457
sales_variance <- var(mydata$sales)
sales_variance
[1] 27.22185
sales has the lowest standard deviation of the bunch
1C) Use the summary() function on all the dataset to give you a general description of the data. Note any differences between features.
summary(mydata)
case_number TV
Min. : 1.00 Min. : 0.70
1st Qu.: 50.75 1st Qu.: 74.38
Median :100.50 Median :149.75
Mean :100.50 Mean :147.04
3rd Qu.:150.25 3rd Qu.:218.82
Max. :200.00 Max. :296.40
radio newspaper sales
Min. : 0.000 Min. : 0.30 Min. : 1.60
1st Qu.: 9.975 1st Qu.: 12.75 1st Qu.:10.38
Median :22.900 Median : 25.75 Median :12.90
Mean :23.264 Mean : 30.55 Mean :14.02
3rd Qu.:36.525 3rd Qu.: 45.10 3rd Qu.:17.40
Max. :49.600 Max. :114.00 Max. :27.00
Are there any outliers, if not explain the lack of outliers? if any explain what the outliers represent and how many records are outliers? ( Use code from notebook-03 to find outliers)
iqrTV = quantile(mydata$TV)[4] - quantile(mydata$TV)[2]
iqrRadio = quantile(mydata$radio)[4] - quantile(mydata$radio)[2]
iqrNewspaper = quantile(mydata$newspaper)[4] - quantile(mydata$newspaper)[2]
iqrSales = quantile(mydata$sales)[4] - quantile(mydata$sales)[2]
UpperTV = (iqrTV * 1.5) + quantile(mydata$TV)[4]
UpperRadio = (iqrRadio * 1.5) + quantile(mydata$radio)[4]
UpperNewspaper = (iqrNewspaper * 1.5) + quantile(mydata$newspaper)[4]
UpperSales = (iqrSales * 1.5) + quantile(mydata$sales)[4]
LowerTV = quantile(mydata$TV)[2] - (iqrTV * 1.5)
LowerRadio = quantile(mydata$radio)[2] - (iqrRadio * 1.5)
LowerNewspaper = quantile(mydata$newspaper)[2] - (iqrNewspaper * 1.5)
LowerSales = quantile(mydata$sales)[2] - (iqrSales * 1.5)
count(mydata[mydata$TV < LowerTV, ])
count(mydata[mydata$TV > UpperTV, ])
count(mydata[mydata$radio > UpperRadio, ])
count(mydata[mydata$radio < LowerRadio, ])
count(mydata[mydata$newspaper > UpperNewspaper, ])
count(mydata[mydata$newspaper < LowerNewspaper, ])
count(mydata[mydata$sales > UpperSales, ])
count(mydata[mydata$sales < LowerSales, ])
there are two outliers in newspaper that go past the upper threshold, meaning that the difference in these two data points is significantly different from the rest
1D) Write a general description of the dataset using the statistics found in the steps above. Use the min,max range to compare the features, note any significant differences.
On average, the most money spent was in TV as the mean is the highest by quite a bit, next up was newspaper followed by radio, this makes sense as far as the tv part goes but I am not one hundred percent sure if it makes sense as far as newspaper and stuff goes. It’s possible this company just has a focus on print over radio.
Task 2: Qualitative Analysis
2A) Plot all the assigned features as y-axis for x-axis use case_number. Use the given commands to create each plot and create a grid to plot all features Note any trends/patters in the data
- Commands: VARIABLE_plot <- ggplot(data = mydata, aes(x = VARIABLE, y = VARIABLE)) + geom_point()
- Commands: grid.arrange(VARIABLE_plot1, VARIABLE_plot2, VARIABLE_plot3, VARIABLE_plot4, ncol=2)
#grid.arrange(VARIABLE_plot1, VARIABLE_plot2, VARIABLE_plot3, VARIABLE_plot4, ncol=2)
TVplot = ggplot(data = mydata, aes(x = case_number, y = TV)) + geom_point()
Radioplot = ggplot(data = mydata, aes(x = case_number, y = radio)) + geom_point()
newsplot = ggplot(data = mydata, aes(x = case_number, y = newspaper)) + geom_point()
salesplot = ggplot(data = mydata, aes(x = case_number, y = sales)) + geom_point()
grid.arrange(TVplot, Radioplot, newsplot,salesplot, ncol=2)

- When looking at these plots it is hard to see a particular trend.
- One way to observe any possible trend in the sales data would be to re-order the data from low to high.
- The 200 months observations are in no particular chronological time sequence.
- The case numbers are independent sequentially generated numbers. Since each case is independent, we can reorder them.
2B) Re-order sales from low to high, and save re-ordered data in a new set. As sales data is re-reorded associated other column fields follow.
- Commands: newdata <- mydata[ order(mydata$VARIABLE), ]
# Extract case_number from the newdata
newdata = mydata[ order(mydata$sales), ]
case_number <- newdata$case_number
head(newdata)
2C) Repeat the 4 graphs with the newdata to spot any trends. Note your observations on what the new plots are revealing in terms of trending relationship.
- Commands: VARIABLE_plot <- ggplot(data = mydata, aes(x = VARIABLE, y = VARIABLE)) + geom_point()
- Commands: For x variable in the plot use: aes(x = case_number[order(case_number)])
- Commands: grid.arrange(VARIABLE_plot1, VARIABLE_plot2, VARIABLE_plot3, VARIABLE_plot4, ncol=2)
newTVplot = ggplot(data = newdata, aes(x = case_number[order(case_number)], y = TV)) + geom_point()
newRadioplot = ggplot(data = newdata, aes(x = case_number[order(case_number)],, y = radio)) + geom_point()
newnewsplot = ggplot(data = newdata, aes(x = case_number[order(case_number)],, y = newspaper)) + geom_point()
newsalesplot = ggplot(data = newdata, aes(x = case_number[order(case_number)],, y = sales)) + geom_point()
grid.arrange(newTVplot, newRadioplot, newnewsplot,newsalesplot, ncol=2)

#After the ordering of the data it is a lot easier to see some correlations, specifically in sales and tv, radio also, but definitely not newspaper which still is very all over the place.
Task 3: Standardized Z-Value
3A) Create a histogram of the assigned feature z-scores. Describe the output note any relevant values.
- Command: z_score = ( VARIABLE - mean(VARIABLE) ) / sd(VARIABLE)
- Commands: qplot( x = VARIABLE ,geom=“histogram”, binwidth = 0.3)
z_scoreTV = (mydata$TV - TV_mean) / TV_sd
z_scoreradio = (mydata$radio - Radio_mean) / Radio_sd
z_scorenewspaper = (mydata$newspaper - newspaper_mean) / newspaper_sd
z_scoresales = (mydata$sales - sales_mean) / sales_sd
qplot( x = z_scoreTV ,geom="histogram", binwidth = 0.3)

qplot( x = z_scoreradio ,geom="histogram", binwidth = 0.3)

qplot( x = z_scorenewspaper ,geom="histogram", binwidth = 0.3)

qplot( x = z_scoresales ,geom="histogram", binwidth = 0.3)

3B) Given a sales value of $26700, calculate the corresponding z-value or z-score.
- Command: z_score = ( VARIABLE - mean(VARIABLE) ) / sd(VARIABLE)
z_score = ( 26.7 - mean(mydata$sales) ) / sd(mydata$sales)
z_score
[1] 2.429824
3C) Based on the z-value, how would you rate a $26700 sales value: poor, average, good, or very good performance? Explain your logic.
The z-score shows that the sales values is more than two standard deviations above the mean, so the sales are well above average so it should be considered a good performance.
