Notebook Instructions
For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
Load Packages in R/RStudio
We are going to use tidyverse a collection of R packages designed for data science.
Loading required package: tidyverse
[30m── [1mAttaching packages[22m ─────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──[39m
[30m[32m✔[30m [34mggplot2[30m 2.2.1 [32m✔[30m [34mpurrr [30m 0.2.4
[32m✔[30m [34mtibble [30m 1.4.1 [32m✔[30m [34mdplyr [30m 0.7.4
[32m✔[30m [34mtidyr [30m 0.7.2 [32m✔[30m [34mstringr[30m 1.2.0
[32m✔[30m [34mreadr [30m 1.1.1 [32m✔[30m [34mforcats[30m 0.2.0[39m
[30m── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31m✖[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
Loading required package: gridExtra
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine
Task 1: Quantitative Analysis
1A) Read the csv file into R Studio and display the dataset.
Change the variable name “X1” to case_number using the function rename()
- mydata <- rename(mydata, “NEW_VAR_NAME” = “OLD_VAR_NAME”)
1B) Find the range ( difference between min and max ), min, max, standard deviation and variance for each assigned feature ( Use separate chunks for each feature ). Compare each feature and note any significant differences
SALES
#variable_max
ilmax <- max(mydata$Total_State_Tax_Illinois)
#variable_min
#variable_Range max-min
#variable_mean
#variable_sd Standard Deviation
#variable_variance
RADIO
TV
NEWSPAPER
1C) Use the summary() function on all the dataset to give you a general description of the data. Note any differences between features.
Are there any outliers, if not explain the lack of outliers? if any explain what the outliers represent and how many records are outliers? ( Use code from notebook-03 to find outliers)
# lowerq = quantile(VARIABLE)[2]
# upperq = quantile(VARIABLE)[4]
# iqr = upperq - lowerq
# upper_threshold = (iqr * 1.5) + upperq
# lower_threshold = lowerq - (iqr * 1.5)
# VARIABLE[ VARIABLE > upper_threshold][1:10]
# VARIABLE[ VARIABLE < lower_threshold][1:10]
1D) Write a general description of the dataset using the statistics found in the steps above. Use the min,max range to compare the features, note any significant differences.
Task 2: Qualitative Analysis
2A) Plot all the assigned features as y-axis for x-axis use case_number. Use the given commands to create each plot and create a grid to plot all features Note any trends/patters in the data
- Commands: VARIABLE_plot <- ggplot(data = mydata, aes(x = VARIABLE, y = VARIABLE)) + geom_point()
- Commands: grid.arrange(VARIABLE_plot1, VARIABLE_plot2, VARIABLE_plot3, VARIABLE_plot4, ncol=2)
corp_plot <- ggplot(data = mydata, aes(x = year, y = `Corporate_Income_Tax _Illinois`)) + geom_point()
personal_plot <- ggplot(data = mydata, aes(x = year, y = Personal_Income_Tax_Illinois)) + geom_point()
general_plot <- ggplot(data = mydata, aes(x = year, y = General_Sales_Tax_Illinois)) + geom_point()
motor_plot <- ggplot(data = mydata, aes(x = year, y = Motor_Fuel_Sales_Illinois)) + geom_point()
grid.arrange(corp_plot, personal_plot, general_plot, motor_plot, ncol=2)

- When looking at these plots it is hard to see a particular trend.
- One way to observe any possible trend in the sales data would be to re-order the data from low to high.
- The 200 months observations are in no particular chronological time sequence.
- The case numbers are independent sequentially generated numbers. Since each case is independent, we can reorder them.
2B) Re-order sales from low to high, and save re-ordered data in a new set. As sales data is re-reorded associated other column fields follow.
- Commands: newdata <- mydata[ order(mydata$VARIABLE), ]
newdata <- mydata[ order(mydata$Total_Tax_US), ]
# Extract case_number from the newdata
case_number <- newdata$Total_Tax_US
head(newdata)
2C) Repeat the 4 graphs with the newdata to spot any trends. Note your observations on what the new plots are revealing in terms of trending relationship.
- Commands: VARIABLE_plot <- ggplot(data = mydata, aes(x = VARIABLE, y = VARIABLE)) + geom_point()
- Commands: For x variable in the plot use: aes(x = case_number[order(case_number)])
- Commands: grid.arrange(VARIABLE_plot1, VARIABLE_plot2, VARIABLE_plot3, VARIABLE_plot4, ncol=2)
corp_plot <- ggplot(data = newdata, aes(x = case_number[order(case_number)], y = `Corporate_Income_Tax _Illinois`)) + geom_point()
personal_plot <- ggplot(data = newdata, aes(x = case_number[order(case_number)], y = Personal_Income_Tax_Illinois)) + geom_point()
general_plot <- ggplot(data = newdata, aes(x = case_number[order(case_number)], y = General_Sales_Tax_Illinois)) + geom_point()
motor_plot <- ggplot(data = newdata, aes(x = case_number[order(case_number)], y = Motor_Fuel_Sales_Illinois)) + geom_point()
grid.arrange(corp_plot, personal_plot, general_plot, motor_plot, ncol=2)

Task 3: Standardized Z-Value
3A) Create a histogram of the assigned feature z-scores. Describe the output note any relevant values.
- Command: z_scores = ( VARIABLE - mean(VARIABLE) ) / sd(VARIABLE)
- Commands: qplot( x = VARIABLE , geom=“histogram”, binwidth = 0.3)
il_personal <- scale(il_personal)
qplot( x = il_personal , binwidth = 0.3)

3B) Given a sales value of $26700, calculate the corresponding z-value or z-score.
- Command: z_score = ( x - mean(VARIABLE) ) / sd(VARIABLE)
personal <- mydata$Personal_Income_Tax_Illinois
z_scores = ( personal - mean(personal) ) / sd(personal)
qplot( x = z_scores, binwidth = 0.3)

x = 4009160
z_score = ( x - mean(personal) ) / sd(personal)
z_score
[1] 1.607937
3C) Based on the z-value, how would you rate a $26700 sales value: poor, average, good, or very good performance? Explain your logic.
