About

This worksheet includes three main tasks on data outliers, data preparation, and data modeling. The lab requires the use of Microsoft Excel, R, and ERDplus.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

Note

For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. For clarity, tasks/questions to be completed/answered are highlighted in red color (visible in preview) and numbered according to their particular placement in the task section. Quite often you will need to add your own code chunk.

Execute all code chunks, preview, publish, and submit link on Sakai.


Task 1: Data Outliers

First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.

In R, we must read in the file again, extract the column and find the values that are asked for.

#Read File
mydata = read.csv(file="data/creditrisk.csv") 
#Name the extracted variable
age = mydata$Age 

##### 1A) Fill in the code chunk below to calculate and display each result. Refer to previous worksheets and on-line help for some commands.

#Calculate the average age below. 
mean(age)
[1] 34.39765
#Calculate the standard deviation of age below. 
sd(age)
[1] 11.04513
#Calculate the maxima of age below. Look in Help to find the right command for maxima
max(age)
[1] 73
#Calculate the minima of age below. Look in Help to find the right command for minima
min(age)
[1] 18

Next, we will use the formula from class to detect any outliers. An outlier is a value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower limits is to take the mean (+ or -) 3 * standard deviation.

##### 1B) Write and execute the code chunk corresponding to the above formula to calculate the upper and lower limits for age. Display your results. Based on the limits do you think there are outliers? Explain your answer.

lowerthreshold = mean(age) - (3) * sd(age)
upperthreshold = mean(age) + (3) * sd(age)
lowerthreshold
[1] 1.262269
upperthreshold
[1] 67.53302

there are outliers because the max age is 73 but the upper limit is 67.5

Another similar method to find the upper and lower thresholds, discussed in many introductory statistics courses, involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..

quantile(age) 
  0%  25%  50%  75% 100% 
  18   26   32   41   73 
lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq

Next we calculate the limiting thresholds. A threshold here is the boundary that determines if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.

To calculate the upper threshold:

upperthreshold = (iqr * 1.5) + upperq 
upperthreshold
 75% 
63.5 

Below is the lower threshold:

lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
25% 
3.5 

A good way to undertsand the above calculations is to visualize the results using a box and whisker plot. The top and lower ends of the box correspond to the upper and lower quartiles. The median is marked by a bolded line. The whiskers are the lines connecting the upper and lower quartiles to upper and lower thresholds. Any points beyond the thresholds is a potential outlier.

boxplot(age) 

##### 1C) From the box plot representation are there any outliers? How many can you count? How does your answer reconcile with the result from Task 1B? it shows there are 5 outliers and it does because they are outside of the upper limit —————

Task 2: Data Preparation

Next, we will read the file creditriskorg.csv into R as provided in its original form. Unlike the cleaned file creditrisk.csv, the original dataset will require some data preparation.

newdata = read.csv(file="data/creditriskorg.csv")
head(newdata)

We observe a new line is inserted with the header labels X, X.1, ... and that the true column headers are shifted down. This is because of the empty line in the original dataset. To account for this detail we must skip one line when reading the file.

newdata = read.csv(file="data/creditriskorg.csv",skip=1) 
head(newdata)

Next we want to extract the Checking column and then find the average of checking, smilar to what we did in the previous lab. When we try to execute the code chunk below notice that we get an error.

checking = newdata$Checking # command to extract the Checking column from the data file newdata
mean(checking)
argument is not numeric or logical: returning NA
[1] NA

To resolve the error, we must understand first where it’s source. There are missing values in the csv file represented by the symbol $-. Missing data is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly, there are ‘$’ symbols throughout the file which is not a numerical symbol either.

To correct for the error we need to do some data cleanup. For this we will use the sub function sub() in R to replace unwanted symbols with something else. For example, in order to remove the comma in the number “1,234”, we can substitute it with blank. Below is a sequence of commands to help with the cleanup of the data in the Checking column and eventual calculation of the mean.

#substitute comma with blank in all of checking  
checking= sub(",","",checking)
#substitute dollar sign with blank in all of checking 
# Example new = sub("\\$","",new)
checking = sub("\\$","",checking)
#Convert values to numeric. Any value that cannot be converted to numeric will be designated as NA (Not Applicable)
checking = as.numeric(checking)
NAs introduced by coercion
#Calculate mean of checking with all NAs removed 
mean(checking,na.rm=TRUE)
[1] 2559.805

##### 2A) Repeat the above commands to calculate the mean of the Savings column instead. Use a different variable naming.

Savings = newdata$Savings
Savings= sub(",","",Savings)
Savings = sub("\\$","",Savings)
Savings = as.numeric(Savings)
NAs introduced by coercion
mean(Savings,na.rm=TRUE)
[1] 2122.146

##### 2B) Calculate now the mean of the Checking column in Excel using the Excel function Average. Compare the two results, from Excel and R and share your observation. they are the same R just rounded more ##### 2C) Based on your observation how did Excel treat the missing values represented by the symbol $-? Are they included in the calculation of the mean or excluded? Explain your answer. excel corrected for the missing value ————-

Task 3: Data Modeling

Here, we will look at Chicago Divvy bike data. The historical data sets with description of fields can be found at:

Chicago Divvy Data: https://www.divvybikes.com/data

###### 3A) Open in RStudio or Excel the file Divvy_Trips_2017_Q4.csv located in the data folder. What is the size of the file (measured in bytes), the number of columns and of rows?. Identify the column field(s) in the data that is/are unique identifier(s) (cannot have duplicate/repeated values) the file is 85.9 mb 12 columns 669240 rows

Divydata = read.csv(file="data/Divvy_Trips_2017_Q4.csv") 

Read carefully the file README.txt, located in same data folder, for the description of the data.

###### 3B) Define a relational business logic integrity rule for the column field Trip Duration. Trip duration cant be negative or zero It must be a positive number less then 24 hours according to the readme doc

###### 3C) Using https://erdplus.com/#/standalone draw a star like schema using the below three tables. Include an image capture of your schema here.

