About

This worksheet includes three main tasks on data outliers, data preparation, and data modeling. The lab requires the use of Microsoft Excel, R, and ERDplus.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

Note

For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. For clarity, tasks/questions to be completed/answered are highlighted in red color (visible in preview) and numbered according to their particular placement in the task section. Quite often you will need to add your own code chunk.

Execute all code chunks, preview, publish, and submit link on Sakai.


Task 1: Data Outliers

First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.

In R, we must read in the file again, extract the column and find the values that are asked for.

rr #Read File mydata = read.csv(file=/creditrisk.csv) #Name the extracted variable age = mydata$Age

##### 1A) Fill in the code chunk below to calculate and display each result. Refer to previous worksheets and on-line help for some commands.

rr #Calculate the average age below. mean(age)

[1] 34.39765

rr #Calculate the standard deviation of age below. sd(age)

[1] 11.04513

rr #Calculate the maxima of age below. Look in Help to find the right command for maxima max(age)

[1] 73

rr #Calculate the minima of age below. Look in Help to find the right command for minima min(age)

[1] 18

Next, we will use the formula from class to detect any outliers. An outlier is a value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower limits is to take the mean (+ or -) 3 * standard deviation.

##### 1B) Write and execute the code chunk corresponding to the above formula to calculate the upper and lower limits for age. Display your results. Based on the limits do you think there are outliers? Explain your answer.

rr UL=mean(age) + 3 * sd(age) UL

[1] 67.53302

rr LL=mean(age) - 3 * sd(age) LL

[1] 1.262269

According to the answer of 1B and 1A, it states that there are outliers. Since the minimum 18 is larger than lower limit 1.262269, that means minimum 18 is in the boundaries. However, the maximum 73 is bigger than upper limit is 67.53302, that says 73 is not within the boundaries, and the outliners exist.

Another similar method to find the upper and lower thresholds, discussed in many introductory statistics courses, involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..

rr quantile(age)

  0%  25%  50%  75% 100% 
  18   26   32   41   73 

rr lowerq = quantile(age)[2] upperq = quantile(age)[4] iqr = upperq - lowerq

Next we calculate the limiting thresholds. A threshold here is the boundary that determines if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.

To calculate the upper threshold:

rr upperthreshold = (iqr * 1.5) + upperq upperthreshold

 75% 
63.5 

Below is the lower threshold:

rr lowerthreshold = lowerq - (iqr * 1.5) lowerthreshold

25% 
3.5 

A good way to undertsand the above calculations is to visualize the results using a box and whisker plot. The top and lower ends of the box correspond to the upper and lower quartiles. The median is marked by a bolded line. The whiskers are the lines connecting the upper and lower quartiles to upper and lower thresholds. Any points beyond the thresholds is a potential outlier.

rr boxplot(age)

##### 1C) From the box plot representation are there any outliers? How many can you count? How does your answer reconcile with the result from Task 1B? #### From the box plot above, there seem to be five outliers on the upper end of this data and no outliers on the lower end. This observation confirms the statement made in the task 1B. —————

Task 2: Data Preparation

Next, we will read the file creditriskorg.csv into R as provided in its original form. Unlike the cleaned file creditrisk.csv, the original dataset will require some data preparation.

rr newdata = read.csv(file=/creditriskorg.csv) head(newdata)

We observe a new line is inserted with the header labels X, X.1, ... and that the true column headers are shifted down. This is because of the empty line in the original dataset. To account for this detail we must skip one line when reading the file.

rr newdata = read.csv(file=/creditriskorg.csv,skip=1) head(newdata)

Next we want to extract the Checking column and then find the average of checking, smilar to what we did in the previous lab. When we try to execute the code chunk below notice that we get an error.

rr checking = newdata$Checking # command to extract the Checking column from the data file newdata mean(checking)

argument is not numeric or logical: returning NA
[1] NA

To resolve the error, we must understand first where it’s source. There are missing values in the csv file represented by the symbol $-. Missing data is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly, there are ‘$’ symbols throughout the file which is not a numerical symbol either.

To correct for the error we need to do some data cleanup. For this we will use the sub function sub() in R to replace unwanted symbols with something else. For example, in order to remove the comma in the number “1,234”, we can substitute it with blank. Below is a sequence of commands to help with the cleanup of the data in the Checking column and eventual calculation of the mean.

rr #substitute comma with blank in all of checking
checking= sub(
```

NAs introduced by coercion

rr #Calculate mean of checking with all NAs removed mean(checking,na.rm=TRUE)

[1] 2559.805

##### 2A) Repeat the above commands to calculate the mean of the Savings column instead. Use a different variable naming.

rr Savings = newdata$Savings # command to extract the Savings column from the data file newdata mean(Savings)

argument is not numeric or logical: returning NA
[1] NA

rr #substitute comma with blank in all of Savings
Savings= sub(
```

[1] 2122.146

##### 2B) Calculate now the mean of the Checking column in Excel using the Excel function Average. Compare the two results, from Excel and R and share your observation.

The mean of the checking calculated in Excel is 2122.15, and it is kind of the same number calculated in R. They both were able to find the same average upon correct manipulation of the data.

##### 2C) Based on your observation how did Excel treat the missing values represented by the symbol $-? Are they included in the calculation of the mean or excluded? Explain your answer.

Based on my observation Excel treated the missing values as 0s and excluded them from the calculation which is why the mean calculation for checking in Excel is so much lower than the mean calculation of checking in R.

Task 3: Data Modeling

Here, we will look at Chicago Divvy bike data. The historical data sets with description of fields can be found at:

Chicago Divvy Data: https://www.divvybikes.com/data

###### 3A) Open in RStudio or Excel the file Divvy_Trips_2017_Q4.csv located in the data folder. What is the size of the file (measured in bytes), the number of columns and of rows?. Identify the column field(s) in the data that is/are unique identifier(s) (cannot have duplicate/repeated values)

The file is 81.9 MB (megabites). There are 12 columns and 669,240 rows of data. Trip_id is the only field that is unique to this data set.

Read carefully the file README.txt, located in same data folder, for the description of the data.

###### 3B) Define a relational business logic integrity rule for the column field Trip Duration. ####The trip duration cannot be 0, because if it’s 0, they’re not gonna have any trip. Also, the start and end time must less than 24 hours.

###### 3C) Using https://erdplus.com/#/standalone draw a star like schema using the below three tables. Include an image capture of your schema here.

  • A Fact table for Trips
  • A Dimension table for Stations
  • A Dimension table for Users
