About

The focus of this lab is on data outliers, data preparation, and data modeling. This lab requires the use of Microsoft Excel, R, and ERDplus.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.

Important Note

For your assignment you may be using different data sets than what is included in this worksheet demo. Make sure to read carefully the instructions on Sakai.


Task 1: Data Outliers

First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.

In R, we must read in the file again, extract the column and find the values that are asked for.

#Read File
mydata = read.csv(file="data/Scoring.csv") 

#Name the extracted variable
age = mydata$Age 
#Calculate the average age below. Refer to Worksheet 2 for the correct command.
agemean = mean(age)
agemean
## [1] 37.08412
#Calculate standard deviation of age below. Refer to Worksheet 2 for the correct command. 
agesd = sd(age)
agesd
## [1] 10.98637
#Calculate the maximum of age below. The command to find the maximum is max(variable) where variable is the extracted variable. 
agemax = max(age)
agemax
## [1] 68
#Calculate the minimum of age below. The command to find the minimum is min(variable) where variable is the extracted variable. 
agemin = min(age)
agemin
## [1] 18

Next, use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower threshold is to take the mean (+ or -) 3 * standard deviation. Try using this formula to find the upper and lower limit for age.

#Use the formula above to calculate the upper and lower threshold
upper = agemean + 3*agesd
lower = agemean - 3*agesd
upper
## [1] 70.04322
lower
## [1] 4.125023

Another similar method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..

quantile(age) 
##   0%  25%  50%  75% 100% 
##   18   28   36   45   68
lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq
iqr
## 75% 
##  17

The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.

Below is the upper threshold:

upperthreshold = (iqr * 1.5) + upperq 
upperthreshold
##  75% 
## 70.5

Below is the lower threshold:

lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
## 25% 
## 2.5

Are there any outliers? How many? It can also be useful to visualize the data using a box and whisker plot. The boxplot below supports the IQR we found of 15 and upper and lower threshold.

There are not any outliers. All values in the data set are within the boundaries of 2.5 and 70.5. (Values are above the lower threshold, however below the upper threshold). Also as pictured in the boxplot below, there are no outliers in the data set. The max of 68 is within the upper threshold (it is less than 70.04) and the min of 18 is within the lower threshold (it is greater than 4.125).

boxplot(age) 


Task 2: Data Preparation

Next, we must read the ‘scoring_orignial.csv’ file into R. This is the original dataset and contains missing values.

newdata = read.csv(file="data/scoring_original.csv")
head(newdata)
##   Status Seniority  Home Time Age Marital Records       Job Expenses
## 1   good         9  rent   60  30 married  no_rec freelance     $73K
## 2   good        17  rent   60  58   widow  no_rec     fixed     $48K
## 3    bad        10 owner   36  46 married yes_rec freelance     $90K
## 4   good         0  rent   60  24  single  no_rec     fixed     $63K
## 5   good         0  rent   36  26  single  no_rec     fixed     $46K
## 6   good         1 owner   60  36 married  no_rec     fixed     $75K
##   Income Assets Debt     Amount      Price   Finrat   Savings
## 1  $129K      0    0   $800.00    $846.00  94.56265  4.200000
## 2  $131K      0    0 $1,000.00  $1,658.00  60.31363  4.980000
## 3  $200K   3000    0 $2,000.00  $2,985.00  67.00168  1.980000
## 4  $182K   2500    0   $900.00  $1,325.00  67.92453  7.933333
## 5  $107K      0    0   $310.00    $910.00  34.06593  7.083871
## 6  $214K   3500    0   $650.00  $1,645.00  39.51368 12.830769

We observe that the column names are sometimes shifted down below because of the empty line. So, we must make sure to use the command skip and set the header to true. However in this case, that is not necessary because all the column names are in the header line already.

newdata = read.csv(file="data/scoring_original.csv",skip=0,header=TRUE,sep=",") 
head(newdata)
##   Status Seniority  Home Time Age Marital Records       Job Expenses
## 1   good         9  rent   60  30 married  no_rec freelance     $73K
## 2   good        17  rent   60  58   widow  no_rec     fixed     $48K
## 3    bad        10 owner   36  46 married yes_rec freelance     $90K
## 4   good         0  rent   60  24  single  no_rec     fixed     $63K
## 5   good         0  rent   36  26  single  no_rec     fixed     $46K
## 6   good         1 owner   60  36 married  no_rec     fixed     $75K
##   Income Assets Debt     Amount      Price   Finrat   Savings
## 1  $129K      0    0   $800.00    $846.00  94.56265  4.200000
## 2  $131K      0    0 $1,000.00  $1,658.00  60.31363  4.980000
## 3  $200K   3000    0 $2,000.00  $2,985.00  67.00168  1.980000
## 4  $182K   2500    0   $900.00  $1,325.00  67.92453  7.933333
## 5  $107K      0    0   $310.00    $910.00  34.06593  7.083871
## 6  $214K   3500    0   $650.00  $1,645.00  39.51368 12.830769

To calculate the mean for Price in R, follow Worksheet 2. Extract the Price column first and then find the average using the function built in R. What happens when we try to use the function? When trying to run the code, an error shows up saying the argument is not numeric or logical, signaling that we have to clean the data.

price = newdata$Price
pricemean = mean(price)
## Warning in mean.default(price): argument is not numeric or logical:
## returning NA
pricemean
## [1] NA

To resolve the error, we must understand where it is coming from and correct for. There are missing values in the csv file, which is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly, there are ‘$’ symbols throughout the file which is not a numerical symbol either.

The sub function replaces these symbols with something else. So, in order to remove the comma in the number “1,234”, we must substitute it with just an empty space.

As shown on the worksheet, type and copy the exact commands to find the mean with the NA values removed.

#substitute comma with blank in all of checking.  Below are examples using a hypothetical variable name 'new'.
# Example new = sub(",","",new)
price = sub(",","",price)

#substitute dollar sign with blank in all of checking 
# Example new = sub("\\$","",new)
price = sub("\\$","",price)

#Convert values to numeric to remove any NA
# Example new = as.numeric(new)
price = as.numeric(price)
## Warning: NAs introduced by coercion
#Calculate mean of checking with NA removed 
# Example mean(new,na.rm=TRUE)
mean(price,na.rm=TRUE)
## [1] 1462.48

What are some other ways to clean this data? How about Excel? How does Excel treat the missing values and the “$” symbols? Another way to clean the data is to open Excel and change everything manually. Excel treats the missing values as zeros: when you calculate the average in Excel you get 1462.48, equal to the average calculated in R after cleaning the data. This shows that Excel treats the null values as zeros; Excel treats the dollar sign just as a currency value. The dollar sign in R studio extracts variables in data sets.


Task 3: Data Modeling

Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq

Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/

– Open in RStudio or Excel the taxi trips sample csv file located in the data folder. Note the size of the file, the number of columns and of rows here. Identify the unique entities, and fields in the data.

This file contains 24 columns and 100,000 rows. The unique entities are Trip ID, Taxi ID and the fields in the data are the rest of the columns that relate to these entities and the Taxi Ride in general.

– Define a relational business logic integrity check for the column field ‘Trip Seconds’. When looking at the column ‘Trip Seconds,’ and conducting a business logic integrity check, one notices that the entries in this field must be numerical (by seconds). If one entered a text into one of the cells in this column that would be a violation of the field integrity. Another integrity check for Trip Seconds could be enforcing a formula where Trip Seconds(duration) = Trip End Time - Trip Start Time.

– Using https://erdplus.com/#/standalone draw a star like schema using at least the following tables:

The star schema above shows the data we are keeping track of and the relationships involved. The trip fact table is keeping track of a specific ride (Trip ID). Regarding this one ride, we keep track of the duration, miles, start time, end time, pickup and dropoff location, and the census tract (traffic flow). It is important to note that the census tract changes with every particular ride and changes with time. Payment and Taxi are categorical values, so we put them in dimension tables to make it easier to work witht he data. In the fare dimension, we keep track of everything that goes along with the Total Fare, also adding the Payment ID as a foreign key to fare if we wanted to keep track of the payment type that corresponds with the specific Fare ID. Location is put in a dimension which contains the specific latitude and longitude coordinates, as well as the community area number. Each unique key in the dimensions become foreign keys in the fact table, signaling a relationship. Specifically the Location dimension table will allow the Trip Fact table to keep track of the pickup and dropoff location.