### About

This worksheet includes three main tasks on data outliers, data preparation, and data modeling. The lab requires the use of Microsoft Excel, R, and ERDplus.

Setup

Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work in Sakai as detailed in previous notes.

Note

For your assignment you may be using different data sets than what is included here. Always read carefully the instructions provided, before executing any included code chunks and/or adding your own code. For clarity, tasks/questions to be completed/answered are highlighted in red color and numbered according to their particular placement in the task section. The red color is only apparent when in Preview mode. Quite often you will need to add your own code chunk.

Execute all code chunks (already included and own added), preview, check integrity, and submit final work (\(html\) file) in Sakai.


Task 1: Data Outliers

First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.

In R, we must read in the file again, extract the column and find the values that are asked for.

#Read File
mydata = read.csv(file="creditrisk.csv") 

#Name the extracted variable
age = mydata$Age 

##### 1A) Fill in the code chunk below to calculate and display each result. Refer to previous worksheets and on-line help for some commands (2pts)


#Calculate the average age 
mean(age)
[1] 34.39765
#Calculate the standard deviation of age 
sd(age)
[1] 11.04513
#Calculate the maxima of age.  Search in Help to find the right command for maxima
max(age)
[1] 73
#Calculate the minima of age.  Search in Help to find the right command for minima
min(age)
[1] 18

Next, we will use the formula from class to detect any outliers. An outlier is a value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower limits is to take the mean (+ or -) 3 * standard deviation.

##### 1B) Insert and execute the code chunk corresponding to the above formula to calculate the upper and lower limits for age. Show your results. Based on the limits do you think there are outliers? Explain your logic (2pts)

Yes, in this case, an outlier is described as a value + or - 3 standard deviations from the mean. This means the upper limit is 67.5 years old, and the lower limit is 1.3 years old.There are outliers because there are people aged more thean 67.5 years old, which is greater than the upper limit. There is someone aged 73, and this is more than 3 standard deviations away from the mean, and this defines an outlier.

#Upper limit
mean(age)+sd(age)*3
[1] 67.53302
#Lower Limit
mean(age)-sd(age)*3
[1] 1.262269

Another method to find the upper and lower thresholds, discussed in many introductory statistics courses, requires calculating the interquartile range. Follow along below to see how we first calculate the interquartile range..

quantile(age) 
  0%  25%  50%  75% 100% 
  18   26   32   41   73 
lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq

Next we calculate the limiting thresholds. A threshold here is the boundary that determines if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.

To calculate the upper threshold:

upperthreshold = (iqr * 1.5) + upperq 
upperthreshold
 75% 
63.5 

Below is the lower threshold:

lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
25% 
3.5 

A good way to undertsand the above calculations is to visualize the results using a box and whisker plot. The top and lower ends of the box correspond to the upper and lower quartiles. The median is marked by a bolded line. The whiskers are the lines connecting the upper and lower quartiles to upper and lower thresholds. Any points beyond the thresholds is a potential outlier.

boxplot(age) 

##### 1C) From the box plot representation are there any outliers? How many can you count? How does your answer reconcile with the result from Task 1B? (2pts)

The Box plot shows 5 different ages Outliers. Looking at the excel sheet, there are 9 outliers total (73,67,67, 66, 65, 65, 65, 64, 64). In all, there are 5 differnt ages as outliers, (73, 67, 66, 65, and 64). This reconciles the ages that are considered outliers in excel and in the Box Plot.

The following is an optional challenge question. To answer, you will need to consider the difference between a normal distribution and a corresponding histogram. It will count as 2 extra points instead of the second pop quiz.

##### 1C’) Looking at the two methods of calculations to identify outliers in 1A) and 1B) , explain how they differ and also how they reconcile (2pts extra)

We used two different methods to calculate outliers.The first, was using a normal distribution curve. Outliers are defined as anything outside of the upper and lower limits found (mean + 3 standard deviations or mean - 3 standard deviations). In this calculation, about 99.7% of the data is within 3 standard deviations, and about .3% is outside of this range, meaning it is an outlier. The second method is using histogram. This method uses the interquartile range (IQR) to calculate outliers. The IQR accounts for the middle 50% of the data, and it takes the difference between the top 75th percentile data (Q3) and the 25th percentile (Q1) data. This is then multiplied by 1.5, and either added or subtracted from the IQR to find the upper or lower limits for outliers.

Reconcile of histogram and normal distribution outliers: a z-score of -.67 is the 25th percentile of data (Q1) and a z-score of .67 is the 75th percentile of data (Q3)

#Read File
mydata = read.csv(file="creditrisk.csv") 

#Name the extracted variable
age = mydata$Age
mean(age)
[1] 34.39765
sd(age)
[1] 11.04513
#Q1
# z =  (x - xbar )/ spread
# change equation to easily solve z*s + xbar = x
# if z= -.67, this is Q1
z = -.67
y = mean(age)
s = sd(age)
z*s + y
[1] 26.99741

In the histogram, Q1 is 26. This reconciles to the Q1 found using z-scores. Obviously there is a rounding error.

#Q3
# z =  (x - xbar )/ spread
# change equation to easily solve z*s + xbar = x
# if z= -.67, this is Q1
z = .67
y = mean(age)
s = sd(age)
z*s + y
[1] 41.79788

In the histogram, Q3 is 41, which is very close to the Q3 found using z-scores. This is a rounding error away.

#Solve for IQR
41.797888-26.99741
[1] 14.80048
#solve for lower limit
26.99741-(14.80048*1.5)
[1] 4.79669
#solve for upper limit
41.797888+(14.80048*1.5)
[1] 63.99861

Under these calculations, anything over 63.99861 is an outlier, compared to the 63.5 outlier upper range calculated from the histogram. These are very close, but they vary from rounding errors. Anything under 4.79669 is an outlier, compared to the 3.5 from the histogram. Again, this number varies slightly from the rounding errors. This is the reconciling of the histogram and standard deviation calculations of the outliers. —————

Task 2: Data Preparation

Next, we will read the file creditriskorg.csv into R as provided in its original form. Unlike the cleaned file creditrisk.csv, the original dataset will require some data preparation.

newdata = read.csv(file="creditriskorg1.csv")
head(newdata)

We observe a new line is inserted with the header labels X, X.1, ... and that the true column headers are shifted down. This is because of the empty line in the original dataset. To account for this detail we must skip one line when reading the file.

newdata = read.csv(file="creditriskorg1.csv",skip=1) 
head(newdata)

Next we want to extract the Checking column and then find the average of checking, smilar to what we did in the previous lab. When we try to execute the code chunk below notice that we get an error.

checking = newdata$Checking # command to extract the Checking column from the data file newdata
mean(checking)
argument is not numeric or logical: returning NA
[1] NA

To resolve the error, we must understand first where it’s source. There are missing values in the csv file represented by the symbol $-. Missing data is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly, there are ‘$’ symbols throughout the file which is not a numerical symbol either.

To correct for the error we need to do some data cleanup. For this we will use the sub function sub() in R to replace unwanted symbols with something else. For example, in order to remove the comma in the number “1,234”, we can substitute it with blank. Below is a sequence of commands to help with the cleanup of the data in the Checking column and eventual calculation of the mean.

#substitute comma with blank in all of checking  
checking= sub(",","",checking)

#substitute dollar sign with blank in all of checking 
# Example new = sub("\\$","",new)
checking = sub("\\$","",checking)

#Convert values to numeric. Any value that cannot be converted to numeric will be designated as NA (Not Applicable)
checking = as.numeric(checking)

#Calculate mean of checking with all NAs removed 
mean(checking,na.rm=TRUE)
[1] 2559.805

##### 2A) Repeat the above commands to calculate the mean of the Savings column instead. Use a different variable naming (2pts)





Savings = newdata$Savings
#substitute comma with blank in all of Savings  
Savings = sub(",","",Savings)

#substitute dollar sign with blank in all of Savings 
# Example new = sub("\\$","",new)
Savings = sub("\\$","",Savings)

#Convert values to numeric. Any value that cannot be converted to numeric will be designated as NA (Not Applicable)
Savings = as.numeric(Savings)
NAs introduced by coercion
#Calculate mean of Savings with all NAs removed 
mean(Savings,na.rm = TRUE)
[1] 2122.146
mean(Savings,na.rm = TRUE)
[1] 2122.146

##### 2B) Calculate now the mean of the Checking column in Excel using the Excel function Average. Compare the two results, from Excel and R and share your observation (2pts)

The mean of the Excel file is $2559.805. The R mean of checking and the excel mean are the same number. R was unfamiliar with the formatting of the data, with commas and dollar signs. This may be because the data was imported, and R may not be familiar with the formatting styles. Therefore, we had to clean the data to be used. However, excel can manipulate the numbers to be easily analyzed. This may be because it is designed to do this, and can work around the formatting.

##### 2C) Based on your observation how did Excel treat the missing values represented by the symbol $-? Are they included in the calculation of the mean or excluded? Explain your answer (2pts)

Excel excluded the missing values. This conclusion can be drawn because R and excel calculated the same mean. Since R excluded the missing data, excel must have too. I am concluding this because they must have calculated the same mean by using the same procedures. Also, if excel included the missing values, there would be an error, or a lower mean if excel deemed them to have a value of zero.


Task 3: Data Modeling

```

Here, we will look at Chicago Divvy bike data. The historical data sets with description of fields can be found at:

Chicago Divvy Data: https://www.divvybikes.com/data

First you need to download the archived (zipped) ‘Divvy_Trips_2019_Q3.zip' file. Once downlaoded you should unarchive (unzip) the file to extract its content. You will then obtain the corresponding file `Divvy_Trips_2019_Q3.csv’ file. The file is relatively big in size, and you will quickly come to appreciate some of the challenges working with big data files using Excel.

###### 3A) Open in RStudio or Excel the file Divvy_Trips_2019_Q3.csv. What is the size of the file (measured in bytes), the number of columns and of rows?. Identify the column field(s) in the data that is/are unique identifier(s) (cannot have duplicate/repeated values) (2pts) The size of the file is 215MB (215,000,000 bytes). The number of columns is 12 The number of rows is 1,048,576 Unique Identifiers: trip_id

###### 3B) Looking at the Divvy bike data define two business rules: one for field integrity, and one for relationship integrity. (2pts)

field integrity: gender cannot be numerical relationship integrity: end_time cannot be before start time

###### 3C) Using https://erdplus.com/#/standalone draw a star like schema using the below three tables. Include an image capture of your schema here (4pts) #####

  • A Fact table for Trips
  • A Dimension table for Stations
  • A Dimension table for Users SEE THE TOP OF THE LAB
---
title: "Data Preparation & Flow (lab04)"
author: "Lauren Kroll"
date: "2/13/2020"
output:
  html_notebook: default
  html_document: default
subtitle: BSAD 343H, Business Analytics, Spring 2020
---
![](divvy.png)
### About

This worksheet includes three main tasks on data outliers, data preparation, and data modeling. The lab requires the use of Microsoft Excel, R, and ERDplus.

### Setup

Remember to always set your working directory to the source file location. Go to 'Session', scroll down to 'Set Working Directory', and click 'To Source File Location'. Read carefully the below and follow the instructions to complete the tasks and answer any questions.  Submit your work in Sakai as detailed in previous notes. 

### Note

For your assignment you may be using different data sets than what is included here. Always read carefully the instructions provided, before executing any included code chunks and/or adding your own code.  For clarity, tasks/questions to be completed/answered are highlighted in red color and numbered according to their particular placement in the task section.  The red color is only apparent when in Preview mode. Quite often you will need to add your own code chunk.

Execute all code chunks (already included and own added), preview, check integrity, and submit final work ($html$ file) in Sakai.

--------------

### Task 1: Data Outliers

First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.

In R, we must read in the file again, extract the column and find the values that are asked for.

```{r }
#Read File
mydata = read.csv(file="creditrisk.csv") 

#Name the extracted variable
age = mydata$Age 
```

<span style="color:red">
##### 1A) Fill in the code chunk below to calculate and display each result.  Refer to previous worksheets and on-line help for some commands (2pts)
</span>

```{r}

#Calculate the average age 
mean(age)

#Calculate the standard deviation of age 
sd(age)


#Calculate the maxima of age.  Search in Help to find the right command for maxima
max(age)

#Calculate the minima of age.  Search in Help to find the right command for minima
min(age)

```

Next, we will use the formula from class to detect any outliers. An outlier is a value that "lies outside" most of the other values in a set of data. A common way to estimate the upper and lower limits is to take the ```mean (+ or -) 3 * standard deviation```.  


<span style="color:red">
##### 1B) Insert and execute the code chunk corresponding to the above formula to calculate the upper and lower limits for age. Show your results. Based on the limits do you think there are outliers? Explain your logic (2pts)

Yes, in this case, an outlier is described as a value + or - 3 standard deviations from the mean. This means the upper limit is 67.5 years old, and the lower limit is 1.3 years old.There are outliers because there are people aged more thean 67.5 years old, which is greater than the upper limit.  There is someone aged 73, and this is more than 3 standard deviations away from the mean, and this defines an outlier.

```{r}
#Upper limit
mean(age)+sd(age)*3

#Lower Limit
mean(age)-sd(age)*3

```

</span>

Another method to find the upper and lower thresholds, discussed in many introductory statistics courses, requires calculating the interquartile range. Follow along below to see how we first calculate the interquartile range.. 

```{r} 
quantile(age) 
lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq
```

Next we calculate the limiting thresholds. A threshold here is the boundary that determines if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier. 

To calculate the upper threshold:
```{r} 
upperthreshold = (iqr * 1.5) + upperq 
upperthreshold
```

Below is the lower threshold:
```{r}
lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
```

A good way to undertsand the above calculations is to visualize the results using a box and whisker plot. The top and lower ends of the box correspond to the upper and lower quartiles. The median is marked by a bolded line. The whiskers are the lines connecting the upper and lower quartiles to upper and lower thresholds. Any points beyond the thresholds is a potential outlier.

```{r} 
boxplot(age) 
```

<span style="color:red">
##### 1C) From the box plot representation are there any outliers? How many can you count? How does your answer reconcile with the result from Task 1B?
(2pts)

The Box plot shows 5 different ages Outliers.
Looking at the excel sheet, there are 9 outliers total (73,67,67, 66, 65, 65, 65, 64, 64). In all, there are 5 differnt ages as outliers, (73, 67, 66, 65, and 64). This reconciles the ages that are considered outliers in excel and in the Box Plot.
</span>


##### The following is an optional challenge question. To answer, you will need to consider the difference between a normal distribution and a corresponding histogram. It will count as 2 extra points instead of the second pop quiz.  

<span style="color:red">
##### 1C') Looking at the two methods of calculations to identify outliers in 1A) and 1B) , explain how they differ and also how they reconcile (2pts extra)


We used two different methods to calculate outliers.The first, was using a normal distribution curve.  Outliers are defined as anything outside of the upper and lower limits found (mean + 3 standard deviations or mean - 3 standard deviations). In this calculation, about 99.7% of the data is within 3 standard deviations, and about .3% is outside of this range, meaning it is an outlier. 
The second method is using histogram. This method uses the interquartile range (IQR) to calculate outliers. The IQR accounts for the middle 50% of the data, and it takes the difference between the top 75th percentile data (Q3) and the 25th percentile (Q1) data.  This is then multiplied by 1.5, and either added or subtracted from the IQR to find the upper or lower limits for outliers.

Reconcile of histogram and normal distribution outliers:
a z-score of -.67 is the 25th percentile of data (Q1) and a z-score of .67 is the 75th percentile of data (Q3)
```{r}
#Read File
mydata = read.csv(file="creditrisk.csv") 

#Name the extracted variable
age = mydata$Age
```

```{r}
mean(age)
sd(age)
```
```{r}
#Q1 (z=-.67)
# z =  (x - xbar )/ spread
# change equation to easily solve z*s + xbar = x
# if z= -.67, this is Q1
z = -.67
y = mean(age)
s = sd(age)
z*s + y

```
In the histogram, Q1 is 26. This reconciles to the Q1 found using z-scores. Obviously there is a rounding error.
</span>

```{r}
#Q3 (z=.67)
# z =  (x - xbar )/ spread
# change equation to easily solve z*s + xbar = x
# if z= -.67, this is Q1
z = .67
y = mean(age)
s = sd(age)
z*s + y
```
In the histogram, Q3 is 41, which is very close to the Q3 found using z-scores.  This is a rounding error away.

```{r}
#Solve for IQR
41.797888-26.99741


```
```{r}
#solve for lower limit
26.99741-(14.80048*1.5)
```
```{r}
#solve for upper limit
41.797888+(14.80048*1.5)
```
Under these calculations, anything over 63.99861 is an outlier, compared to the 63.5 outlier upper range calculated from the histogram. These are very close, but they vary from rounding errors.
Anything under 4.79669 is an outlier, compared to the 3.5 from the histogram. Again, this number varies slightly from the rounding errors. 
This is the reconciling of the histogram and standard deviation calculations of the outliers.
---------------

### Task 2: Data Preparation

Next, we will read the file `creditriskorg.csv` into R as provided in its original form. Unlike the cleaned file `creditrisk.csv`, the original dataset will require some data preparation.  

```{r}
newdata = read.csv(file="creditriskorg1.csv")
head(newdata)
```

We observe a new line is inserted with the header labels `X, X.1, ...` and that the true column headers are shifted down.  This is because of the empty line in the original dataset. To account for this detail we must skip one line when reading the file.

```{r} 
newdata = read.csv(file="creditriskorg1.csv",skip=1) 
head(newdata)
```

Next we want to extract the Checking column and then find the average of checking, smilar to what we did in the previous lab.  When we try to execute the code chunk below notice that we get an error.

```{r} 
checking = newdata$Checking # command to extract the Checking column from the data file newdata
mean(checking)
```

To resolve the error, we must understand first where it's source. There are missing values in the csv file represented by the symbol `$-`. Missing data is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that '1,234' is equivalent to '1234'. Lastly, there are '$' symbols throughout the file which is not a numerical symbol either.

To correct for the error we need to do some data cleanup. For this we will use the sub function `sub()` in R to replace unwanted symbols with something else. For example, in order to remove the comma in the number "1,234", we can substitute it with blank. Below is a sequence of commands to help with the cleanup of the data in the Checking column and eventual calculation of the mean.

```{r} 
#substitute comma with blank in all of checking  
checking= sub(",","",checking)

#substitute dollar sign with blank in all of checking 
# Example new = sub("\\$","",new)
checking = sub("\\$","",checking)

#Convert values to numeric. Any value that cannot be converted to numeric will be designated as NA (Not Applicable)
checking = as.numeric(checking)

#Calculate mean of checking with all NAs removed 
mean(checking,na.rm=TRUE)
```

<span style="color:red">
##### 2A) Repeat the above commands to calculate the mean of the Savings column instead. Use a different variable naming (2pts)

```{r}




Savings = newdata$Savings
#substitute comma with blank in all of Savings  
Savings = sub(",","",Savings)

#substitute dollar sign with blank in all of Savings 
# Example new = sub("\\$","",new)
Savings = sub("\\$","",Savings)

#Convert values to numeric. Any value that cannot be converted to numeric will be designated as NA (Not Applicable)
Savings = as.numeric(Savings)

#Calculate mean of Savings with all NAs removed 
mean(Savings,na.rm = TRUE)
```
```{r}
mean(Savings,na.rm = TRUE)
```


</span>

<span style="color:red">
##### 2B) Calculate now the mean of the Checking column in Excel using the Excel function `Average`. Compare the two results, from Excel and R and share your observation (2pts)


The mean of the Excel file is $2559.805.  The R mean of checking and the excel mean are the same number.  R was unfamiliar with the formatting of the data, with commas and dollar signs. This may be because the data was imported, and R may not be familiar with the formatting styles.  Therefore, we had to clean the data to be used. However, excel can manipulate the numbers to be easily analyzed. This may be because it is designed to do this, and can work around the formatting.
</span>

<span style="color:red">
##### 2C) Based on your observation how did Excel treat the missing values represented by the symbol `$-`? Are they included in the calculation of the mean or excluded?  Explain your answer (2pts)


Excel excluded the missing values. This conclusion can be drawn because R and excel calculated the same mean. Since R excluded the missing data, excel must  have too.  I am  concluding this because they must have calculated the same mean by using the same procedures. Also, if excel included the missing values, there would be an error, or a lower mean if excel deemed them to have a value of zero.

-------------

### Task 3: Data Modeling


```


Here, we will look at Chicago Divvy bike data. The historical data sets with description of fields can be found at:  

Chicago Divvy Data: [https://www.divvybikes.com/data](https://www.divvybikes.com/system-data)

First you need to download the archived (zipped) '`Divvy_Trips_2019_Q3.zip' file`.  Once downlaoded you should unarchive (unzip) the file to extract its content.  You will then obtain the corresponding file `Divvy_Trips_2019_Q3.csv' file. The file is relatively big in size, and you will quickly come to appreciate some of the challenges working with big data files using Excel.

<span style="color:red">
###### 3A) Open in RStudio or Excel the file `Divvy_Trips_2019_Q3.csv`. What is the size of the file (measured in bytes), the number of columns and of rows?. Identify the column field(s) in the data that is/are unique identifier(s) (cannot have duplicate/repeated values) (2pts)
</span> 
The size of the file is 215MB (215,000,000 bytes).
The number of columns is 12
The number of rows is 1,048,576
Unique Identifiers: trip_id



<span style="color:red">
###### 3B) Looking at the Divvy bike data define two business rules: one for field integrity, and one for relationship integrity. (2pts)



field integrity: gender cannot be numerical
relationship integrity: end_time cannot be before start time
</span>

<span style="color:red">
###### 3C) Using [https://erdplus.com/#/standalone](https://erdplus.com/#/standalone) draw a star like schema using the below three tables. Include an image capture of your schema here (4pts)
#####

</span>

######
 



- A Fact table for Trips
- A Dimension table for Stations
- A Dimension table for Users
SEE THE TOP OF THE LAB


