About

This lab focuses on data outliers, data preparation, and data modeling. This lab requires the use of Microsoft Excel, R, and ERDplus.


Task 1

First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R. In R, we must read in the file again, extract the column and find the values that are asked for.

#Read File
mydata = read.csv(file = "data/Scoring.csv")
#Name the extracted variable
age = mydata$Age
age
   [1] 30 58 46 24 26 36 44 27 32 41 34 29 30 37 21 68 52 68 36 31 25 22 45 41 51 54 43 43 23 29 23 51 39 35
  [35] 62 56 32 36 64 31 42 61 32 43 26 49 47 27 52 28 31 32 51 37 29 26 27 63 55 28 31 27 40 35 43 34 37 53
  [69] 46 36 46 43 31 43 38 37 46 55 56 55 52 29 39 38 24 28 53 25 62 29 32 21 38 56 41 46 54 51 52 57 33 33
 [103] 29 23 33 24 28 27 33 32 39 51 39 29 32 40 47 66 39 50 25 27 54 42 35 23 52 22 48 25 26 28 62 33 41 29
 [137] 37 31 50 42 33 54 58 40 31 35 64 29 33 35 28 59 25 23 30 36 25 22 45 47 36 36 60 59 58 37 32 26 40 55
 [171] 30 43 28 47 29 48 34 53 44 49 51 34 42 34 46 31 31 33 30 27 48 33 44 48 57 34 38 35 52 28 46 56 52 35
 [205] 45 29 22 33 41 37 46 19 22 31 44 48 31 53 36 46 37 34 36 64 27 26 53 50 21 40 49 40 37 25 40 22 49 34
 [239] 49 43 42 28 30 25 22 27 26 28 35 51 24 34 34 49 28 57 63 37 27 39 64 49 49 45 27 30 33 40 30 30 25 40
 [273] 50 23 28 35 27 19 39 39 55 32 54 58 37 36 28 38 26 41 42 46 52 30 47 66 54 65 25 31 23 35 47 28 50 25
 [307] 39 46 36 39 25 49 52 34 24 38 24 33 31 63 23 61 28 58 41 38 27 55 40 59 47 38 25 28 49 38 52 37 42 54
 [341] 30 26 49 36 40 61 59 49 52 35 25 30 28 42 29 32 45 52 26 33 21 34 27 44 51 33 35 38 28 26 34 27 32 43
 [375] 30 42 26 20 24 27 35 18 35 33 39 36 29 23 34 57 38 52 42 44 37 34 43 42 24 36 52 34 34 31 36 50 33 54
 [409] 40 29 31 32 51 31 46 35 33 25 31 43 30 40 37 22 24 27 32 43 57 59 60 25 28 40 36 27 33 44 34 28 39 47
 [443] 28 37 20 36 28 20 32 19 21 23 47 37 51 31 49 27 34 28 29 51 34 42 46 40 32 26 31 47 45 26 41 31 23 34
 [477] 34 38 55 54 24 37 43 26 59 54 27 28 53 48 30 28 34 33 55 40 28 31 56 38 25 35 41 31 52 33 30 37 54 32
 [511] 23 39 28 43 50 41 27 25 35 31 43 28 22 42 41 29 27 32 27 37 59 52 25 24 51 39 33 31 30 24 31 39 42 37
 [545] 38 42 23 40 36 27 60 45 65 40 40 33 28 33 51 31 36 23 33 51 31 32 50 52 34 25 28 36 27 26 24 47 40 42
 [579] 52 26 45 47 25 50 44 48 19 48 46 50 30 30 39 27 52 42 22 43 36 23 40 50 52 38 35 34 34 49 37 33 34 42
 [613] 32 19 45 41 25 19 59 56 42 38 22 44 31 49 55 55 38 60 58 39 34 21 37 25 39 22 35 50 49 30 32 37 24 63
 [647] 25 23 22 52 39 29 43 35 27 34 39 51 35 33 30 25 53 26 61 63 30 43 35 49 34 26 41 30 51 51 40 53 33 24
 [681] 25 44 33 27 24 32 32 36 27 55 39 36 38 37 36 49 27 37 49 28 27 47 44 29 35 47 31 23 46 50 26 36 28 36
 [715] 61 46 25 22 38 57 26 24 51 26 46 37 55 38 24 27 43 52 57 36 62 19 32 43 34 34 34 30 40 42 49 40 24 35
 [749] 36 62 24 44 57 25 30 65 32 43 48 25 36 47 22 47 42 40 40 43 38 41 55 32 42 23 41 50 49 62 43 23 33 25
 [783] 50 37 27 47 22 39 46 19 24 43 42 32 27 34 43 21 36 23 40 25 38 58 41 39 19 31 28 49 47 42 25 48 52 37
 [817] 45 30 38 50 29 52 42 37 36 33 41 23 30 27 33 23 41 52 58 46 28 38 39 24 41 51 56 24 49 21 34 38 58 39
 [851] 60 23 37 48 26 23 34 46 51 45 55 39 33 52 26 47 31 50 39 36 47 45 40 36 43 44 55 32 26 33 27 24 29 20
 [885] 46 62 49 27 52 26 43 34 37 37 39 26 34 30 26 58 36 40 49 55 25 33 38 38 33 42 47 25 42 52 36 28 33 46
 [919] 22 22 23 29 29 37 30 52 46 51 28 33 39 33 40 59 32 45 38 26 29 21 31 32 40 29 33 29 57 23 36 38 45 51
 [953] 42 22 26 51 33 43 32 52 65 32 29 25 22 53 47 34 23 54 22 46 42 45 49 28 22 27 33 52 34 43 40 31 31 27
 [987] 42 35 42 24 31 53 59 38 49 42 54 24 39 48
 [ reached getOption("max.print") -- omitted 3446 entries ]
#Calculate the average age below. Refer to Worksheet 1 for the correct command.
age_mean = mean(age)
age_mean
[1] 37.08412
#Calculate standard deviation of age below. Refer to Worksheet 1 for the correct command. 
age_sd = sd(age)
age_sd
[1] 10.98637
#Calculate the maximum of age below. The command to find the maximum is max(variable) where variable is the extracted variable. 
max(age)
[1] 68
#Calculate the minimum of age below. The command to find the minimum is min(variable) where variable is the extracted variable.  
min(age)
[1] 18

Next, use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower threshold is to take the mean (+ or -) 3 * standard deviation.

Below is an estimation of the upper and lower threshold taken by using the mean (+ or -)3 x standard deviation

#Use the formula above to calculate the upper and lower threshold
age_lower = age_mean - (3) * age_sd
age_upper = age_mean + (3) * age_sd
age_lower
[1] 4.125023
age_upper
[1] 70.04322

A method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Below you will see a the steps to calculating the interquartile. We begin by specifying the quantile, followed by the upper and lower quantile to ultimately form our equation.

quantile(age) 
  0%  25%  50%  75% 100% 
  18   28   36   45   68 
lowerq = quantile(age)[2]
upperq = quantile(age)[3]
iqr = upperq - lowerq
iqr
50% 
  8 

The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.

Below is our calculation for our the upper threshold:

upperthreshold = (iqr * 1.5) + upperq 
upperthreshold
50% 
 48 

Below is the calculation for our lower threshold:

lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
25% 
 16 

The calculations below demonstrate our upper and lower threshold for the scoring data shown initially

age[age>upperthreshold]
  [1] 58 68 52 68 51 54 51 62 56 64 61 49 52 51 63 55 53 55 56 55 52 53 62 56 54 51 52 57 51 66 50 54 52 62
 [35] 50 54 58 64 59 60 59 58 55 53 49 51 57 52 56 52 53 64 53 50 49 49 49 51 49 57 63 64 49 49 50 55 54 58
 [69] 52 66 54 65 50 49 52 63 61 58 55 59 49 52 54 49 61 59 49 52 52 51 57 52 52 50 54 51 57 59 60 51 49 51
[103] 55 54 59 54 53 55 56 52 54 50 59 52 51 60 65 51 51 50 52 52 50 50 52 50 52 49 59 56 49 55 55 60 58 50
[137] 49 63 52 51 53 61 63 49 51 51 53 55 49 49 50 61 57 51 55 52 57 62 49 62 57 65 55 50 49 62 50 58 49 52
[171] 50 52 52 58 51 56 49 58 60 51 55 52 50 55 62 49 52 58 49 55 52 52 51 59 57 51 51 52 65 53 54 49 52 53
[205] 59 49 54 51 54 57 54 56 57 53 51 53 49 56 62 60 57 57 59 55 51 55 65 56 62 49 64 49 51 50 51 53 52 52
[239] 52 53 50 52 55 49 55 55 61 49 55 52 56 51 49 64 58 49 62 59 49 57 52 55 56 62 54 49 60 55 49 49 57 54
[273] 60 52 53 50 58 52 63 61 63 54 61 52 49 49 55 60 59 53 50 56 62 49 49 50 59 59 50 63 60 54 57 52 63 52
[307] 52 57 58 49 61 50 65 60 59 65 59 61 55 56 49 56 59 56 62 61 50 51 63 52 51 50 51 53 60 50 50 50 58 52
[341] 61 49 52 57 57 49 49 50 51 54 49 50 58 50 54 52 54 49 51 51 51 50 53 66 51 50 49 54 55 58 54 53 52 55
[375] 51 49 51 52 63 53 52 54 50 50 60 55 50 60 60 51 49 55 63 58 51 53 55 52 52 53 55 55 56 56 50 58 54 56
[409] 50 61 53 64 57 60 52 58 59 57 52 60 65 56 50 51 62 50 50 60 53 55 59 55 53 54 51 49 49 65 60 50 51 51
[443] 54 51 50 66 50 52 58 56 57 54 53 53 53 55 63 49 56 54 57 49 58 56 53 54 49 49 49 52 57 57 60 50 62 54
[477] 54 55 55 51 51 57 54 53 58 50 49 53 51 51 51 49 59 56 54 53 51 61 56 54 51 63 53 50 55 64 61 49 61 54
[511] 52 52 55 53 50 61 66 58 55 59 50 50 58 56 53 52 53 57 52 54 53 56 59 56 49 53 53 50 54 49 51 51 49 50
[545] 50 63 51 58 50 52 54 50 61 57 54 60 49 50 49 53 49 49 50 63 53 50 49 51 55 64 60 53 59 58 55 54 53 58
[579] 49 53 55 51 62 50 63 57 51 57 60 52 54 51 53 49 56 59 53 53 59 58 53 51 58 59 56 50 59 53 54 52 61 52
[613] 54 56 56 54 51 51 51 53 56 61 55 52 49 54 58 59 64 66 59 58 54 53 51 54 55 53 62 55 56 60 59 66 51 66
[647] 49 53 65 62 52 61 61 50 51 55 55 53 58 56 52 54 60 53 49 53 49 53 56 55 54 51 60 56 54 53 49 49 62 63
[681] 51 63 52 55 51 55 50 56 52 54 60 56 50 49 54 56 62 49 51 50 51 63 51 65 55 58 49 61 60 53 50 55 51 60
[715] 54 52 50 52 49 56 55 53 53 58 59 50 55 64 56 56 59 58 49 49 56 64 52 61 51 52 50 55 50 53 58 50 61 55
[749] 61 50 53 51 56 55 52 52 49 64 49 52 51 50 63 53 59 53 66 55 51 60 60 58 50 53 63 54 51 50 53 52 60 51
[783] 55 53 53 49 50 56 60 60 58 50 57 56 50 56 55 51 55 52 55 58 54 53 61 51 49 52 49 52 55 52 58 50 65 54
[817] 50 64 51 50 52 49
age[age<lowerthreshold]
integer(0)
mydata[age>upperthreshold,]
boxplot(age,horizontal = TRUE)


Task 2

Below we will read the ‘scoring_original.csv’ file into R. This is the original dataset and contains missing values

mydata = read.csv(file = "data/scoring_original.csv")
head(mydata)
#tail(data)

Given that the column names are shifted down below we must ensure to use the command skip and set the header to true.

mydata = read.csv("data/scoring_original.csv")
head(mydata)

Below is a summary of our originial scoring data.

summary(mydata)
  Status       Seniority           Home           Time            Age                Marital    
     :  10   Min.   : 0.000          :  10   Min.   : 6.00   Min.   :-9999.00            :  10  
 bad :1249   1st Qu.: 2.000   ignore :  20   1st Qu.:36.00   1st Qu.:   28.00   divorced :  38  
 good:3197   Median : 5.000   other  : 319   Median :48.00   Median :   36.00   married  :3240  
 NA's:   3   Mean   : 7.987   owner  :2107   Mean   :46.45   Mean   :   14.57   separated: 130  
             3rd Qu.:12.000   parents: 782   3rd Qu.:60.00   3rd Qu.:   45.00   single   : 974  
             Max.   :48.000   priv   : 246   Max.   :72.00   Max.   :   68.00   widow    :  67  
             NA's   :10       rent   : 975   NA's   :10                                         
    Records            Job          Expenses        Income         Assets            Debt      
        :  10            :  10   $35K   :1207   $100K  : 155   Min.   :     0   Min.   :    0  
 no_rec :3680   fixed    :2806   $45K   : 819   $150K  : 115   1st Qu.:     0   1st Qu.:    0  
 yes_rec: 769   freelance:1021   $60K   : 792   $120K  : 107   Median :  3000   Median :    0  
                others   : 171   $75K   : 570   $80K   :  87   Mean   :  5352   Mean   :  342  
                partime  : 451   $90K   : 192   $110K  :  86   3rd Qu.:  6000   3rd Qu.:    0  
                                 $105K  :  40   $90K   :  84   Max.   :300000   Max.   :30000  
                                 (Other): 839   (Other):3825   NA's   :10       NA's   :10     
        Amount            Price          Finrat           Savings      
 $1,000.00 : 541   $1,500.00 :  46   Min.   :  6.702   Min.   :-8.160  
 $1,200.00 : 221   $1,200.00 :  45   1st Qu.: 60.020   1st Qu.: 1.615  
 $800.00   : 219   $1,300.00 :  45   Median : 77.093   Median : 3.120  
 $1,100.00 : 210   $1,600.00 :  43   Mean   : 72.614   Mean   : 3.858  
 $1,300.00 : 198   $1,100.00 :  41   3rd Qu.: 88.462   3rd Qu.: 5.195  
 $1,500.00 : 198   $1,700.00 :  39   Max.   :100.000   Max.   :33.250  
 (Other)   :2872   (Other)   :4200   NA's   :10        NA's   :10      

To calculate the mean for price in R, follow Worksheet 1. Extract the price column first and then find the average using the function built in R. What happens when we try to use the function? when extracting the price column and then calculating the mean it gives us an N/A result

mydata = read.csv(file = "data/scoring_original.csv")
mydata
#Extracting the Price Column
Price = mydata$Price
#Calling the Price Column
Price 
   [1] $846.00    $1,658.00  $2,985.00  $1,325.00  $910.00    $1,645.00  $1,800.00  $1,093.00  $1,957.00 
  [10] $1,468.00  $1,577.00  $915.00    $1,650.00  $940.00    $500.00    $1,186.00  $2,201.00  $1,350.00 
  [19] $1,511.00  $1,253.00  $2,189.00  $-         $1,159.00  $1,332.00  $1,497.00  $1,357.00  $2,100.00 
  [28] $1,070.00  $2,557.00  $1,600.00  $-         $1,312.00  $400.00    $650.00    $1,394.00  $1,542.00 
  [37] $1,200.00  $1,560.00  $1,200.00  $950.00    $-         $1,300.00  $1,700.00  $1,167.00  $1,150.00 
  [46] $1,566.00  $1,552.00  $1,300.00  $2,104.00  $2,154.00  $545.00    $1,778.00  $1,718.00  $1,500.00 
  [55] $1,274.00  $1,015.00  $1,701.00  $1,345.00  $1,238.00  $1,048.00  $1,048.00  $1,324.00  $2,201.00 
  [64] $1,430.00  $926.00    $1,500.00  $1,542.00  $1,360.00  $1,000.00  $1,564.00  $1,205.00  $2,200.00 
  [73] $2,214.00  $1,137.00  $1,025.00  $1,593.00  $1,735.00  $1,132.00  $1,338.00  $1,100.00  $2,276.00 
  [82] $1,360.00  $1,436.00  $612.00    $1,100.00  $1,300.00  $2,089.00  $2,262.00  $1,054.00  $1,426.00 
  [91] $1,048.00  $1,292.00  $1,177.00  $1,490.00  $850.00    $1,011.00  $1,571.00  $1,000.00  $270.00   
 [100] $1,382.00  $1,713.00  $1,086.00  $3,262.00  $1,610.00  $1,030.00  $650.00    $2,175.00  $1,780.00 
 [109] $950.00    $1,211.00  $1,650.00  $350.00    $1,117.00  $2,468.00  $1,740.00  $1,210.00  $963.00   
 [118] $1,646.00  $800.00    $470.00    $1,370.00  $839.00    $1,346.00  $1,350.00  $2,100.00  $1,307.00 
 [127] $1,126.00  $4,786.00  $1,478.00  $1,568.00  $1,056.00  $1,608.00  $1,179.00  $1,524.00  $1,730.00 
 [136] $823.00    $800.00    $1,482.00  $1,110.00  $1,555.00  $1,170.00  $886.00    $1,395.00  $2,084.00 
 [145] $8,800.00  $1,160.00  $1,211.00  $1,300.00  $850.00    $1,204.00  $1,528.00  $1,298.00  $600.00   
 [154] $1,320.00  $1,300.00  $1,380.00  $1,426.00  $5,200.00  $1,503.00  $1,449.00  $1,422.00  $1,430.00 
 [163] $1,578.00  $1,250.00  $2,260.00  $280.00    $1,298.00  $945.00    $1,694.00  $700.00    $1,290.00 
 [172] $1,581.00  $1,383.00  $1,637.00  $1,275.00  $1,400.00  $832.00    $2,886.00  $1,257.00  $1,346.00 
 [181] $1,138.00  $1,801.00  $1,200.00  $950.00    $1,589.00  $1,617.00  $1,200.00  $2,194.00  $1,642.00 
 [190] $1,423.00  $1,600.00  $700.00    $300.00    $1,165.00  $1,365.00  $1,706.00  $2,053.00  $1,700.00 
 [199] $1,462.00  $1,300.00  $1,122.00  $1,257.00  $1,387.00  $1,266.00  $2,195.00  $2,004.00  $1,203.00 
 [208] $1,138.00  $1,374.00  $1,514.00  $750.00    $1,894.00  $1,048.00  $1,369.00  $1,035.00  $1,218.00 
 [217] $1,637.00  $953.00    $1,212.00  $1,218.00  $1,419.00  $1,100.00  $987.00    $2,400.00  $1,375.00 
 [226] $1,595.00  $1,054.00  $1,651.00  $1,542.00  $2,014.00  $2,624.00  $1,135.00  $1,105.00  $1,500.00 
 [235] $1,200.00  $1,555.00  $2,251.00  $1,542.00  $2,186.00  $1,700.00  $1,180.00  $800.00    $1,668.00 
 [244] $1,750.00  $875.00    $1,514.00  $1,950.00  $1,462.00  $1,098.00  $840.00    $1,634.00  $1,603.00 
 [253] $1,200.00  $1,360.00  $1,236.00  $1,395.00  $1,390.00  $1,355.00  $1,604.00  $1,241.00  $375.00   
 [262] $1,045.00  $1,781.00  $900.00    $1,698.00  $1,048.00  $2,200.00  $1,119.00  $1,090.00  $4,100.00 
 [271] $1,400.00  $2,010.00  $2,014.00  $1,342.00  $1,132.00  $1,048.00  $1,218.00  $1,114.00  $1,206.00 
 [280] $2,132.00  $570.00    $1,094.00  $550.00    $1,255.00  $4,138.00  $1,714.00  $1,150.00  $2,673.00 
 [289] $1,630.00  $1,544.00  $1,195.00  $1,206.00  $2,050.00  $1,513.00  $3,300.00  $1,250.00  $1,339.00 
 [298] $875.00    $1,101.00  $1,568.00  $1,194.00  $1,378.00  $1,139.00  $350.00    $1,406.00  $1,497.00 
 [307] $1,600.00  $1,350.00  $2,032.00  $2,610.00  $1,193.00  $1,275.00  $3,750.00  $2,500.00  $1,790.00 
 [316] $1,419.00  $1,480.00  $1,542.00  $1,850.00  $1,380.00  $1,888.00  $1,672.00  $808.00    $1,107.00 
 [325] $1,335.00  $1,390.00  $1,363.00  $1,482.00  $2,357.00  $850.00    $1,180.00  $1,617.00  $1,114.00 
 [334] $1,976.00  $1,095.00  $1,380.00  $1,529.00  $1,263.00  $1,600.00  $1,400.00  $1,520.00  $1,617.00 
 [343] $1,500.00  $1,705.00  $1,292.00  $1,294.00  $1,936.00  $2,143.00  $1,394.00  $425.00    $1,600.00 
 [352] $1,563.00  $1,284.00  $990.00    $1,125.00  $1,267.00  $1,780.00  $1,298.00  $1,600.00  $1,340.00 
 [361] $1,244.00  $1,403.00  $1,980.00  $1,290.00  $1,339.00  $1,559.00  $1,078.00  $1,113.00  $2,290.00 
 [370] $1,578.00  $1,039.00  $1,218.00  $1,329.00  $500.00    $1,384.00  $1,553.00  $1,626.00  $1,674.00 
 [379] $1,500.00  $1,421.00  $480.00    $1,274.00  $1,580.00  $1,300.00  $1,314.00  $957.00    $1,500.00 
 [388] $1,504.00  $935.00    $1,028.00  $1,440.00  $1,030.00  $1,366.00  $1,768.00  $2,340.00  $1,700.00 
 [397] $1,450.00  $1,155.00  $1,191.00  $841.00    $2,441.00  $1,516.00  $1,591.00  $1,999.00  $1,336.00 
 [406] $1,600.00  $1,850.00  $1,775.00  $1,224.00  $1,663.00  $1,500.00  $2,225.00  $1,930.00  $1,607.00 
 [415] $1,230.00  $1,581.00  $1,072.00  $1,471.00  $1,300.00  $1,192.00  $2,700.00  $500.00    $1,720.00 
 [424] $1,770.00  $1,197.00  $950.00    $450.00    $1,707.00  $1,201.00  $1,355.00  $1,196.00  $1,297.00 
 [433] $1,420.00  $1,036.00  $1,201.00  $1,527.00  $1,500.00  $1,057.00  $2,127.00  $2,218.00  $1,570.00 
 [442] $650.00    $1,175.00  $825.00    $2,152.00  $978.00    $1,650.00  $2,150.00  $1,771.00  $1,615.00 
 [451] $994.00    $1,147.00  $1,194.00  $825.00    $1,128.00  $360.00    $1,190.00  $1,580.00  $1,600.00 
 [460] $1,735.00  $2,178.00  $1,589.00  $1,828.00  $1,566.00  $1,250.00  $1,626.00  $1,414.00  $1,687.00 
 [469] $1,638.00  $1,296.00  $1,460.00  $1,400.00  $911.00    $1,559.00  $1,315.00  $2,600.00  $1,208.00 
 [478] $975.00    $1,274.00  $1,197.00  $1,092.00  $1,100.00  $1,202.00  $2,646.00  $1,816.00  $1,700.00 
 [487] $1,700.00  $984.00    $1,387.00  $1,636.00  $1,135.00  $1,453.00  $1,500.00  $1,350.00  $2,247.00 
 [496] $1,367.00  $950.00    $2,154.00  $1,100.00  $1,492.00  $1,351.00  $2,040.00  $1,037.00  $1,202.00 
 [505] $1,683.00  $1,000.00  $1,100.00  $1,734.00  $1,604.00  $1,330.00  $877.00    $1,759.00  $1,564.00 
 [514] $1,306.00  $1,108.00  $1,500.00  $1,645.00  $1,500.00  $2,625.00  $1,456.00  $2,178.00  $1,642.00 
 [523] $1,265.00  $325.00    $1,471.00  $1,440.00  $2,140.00  $1,377.00  $1,780.00  $818.00    $1,017.00 
 [532] $1,788.00  $1,212.00  $1,800.00  $1,394.00  $400.00    $940.00    $1,433.00  $1,439.00  $2,414.00 
 [541] $1,571.00  $1,500.00  $1,258.00  $1,403.00  $1,346.00  $2,195.00  $1,637.00  $2,375.00  $873.00   
 [550] $1,450.00  $1,607.00  $1,376.00  $1,092.00  $1,307.00  $1,128.00  $1,698.00  $1,075.00  $1,920.00 
 [559] $1,298.00  $1,362.00  $1,770.00  $2,022.00  $1,225.00  $724.00    $1,346.00  $1,415.00  $1,380.00 
 [568] $1,500.00  $900.00    $1,639.00  $2,220.00  $1,600.00  $1,270.00  $1,556.00  $1,878.00  $1,532.00 
 [577] $900.00    $1,100.00  $1,556.00  $1,054.00  $1,285.00  $1,338.00  $1,571.00  $1,556.00  $1,194.00 
 [586] $1,556.00  $1,989.00  $2,624.00  $925.00    $1,128.00  $2,220.00  $1,395.00  $1,137.00  $875.00   
 [595] $1,417.00  $700.00    $1,207.00  $1,268.00  $1,326.00  $1,480.00  $1,228.00  $2,168.00  $2,100.00 
 [604] $1,197.00  $510.00    $325.00    $1,000.00  $1,748.00  $1,392.00  $820.00    $1,275.00  $1,758.00 
 [613] $1,250.00  $1,560.00  $1,570.00  $1,045.00  $500.00    $1,800.00  $1,600.00  $1,177.00  $1,371.00 
 [622] $1,100.00  $1,256.00  $1,469.00  $1,682.00  $1,150.00  $675.00    $1,236.00  $1,265.00  $1,204.00 
 [631] $1,353.00  $1,330.00  $1,200.00  $1,488.00  $1,695.00  $1,547.00  $1,112.00  $1,062.00  $1,567.00 
 [640] $900.00    $1,280.00  $1,340.00  $1,490.00  $2,640.00  $900.00    $1,569.00  $1,937.00  $1,403.00 
 [649] $1,501.00  $1,255.00  $1,345.00  $2,134.00  $1,590.00  $400.00    $1,406.00  $1,872.00  $1,123.00 
 [658] $1,067.00  $2,253.00  $580.00    $1,992.00  $1,400.00  $1,284.00  $1,749.00  $2,125.00  $1,505.00 
 [667] $1,376.00  $2,154.00  $1,138.00  $1,931.00  $1,698.00  $1,279.00  $2,755.00  $1,600.00  $2,200.00 
 [676] $2,260.00  $1,904.00  $1,580.00  $2,810.00  $700.00    $1,100.00  $1,910.00  $1,360.00  $1,030.00 
 [685] $460.00    $1,339.00  $525.00    $1,315.00  $1,517.00  $1,330.00  $1,290.00  $1,900.00  $1,366.00 
 [694] $4,063.00  $1,064.00  $1,606.00  $1,570.00  $1,381.00  $816.00    $1,150.00  $2,215.00  $1,586.00 
 [703] $1,469.00  $1,556.00  $1,720.00  $1,650.00  $1,191.00  $1,127.00  $1,100.00  $1,480.00  $1,258.00 
 [712] $1,544.00  $1,404.00  $2,053.00  $1,575.00  $1,430.00  $1,409.00  $962.00    $1,400.00  $1,571.00 
 [721] $1,200.00  $1,200.00  $1,842.00  $841.00    $1,585.00  $957.00    $950.00    $1,375.00  $1,800.00 
 [730] $2,212.00  $1,133.00  $1,570.00  $1,569.00  $1,590.00  $1,734.00  $450.00    $2,008.00  $1,150.00 
 [739] $630.00    $1,571.00  $1,770.00  $1,600.00  $1,627.00  $1,094.00  $1,570.00  $935.00    $2,259.00 
 [748] $1,419.00  $820.00    $1,060.00  $600.00    $1,372.00  $758.00    $1,164.00  $1,450.00  $2,125.00 
 [757] $1,557.00  $1,700.00  $1,571.00  $1,654.00  $350.00    $420.00    $2,173.00  $6,802.00  $1,061.00 
 [766] $1,425.00  $546.00    $3,400.00  $1,776.00  $1,200.00  $1,403.00  $1,086.00  $1,200.00  $2,200.00 
 [775] $2,360.00  $1,758.00  $1,783.00  $1,108.00  $1,136.00  $1,557.00  $886.00    $1,490.00  $400.00   
 [784] $1,105.00  $1,131.00  $1,256.00  $1,571.00  $1,443.00  $1,160.00  $896.00    $1,639.00  $1,571.00 
 [793] $1,075.00  $1,590.00  $2,200.00  $450.00    $1,536.00  $1,375.00  $1,005.00  $1,264.00  $1,313.00 
 [802] $1,365.00  $1,105.00  $1,100.00  $1,480.00  $1,310.00  $1,265.00  $1,474.00  $980.00    $600.00   
 [811] $1,635.00  $1,376.00  $675.00    $2,000.00  $978.00    $1,130.00  $2,600.00  $950.00    $1,493.00 
 [820] $2,240.00  $1,224.00  $1,000.00  $2,500.00  $1,120.00  $1,242.00  $725.00    $720.00    $2,470.00 
 [829] $1,150.00  $1,655.00  $1,389.00  $1,310.00  $758.00    $1,770.00  $425.00    $1,035.00  $1,226.00 
 [838] $1,750.00  $1,800.00  $2,150.00  $1,713.00  $975.00    $650.00    $916.00    $625.00    $1,318.00 
 [847] $750.00    $325.00    $1,014.00  $1,525.00  $1,144.00  $550.00    $1,556.00  $1,396.00  $2,470.00 
 [856] $1,890.00  $575.00    $1,294.00  $710.00    $2,600.00  $1,849.00  $450.00    $1,369.00  $2,281.00 
 [865] $4,575.00  $1,332.00  $1,240.00  $931.00    $1,330.00  $1,192.00  $950.00    $1,900.00  $1,810.00 
 [874] $1,700.00  $1,318.00  $750.00    $1,168.00  $1,300.00  $1,111.00  $1,264.00  $1,692.00  $1,123.00 
 [883] $1,386.00  $1,395.00  $1,127.00  $960.00    $2,179.00  $886.00    $1,314.00  $2,800.00  $960.00   
 [892] $873.00    $1,111.00  $1,040.00  $1,995.00  $1,333.00  $1,358.00  $1,613.00  $1,855.00  $2,271.00 
 [901] $925.00    $1,950.00  $2,058.00  $1,600.00  $860.00    $875.00    $1,710.00  $1,260.00  $1,705.00 
 [910] $1,307.00  $1,982.00  $1,479.00  $1,308.00  $2,201.00  $1,350.00  $1,386.00  $2,276.00  $1,608.00 
 [919] $1,585.00  $1,320.00  $3,190.00  $1,500.00  $1,571.00  $1,386.00  $1,680.00  $450.00    $1,488.00 
 [928] $1,396.00  $1,255.00  $1,883.00  $375.00    $3,070.00  $2,189.00  $1,692.00  $2,216.00  $1,298.00 
 [937] $750.00    $1,830.00  $1,242.00  $882.00    $1,137.00  $275.00    $580.00    $1,403.00  $957.00   
 [946] $1,300.00  $1,097.00  $1,033.00  $913.00    $962.00    $1,123.00  $1,400.00  $1,800.00  $1,980.00 
 [955] $989.00    $1,250.00  $1,646.00  $1,114.00  $1,365.00  $1,600.00  $1,133.00  $550.00    $350.00   
 [964] $1,321.00  $1,225.00  $950.00    $550.00    $1,277.00  $1,172.00  $1,318.00  $2,185.00  $2,610.00 
 [973] $1,257.00  $600.00    $1,408.00  $2,413.00  $1,388.00  $1,636.00  $850.00    $1,757.00  $1,600.00 
 [982] $2,022.00  $1,200.00  $1,550.00  $710.00    $1,286.00  $1,400.00  $1,638.00  $1,700.00  $1,850.00 
 [991] $1,677.00  $1,900.00  $1,601.00  $1,200.00  $1,456.00  $1,456.00  $960.00    $1,337.00  $1,359.00 
[1000] $801.00   
 [ reached getOption("max.print") -- omitted 3459 entries ]
1419 Levels: $- $1,000.00  $1,001.00  $1,003.00  $1,005.00  $1,007.00  $1,008.00  $1,011.00  ... $999.00 

To resolve the error, we must remove understand where it is coming from. There are missing values in the csv file, which is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly, there are ‘$’ symbols throughout the file which is not a numerica symbol either.

The sub function replaces these symbols with something else. So, in order to remove the comma in the number “1,234”, we must substitute it with just an empty space.

As shown on the worksheet, type and copy the exact commands to find the mean with the NA values removed.

Price [1:6]
[1] $846.00    $1,658.00  $2,985.00  $1,325.00  $910.00    $1,645.00 
1419 Levels: $- $1,000.00  $1,001.00  $1,003.00  $1,005.00  $1,007.00  $1,008.00  $1,011.00  ... $999.00 
clean = Price[1:6]
#substitute comma with blank in all of Price 
clean = sub(",","",clean)
#substitute dollar sign with blank in all of Price
clean = sub("\\$","",clean)
class(clean)
[1] "character"
#numeric convert
clean = as.numeric(clean)
class(clean)
[1] "numeric"
#mean with NA removed 
clean
[1]  846 1658 2985 1325  910 1645
#substitute comma with blank in all of Price 
mydata$Price = sub(",","", mydata$Price)
#substitute dollar sign with blank in all of price
mydata$Price = sub("\\$","", mydata$Price)
#numeric convert
mydata$Price = as.numeric(mydata$Price)
#mean with NA removed
mydata <- mydata[complete.cases(mydata$Price), ]
which(is.na(mydata))
integer(0)
mean = mean(mydata$Price)
mean
[1] 1462.48

What are some other ways to clean this data in R? How about in Excel?

mean(mydata$Price,na.rm = TRUE)
[1] 1462.48
sum(mydata$Price,na.rm = TRUE)/length(Price)
[1] 1458.217

Task 3

Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq

Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/

Open in RStudio the csv file is located in the data folder, note the size of the file, the number of columns and of rows here. Use the functions learned in lab00 and lab01 to describe the data, identify unique entities, fields and summarize.

mydata = read.csv(file="data/Taxi_Trips_Sample.csv")
mydata
mydata$Fare[1:500]
  [1] $7.05   $6.05   $7.05   $31.25  $5.50   $9.25   $9.05   $30.45  $18.25  $17.25  $8.05   $21.25 
 [13] $6.85   $10.45  $7.45   $6.25   $7.45   $7.25   $10.05  $13.25  $35.25  $11.65  $3.25   $9.05  
 [25] $14.25  $15.85  $14.85  $6.85   $38.85  $5.65   $6.45   $15.45  $3.25   $9.75   $14.85  $16.25 
 [37] $3.45   $6.50   $5.45   $9.65   $107.45 $13.45  $35.04  $6.25   $13.65  $4.85   $5.05   $9.00  
 [49] $13.05  $5.45   $10.25  $6.45   $16.25  $8.65   $5.65   $7.05   $8.25   $9.65   $5.25   $7.00  
 [61] $6.65   $5.85   $9.44   $5.50   $9.25   $16.85  $4.65   $9.65   $39.25  $10.65  $4.84   $18.65 
 [73] $5.85   $6.45   $18.85  $7.05   $7.25   $9.25   $7.45   $4.25   $17.45  $4.50   $35.25  $7.45  
 [85] $9.45   $11.45  $9.65   $4.25   $4.25   $6.65   $13.45  $7.50   $5.65   $4.84   $7.45   $4.25  
 [97] $38.00  $11.50  $8.75   $8.00   $4.45   $14.25  $6.65   $7.05   $12.65  $4.45   $9.85   $5.05  
[109] $9.25   $31.45  $6.65   $10.65  $10.85  $5.25   $11.25  $5.45   $6.85   $6.45   $7.85   $17.25 
[121] $8.65   $8.85   $5.05   $7.65   $6.45   $11.05  $8.65   $8.05   $37.45  $8.50   $16.85  $4.84  
[133] $6.65   $9.45   $7.05   $7.25   $13.85  $15.45  $4.25   $12.45  $23.85  $6.85   $24.25  $38.25 
[145] $7.05   $17.85  $7.25   $37.05  $5.65   $6.85   $10.75  $5.25   $5.25   $13.45  $36.65  $9.25  
[157] $7.45   $8.25   $10.25  $35.05  $7.25   $14.00  $4.85   $8.25   $7.05   $8.85   $5.85   $7.45  
[169] $6.00   $16.25  $3.45   $5.50   $9.25   $6.85   $7.05   $8.65   $8.05   $5.65   $5.25   $3.25  
[181] $36.25  $6.65   $8.05   $4.45   $8.45   $5.45   $23.85  $7.05   $15.05  $6.75   $5.75   $9.05  
[193] $12.45  $9.65   $8.45   $7.45   $31.45  $6.65   $10.25  $7.65   $7.65   $7.25   $6.45   $90.95 
[205] $20.45  $5.75   $3.25   $3.45   $9.44   $9.75   $7.00   $8.85   $44.85  $7.65   $10.25  $10.25 
[217] $5.25   $8.25   $8.85   $6.05   $4.05   $9.25   $13.05  $32.25  $6.05   $3.25   $6.25   $22.50 
[229] $6.25   $13.45  $5.85   $5.45   $4.85   $11.65  $4.85   $25.65  $5.05   $6.25   $7.25   $4.05  
[241] $4.25   $5.00   $17.25  $16.00  $12.05  $44.75  $12.65  $5.45   $48.65  $5.45   $4.25   $4.65  
[253] $6.25   $7.45   $5.85   $10.45  $7.45   $4.65   $58.45  $22.85  $11.00  $14.45  $6.25   $7.05  
[265] $19.45  $11.45  $11.85  $12.25  $7.05   $4.65   $4.50   $6.45   $8.25   $15.25  $35.04  $3.45  
[277] $36.45  $20.45  $4.65   $6.25   $36.05  $5.05   $16.85  $8.25   $5.75   $12.45  $7.65   $10.45 
[289] $7.25   $4.25   $8.25   $7.05   $6.05   $11.85  $21.05  $14.00  $6.75   $20.05  $8.44   $46.50 
[301] $4.25   $38.85  $7.85   $4.85   $7.05   $5.85   $4.45   $13.65  $8.45   $33.05  $5.05   $10.85 
[313] $5.05   $11.65  $5.75   $4.65   $5.25   $7.65   $3.85   $4.45   $36.05  $6.65   $6.25   $13.05 
[325] $4.75   $4.45   $37.05  $10.45  $30.25  $24.45  $8.25   $17.05  $8.65   $13.25  $20.75  $6.00  
[337] $28.65  $7.05   $5.45   $15.25  $15.00  $6.00   $36.85  $7.75   $8.85   $7.25   $12.65  $14.05 
[349] $15.45  $52.50  $6.45   $16.65  $8.25   $5.65   $27.05  $12.25  $10.25  $5.05   $32.05  $8.25  
[361] $7.05   $19.65  $29.45  $3.25   $7.85   $4.65   $6.25   $8.65   $5.85   $9.85   $8.05   $16.75 
[373] $3.25   $4.65   $3.25   $9.25   $7.25   $10.85  $7.85   $9.85   $9.65   $40.25  $8.05   $10.25 
[385] $11.75  $23.85  $30.05  $7.05   $5.25   $8.85   $6.65   $7.05   $4.65   $7.50   $9.25   $17.25 
[397] $9.85   $5.45   $32.05  $34.25  $9.45   $13.05  $5.65   $3.25   $34.85  $13.05  $4.85   $3.25  
[409] $6.05   $10.25  $14.65  $16.25  $13.25  $14.65  $7.45   $67.00  $7.45   $5.45   $6.25   $10.05 
[421] $6.85   $4.65   $12.50  $6.45   $11.25  $5.25   $9.75   $30.25  $7.85   $6.50   $5.45   $10.25 
[433] $6.25   $7.45   $6.05   $5.45   $8.50   $5.45   $5.25   $8.44   $10.85  $3.25   $7.05   $9.25  
[445] $8.50   $17.85  $4.85   $0.00   $17.64  $7.85   $4.05   $7.25   $35.65  $15.05  $6.25   $7.25  
[457] $39.45  $37.25  $5.85   $9.85   $26.85  $3.25   $7.85   $6.15   $5.25   $64.25  $6.05   $8.25  
[469] $35.85  $10.45  $12.75  $4.45   $7.00   $7.05   $5.85   $37.45  $7.45   $5.85   $6.65   $3.25  
[481] $16.00  $6.45   $8.25   $0.00   $10.05  $9.65   $26.85  $5.25   $10.75  $5.05   $5.65   $8.65  
[493] $9.25   $7.85   $5.05   $7.05   $10.50  $6.45   $38.45  $32.85 
892 Levels:  $0.00 $0.01 $0.03 $0.05 $0.10 $0.11 $0.28 $0.30 $0.32 $0.34 $0.42 $0.60 $1.00 $1.11 ... $99.99

Define a relational business logic for the column field ‘Trip Seconds’.

Using https://erdplus.com/#/standalone a star schema was compiled using the following:

The follwing relational schema includes variables that are subject to change (under the fact table) and items which tend to be constant, such as those of a form of payment or the location from which an individual gets picked up)

---
title: "Business Analytics Lab Worksheet 02"
author: "Marlen Sotelo"
date: "July 18, 2017"
output:
  html_notebook: default
  html_document: default
  pdf_document: default
subtitle: CME Group Foundation Business Analytics Lab
---

### About

This lab focuses on data outliers, data preparation, and data modeling. This lab requires the use of Microsoft Excel, R, and ERDplus.

---------

### Task 1

First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.
In R, we must read in the file again, extract the column and find the values that are asked for.

```{r}
#Read File
mydata = read.csv(file = "data/Scoring.csv")
#Name the extracted variable
age = mydata$Age
age
```

```{r}
#Calculate the average age below. Refer to Worksheet 1 for the correct command.
age_mean = mean(age)
age_mean
```

```{r}
#Calculate standard deviation of age below. Refer to Worksheet 1 for the correct command. 
age_sd = sd(age)
age_sd
```

```{r}
#Calculate the maximum of age below. The command to find the maximum is max(variable) where variable is the extracted variable. 
max(age)
```

```{r}
#Calculate the minimum of age below. The command to find the minimum is min(variable) where variable is the extracted variable.  
min(age)
```

Next, use the formula from class to detect any outliers. An outlier is value that "lies outside" most of the other values in a set of data. A common way to estimate the upper and lower threshold is to take the ```mean (+ or -) 3 * standard deviation```. 

*Below is an estimation of the upper and lower threshold taken by using the mean (+ or -)3 x standard deviation*

```{r}
#Use the formula above to calculate the upper and lower threshold
age_lower = age_mean - (3) * age_sd
age_upper = age_mean + (3) * age_sd
age_lower
age_upper
```

*A method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Below you will see a the steps to calculating the interquartile. We begin by specifying the quantile, followed by the upper and lower quantile to ultimately form our equation.*

```{r} 
quantile(age) 
lowerq = quantile(age)[2]
upperq = quantile(age)[3]
iqr = upperq - lowerq
iqr
```

The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier. 

Below is our calculation for our the upper threshold:
```{r} 
upperthreshold = (iqr * 1.5) + upperq 
upperthreshold
```

Below is the calculation for our lower threshold:
```{r}
lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
```

*The calculations below demonstrate our upper and lower threshold for the scoring data shown initially*

```{r}
age[age>upperthreshold]
age[age<lowerthreshold]
mydata[age>upperthreshold,]
```
```{r} 
boxplot(age,horizontal = TRUE)
```

---------------

### Task 2

Below we will read the 'scoring_original.csv' file into R. This is the original dataset and contains missing values

```{r}
mydata = read.csv(file = "data/scoring_original.csv")
head(mydata)
#tail(data)
```

*Given that the column names are shifted down below we must ensure to use the command skip and set the header to true.*

```{r}
mydata = read.csv("data/scoring_original.csv")
head(mydata)
```

*Below is a summary of our originial scoring data.*
```{r}
summary(mydata)
```

To calculate the mean for price in R, follow Worksheet 1. Extract the price column first and then find the average using the function built in R.
What happens when we try to use the function?
*when extracting the price column and then calculating the mean it gives us an N/A result*
```{r}
mydata = read.csv(file = "data/scoring_original.csv")
mydata
```
```{r}
#Extracting the Price Column
Price = mydata$Price

#Calling the Price Column
Price 
```
To resolve the error, we must remove understand where it is coming from. There are missing values in the csv file, which is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that '1,234' is equivalent to '1234'. Lastly, there are '$' symbols throughout the file which is not a numerica symbol either.

The sub function replaces these symbols with something else. So, in order to remove the comma in the number "1,234", we must substitute it with just an empty space.

As shown on the worksheet, type and copy the exact commands to find the mean with the NA values removed.

```{r} 
Price [1:6]
```

```{r}
clean = Price[1:6]
#substitute comma with blank in all of Price 
clean = sub(",","",clean)
#substitute dollar sign with blank in all of Price
clean = sub("\\$","",clean)
class(clean)
#numeric convert
clean = as.numeric(clean)

class(clean)
#mean with NA removed 
clean
```

```{r} 
#substitute comma with blank in all of Price 
mydata$Price = sub(",","", mydata$Price)

#substitute dollar sign with blank in all of price
mydata$Price = sub("\\$","", mydata$Price)

#numeric convert
mydata$Price = as.numeric(mydata$Price)
#mean with NA removed
mydata <- mydata[complete.cases(mydata$Price), ]
which(is.na(mydata))
mean = mean(mydata$Price)
mean
```

What are some other ways to clean this data in R? How about in Excel? 

```{r}
mean(mydata$Price,na.rm = TRUE)
sum(mydata$Price,na.rm = TRUE)/length(Price)
```

-------------

### Task 3

Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.

Chicago Taxi Dashboard: [https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq](https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq)

Chicago Taxi Data Description:
[http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/](http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/) 

Open in RStudio the csv file is located in the data folder, note the size of the file, the number of columns and of rows here. Use the functions learned in lab00 and lab01 to describe the data, identify unique entities, fields and summarize.
```{r}
mydata = read.csv(file="data/Taxi_Trips_Sample.csv")
mydata
```

```{r}
mydata$Fare[1:500]
```

Define a relational business logic for the column field 'Trip Seconds'.

Using [https://erdplus.com/#/standalone](https://erdplus.com/#/standalone) a star schema was compiled using the following:

- A Fact table for Trip
- A Dimension table for Community Area
- A Dimension table for Rider

*The follwing relational schema includes variables that are subject to change (under the fact table) and items which tend to be constant, such as those of a form of payment or the location from which an individual gets picked up)*

![](imgs/erdplus-diagram (1).png)

