Task 1

First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.

#Read File
mydata= read.csv(file = "data/scoring.csv")
mydata


#Name the extracted variable
age= mydata$Age
age
#The average age is calculated below.
age_mean=mean(age)
age_mean
[1] 37.08412
#The standard deviation of age is calculated below.
age_sd= sd(age)
age_sd
[1] 10.98637
#The maximum of age is calculated below.  
agemax=max(age)
agemax
[1] 68
#Calculate the minimum of age below. The command to find the minimum is min(variable) where variable is the extracted variable.
agemin=min(age)
agemin
[1] 18

An outlier is value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower threshold is to take the mean (+ or -) 3 * standard deviation.

#Use the formula above to calculate the upper and lower threshold
age_lower= age_mean - (3) * age_sd
age_upper= age_mean + (3) * age_sd 
age_upper
[1] 70.04322
age_lower
[1] 4.125023

A method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range.

quantile(age) 
  0%  25%  50%  75% 100% 
  18   28   36   45   68 
lowerq= quantile(age) [2]
upperq= quantile(age) [4]
iqr = upperq - lowerq
iqr
75% 
 17 

The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.

Below is the upper threshold:

upperthreshold = (iqr * 1.5) + upperq 
upperthreshold
 75% 
70.5 

Below is the lower threshold:

lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
25% 
2.5 

Are there any outliers? How many? It can also be useful to visualize the data using a box and whisker plot. The boxplot below supports the IQR we found of 15 and upper and lower threshold.

age[age>upperthreshold]
integer(0)
mydata[age>upperthreshold,]
age[age<lowerthreshold]
integer(0)
boxplot(age,horizontal =TRUE)


Task 2

mydata=read.csv(file="data/scoring_original.csv")
head(mydata)
tail(mydata)
mydata=read.csv("data/scoring_original.csv")
head(mydata)
#str(mydata)
summary(mydata)
  Status       Seniority           Home           Time            Age                Marital        Records    
     :  10   Min.   : 0.000          :  10   Min.   : 6.00   Min.   :-9999.00            :  10          :  10  
 bad :1249   1st Qu.: 2.000   ignore :  20   1st Qu.:36.00   1st Qu.:   28.00   divorced :  38   no_rec :3680  
 good:3197   Median : 5.000   other  : 319   Median :48.00   Median :   36.00   married  :3240   yes_rec: 769  
 NA's:   3   Mean   : 7.987   owner  :2107   Mean   :46.45   Mean   :   14.57   separated: 130                 
             3rd Qu.:12.000   parents: 782   3rd Qu.:60.00   3rd Qu.:   45.00   single   : 974                 
             Max.   :48.000   priv   : 246   Max.   :72.00   Max.   :   68.00   widow    :  67                 
             NA's   :10       rent   : 975   NA's   :10                                                        
        Job          Expenses        Income         Assets            Debt              Amount            Price     
          :  10   $35K   :1207   $100K  : 155   Min.   :     0   Min.   :    0   $1,000.00 : 541   $1,500.00 :  46  
 fixed    :2806   $45K   : 819   $150K  : 115   1st Qu.:     0   1st Qu.:    0   $1,200.00 : 221   $1,200.00 :  45  
 freelance:1021   $60K   : 792   $120K  : 107   Median :  3000   Median :    0   $800.00   : 219   $1,300.00 :  45  
 others   : 171   $75K   : 570   $80K   :  87   Mean   :  5352   Mean   :  342   $1,100.00 : 210   $1,600.00 :  43  
 partime  : 451   $90K   : 192   $110K  :  86   3rd Qu.:  6000   3rd Qu.:    0   $1,300.00 : 198   $1,100.00 :  41  
                  $105K  :  40   $90K   :  84   Max.   :300000   Max.   :30000   $1,500.00 : 198   $1,700.00 :  39  
                  (Other): 839   (Other):3825   NA's   :10       NA's   :10      (Other)   :2872   (Other)   :4200  
     Finrat           Savings      
 Min.   :  6.702   Min.   :-8.160  
 1st Qu.: 60.020   1st Qu.: 1.615  
 Median : 77.093   Median : 3.120  
 Mean   : 72.614   Mean   : 3.858  
 3rd Qu.: 88.462   3rd Qu.: 5.195  
 Max.   :100.000   Max.   :33.250  
 NA's   :10        NA's   :10      

To calculate the mean for price we must extract the price column first and then find the average using the function built in R. What happens when we try to use the function?

#Extracting the variable price
Variable_price = mydata$Price
Variable_price
   [1] $846.00    $1,658.00  $2,985.00  $1,325.00  $910.00    $1,645.00  $1,800.00  $1,093.00  $1,957.00  $1,468.00 
  [11] $1,577.00  $915.00    $1,650.00  $940.00    $500.00    $1,186.00  $2,201.00  $1,350.00  $1,511.00  $1,253.00 
  [21] $2,189.00  $-         $1,159.00  $1,332.00  $1,497.00  $1,357.00  $2,100.00  $1,070.00  $2,557.00  $1,600.00 
  [31] $-         $1,312.00  $400.00    $650.00    $1,394.00  $1,542.00  $1,200.00  $1,560.00  $1,200.00  $950.00   
  [41] $-         $1,300.00  $1,700.00  $1,167.00  $1,150.00  $1,566.00  $1,552.00  $1,300.00  $2,104.00  $2,154.00 
  [51] $545.00    $1,778.00  $1,718.00  $1,500.00  $1,274.00  $1,015.00  $1,701.00  $1,345.00  $1,238.00  $1,048.00 
  [61] $1,048.00  $1,324.00  $2,201.00  $1,430.00  $926.00    $1,500.00  $1,542.00  $1,360.00  $1,000.00  $1,564.00 
  [71] $1,205.00  $2,200.00  $2,214.00  $1,137.00  $1,025.00  $1,593.00  $1,735.00  $1,132.00  $1,338.00  $1,100.00 
  [81] $2,276.00  $1,360.00  $1,436.00  $612.00    $1,100.00  $1,300.00  $2,089.00  $2,262.00  $1,054.00  $1,426.00 
  [91] $1,048.00  $1,292.00  $1,177.00  $1,490.00  $850.00    $1,011.00  $1,571.00  $1,000.00  $270.00    $1,382.00 
 [101] $1,713.00  $1,086.00  $3,262.00  $1,610.00  $1,030.00  $650.00    $2,175.00  $1,780.00  $950.00    $1,211.00 
 [111] $1,650.00  $350.00    $1,117.00  $2,468.00  $1,740.00  $1,210.00  $963.00    $1,646.00  $800.00    $470.00   
 [121] $1,370.00  $839.00    $1,346.00  $1,350.00  $2,100.00  $1,307.00  $1,126.00  $4,786.00  $1,478.00  $1,568.00 
 [131] $1,056.00  $1,608.00  $1,179.00  $1,524.00  $1,730.00  $823.00    $800.00    $1,482.00  $1,110.00  $1,555.00 
 [141] $1,170.00  $886.00    $1,395.00  $2,084.00  $8,800.00  $1,160.00  $1,211.00  $1,300.00  $850.00    $1,204.00 
 [151] $1,528.00  $1,298.00  $600.00    $1,320.00  $1,300.00  $1,380.00  $1,426.00  $5,200.00  $1,503.00  $1,449.00 
 [161] $1,422.00  $1,430.00  $1,578.00  $1,250.00  $2,260.00  $280.00    $1,298.00  $945.00    $1,694.00  $700.00   
 [171] $1,290.00  $1,581.00  $1,383.00  $1,637.00  $1,275.00  $1,400.00  $832.00    $2,886.00  $1,257.00  $1,346.00 
 [181] $1,138.00  $1,801.00  $1,200.00  $950.00    $1,589.00  $1,617.00  $1,200.00  $2,194.00  $1,642.00  $1,423.00 
 [191] $1,600.00  $700.00    $300.00    $1,165.00  $1,365.00  $1,706.00  $2,053.00  $1,700.00  $1,462.00  $1,300.00 
 [201] $1,122.00  $1,257.00  $1,387.00  $1,266.00  $2,195.00  $2,004.00  $1,203.00  $1,138.00  $1,374.00  $1,514.00 
 [211] $750.00    $1,894.00  $1,048.00  $1,369.00  $1,035.00  $1,218.00  $1,637.00  $953.00    $1,212.00  $1,218.00 
 [221] $1,419.00  $1,100.00  $987.00    $2,400.00  $1,375.00  $1,595.00  $1,054.00  $1,651.00  $1,542.00  $2,014.00 
 [231] $2,624.00  $1,135.00  $1,105.00  $1,500.00  $1,200.00  $1,555.00  $2,251.00  $1,542.00  $2,186.00  $1,700.00 
 [241] $1,180.00  $800.00    $1,668.00  $1,750.00  $875.00    $1,514.00  $1,950.00  $1,462.00  $1,098.00  $840.00   
 [251] $1,634.00  $1,603.00  $1,200.00  $1,360.00  $1,236.00  $1,395.00  $1,390.00  $1,355.00  $1,604.00  $1,241.00 
 [261] $375.00    $1,045.00  $1,781.00  $900.00    $1,698.00  $1,048.00  $2,200.00  $1,119.00  $1,090.00  $4,100.00 
 [271] $1,400.00  $2,010.00  $2,014.00  $1,342.00  $1,132.00  $1,048.00  $1,218.00  $1,114.00  $1,206.00  $2,132.00 
 [281] $570.00    $1,094.00  $550.00    $1,255.00  $4,138.00  $1,714.00  $1,150.00  $2,673.00  $1,630.00  $1,544.00 
 [291] $1,195.00  $1,206.00  $2,050.00  $1,513.00  $3,300.00  $1,250.00  $1,339.00  $875.00    $1,101.00  $1,568.00 
 [301] $1,194.00  $1,378.00  $1,139.00  $350.00    $1,406.00  $1,497.00  $1,600.00  $1,350.00  $2,032.00  $2,610.00 
 [311] $1,193.00  $1,275.00  $3,750.00  $2,500.00  $1,790.00  $1,419.00  $1,480.00  $1,542.00  $1,850.00  $1,380.00 
 [321] $1,888.00  $1,672.00  $808.00    $1,107.00  $1,335.00  $1,390.00  $1,363.00  $1,482.00  $2,357.00  $850.00   
 [331] $1,180.00  $1,617.00  $1,114.00  $1,976.00  $1,095.00  $1,380.00  $1,529.00  $1,263.00  $1,600.00  $1,400.00 
 [341] $1,520.00  $1,617.00  $1,500.00  $1,705.00  $1,292.00  $1,294.00  $1,936.00  $2,143.00  $1,394.00  $425.00   
 [351] $1,600.00  $1,563.00  $1,284.00  $990.00    $1,125.00  $1,267.00  $1,780.00  $1,298.00  $1,600.00  $1,340.00 
 [361] $1,244.00  $1,403.00  $1,980.00  $1,290.00  $1,339.00  $1,559.00  $1,078.00  $1,113.00  $2,290.00  $1,578.00 
 [371] $1,039.00  $1,218.00  $1,329.00  $500.00    $1,384.00  $1,553.00  $1,626.00  $1,674.00  $1,500.00  $1,421.00 
 [381] $480.00    $1,274.00  $1,580.00  $1,300.00  $1,314.00  $957.00    $1,500.00  $1,504.00  $935.00    $1,028.00 
 [391] $1,440.00  $1,030.00  $1,366.00  $1,768.00  $2,340.00  $1,700.00  $1,450.00  $1,155.00  $1,191.00  $841.00   
 [401] $2,441.00  $1,516.00  $1,591.00  $1,999.00  $1,336.00  $1,600.00  $1,850.00  $1,775.00  $1,224.00  $1,663.00 
 [411] $1,500.00  $2,225.00  $1,930.00  $1,607.00  $1,230.00  $1,581.00  $1,072.00  $1,471.00  $1,300.00  $1,192.00 
 [421] $2,700.00  $500.00    $1,720.00  $1,770.00  $1,197.00  $950.00    $450.00    $1,707.00  $1,201.00  $1,355.00 
 [431] $1,196.00  $1,297.00  $1,420.00  $1,036.00  $1,201.00  $1,527.00  $1,500.00  $1,057.00  $2,127.00  $2,218.00 
 [441] $1,570.00  $650.00    $1,175.00  $825.00    $2,152.00  $978.00    $1,650.00  $2,150.00  $1,771.00  $1,615.00 
 [451] $994.00    $1,147.00  $1,194.00  $825.00    $1,128.00  $360.00    $1,190.00  $1,580.00  $1,600.00  $1,735.00 
 [461] $2,178.00  $1,589.00  $1,828.00  $1,566.00  $1,250.00  $1,626.00  $1,414.00  $1,687.00  $1,638.00  $1,296.00 
 [471] $1,460.00  $1,400.00  $911.00    $1,559.00  $1,315.00  $2,600.00  $1,208.00  $975.00    $1,274.00  $1,197.00 
 [481] $1,092.00  $1,100.00  $1,202.00  $2,646.00  $1,816.00  $1,700.00  $1,700.00  $984.00    $1,387.00  $1,636.00 
 [491] $1,135.00  $1,453.00  $1,500.00  $1,350.00  $2,247.00  $1,367.00  $950.00    $2,154.00  $1,100.00  $1,492.00 
 [501] $1,351.00  $2,040.00  $1,037.00  $1,202.00  $1,683.00  $1,000.00  $1,100.00  $1,734.00  $1,604.00  $1,330.00 
 [511] $877.00    $1,759.00  $1,564.00  $1,306.00  $1,108.00  $1,500.00  $1,645.00  $1,500.00  $2,625.00  $1,456.00 
 [521] $2,178.00  $1,642.00  $1,265.00  $325.00    $1,471.00  $1,440.00  $2,140.00  $1,377.00  $1,780.00  $818.00   
 [531] $1,017.00  $1,788.00  $1,212.00  $1,800.00  $1,394.00  $400.00    $940.00    $1,433.00  $1,439.00  $2,414.00 
 [541] $1,571.00  $1,500.00  $1,258.00  $1,403.00  $1,346.00  $2,195.00  $1,637.00  $2,375.00  $873.00    $1,450.00 
 [551] $1,607.00  $1,376.00  $1,092.00  $1,307.00  $1,128.00  $1,698.00  $1,075.00  $1,920.00  $1,298.00  $1,362.00 
 [561] $1,770.00  $2,022.00  $1,225.00  $724.00    $1,346.00  $1,415.00  $1,380.00  $1,500.00  $900.00    $1,639.00 
 [571] $2,220.00  $1,600.00  $1,270.00  $1,556.00  $1,878.00  $1,532.00  $900.00    $1,100.00  $1,556.00  $1,054.00 
 [581] $1,285.00  $1,338.00  $1,571.00  $1,556.00  $1,194.00  $1,556.00  $1,989.00  $2,624.00  $925.00    $1,128.00 
 [591] $2,220.00  $1,395.00  $1,137.00  $875.00    $1,417.00  $700.00    $1,207.00  $1,268.00  $1,326.00  $1,480.00 
 [601] $1,228.00  $2,168.00  $2,100.00  $1,197.00  $510.00    $325.00    $1,000.00  $1,748.00  $1,392.00  $820.00   
 [611] $1,275.00  $1,758.00  $1,250.00  $1,560.00  $1,570.00  $1,045.00  $500.00    $1,800.00  $1,600.00  $1,177.00 
 [621] $1,371.00  $1,100.00  $1,256.00  $1,469.00  $1,682.00  $1,150.00  $675.00    $1,236.00  $1,265.00  $1,204.00 
 [631] $1,353.00  $1,330.00  $1,200.00  $1,488.00  $1,695.00  $1,547.00  $1,112.00  $1,062.00  $1,567.00  $900.00   
 [641] $1,280.00  $1,340.00  $1,490.00  $2,640.00  $900.00    $1,569.00  $1,937.00  $1,403.00  $1,501.00  $1,255.00 
 [651] $1,345.00  $2,134.00  $1,590.00  $400.00    $1,406.00  $1,872.00  $1,123.00  $1,067.00  $2,253.00  $580.00   
 [661] $1,992.00  $1,400.00  $1,284.00  $1,749.00  $2,125.00  $1,505.00  $1,376.00  $2,154.00  $1,138.00  $1,931.00 
 [671] $1,698.00  $1,279.00  $2,755.00  $1,600.00  $2,200.00  $2,260.00  $1,904.00  $1,580.00  $2,810.00  $700.00   
 [681] $1,100.00  $1,910.00  $1,360.00  $1,030.00  $460.00    $1,339.00  $525.00    $1,315.00  $1,517.00  $1,330.00 
 [691] $1,290.00  $1,900.00  $1,366.00  $4,063.00  $1,064.00  $1,606.00  $1,570.00  $1,381.00  $816.00    $1,150.00 
 [701] $2,215.00  $1,586.00  $1,469.00  $1,556.00  $1,720.00  $1,650.00  $1,191.00  $1,127.00  $1,100.00  $1,480.00 
 [711] $1,258.00  $1,544.00  $1,404.00  $2,053.00  $1,575.00  $1,430.00  $1,409.00  $962.00    $1,400.00  $1,571.00 
 [721] $1,200.00  $1,200.00  $1,842.00  $841.00    $1,585.00  $957.00    $950.00    $1,375.00  $1,800.00  $2,212.00 
 [731] $1,133.00  $1,570.00  $1,569.00  $1,590.00  $1,734.00  $450.00    $2,008.00  $1,150.00  $630.00    $1,571.00 
 [741] $1,770.00  $1,600.00  $1,627.00  $1,094.00  $1,570.00  $935.00    $2,259.00  $1,419.00  $820.00    $1,060.00 
 [751] $600.00    $1,372.00  $758.00    $1,164.00  $1,450.00  $2,125.00  $1,557.00  $1,700.00  $1,571.00  $1,654.00 
 [761] $350.00    $420.00    $2,173.00  $6,802.00  $1,061.00  $1,425.00  $546.00    $3,400.00  $1,776.00  $1,200.00 
 [771] $1,403.00  $1,086.00  $1,200.00  $2,200.00  $2,360.00  $1,758.00  $1,783.00  $1,108.00  $1,136.00  $1,557.00 
 [781] $886.00    $1,490.00  $400.00    $1,105.00  $1,131.00  $1,256.00  $1,571.00  $1,443.00  $1,160.00  $896.00   
 [791] $1,639.00  $1,571.00  $1,075.00  $1,590.00  $2,200.00  $450.00    $1,536.00  $1,375.00  $1,005.00  $1,264.00 
 [801] $1,313.00  $1,365.00  $1,105.00  $1,100.00  $1,480.00  $1,310.00  $1,265.00  $1,474.00  $980.00    $600.00   
 [811] $1,635.00  $1,376.00  $675.00    $2,000.00  $978.00    $1,130.00  $2,600.00  $950.00    $1,493.00  $2,240.00 
 [821] $1,224.00  $1,000.00  $2,500.00  $1,120.00  $1,242.00  $725.00    $720.00    $2,470.00  $1,150.00  $1,655.00 
 [831] $1,389.00  $1,310.00  $758.00    $1,770.00  $425.00    $1,035.00  $1,226.00  $1,750.00  $1,800.00  $2,150.00 
 [841] $1,713.00  $975.00    $650.00    $916.00    $625.00    $1,318.00  $750.00    $325.00    $1,014.00  $1,525.00 
 [851] $1,144.00  $550.00    $1,556.00  $1,396.00  $2,470.00  $1,890.00  $575.00    $1,294.00  $710.00    $2,600.00 
 [861] $1,849.00  $450.00    $1,369.00  $2,281.00  $4,575.00  $1,332.00  $1,240.00  $931.00    $1,330.00  $1,192.00 
 [871] $950.00    $1,900.00  $1,810.00  $1,700.00  $1,318.00  $750.00    $1,168.00  $1,300.00  $1,111.00  $1,264.00 
 [881] $1,692.00  $1,123.00  $1,386.00  $1,395.00  $1,127.00  $960.00    $2,179.00  $886.00    $1,314.00  $2,800.00 
 [891] $960.00    $873.00    $1,111.00  $1,040.00  $1,995.00  $1,333.00  $1,358.00  $1,613.00  $1,855.00  $2,271.00 
 [901] $925.00    $1,950.00  $2,058.00  $1,600.00  $860.00    $875.00    $1,710.00  $1,260.00  $1,705.00  $1,307.00 
 [911] $1,982.00  $1,479.00  $1,308.00  $2,201.00  $1,350.00  $1,386.00  $2,276.00  $1,608.00  $1,585.00  $1,320.00 
 [921] $3,190.00  $1,500.00  $1,571.00  $1,386.00  $1,680.00  $450.00    $1,488.00  $1,396.00  $1,255.00  $1,883.00 
 [931] $375.00    $3,070.00  $2,189.00  $1,692.00  $2,216.00  $1,298.00  $750.00    $1,830.00  $1,242.00  $882.00   
 [941] $1,137.00  $275.00    $580.00    $1,403.00  $957.00    $1,300.00  $1,097.00  $1,033.00  $913.00    $962.00   
 [951] $1,123.00  $1,400.00  $1,800.00  $1,980.00  $989.00    $1,250.00  $1,646.00  $1,114.00  $1,365.00  $1,600.00 
 [961] $1,133.00  $550.00    $350.00    $1,321.00  $1,225.00  $950.00    $550.00    $1,277.00  $1,172.00  $1,318.00 
 [971] $2,185.00  $2,610.00  $1,257.00  $600.00    $1,408.00  $2,413.00  $1,388.00  $1,636.00  $850.00    $1,757.00 
 [981] $1,600.00  $2,022.00  $1,200.00  $1,550.00  $710.00    $1,286.00  $1,400.00  $1,638.00  $1,700.00  $1,850.00 
 [991] $1,677.00  $1,900.00  $1,601.00  $1,200.00  $1,456.00  $1,456.00  $960.00    $1,337.00  $1,359.00  $801.00   
 [ reached getOption("max.print") -- omitted 3459 entries ]
1419 Levels: $- $1,000.00  $1,001.00  $1,003.00  $1,005.00  $1,007.00  $1,008.00  $1,011.00  $1,012.00  ... $999.00 
mean_price= mean(Variable_price)
argument is not numeric or logical: returning NA
mean_price
[1] NA
Variable_price [1:6]
[1] $846.00    $1,658.00  $2,985.00  $1,325.00  $910.00    $1,645.00 
1419 Levels: $- $1,000.00  $1,001.00  $1,003.00  $1,005.00  $1,007.00  $1,008.00  $1,011.00  $1,012.00  ... $999.00 
mydata$Price <- gsub("\\$", " ", mydata$Price)
mydata$Price <- gsub(",", "", mydata$Price)
mydata$Price <- sub("$", "", mydata$Price)
clean=Variable_price[1:10]
#substitute comma with blank in all of checking 
clean= sub(",","",clean)
#substitute dollar sign with blank in all of checking 
clean=sub("\\$","",clean)
class(clean)
[1] "character"
#numeric convert
clean=as.numeric(clean)
class(clean)
[1] "numeric"
#mean with NA removed 
clean
 [1]  846 1658 2985 1325  910 1645 1800 1093 1957 1468
#substitute comma with blank in all variable price
Variable_price= sub(",","",Variable_price)
#substitute dollar sign with blank in all variable price
Variable_price= sub("\\$","",Variable_price)
#numeric convert 
Variable_price= as.numeric(Variable_price)
#mean with NA removed
Variable_price2 = mean(Variable_price)
Variable_price2
[1] NA
#What are some other ways to clean this data in R? How about in Excel? 
mean(Variable_price,na.rm= TRUE)
[1] 1462.48
sum(Variable_price,na.rm = TRUE)
[1] 6502188
sum(Variable_price,na.rm = TRUE)/ length(Variable_price)
[1] 1458.217

Task 3

Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq

Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/

Open in RStudio the csv file is located in the data folder, note the size of the file, the number of columns and of rows here. Use the functions learned in lab00 and lab01 to describe the data, identify unique entities, fields and summarize.

Define a relational business logic for the column field ‘Trip Seconds’.

Using https://erdplus.com/#/standalone draw a star schema using the following three tables:

mydata=read.csv(file="data/Taxi_Trips_Sample.csv")
head(mydata)
tail(mydata)

After looking at the data provided by Chicago Taxi Data.I was able to make this star schema.

In the relationship schema I made four categories.The four relationship schema categories are trip fact table, distance, community areas and rider.The four categories are related to one another.The schema starts off from the trip fact table, inside the trip fact table the trip_miles connects with the distance dimension.Under the distance dimension are the pickup,dropoff latitude and longitude properties. From the distance dimension it connects to the pickup centroid and dropoff centroid locations.From the community areas dimension it connects with the rider dimension.lastly, in the trip fact table I put trip duration under the trip fact table.The properties of start time and stop time along with taxi ID are important factors to the duration. The relationship between the community areas and the distance are important for the trip.

---
title: "Business Analytics Lab Worksheet 02"
author: "Stefania Tolari"
date: "July,18 2017"
output:
  html_notebook: default
  html_document: default
  pdf_document: default
subtitle: CME Group Foundation Business Analytics Lab
---

---------

### Task 1

First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.


```{r}
#Read File
mydata= read.csv(file = "data/scoring.csv")
mydata


#Name the extracted variable
age= mydata$Age
age

```


```{r}
#The average age is calculated below.

age_mean=mean(age)
age_mean
```

```{r}
#The standard deviation of age is calculated below.

age_sd= sd(age)
age_sd
```


```{r}


#The maximum of age is calculated below.  

agemax=max(age)
agemax
```




```{r}
#The minimum of age is calculated below. The command to find the minimum is min(variable) where variable is the extracted variable.

agemin=min(age)
agemin
```


An outlier is value that "lies outside" most of the other values in a set of data. A common way to estimate the upper and lower threshold is to take the ```mean (+ or -) 3 * standard deviation```.
```{r}
#Use the formula above to calculate the upper and lower threshold

age_lower= age_mean - (3) * age_sd
age_upper= age_mean + (3) * age_sd 
age_upper
age_lower
```

A method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range.

```{r} 
quantile(age) 

lowerq= quantile(age) [2]
upperq= quantile(age) [4]
iqr = upperq - lowerq
iqr
```

The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier. 

Below is the upper threshold:
```{r} 
upperthreshold = (iqr * 1.5) + upperq 
upperthreshold
```

Below is the lower threshold:
```{r}
lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
```

Are there any outliers? How many? It can also be useful to visualize the data using a box and whisker plot. The boxplot below supports the IQR we found of 15 and upper and lower threshold.

```{r} 

age[age>upperthreshold]
mydata[age>upperthreshold,]
age[age<lowerthreshold]

boxplot(age,horizontal =TRUE)
```

---------------

### Task 2

```{r}
mydata=read.csv(file="data/scoring_original.csv")
head(mydata)
tail(mydata)
```



```{r} 
mydata=read.csv("data/scoring_original.csv")
head(mydata)

#str(mydata)
summary(mydata)
```

To calculate the mean for price we must extract the price column first and then find the average using the function built in R.
What happens when we try to use the function?

```{r} 
#Extracting the variable price

Variable_price = mydata$Price
Variable_price

mean_price= mean(Variable_price)
mean_price
```


```{r}
Variable_price [1:6]
```



```{r}
mydata$Price <- gsub("\\$", " ", mydata$Price)
mydata$Price <- gsub(",", "", mydata$Price)

```

```{r}

mydata$Price <- sub("$", "", mydata$Price)



clean=Variable_price[1:10]

#substitute comma with blank in all of checking 
clean= sub(",","",clean)

#substitute dollar sign with blank in all of checking 
clean=sub("\\$","",clean)
class(clean)


#numeric convert
clean=as.numeric(clean)
class(clean)

#mean with NA removed 
clean
```
```{r}
#substitute comma with blank in all variable price
Variable_price= sub(",","",Variable_price)

#substitute dollar sign with blank in all variable price
Variable_price= sub("\\$","",Variable_price)

#numeric convert 
Variable_price= as.numeric(Variable_price)

#mean with NA removed
Variable_price2 = mean(Variable_price)
Variable_price2
```


```{r}
#What are some other ways to clean this data in R? How about in Excel? 
mean(Variable_price,na.rm= TRUE)

sum(Variable_price,na.rm = TRUE)

sum(Variable_price,na.rm = TRUE)/ length(Variable_price)
```

-------------

### Task 3

Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.

Chicago Taxi Dashboard: [https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq](https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq)

Chicago Taxi Data Description:
[http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/](http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/) 

Open in RStudio the csv file is located in the data folder, note the size of the file, the number of columns and of rows here. Use the functions learned in lab00 and lab01 to describe the data, identify unique entities, fields and summarize.

Define a relational business logic for the column field 'Trip Seconds'.

Using [https://erdplus.com/#/standalone](https://erdplus.com/#/standalone) draw a star schema using the following three tables:

- A Fact table for Trip
- A Dimension table for Community Area
- A Dimension table for Rider


```{r}
mydata=read.csv(file="data/Taxi_Trips_Sample.csv")
head(mydata)
tail(mydata)
```


###After looking at the data provided by Chicago Taxi Data.I was able to make this star schema. 

![](imgs/img03.png)
In the relationship schema I made four categories.The four relationship schema categories are trip fact table, distance, community areas and rider.The four categories are related to one another.The schema starts off from the trip fact table, inside the trip fact table the trip_miles connects with the distance dimension.Under the distance dimension are the pickup,dropoff latitude and longitude properties. From the distance dimension it connects to the pickup centroid and dropoff centroid locations.From the community areas dimension it connects with the rider dimension.lastly, in the trip fact table I put trip duration under the trip fact table.The properties of start time and stop time along with taxi ID are important factors to the duration. The relationship between the community areas and the distance are important for the trip.






