First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.
#Read File
mydata= read.csv(file = "data/scoring.csv")
mydata
#Name the extracted variable
age= mydata$Age
age
#The average age is calculated below.
age_mean=mean(age)
age_mean
[1] 37.08412
#The standard deviation of age is calculated below.
age_sd= sd(age)
age_sd
[1] 10.98637
#The maximum of age is calculated below.
agemax=max(age)
agemax
[1] 68
#Calculate the minimum of age below. The command to find the minimum is min(variable) where variable is the extracted variable.
agemin=min(age)
agemin
[1] 18
An outlier is value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower threshold is to take the mean (+ or -) 3 * standard deviation.
#Use the formula above to calculate the upper and lower threshold
age_lower= age_mean - (3) * age_sd
age_upper= age_mean + (3) * age_sd
age_upper
[1] 70.04322
age_lower
[1] 4.125023
A method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range.
quantile(age)
0% 25% 50% 75% 100%
18 28 36 45 68
lowerq= quantile(age) [2]
upperq= quantile(age) [4]
iqr = upperq - lowerq
iqr
75%
17
The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.
Below is the upper threshold:
upperthreshold = (iqr * 1.5) + upperq
upperthreshold
75%
70.5
Below is the lower threshold:
lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
25%
2.5
Are there any outliers? How many? It can also be useful to visualize the data using a box and whisker plot. The boxplot below supports the IQR we found of 15 and upper and lower threshold.
age[age>upperthreshold]
integer(0)
mydata[age>upperthreshold,]
age[age<lowerthreshold]
integer(0)
boxplot(age,horizontal =TRUE)
mydata=read.csv(file="data/scoring_original.csv")
head(mydata)
tail(mydata)
mydata=read.csv("data/scoring_original.csv")
head(mydata)
#str(mydata)
summary(mydata)
Status Seniority Home Time Age Marital Records
: 10 Min. : 0.000 : 10 Min. : 6.00 Min. :-9999.00 : 10 : 10
bad :1249 1st Qu.: 2.000 ignore : 20 1st Qu.:36.00 1st Qu.: 28.00 divorced : 38 no_rec :3680
good:3197 Median : 5.000 other : 319 Median :48.00 Median : 36.00 married :3240 yes_rec: 769
NA's: 3 Mean : 7.987 owner :2107 Mean :46.45 Mean : 14.57 separated: 130
3rd Qu.:12.000 parents: 782 3rd Qu.:60.00 3rd Qu.: 45.00 single : 974
Max. :48.000 priv : 246 Max. :72.00 Max. : 68.00 widow : 67
NA's :10 rent : 975 NA's :10
Job Expenses Income Assets Debt Amount Price
: 10 $35K :1207 $100K : 155 Min. : 0 Min. : 0 $1,000.00 : 541 $1,500.00 : 46
fixed :2806 $45K : 819 $150K : 115 1st Qu.: 0 1st Qu.: 0 $1,200.00 : 221 $1,200.00 : 45
freelance:1021 $60K : 792 $120K : 107 Median : 3000 Median : 0 $800.00 : 219 $1,300.00 : 45
others : 171 $75K : 570 $80K : 87 Mean : 5352 Mean : 342 $1,100.00 : 210 $1,600.00 : 43
partime : 451 $90K : 192 $110K : 86 3rd Qu.: 6000 3rd Qu.: 0 $1,300.00 : 198 $1,100.00 : 41
$105K : 40 $90K : 84 Max. :300000 Max. :30000 $1,500.00 : 198 $1,700.00 : 39
(Other): 839 (Other):3825 NA's :10 NA's :10 (Other) :2872 (Other) :4200
Finrat Savings
Min. : 6.702 Min. :-8.160
1st Qu.: 60.020 1st Qu.: 1.615
Median : 77.093 Median : 3.120
Mean : 72.614 Mean : 3.858
3rd Qu.: 88.462 3rd Qu.: 5.195
Max. :100.000 Max. :33.250
NA's :10 NA's :10
To calculate the mean for price we must extract the price column first and then find the average using the function built in R. What happens when we try to use the function?
#Extracting the variable price
Variable_price = mydata$Price
Variable_price
[1] $846.00 $1,658.00 $2,985.00 $1,325.00 $910.00 $1,645.00 $1,800.00 $1,093.00 $1,957.00 $1,468.00
[11] $1,577.00 $915.00 $1,650.00 $940.00 $500.00 $1,186.00 $2,201.00 $1,350.00 $1,511.00 $1,253.00
[21] $2,189.00 $- $1,159.00 $1,332.00 $1,497.00 $1,357.00 $2,100.00 $1,070.00 $2,557.00 $1,600.00
[31] $- $1,312.00 $400.00 $650.00 $1,394.00 $1,542.00 $1,200.00 $1,560.00 $1,200.00 $950.00
[41] $- $1,300.00 $1,700.00 $1,167.00 $1,150.00 $1,566.00 $1,552.00 $1,300.00 $2,104.00 $2,154.00
[51] $545.00 $1,778.00 $1,718.00 $1,500.00 $1,274.00 $1,015.00 $1,701.00 $1,345.00 $1,238.00 $1,048.00
[61] $1,048.00 $1,324.00 $2,201.00 $1,430.00 $926.00 $1,500.00 $1,542.00 $1,360.00 $1,000.00 $1,564.00
[71] $1,205.00 $2,200.00 $2,214.00 $1,137.00 $1,025.00 $1,593.00 $1,735.00 $1,132.00 $1,338.00 $1,100.00
[81] $2,276.00 $1,360.00 $1,436.00 $612.00 $1,100.00 $1,300.00 $2,089.00 $2,262.00 $1,054.00 $1,426.00
[91] $1,048.00 $1,292.00 $1,177.00 $1,490.00 $850.00 $1,011.00 $1,571.00 $1,000.00 $270.00 $1,382.00
[101] $1,713.00 $1,086.00 $3,262.00 $1,610.00 $1,030.00 $650.00 $2,175.00 $1,780.00 $950.00 $1,211.00
[111] $1,650.00 $350.00 $1,117.00 $2,468.00 $1,740.00 $1,210.00 $963.00 $1,646.00 $800.00 $470.00
[121] $1,370.00 $839.00 $1,346.00 $1,350.00 $2,100.00 $1,307.00 $1,126.00 $4,786.00 $1,478.00 $1,568.00
[131] $1,056.00 $1,608.00 $1,179.00 $1,524.00 $1,730.00 $823.00 $800.00 $1,482.00 $1,110.00 $1,555.00
[141] $1,170.00 $886.00 $1,395.00 $2,084.00 $8,800.00 $1,160.00 $1,211.00 $1,300.00 $850.00 $1,204.00
[151] $1,528.00 $1,298.00 $600.00 $1,320.00 $1,300.00 $1,380.00 $1,426.00 $5,200.00 $1,503.00 $1,449.00
[161] $1,422.00 $1,430.00 $1,578.00 $1,250.00 $2,260.00 $280.00 $1,298.00 $945.00 $1,694.00 $700.00
[171] $1,290.00 $1,581.00 $1,383.00 $1,637.00 $1,275.00 $1,400.00 $832.00 $2,886.00 $1,257.00 $1,346.00
[181] $1,138.00 $1,801.00 $1,200.00 $950.00 $1,589.00 $1,617.00 $1,200.00 $2,194.00 $1,642.00 $1,423.00
[191] $1,600.00 $700.00 $300.00 $1,165.00 $1,365.00 $1,706.00 $2,053.00 $1,700.00 $1,462.00 $1,300.00
[201] $1,122.00 $1,257.00 $1,387.00 $1,266.00 $2,195.00 $2,004.00 $1,203.00 $1,138.00 $1,374.00 $1,514.00
[211] $750.00 $1,894.00 $1,048.00 $1,369.00 $1,035.00 $1,218.00 $1,637.00 $953.00 $1,212.00 $1,218.00
[221] $1,419.00 $1,100.00 $987.00 $2,400.00 $1,375.00 $1,595.00 $1,054.00 $1,651.00 $1,542.00 $2,014.00
[231] $2,624.00 $1,135.00 $1,105.00 $1,500.00 $1,200.00 $1,555.00 $2,251.00 $1,542.00 $2,186.00 $1,700.00
[241] $1,180.00 $800.00 $1,668.00 $1,750.00 $875.00 $1,514.00 $1,950.00 $1,462.00 $1,098.00 $840.00
[251] $1,634.00 $1,603.00 $1,200.00 $1,360.00 $1,236.00 $1,395.00 $1,390.00 $1,355.00 $1,604.00 $1,241.00
[261] $375.00 $1,045.00 $1,781.00 $900.00 $1,698.00 $1,048.00 $2,200.00 $1,119.00 $1,090.00 $4,100.00
[271] $1,400.00 $2,010.00 $2,014.00 $1,342.00 $1,132.00 $1,048.00 $1,218.00 $1,114.00 $1,206.00 $2,132.00
[281] $570.00 $1,094.00 $550.00 $1,255.00 $4,138.00 $1,714.00 $1,150.00 $2,673.00 $1,630.00 $1,544.00
[291] $1,195.00 $1,206.00 $2,050.00 $1,513.00 $3,300.00 $1,250.00 $1,339.00 $875.00 $1,101.00 $1,568.00
[301] $1,194.00 $1,378.00 $1,139.00 $350.00 $1,406.00 $1,497.00 $1,600.00 $1,350.00 $2,032.00 $2,610.00
[311] $1,193.00 $1,275.00 $3,750.00 $2,500.00 $1,790.00 $1,419.00 $1,480.00 $1,542.00 $1,850.00 $1,380.00
[321] $1,888.00 $1,672.00 $808.00 $1,107.00 $1,335.00 $1,390.00 $1,363.00 $1,482.00 $2,357.00 $850.00
[331] $1,180.00 $1,617.00 $1,114.00 $1,976.00 $1,095.00 $1,380.00 $1,529.00 $1,263.00 $1,600.00 $1,400.00
[341] $1,520.00 $1,617.00 $1,500.00 $1,705.00 $1,292.00 $1,294.00 $1,936.00 $2,143.00 $1,394.00 $425.00
[351] $1,600.00 $1,563.00 $1,284.00 $990.00 $1,125.00 $1,267.00 $1,780.00 $1,298.00 $1,600.00 $1,340.00
[361] $1,244.00 $1,403.00 $1,980.00 $1,290.00 $1,339.00 $1,559.00 $1,078.00 $1,113.00 $2,290.00 $1,578.00
[371] $1,039.00 $1,218.00 $1,329.00 $500.00 $1,384.00 $1,553.00 $1,626.00 $1,674.00 $1,500.00 $1,421.00
[381] $480.00 $1,274.00 $1,580.00 $1,300.00 $1,314.00 $957.00 $1,500.00 $1,504.00 $935.00 $1,028.00
[391] $1,440.00 $1,030.00 $1,366.00 $1,768.00 $2,340.00 $1,700.00 $1,450.00 $1,155.00 $1,191.00 $841.00
[401] $2,441.00 $1,516.00 $1,591.00 $1,999.00 $1,336.00 $1,600.00 $1,850.00 $1,775.00 $1,224.00 $1,663.00
[411] $1,500.00 $2,225.00 $1,930.00 $1,607.00 $1,230.00 $1,581.00 $1,072.00 $1,471.00 $1,300.00 $1,192.00
[421] $2,700.00 $500.00 $1,720.00 $1,770.00 $1,197.00 $950.00 $450.00 $1,707.00 $1,201.00 $1,355.00
[431] $1,196.00 $1,297.00 $1,420.00 $1,036.00 $1,201.00 $1,527.00 $1,500.00 $1,057.00 $2,127.00 $2,218.00
[441] $1,570.00 $650.00 $1,175.00 $825.00 $2,152.00 $978.00 $1,650.00 $2,150.00 $1,771.00 $1,615.00
[451] $994.00 $1,147.00 $1,194.00 $825.00 $1,128.00 $360.00 $1,190.00 $1,580.00 $1,600.00 $1,735.00
[461] $2,178.00 $1,589.00 $1,828.00 $1,566.00 $1,250.00 $1,626.00 $1,414.00 $1,687.00 $1,638.00 $1,296.00
[471] $1,460.00 $1,400.00 $911.00 $1,559.00 $1,315.00 $2,600.00 $1,208.00 $975.00 $1,274.00 $1,197.00
[481] $1,092.00 $1,100.00 $1,202.00 $2,646.00 $1,816.00 $1,700.00 $1,700.00 $984.00 $1,387.00 $1,636.00
[491] $1,135.00 $1,453.00 $1,500.00 $1,350.00 $2,247.00 $1,367.00 $950.00 $2,154.00 $1,100.00 $1,492.00
[501] $1,351.00 $2,040.00 $1,037.00 $1,202.00 $1,683.00 $1,000.00 $1,100.00 $1,734.00 $1,604.00 $1,330.00
[511] $877.00 $1,759.00 $1,564.00 $1,306.00 $1,108.00 $1,500.00 $1,645.00 $1,500.00 $2,625.00 $1,456.00
[521] $2,178.00 $1,642.00 $1,265.00 $325.00 $1,471.00 $1,440.00 $2,140.00 $1,377.00 $1,780.00 $818.00
[531] $1,017.00 $1,788.00 $1,212.00 $1,800.00 $1,394.00 $400.00 $940.00 $1,433.00 $1,439.00 $2,414.00
[541] $1,571.00 $1,500.00 $1,258.00 $1,403.00 $1,346.00 $2,195.00 $1,637.00 $2,375.00 $873.00 $1,450.00
[551] $1,607.00 $1,376.00 $1,092.00 $1,307.00 $1,128.00 $1,698.00 $1,075.00 $1,920.00 $1,298.00 $1,362.00
[561] $1,770.00 $2,022.00 $1,225.00 $724.00 $1,346.00 $1,415.00 $1,380.00 $1,500.00 $900.00 $1,639.00
[571] $2,220.00 $1,600.00 $1,270.00 $1,556.00 $1,878.00 $1,532.00 $900.00 $1,100.00 $1,556.00 $1,054.00
[581] $1,285.00 $1,338.00 $1,571.00 $1,556.00 $1,194.00 $1,556.00 $1,989.00 $2,624.00 $925.00 $1,128.00
[591] $2,220.00 $1,395.00 $1,137.00 $875.00 $1,417.00 $700.00 $1,207.00 $1,268.00 $1,326.00 $1,480.00
[601] $1,228.00 $2,168.00 $2,100.00 $1,197.00 $510.00 $325.00 $1,000.00 $1,748.00 $1,392.00 $820.00
[611] $1,275.00 $1,758.00 $1,250.00 $1,560.00 $1,570.00 $1,045.00 $500.00 $1,800.00 $1,600.00 $1,177.00
[621] $1,371.00 $1,100.00 $1,256.00 $1,469.00 $1,682.00 $1,150.00 $675.00 $1,236.00 $1,265.00 $1,204.00
[631] $1,353.00 $1,330.00 $1,200.00 $1,488.00 $1,695.00 $1,547.00 $1,112.00 $1,062.00 $1,567.00 $900.00
[641] $1,280.00 $1,340.00 $1,490.00 $2,640.00 $900.00 $1,569.00 $1,937.00 $1,403.00 $1,501.00 $1,255.00
[651] $1,345.00 $2,134.00 $1,590.00 $400.00 $1,406.00 $1,872.00 $1,123.00 $1,067.00 $2,253.00 $580.00
[661] $1,992.00 $1,400.00 $1,284.00 $1,749.00 $2,125.00 $1,505.00 $1,376.00 $2,154.00 $1,138.00 $1,931.00
[671] $1,698.00 $1,279.00 $2,755.00 $1,600.00 $2,200.00 $2,260.00 $1,904.00 $1,580.00 $2,810.00 $700.00
[681] $1,100.00 $1,910.00 $1,360.00 $1,030.00 $460.00 $1,339.00 $525.00 $1,315.00 $1,517.00 $1,330.00
[691] $1,290.00 $1,900.00 $1,366.00 $4,063.00 $1,064.00 $1,606.00 $1,570.00 $1,381.00 $816.00 $1,150.00
[701] $2,215.00 $1,586.00 $1,469.00 $1,556.00 $1,720.00 $1,650.00 $1,191.00 $1,127.00 $1,100.00 $1,480.00
[711] $1,258.00 $1,544.00 $1,404.00 $2,053.00 $1,575.00 $1,430.00 $1,409.00 $962.00 $1,400.00 $1,571.00
[721] $1,200.00 $1,200.00 $1,842.00 $841.00 $1,585.00 $957.00 $950.00 $1,375.00 $1,800.00 $2,212.00
[731] $1,133.00 $1,570.00 $1,569.00 $1,590.00 $1,734.00 $450.00 $2,008.00 $1,150.00 $630.00 $1,571.00
[741] $1,770.00 $1,600.00 $1,627.00 $1,094.00 $1,570.00 $935.00 $2,259.00 $1,419.00 $820.00 $1,060.00
[751] $600.00 $1,372.00 $758.00 $1,164.00 $1,450.00 $2,125.00 $1,557.00 $1,700.00 $1,571.00 $1,654.00
[761] $350.00 $420.00 $2,173.00 $6,802.00 $1,061.00 $1,425.00 $546.00 $3,400.00 $1,776.00 $1,200.00
[771] $1,403.00 $1,086.00 $1,200.00 $2,200.00 $2,360.00 $1,758.00 $1,783.00 $1,108.00 $1,136.00 $1,557.00
[781] $886.00 $1,490.00 $400.00 $1,105.00 $1,131.00 $1,256.00 $1,571.00 $1,443.00 $1,160.00 $896.00
[791] $1,639.00 $1,571.00 $1,075.00 $1,590.00 $2,200.00 $450.00 $1,536.00 $1,375.00 $1,005.00 $1,264.00
[801] $1,313.00 $1,365.00 $1,105.00 $1,100.00 $1,480.00 $1,310.00 $1,265.00 $1,474.00 $980.00 $600.00
[811] $1,635.00 $1,376.00 $675.00 $2,000.00 $978.00 $1,130.00 $2,600.00 $950.00 $1,493.00 $2,240.00
[821] $1,224.00 $1,000.00 $2,500.00 $1,120.00 $1,242.00 $725.00 $720.00 $2,470.00 $1,150.00 $1,655.00
[831] $1,389.00 $1,310.00 $758.00 $1,770.00 $425.00 $1,035.00 $1,226.00 $1,750.00 $1,800.00 $2,150.00
[841] $1,713.00 $975.00 $650.00 $916.00 $625.00 $1,318.00 $750.00 $325.00 $1,014.00 $1,525.00
[851] $1,144.00 $550.00 $1,556.00 $1,396.00 $2,470.00 $1,890.00 $575.00 $1,294.00 $710.00 $2,600.00
[861] $1,849.00 $450.00 $1,369.00 $2,281.00 $4,575.00 $1,332.00 $1,240.00 $931.00 $1,330.00 $1,192.00
[871] $950.00 $1,900.00 $1,810.00 $1,700.00 $1,318.00 $750.00 $1,168.00 $1,300.00 $1,111.00 $1,264.00
[881] $1,692.00 $1,123.00 $1,386.00 $1,395.00 $1,127.00 $960.00 $2,179.00 $886.00 $1,314.00 $2,800.00
[891] $960.00 $873.00 $1,111.00 $1,040.00 $1,995.00 $1,333.00 $1,358.00 $1,613.00 $1,855.00 $2,271.00
[901] $925.00 $1,950.00 $2,058.00 $1,600.00 $860.00 $875.00 $1,710.00 $1,260.00 $1,705.00 $1,307.00
[911] $1,982.00 $1,479.00 $1,308.00 $2,201.00 $1,350.00 $1,386.00 $2,276.00 $1,608.00 $1,585.00 $1,320.00
[921] $3,190.00 $1,500.00 $1,571.00 $1,386.00 $1,680.00 $450.00 $1,488.00 $1,396.00 $1,255.00 $1,883.00
[931] $375.00 $3,070.00 $2,189.00 $1,692.00 $2,216.00 $1,298.00 $750.00 $1,830.00 $1,242.00 $882.00
[941] $1,137.00 $275.00 $580.00 $1,403.00 $957.00 $1,300.00 $1,097.00 $1,033.00 $913.00 $962.00
[951] $1,123.00 $1,400.00 $1,800.00 $1,980.00 $989.00 $1,250.00 $1,646.00 $1,114.00 $1,365.00 $1,600.00
[961] $1,133.00 $550.00 $350.00 $1,321.00 $1,225.00 $950.00 $550.00 $1,277.00 $1,172.00 $1,318.00
[971] $2,185.00 $2,610.00 $1,257.00 $600.00 $1,408.00 $2,413.00 $1,388.00 $1,636.00 $850.00 $1,757.00
[981] $1,600.00 $2,022.00 $1,200.00 $1,550.00 $710.00 $1,286.00 $1,400.00 $1,638.00 $1,700.00 $1,850.00
[991] $1,677.00 $1,900.00 $1,601.00 $1,200.00 $1,456.00 $1,456.00 $960.00 $1,337.00 $1,359.00 $801.00
[ reached getOption("max.print") -- omitted 3459 entries ]
1419 Levels: $- $1,000.00 $1,001.00 $1,003.00 $1,005.00 $1,007.00 $1,008.00 $1,011.00 $1,012.00 ... $999.00
mean_price= mean(Variable_price)
argument is not numeric or logical: returning NA
mean_price
[1] NA
Variable_price [1:6]
[1] $846.00 $1,658.00 $2,985.00 $1,325.00 $910.00 $1,645.00
1419 Levels: $- $1,000.00 $1,001.00 $1,003.00 $1,005.00 $1,007.00 $1,008.00 $1,011.00 $1,012.00 ... $999.00
mydata$Price <- gsub("\\$", " ", mydata$Price)
mydata$Price <- gsub(",", "", mydata$Price)
mydata$Price <- sub("$", "", mydata$Price)
clean=Variable_price[1:10]
#substitute comma with blank in all of checking
clean= sub(",","",clean)
#substitute dollar sign with blank in all of checking
clean=sub("\\$","",clean)
class(clean)
[1] "character"
#numeric convert
clean=as.numeric(clean)
class(clean)
[1] "numeric"
#mean with NA removed
clean
[1] 846 1658 2985 1325 910 1645 1800 1093 1957 1468
#substitute comma with blank in all variable price
Variable_price= sub(",","",Variable_price)
#substitute dollar sign with blank in all variable price
Variable_price= sub("\\$","",Variable_price)
#numeric convert
Variable_price= as.numeric(Variable_price)
#mean with NA removed
Variable_price2 = mean(Variable_price)
Variable_price2
[1] NA
#What are some other ways to clean this data in R? How about in Excel?
mean(Variable_price,na.rm= TRUE)
[1] 1462.48
sum(Variable_price,na.rm = TRUE)
[1] 6502188
sum(Variable_price,na.rm = TRUE)/ length(Variable_price)
[1] 1458.217
Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.
Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/
Open in RStudio the csv file is located in the data folder, note the size of the file, the number of columns and of rows here. Use the functions learned in lab00 and lab01 to describe the data, identify unique entities, fields and summarize.
Define a relational business logic for the column field ‘Trip Seconds’.
Using https://erdplus.com/#/standalone draw a star schema using the following three tables:
mydata=read.csv(file="data/Taxi_Trips_Sample.csv")
head(mydata)
tail(mydata)