Today the lab focuses on data outliers, data preparation, and data modeling. This lab requires the use of Microsoft Excel, R, and ERDplus.
Make sure to download the folder titled ‘bsad_lab02’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the directions to complete the lab.
#Read File
mydata = read.csv(file = "data/Scoring.csv")
#Name the extracted variable
age = mydata$Age
age
[1] 30 58 46 24 26 36 44 27 32 41 34 29 30 37 21 68 52 68 36 31 25 22 45 41 51 54 43 43 23
[30] 29 23 51 39 35 62 56 32 36 64 31 42 61 32 43 26 49 47 27 52 28 31 32 51 37 29 26 27 63
[59] 55 28 31 27 40 35 43 34 37 53 46 36 46 43 31 43 38 37 46 55 56 55 52 29 39 38 24 28 53
[88] 25 62 29 32 21 38 56 41 46 54 51 52 57 33 33 29 23 33 24 28 27 33 32 39 51 39 29 32 40
[117] 47 66 39 50 25 27 54 42 35 23 52 22 48 25 26 28 62 33 41 29 37 31 50 42 33 54 58 40 31
[146] 35 64 29 33 35 28 59 25 23 30 36 25 22 45 47 36 36 60 59 58 37 32 26 40 55 30 43 28 47
[175] 29 48 34 53 44 49 51 34 42 34 46 31 31 33 30 27 48 33 44 48 57 34 38 35 52 28 46 56 52
[204] 35 45 29 22 33 41 37 46 19 22 31 44 48 31 53 36 46 37 34 36 64 27 26 53 50 21 40 49 40
[233] 37 25 40 22 49 34 49 43 42 28 30 25 22 27 26 28 35 51 24 34 34 49 28 57 63 37 27 39 64
[262] 49 49 45 27 30 33 40 30 30 25 40 50 23 28 35 27 19 39 39 55 32 54 58 37 36 28 38 26 41
[291] 42 46 52 30 47 66 54 65 25 31 23 35 47 28 50 25 39 46 36 39 25 49 52 34 24 38 24 33 31
[320] 63 23 61 28 58 41 38 27 55 40 59 47 38 25 28 49 38 52 37 42 54 30 26 49 36 40 61 59 49
[349] 52 35 25 30 28 42 29 32 45 52 26 33 21 34 27 44 51 33 35 38 28 26 34 27 32 43 30 42 26
[378] 20 24 27 35 18 35 33 39 36 29 23 34 57 38 52 42 44 37 34 43 42 24 36 52 34 34 31 36 50
[407] 33 54 40 29 31 32 51 31 46 35 33 25 31 43 30 40 37 22 24 27 32 43 57 59 60 25 28 40 36
[436] 27 33 44 34 28 39 47 28 37 20 36 28 20 32 19 21 23 47 37 51 31 49 27 34 28 29 51 34 42
[465] 46 40 32 26 31 47 45 26 41 31 23 34 34 38 55 54 24 37 43 26 59 54 27 28 53 48 30 28 34
[494] 33 55 40 28 31 56 38 25 35 41 31 52 33 30 37 54 32 23 39 28 43 50 41 27 25 35 31 43 28
[523] 22 42 41 29 27 32 27 37 59 52 25 24 51 39 33 31 30 24 31 39 42 37 38 42 23 40 36 27 60
[552] 45 65 40 40 33 28 33 51 31 36 23 33 51 31 32 50 52 34 25 28 36 27 26 24 47 40 42 52 26
[581] 45 47 25 50 44 48 19 48 46 50 30 30 39 27 52 42 22 43 36 23 40 50 52 38 35 34 34 49 37
[610] 33 34 42 32 19 45 41 25 19 59 56 42 38 22 44 31 49 55 55 38 60 58 39 34 21 37 25 39 22
[639] 35 50 49 30 32 37 24 63 25 23 22 52 39 29 43 35 27 34 39 51 35 33 30 25 53 26 61 63 30
[668] 43 35 49 34 26 41 30 51 51 40 53 33 24 25 44 33 27 24 32 32 36 27 55 39 36 38 37 36 49
[697] 27 37 49 28 27 47 44 29 35 47 31 23 46 50 26 36 28 36 61 46 25 22 38 57 26 24 51 26 46
[726] 37 55 38 24 27 43 52 57 36 62 19 32 43 34 34 34 30 40 42 49 40 24 35 36 62 24 44 57 25
[755] 30 65 32 43 48 25 36 47 22 47 42 40 40 43 38 41 55 32 42 23 41 50 49 62 43 23 33 25 50
[784] 37 27 47 22 39 46 19 24 43 42 32 27 34 43 21 36 23 40 25 38 58 41 39 19 31 28 49 47 42
[813] 25 48 52 37 45 30 38 50 29 52 42 37 36 33 41 23 30 27 33 23 41 52 58 46 28 38 39 24 41
[842] 51 56 24 49 21 34 38 58 39 60 23 37 48 26 23 34 46 51 45 55 39 33 52 26 47 31 50 39 36
[871] 47 45 40 36 43 44 55 32 26 33 27 24 29 20 46 62 49 27 52 26 43 34 37 37 39 26 34 30 26
[900] 58 36 40 49 55 25 33 38 38 33 42 47 25 42 52 36 28 33 46 22 22 23 29 29 37 30 52 46 51
[929] 28 33 39 33 40 59 32 45 38 26 29 21 31 32 40 29 33 29 57 23 36 38 45 51 42 22 26 51 33
[958] 43 32 52 65 32 29 25 22 53 47 34 23 54 22 46 42 45 49 28 22 27 33 52 34 43 40 31 31 27
[987] 42 35 42 24 31 53 59 38 49 42 54 24 39 48
[ reached getOption("max.print") -- omitted 3446 entries ]
#Calculate the average age below. Refer to Worksheet 1 for the correct command.
age_mean = mean(age)
age_mean
[1] 37.08412
#Calculate standard deviation of age below. Refer to Worksheet 1 for the correct command.
age_sd = sd(age)
age_sd
[1] 10.98637
#Calculate the maximum of age below. The command to find the maximum is max(variable) where variable is the extracted variable.
age_max = max(age)
age_max
[1] 68
#Calculate the minimum of age below. The command to find the minimum is min(variable) where variable is the extracted variable.
age_min = min(age)
age_min
[1] 18
#Use the formula above to calculate the upper and lower threshold
age_lower = age_mean -(3 * age_sd)
age_upper = age_mean + (3 * age_sd)
age_lower
[1] 4.125023
age_upper
[1] 70.04322
quantile(age)
0% 25% 50% 75% 100%
18 28 36 45 68
iqr = age_upper - age_lower
iqr
[1] 65.91819
The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.
upperthreshold = (iqr * 1.5) + age_upper
upperthreshold
[1] 168.9205
lowerthreshold = age_lower - (iqr * 1.5)
lowerthreshold
[1] -94.75227
boxplot(age)
#Read File
mydata = read.csv(file = "data/scoring_original.csv")
head(mydata)
#tail(mydata)
We observe that the column names are shifted down below. So, we must make sure to use the command skip and set the header to true.
mydata = read.csv(file = "data/scoring_original.csv",skip = 1)
head(mydata)
Price = mydata$Price
#class(price)
#str(price)
#summary(price)
#mean(price)
Price[1:6]
[1] $846.00 $1,658.00 $2,985.00 $1,325.00 $910.00 $1,645.00
1419 Levels: $- $1,000.00 $1,001.00 $1,003.00 $1,005.00 $1,007.00 $1,008.00 ... $999.00
clean=price[1:6]
clean
[1] $846.00 $1,658.00 $2,985.00 $1,325.00 $910.00 $1,645.00
1418 Levels: $1,000.00 $1,001.00 $1,003.00 $1,005.00 $1,007.00 $1,008.00 $1,011.00 ... $999.00
#substitute comma with blank in all of checking
price=sub(",",'',price)
#substitute dollar sign with blank in all of checking
price=sub("\\$","",price)
#numeric convert
price = as.numeric(price)
#mean with NA removed
price
[1] 846 1658 2985 1325 910 1645 1800 1093 1957 1468 1577 915 1650 940 500 1186 2201
[18] 1350 1511 1253 2189 1159 1332 1497 1357 2100 1070 2557 1600 1312 400 650 1394 1542
[35] 1200 1560 1200 950 1300 1700 1167 1150 1566 1552 1300 2104 2154 545 1778 1718 1500
[52] 1274 1015 1701 1345 1238 1048 1048 1324 2201 1430 926 1500 1542 1360 1000 1564 1205
[69] 2200 2214 1137 1025 1593 1735 1132 1338 1100 2276 1360 1436 612 1100 1300 2089 2262
[86] 1054 1426 1048 1292 1177 1490 850 1011 1571 1000 270 1382 1713 1086 3262 1610 1030
[103] 650 2175 1780 950 1211 1650 350 1117 2468 1740 1210 963 1646 800 470 1370 839
[120] 1346 1350 2100 1307 1126 4786 1478 1568 1056 1608 1179 1524 1730 823 800 1482 1110
[137] 1555 1170 886 1395 2084 8800 1160 1211 1300 850 1204 1528 1298 600 1320 1300 1380
[154] 1426 5200 1503 1449 1422 1430 1578 1250 2260 280 1298 945 1694 700 1290 1581 1383
[171] 1637 1275 1400 832 2886 1257 1346 1138 1801 1200 950 1589 1617 1200 2194 1642 1423
[188] 1600 700 300 1165 1365 1706 2053 1700 1462 1300 1122 1257 1387 1266 2195 2004 1203
[205] 1138 1374 1514 750 1894 1048 1369 1035 1218 1637 953 1212 1218 1419 1100 987 2400
[222] 1375 1595 1054 1651 1542 2014 2624 1135 1105 1500 1200 1555 2251 1542 2186 1700 1180
[239] 800 1668 1750 875 1514 1950 1462 1098 840 1634 1603 1200 1360 1236 1395 1390 1355
[256] 1604 1241 375 1045 1781 900 1698 1048 2200 1119 1090 4100 1400 2010 2014 1342 1132
[273] 1048 1218 1114 1206 2132 570 1094 550 1255 4138 1714 1150 2673 1630 1544 1195 1206
[290] 2050 1513 3300 1250 1339 875 1101 1568 1194 1378 1139 350 1406 1497 1600 1350 2032
[307] 2610 1193 1275 3750 2500 1790 1419 1480 1542 1850 1380 1888 1672 808 1107 1335 1390
[324] 1363 1482 2357 850 1180 1617 1114 1976 1095 1380 1529 1263 1600 1400 1520 1617 1500
[341] 1705 1292 1294 1936 2143 1394 425 1600 1563 1284 990 1125 1267 1780 1298 1600 1340
[358] 1244 1403 1980 1290 1339 1559 1078 1113 2290 1578 1039 1218 1329 500 1384 1553 1626
[375] 1674 1500 1421 480 1274 1580 1300 1314 957 1500 1504 935 1028 1440 1030 1366 1768
[392] 2340 1700 1450 1155 1191 841 2441 1516 1591 1999 1336 1600 1850 1775 1224 1663 1500
[409] 2225 1930 1607 1230 1581 1072 1471 1300 1192 2700 500 1720 1770 1197 950 450 1707
[426] 1201 1355 1196 1297 1420 1036 1201 1527 1500 1057 2127 2218 1570 650 1175 825 2152
[443] 978 1650 2150 1771 1615 994 1147 1194 825 1128 360 1190 1580 1600 1735 2178 1589
[460] 1828 1566 1250 1626 1414 1687 1638 1296 1460 1400 911 1559 1315 2600 1208 975 1274
[477] 1197 1092 1100 1202 2646 1816 1700 1700 984 1387 1636 1135 1453 1500 1350 2247 1367
[494] 950 2154 1100 1492 1351 2040 1037 1202 1683 1000 1100 1734 1604 1330 877 1759 1564
[511] 1306 1108 1500 1645 1500 2625 1456 2178 1642 1265 325 1471 1440 2140 1377 1780 818
[528] 1017 1788 1212 1800 1394 400 940 1433 1439 2414 1571 1500 1258 1403 1346 2195 1637
[545] 2375 873 1450 1607 1376 1092 1307 1128 1698 1075 1920 1298 1362 1770 2022 1225 724
[562] 1346 1415 1380 1500 900 1639 2220 1600 1270 1556 1878 1532 900 1100 1556 1054 1285
[579] 1338 1571 1556 1194 1556 1989 2624 925 1128 2220 1395 1137 875 1417 700 1207 1268
[596] 1326 1480 1228 2168 2100 1197 510 325 1000 1748 1392 820 1275 1758 1250 1560 1570
[613] 1045 500 1800 1600 1177 1371 1100 1256 1469 1682 1150 675 1236 1265 1204 1353 1330
[630] 1200 1488 1695 1547 1112 1062 1567 900 1280 1340 1490 2640 900 1569 1937 1403 1501
[647] 1255 1345 2134 1590 400 1406 1872 1123 1067 2253 580 1992 1400 1284 1749 2125 1505
[664] 1376 2154 1138 1931 1698 1279 2755 1600 2200 2260 1904 1580 2810 700 1100 1910 1360
[681] 1030 460 1339 525 1315 1517 1330 1290 1900 1366 4063 1064 1606 1570 1381 816 1150
[698] 2215 1586 1469 1556 1720 1650 1191 1127 1100 1480 1258 1544 1404 2053 1575 1430 1409
[715] 962 1400 1571 1200 1200 1842 841 1585 957 950 1375 1800 2212 1133 1570 1569 1590
[732] 1734 450 2008 1150 630 1571 1770 1600 1627 1094 1570 935 2259 1419 820 1060 600
[749] 1372 758 1164 1450 2125 1557 1700 1571 1654 350 420 2173 6802 1061 1425 546 3400
[766] 1776 1200 1403 1086 1200 2200 2360 1758 1783 1108 1136 1557 886 1490 400 1105 1131
[783] 1256 1571 1443 1160 896 1639 1571 1075 1590 2200 450 1536 1375 1005 1264 1313 1365
[800] 1105 1100 1480 1310 1265 1474 980 600 1635 1376 675 2000 978 1130 2600 950 1493
[817] 2240 1224 1000 2500 1120 1242 725 720 2470 1150 1655 1389 1310 758 1770 425 1035
[834] 1226 1750 1800 2150 1713 975 650 916 625 1318 750 325 1014 1525 1144 550 1556
[851] 1396 2470 1890 575 1294 710 2600 1849 450 1369 2281 4575 1332 1240 931 1330 1192
[868] 950 1900 1810 1700 1318 750 1168 1300 1111 1264 1692 1123 1386 1395 1127 960 2179
[885] 886 1314 2800 960 873 1111 1040 1995 1333 1358 1613 1855 2271 925 1950 2058 1600
[902] 860 875 1710 1260 1705 1307 1982 1479 1308 2201 1350 1386 2276 1608 1585 1320 3190
[919] 1500 1571 1386 1680 450 1488 1396 1255 1883 375 3070 2189 1692 2216 1298 750 1830
[936] 1242 882 1137 275 580 1403 957 1300 1097 1033 913 962 1123 1400 1800 1980 989
[953] 1250 1646 1114 1365 1600 1133 550 350 1321 1225 950 550 1277 1172 1318 2185 2610
[970] 1257 600 1408 2413 1388 1636 850 1757 1600 2022 1200 1550 710 1286 1400 1638 1700
[987] 1850 1677 1900 1601 1200 1456 1456 960 1337 1359 801 989 2536 1129
[ reached getOption("max.print") -- omitted 3446 entries ]
What are some other ways to clean this data in R? How about in Excel? Use accounting format instead of regular format in excel. ————-
mydata = read.csv(file = "data/Taxi_Trips_Sample.csv")
head(mydata)
Define a relational business logic for the column field ‘Trip Seconds’.
Using https://erdplus.com/#/standalone draw a star schema using the following three tables: