About
Today the lab focuses on data outliers, data preparation, and data modeling. This lab requires the use of Microsoft Excel, R, and ERDplus.
Setup
Make sure to download the folder titled ‘bsad_lab02’ zip folder and extract the folder to unzip it. Next, we must set this folder as the working directory. The way to do this is to open R Studio, go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Now, follow the worksheet directions to complete the lab.
Task 1
First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using both R and Excel.
In R, we must read in the file again, extract the column and find the values that are asked for.
#Read File
mydata = read.csv(file="data/creditrisk.csv")
#Name the extracted variable
age = mydata$Age
#Calculate the average age below. Refer to Worksheet 1 for the correct command.
#Calculate standard deviation of age below. Refer to Worksheet 1 for the correct command.
#Calculate the maximum of age below. The command to find the maximum is max(variable) where variable is the extracted variable.
#Calculate the minimum of age below. The command to find the minimum is min(variable) where variable is the extracted variable.
Next, compute the same values using Excel and compare results. Post a screenshot of this using the directions found in Lab 0. To screenshot, you can use the snipping tool and save the PNG image to the same folder all of the other data is.
Next, use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower threshold is to take the mean (+ or -) 3 * standard deviation
. Try using this formula to find the upper and lower limit for age.
#Use the formula above to calculate the upper and lower threshold
A method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..
quantile(age)
0% 25% 50% 75% 100%
18 26 32 41 73
lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq
The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.
Below is the upper threshold:
upperthreshold = (iqr * 1.5) + upperq
upperthreshold
75%
63.5
Below is the lower threshold:
lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
25%
3.5
Are there any outliers? How many? It can also be useful to visualize the data using a box and whisker plot. The boxplot below supports the IQR we found of 15 and upper and lower threshold.
boxplot(age)

Task 2
Next, we must calculate the mean for checking in Excel. Take a screenshot of this process and paste it here. How are the missing values are treated in the Excel calculation?
Now, read the file into R.
newdata = read.csv(file="data/creditriskorg.csv")
newdata
We observe that the column names are shifted down below. So, we must make sure to use the command skip and set the header to true.
newdata = read.csv(file="data/creditriskorg.csv",
skip=1,header=TRUE,sep=",")
newdata
To calculate the mean for Checking in R, follow Worksheet 1. Extract the Checking column first and then find the average using the function built in R. What happens when we try to use the function?
checking = newdata$Checking
To resolve the error, we must remove understand where it is coming from. There are missing values in the csv file, which is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly, there are ‘$’ symbols throughout the file which is not a numerica symbol either.
The sub function replaces these symbols with something else. So, in order to remove the comma in the number “1,234”, we must substitute it with just an empty space.
As shown on the worksheet, type and copy the exact commands to find the mean with the NA values removed.
#substitute comma with blank in all of checking
#substitute dollar sign with blank in all of checking
#numeric convert
#mean with NA removed
What are some other ways to clean this data in R? How about in Excel?
Task 3
Now, we will look at divvybike data. Go to https://www.divvybikes.com/system-data Download latest data (zip file) and go over the README file.
Open the file in Excel. Per the Data License Agreement who owns the data, and can the data be correlated with other sources?
Note the size of the file, the number of columns and of rows here.??? Identify the unique entities and fields.
Define a relational business logic for the column field ‘tripduration’.
Using www.erdplus.com draw a star schema using the following three tables:- A Fact table for Trip- A Dimension table for Station- A Dimension table for User
LS0tDQp0aXRsZTogIkJ1c2luZXNzIEFuYWx5dGljcyBMYWIgV29ya3NoZWV0IDAyIg0KYXV0aG9yOiAiWW91ciBOYW1lIEhlcmUiDQpkYXRlOiAiU3VtbWVyIDIwMTciDQpvdXRwdXQ6DQogIGh0bWxfbm90ZWJvb2s6IGRlZmF1bHQNCiAgaHRtbF9kb2N1bWVudDogZGVmYXVsdA0KICBwZGZfZG9jdW1lbnQ6IGRlZmF1bHQNCnN1YnRpdGxlOiBDTUUgR3JvdXAgRm91bmRhdGlvbiBCdXNpbmVzcyBBbmFseXRpY3MgTGFiDQotLS0NCg0KIyMjIEFib3V0DQoNClRvZGF5IHRoZSBsYWIgZm9jdXNlcyBvbiBkYXRhIG91dGxpZXJzLCBkYXRhIHByZXBhcmF0aW9uLCBhbmQgZGF0YSBtb2RlbGluZy4gVGhpcyBsYWIgcmVxdWlyZXMgdGhlIHVzZSBvZiBNaWNyb3NvZnQgRXhjZWwsIFIsIGFuZCBFUkRwbHVzLg0KDQojIyMgU2V0dXANCg0KTWFrZSBzdXJlIHRvIGRvd25sb2FkIHRoZSBmb2xkZXIgdGl0bGVkICdic2FkX2xhYjAyJyB6aXAgZm9sZGVyIGFuZCBleHRyYWN0IHRoZSBmb2xkZXIgdG8gdW56aXAgaXQuIE5leHQsIHdlIG11c3Qgc2V0IHRoaXMgZm9sZGVyIGFzIHRoZSB3b3JraW5nIGRpcmVjdG9yeS4gVGhlDQp3YXkgdG8gZG8gdGhpcyBpcyB0byBvcGVuIFIgU3R1ZGlvLCBnbyB0byAnU2Vzc2lvbicsIHNjcm9sbCBkb3duIHRvICdTZXQgV29ya2luZyBEaXJlY3RvcnknLCBhbmQgY2xpY2sgJ1RvIFNvdXJjZSBGaWxlIExvY2F0aW9uJy4gTm93LCBmb2xsb3cgdGhlIHdvcmtzaGVldCBkaXJlY3Rpb25zIHRvIGNvbXBsZXRlIHRoZSBsYWIuDQoNCi0tLS0tLS0tLQ0KDQojIyMgVGFzayAxDQoNCkZpcnN0LCB3ZSBtdXN0IGNhbGN1bGF0ZSB0aGUgbWVhbiwgc3RhbmRhcmQgZGV2aWF0aW9uLCBtYXhpbXVtLCBhbmQgbWluaW11bSBmb3IgdGhlIEFnZSBjb2x1bW4gdXNpbmcgYm90aCBSIGFuZCBFeGNlbC4NCg0KSW4gUiwgd2UgbXVzdCByZWFkIGluIHRoZSBmaWxlIGFnYWluLCBleHRyYWN0IHRoZSBjb2x1bW4gYW5kIGZpbmQgdGhlIHZhbHVlcyB0aGF0IGFyZSBhc2tlZCBmb3IuDQoNCmBgYHtyfQ0KI1JlYWQgRmlsZQ0KbXlkYXRhID0gcmVhZC5jc3YoZmlsZT0iZGF0YS9jcmVkaXRyaXNrLmNzdiIpIA0KDQojTmFtZSB0aGUgZXh0cmFjdGVkIHZhcmlhYmxlDQphZ2UgPSBteWRhdGEkQWdlIA0KYGBgDQoNCmBgYHtyfQ0KI0NhbGN1bGF0ZSB0aGUgYXZlcmFnZSBhZ2UgYmVsb3cuIFJlZmVyIHRvIFdvcmtzaGVldCAxIGZvciB0aGUgY29ycmVjdCBjb21tYW5kLg0KYGBgDQoNCmBgYHtyfQ0KI0NhbGN1bGF0ZSBzdGFuZGFyZCBkZXZpYXRpb24gb2YgYWdlIGJlbG93LiBSZWZlciB0byBXb3Jrc2hlZXQgMSBmb3IgdGhlIGNvcnJlY3QgY29tbWFuZC4gDQpgYGANCg0KYGBge3J9DQojQ2FsY3VsYXRlIHRoZSBtYXhpbXVtIG9mIGFnZSBiZWxvdy4gVGhlIGNvbW1hbmQgdG8gZmluZCB0aGUgbWF4aW11bSBpcyBtYXgodmFyaWFibGUpIHdoZXJlIHZhcmlhYmxlIGlzIHRoZSBleHRyYWN0ZWQgdmFyaWFibGUuICANCmBgYA0KDQpgYGB7cn0NCiNDYWxjdWxhdGUgdGhlIG1pbmltdW0gb2YgYWdlIGJlbG93LiBUaGUgY29tbWFuZCB0byBmaW5kIHRoZSBtaW5pbXVtIGlzIG1pbih2YXJpYWJsZSkgd2hlcmUgdmFyaWFibGUgaXMgdGhlIGV4dHJhY3RlZCB2YXJpYWJsZS4gIA0KYGBgDQoNCk5leHQsIGNvbXB1dGUgdGhlIHNhbWUgdmFsdWVzIHVzaW5nIEV4Y2VsIGFuZCBjb21wYXJlIHJlc3VsdHMuIFBvc3QgYSBzY3JlZW5zaG90IG9mIHRoaXMgdXNpbmcgdGhlIGRpcmVjdGlvbnMgZm91bmQgaW4gTGFiIDAuIFRvIHNjcmVlbnNob3QsIHlvdSBjYW4gdXNlDQp0aGUgc25pcHBpbmcgdG9vbCBhbmQgc2F2ZSB0aGUgUE5HIGltYWdlIHRvIHRoZSBzYW1lIGZvbGRlciBhbGwgb2YgdGhlIG90aGVyIGRhdGEgaXMuDQoNCk5leHQsIHVzZSB0aGUgZm9ybXVsYSBmcm9tIGNsYXNzIHRvIGRldGVjdCBhbnkgb3V0bGllcnMuIEFuIG91dGxpZXIgaXMgdmFsdWUgdGhhdCAibGllcyBvdXRzaWRlIiBtb3N0IG9mIHRoZSBvdGhlciB2YWx1ZXMgaW4gYSBzZXQgb2YgZGF0YS4gQSBjb21tb24gd2F5IHRvIGVzdGltYXRlIHRoZSB1cHBlciBhbmQgbG93ZXIgdGhyZXNob2xkIGlzIHRvIHRha2UgdGhlIGBgYG1lYW4gKCsgb3IgLSkgMyAqIHN0YW5kYXJkIGRldmlhdGlvbmBgYC4gVHJ5IHVzaW5nIHRoaXMgZm9ybXVsYSB0byBmaW5kIHRoZSB1cHBlciBhbmQgbG93ZXIgbGltaXQgZm9yIGFnZS4gDQoNCmBgYHtyfQ0KI1VzZSB0aGUgZm9ybXVsYSBhYm92ZSB0byBjYWxjdWxhdGUgdGhlIHVwcGVyIGFuZCBsb3dlciB0aHJlc2hvbGQNCmBgYA0KDQpBIG1ldGhvZCB0byBmaW5kIHRoZSB1cHBlciBhbmQgbG93ZXIgdGhyZXNob2xkcyBkaXNjdXNzZWQgaW4gaW50cm9kdWN0b3J5IHN0YXRpc3RpY3MgY291cnNlcyBpbnZvbHZlcyBmaW5kaW5nIHRoZSBpbnRlcnF1YXJ0aWxlIHJhbmdlLiBGb2xsb3cgYWxvbmcgYmVsb3cgdG8gc2VlIGhvdyB3ZSBmaXJzdCBjYWxjdWxhdGUgdGhlIGludGVycXVhcnRpbGUgcmFuZ2UuLiANCg0KYGBge3J9IA0KcXVhbnRpbGUoYWdlKSANCmxvd2VycSA9IHF1YW50aWxlKGFnZSlbMl0gDQp1cHBlcnEgPSBxdWFudGlsZShhZ2UpWzRdIA0KaXFyID0gdXBwZXJxIC0gbG93ZXJxDQpgYGANCg0KVGhlIGZvcm11bGEgYmVsb3cgY2FsY3VsYXRlcyB0aGUgdGhyZXNob2xkLiBUaGUgdGhyZXNob2xkIGlzIHRoZSBib3VuZGFyaWVzIHRoYXQgZGV0ZXJtaW5lIGlmIGEgdmFsdWUgaXMgYW4gb3V0bGllci4gSWYgdGhlIHZhbHVlIGZhbGxzIGFib3ZlIHRoZSB1cHBlciB0aHJlc2hvbGQgb3IgYmVsb3cgdGhlIGxvd2VyIHRocmVzaG9sZCwgaXQgaXMgYW4gb3V0bGllci4gDQoNCkJlbG93IGlzIHRoZSB1cHBlciB0aHJlc2hvbGQ6DQpgYGB7cn0gDQp1cHBlcnRocmVzaG9sZCA9IChpcXIgKiAxLjUpICsgdXBwZXJxIA0KdXBwZXJ0aHJlc2hvbGQNCmBgYA0KDQpCZWxvdyBpcyB0aGUgbG93ZXIgdGhyZXNob2xkOg0KYGBge3J9DQpsb3dlcnRocmVzaG9sZCA9IGxvd2VycSAtIChpcXIgKiAxLjUpDQpsb3dlcnRocmVzaG9sZA0KYGBgDQoNCkFyZSB0aGVyZSBhbnkgb3V0bGllcnM/IEhvdyBtYW55PyBJdCBjYW4gYWxzbyBiZSB1c2VmdWwgdG8gdmlzdWFsaXplIHRoZSBkYXRhIHVzaW5nIGEgYm94IGFuZCB3aGlza2VyIHBsb3QuIFRoZSBib3hwbG90IGJlbG93IHN1cHBvcnRzIHRoZSBJUVIgd2UgZm91bmQgb2YgMTUgYW5kIHVwcGVyIGFuZCBsb3dlciB0aHJlc2hvbGQuDQoNCmBgYHtyfSANCmJveHBsb3QoYWdlKSANCmBgYA0KDQotLS0tLS0tLS0tLS0tLS0NCg0KIyMjIFRhc2sgMg0KDQpOZXh0LCB3ZSBtdXN0IGNhbGN1bGF0ZSB0aGUgbWVhbiBmb3IgY2hlY2tpbmcgaW4gRXhjZWwuIFRha2UgYSBzY3JlZW5zaG90IG9mIHRoaXMgcHJvY2VzcyBhbmQgcGFzdGUgaXQgaGVyZS4gSG93IGFyZSB0aGUgbWlzc2luZyB2YWx1ZXMgYXJlIHRyZWF0ZWQgaW4gdGhlDQpFeGNlbCBjYWxjdWxhdGlvbj8NCg0KTm93LCByZWFkIHRoZSBmaWxlIGludG8gUi4gDQoNCmBgYHtyfQ0KbmV3ZGF0YSA9IHJlYWQuY3N2KGZpbGU9ImRhdGEvY3JlZGl0cmlza29yZy5jc3YiKQ0KbmV3ZGF0YSANCmBgYA0KDQpXZSBvYnNlcnZlIHRoYXQgdGhlIGNvbHVtbiBuYW1lcyBhcmUgc2hpZnRlZCBkb3duIGJlbG93LiBTbywgd2UgbXVzdCBtYWtlIHN1cmUNCnRvIHVzZSB0aGUgY29tbWFuZCBza2lwIGFuZCBzZXQgdGhlIGhlYWRlciB0byB0cnVlLg0KDQpgYGB7cn0gDQpuZXdkYXRhID0gcmVhZC5jc3YoZmlsZT0iZGF0YS9jcmVkaXRyaXNrb3JnLmNzdiIsDQpza2lwPTEsaGVhZGVyPVRSVUUsc2VwPSIsIikgDQpuZXdkYXRhIA0KYGBgDQoNClRvIGNhbGN1bGF0ZSB0aGUgbWVhbiBmb3IgQ2hlY2tpbmcgaW4gUiwgZm9sbG93IFdvcmtzaGVldCAxLiBFeHRyYWN0IHRoZSBDaGVja2luZyBjb2x1bW4gZmlyc3QgYW5kIHRoZW4gZmluZCB0aGUgYXZlcmFnZSB1c2luZyB0aGUgZnVuY3Rpb24gYnVpbHQgaW4gUi4NCldoYXQgaGFwcGVucyB3aGVuIHdlIHRyeSB0byB1c2UgdGhlIGZ1bmN0aW9uPw0KDQpgYGB7cn0gDQpjaGVja2luZyA9IG5ld2RhdGEkQ2hlY2tpbmcgDQpgYGANCg0KVG8gcmVzb2x2ZSB0aGUgZXJyb3IsIHdlIG11c3QgcmVtb3ZlIHVuZGVyc3RhbmQgd2hlcmUgaXQgaXMgY29taW5nIGZyb20uIFRoZXJlDQphcmUgbWlzc2luZyB2YWx1ZXMgaW4gdGhlIGNzdiBmaWxlLCB3aGljaCBpcyBxdWl0ZSBjb21tb24gYXMgbW9zdCBkYXRhc2V0cyBhcmUNCm5vdCBwZXJmZWN0LiBBZGRpdGlvbmFsbHksIHRoZXJlIGFyZSBjb21tYXMgd2l0aGluIHRoZSBleGNlbCBzcHJlYWRzaGVldCwgYW5kIFINCmRvZXMgbm90IHJlY29nbml6ZSB0aGF0ICcxLDIzNCcgaXMgZXF1aXZhbGVudCB0byAnMTIzNCcuIExhc3RseSwgdGhlcmUgYXJlICckJw0Kc3ltYm9scyB0aHJvdWdob3V0IHRoZSBmaWxlIHdoaWNoIGlzIG5vdCBhIG51bWVyaWNhIHN5bWJvbCBlaXRoZXIuDQoNClRoZSBzdWIgZnVuY3Rpb24gcmVwbGFjZXMgdGhlc2Ugc3ltYm9scyB3aXRoIHNvbWV0aGluZyBlbHNlLiBTbywgaW4gb3JkZXIgdG8NCnJlbW92ZSB0aGUgY29tbWEgaW4gdGhlIG51bWJlciAiMSwyMzQiLCB3ZSBtdXN0IHN1YnN0aXR1dGUgaXQgd2l0aCBqdXN0IGFuIGVtcHR5DQpzcGFjZS4NCg0KQXMgc2hvd24gb24gdGhlIHdvcmtzaGVldCwgdHlwZSBhbmQgY29weSB0aGUgZXhhY3QgY29tbWFuZHMgdG8gZmluZCB0aGUgbWVhbiB3aXRoIHRoZSBOQSB2YWx1ZXMgcmVtb3ZlZC4NCmBgYHtyfSANCiNzdWJzdGl0dXRlIGNvbW1hIHdpdGggYmxhbmsgaW4gYWxsIG9mIGNoZWNraW5nIA0KDQojc3Vic3RpdHV0ZSBkb2xsYXIgc2lnbiB3aXRoIGJsYW5rIGluIGFsbCBvZiBjaGVja2luZyANCg0KI251bWVyaWMgY29udmVydA0KDQojbWVhbiB3aXRoIE5BIHJlbW92ZWQgDQoNCmBgYA0KDQpXaGF0IGFyZSBzb21lIG90aGVyIHdheXMgdG8gY2xlYW4gdGhpcyBkYXRhIGluIFI/IEhvdyBhYm91dCBpbiBFeGNlbD8gDQoNCi0tLS0tLS0tLS0tLS0NCg0KIyMjIFRhc2sgMw0KDQpOb3csIHdlIHdpbGwgbG9vayBhdCBkaXZ2eWJpa2UgZGF0YS4gR28gdG8gW2h0dHBzOi8vd3d3LmRpdnZ5YmlrZXMuY29tL3N5c3RlbS1kYXRhXShodHRwczovL3d3dy5kaXZ2eWJpa2VzLmNvbS9zeXN0ZW0tZGF0YSkgRG93bmxvYWQgbGF0ZXN0IGRhdGEgKHppcCBmaWxlKSBhbmQgZ28gb3ZlciB0aGUgUkVBRE1FIGZpbGUuIA0KDQpPcGVuIHRoZSBmaWxlIGluIEV4Y2VsLiBQZXIgdGhlIERhdGEgTGljZW5zZSBBZ3JlZW1lbnQgd2hvIG93bnMgdGhlIGRhdGEsIGFuZCBjYW4gdGhlIGRhdGEgYmUgY29ycmVsYXRlZCB3aXRoIG90aGVyIHNvdXJjZXM/DQoNCk5vdGUgdGhlIHNpemUgb2YgdGhlIGZpbGUsIHRoZSBudW1iZXIgb2YgY29sdW1ucyBhbmQgb2Ygcm93cyBoZXJlLj8/PyBJZGVudGlmeSB0aGUgdW5pcXVlIGVudGl0aWVzIGFuZCBmaWVsZHMuDQoNCkRlZmluZSBhIHJlbGF0aW9uYWwgYnVzaW5lc3MgbG9naWMgZm9yIHRoZSBjb2x1bW4gZmllbGQgJ3RyaXBkdXJhdGlvbicuDQoNClVzaW5nIHd3dy5lcmRwbHVzLmNvbSBkcmF3IGEgc3RhciBzY2hlbWEgdXNpbmcgdGhlIGZvbGxvd2luZyB0aHJlZSB0YWJsZXM6LSBBIEZhY3QgdGFibGUgZm9yIFRyaXAtIEEgRGltZW5zaW9uIHRhYmxlIGZvciBTdGF0aW9uLSBBIERpbWVuc2lvbiB0YWJsZSBmb3IgVXNlcg0KDQo=