This worksheet includes three main tasks on data outliers, data preparation, and data modeling. The lab requires the use of Microsoft Excel, R, and ERDplus.
Remember to always set your working directory to the source file location. Go to ‘Session’, scroll down to ‘Set Working Directory’, and click ‘To Source File Location’. Read carefully the below and follow the instructions to complete the tasks and answer any questions. Submit your work to RPubs as detailed in previous notes.
For your assignment you may be using different data sets than what is included here. Always read carefully the instructions on Sakai. For clarity, tasks/questions to be completed/answered are highlighted in red color (visible in preview) and numbered according to their particular placement in the task section. Quite often you will need to add your own code chunk.
Execute all code chunks, preview, publish, and submit link on Sakai.
First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R.
In R, we must read in the file again, extract the column and find the values that are asked for.
#Read File
mydata = read.csv(file="data/creditrisk.csv")
#Name the extracted variable
age = mydata$Age
##### 1A) Fill in the code chunk below to calculate and display each result. Refer to previous worksheets and on-line help for some commands.
#Calculate the average age below.
MeanAge = mean(age)
MeanAge
[1] 34.39765
#Calculate the standard deviation of age below.
StdDevAge = sd(age)
StdDevAge
[1] 11.04513
#Calculate the maxima of age below. Look in Help to find the right command for maxima
AgeMaxima = max(age)
AgeMaxima
[1] 73
#Calculate the minima of age below. Look in Help to find the right command for minima
AgeMinima = min(age)
AgeMinima
[1] 18
Next, we will use the formula from class to detect any outliers. An outlier is a value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower limits is to take the mean (+ or -) 3 * standard deviation.
##### 1B) Write and execute the code chunk corresponding to the above formula to calculate the upper and lower limits for age. Display your results. Based on the limits do you think there are outliers? Explain your answer.
UpperLimitAge = MeanAge + 3*StdDevAge
UpperLimitAge
[1] 67.53302
LowerLimitAge = MeanAge - 3*StdDevAge
LowerLimitAge
[1] 1.262269
Yes, there are outliers because the maximum that was calculated previously is above the upper limit. This means there is at least one outlier on the upper limit side of the data. However, the minimum calcluated was above the lower limit so there are no outliers on the lower limit side of the data.
Another similar method to find the upper and lower thresholds, discussed in many introductory statistics courses, involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..
quantile(age)
0% 25% 50% 75% 100%
18 26 32 41 73
lowerq = quantile(age)[2]
upperq = quantile(age)[4]
iqr = upperq - lowerq
Next we calculate the limiting thresholds. A threshold here is the boundary that determines if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.
To calculate the upper threshold:
upperthreshold = (iqr * 1.5) + upperq
upperthreshold
75%
63.5
Below is the lower threshold:
lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
25%
3.5
A good way to undertsand the above calculations is to visualize the results using a box and whisker plot. The top and lower ends of the box correspond to the upper and lower quartiles. The median is marked by a bolded line. The whiskers are the lines connecting the upper and lower quartiles to upper and lower thresholds. Any points beyond the thresholds is a potential outlier.
boxplot(age)
##### 1C) From the box plot representation are there any outliers? How many can you count? How does your answer reconcile with the result from Task 1B?
Next, we will read the file creditriskorg.csv into R as provided in its original form. Unlike the cleaned file creditrisk.csv, the original dataset will require some data preparation.
newdata = read.csv(file="data/creditriskorg.csv")
head(newdata)
We observe a new line is inserted with the header labels X, X.1, ... and that the true column headers are shifted down. This is because of the empty line in the original dataset. To account for this detail we must skip one line when reading the file.
newdata = read.csv(file="data/creditriskorg.csv",skip=1)
head(newdata)
Next we want to extract the Checking column and then find the average of checking, smilar to what we did in the previous lab. When we try to execute the code chunk below notice that we get an error.
checking = newdata$Checking # command to extract the Checking column from the data file newdata
mean(checking)
argument is not numeric or logical: returning NA
[1] NA
To resolve the error, we must understand first where it’s source. There are missing values in the csv file represented by the symbol $-. Missing data is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly, there are ‘$’ symbols throughout the file which is not a numerical symbol either.
To correct for the error we need to do some data cleanup. For this we will use the sub function sub() in R to replace unwanted symbols with something else. For example, in order to remove the comma in the number “1,234”, we can substitute it with blank. Below is a sequence of commands to help with the cleanup of the data in the Checking column and eventual calculation of the mean.
#substitute comma with blank in all of checking
checking= sub(",","",checking)
#substitute dollar sign with blank in all of checking
# Example new = sub("\\$","",new)
checking = sub("\\$","",checking)
#Convert values to numeric. Any value that cannot be converted to numeric will be designated as NA (Not Applicable)
checking = as.numeric(checking)
NAs introduced by coercion
#Calculate mean of checking with all NAs removed
mean(checking,na.rm=TRUE)
[1] 2559.805
##### 2A) Repeat the above commands to calculate the mean of the Savings column instead. Use a different variable naming.
Savings = newdata$Savings
Savings = sub(",","",Savings)
Savings = sub("\\$","",Savings)
Savings = as.numeric(Savings)
NAs introduced by coercion
mean(Savings,na.rm = TRUE)
[1] 2122.146
##### 2B) Calculate now the mean of the Checking column in Excel using the Excel function Average. Compare the two results, from Excel and R and share your observation.
After going into the excel file and using the ‘Average’ function in excel, it seems after cleaning up the data in R and then using the mean function you get the same result. The fact that two sources of calculations show the same average it most likely means this calcuation is the correct average of the column.
##### 2C) Based on your observation how did Excel treat the missing values represented by the symbol $-? Are they included in the calculation of the mean or excluded? Explain your answer.
Here, we will look at Chicago Divvy bike data. The historical data sets with description of fields can be found at:
Chicago Divvy Data: https://www.divvybikes.com/data
###### 3A) Open in RStudio or Excel the file Divvy_Trips_2017_Q4.csv located in the data folder. What is the size of the file (measured in bytes), the number of columns and of rows?. Identify the column field(s) in the data that is/are unique identifier(s) (cannot have duplicate/repeated values)
The size of the file is 82 MB and is too big to open in R. Afer opening the file in Excel there are 669240 rows and 12 columns. The uniquely identifying field in the data is trip_id. Although there are other columns that uniquely identify a certain subject like trip_id, it does not uniquely identify the specific row of data and can show up more than once within the data.
Read carefully the file README.txt, located in same data folder, for the description of the data.
###### 3B) Define a relational business logic integrity rule for the column field Trip Duration.
The Trip duration cloumn cannot excede or be less than the end time minus the start time. (+/- 60 seconds because the time started and ended only goes to the closest minute while the trip duration is in seconds).
###### 3C) Using https://erdplus.com/#/standalone draw a star like schema using the below three tables. Include an image capture of your schema here.