Task 1

First, we must calculate the mean, standard deviation, maximum, and minimum for the Age column using R. In R, we must read in the file again, extract the column and find the values that are asked for.

#Read File
mydata = read.csv(file = "data/creditrisk.csv")
#Name the extracted variable
age = mydata$Age
#Calculate the average age below. Refer to Worksheet 1 for the correct command.
age_mean = mean(age)
age_mean
## [1] 34.39765
#Calculate standard deviation of age below. Refer to Worksheet 1 for the correct command. 
age_sd = sd(age)
age_sd
## [1] 11.04513
#Calculate the maximum of age below. The command to find the maximum is max(variable) where variable is the extracted variable.
age_max = max(age)
age_max
## [1] 73
#Calculate the minimum of age below. The command to find the minimum is min(variable) where variable is the extracted variable.  
age_min = min(age)
age_min
## [1] 18

Above are all of the values that I was requested to find. The values listed come from searching through the attached data documents in order to allow for proper processing of the information.


Next, use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data. A common way to estimate the upper and lower threshold is to take the mean (+ or -) 3 * standard deviation. Try using this formula to find the upper and lower limit for age.

#Use the formula above to calculate the upper and lower threshold
age_lower = age_mean - (3) * age_sd
age_upper = age_mean + (3) * age_sd
age_upper
## [1] 67.53302
age_lower
## [1] 1.262269

A method to find the upper and lower thresholds discussed in introductory statistics courses involves finding the interquartile range. Follow along below to see how we first calculate the interquartile range..

quantile(age)
##   0%  25%  50%  75% 100% 
##   18   26   32   41   73
lowerq = quantile(age)[2]
upperq = quantile(age)[4]
lowerq
## 25% 
##  26
upperq
## 75% 
##  41
iqr = upperq - lowerq

The formula below calculates the threshold. The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is an outlier.

Below is the upper threshold:

upperthreshold = (iqr * 1.5) + upperq
upperthreshold[1]
##  75% 
## 63.5

Below is the lower threshold:

lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold [1]
## 25% 
## 3.5

Are there any outliers? How many? It can also be useful to visualize the data using a box and whisker plot. The boxplot below supports the IQR we found of 15 and upper and lower threshold.

#age>upperthreshold
age[age>upperthreshold]
## [1] 64 65 67 64 73 65 66 67 65
boxplot(age,horizontal = TRUE)

The values above are the outliers that exist among the information within the attached file.


Task 2

Next, we must read the ‘creditriskorg.csv’ file into R. This is the original dataset and contains missing values.

mydata = read.csv(file = "data/creditriskorg.csv")
head(mydata)
##                 X       X.1         X.2             X.3             X.4
## 1    Loan Purpose Checking      Savings Months Customer Months Employed
## 2 Small Appliance     $-       $739.00               13              12
## 3       Furniture     $-     $1,230.00               25               0
## 4         New Car     $-       $389.00               19             119
## 5       Furniture  $638.00     $347.00               13              14
## 6       Education  $963.00   $4,754.00               40              45
##      X.5            X.6 X.7     X.8   X.9       X.10        X.11
## 1 Gender Marital Status Age Housing Years        Job Credit Risk
## 2      M         Single  23     Own     3  Unskilled         Low
## 3      M       Divorced  32     Own     1    Skilled        High
## 4      M         Single  38     Own     4 Management        High
## 5      M         Single  36     Own     2  Unskilled        High
## 6      M         Single  31    Rent     3    Skilled         Low
#tail(mydata)

We observe that the column names are shifted down below. So, we must make sure to use the command skip and set the header to true.

mydata = read.csv("data/creditriskorg.csv",skip = 1)
head(mydata)
##      Loan.Purpose    Checking     Savings Months.Customer Months.Employed
## 1 Small Appliance       $-       $739.00               13              12
## 2       Furniture       $-     $1,230.00               25               0
## 3         New Car       $-       $389.00               19             119
## 4       Furniture    $638.00     $347.00               13              14
## 5       Education    $963.00   $4,754.00               40              45
## 6       Furniture  $2,827.00        $-                 11              13
##   Gender Marital.Status Age Housing Years        Job Credit.Risk
## 1      M         Single  23     Own     3  Unskilled         Low
## 2      M       Divorced  32     Own     1    Skilled        High
## 3      M         Single  38     Own     4 Management        High
## 4      M         Single  36     Own     2  Unskilled        High
## 5      M         Single  31    Rent     3    Skilled         Low
## 6      M        Married  25     Own     1    Skilled         Low
#str(mydata)
summary(mydata)
##           Loan.Purpose      Checking        Savings    Months.Customer
##  Small Appliance:105    $-      :251    $-      : 62   Min.   : 5.0   
##  New Car        :104    $216.00 :  2    $127.00 :  3   1st Qu.:13.0   
##  Furniture      : 85    $271.00 :  2    $836.00 :  3   Median :19.0   
##  Business       : 44    $296.00 :  2    $904.00 :  3   Mean   :22.9   
##  Used Car       : 40    $305.00 :  2    $922.00 :  3   3rd Qu.:28.0   
##  Education      : 23    $497.00 :  2    $104.00 :  2   Max.   :73.0   
##  (Other)        : 24   (Other)  :164   (Other)  :349                  
##  Months.Employed Gender   Marital.Status      Age        Housing   
##  Min.   :  0.0   F:135   Divorced:156    Min.   :18.0   Other: 52  
##  1st Qu.:  6.0   M:290   Married : 36    1st Qu.:26.0   Own  :292  
##  Median : 20.0           Single  :233    Median :32.0   Rent : 81  
##  Mean   : 31.9                           Mean   :34.4              
##  3rd Qu.: 47.0                           3rd Qu.:41.0              
##  Max.   :119.0                           Max.   :73.0              
##                                                                    
##      Years              Job      Credit.Risk
##  Min.   :1.00   Management: 54   High:211   
##  1st Qu.:2.00   Skilled   :271   Low :214   
##  Median :3.00   Unemployed: 11              
##  Mean   :2.84   Unskilled : 89              
##  3rd Qu.:4.00                               
##  Max.   :4.00                               
## 

To calculate the mean for Checking in R, follow Worksheet 1. Extract the Checking column first and then find the average using the function built in R. What happens when we try to use the function?

checkings = mydata$Checking

class(checkings)
## [1] "factor"
#str(checkings)
#summary(checkings)
#mean(checkings)
checkings[1:6]
## [1]  $-          $-          $-          $638.00     $963.00     $2,827.00 
## 168 Levels:  $-     $1,257.00   $1,336.00   $1,613.00  ...  $996.00

To resolve the error, we must remove understand where it is coming from. There are missing values in the csv file, which is quite common as most datasets are not perfect. Additionally, there are commas within the excel spreadsheet, and R does not recognize that ‘1,234’ is equivalent to ‘1234’. Lastly, there are ‘$’ symbols throughout the file which is not a numerica symbol either.

The sub function replaces these symbols with something else. So, in order to remove the comma in the number “1,234”, we must substitute it with just an empty space.

As shown on the worksheet, type and copy the exact commands to find the mean with the NA values removed.

clean = checkings[1:10]
clean
##  [1]  $-          $-          $-          $638.00     $963.00   
##  [6]  $2,827.00   $-          $-          $6,509.00   $966.00   
## 168 Levels:  $-     $1,257.00   $1,336.00   $1,613.00  ...  $996.00
#substitute comma with blank in all of checking 

clean = sub(",","",clean)
clean
##  [1] " $-   "     " $-   "     " $-   "     " $638.00 "  " $963.00 " 
##  [6] " $2827.00 " " $-   "     " $-   "     " $6509.00 " " $966.00 "
#substitute dollar sign with blank in all of checking 
clean = sub("\\$","",clean)
clean
##  [1] " -   "     " -   "     " -   "     " 638.00 "  " 963.00 " 
##  [6] " 2827.00 " " -   "     " -   "     " 6509.00 " " 966.00 "
#numeric convert
clean = as.numeric(clean)
## Warning: NAs introduced by coercion
clean
##  [1]   NA   NA   NA  638  963 2827   NA   NA 6509  966
#mean with NA removed 
mean_checking = mean(clean,na.rm = TRUE)
mean_checking
## [1] 2380.6
clean = checkings[1:10]
clean
##  [1]  $-          $-          $-          $638.00     $963.00   
##  [6]  $2,827.00   $-          $-          $6,509.00   $966.00   
## 168 Levels:  $-     $1,257.00   $1,336.00   $1,613.00  ...  $996.00
#substitute comma with blank in all of checking 

checkings = sub(",","",checkings)

#substitute dollar sign with blank in all of checking 
checkings = sub("\\$","",checkings)

#numeric convert
checkings = as.numeric(checkings)
## Warning: NAs introduced by coercion
#mean with NA removed 
mean_checking = mean(checkings,na.rm = TRUE)
mean_checking
## [1] 2559.805

What are some other ways to clean this data in R? How about in Excel?

mean(checkings,na.rm = TRUE)
## [1] 2559.805
sum(checkings,na.rm = TRUE)/length(checkings)
## [1] 1048.014

Task 3

Now, we will look at Chicago taxi data. Go and explore the interactive dashboard and read the description of the data.

Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq

Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released/

Open in RStudio the csv file is located in the data folder, note the size of the file, the number of columns and of rows here. Use the functions learned in lab00 and lab01 to describe the data, identify unique entities, fields and summarize.

Define a relational business logic for the column field ‘Trip Seconds’.

Using https://erdplus.com/#/standalone draw a star schema using the following three tables: