For your assignment you may be using different dataset than what is included here.
Always read carefully the instructions on Sakai.
Tasks/questions to be completed/answered are highlighted in larger bolded fonts and numbered according to their section.
We are going to use tidyverse a collection of R packages designed for data science.
## Loading required package: tidyverse
## ── Attaching packages ───────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.1 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Name your dataset ‘mydata’ so it easy to work with.
Commands: read_csv() head() mean() sub() as.numeric()
** What happens when we try to use the function? **
To resolve the error, check if the feature data type is correct
Notice that comma ‘1,234’ in some values
Also the data type and dollar sign ‘$’ symbol
To remove the any character in this case the comma from “1,234”. We must substitute it with just an empty space.
** substitute comma with “” **
# VARIABLE <- sub("," , "", VARIABLE)
** substitute dollar sign with “” **
** character to numeric - as.numeric()**
** mean with NA removed **
# VARIABLE_MEAN <- mean( YOUR_VARIABLE , na.rm = TRUE)
** Clean variables to dataset **
# mydata$VARIABLE <- VARIABLE
** Save clean data **
# write_csv(mydata, path = "data/mydata_clean.csv")
In this task we must calculate the mean, standard deviation, maximum, and minimum for the given feature.
** calculate the average **
** calculate the standard deviation **
** calculate the min **
** calculate the max **
To find the outliers we are going to look at the upper and lower limits
An outlier is value that “lies outside” most of the other values in a set of data.
A method to find upper and lower thresholds involves finding the interquartile range.
** quantile calculation for the give feature**
** Lower and upper quantile calculation **
# lowerq = quantile(VARIABLE)[2]
# upperq = quantile(VARIABLE)[4]
Interquantile calculation
# iqr = upperq - lowerq
The threshold is the boundaries that determine if a value is an outlier.
If the value falls above the upper threshold or below the lower threshold, it is an outlier.
** Calculation the upper threshold **
# upper_threshold = (iqr * 1.5) + upperq
** Calculation the lower threshold **
# lower_threshold = lowerq - (iqr * 1.5)
** Identify outliers **
# VARIABLE[ VARIABLE > upper_threshold][1:10]
# VARIABLE[ VARIABLE > lower_threshold][1:10]
** Finding outliers records **
# mydata[ VARIABLE > upper_threshold, ][1:10]
# mydata[ VARIABLE > lower_threshold, ][1:10]
It can also be useful to visualize the data using a box and whisker plot.
The boxplot supports the IQR also shows the upper and lower thresholds
# p <- ggplot(data = Scoring, aes(x = "", y = Finrat)) + geom_boxplot() + coord_flip()
# p
Chicago Taxi Dashboard: https://data.cityofchicago.org/Transportation/Taxi-Trips-Dashboard/spcw-brbq
Chicago Taxi Data Description: http://digital.cityofchicago.org/index.php/chicago-taxi-data-released