HW6 phase 1 is all about cleaning up a data set about irises. The first task is to read the .txt file into R. First, the comma seperated value file was opened with Excel and saved as a .xlsx file. This can be read into R with the following commands.The data set is named HW6.
library(readxl)
HW6 <- read_excel("~/Spring 2017/Data Analytics/HW6.xlsx")
The type of variables in the data set must be found to ensure we know what we are working with using the class() function.
class(HW6$Sepal.Length)
## [1] "character"
class(HW6$Sepal.Width)
## [1] "character"
class(HW6$Petal.Length)
## [1] "character"
class(HW6$Petal.Width)
## [1] "character"
class(HW6$Species)
## [1] "character"
It appears we are working with “characters”, which should not cause any problems when cleaning the data.The next step is to replace any special values with “NA”. The result should be a data set with only a number or “NA” as an entry.
First, load the dplyr and ggplot2 libraries. Next, numeric vectors are created for the columns that should have numbers as entries using the as.numeric command. If an entry in these vectors are not in the reals, it is replaced with “NA”, using the is.finite() command to identify entries in the reals. The columns with numeric entries - Sepal.Length, Sepal.Width, Petal.Length, Petal.Width- in HW6 are replaced with the improved vectors.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
SL <- as.numeric(HW6$Sepal.Length)
## Warning: NAs introduced by coercion
SW <- as.numeric(HW6$Sepal.Width)
## Warning: NAs introduced by coercion
PL <- as.numeric(HW6$Petal.Length)
## Warning: NAs introduced by coercion
PW <- as.numeric(HW6$Petal.Width)
## Warning: NAs introduced by coercion
SL [is.finite(SL)==FALSE] <- "NA"
SW [is.finite(SW)==FALSE] <- "NA"
PL [is.finite(PL)==FALSE] <- "NA"
PW [is.finite(PW)==FALSE] <- "NA"
HW6$Sepal.Length = SL
HW6$Sepal.Width = SW
HW6$Petal.Length = PL
HW6$Petal.Width = PW
View(HW6)
The following code finds the amount of observations that have at least one “NA” entry. The number and percentage of complete oberservations are recorded.
NotComplete <- filter(HW6, Sepal.Length=="NA" | Sepal.Width == "NA" | Petal.Length == "NA" | Petal.Width == "NA" | Species == "NA")
View(NotComplete)
NumberComplete=nrow(HW6)-nrow(NotComplete)
NumberComplete
## [1] 95
PercentComplete=((nrow(HW6)-nrow(NotComplete))/nrow(HW6))*100
paste(round(PercentComplete,digits=2),"%")
## [1] "63.33 %"
It may appear that the data is now all squared away and ready to be analyzed, but there are some rules that must be followed!
Let’s figure out how many times each rule is broken.
filter() and nrow() can be used to find how many observations are listed with an unallowed species values. This number is recorded.
a <- filter(HW6, Species != "setosa")
b <- filter(a, Species != "versicolor")
c <- filter(b, Species != "virginica")
One = nrow(c)
One
## [1] 14
The same commands as above are used. The number of just non-positive entries is found. The result still includes “NA” entries. It seemed more beneficial to calculate only numeric entries that were non-positive. This number is recorded.
r <- filter(HW6, Sepal.Length <= "0")
R <- filter(HW6, Sepal.Length <= "0" | Sepal.Length == "NA")
s <- filter(HW6, Sepal.Width <= "0")
S <- filter(HW6, Sepal.Width <= "0" | Sepal.Width == "NA")
v <- filter(HW6, Petal.Length <= "0")
V <- filter(HW6, Petal.Length <= "0" | Petal.Length == "NA")
z <- filter(HW6, Petal.Width <= "0")
Z <- filter(HW6, Petal.Width <= "0" | Petal.Width == "NA")
two=nrow(r)+nrow(s)+nrow(v)+nrow(z)
Two = nrow(R)+nrow(S)+nrow(V)+nrow(Z)
Two
## [1] 70
To do any kind of calculation, the entries must be viewed as numeric, so as.numeric is used again, as well as filter(). The number of times this rule is broken is recorded.
PTLW <- as.numeric(HW6$Petal.Width)
## Warning: NAs introduced by coercion
PTLL <- as.numeric(HW6$Petal.Length)
## Warning: NAs introduced by coercion
LvsW <- filter(HW6, PTLL >= 2*PTLW)
Three = nrow(HW6)-nrow(LvsW)
Three
## [1] 36
Similar to before, as.numeric and filter are used. The number of times this rule is broken is recorded.
SPLL <- as.numeric(HW6$Sepal.Length)
## Warning: NAs introduced by coercion
SPL30 <- filter(HW6, SPLL <= 30)
Four = nrow(HW6)-nrow(SPL30)
Four
## [1] 12
Finally, only the filter() command is necessary, and the number of times the final rule is broken is recorded.
LvsL <- filter(HW6, SPLL > PTLL)
Five = nrow(HW6)-nrow(LvsL)
Five
## [1] 32
Total=One+Two+Three+Four+Five
Total
## [1] 164
This step is mostly complete repetition for the code above. Each Species is checked for breaking each of the five rules, and then the error-free observations that pass the test for each Species are combined into one data frame using the rbind() command.
How many “setosa” had no errors?
A <- filter(HW6, Species == "setosa")
ASL <- filter(A, Sepal.Length >= "0" , Sepal.Length != "NA")
ASW <- filter(ASL, Sepal.Width >= "0", Sepal.Width != "NA")
APL <- filter(ASW, Petal.Length >= "0", Petal.Length != "NA")
APW <- filter(APL, Petal.Width >= "0", Petal.Width != "NA")
APTLW <- as.numeric(APW$Petal.Width)
APTLL <- as.numeric(APW$Petal.Length)
ALvsW <- filter(APW, APTLL >= 2*APTLW)
ASPLL <- as.numeric(ALvsW$Sepal.Length)
ASPL30 <- filter(ALvsW, ASPLL <= 30)
APTLL <- as.numeric(ASPL30$Petal.Length)
ASPLL <- as.numeric(ASPL30$Sepal.Length)
ALvsL <- filter(ASPL30, ASPLL > APTLL)
nrow(ALvsL)
## [1] 27
How many “versicolor” had no errors?
B <- filter(HW6, Species == "versicolor")
BSL <- filter(B, Sepal.Length >= "0" , Sepal.Length != "NA")
BSW <- filter(BSL, Sepal.Width >= "0", Sepal.Width != "NA")
BPL <- filter(BSW, Petal.Length >= "0", Petal.Length != "NA")
BPW <- filter(BPL, Petal.Width >= "0", Petal.Width != "NA")
BPTLW <- as.numeric(BPW$Petal.Width)
BPTLL <- as.numeric(BPW$Petal.Length)
BLvsW <- filter(BPW, BPTLL >= 2*BPTLW)
BSPLL <- as.numeric(BLvsW$Sepal.Length)
BSPL30 <- filter(BLvsW, BSPLL <= 30)
BPTLL <- as.numeric(BSPL30$Petal.Length)
BSPLL <- as.numeric(BSPL30$Sepal.Length)
BLvsL <- filter(BSPL30, BSPLL > BPTLL)
nrow(BLvsL)
## [1] 20
How many “virginica” had no errors?
C <- filter(HW6, Species == "virginica")
CSL <- filter(C, Sepal.Length >= "0" , Sepal.Length != "NA")
CSW <- filter(CSL, Sepal.Width >= "0", Sepal.Width != "NA")
CPL <- filter(CSW, Petal.Length >= "0", Petal.Length != "NA")
CPW <- filter(CPL, Petal.Width >= "0", Petal.Width != "NA")
CPTLW <- as.numeric(CPW$Petal.Width)
CPTLL <- as.numeric(CPW$Petal.Length)
CLvsW <- filter(CPW, CPTLL >= 2*CPTLW)
CSPLL <- as.numeric(CLvsW$Sepal.Length)
CSPL30 <- filter(CLvsW, CSPLL <= 30)
CPTLL <- as.numeric(CSPL30$Petal.Length)
CSPLL <- as.numeric(CSPL30$Sepal.Length)
CLvsL <- filter(CSPL30, CSPLL > CPTLL)
nrow(CLvsL)
## [1] 29
Overall: What data had no errors?
NoErrors = rbind(ALvsL,BLvsL,CLvsL)
PercentNoError = (nrow(NoErrors)/nrow(HW6))*100
paste(round(PercentNoError,digits=2),"%")
## [1] "50.67 %"
NoErrors
## # A tibble: 76 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## * <chr> <chr> <chr> <chr> <chr>
## 1 5 3.4 1.6 0.4 setosa
## 2 5 3.5 1.6 0.6 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.2 1.4 0.2 setosa
## 5 4.9 3.1 1.5 0.1 setosa
## 6 4.8 3 1.4 0.1 setosa
## 7 5 3 1.6 0.2 setosa
## 8 5.5 4.2 1.4 0.2 setosa
## 9 4.8 3.4 1.6 0.2 setosa
## 10 5.1 3.8 1.5 0.3 setosa
## # ... with 66 more rows