R Markdown

HW6 phase 1 is all about cleaning up a data set about irises. The first task is to read the .txt file into R. First, the comma seperated value file was opened with Excel and saved as a .xlsx file. The data set HW6 can be read into R with the following commands.

library(readxl)
HW6 <- read_excel("~/Spring 2017/Data Analytics/HW6.xlsx")

The class() function tells us what variable types we are working with for each column.

class(HW6$Sepal.Length)
## [1] "character"
class(HW6$Sepal.Width)
## [1] "character"
class(HW6$Petal.Length)
## [1] "character"
class(HW6$Petal.Width)
## [1] "character"
class(HW6$Species)
## [1] "character"

It appears we are working with “characters”, which should not cause any problems when cleaning the data.The next step is to replace any special values with “NA”. The result should be a data set with only a number or “NA” as an entry.

First, load the dplyr and ggplot2 libraries. Next, numeric vectors are created for the columns that should have numbers as entries using the as.numeric() command. Using the is.finite() command, an entry not of the reals is repalced with “NA”. The columns with numeric entries - Sepal.Length, Sepal.Width, Petal.Length, Petal.Width- in HW6 are replaced with the improved vectors.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
SL <- as.numeric(HW6$Sepal.Length)
## Warning: NAs introduced by coercion
SW <- as.numeric(HW6$Sepal.Width)
## Warning: NAs introduced by coercion
PL <- as.numeric(HW6$Petal.Length)
## Warning: NAs introduced by coercion
PW <- as.numeric(HW6$Petal.Width)
## Warning: NAs introduced by coercion
SL [is.finite(SL)==FALSE] <- "NA"
SW [is.finite(SW)==FALSE] <- "NA"
PL [is.finite(PL)==FALSE] <- "NA"
PW [is.finite(PW)==FALSE] <- "NA"
HW6$Sepal.Length = SL
HW6$Sepal.Width = SW
HW6$Petal.Length = PL
HW6$Petal.Width = PW

The following code finds the amount of observations that have at least one “NA” entry. The number and percentage of complete oberservations are recorded.

NotComplete <- filter(HW6, Sepal.Length=="NA" | Sepal.Width == "NA" | Petal.Length == "NA" | Petal.Width == "NA" | Species == "NA")

NumberComplete=nrow(HW6)-nrow(NotComplete)
NumberComplete
## [1] 95
PercentComplete=((nrow(HW6)-nrow(NotComplete))/nrow(HW6))*100
paste(round(PercentComplete,digits=2),"%")
## [1] "63.33 %"

It may appear that the data is now all squared away and ready to be analyzed, but there are some rules that must be followed!

Rules

  1. Species should be precisely one of the following values: setosa, versicolor,orvirginica.
  2. All measured numerical properties of an iris are in centimeters and should be positive.
  3. The petal length of an iris is always at least two times its petal width.
  4. The sepal length of an iris cannot exceed 30 cm.
  5. The sepals of an iris are always longer than its petals.

Let’s figure out how many times each rule is broken.

First, the breaking of a rule should be clearly defined. Does a “NA” entry count as a rule break? For now, we do not need to be concerned with making this decision. Instead, each rule will have two output values of “How many times the rule is broken”. The first value does not count an entry “NA” as breaking the rule, and the second value does count “NA” as breaking the rule

Rule 1: Species should be precisely one of the following values: setosa, versicolor,orvirginica.

filter() and nrow() can be used to find how many observations are listed with unallowed species values. The number of “NA” entries is calculated to be zero, so it makes sense that the consideration of a “NA” entry does not change the number of times Rule 1 is broken.

R1NA <- filter(HW6, Species == "NA")
nrow(R1NA)
## [1] 0
a <- filter(HW6, Species != "setosa")
b <- filter(a, Species != "versicolor")
c <- filter(b, Species != "virginica")
d <- filter(c, Species != "NA")


R1NA= nrow(d)
R1 = nrow(c)
R1NA
## [1] 14
R1
## [1] 14
Rule 2: All measured numerical properties of an iris are in centimeters and should be positive.

Utilizing commands similar to those used in the previous calculation, the number of non-positive entries is found. This first result still includes “NA” entries, and the number of times the rule is broken in this case is the first output. The number of times the rule is broken including a “NA” entry as a rule break is the second output.

slNA <- filter(HW6, Sepal.Length >= "0")
sl <- filter(HW6, Sepal.Length >="0", Sepal.Length!= "NA")


swNA <- filter(HW6, Sepal.Width >= "0")
sw <- filter(HW6, Sepal.Width >= "0" , Sepal.Width != "NA")

plNA <- filter(HW6, Petal.Length >= "0")
pl <- filter(HW6, Petal.Length >= "0" , Petal.Length != "NA")

pwNA <- filter(HW6, Petal.Width >= "0")
pw <- filter(HW6, Petal.Width >= "0" , Petal.Width != "NA")

R2NA=4*nrow(HW6)-(nrow(slNA)+nrow(swNA)+nrow(plNA)+nrow(pwNA))
R2=4*nrow(HW6)-(nrow(sl)+nrow(sw)+nrow(pl)+nrow(pw))

R2NA
## [1] 8
R2
## [1] 67
Rule 3: The petal length of an iris is always at least two times its petal width.

To do any kind of calculation, the entries must be viewed as numeric, so as.numeric() is used again, as well as filter(). The number of times this rule is broken is recorded, first for “NA” not as a rule break, and second for “NA” as a rule break.

PTLW <- as.numeric(HW6$Petal.Width)
## Warning: NAs introduced by coercion
PTLL <- as.numeric(HW6$Petal.Length)
## Warning: NAs introduced by coercion
LvsW <- filter(HW6, PTLL >= 2*PTLW)
PTLLNA <- filter(HW6, Petal.Length == "NA" | Petal.Width == "NA")
LvsWNA <- rbind(LvsW,PTLLNA)

R3NA = nrow(HW6)-nrow(LvsWNA)
R3 = nrow(HW6)-nrow(LvsW)
R3NA
## [1] 5
R3
## [1] 36
Rule 4: The sepal length of an iris cannot exceed 30 cm.

Similar to before, as.numeric() and filter() are used. The number of times this rule is broken is recorded for both considerations of a “NA” entry.

SPLL <- as.numeric(HW6$Sepal.Length)
## Warning: NAs introduced by coercion
SPL30 <- filter(HW6, SPLL <= 30)
SPLNA <- filter(HW6, Sepal.Length == "NA")
SPL30NA <-rbind(SPL30,SPLNA)

R4NA = nrow(HW6)-nrow(SPL30NA)
R4 = nrow(HW6)-nrow(SPL30)

R4NA
## [1] 2
R4
## [1] 12
Rule 5: The sepals of an iris are always longer than its petals.

Finally, only the filter() command is necessary, and the number of times the final rule is broken is recorded for both considerations of a “NA” entry.

LvsL <- filter(HW6, SPLL > PTLL)
SPLLNA <- filter(HW6, Sepal.Length == "NA" | Petal.Length == "NA")
LvsLNA <- rbind(LvsL, SPLLNA)

R5NA = nrow(HW6)-nrow(LvsLNA)
R5 = nrow(HW6)-nrow(LvsL)
R5NA
## [1] 4
R5
## [1] 32
How many times was a rule broken?

Where “NA” is not a rule break:

TotalNA=R1NA+R2NA+R3NA+R4NA+R5NA
TotalNA
## [1] 33

Where “NA” is a rule break:

Total= R1+R2+R3+R4+R5
Total
## [1] 161
What percentage of the observations has no errors?

This step is mostly complete repetition for the code above. Each Species is checked for breaking each of the five rules, and then the error-free observations that pass the test for each Species are combined into one data frame using the rbind() command.

How many “setosa” had no errors?

A <- filter(HW6, Species == "setosa")

ASL <- filter(A, Sepal.Length >= "0" , Sepal.Length != "NA")
ASW <- filter(ASL, Sepal.Width >= "0", Sepal.Width != "NA")
APL <- filter(ASW, Petal.Length >= "0", Petal.Length != "NA")
APW <- filter(APL, Petal.Width >= "0", Petal.Width != "NA")

APTLW <- as.numeric(APW$Petal.Width)
APTLL <- as.numeric(APW$Petal.Length)
ALvsW <- filter(APW, APTLL >= 2*APTLW)

ASPLL <- as.numeric(ALvsW$Sepal.Length)
ASPL30 <- filter(ALvsW, ASPLL <= 30)

APTLL <- as.numeric(ASPL30$Petal.Length)
ASPLL <- as.numeric(ASPL30$Sepal.Length)
ALvsL <- filter(ASPL30, ASPLL > APTLL)
nrow(ALvsL)
## [1] 27

How many “versicolor” had no errors?

B <- filter(HW6, Species == "versicolor")

BSL <- filter(B, Sepal.Length >= "0" , Sepal.Length != "NA")
BSW <- filter(BSL, Sepal.Width >= "0", Sepal.Width != "NA")
BPL <- filter(BSW, Petal.Length >= "0", Petal.Length != "NA")
BPW <- filter(BPL, Petal.Width >= "0", Petal.Width != "NA")

BPTLW <- as.numeric(BPW$Petal.Width)
BPTLL <- as.numeric(BPW$Petal.Length)
BLvsW <- filter(BPW, BPTLL >= 2*BPTLW)

BSPLL <- as.numeric(BLvsW$Sepal.Length)
BSPL30 <- filter(BLvsW, BSPLL <= 30)

BPTLL <- as.numeric(BSPL30$Petal.Length)
BSPLL <- as.numeric(BSPL30$Sepal.Length)
BLvsL <- filter(BSPL30, BSPLL > BPTLL)
nrow(BLvsL)
## [1] 20

How many “virginica” had no errors?

C <- filter(HW6, Species == "virginica")

CSL <- filter(C, Sepal.Length >= "0" , Sepal.Length != "NA")
CSW <- filter(CSL, Sepal.Width >= "0", Sepal.Width != "NA")
CPL <- filter(CSW, Petal.Length >= "0", Petal.Length != "NA")
CPW <- filter(CPL, Petal.Width >= "0", Petal.Width != "NA")

CPTLW <- as.numeric(CPW$Petal.Width)
CPTLL <- as.numeric(CPW$Petal.Length)
CLvsW <- filter(CPW, CPTLL >= 2*CPTLW)

CSPLL <- as.numeric(CLvsW$Sepal.Length)
CSPL30 <- filter(CLvsW, CSPLL <= 30)

CPTLL <- as.numeric(CSPL30$Petal.Length)
CSPLL <- as.numeric(CSPL30$Sepal.Length)
CLvsL <- filter(CSPL30, CSPLL > CPTLL)
nrow(CLvsL)
## [1] 29

Overall: What data had no errors?

NoErrors = rbind(ALvsL,BLvsL,CLvsL)
PercentNoError = (nrow(NoErrors)/nrow(HW6))*100
paste(round(PercentNoError,digits=2),"%")
## [1] "50.67 %"
NoErrors
## # A tibble: 76 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## *         <chr>       <chr>        <chr>       <chr>   <chr>
## 1             5         3.4          1.6         0.4  setosa
## 2             5         3.5          1.6         0.6  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.2          1.4         0.2  setosa
## 5           4.9         3.1          1.5         0.1  setosa
## 6           4.8           3          1.4         0.1  setosa
## 7             5           3          1.6         0.2  setosa
## 8           5.5         4.2          1.4         0.2  setosa
## 9           4.8         3.4          1.6         0.2  setosa
## 10          5.1         3.8          1.5         0.3  setosa
## # ... with 66 more rows