We were given a data set with multiple measurements of irises. We were also given specific rules and specifications that irises should meet. The following analysis is the process that we went through to identify these errors.

Type of Variables

We first were asked to find the class of the variables that were given. We did this by using the class() function that you can see below. The results for each variable can be seen below the corresponding code.

Sepal.Length:

class(iris1$Sepal.Length)
## [1] "numeric"

Sepal.Width:

class(iris1$Sepal.Width)
## [1] "numeric"

Petal.Length:

class(iris1$Petal.Length)
## [1] "numeric"

Petal.Width:

class(iris1$Petal.Width)
## [1] "numeric"

Species:

class(iris1$Species)
## [1] "character"

Complete Observations

Below we use code that filters out the observations that do not contain NA. The NA signifies that there is no data for this variable for the observation so if an observation contains NA then it is not complete. This code finds observations that does not have NA. The nrow() function then counts this filter.

Number of observations that are complete:

nrow(filter(iris1, iris1$Sepal.Length != "NA" & iris1$Sepal.Width != "NA" & iris1$Petal.Length != "NA" & iris1$Petal.Width != "NA" & iris$Species != "NA"))
## [1] 96

Next we simply take the complete observations divided by the number of total observations. Once again the nrow() function is used.

Percentage of observations that are complete:

(nrow(filter(iris1, iris1$Sepal.Length != "NA" & iris1$Sepal.Width != "NA" & iris1$Petal.Length != "NA" & iris1$Petal.Width != "NA" & iris$Species != "NA"))/ (nrow(iris1)))
## [1] 0.64

Special Values

After searching each variable by sorting numerically in both ascending and descending order, we found an interesting value Inf. After filtering for each through each of the variables the only variable with this value is Sepal.Width.

The code below shows this special value. It then replaces the value and shows the results afterwards.

filter(iris1, iris1$Petal.Width == "Inf")
## # A tibble: 1 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl>   <chr>
## 1          5.1         3.8          1.9         Inf  setosa
iris1$Petal.Width[iris1$Petal.Width == "Inf"] = NA
filter(iris1, iris1$Petal.Width == "Inf")
## # A tibble: 0 × 5
## # ... with 5 variables: Sepal.Length <dbl>, Sepal.Width <dbl>,
## #   Petal.Length <dbl>, Petal.Width <dbl>, Species <chr>

Species Outliers

The following observations contain entries in the Species column that do not meet the precise specifications.The code below shows how we searched for the responses that were not spelled to specifications. In this case setosa, virginica, and versicolor.

filter(iris1, iris1$Species != "setosa" & iris1$Species != "virginica" & iris1$Species != "versicolor")
## # A tibble: 11 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width     Species
##           <dbl>       <dbl>        <dbl>       <dbl>       <chr>
## 1           4.8         3.1          1.6         0.2     setosta
## 2           5.0          NA          1.2         0.2     settosa
## 3           5.0         3.5          1.3         0.3     settosa
## 4           6.5          NA          4.6         1.5  varsicolor
## 5           6.4         2.9          4.3         1.3 versi color
## 6           5.7         2.9          4.2         1.3 versi-color
## 7           6.7         3.1          4.7         1.5  wersicolor
## 8           4.8          NA          1.9         0.2     setossa
## 9           5.7         2.5          5.0         2.0  virginicca
## 10          5.2         3.5          1.5         0.2     settosa
## 11          5.8         2.6          4.0          NA versicolora

This code simply counts the number of times that the rules of the Species column are broken.

nrow(filter(iris1, iris1$Species != "setosa" & iris1$Species != "virginica" & iris1$Species != "versicolor"))
## [1] 11

Iris Measurements

All of the observations are supposed to be positive so this code below filters and finds the cases where a measurement is less the 0, or negative. After that the nrow() is used to count these instances.

filter(iris1, Sepal.Length<0 | Sepal.Width<0 | Petal.Length<0 | Petal.Width<0)
## # A tibble: 8 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
##          <dbl>       <dbl>        <dbl>       <dbl>      <chr>
## 1          5.0        -3.0          3.5         1.0 versicolor
## 2         -6.1         3.0          4.6         1.4 versicolor
## 3          6.1         2.9          4.7        -1.4 versicolor
## 4          6.7         3.0         -5.2         2.3  virginica
## 5          5.7         3.0          4.2        -1.2 versicolor
## 6         -5.6         3.0          4.1         1.3 versicolor
## 7          5.1         3.3         -1.7         0.5     setosa
## 8          6.6        -3.0          4.4         1.4 versicolor
nrow(filter(iris1, Sepal.Length<0 | Sepal.Width<0 | Petal.Length<0 | Petal.Width<0))
## [1] 8

Petal Length

Another rule of the dataset is that Petal Lengths are supposed to be at least twice as long as the Petal Width. This following code filters and finds the cases where this rule is broken then counts these instances.

filter(iris1, iris1$Petal.Length < (2*iris1$Petal.Width))
## # A tibble: 5 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
##          <dbl>       <dbl>        <dbl>       <dbl>      <chr>
## 1           NA         2.8        0.820         1.3 versicolor
## 2          5.1         3.8        0.000         0.2     setosa
## 3          6.7         3.0       -5.200         2.3  virginica
## 4          5.5          NA        0.925         1.0 versicolor
## 5          5.1         3.3       -1.700         0.5     setosa
nrow(filter(iris1, iris1$Petal.Length < (2*iris1$Petal.Width)))
## [1] 5

Sepal Length

An additional rule to the dataset is that the Sepal Length cannot exceed 30 cm so the following code again finds the instances where this happens and counts them.

filter(iris1, iris1$Sepal.Length > 30)
## # A tibble: 2 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
##          <dbl>       <dbl>        <dbl>       <dbl>     <chr>
## 1           73          29           63          NA virginica
## 2           49          30           14           2    setosa
nrow(filter(iris1, iris1$Sepal.Length > 30))
## [1] 2

Sepal Length vs. Petal Length

The last rule of the dataset is that Sepal Length should always be longer than the Petal Length. Just like above the code finds then counts the times where this rule is broken.

filter(iris1, iris1$Sepal.Length < iris1$Petal.Length)
## # A tibble: 4 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
##          <dbl>       <dbl>        <dbl>       <dbl>      <chr>
## 1          6.6         2.9         23.0         1.3 versicolor
## 2          0.0          NA          1.3         0.4     setosa
## 3         -6.1         3.0          4.6         1.4 versicolor
## 4         -5.6         3.0          4.1         1.3 versicolor
nrow(filter(iris1, iris1$Sepal.Length < iris1$Petal.Length))
## [1] 4

Observations Without Errors

The following code adds up all the errors from the nrow() commands to find the total errors it then places it into the errors variable.

errors <-nrow(filter(iris1, iris1$Species != "setosa" & iris1$Species != "virginica" & iris1$Species != "versicolor")) +  nrow(filter(iris1, Sepal.Length<0 | Sepal.Width<0 | Petal.Length<0 | Petal.Width<0)) + nrow(filter(iris1, iris1$Petal.Length < (2*iris1$Petal.Width))) + nrow(filter(iris1, iris1$Sepal.Length > 30)) + nrow(filter(iris1, iris1$Sepal.Length < iris1$Petal.Length))

errors
## [1] 30

This code places the total observations into the variable total.

total <- nrow(iris1)

total
## [1] 150

This then calculates the percentage of observations that do not have variables.

(total - errors)/total
## [1] 0.8