We were given a data set with multiple measurements of irises. We were also given specific rules and specifications that irises should meet. The following analysis is the process that we went through to identify these errors.
We first were asked to find the class of the variables that were given. We did this by using the class() function that you can see below. The results for each variable can be seen below the corresponding code.
Sepal.Length:
class(iris1$Sepal.Length)
## [1] "numeric"
Sepal.Width:
class(iris1$Sepal.Width)
## [1] "numeric"
Petal.Length:
class(iris1$Petal.Length)
## [1] "numeric"
Petal.Width:
class(iris1$Petal.Width)
## [1] "numeric"
Species:
class(iris1$Species)
## [1] "character"
Below we use code that filters out the observations that do not contain NA. The NA signifies that there is no data for this variable for the observation so if an observation contains NA then it is not complete. This code finds observations that does not have NA. The nrow() function then counts this filter.
Number of observations that are complete:
nrow(filter(iris1, iris1$Sepal.Length != "NA" & iris1$Sepal.Width != "NA" & iris1$Petal.Length != "NA" & iris1$Petal.Width != "NA" & iris$Species != "NA"))
## [1] 96
Next we simply take the complete observations divided by the number of total observations. Once again the nrow() function is used.
Percentage of observations that are complete:
(nrow(filter(iris1, iris1$Sepal.Length != "NA" & iris1$Sepal.Width != "NA" & iris1$Petal.Length != "NA" & iris1$Petal.Width != "NA" & iris$Species != "NA"))/ (nrow(iris1)))
## [1] 0.64
After searching each variable by sorting numerically in both ascending and descending order, we found an interesting value Inf. After filtering for each through each of the variables the only variable with this value is Sepal.Width.
The code below shows this special value. It then replaces the value and shows the results afterwards.
filter(iris1, iris1$Petal.Width == "Inf")
## # A tibble: 1 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.8 1.9 Inf setosa
iris1$Petal.Width[iris1$Petal.Width == "Inf"] = NA
filter(iris1, iris1$Petal.Width == "Inf")
## # A tibble: 0 × 5
## # ... with 5 variables: Sepal.Length <dbl>, Sepal.Width <dbl>,
## # Petal.Length <dbl>, Petal.Width <dbl>, Species <chr>
The following observations contain entries in the Species column that do not meet the precise specifications.The code below shows how we searched for the responses that were not spelled to specifications. In this case setosa, virginica, and versicolor.
filter(iris1, iris1$Species != "setosa" & iris1$Species != "virginica" & iris1$Species != "versicolor")
## # A tibble: 11 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 4.8 3.1 1.6 0.2 setosta
## 2 5.0 NA 1.2 0.2 settosa
## 3 5.0 3.5 1.3 0.3 settosa
## 4 6.5 NA 4.6 1.5 varsicolor
## 5 6.4 2.9 4.3 1.3 versi color
## 6 5.7 2.9 4.2 1.3 versi-color
## 7 6.7 3.1 4.7 1.5 wersicolor
## 8 4.8 NA 1.9 0.2 setossa
## 9 5.7 2.5 5.0 2.0 virginicca
## 10 5.2 3.5 1.5 0.2 settosa
## 11 5.8 2.6 4.0 NA versicolora
This code simply counts the number of times that the rules of the Species column are broken.
nrow(filter(iris1, iris1$Species != "setosa" & iris1$Species != "virginica" & iris1$Species != "versicolor"))
## [1] 11
All of the observations are supposed to be positive so this code below filters and finds the cases where a measurement is less the 0, or negative. After that the nrow() is used to count these instances.
filter(iris1, Sepal.Length<0 | Sepal.Width<0 | Petal.Length<0 | Petal.Width<0)
## # A tibble: 8 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.0 -3.0 3.5 1.0 versicolor
## 2 -6.1 3.0 4.6 1.4 versicolor
## 3 6.1 2.9 4.7 -1.4 versicolor
## 4 6.7 3.0 -5.2 2.3 virginica
## 5 5.7 3.0 4.2 -1.2 versicolor
## 6 -5.6 3.0 4.1 1.3 versicolor
## 7 5.1 3.3 -1.7 0.5 setosa
## 8 6.6 -3.0 4.4 1.4 versicolor
nrow(filter(iris1, Sepal.Length<0 | Sepal.Width<0 | Petal.Length<0 | Petal.Width<0))
## [1] 8
Another rule of the dataset is that Petal Lengths are supposed to be at least twice as long as the Petal Width. This following code filters and finds the cases where this rule is broken then counts these instances.
filter(iris1, iris1$Petal.Length < (2*iris1$Petal.Width))
## # A tibble: 5 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 NA 2.8 0.820 1.3 versicolor
## 2 5.1 3.8 0.000 0.2 setosa
## 3 6.7 3.0 -5.200 2.3 virginica
## 4 5.5 NA 0.925 1.0 versicolor
## 5 5.1 3.3 -1.700 0.5 setosa
nrow(filter(iris1, iris1$Petal.Length < (2*iris1$Petal.Width)))
## [1] 5
An additional rule to the dataset is that the Sepal Length cannot exceed 30 cm so the following code again finds the instances where this happens and counts them.
filter(iris1, iris1$Sepal.Length > 30)
## # A tibble: 2 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 73 29 63 NA virginica
## 2 49 30 14 2 setosa
nrow(filter(iris1, iris1$Sepal.Length > 30))
## [1] 2
The last rule of the dataset is that Sepal Length should always be longer than the Petal Length. Just like above the code finds then counts the times where this rule is broken.
filter(iris1, iris1$Sepal.Length < iris1$Petal.Length)
## # A tibble: 4 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 6.6 2.9 23.0 1.3 versicolor
## 2 0.0 NA 1.3 0.4 setosa
## 3 -6.1 3.0 4.6 1.4 versicolor
## 4 -5.6 3.0 4.1 1.3 versicolor
nrow(filter(iris1, iris1$Sepal.Length < iris1$Petal.Length))
## [1] 4
The following code adds up all the errors from the nrow() commands to find the total errors it then places it into the errors variable.
errors <-nrow(filter(iris1, iris1$Species != "setosa" & iris1$Species != "virginica" & iris1$Species != "versicolor")) + nrow(filter(iris1, Sepal.Length<0 | Sepal.Width<0 | Petal.Length<0 | Petal.Width<0)) + nrow(filter(iris1, iris1$Petal.Length < (2*iris1$Petal.Width))) + nrow(filter(iris1, iris1$Sepal.Length > 30)) + nrow(filter(iris1, iris1$Sepal.Length < iris1$Petal.Length))
errors
## [1] 30
This code places the total observations into the variable total.
total <- nrow(iris1)
total
## [1] 150
This then calculates the percentage of observations that do not have variables.
(total - errors)/total
## [1] 0.8