Cleaning Up The Flowerbed

When it comes to the dataset that we were given, the data was not very clean at all. In order to clean it up we went through many different filters which are shown below. However, first we had to make sure that our data imported properly.

iris = read_csv("C:/Users/keppmatt/SkyDrive/Documents/Documents/Documents/Data Analytics/R/HW6-irisDataset.csv", trim_ws = FALSE)

Parsed with column specification:
cols(
  Sepal.Length = col_double(),
  Sepal.Width = col_double(),
  Petal.Length = col_double(),
  Petal.Width = col_double(),
  Species = col_character()
)

class(iris$Sepal.Length)

[1] "numeric"

class(iris$Sepal.Width)

[1] "numeric"

class(iris$Petal.Length)

[1] "numeric"

class(iris$Petal.Width)

[1] "numeric"

class(iris$Species)

[1] "character"

As you can see, variables that correspond to length and width are all numeric and the species variable is a character variable rather than a categorical variable.

After this, we looked for any specific special values and replaced and Inf with a NA becuase it is asssumed this is an error for this dataset.

is.na(iris) <- do.call(cbind,lapply(iris, is.infinite))

The next step of our analysis was to determine what percentage of the entries were considered complete. To do this, we used a new command that counted the observations that had a value in each column and put the results into a summary of TRUE and FALSE responses.

x=complete.cases(iris)
summary(x)

   Mode   FALSE    TRUE    NA's 
logical      55      95       0

y=table(x)
w=y[names(y)=="TRUE"]

Using the number of TRUE’s and dividing by the total number of rows we get the percentage of complete observations equal to:

z=length(x)
(w/z)

     TRUE 
0.6333333

The next step was analyzing the rules of the flowers and determining how often each rule was broken.

Rule 1:

The first rule dealt with the Species variable and the spelling of each name. To analyze this we used a simple filter for the entries that were not equal to any of the correct spellings. At first we came up with 11 entries and spent a considerable amount of time trying to match what we knew was the correct answer from Excel. We eventually found that when importing our data, we had to uncheck a box that was trimming the spaces off of all data points. After fixing this problem we found the following results.

mi=filter(iris,Species!="setosa"& Species!="virginica"& Species!="versicolor")
count (mi)

# A tibble: 1 × 1
      n
  <int>
1    14

Rule 2:

The second rule was that all numerical entries should be positive and in centimeters so we ran another filter and found the following results.

me=filter(iris,Sepal.Length<=0|Sepal.Width<=0|Petal.Length<=0|Petal.Width<=0)
count (me)

# A tibble: 1 × 1
      n
  <int>
1    11

Rule 3:

The third rule was a relationship between petal length and petal width so we ran yet another filter to find the number of incorrect observations.

mn=filter(iris,Petal.Length<2*Petal.Width)
count(mn)

# A tibble: 1 × 1
      n
  <int>
1     5

Rule 4:

The fourth rule required a maximum sepal length of 30 centimeters so with a filter we were able to find the following results.

de=filter(iris,Sepal.Length>30)
count(de)

# A tibble: 1 × 1
      n
  <int>
1     2

Rule 5:

The final rule was that the length of the sepals of the iris should always be longer than the petals so the filter command found the following number of errors for us.

nc=filter(iris,Sepal.Length<=Petal.Length)
count(nc)

# A tibble: 1 × 1
      n
  <int>
1     4

The final step of our analysis was to determine the percentage of the observations that had no errors at all. To do this, we combined our filters into one long filter to find the total number of errors.

mz=filter(iris,Sepal.Length<=Petal.Length|Sepal.Length>30|
                 Petal.Length<2*Petal.Width|Sepal.Length<=0|
                 Sepal.Width<=0|Petal.Length<=0|Petal.Width<=0|
                 Species!="setosa"& Species!="virginica"& Species!="versicolor")
ia=count(mz)
ia

# A tibble: 1 × 1
      n
  <int>
1    30

From here, we used the total number of entries once again, and found the percentage of errorless observations which is shown below.

1-(ia/nrow(iris))

    n
1 0.8

There is one final consideration that is important when viewing our results. All of our results are based off of the idea that rows with NA’s are not even considered as a possibility for being a rule violation. This means that the code would change significantly if such a specification was changed. The code would need to include a portion that included an “or NA” part that counted the NA as being a violation as well.