Summary

  1. Most of the outliers from the Service_Number variable have Sow_Date_BreedingHerd_Available values from 1951.

  2. Those farms with few services have either a very high or very low Service_Count_Matings mean and a low mean for the SowParity_Number variable. They also have a low mean for the SowParity_Number variable.

  3. It appears that the farms that are outliers (Sow Parity) either have low or high means for the Service_Count_Matings variable.

  4. The variable SowParity_Number have 1 observation with the value of 100 and 11 observations with 20 or more. I do not know if that is possible considering that pigs live between 15 and 20 years.

  5. There are 5 observations of the Sow_Date_Mating_First variable from the year 1961 and a few others from the early 2000s.

  6. For the variable Sow_Date_BreedingHerd_Available there are 20,869 observations from the year 1951 and a small number from the 60’s, 70’s and 90’s

  7. The FarrowingRate variable has many outliers. However, most of them are from the year 2020.

  8. Most of the farms with few services have a very low farrowing rate mean. All outliers for the farrowing rate are also outliers for the number of services.

Observations

Service Count Mating

1 and 2 are by far the most common values. Looking at a bar graph the differences can be better seen.

let’s look for outliers.

Let’s look at the percentage of outliers per farm.

The percentages of outliers per farm are too small to find any kind of pattern.

Service_Date_Mating_First

Unusual trends that could represent outliers are not appreciated. There are many years in which the numbers drop in the last week of December, or the first week of January. There is a clear increase in the trend the summer of 2018 to date. There is a small negative trend from June 2020 to January 2021, perhaps attributable to covid.

Service_IsRepeat

As you can see, most of the services are not repetitions. They have a ratio of 10.2.

Service_Number

Looking at these values it seems that perhaps there are some observations that are outliers. Let’s look at those observations that have more than 10 services. Specifically, let’s look at the farms where these come from.

What is striking is that for the variable Sow_Date_BreedingHerd_Available most of the observations are from the year 1951, with the exception of the 2 observations from farms BN1S and BP1S

Site_Code

There are 8 farms that can be considered outliers. Let’s look at their frequencies.

I want to see if there are substantial differences between farms. With that goal, I will start by analyzing the variable Service_Count_Matings by Site_Code

It is interesting to note that the outliers are usually at the extremes. In other words, those farms with few observations have either a very high or very low Service_Count_Matings mean.

Now, I will do the same for the mean Service_Number

There is no clear pattern. The only thing that can be appreciated is that those farms with a Service_Number mean of 1 are all outliers.

Now with SowParity_Number

There is a clear pattern. Most outliers have a low mean for the SowParity_Number variable.

SowParity_Number

There’s one observation with a value of SowParity_Number of 100. There’s other with 20 or more, I don’t know if that is possible considering how many years pig usually live.

Let’s take a look at the mean of the sow parity by farm.

Most of the farms that are outliers correspond to farms with very few services.

Now the same graph but using Service_Count_Matings

It appears that the farms that are outliers (Sow Parity) either have low or high means for the Service_Count_Matings variable.

Sow_Date_Mating_First

  1. There are very few observations before 2012.
  2. There is a peak in December 2016 - January 2017.
  3. Since 2019 there is a downward trend.
  4. This type of distribution perhaps makes sense if we consider how long pigs live

There are 5 observations from 1961 and a few from the early 2000s. Most likely, they will need to be eliminated.

Sow_Date_BreedingHerd_Available

The big difference, between this graph and the previous one, is that there is a peak in the 50’s

There are many old observations.

Response Data

Site_Code

Looking at the number of services per farm.

There are several farms with few services. Let’s look at the farrowing rate mean by location.

Let’s look at a scatterplot with the mean of the farrowing rate versus the number of services per farm.

Although there is no linear relationship between the variables, it can be seen that most of the farms with few services have a very low farrowing rate mean. All outliers for the farrowing rate are also outliers for the number of services.

FarrowingRate

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.7636  0.8246  0.7889  0.8655  0.9863 

Let’s look at the proportions of outliers per farm

The proportion of outliers is much lower in those farms with fewer observations, however there are many farms with many services that still have many outliers (MLSS, WHSS, CLXS, NVNS).

Let’s take a look at the number of services per year

x<-data1 %>% separate(ServiceYearWeek,into=c("year","week"),sep="-")%>% filter(FarrowingRate<0.60) %>% count(year,sort=TRUE)%>% rename(Outliers=n)


DT::datatable(x, rownames = FALSE,
          options = list(
            columnDefs = list(list(className = 'dt-center', targets = 0:1))))
NA

Most of the observations are from 2020, these outliers may be a product of the pandemic. Maybe it’s not good to delete them. Let’s do a chi square test to be sure.

Given the p-value, it is possible to say that the number of outliers and the years are related. Let’s look at the residuals to have a better idea about what’s going on.

   
           2016        2017        2018        2019        2020        2021
  0 -0.35116697 -0.13679451  0.78403993  0.95485363 -1.17885258 -0.07825466
  1  1.32259468  0.51520703 -2.95291739 -3.59625035  4.43989410  0.29472929

It is important to mention that the rule of thumb is that if the values of a cell are above 2, we have more observations than expected. On the other hand, if the cell value is less than -2 we have fewer values than expected. Looking at this table, we can say that for 2018 and 2019 we have fewer outliers than expected. As for the year 2020, we have more outliers than expected.

