Most of the outliers from the Service_Number variable have Sow_Date_BreedingHerd_Available values from 1951.
Those farms with few services have either a very high or very low Service_Count_Matings mean and a low mean for the SowParity_Number variable. They also have a low mean for the SowParity_Number variable.
It appears that the farms that are outliers (Sow Parity) either have low or high means for the Service_Count_Matings variable.
The variable SowParity_Number have 1 observation with the value of 100 and 11 observations with 20 or more. I do not know if that is possible considering that pigs live between 15 and 20 years.
There are 5 observations of the Sow_Date_Mating_First variable from the year 1961 and a few others from the early 2000s.
For the variable Sow_Date_BreedingHerd_Available there are 20,869 observations from the year 1951 and a small number from the 60’s, 70’s and 90’s
The FarrowingRate variable has many outliers. However, most of them are from the year 2020.
Most of the farms with few services have a very low farrowing rate mean. All outliers for the farrowing rate are also outliers for the number of services.
1 and 2 are by far the most common values. Looking at a bar graph the differences can be better seen.
let’s look for outliers.
Let’s look at the percentage of outliers per farm.
The percentages of outliers per farm are too small to find any kind of pattern.
Unusual trends that could represent outliers are not appreciated. There are many years in which the numbers drop in the last week of December, or the first week of January. There is a clear increase in the trend the summer of 2018 to date. There is a small negative trend from June 2020 to January 2021, perhaps attributable to covid.
As you can see, most of the services are not repetitions. They have a ratio of 10.2.
Looking at these values it seems that perhaps there are some observations that are outliers. Let’s look at those observations that have more than 10 services. Specifically, let’s look at the farms where these come from.
What is striking is that for the variable Sow_Date_BreedingHerd_Available most of the observations are from the year 1951, with the exception of the 2 observations from farms BN1S and BP1S
There are 8 farms that can be considered outliers. Let’s look at their frequencies.
I want to see if there are substantial differences between farms. With that goal, I will start by analyzing the variable Service_Count_Matings by Site_Code
It is interesting to note that the outliers are usually at the extremes. In other words, those farms with few observations have either a very high or very low Service_Count_Matings mean.
Now, I will do the same for the mean Service_Number
There is no clear pattern. The only thing that can be appreciated is that those farms with a Service_Number mean of 1 are all outliers.
Now with SowParity_Number
There is a clear pattern. Most outliers have a low mean for the SowParity_Number variable.
There’s one observation with a value of SowParity_Number of 100. There’s other with 20 or more, I don’t know if that is possible considering how many years pig usually live.
Let’s take a look at the mean of the sow parity by farm.
Most of the farms that are outliers correspond to farms with very few services.
Now the same graph but using Service_Count_Matings
It appears that the farms that are outliers (Sow Parity) either have low or high means for the Service_Count_Matings variable.
There are 5 observations from 1961 and a few from the early 2000s. Most likely, they will need to be eliminated.
The big difference, between this graph and the previous one, is that there is a peak in the 50’s
There are many old observations.
Looking at the number of services per farm.
There are several farms with few services. Let’s look at the farrowing rate mean by location.
Let’s look at a scatterplot with the mean of the farrowing rate versus the number of services per farm.
Although there is no linear relationship between the variables, it can be seen that most of the farms with few services have a very low farrowing rate mean. All outliers for the farrowing rate are also outliers for the number of services.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.7636 0.8246 0.7889 0.8655 0.9863
Let’s look at the proportions of outliers per farm
The proportion of outliers is much lower in those farms with fewer observations, however there are many farms with many services that still have many outliers (MLSS, WHSS, CLXS, NVNS).
Let’s take a look at the number of services per year
x<-data1 %>% separate(ServiceYearWeek,into=c("year","week"),sep="-")%>% filter(FarrowingRate<0.60) %>% count(year,sort=TRUE)%>% rename(Outliers=n)
DT::datatable(x, rownames = FALSE,
options = list(
columnDefs = list(list(className = 'dt-center', targets = 0:1))))
NA
Most of the observations are from 2020, these outliers may be a product of the pandemic. Maybe it’s not good to delete them. Let’s do a chi square test to be sure.
Given the p-value, it is possible to say that the number of outliers and the years are related. Let’s look at the residuals to have a better idea about what’s going on.
2016 2017 2018 2019 2020 2021
0 -0.35116697 -0.13679451 0.78403993 0.95485363 -1.17885258 -0.07825466
1 1.32259468 0.51520703 -2.95291739 -3.59625035 4.43989410 0.29472929
It is important to mention that the rule of thumb is that if the values of a cell are above 2, we have more observations than expected. On the other hand, if the cell value is less than -2 we have fewer values than expected. Looking at this table, we can say that for 2018 and 2019 we have fewer outliers than expected. As for the year 2020, we have more outliers than expected.