Task 1: Data Mapping and Understanding

Shows the structure of each dataframe

## tibble [5,289 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Calf_ID                : num [1:5289] 31570 31754 32338 27926 31395 ...
##  $ Pen                    : num [1:5289] 2 8 8 4 4 2 4 1 4 4 ...
##  $ Milk_Consumption_Liters: num [1:5289] 2 1.4 0.1 13.9 1.7 3.1 5.6 13.9 8 8.5 ...
##  $ Days_consuming_milk    : num [1:5289] 1 1 1 2 2 2 3 3 3 3 ...
## tibble [4,890 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Calf_ID     : num [1:4890] 18714 18715 18716 18718 18719 ...
##  $ Birthdate   : chr [1:4890] "44863" "44864" "44869" "44873" ...
##  $ BW_date     : chr [1:4890] "44863" "44864" "44869" "44873" ...
##  $ Birth_Weight: num [1:4890] 87 84 86 81 283 322 463 92 90 223 ...
## tibble [4,895 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Calf_ID            : num [1:4895] 18714 18715 18718 18719 18720 ...
##  $ Birthdate          : chr [1:4895] "44863" "44864" "44873" "44877" ...
##  $ Days_of_age        : num [1:4895] 6 5 12 8 7 28 28 13 12 8 ...
##  $ Serum_Total_Protein: num [1:4895] 6.4 5 6 6 6 6 6.4 6.2 6.4 6 ...

Task 2: Outlier Evaluation Graphs

Task 3: Outlier Analysis

1. Explain your choice of graph type for exploring outliers:

I used a histogram to check each dataframe for outliers. I felt like this was the best way to show any outliers as the histogram shows the distribution of the data. It very easily allows me to see if any values are outside the normal distribution for the data.

2. Do you consider any of the data points outliers?

I do consider some of the data points to be outliers. The histogram for Milk_Consumption_Liters returned a rather normal distribuion that suggests that there are few if any outliers. However, Birth_Weight and Serum_Total_Protein clearly showed outliers the most. Birth_Weight showed values that would be biologically impossible and that are way off compared to the normal distribution of the data. Serum_Total_Protein returned a rather odd looking distribution which seems to me that there must be one or more outliers that are influencing the graph so much that the distribution looks like a straight line.

3. How did you come to that conclusion?

I concluded that some of the Birth_Weight data points were outliers since they were rather far outside of the normal distribution for the data points. I also saw that they were really high, ranging from 200-300lbs which seems biologically impossible for a birth weight.
I concluded that some of the Serum_Total_Protein data points were outliers since it returned a graph with a straight line distribution that was centered around 0. However the spread of the x values goes from 0 to 30000 which suggests that there might be one or two data points around 30000 which is really influencing the distribution of the dataframe for Serum Total Proteins. The straight line distriution centered around 0 suggests that most of the values in the dataframe are centered around 0.

4. What would you do next if there were outliers?

For the birth weights I would filter the data to only include data points that are within the range of birth weights that are biologically possible. I would then see if this has any impact on the distribution of the data points.
For the sermum total proteins I would filter the data to only include the data points that are centered relatively close to 0. I would do this because the distribution around 0 suggests that most of the data points are relatively close to 0 and the graph showing a range of 0 to 30000 suggests that there is a highly influential data point around 30000 that is most definitely an outlier. I would then update the histogram to see if removing that point returns a more normal distribution.

Task 4: Data Filtering using dplyr

Shows the first few rows of each filtered dataframe

## # A tibble: 6 × 4
##   Calf_ID   Pen Milk_Consumption_Liters Days_consuming_milk
##     <dbl> <dbl>                   <dbl>               <dbl>
## 1   19721     1                    56.4                   9
## 2   30470     8                    52.5                  10
## 3   32284     1                    52.8                  11
## 4   28507     1                    52.1                  12
## 5   29426     3                    85.5                  15
## 6   30634     7                    80.5                  16
## # A tibble: 6 × 4
##   Calf_ID Birthdate BW_date Birth_Weight
##     <dbl> <chr>     <chr>          <dbl>
## 1   18714 44863     44863             87
## 2   18715 44864     44864             84
## 3   18716 44869     44869             86
## 4   18718 44873     44873             81
## 5   18752 44940     44940             92
## 6   18753 44946     44946             90
## # A tibble: 6 × 4
##   Calf_ID Birthdate Days_of_age Serum_Total_Protein
##     <dbl> <chr>           <dbl>               <dbl>
## 1   18714 44863               6                 6.4
## 2   18715 44864               5                 5  
## 3   18719 44877               8                 6  
## 4   18720 44878               7                 6  
## 5   18727 44911               8                 6  
## 6   18728 44913               6                 6.2

Task 5: Data Combination and Relationship Analysis

Describe the graph and explain the relationship:

The graph shows a relatively positive linear relationship between Birth Weight and Milk Consumption. This means that the more the calf weighs the more milk is consumed. This would make sense biologically since the bigger a calf is the more milk it would need to consume.

Describe the graph and explain the relationship:

The graph shows that while Milk Consumption for the data is centered around 500lbs there really isn’t a strong linear relationship between Serum Total Protein and Milk Consumption. This means that Milk Consumption doesn’t really influence the outcome of Serum Total Protein.