Lab 3.2 Revisions

Fernando Gomez

Instructions

On your own, complete, at minimum, the following:

Load the datasetsICR package and open your specific dataset…
Look at first/last 10 rows, and the structure…
Get summary statistics…
Create tables for categorical variables…
Develop different versions of charts…
Select your final chart with all required features…

Objective

In this lab, my directive is to get familiar with the exploration of data analysis and their visualization using R and ggplot2, while applying the principles from Chapter of the assigned reading. I’ll attempt to load and examine the assigned dataset from datasets ICR package and developed the understanding of its variables, while making basic data inspection and summary statistics for assigned numeric and categorical values.while also creating multiple different types of charts in order to conclude which would be the most effective visual representation for my data, as followed the practice outlined in chapter 3. And in the lab, I’m expecting to produce a polished version of final visualization, which includes the appropriate titles, labels, color coding, legends, data source citation, along with the concise interpretation.

    area perimeter compactness length_of_kernel width_of_kernel
1  15.26     14.84      0.8710            5.763           3.312
2  14.88     14.57      0.8811            5.554           3.333
3  14.29     14.09      0.9050            5.291           3.337
4  13.84     13.94      0.8955            5.324           3.379
5  16.14     14.99      0.9034            5.658           3.562
6  14.38     14.21      0.8951            5.386           3.312
7  14.69     14.49      0.8799            5.563           3.259
8  14.11     14.10      0.8911            5.420           3.302
9  16.63     15.46      0.8747            6.053           3.465
10 16.44     15.25      0.8880            5.884           3.505
   asymmetry_coefficient length_of_kernel_groove variety
1                  2.221                   5.220    Kama
2                  1.018                   4.956    Kama
3                  2.699                   4.825    Kama
4                  2.259                   4.805    Kama
5                  1.355                   5.175    Kama
6                  2.462                   4.956    Kama
7                  3.586                   5.219    Kama
8                  2.700                   5.000    Kama
9                  2.040                   5.877    Kama
10                 1.969                   5.533    Kama

     area perimeter compactness length_of_kernel width_of_kernel
201 12.38     13.44      0.8609            5.219           2.989
202 12.67     13.32      0.8977            4.984           3.135
203 11.18     12.72      0.8680            5.009           2.810
204 12.70     13.41      0.8874            5.183           3.091
205 12.37     13.47      0.8567            5.204           2.960
206 12.19     13.20      0.8783            5.137           2.981
207 11.23     12.88      0.8511            5.140           2.795
208 13.20     13.66      0.8883            5.236           3.232
209 11.84     13.21      0.8521            5.175           2.836
210 12.30     13.34      0.8684            5.243           2.974
    asymmetry_coefficient length_of_kernel_groove  variety
201                 5.472                   5.045 Canadian
202                 2.300                   4.745 Canadian
203                 4.051                   4.828 Canadian
204                 8.456                   5.000 Canadian
205                 3.919                   5.001 Canadian
206                 3.631                   4.870 Canadian
207                 4.325                   5.003 Canadian
208                 8.315                   5.056 Canadian
209                 3.598                   5.044 Canadian
210                 5.637                   5.063 Canadian

'data.frame':   210 obs. of  8 variables:
 $ area                   : num  15.3 14.9 14.3 13.8 16.1 ...
 $ perimeter              : num  14.8 14.6 14.1 13.9 15 ...
 $ compactness            : num  0.871 0.881 0.905 0.895 0.903 ...
 $ length_of_kernel       : num  5.76 5.55 5.29 5.32 5.66 ...
 $ width_of_kernel        : num  3.31 3.33 3.34 3.38 3.56 ...
 $ asymmetry_coefficient  : num  2.22 1.02 2.7 2.26 1.35 ...
 $ length_of_kernel_groove: num  5.22 4.96 4.83 4.8 5.17 ...
 $ variety                : Factor w/ 3 levels "Kama","Rosa",..: 1 1 1 1 1 1 1 1 1 1 ...

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.899   5.262   5.524   5.629   5.980   6.675

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.630   2.944   3.237   3.259   3.562   4.033


    Kama     Rosa Canadian 
      70       70       70


     Kama      Rosa  Canadian 
0.3333333 0.3333333 0.3333333

# Chosen Chart

::: {.cell} ::: {.cell-output-display}

::: :::

Insights

Three distinct clusters visible, indicating varieties have different kernel dimensions

Kama variety (green) has smallest kernels: shorter length and narrower width

Canadian variety (red) has largest kernels: longest length and widest width

Rosa variety (blue) falls in between with medium-sized kernels

Positive correlation exists: longer kernels tend to be wider

Minimal overlap between varieties suggests kernel dimensions are strong variety identifiers

Kernel dimensions could be used to classify wheat variety with high accuracy

Conclusion

This lab demonstrated how exploratory data analysis and visualization techniques in R can be used to better understand relationships within a dataset. By examining the seeds dataset from the datasetsICR package, summary statistics and multiple chart types were used to explore kernel dimensions across wheat varieties. The final scatterplot effectively revealed clear differences between varieties, showing a strong positive relationship between kernel length and width, as well as distinct clustering by variety.

References

Healy, K. (2018). Data visualization: A practical introduction. Princeton University Press. https://socviz.co/makeplot.html

Anthropic. (2024). Claude Sonnet 4.5 [Large language model]. https://www.anthropic.com

Peer Feedback Summary

I received the following feedback on my original chart: 1. Title: Tighten up the title - the subtitle explains what the title is saying, so shorten it 2. Grid Lines: Lighten the grid lines

Original Chart (Before Feedback)

Revised Chart (After Feedback)

Improvements Made

Title: Shortened from “Seed Kernel Dimensions Vary Significantly by Wheat Variety” to “Wheat Kernel Dimensions by Variety” - more concise since subtitle provides the detail
Grid Lines: Lightened using theme() to make them less prominent and improve readability