Question 1

Create a data frame made up of the following vectors. Make sure use comments to explain each coding step you took to accomplish this.

sprint.speed.feet_s lizard.id
2.62 L1
3.93 L2
4.26 L3
1.31 L4
#Creating my vectors 

sprint.speed.f.s = c(2.62, 3.93, 4.26, 1.31)
#First I use the c function to supply my sequence of numbers. Note that for a numerical variable I do not use qoutations ""

lizard.id = c("L1", "L2", "L3", "L4")
#Again I use c() to supply more than one object to a vector, this time using "" to denote that these are in fact categorical (factors, groupings, letters, etc.). 

sprint.dat = data.frame(sprint.speed.f.s, lizard.id)

Question 2

Create a new vector called sprint.speed.m_s made up of the conversion of the vector sprint.speed.feet_s into meters. Update your data frame to include this vector.

#Many ways to answer this question 


sprint.speed.m_s = sprint.speed.f.s * 0.3048
sprint.speed.m_s
## [1] 0.798576 1.197864 1.298448 0.399288
# 0.3048 meters = 1 foot
# X meters = 1 foot * 0.3048

#Also 
sprint.speed.m_s = sprint.speed.f.s / 3.28
sprint.speed.m_s
## [1] 0.7987805 1.1981707 1.2987805 0.3993902
# 3.28 foot = 1 meters 
# Foot / 3.28 = X meters

sprint.dat = data.frame(sprint.dat, sprint.speed.m_s)
#I've assinged to sprint.dat dataframe made up of itself and the new vector sprint.speed.m_s

sprint.dat$sprint.speed.m_s =sprint.speed.m_s 
sprint.dat
##   sprint.speed.f.s lizard.id sprint.speed.m_s
## 1             2.62        L1        0.7987805
## 2             3.93        L2        1.1981707
## 3             4.26        L3        1.2987805
## 4             1.31        L4        0.3993902

Question 3

Explain what the function library does

#You can always use the help function to know what a function is. If you run help(library) in your console it reads "library and require load and attach add on packages". The TA of Mondays had a brilliant analogy "library is like the librarian who finds the book (package) and brings it to you". So library is a function that looks to see if you have a package installed in your computer and if you do it activates it into your current R section. Packages require activation everytime you open R. 

Question 4

Use the package ggplot to create a column plot of the sprint speed data in m_s as the y axis and lizard.id as the x axis.

library(ggplot2)
ggplot(sprint.dat, 
       aes(x=lizard.id, 
           y=sprint.speed.m_s)) +
  geom_col()

#First I use the library function to activate the package ggplot. Then I supplied my dataframe name (sprint.dat), and my variables (x=lizard.id, y=sprint...m_s). Because I wanted to make column graph I used the geometry geom_col(). 

Question 5:

Load the citrus dataset into R section and use summary to explore the dataset. Describe the variables in the dataset. Do you detect any potential errors in the data?

library(readxl)
citrus.dat = read_xlsx("D:/Dropbox/All about Umass/Spring 18/Animal Physiology Lab/R challenges/Citrus.data.r.challenge.xlsx")
#First I used the import dataset button on R studio to navigate to my dataset in my computer, then I copy pasted the file path int the function read_xlsx("/filepath/filename.xlsx"). 

summary(citrus.dat)
##    Sample_No     Common_Name          Species          total_mass_gm    
##  Min.   : 1.00   Length:28          Length:28          Min.   :   7.20  
##  1st Qu.: 7.75   Class :character   Class :character   1st Qu.:  76.62  
##  Median :15.50   Mode  :character   Mode  :character   Median :  91.35  
##  Mean   :15.18                                         Mean   : 191.55  
##  3rd Qu.:22.25                                         3rd Qu.: 149.70  
##  Max.   :29.00                                         Max.   :1027.00  
##  computed mass of rind + mass of fruit axial_diameter_mm
##  Min.   :   6.20                       Min.   : 26.95   
##  1st Qu.:  77.33                       1st Qu.: 39.29   
##  Median :  87.60                       Median : 51.30   
##  Mean   : 207.35                       Mean   : 60.77   
##  3rd Qu.: 171.95                       3rd Qu.: 85.03   
##  Max.   :1025.50                       Max.   :135.00   
##  equatorial_diameter_mm Shape (>1 taller than wider) Calculated volume
##  Min.   : 17.75         Min.   :0.6000               Min.   :   4446  
##  1st Qu.: 53.40         1st Qu.:0.8000               1st Qu.:  62102  
##  Median : 59.10         Median :0.9000               Median :  96284  
##  Mean   : 64.87         Mean   :0.9857               Mean   : 234801  
##  3rd Qu.: 67.40         3rd Qu.:1.1750               3rd Qu.: 183446  
##  Max.   :160.00         Max.   :1.5000               Max.   :1809557  
##  diameter_stem_attachment_mm   volume_mm3      Paper_mass_gm  
##  Min.   : 0.340              Min.   :   5000   Min.   :0.200  
##  1st Qu.: 2.550              1st Qu.:  83750   1st Qu.:0.803  
##  Median : 3.950              Median : 100000   Median :2.600  
##  Mean   : 3.883              Mean   : 273736   Mean   :2.081  
##  3rd Qu.: 5.050              3rd Qu.: 220000   3rd Qu.:2.944  
##  Max.   :10.000              Max.   :2047610   Max.   :4.700  
##  surface_area_mm2   rind_thickness_mm mass_of_rind_gm  mass_of_fruit_gm
##  Length:28          Min.   : 0.250    Min.   :  1.40   Min.   :  3.00  
##  Class :character   1st Qu.: 1.575    1st Qu.: 13.93   1st Qu.: 56.85  
##  Mode  :character   Median : 2.050    Median : 19.50   Median : 71.45  
##                     Mean   : 3.165    Mean   : 51.28   Mean   :156.07  
##                     3rd Qu.: 3.125    3rd Qu.: 50.00   3rd Qu.:120.15  
##                     Max.   :16.720    Max.   :325.00   Max.   :700.50  
##  Calculated mass of fruit (expected less than total fruit due to water loss)
##  Min.   :  3.00                                                             
##  1st Qu.: 54.12                                                             
##  Median : 72.87                                                             
##  Mean   :154.71                                                             
##  3rd Qu.:120.17                                                             
##  Max.   :730.80                                                             
##   No_sections    Mass_section_gm     No_Seeds      total_seed_mass_gm
##  Min.   : 4.00   Min.   : 0.750   Min.   :  0.00   Length:28         
##  1st Qu.: 9.00   1st Qu.: 5.700   1st Qu.:  0.00   Class :character  
##  Median :10.00   Median : 7.275   Median :  0.00   Mode  :character  
##  Mean   :10.11   Mean   :12.612   Mean   : 16.64                     
##  3rd Qu.:12.00   3rd Qu.:11.845   3rd Qu.:  4.00                     
##  Max.   :16.00   Max.   :60.900   Max.   :224.00                     
##  seed_volume_mm3    juice_cell_length_mm juice_cell_diameter_mm
##  Length:28          Min.   : 2.200       Min.   :0.150         
##  Class :character   1st Qu.: 4.850       1st Qu.:1.400         
##  Mode  :character   Median : 8.600       Median :2.150         
##                     Mean   : 9.229       Mean   :2.340         
##                     3rd Qu.:11.225       3rd Qu.:3.025         
##                     Max.   :20.000       Max.   :6.300         
##  juice_cell_mass_gm
##  Min.   :0.00050   
##  1st Qu.:0.01375   
##  Median :0.02615   
##  Mean   :0.09893   
##  3rd Qu.:0.08500   
##  Max.   :0.57500
#The dataset has many varibles related to citrus fruits, including three grouping varibles (Sample_No, Common_Name, Species), and 19 variables related to frui shape, size and mass of fruits and its components 

#One of the most common errors in group collected data in mispelling or inconsistencies in labels which create artificial groupings in R. 

citrus.dat$Common_Name= as.factor(citrus.dat$Common_Name)
#Here I assigned to the same vector that vector coerced into a factor using the as.factor() function. Using the $ I am able to call a vector within a data.frame in this case Common_Name. I see that there are many misspelling of Common_Names. The simplest way to fix this is to fix it in the dataset and reload into r

citrus.dat$Common_Name
##  [1] kumquat      Kumquat      Kumquat      Kumquat      Tangerine   
##  [6] mandarin     Orange       Lime         Tangerine    Tangerine   
## [11] lime         mandarin     lime         Tangerine    Lime        
## [16] Mandarin     Tangerine    mandarin     Lemon        lemon       
## [21] Lemon        Orange       Sweet Orange Grapefruit   Grapefruit  
## [26] Grapefruit   Pomelo       Red Pummelo 
## 14 Levels: Grapefruit kumquat Kumquat lemon Lemon lime Lime ... Tangerine
#Notice that now there are levels listed.

levels(citrus.dat$Common_Name)
##  [1] "Grapefruit"   "kumquat"      "Kumquat"      "lemon"       
##  [5] "Lemon"        "lime"         "Lime"         "mandarin"    
##  [9] "Mandarin"     "Orange"       "Pomelo"       "Red Pummelo" 
## [13] "Sweet Orange" "Tangerine"
#The function levels ask for the groupings in a factor and the function Note that there are 14 levels due to mispellings



levels(citrus.dat$Common_Name)=c("Grapefruit",
                                  "Kumquat",
                                  "Kumquat", 
                                  "Lemon", 
                                  "Lemon", 
                                  "Lime", 
                                  "Lime",
                                  "Mandarin",
                                  "Mandarin",
                                  "Orange",
                                  "Pomelo", 
                                  "Red Pomelo",
                                  "Sweet Orange",
                                  "Tangerine")

#By using the function levels and the fuction c() I can assinged what the levels should be by providing corrected spelling

#In R by assinging the correct levels with exact spelling for same groups I should be able to correct the error. We can check it by asking the levels again

levels(citrus.dat$Common_Name)
##  [1] "Grapefruit"   "Kumquat"      "Lemon"        "Lime"        
##  [5] "Mandarin"     "Orange"       "Pomelo"       "Red Pomelo"  
##  [9] "Sweet Orange" "Tangerine"

Question 6:

Use ggplot to plot any of the two variables in the dataset

summary(citrus.dat$total_mass_gm)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.20   76.62   91.35  191.55  149.70 1027.00
#I want to plot total_mass_gm because I noticed that the data is a bit spread out with a mean of 191, 3rd quantiles of 140 and a maximum of 1000. Because the 3rd quantile is so close to the mean it's likely that only a few observation are 1000, in this case this could be very few instances of heavy fruit or an error of 1027 actually being 102.7 

ggplot(citrus.dat, 
       aes(x=total_mass_gm, fill=Common_Name)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#I use ggplot to plot from citrus data the total mass as my x variable, filling each column by the common name of the fruit and creating the histogram using geom_histogram. Notice that most of our fruits are under 250 g of mass. With only Grapefruit, Pomello and Red Pomelo heavir than 750 g. 

Can a pomelo weight 1000g ?

Pomelo can weight between 1 to 5 kg, which means that our suspected error is probably real data. It’s just very rare in our dataset which is my the distribution of the mass is so spread out.

Pomelo fruit, photo by Emilian Rober Vicol from wikimedia commons

Pomelo fruit, photo by Emilian Rober Vicol from wikimedia commons