Create a data frame made up of the following vectors. Make sure use comments to explain each coding step you took to accomplish this.
| sprint.speed.feet_s | lizard.id |
|---|---|
| 2.62 | L1 |
| 3.93 | L2 |
| 4.26 | L3 |
| 1.31 | L4 |
#Creating my vectors
sprint.speed.f.s = c(2.62, 3.93, 4.26, 1.31)
#First I use the c function to supply my sequence of numbers. Note that for a numerical variable I do not use qoutations ""
lizard.id = c("L1", "L2", "L3", "L4")
#Again I use c() to supply more than one object to a vector, this time using "" to denote that these are in fact categorical (factors, groupings, letters, etc.).
sprint.dat = data.frame(sprint.speed.f.s, lizard.id)
Create a new vector called sprint.speed.m_s made up of the conversion of the vector sprint.speed.feet_s into meters. Update your data frame to include this vector.
#Many ways to answer this question
sprint.speed.m_s = sprint.speed.f.s * 0.3048
sprint.speed.m_s
## [1] 0.798576 1.197864 1.298448 0.399288
# 0.3048 meters = 1 foot
# X meters = 1 foot * 0.3048
#Also
sprint.speed.m_s = sprint.speed.f.s / 3.28
sprint.speed.m_s
## [1] 0.7987805 1.1981707 1.2987805 0.3993902
# 3.28 foot = 1 meters
# Foot / 3.28 = X meters
sprint.dat = data.frame(sprint.dat, sprint.speed.m_s)
#I've assinged to sprint.dat dataframe made up of itself and the new vector sprint.speed.m_s
sprint.dat$sprint.speed.m_s =sprint.speed.m_s
sprint.dat
## sprint.speed.f.s lizard.id sprint.speed.m_s
## 1 2.62 L1 0.7987805
## 2 3.93 L2 1.1981707
## 3 4.26 L3 1.2987805
## 4 1.31 L4 0.3993902
Explain what the function library does
#You can always use the help function to know what a function is. If you run help(library) in your console it reads "library and require load and attach add on packages". The TA of Mondays had a brilliant analogy "library is like the librarian who finds the book (package) and brings it to you". So library is a function that looks to see if you have a package installed in your computer and if you do it activates it into your current R section. Packages require activation everytime you open R.
Use the package ggplot to create a column plot of the sprint speed data in m_s as the y axis and lizard.id as the x axis.
library(ggplot2)
ggplot(sprint.dat,
aes(x=lizard.id,
y=sprint.speed.m_s)) +
geom_col()
#First I use the library function to activate the package ggplot. Then I supplied my dataframe name (sprint.dat), and my variables (x=lizard.id, y=sprint...m_s). Because I wanted to make column graph I used the geometry geom_col().
Load the citrus dataset into R section and use summary to explore the dataset. Describe the variables in the dataset. Do you detect any potential errors in the data?
library(readxl)
citrus.dat = read_xlsx("D:/Dropbox/All about Umass/Spring 18/Animal Physiology Lab/R challenges/Citrus.data.r.challenge.xlsx")
#First I used the import dataset button on R studio to navigate to my dataset in my computer, then I copy pasted the file path int the function read_xlsx("/filepath/filename.xlsx").
summary(citrus.dat)
## Sample_No Common_Name Species total_mass_gm
## Min. : 1.00 Length:28 Length:28 Min. : 7.20
## 1st Qu.: 7.75 Class :character Class :character 1st Qu.: 76.62
## Median :15.50 Mode :character Mode :character Median : 91.35
## Mean :15.18 Mean : 191.55
## 3rd Qu.:22.25 3rd Qu.: 149.70
## Max. :29.00 Max. :1027.00
## computed mass of rind + mass of fruit axial_diameter_mm
## Min. : 6.20 Min. : 26.95
## 1st Qu.: 77.33 1st Qu.: 39.29
## Median : 87.60 Median : 51.30
## Mean : 207.35 Mean : 60.77
## 3rd Qu.: 171.95 3rd Qu.: 85.03
## Max. :1025.50 Max. :135.00
## equatorial_diameter_mm Shape (>1 taller than wider) Calculated volume
## Min. : 17.75 Min. :0.6000 Min. : 4446
## 1st Qu.: 53.40 1st Qu.:0.8000 1st Qu.: 62102
## Median : 59.10 Median :0.9000 Median : 96284
## Mean : 64.87 Mean :0.9857 Mean : 234801
## 3rd Qu.: 67.40 3rd Qu.:1.1750 3rd Qu.: 183446
## Max. :160.00 Max. :1.5000 Max. :1809557
## diameter_stem_attachment_mm volume_mm3 Paper_mass_gm
## Min. : 0.340 Min. : 5000 Min. :0.200
## 1st Qu.: 2.550 1st Qu.: 83750 1st Qu.:0.803
## Median : 3.950 Median : 100000 Median :2.600
## Mean : 3.883 Mean : 273736 Mean :2.081
## 3rd Qu.: 5.050 3rd Qu.: 220000 3rd Qu.:2.944
## Max. :10.000 Max. :2047610 Max. :4.700
## surface_area_mm2 rind_thickness_mm mass_of_rind_gm mass_of_fruit_gm
## Length:28 Min. : 0.250 Min. : 1.40 Min. : 3.00
## Class :character 1st Qu.: 1.575 1st Qu.: 13.93 1st Qu.: 56.85
## Mode :character Median : 2.050 Median : 19.50 Median : 71.45
## Mean : 3.165 Mean : 51.28 Mean :156.07
## 3rd Qu.: 3.125 3rd Qu.: 50.00 3rd Qu.:120.15
## Max. :16.720 Max. :325.00 Max. :700.50
## Calculated mass of fruit (expected less than total fruit due to water loss)
## Min. : 3.00
## 1st Qu.: 54.12
## Median : 72.87
## Mean :154.71
## 3rd Qu.:120.17
## Max. :730.80
## No_sections Mass_section_gm No_Seeds total_seed_mass_gm
## Min. : 4.00 Min. : 0.750 Min. : 0.00 Length:28
## 1st Qu.: 9.00 1st Qu.: 5.700 1st Qu.: 0.00 Class :character
## Median :10.00 Median : 7.275 Median : 0.00 Mode :character
## Mean :10.11 Mean :12.612 Mean : 16.64
## 3rd Qu.:12.00 3rd Qu.:11.845 3rd Qu.: 4.00
## Max. :16.00 Max. :60.900 Max. :224.00
## seed_volume_mm3 juice_cell_length_mm juice_cell_diameter_mm
## Length:28 Min. : 2.200 Min. :0.150
## Class :character 1st Qu.: 4.850 1st Qu.:1.400
## Mode :character Median : 8.600 Median :2.150
## Mean : 9.229 Mean :2.340
## 3rd Qu.:11.225 3rd Qu.:3.025
## Max. :20.000 Max. :6.300
## juice_cell_mass_gm
## Min. :0.00050
## 1st Qu.:0.01375
## Median :0.02615
## Mean :0.09893
## 3rd Qu.:0.08500
## Max. :0.57500
#The dataset has many varibles related to citrus fruits, including three grouping varibles (Sample_No, Common_Name, Species), and 19 variables related to frui shape, size and mass of fruits and its components
#One of the most common errors in group collected data in mispelling or inconsistencies in labels which create artificial groupings in R.
citrus.dat$Common_Name= as.factor(citrus.dat$Common_Name)
#Here I assigned to the same vector that vector coerced into a factor using the as.factor() function. Using the $ I am able to call a vector within a data.frame in this case Common_Name. I see that there are many misspelling of Common_Names. The simplest way to fix this is to fix it in the dataset and reload into r
citrus.dat$Common_Name
## [1] kumquat Kumquat Kumquat Kumquat Tangerine
## [6] mandarin Orange Lime Tangerine Tangerine
## [11] lime mandarin lime Tangerine Lime
## [16] Mandarin Tangerine mandarin Lemon lemon
## [21] Lemon Orange Sweet Orange Grapefruit Grapefruit
## [26] Grapefruit Pomelo Red Pummelo
## 14 Levels: Grapefruit kumquat Kumquat lemon Lemon lime Lime ... Tangerine
#Notice that now there are levels listed.
levels(citrus.dat$Common_Name)
## [1] "Grapefruit" "kumquat" "Kumquat" "lemon"
## [5] "Lemon" "lime" "Lime" "mandarin"
## [9] "Mandarin" "Orange" "Pomelo" "Red Pummelo"
## [13] "Sweet Orange" "Tangerine"
#The function levels ask for the groupings in a factor and the function Note that there are 14 levels due to mispellings
levels(citrus.dat$Common_Name)=c("Grapefruit",
"Kumquat",
"Kumquat",
"Lemon",
"Lemon",
"Lime",
"Lime",
"Mandarin",
"Mandarin",
"Orange",
"Pomelo",
"Red Pomelo",
"Sweet Orange",
"Tangerine")
#By using the function levels and the fuction c() I can assinged what the levels should be by providing corrected spelling
#In R by assinging the correct levels with exact spelling for same groups I should be able to correct the error. We can check it by asking the levels again
levels(citrus.dat$Common_Name)
## [1] "Grapefruit" "Kumquat" "Lemon" "Lime"
## [5] "Mandarin" "Orange" "Pomelo" "Red Pomelo"
## [9] "Sweet Orange" "Tangerine"
Use ggplot to plot any of the two variables in the dataset
summary(citrus.dat$total_mass_gm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.20 76.62 91.35 191.55 149.70 1027.00
#I want to plot total_mass_gm because I noticed that the data is a bit spread out with a mean of 191, 3rd quantiles of 140 and a maximum of 1000. Because the 3rd quantile is so close to the mean it's likely that only a few observation are 1000, in this case this could be very few instances of heavy fruit or an error of 1027 actually being 102.7
ggplot(citrus.dat,
aes(x=total_mass_gm, fill=Common_Name)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#I use ggplot to plot from citrus data the total mass as my x variable, filling each column by the common name of the fruit and creating the histogram using geom_histogram. Notice that most of our fruits are under 250 g of mass. With only Grapefruit, Pomello and Red Pomelo heavir than 750 g.
Pomelo can weight between 1 to 5 kg, which means that our suspected error is probably real data. It’s just very rare in our dataset which is my the distribution of the mass is so spread out.
Pomelo fruit, photo by Emilian Rober Vicol from wikimedia commons