Homework #2

Chapter 2
Chapter 3

Chapter 2

Chapter 2: #20

What is the mode of frequency distribution?

Mode = 12-13mm

Estimate by eye the fraction of birds whose measurements are in the interval representing the mode

550/1017

There is a hint of a second peak in the frequency distribution between 15 and 16 mm. What strategy would you recommend be used to explore more fully the possibility of a second peak?

I would recommend decreasing the intervals in the histogram between 14 and 17 mm so that the possible trend could be easier to explore.

What name is given to a frequency distribution having two distinct peaks?

Bimodal frequency distribution

Chapter 2: #23

What types of variables are displayed?

The two variables displayed are both numerical and continuous.

What type of graph is this?

This is a scatter plot.

Describe the association between the two variables. Is the relationship between flicker fusion frequency and temperature positive or negative? Is the relationship linear or nonlinear?

The relationship between flicker fusion frequency and temperature is nonlinear and positive.

The 20 points in the graph were obtained from measurements of six swordfish. Can we treat the 20 measurements as a random sample? Why or why not?

The 20 measurements cannot be treated as a random sample because more than one measurement was taken for some of the fish, not all of them. In addition, measurements that were taken from the same fish were not independent.

Chapter 2: #29

What kind of graph is this?

This is a histogram.

Describe the shape of the distribution. Is it symmetric or skewed? If it is skewed, describe the type of skew.

The distribution is skewed left.

Is this distribution bimodal? Where are the mode or modes?

This distribution is not bimodal. The mode is between the ages of 80 and 85.

Chapter 2: #32

Toxo_data <- read_csv(here::here("Data", "chapter02", "chap02q32ToxoplasmaAccidents.csv"))

## Rows: 308 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): driverType, infectionStatus
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(Toxo_data)

## # A tibble: 6 × 2
##   driverType infectionStatus
##   <chr>      <chr>          
## 1 accidents  infected       
## 2 accidents  infected       
## 3 accidents  infected       
## 4 accidents  infected       
## 5 accidents  infected       
## 6 accidents  infected

str(Toxo_data)

## spec_tbl_df [308 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ driverType     : chr [1:308] "accidents" "accidents" "accidents" "accidents" ...
##  $ infectionStatus: chr [1:308] "infected" "infected" "infected" "infected" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   driverType = col_character(),
##   ..   infectionStatus = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Toxo_table <- table(Toxo_data$driverType, Toxo_data$infectionStatus)

mosaicplot(t(Toxo_table),
           xlab = "Driver Type",
           ylab = "Infection Status",
           main = "",
           col = c("forestgreen", "goldenrod1"))

What type of table is this?

This is a two-way contingency table.

What are the two variables being compared? Which is the explanatory variable and which is the response?

Driver type and infection status are being compared. There is no explanatory and response variable regarding this data table; one does not seem to have an effect on the other.

Depict the data in a graph. Use the results to answer the question: are the two variables associated in this data set?

There is a possible association in this data set because the mosaic plot does not contain a “plus sign” figure.

Chapter 2: #33

ADHD_data <- read_csv(here::here("Data", "chapter02", "chap02q33BirthMonthADHD.csv"))

## Rows: 4 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): birthMonth, diagnosis
## dbl (1): frequencies
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(ADHD_data)

## # A tibble: 4 × 3
##   birthMonth diagnosis frequencies
##   <chr>      <chr>           <dbl>
## 1 January    ADHD             2219
## 2 January    no ADHD         36917
## 3 December   ADHD             2870
## 4 December   no ADHD         36107

str(ADHD_data)

## spec_tbl_df [4 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ birthMonth : chr [1:4] "January" "January" "December" "December"
##  $ diagnosis  : chr [1:4] "ADHD" "no ADHD" "ADHD" "no ADHD"
##  $ frequencies: num [1:4] 2219 36917 2870 36107
##  - attr(*, "spec")=
##   .. cols(
##   ..   birthMonth = col_character(),
##   ..   diagnosis = col_character(),
##   ..   frequencies = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

ADHD_matrix <- matrix(ADHD_data$frequencies,
                      byrow = FALSE,
                      ncol = 2,
                      dimnames = list(c("Diagnosed ADHD", "Not diagnosed"),
                                     c("Jan", "Dec")))
barplot(ADHD_matrix, beside= TRUE, xlab= "birth month", ylab = "frequency", legend.text = rownames(ADHD_matrix))

There is an association; the calculated value of x^2 is greater than the critical value.

Chapter 2: #34

What type of graph is this?

This is a violin plot.

In which of the groups is the frequency distribution of measurements approximately symmetric?

Group A

Which of the frequency distributions show negative skew?

Group B

Which of the frequency distributions show positive skew?

Group C

Chapter 2: #35

FRL_data <- read_csv(here::here("Data", "chapter02", "chap02q35FoodReductionLifespan.csv"))

## Rows: 34 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, foodTreatment
## dbl (1): lifespan
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(FRL_data)

## # A tibble: 6 × 3
##   sex    foodTreatment lifespan
##   <chr>  <chr>            <dbl>
## 1 female reduced           16.5
## 2 female reduced           18.9
## 3 female reduced           22.6
## 4 female reduced           27.8
## 5 female reduced           30.2
## 6 female reduced           30.7

str(FRL_data)

## spec_tbl_df [34 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ sex          : chr [1:34] "female" "female" "female" "female" ...
##  $ foodTreatment: chr [1:34] "reduced" "reduced" "reduced" "reduced" ...
##  $ lifespan     : num [1:34] 16.5 18.9 22.6 27.8 30.2 30.7 35.9 23.7 24.5 24.7 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   sex = col_character(),
##   ..   foodTreatment = col_character(),
##   ..   lifespan = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

#FRL_matrix <- matrix(FRL_data$lifespan,
                      # byrow = FALSE,
                      # ncol = 2,
                      # dimnames = list(c("Female", "Male"),
                                  #   c("control", "reduced"))
stripchart(data = FRL_data, lifespan~sex*foodTreatment, vertical = TRUE)

(b) According to your graph, which difference in life span is greater: that between the sexes or that between diet groups?

The difference in life span between diet groups is greater.

Chapter 2: #36

What kind of graph is this?

This is a scatter plot.

Since the data describe a temporal sequence, what other type of plot would be suitable to display them?

Line plot

One individual in this graph is an outlier relative to the others. What is the age of death for the outlier?

122 years old.

Chapter 3

Chapter 3: #15

Vaso_data <- read_csv(here::here("Data", "chapter03", "chap03q15VasopressinVoles.csv"))

## Rows: 31 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): treatment
## dbl (1): percent
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(Vaso_data)

## # A tibble: 6 × 2
##   treatment percent
##   <chr>       <dbl>
## 1 control        98
## 2 control        96
## 3 control        94
## 4 control        88
## 5 control        86
## 6 control        82

str(Vaso_data)

## spec_tbl_df [31 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ treatment: chr [1:31] "control" "control" "control" "control" ...
##  $ percent  : num [1:31] 98 96 94 88 86 82 77 74 70 60 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   treatment = col_character(),
##   ..   percent = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

boxplot(percent ~ treatment, data = Vaso_data)

Display these data in a graph. Explain your choice of graph.

I chose to represent the data in a box plot because there is one categorical variable (Control or Enhanced) and one numerical variable (Percent) represented by the data, and having two box plots plotted next to each other was easy to compare the data.

Which group has the higher mean percentage of time spent huddling with females?

The Enhanced group has the higher mean percentage (mean ~ 85%) compared to the Control group (mean ~ 60%).

Which group has the higher standard deviation in percentage of time spent huddling with females?

The Control group has the higher standard deviation (s ~ 23) compared to the Enhanced group (s ~ 10).

Chapter 3: #16

Diet_data <- read_csv(here::here("Data", "chapter03", "chap03q16DietBreadthElVerde.csv"))

## Rows: 127 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): breadth
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(Diet_data)

## # A tibble: 6 × 1
##   breadth
##     <dbl>
## 1       1
## 2       1
## 3       1
## 4       1
## 5       1
## 6       1

str(Diet_data)

## spec_tbl_df [127 × 1] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ breadth: num [1:127] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   breadth = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Calculate the median number of prey types consumed by animal species in the community.

The median is 8.

What is the interquartile range in the number of prey types? Use the method outlined in Section 3.2 to calculate the quartiles.

First quartile = 3 Third quartile = 17 Interquartile range = 14

Can you calculate the mean number of prey types in the diet? Explain.

You cannot calculate the mean number of prey types in the diet becasue there is no way to know how many prey types are represented by the diet breadth of more than 20.

Chapter 3: #17

What type of graph is this?

This is a histogram.

Examine the graph and visually determine the approximate value of the mean (to the nearest 100 yards per minute). Explain how you obtained your estimate.

Approx. 1000 yards per minute. The histogram creates a bell curve with relatively even distribution, and 1000 yards per minute is the value that represents the bell curve.

Examine the graph and visually determine the approximate value of the median (to the nearest 100 yards per minute). Explain how you obtained your estimate.

The median is at approximately 200 yards per minute. This is the value at the middle of the y valuesin the data set.

Examine the graph and visually determine the approximate value of the mode (to the nearest 100 yards per minute). Explain how you obtained your estimate.

1100 yards per minute is the value with the highest frequency, thus being the approximate value of the mode.

Examine the graph and visually determine the approximate value of the standard deviation (to the nearest 50 yards per minute). Explain how you obtained your estimate.

250 yards per minute is the approximate standard deviation because the total speed (1500 yards/minute) divided by all of the samples (6) is 250.

Chapter 3: #22

Yeast_data <- read_csv(here::here("Data", "chapter03", "chap03q22YeastMutantGrowth.csv"))

## Rows: 11 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): mutantGrowthRate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(Yeast_data)

## # A tibble: 6 × 1
##   mutantGrowthRate
##              <dbl>
## 1             0.86
## 2             1.02
## 3             1.02
## 4             1.01
## 5             1.02
## 6             1

str(Yeast_data)

## spec_tbl_df [11 × 1] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ mutantGrowthRate: num [1:11] 0.86 1.02 1.02 1.01 1.02 1 0.99 1.01 0.91 0.83 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   mutantGrowthRate = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

What is the mean growth rate of this sample of yeast lines?

0.971

When the mean of these numbers is reported, how many digits after the decimal should be used? Why?

Two digits after the decimal should be used because that is how many places after the decimal we are given from the data set, so it needs to be mirrored.

What is the median growth rate of this sample?

1.01

What is the variance of growth rate of the sample?

0.00488909

What is the standard deviation of growth rate of the sample?

0.06992203

Chapter 3: #23

Zebra_data <- read_csv(here::here("Data", "chapter03", "chap03q23ZebraFishBoldness.csv"))

## Rows: 21 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): genotype
## dbl (1): secondsAggressiveActivity
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(Zebra_data)

## # A tibble: 6 × 2
##   genotype  secondsAggressiveActivity
##   <chr>                         <dbl>
## 1 wild type                         0
## 2 wild type                        21
## 3 wild type                        22
## 4 wild type                        28
## 5 wild type                        60
## 6 wild type                        80

str(Zebra_data)

## spec_tbl_df [21 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ genotype                 : chr [1:21] "wild type" "wild type" "wild type" "wild type" ...
##  $ secondsAggressiveActivity: num [1:21] 0 21 22 28 60 80 99 101 106 129 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   genotype = col_character(),
##   ..   secondsAggressiveActivity = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

boxplot(secondsAggressiveActivity ~ genotype, data = Zebra_data)

Draw a boxplot to compare the frequency distributions of aggression score in the two groups of zebrafish. According to the box plot, which genotype has the higher aggression scores?

The Spd genotype has the higher aggression scores.

According to the box plot, which sample spans the higher range of values for aggression scores?

The wild type genotype has the higher range of values for aggression scores.

Which sample has the larger interquartile range?

The wild type sample has the larger interquartile range.

What are the vertical lines projecting outward above and below each box?

The vertical lines represent the maximum and minimum values for each box plot.

Chapter 3: #25

What type of graph is this?

This is a cumulative frequency distribution graph.

Which sex, males or females, has the earliest median emergence date? Explain how you obtained your answer.

Females have the earliest median emergence date; They reach their middle range of cumulative relative frequency values before males reach theirs.

Which sex, male or female, has the greater interquartile range in emergence date? Explain how you obtained your answer.

The female sex has the greater interquartile range; based off of estimates of the cumulative relative frequencies (.75 - .25 for each set), the females’ interquartile range was approximately 9, while male interquartile range was about 6.

Chapter 3: #28

SeaUrchin_data <- read_csv(here::here("Data", "chapter03", "chap03q28SeaUrchinBindin.csv"))

## Rows: 19 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): populationOfFemale
## dbl (1): percentAAfertilization
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(SeaUrchin_data)

## # A tibble: 6 × 2
##   populationOfFemale percentAAfertilization
##   <chr>                               <dbl>
## 1 AA                                   0.58
## 2 AA                                   0.59
## 3 AA                                   0.69
## 4 AA                                   0.72
## 5 AA                                   0.78
## 6 AA                                   0.78

str(SeaUrchin_data)

## spec_tbl_df [19 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ populationOfFemale    : chr [1:19] "AA" "AA" "AA" "AA" ...
##  $ percentAAfertilization: num [1:19] 0.58 0.59 0.69 0.72 0.78 0.78 0.81 0.85 0.85 0.92 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   populationOfFemale = col_character(),
##   ..   percentAAfertilization = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

stripchart(percentAAfertilization ~ populationOfFemale, data = SeaUrchin_data, vertical = TRUE, method = "jitter")

Plot the data using a method other than the box plot. Is there an association in these data between female type and fertilization by AA sperm?

There is an association in these data.

Inspect the plot. On this basis, which method from this chapter (mean or median) would be best to compare the locations of the frequency distributions for the two groups? Explain your reasoning. Calculate and compare locations using this method.

Using the median would be best to compare the locations of the frequency distributions because there is an outlier in the data. The median of the AA sperm group is 0.795, which is much higher than the median of the BB sperm group is 0.37. The outlier in the data does not affect the median to chnage this comparison.

Which method would be best to compare the spread of the frequency distributions for the two groups? Explain your reasoning. Calculate and compare spread using this method.

Standard deviation would be the best method to compare the spread of the frequency distributions because it is a good partner with the mean, and outliers affect the value of the standard deviation. The standard deviation for the AA sperm group is 0.1239, which is lower than the BB sperm group standard deviation of 0.2639. This could be due to the fact that the AA group had a bigger sample size, minimizing deviation/sample error.