Chapter 9 Homework

This homework will give you practice at transforming and visualizing data and fitting a distribution to a set of data. Note that much of the code needed to complete this homework can be adapted from the Coursebook Exercises in Chapter 9.

When a question asks you to make a plot, remember to set a theme, title, subtitle, labels, colors, etc. It is up to you how to personalize your plots, but put in some effort and make the plotting approach consistent throughout the document. For example, you could use the same theme for all plots.

Question 1

Recreate Figure 9.8 (the three EDA plots based on salary_ps2$salary), but show the plots on a log-scale x-axis. Plot the histogram with 30 bins and move the legends so that they don’t block the data. Does the data in these plots appear more symmetric about the median? Why or why not?

## # A tibble: 6 × 3
##   salary sex   major   
##    <dbl> <chr> <chr>   
## 1  80000 M     Poly Sci
## 2  82000 F     Poly Sci
## 3  67000 F     Poly Sci
## 4  83000 F     Poly Sci
## 5  60000 M     Poly Sci
## 6 200000 M     Poly Sci

## # A tibble: 2 × 6
##   sex   median    mean   min     max   IQR
##   <chr>  <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1 F      69003  84432.     0 1027653 57375
## 2 M      80000 114331.     0 1027653 78000

These sets of graphs for the salaries of men and women appear more symmetrical about the median. Using the log10 scale instead of the linear continuous scale suits this data much better and is more appealing to the eye. It is important to note that the way this data is presented in these log10 scaled graphs can be misleading because it does not show the true nature of the range and distribution of the data.

Question 2

Modify the code that created the sal_simulate data frame to create a variable that simulates quantiles from a cumulative distribution. Plot these data (instead of a histogram). Hint: instead of rlnorm() you will need to use a different log density function that takes a vector of quantiles as input (you will need to specify the quantile vector). Type ?Lognormal into the Console for help.

Question 3

Mutate the salary_ps2 data frame to create a new column variable that takes the log of the salary data (call that variable log.salary). Then use fitdistr() to fit a normal distribution to log.salary. What are the resultant parameter estimates for the mean and sd? Hint: the output of fitdistr() is a list; look in the estimate entry for these parameters. How close are these estimates to those calculated in section 9.6.4 of the Coursebook?

## mean   sd 
## 4.32 0.67

These values match the values in the textbook

MECH481A6: Engineering Data Analysis in R