This homework will give you practice at transforming and visualizing data and fitting a distribution to a set of data. Note that much of the code needed to complete this homework can be adapted from the Coursebook Exercises in Chapter 9.
When a question asks you to make a plot, remember to set a theme, title, subtitle, labels, colors, etc. It is up to you how to personalize your plots, but put in some effort and make the plotting approach consistent throughout the document. For example, you could use the same theme for all plots.
Recreate Figure 9.8 (the three EDA plots based on
salary_ps2$salary), but show the plots on a log-scale
x-axis. Plot the histogram with 30 bins and move the legends so that
they don’t block the data. Does the data in these plots appear more
symmetric about the median? Why or why not?
## # A tibble: 6 × 3
## salary sex major
## <dbl> <chr> <chr>
## 1 80000 M Poly Sci
## 2 82000 F Poly Sci
## 3 67000 F Poly Sci
## 4 83000 F Poly Sci
## 5 60000 M Poly Sci
## 6 200000 M Poly Sci
## # A tibble: 2 × 6
## sex median mean min max IQR
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 69003 84432. 0 1027653 57375
## 2 M 80000 114331. 0 1027653 78000
These sets of graphs for the salaries of men and women appear more symmetrical about the median. Using the log10 scale instead of the linear continuous scale suits this data much better and is more appealing to the eye. It is important to note that the way this data is presented in these log10 scaled graphs can be misleading because it does not show the true nature of the range and distribution of the data.
Modify the code that created the sal_simulate data frame
to create a variable that simulates quantiles from a cumulative
distribution. Plot these data (instead of a histogram). Hint:
instead of rlnorm() you will need to use a different log
density function that takes a vector of quantiles as input (you will
need to specify the quantile vector). Type ?Lognormal into the Console
for help.
Mutate the salary_ps2 data frame to create a new column
variable that takes the log of the salary data (call that variable
log.salary). Then use fitdistr() to fit a
normal distribution to log.salary. What are the
resultant parameter estimates for the mean and sd? Hint: the output of
fitdistr() is a list; look in the estimate
entry for these parameters. How close are these estimates to those
calculated in section
9.6.4 of the Coursebook?
## mean sd
## 4.32 0.67
These values match the values in the textbook