Please Indicate

Tips Instructions:

Question 1: OkCupid Profile Data

We investigate a data set of about 60,000 San Francisco OkCupid users in 2012.

a)

Use a geom_histogram() to compare male and female heights, for

  • A reasonable range of heights. For example, a self-referenced height of 30 inches is certainly someone joking around.
  • An appropriate binning structure given the units of the data
# Write your code below:
profiles <- filter(profiles, between(height, 50 , 80))


ggplot(data = profiles, aes(x = height)) +
   geom_histogram(bins = 20) +
  facet_wrap(~sex)

profiles %>%
  group_by(sex) %>%
  summarise(mean_height=mean(height, na.rm=TRUE))
## # A tibble: 2 × 2
##     sex mean_height
##   <chr>       <dbl>
## 1     f    65.10567
## 2     m    70.43220

The range for men seems to be around 55 inches to 80 inches. women’s heights seem to range from 50 to 77 inches An appropriate bin size is around 20

b)

  1. How tall is a typical male?
    The average height for a male is 70.4 inches

  2. How tall is a typical female?
    The average height for a female is 65.1 inches

  3. While the centers of both distributions might be different, what about their spread? the height spread of females on okcupid is more narrow than the height spread of males.

c)

Within the male histogram there should be two large spikes. At what height is the second spike occurring? What could be a sociological explanation for this phenomenon?

The second spike is around 73 inches, possibly because it is a commmon tall height

d)

Create a boxplot comparing ages for males and females? What can you say about the age distribution of females on San Francisco when compared to the males

# Write your code below:
ggplot(data = profiles, aes(x = sex , y = age)) +
  geom_boxplot()

summary(profiles$age) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   26.00   30.00   32.34   37.00  110.00
profiles %>%
  group_by(sex) %>%
  summarise(max_height=max(height, na.rm=TRUE))
## # A tibble: 2 × 2
##     sex max_height
##   <chr>      <int>
## 1     f         77
## 2     m         80

The spread for female heights is wider than the spread of male heights.

Question 2:

(Preview of chapter on data manipulation/wrangling) Using the same San Francisco OkCupid data from Question 1, a research scientist produces the table cross-classifying a user’s self-identified sexual orientation and their sex below. The following code will:

  1. Take all users’ profiles data then
  2. First group users’ _by their self-identified sexual orientation, second group users’ _by their sex then
  3. Within each group, summarise that group with a count of the n()umber of people then
  4. mutate existing variables to create new ones:
    • the proportion for each group: count/sum(count)
    • then round() the proportion variable to two digits.
# Do not modify any code in this block.
output_table <- profiles %>% 
  group_by(orientation, sex) %>% 
  summarise(count = n()) %>% 
  mutate(
    proportion = count/sum(count),
    proportion = round(proportion, digits=2)
    )
# Print table in clean format:
kable(output_table)
orientation sex count proportion
bisexual f 1994 0.72
bisexual m 769 0.28
gay f 1587 0.28
gay m 3982 0.72
straight f 20514 0.40
straight m 30992 0.60

a)

Read the questions here:

  1. What proportion of the 59799 SF OkCupid users are female? What are some explanations for this?
  2. What proportion of bisexual SF OkCupid users are female?
  3. What proportion of female SF OkCupid users are gay?
  4. If I randomly choose someone who self-identifies as straight, are they are more likely to be male or female?

Write your answers here (Do not insert blank lines between your answers, kind of like how I wrote the questions above):

  1. 24,095/59799 = about 40.29 percent of SF OkCupid users are females.
  2. 72 percent of users are bisexual women,
  3. 28 percent of female ok cupid are gay
  4. they would statistically more likely be male.

Question 3: Gapminder

Watch the following 20 minute TED Talk by Hans Rosling on “The best stats you’ve ever seen.” The human and international development data seen in the video is accessible in the gapminder data set within the gapminder package.

a)

Recreate the scatterplot of “Child Survival (%)” over “GDP per capita ($)” for 1980 seen in the video, but

  • Making a comparison between 1952 and 2007
  • Displaying “life expectancy” instead of “Child Survival”

Copy the template code below:

# Note this code will not work in your console, it merely serves as a template
# that you will modify
ggplot(data=DATASETNAME, aes(AES1=VAR1, AES2=VAR2, AES3=VAR3, AES4=VAR4)) +
  geom_point() + 
  facet_wrap(~VAR5) +
  scale_x_log10() + 
  labs(x="WRITE INFORMATIVE LABEL HERE", y="WRITE INFORMATIVE LABEL HERE", title="WRITE INFORMATIVE TITLE HERE")

then paste it in the code block below, then replace anything in CAPS with the appropriate terms:

# Paste your code below and modify:
ggplot(data=gapminder, aes(x=gdpPercap, y=lifeExp, color=continent, size=population)) +
  geom_point() + 
  facet_wrap(~year) +
  scale_x_log10() + 
  labs(x="Year", y="life expectancy", title="Comparison of life expectancy between 1957 and 2007")

b)

Describe two facts that would be of interest to international development organizations.

Africa has lower life expectancy in both 1952 and 2007 In the meanwhile, Asia made significant progress in the 50 year time period possibly because of the increase in technology and changes in diets