View() any data sets and read the help file first. For example the help file for the profiles data set below is accessible by typing ?profiles in the console after loading the data.We investigate a data set of about 60,000 San Francisco OkCupid users in 2012.
Use a geom_histogram() to compare male and female heights, for
# Write your code below:
profiles <- filter(profiles, between(height, 50 , 80))
ggplot(data = profiles, aes(x = height)) +
geom_histogram(bins = 20) +
facet_wrap(~sex)
profiles %>%
group_by(sex) %>%
summarise(mean_height=mean(height, na.rm=TRUE))
## # A tibble: 2 × 2
## sex mean_height
## <chr> <dbl>
## 1 f 65.10567
## 2 m 70.43220
The range for men seems to be around 55 inches to 80 inches. women’s heights seem to range from 50 to 77 inches An appropriate bin size is around 20
How tall is a typical male?
The average height for a male is 70.4 inches
How tall is a typical female?
The average height for a female is 65.1 inches
While the centers of both distributions might be different, what about their spread? the height spread of females on okcupid is more narrow than the height spread of males.
Within the male histogram there should be two large spikes. At what height is the second spike occurring? What could be a sociological explanation for this phenomenon?
The second spike is around 73 inches, possibly because it is a commmon tall height
Create a boxplot comparing ages for males and females? What can you say about the age distribution of females on San Francisco when compared to the males
# Write your code below:
ggplot(data = profiles, aes(x = sex , y = age)) +
geom_boxplot()
summary(profiles$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 26.00 30.00 32.34 37.00 110.00
profiles %>%
group_by(sex) %>%
summarise(max_height=max(height, na.rm=TRUE))
## # A tibble: 2 × 2
## sex max_height
## <chr> <int>
## 1 f 77
## 2 m 80
The spread for female heights is wider than the spread of male heights.
(Preview of chapter on data manipulation/wrangling) Using the same San Francisco OkCupid data from Question 1, a research scientist produces the table cross-classifying a user’s self-identified sexual orientation and their sex below. The following code will:
profiles data thengroup users’ _by their self-identified sexual orientation, second group users’ _by their sex thensummarise that group with a count of the n()umber of people thenmutate existing variables to create new ones:
proportion for each group: count/sum(count)round() the proportion variable to two digits.# Do not modify any code in this block.
output_table <- profiles %>%
group_by(orientation, sex) %>%
summarise(count = n()) %>%
mutate(
proportion = count/sum(count),
proportion = round(proportion, digits=2)
)
# Print table in clean format:
kable(output_table)
| orientation | sex | count | proportion |
|---|---|---|---|
| bisexual | f | 1994 | 0.72 |
| bisexual | m | 769 | 0.28 |
| gay | f | 1587 | 0.28 |
| gay | m | 3982 | 0.72 |
| straight | f | 20514 | 0.40 |
| straight | m | 30992 | 0.60 |
Read the questions here:
Write your answers here (Do not insert blank lines between your answers, kind of like how I wrote the questions above):
Watch the following 20 minute TED Talk by Hans Rosling on “The best stats you’ve ever seen.” The human and international development data seen in the video is accessible in the gapminder data set within the gapminder package.
Recreate the scatterplot of “Child Survival (%)” over “GDP per capita ($)” for 1980 seen in the video, but
Copy the template code below:
# Note this code will not work in your console, it merely serves as a template
# that you will modify
ggplot(data=DATASETNAME, aes(AES1=VAR1, AES2=VAR2, AES3=VAR3, AES4=VAR4)) +
geom_point() +
facet_wrap(~VAR5) +
scale_x_log10() +
labs(x="WRITE INFORMATIVE LABEL HERE", y="WRITE INFORMATIVE LABEL HERE", title="WRITE INFORMATIVE TITLE HERE")
then paste it in the code block below, then replace anything in CAPS with the appropriate terms:
# Paste your code below and modify:
ggplot(data=gapminder, aes(x=gdpPercap, y=lifeExp, color=continent, size=population)) +
geom_point() +
facet_wrap(~year) +
scale_x_log10() +
labs(x="Year", y="life expectancy", title="Comparison of life expectancy between 1957 and 2007")
Describe two facts that would be of interest to international development organizations.
Africa has lower life expectancy in both 1952 and 2007 In the meanwhile, Asia made significant progress in the 50 year time period possibly because of the increase in technology and changes in diets