Chapter 2 Problem Set

Preparation

First, load in the data set using the “load” command, as well as any relevant packages. In other cases, commands like “read_dta” would be appropriate, but given that the file format here is .RData, it is easiest to use load().

1) What determines success at the Winter Olympics? Does population matter? Income? Or is it simply a matter of being in a cold place with lots of mountains?

(a) Summarize the medals, athletes, and GDP data.

The summary statistics for medals are as follows:
- The minimum is 0.
- The 1st quartile is 0.
- The median is 0.
- The 3rd quartile is 0.
- The maximum is 37.
- The mean is 37.
The summary statistics for athletes are as follows:
- The minimum is 0.
- The 1st quartile is 0.
- The median is 2.
- The 3rd quartile is 13.
- The maximum is 230.
- The mean is 18.2.
The summary statistics for the GDP data are as follows:
- The minimum is 0.
- The 1st quartile is 0.1.
- The median is 0.4.
- The 3rd quartile is 1.5.
- The maximum is 14.5.
- The mean is 1.2.
- 156 observations are missing, and are recorded in this data set as NA.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0     1.8     0.0    37.0

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     2.0    18.2    13.0   230.0

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.1     0.4     1.2     1.5    14.5     156

(b) List the first five observations for the country, year, medals, athletes, and GDP data.

The first five observations for the requested variables are:
- Country:
  1. Albania
  2. Albania
  3. Albania
  4. Albania
  5. Albania
- Year:
  1. 1980
  2. 1984
  3. 1988
  4. 1992
  5. 1994
- Medals:
  1. 0
  2. 0
  3. 0
  4. 0
  5. 0
- Athletes:
  1. 0
  2. 0
  3. 0
  4. 0
  5. 0
- GDP:
  1. NA
  2. 0.0641
  3. 0.0637
  4. 0.0206
  5. 0.0587

(c) How many observations are there for each year?

Here is the number of observations for each year in the data set:

1980: 117
1984: 117
1988: 117
1992: 113
1994: 110
1998: 110
2002: 110
2006: 110
2010: 109
2014: 109

## 
## 1980 1984 1988 1992 1994 1998 2002 2006 2010 2014 
##  117  117  117  113  110  110  110  110  109  109

(d) Produce a scatterplot of medals and the number of athletes. Describe the relationship depicted.

Running the code below will create and display a scatterplot called “med_ath” (since it displays the relationship between medals awarded and number of athletes). This scatterplot depicts a positive relationship between the size of a specific olympic delegation and the number of medals awarded to that delegation in a given year. Due to the limited number of medals available, many of the delegations which have participated in the olympics since 1980 earned 0 medals. This scatterplot also depicts a linear relationship; however, without further analysis of the data, this cannot be claimed definitively.

med_ath <- ggplot(olympics, aes(x = athletes, y = medals))+
  geom_jitter()+
  ylab("Number of Medals Awarded")+
  xlab("Number of Athletes Competing")+
  ggtitle("Number of Medals Awarded for Different Sizes of Olympic Delegations, 1980-2014")+
  theme_bw()
med_ath

(e) Explain any suspicion you might have that other factors might explain the observed relationship between the number of athletes and medals.

It’s possible that countries which field small numbers of athletes do so because they lack the resources to field larger groups. This could also mean that those same countries lack the resources to effectively train and coach their athletic delegations, or even just that they lack the resources to train at a level comparable to larger, wealthier countries. Investigating GDP could help clarify this relationship.

(f) Create a scatterplot of medals and GDP. Briefly describe any clear patterns.

For GDP < ~5, an increased GDP appears to be correlated with an increased number of medals awarded. For GDP > ~ 5, this relationship is not present, and these data points all indicate 0 medals awarded. Based solely on visual inspection, GDP seems to be a less effective predictor of the number of medals awarded than athletic delegation size, owing mainly to the outliers with high GDP and 0 medals awarded.

med_GDP <- ggplot(olympics, aes(x = GDP, y = medals))+
  geom_jitter()+
  xlab("Per Capita GDP in $10,000 U.S.")+
  ylab("Number of Medals Awarded")+
  ggtitle("Relationship Between Number of Olympic Medals Awarded and National GDP, 1980-2014")+
  theme_bw()
med_GDP

## Warning: Removed 156 rows containing missing values (geom_point).

(g) Create a scatterplot of medals and population. Briefly describe any clear patterns.

The scatterplot of medals and population depicts a positive relationship between the population of a country and the number of medals awarded to its athletic delegation for countries with a population less than approximately 30,000,000. There are very few data points for populations larger than 30,000,000. Given that one would expect that larger, more populous countries (as well as wealthier countries) could field more highly-trained athletes at the Olympics, this data set merits further investigation to determine if the data is of sufficient quality.

med_pop <- ggplot(olympics, aes(x = population, y = medals))+
  geom_jitter()+
  xlab("Population (in 100,000)")+
  ylab("Number of Medals Awarded")+
  ggtitle("Relationship Between Population and Number of Olympic Medals Awarded, 1980-2014")+
  theme_bw()
med_pop

## Warning: Removed 9 rows containing missing values (geom_point).

(h) Create a scatterplot of medals and temperature. Briefly describe any clear patterns.

This scatterplot visually approximates a normal distribution centered on approximately 37 degrees Fahrenheit. Countries whose high temperatures in winter are closer to the middle of the plot may be more likely to have more temperate weather, and thus may naturally have better conditions available in which to train than those with more extreme temperatures. The peak of this normal distribution could indicate a “Goldilocks zone” of sorts, in that it shows a temperature point where it is cold enough that athletes can easily train for winter sports, but not so cold that training becomes dangerous.

med_temp <- ggplot(olympics, aes(x = temp, y = medals))+
  geom_jitter()+
  xlab("Average High Temperature in Fahrenheit")+
  ylab("Number of Medals Awarded")+
  ggtitle("Relationship Between Temperature and Number of Olympic Medals Awarded, 1980-2014")+
  theme_bw()
med_temp

## Warning: Removed 22 rows containing missing values (geom_point).

2) Persico, Postlewaite, and Silverman (2004) analyzed data from the National Longitudinal Survey of Youth (NLSY) 1979 cohort to assess the relationship between height and wages for white men who were between 14 and 22 years old in 1979. This data set consists of answers from individuals who were asked questions in various years between 1979 and 1996. Here we explore the relationship between height and wages for the full sample that includes men and women and all races.

(a) Summarize the wage, height (both height85 and height81), and sibling variables. Discuss briefly.

The summary statistics for wages are as follows:
- The minimum is 0.
- The 1st quartile is 7.
- The median is 11.
- The 3rd quartile is 16.
- The maximum is 1533.
- The mean is 14.
- There are 5756 missing observations for this variable, recorded in this data set as N/A.
The summary statistics for observed height inches in 1981 are as follows:
- The minimum is 48.
- The 1st quartile is 64.
- The median is 67.
- The 3rd quartile is 70.
- The maximum is 83.
- The mean is 67.
- There are 543 missing observations for this variable, recorded in this data set as N/A.
The summary statistics for observed height in inches in 1985 are as follows:
- The minimum is 48.
- The 1st quartile is 64.
- The median is 67.
- The 3rd quartile is 70.
- The maximum is 81.
- The mean is 67.
- There are 1823 missing observations for this variable, recorded in this data set as N/A.
The summary statistics for wages are as follows:
- The minimum is 0. While -3 is the output generated by the summary function, it does not make sense for this variable, since it is not possible to have a negative number of siblings. This suggests that negative numbers (since -1 and -2 also appear as observations of this variable) were used as coding tools, rather than as actual observations. It is also possible that these negative numbers were entered incorrectly, and should actually be positive. The lack of a complete code book for this data set means that it is not possible to fully confirm either one of these hypotheses.
  - Additionally, all other summary statistics for this variable will be recalculated without the negative observations.
- The 1st quartile is 2.
- The median is 3.
- The 3rd quartile is 5.
- The maximum is 29.
- The mean is 3.85.
These summary statistics for the two height measurements do not seem abnormal. A minimum height of 4 feet and a maximum height of 6 feet 11 inches for adolescents could easily reflect varied access to nutrition as well as genetic variation, and this range is nearly identical for the values observed four years later. The majority of the variation in the reported number of siblings also does not seem abnormal; 50% of respondents having between 2 and 5 siblings, resulting in family sizes of 3 to 6 children, does not seem out of the ordinary. Cases where a respondent reports large numbers of siblings could also be explained by situations such as foster families and adopted children, polygamous families, communal living environments, blended families including one or more step- or half-siblings, or pregnancies resulting in multiple births, to give a non-exhaustive list. The summary statistics for wages are interesting because almost half of the observations are missing for this variable. Of those that are recorded in the data set, at least 75% are a wage of $16 per hour or less. It could be revealing to compare this distribution to national labor statistics from 1996 to see if this sample accurately reflects the wage distribution of American workers in that year. Additionally, subsequent analysis on this data set could benefit from adjusting for inflation, to make it easier to draw comparisons between these observations and American wages in the present day.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       7      11      14      16    1533    5756

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      48      64      67      67      70      83     543

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      48      64      67      67      70      81    1823

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3.00    2.00    3.00    3.84    5.00   29.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    2.00    3.00    3.85    5.00   29.00

## sibs_pos
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##  362 1611 2530 2412 1783 1198  933  634  421  289  178  116   67   64   27   11 
##   16   17   18   19   22   29 
##   18    8    3    1    1    1

(b) Create a scatterplot of wages and adult height (height85). Discuss any distinctive observations.

There are three observations in this scatterplot of wages greater than 500. Without these observations, the plot approximates a normal distribution. The overwhelming majority of points on this scatter plot represent wages less than approximately 25.

wage_h85 <- ggplot(NLSY, aes(x = height85, y = wage96))+
  geom_jitter()+
  xlab("Height in Inches, Measured in 1985")+
  ylab("Hourly Wages Earned in U.S. Dollars in 1996")+
  ggtitle("Relationship Between Adult Height in 1985 and Hourly Wage Earned")+
  theme_bw()
wage_h85

## Warning: Removed 5973 rows containing missing values (geom_point).

#### (c) Create a scatterplot of wages and adult height that excludes the observations with wages above $500 per hour.

The code below will produce a scatter plot titled “Relationship Between Adult Height in 1985 and Hourly Wage Earned in 1996.” This plot is identified with wage_less500_h85.

NLSY3c <- subset(NLSY, NLSY$wage96 < 500) ##Duplicate data frame without the rows where wages are above 500.
wage_less500_h85<-ggplot(NLSY3c, aes(x = height85, y = wage96))+
  geom_jitter()+
  xlab("Height in Inches, Measured in 1985")+
  ylab("Hourly Wages Earned in U.S. Dollars in 1996")+
  ggtitle("Relationship Between Adult Height in 1985 and Hourly Wage Earned in 1996")+
  theme_bw()
wage_less500_h85

## Warning: Removed 217 rows containing missing values (geom_point).

(d) Create a scatterplot of adult height against adolescent height. Identify the set of observations where people’s adolescent height is more than their adult height. Do you think we should use these observations in any future analysis we conduct with this data? Why or why not?

In this scatter plot, the points colored dark blue are the observations where adolescent height is greater than adult height. Future analyses using this data set should consider a threshhold at which to discard an observation. It is possible that if two measurements are very close with the adolescent one being slightly higher that there was some sort of error made when recording the data (e.g. wearing shoes versus being barefoot, the examiner misreading the measuring instrument, or the data was entered incorrectly). However, since the expected pattern is for height to either remain constant or increase from adolescence to adulthood, observations which show a large decrease (like the lowest point on the scatter plot, which has an adolescent height of approximately 61 inches and an adult height of approximately 48 inches)should be discarded for quality control purposes.

height_comp <- ggplot(data = subset(NLSY, !is.na(height81 & height85)), aes(x = height81, y = height85, color = height81 > height85))+ ##Remove the NAs from both variables, also set the color to be conditional on whether adolescent height is greater than adult height
  geom_jitter()+
  xlab("Height in Inches, Measured in 1981")+
  ylab("Height in Inches, Measured in 1985")+
  ggtitle("Comparison Between Adolescent and Adult Height", subtitle = "Data from the National Longitudinal Survey of Youth, 1981-1985")+
  scale_color_manual(name = "Adolescent Height > Adult Height", values = setNames(c("#041E42", "#782F40"), c(T, F)))+ ##Set the title for the legend as well as the colors to be used (GTown Blue and FSU Garnet in this case). I would have also used GTown Gray but the color contrast wasn't obvious enough
  theme_bw()+
  theme(legend.box.background = element_rect(color = "#000000")) ##Put a border around the legend. Not sure why it only puts a border on 2 sides. NOTE: Look into this for future reference.
height_comp