First, load in the data set using the “load” command, as well as any relevant packages. In other cases, commands like “read_dta” would be appropriate, but given that the file format here is .RData, it is easiest to use load().
The summary statistics for medals are as follows:
The minimum is 0.
The 1st quartile is 0.
The median is 0.
The 3rd quartile is 0.
The maximum is 37.
The mean is 37.
The summary statistics for athletes are as follows:
The minimum is 0.
The 1st quartile is 0.
The median is 2.
The 3rd quartile is 13.
The maximum is 230.
The mean is 18.2.
The summary statistics for the GDP data are as follows:
The minimum is 0.
The 1st quartile is 0.1.
The median is 0.4.
The 3rd quartile is 1.5.
The maximum is 14.5.
The mean is 1.2.
156 observations are missing, and are recorded in this data set as NA.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 1.8 0.0 37.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 2.0 18.2 13.0 230.0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.1 0.4 1.2 1.5 14.5 156
The first five observations for the requested variables are:
Country:
Albania
Albania
Albania
Albania
Albania
Year:
1980
1984
1988
1992
1994
Medals:
0
0
0
0
0
Athletes:
0
0
0
0
0
GDP:
NA
0.0641
0.0637
0.0206
0.0587
Here is the number of observations for each year in the data set:
1980: 117
1984: 117
1988: 117
1992: 113
1994: 110
1998: 110
2002: 110
2006: 110
2010: 109
2014: 109
##
## 1980 1984 1988 1992 1994 1998 2002 2006 2010 2014
## 117 117 117 113 110 110 110 110 109 109
Running the code below will create and display a scatterplot called “med_ath” (since it displays the relationship between medals awarded and number of athletes). This scatterplot depicts a positive relationship between the size of a specific olympic delegation and the number of medals awarded to that delegation in a given year. Due to the limited number of medals available, many of the delegations which have participated in the olympics since 1980 earned 0 medals. This scatterplot also depicts a linear relationship; however, without further analysis of the data, this cannot be claimed definitively.
med_ath <- ggplot(olympics, aes(x = athletes, y = medals))+
geom_jitter()+
ylab("Number of Medals Awarded")+
xlab("Number of Athletes Competing")+
ggtitle("Number of Medals Awarded for Different Sizes of Olympic Delegations, 1980-2014")+
theme_bw()
med_ath
It’s possible that countries which field small numbers of athletes do so because they lack the resources to field larger groups. This could also mean that those same countries lack the resources to effectively train and coach their athletic delegations, or even just that they lack the resources to train at a level comparable to larger, wealthier countries. Investigating GDP could help clarify this relationship.
For GDP < ~5, an increased GDP appears to be correlated with an increased number of medals awarded. For GDP > ~ 5, this relationship is not present, and these data points all indicate 0 medals awarded. Based solely on visual inspection, GDP seems to be a less effective predictor of the number of medals awarded than athletic delegation size, owing mainly to the outliers with high GDP and 0 medals awarded.
med_GDP <- ggplot(olympics, aes(x = GDP, y = medals))+
geom_jitter()+
xlab("Per Capita GDP in $10,000 U.S.")+
ylab("Number of Medals Awarded")+
ggtitle("Relationship Between Number of Olympic Medals Awarded and National GDP, 1980-2014")+
theme_bw()
med_GDP
## Warning: Removed 156 rows containing missing values (geom_point).
The scatterplot of medals and population depicts a positive relationship between the population of a country and the number of medals awarded to its athletic delegation for countries with a population less than approximately 30,000,000. There are very few data points for populations larger than 30,000,000. Given that one would expect that larger, more populous countries (as well as wealthier countries) could field more highly-trained athletes at the Olympics, this data set merits further investigation to determine if the data is of sufficient quality.
med_pop <- ggplot(olympics, aes(x = population, y = medals))+
geom_jitter()+
xlab("Population (in 100,000)")+
ylab("Number of Medals Awarded")+
ggtitle("Relationship Between Population and Number of Olympic Medals Awarded, 1980-2014")+
theme_bw()
med_pop
## Warning: Removed 9 rows containing missing values (geom_point).
This scatterplot visually approximates a normal distribution centered on approximately 37 degrees Fahrenheit. Countries whose high temperatures in winter are closer to the middle of the plot may be more likely to have more temperate weather, and thus may naturally have better conditions available in which to train than those with more extreme temperatures. The peak of this normal distribution could indicate a “Goldilocks zone” of sorts, in that it shows a temperature point where it is cold enough that athletes can easily train for winter sports, but not so cold that training becomes dangerous.
med_temp <- ggplot(olympics, aes(x = temp, y = medals))+
geom_jitter()+
xlab("Average High Temperature in Fahrenheit")+
ylab("Number of Medals Awarded")+
ggtitle("Relationship Between Temperature and Number of Olympic Medals Awarded, 1980-2014")+
theme_bw()
med_temp
## Warning: Removed 22 rows containing missing values (geom_point).
The summary statistics for wages are as follows:
The minimum is 0.
The 1st quartile is 7.
The median is 11.
The 3rd quartile is 16.
The maximum is 1533.
The mean is 14.
There are 5756 missing observations for this variable, recorded in this data set as N/A.
The summary statistics for observed height inches in 1981 are as follows:
The minimum is 48.
The 1st quartile is 64.
The median is 67.
The 3rd quartile is 70.
The maximum is 83.
The mean is 67.
There are 543 missing observations for this variable, recorded in this data set as N/A.
The summary statistics for observed height in inches in 1985 are as follows:
The minimum is 48.
The 1st quartile is 64.
The median is 67.
The 3rd quartile is 70.
The maximum is 81.
The mean is 67.
There are 1823 missing observations for this variable, recorded in this data set as N/A.
The summary statistics for wages are as follows:
The minimum is 0. While -3 is the output generated by the summary function, it does not make sense for this variable, since it is not possible to have a negative number of siblings. This suggests that negative numbers (since -1 and -2 also appear as observations of this variable) were used as coding tools, rather than as actual observations. It is also possible that these negative numbers were entered incorrectly, and should actually be positive. The lack of a complete code book for this data set means that it is not possible to fully confirm either one of these hypotheses.
The 1st quartile is 2.
The median is 3.
The 3rd quartile is 5.
The maximum is 29.
The mean is 3.85.
These summary statistics for the two height measurements do not seem abnormal. A minimum height of 4 feet and a maximum height of 6 feet 11 inches for adolescents could easily reflect varied access to nutrition as well as genetic variation, and this range is nearly identical for the values observed four years later. The majority of the variation in the reported number of siblings also does not seem abnormal; 50% of respondents having between 2 and 5 siblings, resulting in family sizes of 3 to 6 children, does not seem out of the ordinary. Cases where a respondent reports large numbers of siblings could also be explained by situations such as foster families and adopted children, polygamous families, communal living environments, blended families including one or more step- or half-siblings, or pregnancies resulting in multiple births, to give a non-exhaustive list. The summary statistics for wages are interesting because almost half of the observations are missing for this variable. Of those that are recorded in the data set, at least 75% are a wage of $16 per hour or less. It could be revealing to compare this distribution to national labor statistics from 1996 to see if this sample accurately reflects the wage distribution of American workers in that year. Additionally, subsequent analysis on this data set could benefit from adjusting for inflation, to make it easier to draw comparisons between these observations and American wages in the present day.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 7 11 14 16 1533 5756
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 48 64 67 67 70 83 543
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 48 64 67 67 70 81 1823
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.00 2.00 3.00 3.84 5.00 29.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 2.00 3.00 3.85 5.00 29.00
## sibs_pos
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 362 1611 2530 2412 1783 1198 933 634 421 289 178 116 67 64 27 11
## 16 17 18 19 22 29
## 18 8 3 1 1 1
There are three observations in this scatterplot of wages greater than 500. Without these observations, the plot approximates a normal distribution. The overwhelming majority of points on this scatter plot represent wages less than approximately 25.
wage_h85 <- ggplot(NLSY, aes(x = height85, y = wage96))+
geom_jitter()+
xlab("Height in Inches, Measured in 1985")+
ylab("Hourly Wages Earned in U.S. Dollars in 1996")+
ggtitle("Relationship Between Adult Height in 1985 and Hourly Wage Earned")+
theme_bw()
wage_h85
## Warning: Removed 5973 rows containing missing values (geom_point).
#### (c) Create a scatterplot of wages and adult height that excludes the observations with wages above $500 per hour.
The code below will produce a scatter plot titled “Relationship Between Adult Height in 1985 and Hourly Wage Earned in 1996.” This plot is identified with wage_less500_h85.
NLSY3c <- subset(NLSY, NLSY$wage96 < 500) ##Duplicate data frame without the rows where wages are above 500.
wage_less500_h85<-ggplot(NLSY3c, aes(x = height85, y = wage96))+
geom_jitter()+
xlab("Height in Inches, Measured in 1985")+
ylab("Hourly Wages Earned in U.S. Dollars in 1996")+
ggtitle("Relationship Between Adult Height in 1985 and Hourly Wage Earned in 1996")+
theme_bw()
wage_less500_h85
## Warning: Removed 217 rows containing missing values (geom_point).
In this scatter plot, the points colored dark blue are the observations where adolescent height is greater than adult height. Future analyses using this data set should consider a threshhold at which to discard an observation. It is possible that if two measurements are very close with the adolescent one being slightly higher that there was some sort of error made when recording the data (e.g. wearing shoes versus being barefoot, the examiner misreading the measuring instrument, or the data was entered incorrectly). However, since the expected pattern is for height to either remain constant or increase from adolescence to adulthood, observations which show a large decrease (like the lowest point on the scatter plot, which has an adolescent height of approximately 61 inches and an adult height of approximately 48 inches)should be discarded for quality control purposes.
height_comp <- ggplot(data = subset(NLSY, !is.na(height81 & height85)), aes(x = height81, y = height85, color = height81 > height85))+ ##Remove the NAs from both variables, also set the color to be conditional on whether adolescent height is greater than adult height
geom_jitter()+
xlab("Height in Inches, Measured in 1981")+
ylab("Height in Inches, Measured in 1985")+
ggtitle("Comparison Between Adolescent and Adult Height", subtitle = "Data from the National Longitudinal Survey of Youth, 1981-1985")+
scale_color_manual(name = "Adolescent Height > Adult Height", values = setNames(c("#041E42", "#782F40"), c(T, F)))+ ##Set the title for the legend as well as the colors to be used (GTown Blue and FSU Garnet in this case). I would have also used GTown Gray but the color contrast wasn't obvious enough
theme_bw()+
theme(legend.box.background = element_rect(color = "#000000")) ##Put a border around the legend. Not sure why it only puts a border on 2 sides. NOTE: Look into this for future reference.
height_comp