Data Science Project 1

MAUNA LOA CO₂ MODELING AND VISUALIZATION
ANTHROPOMETRIC MODELING AND VISUALIZATION

MAUNA LOA CO₂ MODELING AND VISUALIZATION

The Question

In 1966, the World Meteorological Organization (WMO) put forth the term “climatic change” to refer to climatic variability on time-scales longer than ten years, regardless of the cause for such change. During the next decade, scientists began to suspect that human activities had the potential to drastically alter the global climate in ways that would have negative impacts on our environment. The term evolved into “climate change” and is now used to describe both the process of change and the perceived problem. Sometimes the term “global warming” is used, though in many ways this fails to adequately describe the variability in impact, since climate change can cause both hot and cold extremes in weather. Anthropogenic climate change is change that is caused by human activity, as opposed to the Earth’s natural processes. However, in the context of environmental policy, the term “climate change” is often used to mean anthropogenic climate change.

Mauna Loa Observatory is a world-renowned atmospheric research facility. It has been continuously monitoring and collecting data since the 1950’s and its remote location makes it very well-suited for monitoring atmospheric components that can contribute to climate change, including the heat-trapping greenhouse gas carbon dioxide (CO₂). Carbon overload from burning fossil fuels and deforestation is cited as the primary cause of anthropogenic climate change by proponents of such theories, while opponents assert that natural process (such as photosynthesis) contribute more to atmospheric CO₂ than humans and observed changes are simply Earth’s cycle.

Monthly Mean CO₂: The Last Five Years

Create your own version of the plot found here. Do not replicate it, but rather design your own. Use one of the themes found in the ggplot2 or ggthemes packages. You are encouraged to make style adjustments to help you informatively display the data.

CO2_monthly2015 <- co2_monthly %>% filter(year >= 2015)

ggplot(CO2_monthly2015) +
  geom_line(aes(x = date, y = mean_co2), col = "yellow", linetype = 7) +
  geom_point(aes(x = date, y = mean_co2), col = "blue", shape = 7) +
  geom_line(aes(x = date, y = trend_mean_co2), col = "black") +
  geom_point(aes(x = date, y = trend_mean_co2), col = "black", shape = 8) +
  scale_x_continuous(breaks = seq(2015, 2020, .25), 
                     labels = c("2015", rep("", 3), 
                                "2016", rep("", 3),
                                "2017", rep("", 3),
                                "2018", rep("", 3),
                                "2019", rep("", 3),
                                "2020"),
                     limits = c(2015, 2020)) +
  scale_y_continuous(breaks = 395:415, 
                     labels = c("395", rep("", 4),
                                "400", rep("", 4),
                                "405", rep("", 4),
                                "410", rep("", 4),
                                "415"),
                     limits = c(395, 415)) +
  ggtitle(expression("RECENT AVERAGE MONTHLY CO"[2]*" LEVELS AT MAUNA LOA")) +
  ylab("PARTS PER MILLION") +
  xlab("YEAR") +
  theme_classic()

Monthly Mean CO₂: A Major Milestone

An atmospheric CO₂ level of 400 ppm is considered by many to be a symbolic threshold with regard to climate change. “In the centuries to come, history books will likely look back on September 2016 as a major milestone for the world’s climate. At a time when atmospheric carbon dioxide is usually at its minimum, the monthly value failed to drop below 400 parts per million.” (source)

Adapt your plot above to include a red dashed line at 400 ppm and a large red dot on September 2016, with appropriate annotations to indicate what these additions represent.

ggplot(CO2_monthly2015) +
  geom_line(aes(x = date, y = mean_co2), col = "yellow", linetype = 7) +
  geom_point(aes(x = date, y = mean_co2), col = "blue", shape = 7) +
  geom_line(aes(x = date, y = trend_mean_co2), col = "black") +
  geom_point(aes(x = date, y = trend_mean_co2), col = "black", shape = 8) +
  scale_x_continuous(breaks = seq(2015, 2020, .25), 
                     labels = c("2015", rep("", 3), 
                                "2016", rep("", 3),
                                "2017", rep("", 3),
                                "2018", rep("", 3),
                                "2019", rep("", 3),
                                "2020"),
                     limits = c(2015, 2020)) +
  scale_y_continuous(breaks = 395:415, 
                     labels = c("395", rep("", 4),
                                "400", rep("", 4),
                                "405", rep("", 4),
                                "410", rep("", 4),
                                "415"),
                     limits = c(395, 415)) +
  ggtitle(expression("RECENT AVERAGE MONTHLY CO"[2]*" LEVELS AT MAUNA LOA")) +
  ylab("PARTS PER MILLION") +
  xlab("YEAR") +
  theme_classic() + geom_hline(yintercept = 400, color = "red", linetype = "dashed") + 
  geom_point(data = filter(co2_monthly, year == 2016 & month == 9), aes(x = date, y = mean_co2), colour = 'red', size = 4.5) + geom_label(x= 2019.8, y = 401, label="400ppm") +
  geom_label(x= 2018.2, y = 397, label = "September 2016 - Yearly minimum surpasses 400ppm")

Trends Over Time in CO₂ Growth

Consider the full Mauna Loa CO₂ record found here. The overall trend is not linear, but segments of it may be piecewise linear. Filter to remove the incomplete decades 1950s and 2010s and create a scatterplot that shows the interpolated CO₂ values with a fitted linear model for each remaining decade. Do not include standard error bands.

co2_nod <- co2_monthly %>%
  filter(decade != "1950s" & decade != "2010s") 
  ggplot(co2_nod, aes(x = date, y = int_mean_co2), col = "black") + geom_point(size = .1) + geom_smooth(aes(color = decade), method = "lm", se = FALSE) + 
  labs(title = expression("Atmospheric CO"[2]*" at Mauna Loa Observatory (1960-2010)"), 
       y = "PARTS PER MILLION", x = "YEAR")

Annual Mean CO₂ Since 1959

Replicate as closely as possible the annual mean plot found here. Hint: It uses a ggplot theme for some of the formatting.

ggplot(co2_annual) + 
  geom_bar(aes(y = mean_co2, x = year), stat = "identity", fill = "light blue", width = .7) + 
  geom_smooth(method = "loess", aes(x = year, y = mean_co2)) +
  geom_hline(yintercept = 400, color = "red") + 
  annotate("label", x = 1988, y = 400, label = "crisis threshold") +
  geom_hline(yintercept = 280, color = "black") + 
  annotate("label", x = 1988, y = 280, label = "pre-industrial mean") +
  geom_hline(yintercept = 200, color = "black") +
  annotate("label", x = 1988, y = 200, label = "ice age mean") +
labs(title = expression("Annual Mean Atmospheric CO"[2]*" at Mauna Loa Observatory"), 
       subtitle = "with loess smoothed trend curve and estimated historical reference values",
       y = expression("CO"[2]*" (ppm)"), x = "measurement year")  + scale_y_continuous(breaks = seq(0, 400, 50)) +
  scale_x_continuous(breaks = seq(1960, 2020, 5)) +   theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black"))

Discussion

In what way could these visualizations be used to support the theory of anthropogenic climate change?

ANSWER: Yes, these visualizations could be used to support climate change. It shows that CO₂ levels have been increasing for a while and have still continued to do so through recent years.

Why are data such as these considered evidence rather than proof of anthropogenic climate change?

ANSWER: The data shows that there is an increase in CO₂ but does not show that this cause is because of humans. It also is a small sample size as it shows just one place out of the entire earth. Therefore it does not reflect proof.

ANTHROPOMETRIC MODELING AND VISUALIZATION

The Question

Are people generally happy with their heights? If not, how tall do they want to be? Dr. Thomley’s anthropometric dataset contains measurements of students’ heights and their self-selected ideal heights. You will fit a parallel slopes model to predict ideal height using measured height and gender, then interpret the results of your model.

Exploratory Data Analysis

Filter the dataset to include only students who self-identified as male or female (there are not enough data points in the other categories to create a model for them). Perform EDA to determine whether you need to perform any transformations or remove any data points before you fit your model. Create your modeling dataset and call it anthro_mod.

summary(anthro)

    gender              ideal            height         armspan     
 Length:547         Min.   : 45.00   Min.   :59.75   Min.   :50.00  
 Class :character   1st Qu.: 66.00   1st Qu.:65.00   1st Qu.:64.00  
 Mode  :character   Median : 70.00   Median :68.00   Median :67.88  
                    Mean   : 70.03   Mean   :68.09   Mean   :67.91  
                    3rd Qu.: 74.00   3rd Qu.:71.50   3rd Qu.:72.00  
                    Max.   :100.00   Max.   :78.00   Max.   :81.00  
                    NA's   :5        NA's   :1       NA's   :5      
    forearm           hand            leg             foot      
 Min.   : 9.00   Min.   :4.000   Min.   :12.00   Min.   : 6.50  
 1st Qu.:16.43   1st Qu.:7.000   1st Qu.:17.21   1st Qu.: 9.00  
 Median :17.50   Median :7.250   Median :18.50   Median :10.00  
 Mean   :17.47   Mean   :7.365   Mean   :18.70   Mean   :10.07  
 3rd Qu.:18.50   3rd Qu.:8.000   3rd Qu.:20.00   3rd Qu.:11.00  
 Max.   :24.50   Max.   :9.000   Max.   :27.00   Max.   :15.00  
 NA's   :1       NA's   :1       NA's   :3       NA's   :4      
   semester        
 Length:547        
 Class :character  
 Mode  :character

anthro_mod <- anthro %>%
  filter(gender == "female" | gender == 'male')  %>%
filter(!is.na(armspan) & !is.na(height) & !is.na(ideal) & !is.na(forearm) & !is.na(hand) & !is.na(leg) & !is.na(foot)) %>% 
filter(ideal < 90 & ideal > 55)
ggplot(anthro_mod, aes(x = height, y = ideal)) + geom_point()

Fitting the Model

Create a scatterplot of ideal height versus measured height showing separate fitted linear models for males and females. Then fit a parallel slopes model with measured height and gender as predictors and save it as ideal_model. Display its summary.

ggplot(anthro_mod, aes(x = height, y = ideal, color = gender)) + geom_smooth(method = "lm", se = FALSE) + geom_point() + labs(title = "Ideal Height vs Measured Height for Males and Females", x = "Measured Height", y = "Ideal Height")

ideal_model <- lm(ideal ~ height + gender, data = anthro_mod)
summary(ideal_model)


Call:
lm(formula = ideal ~ height + gender, data = anthro_mod)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5876 -1.3481 -0.1123  1.1468 11.1271 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.14842    2.38834    14.3   <2e-16 ***
height       0.49175    0.03669    13.4   <2e-16 ***
gendermale   4.79353    0.29954    16.0   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.257 on 524 degrees of freedom
Multiple R-squared:  0.7708,    Adjusted R-squared:  0.7699 
F-statistic: 881.1 on 2 and 524 DF,  p-value: < 2.2e-16

rp <- get_regression_points(ideal_model)

Examine Residuals

Create a residual scatterplot and histogram for your model.

ggplot(rp, aes(x = height, y = residual)) +
  geom_point(color = "blue") + geom_hline(yintercept = 0, col = "yellow", size = .8) +
  labs(title = "Ideal Model Residual Points", x = "Height", y = "Residual")

ggplot(rp, aes(x = residual)) +
  geom_histogram(binwidth = .5, color = "blue", density = 20) +
  labs(title = "Distribution of Residuals for Ideal Model", x = "Count", y = "Residual")

Predicting Ideal Height

Create a dataset containing heights at one-inch intervals from 60 to 80 for each gender. Use your parallel slopes model to predict the ideal heights for these values. Use mutate to create a new variable in your results tibble that shows whether the ideal height is less than, equal to, or greater than height. The case_when function may be useful here. Display the results.

hi <- c(60:80)
he <- c(60:80)
m <- c(rep.int("male",21))
f <- c(rep.int("female",21))

height2 <- tibble(height  = c(hi,he),
                    gender = c(m,f))
                    
height2_model <- get_regression_points(ideal_model, newdata = height2)

height2_mut <- height2_model %>%
mutate(comparison = case_when(height >= ideal_hat ~ "less than", height == ideal_hat ~ "equal to", height <= ideal_hat ~ "greater than"))
height2_mut

# A tibble: 42 x 5
      ID height gender ideal_hat comparison  
   <int>  <dbl> <chr>      <dbl> <chr>       
 1     1     60 male        68.4 greater than
 2     2     61 male        68.9 greater than
 3     3     62 male        69.4 greater than
 4     4     63 male        69.9 greater than
 5     5     64 male        70.4 greater than
 6     6     65 male        70.9 greater than
 7     7     66 male        71.4 greater than
 8     8     67 male        71.9 greater than
 9     9     68 male        72.4 greater than
10    10     69 male        72.9 greater than
# … with 32 more rows

Additional Visualization

Create a plot that shows the same fitted lines for males and females as your scatterplot (but without points), as well as an annotated line indicating the relationship ideal height = measured height. Format this line in some way other than the default (e.g., color, style).

ggplot(anthro_mod, aes(x = height, y = ideal, color = gender)) + geom_smooth(method = "lm", se = FALSE) +
  geom_line(data = filter(anthro_mod, height == ideal), color = "black", linetype = "dashed") + 
  annotate("text", x = 70, y = 71, angle = 41.5, label = "Ideal Height = Measured height") +
  labs(title = "Linear Models of Ideal Height vs. Measured Height (Male and Female", subtitle = "Line where ideal height equals measured height for reference", x = "Measured Height", y = "Ideal Height")

Discussion

Explain your rationale for any transformations or deletions you chose to make in the dataset.

ANSWER: I filtered out anything that was not Male or Female since the instructions stated to do so. I also filtered out NA’s because they may interfere with some calls. I also noticed that there were two outliers so I filtered out any numbers over 90 or under 55.

Does the model seem appropriate for the data? Be sure to include discussion of the residuals.

ANSWER: Yes, the r-squared value is around .77 which is good. Males are taller than females which is normal. The p-value is also very low so with that and our r-squared we can conclude this model is appropriate.

Do the people in this sample generally seem to be happy with their heights or do their ideal heights differ? Do males and females seem to have the same attitudes regarding what is an ideal height? What group patterns do you notice? Discuss.

ANSWER: No, if everyone had their ideal height the model would yield a slope of 1. The model shows us that most shorter girls want to be taller by a few inches and most taller girls want to be shorter. Most shorter guys would prefer to be taller by a lot (about 6 inches) and most taller guys are pretty content with their height.

Data Science Project 1

Joshua Arford

Updated: Sunday, March 03, 2019 @ 10:40:40 PM

MAUNA LOA CO₂ MODELING AND VISUALIZATION

The Question

Monthly Mean CO₂: The Last Five Years

Monthly Mean CO₂: A Major Milestone

Trends Over Time in CO₂ Growth

Annual Mean CO₂ Since 1959

Discussion

ANTHROPOMETRIC MODELING AND VISUALIZATION

The Question

Exploratory Data Analysis

Fitting the Model

Examine Residuals

Predicting Ideal Height

Additional Visualization

Discussion

Data Science Project 1

Joshua Arford

Updated: Sunday, March 03, 2019 @ 10:40:40 PM

MAUNA LOA CO2 MODELING AND VISUALIZATION

The Question

Monthly Mean CO2: The Last Five Years

Monthly Mean CO2: A Major Milestone

Trends Over Time in CO2 Growth

Annual Mean CO2 Since 1959

Discussion

ANTHROPOMETRIC MODELING AND VISUALIZATION

The Question

Exploratory Data Analysis

Fitting the Model

Examine Residuals

Predicting Ideal Height

Additional Visualization

Discussion

MAUNA LOA CO₂ MODELING AND VISUALIZATION

Monthly Mean CO₂: The Last Five Years

Monthly Mean CO₂: A Major Milestone

Trends Over Time in CO₂ Growth

Annual Mean CO₂ Since 1959