Data Science Project 1

MAUNA LOA CO₂ MODELING AND VISUALIZATION
ANTHROPOMETRIC MODELING AND VISUALIZATION

MAUNA LOA CO₂ MODELING AND VISUALIZATION

The Question

In 1966, the World Meteorological Organization (WMO) put forth the term “climatic change” to refer to climatic variability on time-scales longer than ten years, regardless of the cause for such change. During the next decade, scientists began to suspect that human activities had the potential to drastically alter the global climate in ways that would have negative impacts on our environment. The term evolved into “climate change” and is now used to describe both the process of change and the perceived problem. Sometimes the term “global warming” is used, though in many ways this fails to adequately describe the variability in impact, since climate change can cause both hot and cold extremes in weather. Anthropogenic climate change is change that is caused by human activity, as opposed to the Earth’s natural processes. However, in the context of environmental policy, the term “climate change” is often used to mean anthropogenic climate change.

Mauna Loa Observatory is a world-renowned atmospheric research facility. It has been continuously monitoring and collecting data since the 1950’s and its remote location makes it very well-suited for monitoring atmospheric components that can contribute to climate change, including the heat-trapping greenhouse gas carbon dioxide (CO₂). Carbon overload from burning fossil fuels and deforestation is cited as the primary cause of anthropogenic climate change by proponents of such theories, while opponents assert that natural process (such as photosynthesis) contribute more to atmospheric CO₂ than humans and observed changes are simply Earth’s cycle.

Monthly Mean CO₂: The Last Five Years

Create your own version of the plot found here. Do not replicate it, but rather design your own. Use one of the themes found in the ggplot2 or ggthemes packages. You are encouraged to make style adjustments to help you informatively display the data.

CO2_2015 <-co2_monthly %>% filter(year == 2015)
CO2_2016 <-co2_monthly %>% filter(year == 2016)
CO2_2017 <-co2_monthly %>% filter(year == 2017)
CO2_2018 <-co2_monthly %>% filter(year == 2018)
CO2_2019 <-co2_monthly %>% filter(year == 2019)
CO2_2020 <-co2_monthly %>% filter(year == 2020)
COdot <- co2_monthly %>% filter(year == 2016, month ==9)

CO2_monthly2015 <- co2_monthly %>% filter(year >= 2015)

ggplot(CO2_monthly2015, aes(x = year, y = int_mean_co2))+
  geom_line(aes(x = date, y = int_mean_co2), col = "blue", alpha =0.75) + 
  geom_point(aes(x = date, y = int_mean_co2), col = "blue", shape = 20, size =0.25) + 
  geom_point(size = 1, alpha = 0.75) +
  geom_segment(x= 2015, xend = 2016, y = (mean(CO2_2015$int_mean_co2)), yend= (mean(CO2_2015$int_mean_co2)), col="gray", size =1.5)+
  geom_segment(x= 2016, xend = 2017, y = (mean(CO2_2016$int_mean_co2)), yend= (mean(CO2_2016$int_mean_co2)), col="gray", size =1.5)+
  geom_segment(x= 2017, xend = 2018, y = (mean(CO2_2017$int_mean_co2)), yend= (mean(CO2_2017$int_mean_co2)), col="gray", size =1.5)+
  geom_segment(x= 2018, xend = 2019, y = (mean(CO2_2018$int_mean_co2)), yend= (mean(CO2_2018$int_mean_co2)), col="gray", size =1.5)+
  geom_segment(x= 2019, xend = 2020, y = (mean(CO2_2019$int_mean_co2)), yend= (mean(CO2_2019$int_mean_co2)), col="gray", size =1.5)+
  #geom_line(aes(x = date, y = trend), col = "black", alpha =0.25) + not adding much
  scale_x_continuous(breaks = seq(2015, 2020, .25), 
                     labels = c("2015", rep("", 3), 
                                "2016", rep("", 3), 
                                "2017", rep("", 3),
                                "2018", rep("", 3),
                                "2019", rep("", 3),
                                "2020"),
                     limits = c(2015, 2020),
                     sec.axis = dup_axis(labels = NULL, name = "")) +
  scale_y_continuous(breaks = 395:415, 
                     labels = c("395", rep("", 4),
                                "400", rep("", 4),
                                "405", rep("", 4),
                                "410", rep("", 4),
                                "415"),
                     limits = c(395, 415),
                     sec.axis = dup_axis(labels = NULL, name = "")) +
  ggtitle(expression("Monthly CO"[2]*" and Yearly Average CO"[2]*" AT MAUNA LOA")) +
  ylab(expression("CO"[2]*"(ppm)")) +
  xlab("Year") +
  theme_minimal()+
  theme(axis.line.y.left = element_line(),
        axis.line.x.bottom = element_line())

Monthly Mean CO₂: A Major Milestone

An atmospheric CO₂ level of 400 ppm is considered by many to be a symbolic threshold with regard to climate change. “In the centuries to come, history books will likely look back on September 2016 as a major milestone for the world’s climate. At a time when atmospheric carbon dioxide is usually at its minimum, the monthly value failed to drop below 400 parts per million.” (source)

Adapt your plot above to include a red dashed line at 400 ppm and a large red dot on September 2016, with appropriate annotations to indicate what these additions represent.

CO2_monthly2015 <- co2_monthly %>% filter(year >= 2015)

ggplot(CO2_monthly2015, aes(x = year, y = int_mean_co2))+
  geom_line(aes(x = date, y = int_mean_co2), col = "blue", alpha =0.5) + 
  geom_point(aes(x = date, y = int_mean_co2), col = "blue", shape = 20, size =0.25) + 
  geom_point(size = 1, alpha =0.75) +
  geom_segment(x= 2015, xend = 2016, y = (mean(CO2_2015$int_mean_co2)), yend= (mean(CO2_2015$int_mean_co2)), col="gray", size =1.5)+
  geom_segment(x= 2016, xend = 2017, y = (mean(CO2_2016$int_mean_co2)), yend= (mean(CO2_2016$int_mean_co2)), col="gray", size =1.5)+
  geom_segment(x= 2017, xend = 2018, y = (mean(CO2_2017$int_mean_co2)), yend= (mean(CO2_2017$int_mean_co2)), col="gray", size =1.5)+
  geom_segment(x= 2018, xend = 2019, y = (mean(CO2_2018$int_mean_co2)), yend= (mean(CO2_2018$int_mean_co2)), col="gray", size =1.5)+
  geom_segment(x= 2019, xend = 2020, y = (mean(CO2_2019$int_mean_co2)), yend= (mean(CO2_2019$int_mean_co2)), col="gray", size =1.5)+
 # geom_line(aes(x = date, y = trend), col = "black", alpha =0.5) + 
  scale_x_continuous(breaks = seq(2015, 2020, .25), 
                     labels = c("2015", rep("", 3), 
                                "2016", rep("", 3), 
                                "2017", rep("", 3),
                                "2018", rep("", 3),
                                "2019", rep("", 3),
                                "2020"),
                     limits = c(2015, 2020),
                     sec.axis = dup_axis(labels = NULL, name = "")) +
  scale_y_continuous(breaks = 395:415, 
                     labels = c("395", rep("", 4),
                                "400", rep("", 4),
                                "405", rep("", 4),
                                "410", rep("", 4),
                                "415"),
                     limits = c(395, 415),
                     sec.axis = dup_axis(labels = NULL, name = "")) +
  geom_hline(yintercept = 400, col="red", linetype = 5)+
  geom_label(x= 2019.5, y = 401, label = "400ppm Line")+
  geom_label(x= 2017.5, y = 399, label = "Yearly minimum above 400ppm since 2016")+
  geom_point(data=COdot, mapping = aes(x = date, y = int_mean_co2, size =50, col ="red"), show.legend=FALSE)+
  ggtitle(expression("Monthly CO"[2]*" and Yearly Average CO"[2]*" AT MAUNA LOA")) +
  ylab(expression("CO"[2]*"(ppm)")) +
  xlab("Year") +
  theme_minimal()+
  theme(axis.line.y.left = element_line(),
        axis.line.x.bottom = element_line())

Trends Over Time in CO₂ Growth

Consider the full Mauna Loa CO₂ record found here. The overall trend is not linear, but segments of it may be piecewise linear. Filter to remove the incomplete decades 1950s and 2010s and create a scatterplot that shows the interpolated CO₂ values with a fitted linear model for each remaining decade. Do not include standard error bands.

ggplot(co2_monthly_linear, aes(x = year, y = int_mean_co2))+
  geom_point(size = 0.25) +
geom_smooth(data=co2_1960s, method = "lm", se=F, size = 1.25, col = "yellow1")+
geom_smooth(data=co2_1970s, method = "lm", se=F, size = 1.25, col = "gold1")+
geom_smooth(data=co2_1980s, method = "lm", se=F, size = 1.25, col = "orange")+  
geom_smooth(data=co2_1990s, method = "lm", se=F, size = 1.25, col = "darkorange1")+
geom_smooth(data=co2_2000s, method = "lm", se=F, size = 1.25, col = "red")+
ggtitle(expression("Monthly CO"[2]*" Linear Models by Decade At Mauna Loa")) +
ylab("Parts Per Million") +
xlab("Year")+
theme_minimal()+
  theme(axis.line.y.left = element_line(),
        axis.line.x.bottom = element_line())

Annual Mean CO₂ Since 1959

Replicate as closely as possible the annual mean plot found here. Hint: It uses a ggplot theme for some of the formatting.

ggplot(co2_annual, aes(x = year, y = mean_co2))+
  geom_bar(stat="identity", col = "sky blue", fill = "gray", width = .5)+
  geom_smooth(method="loess", se=F, col = "blue")+
  geom_hline(yintercept = 280, col = "black")+
  geom_hline(yintercept = 200, col = "black")+
  geom_hline(yintercept = 400, col = "red")+
 labs(title = expression("Annual Mean Atmospheric CO"[2]*" at Mauna Loa Observatory"), 
       subtitle = "with loess smoothed trend curve and estimated historical reference values",
      y = expression("CO"[2]*"(ppm)"),
      x = "measurement year",
       caption  = "data soure: https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html")+
  geom_label(x= 1987, y = 400, label = "crisis threshold")+
   geom_label(x= 1987, y = 280, label = "pre-industrial mean")+
  geom_label(x= 1987, y = 200, label = "ice age mean")+
  scale_x_continuous(breaks = seq(1960, 2020, 5), 
                     limits = c(1958, 2020),
                     sec.axis = dup_axis(labels = NULL, name = "")) +
scale_y_continuous(breaks = seq(0, 400, 50), 
                     limits = c(0, 410),
                     sec.axis = dup_axis(labels = NULL, name = "")) +
  theme_few(base_size = 10)+
  theme(panel.border = element_blank())+ 
  theme(axis.ticks.x.top = element_blank())+
  theme(axis.ticks.y.right = element_blank())+
  theme(axis.ticks.length = (unit(0.05, "cm")))+
  theme(axis.line.y.left = element_line(),
        axis.line.x.bottom = element_line())

Discussion

In what way could these visualizations be used to support the theory of anthropogenic climate change?

ANSWER: These visualizations could be used to support the theory of anthropogenic climate change because they show a growing difference between current \(CO_2\) levels and those of the past. The increasing rate of change as shown in the “Monthly \(CO_2\) Linear Models by Decade At Mauna Loa” graph suggests that something is continually happening to make \(CO_2\) accumulate faster than it has in the past and the simplest explanation could be anthropogenic climate change.

Why are data such as these considered evidence rather than proof of anthropogenic climate change?

ANSWER: Data such as these are evidence, not proof, of anthropogenic climate change because they cannot establish causality. It seems logical to many people to think that the changes shown above are a direct result of human activity, although the data does not provide any way to establish whether or not that is the case. Even the increasing rate of change as shown in the “Monthly \(CO_2\) Linear Models by Decade At Mauna Loa” graph does not necessarily mean that the changes are due to human activity since natural variations in climate do not have to take place with a set rate.

ANTHROPOMETRIC MODELING AND VISUALIZATION

The Question

Are people generally happy with their heights? If not, how tall do they want to be? Dr. Thomley’s anthropometric dataset contains measurements of students’ heights and their self-selected ideal heights. You will fit a parallel slopes model to predict ideal height using measured height and gender, then interpret the results of your model.

Exploratory Data Analysis

Filter the dataset to include only students who self-identified as male or female (there are not enough data points in the other categories to create a model for them). Perform EDA to determine whether you need to perform any transformations or remove any data points before you fit your model. Create your modeling dataset and call it anthro_mod.

glimpse(anthro)

Observations: 547
Variables: 9
$ gender   <chr> "female", "female", "female", "female", "female", "fema…
$ ideal    <dbl> 64, 64, 65, 63, 68, 64, NA, 65, 66, 63, 62, 60, 65, 68,…
$ height   <dbl> 59.75, 59.75, 60.00, 60.00, 60.00, 60.00, 60.50, 60.50,…
$ armspan  <dbl> 58.5, 58.5, 61.0, 58.0, 62.0, 60.0, 60.0, 61.5, 63.0, 5…
$ forearm  <dbl> 15.00, 15.00, 16.00, 15.00, 15.00, 15.00, 15.50, 15.00,…
$ hand     <dbl> 6.50, 6.50, 7.00, 6.25, 6.00, 6.00, 5.50, 7.00, 7.00, 6…
$ leg      <dbl> 16.00, 16.00, 16.00, 16.25, 18.00, 16.00, 17.75, 16.50,…
$ foot     <dbl> 8.50, 8.50, 8.50, 9.50, 9.00, 8.00, 8.50, 8.50, 9.50, 8…
$ semester <chr> "F15", "F15", "F13", "F15", "S13", "Sum13", "S14", "Sum…

ggplot(anthro, aes(x=height, y = ideal)) +
geom_point() +
  theme_minimal()+
    labs(title = expression("Height vs Ideal Height 1"), 
        subtitle = "with two outliers",
      y = ("Height"),
      x = "Height")

anthro_mod <- anthro%>%
  filter(gender =="female" | gender =="male") %>%
  filter(ideal>=55 & ideal <= 90)

ggplot(anthro_mod, aes(x=height, y = ideal)) +
geom_point() +theme_minimal()+
    labs(title = expression("Height vs Ideal Height 2"), 
       subtitle = "rescaled to match after removal of outliers",
      y = expression("Ideal Height"),
      x = "Height")+
  scale_x_continuous(breaks = seq(60,75,5), 
                     limits = c(59, 78),
                     sec.axis = dup_axis(labels = NULL, name = "")) +
  scale_y_continuous(breaks = seq(50,100,10), 
                     limits = c(50, 100),
                     sec.axis = dup_axis(labels = NULL, name = ""))

Fitting the Model

Create a scatterplot of ideal height versus measured height showing separate fitted linear models for males and females. Then fit a parallel slopes model with measured height and gender as predictors and save it as ideal_model. Display its summary.

ggplot(anthro_mod, aes(x= ideal, y = height, col = gender))+
  geom_point()+
  geom_smooth(method="lm", se=F)+
  theme_minimal()+
  theme(axis.line.y.left = element_line(),
        axis.line.x.bottom = element_line())+
  labs(title = ("Height vs Ideal Height"), 
       subtitle = "divided by gender",
      y = ("Measured Height"),
      x = "Desired Height")

ideal_model <- lm(ideal ~ height + gender, data = anthro_mod)
summary(ideal_model)


Call:
lm(formula = ideal ~ height + gender, data = anthro_mod)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5953 -1.3357 -0.1049  1.1675 11.1611 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 33.81289    2.37120   14.26   <2e-16 ***
height       0.49680    0.03643   13.64   <2e-16 ***
gendermale   4.74680    0.29744   15.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.259 on 534 degrees of freedom
Multiple R-squared:  0.7703,    Adjusted R-squared:  0.7694 
F-statistic: 895.1 on 2 and 534 DF,  p-value: < 2.2e-16

Examine Residuals

Create a residual scatterplot and histogram for your model.

anthro_ideal_points <- get_regression_points(ideal_model)

summary(anthro_ideal_points$residual)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-5.595000 -1.336000 -0.105000  0.000091  1.168000 11.161000

ggplot(anthro_ideal_points, aes(x=residual)) + 
  geom_histogram(fill="gray", bins=15)+
  theme_minimal()+
  labs(title = expression("Histogram of Residuals"), 
      y = "Count",
      x = "Residual")+
  theme(axis.line.y.left = element_line(),
        axis.line.x.bottom = element_line())

ggplot(anthro_ideal_points, aes(x=ideal, y = residual, col=gender)) + 
  geom_jitter()+
  theme_minimal()+
  labs(title = expression("Residuals vs Ideal Height"), 
       subtitle = "divided by gender",
      y = expression("Residual"),
      x = "Desired Height")+
  theme(axis.line.y.left = element_line(),
        axis.line.x.bottom = element_line())

Predicting Ideal Height

Create a dataset containing heights at one-inch intervals from 60 to 80 for each gender. Use your parallel slopes model to predict the ideal heights for these values. Use mutate to create a new variable in your results tibble that shows whether the ideal height is less than, equal to, or greater than height. The case_when function may be useful here. Display the results.

nh <- c(60:80)
hn <- c(60:80)
gen <- c(rep.int("male",21))
genf <- c(rep.int("female",21))

new_heights<- tibble(height  = c(nh,hn),
                    gender = c(gen,genf))

new_heights_model <- get_regression_points(ideal_model, newdata = new_heights)

new_heights_model_eq <- new_heights_model %>%
mutate(equality = case_when(height >= ideal_hat ~ "less than", 
                            height <= ideal_hat ~ "greater than", 
                            height == ideal_hat ~ "equal to"))
new_heights_model_eq

# A tibble: 42 x 5
      ID height gender ideal_hat equality    
   <int>  <dbl> <chr>      <dbl> <chr>       
 1     1     60 male        68.4 greater than
 2     2     61 male        68.9 greater than
 3     3     62 male        69.4 greater than
 4     4     63 male        69.9 greater than
 5     5     64 male        70.4 greater than
 6     6     65 male        70.9 greater than
 7     7     66 male        71.3 greater than
 8     8     67 male        71.8 greater than
 9     9     68 male        72.3 greater than
10    10     69 male        72.8 greater than
# … with 32 more rows

Additional Visualization

Create a plot that shows the same fitted lines for males and females as your scatterplot (but without points), as well as an annotated line indicating the relationship ideal height = measured height. Format this line in some way other than the default (e.g., color, style).

ggplot(anthro_mod, aes(x= ideal, y = height))+
  geom_smooth(data=anthro_mod, aes(col = gender), method="lm", se=F)+
  geom_smooth(method = "lm", se =F, linetype = 3)+
  annotate("text", x = 71.5, y = 77, label = "relationship of ideal height and measured height across genders")+
  theme_minimal()+
  labs(title = expression("Regression Lines Predicting Ideal Height"), 
       subtitle = "divided by gender",
      y = expression("Measured Height"),
      x = "Desired Height")+
  theme(axis.line.y.left = element_line(),
        axis.line.x.bottom = element_line())

Discussion

Explain your rationale for any transformations or deletions you chose to make in the dataset.

ANSWER: I decided to drop armspans less than 55 inches because there were two outliers which would have made undue changes to the linear models of the data. Similarly, I narrowed ideal heights to those between 55 and 90 inches because without that transformations a few outliers would have made large changes to otherwise fairly uniform data.

Does the model seem appropriate for the data? Be sure to include discussion of the residuals.

ANSWER: The model seems overall appropriate for the data when separated by gender. The residual plot shows two distinct groups which are well separated by gender. The \(R^2\) value is relatively high as well, suggesting that these are reasonable variables to draw connections between.

Do the people in this sample generally seem to be happy with their heights or do their ideal heights differ? Do males and females seem to have the same attitudes regarding what is an ideal height? What group patterns do you notice? Discuss.

ANSWER: Most people appear to be pretty happy with their heights. The mean and median of the residuals are both less than 0.15 inches, with 50% of the residuals being between -1.5 and +1.5. Males and females have slightly different ideal height trends, with males showing a higher tendency to want to be taller as their measured height increases, but the difference is not very large. The most obvious group pattern between genders is that males are almost 5 inches taller than females on average.

END !!!

Data Science Project 1

Josh Sumner

Updated: Wednesday, February 27, 2019 @ 07:33:28 PM

MAUNA LOA CO₂ MODELING AND VISUALIZATION

The Question

Monthly Mean CO₂: The Last Five Years

Monthly Mean CO₂: A Major Milestone

Trends Over Time in CO₂ Growth

Annual Mean CO₂ Since 1959

Discussion

ANTHROPOMETRIC MODELING AND VISUALIZATION

The Question

Exploratory Data Analysis

Fitting the Model

Examine Residuals

Predicting Ideal Height

Additional Visualization

Discussion

Data Science Project 1

Josh Sumner

Updated: Wednesday, February 27, 2019 @ 07:33:28 PM

MAUNA LOA CO2 MODELING AND VISUALIZATION

The Question

Monthly Mean CO2: The Last Five Years

Monthly Mean CO2: A Major Milestone

Trends Over Time in CO2 Growth

Annual Mean CO2 Since 1959

Discussion

ANTHROPOMETRIC MODELING AND VISUALIZATION

The Question

Exploratory Data Analysis

Fitting the Model

Examine Residuals

Predicting Ideal Height

Additional Visualization

Discussion

MAUNA LOA CO₂ MODELING AND VISUALIZATION

Monthly Mean CO₂: The Last Five Years

Monthly Mean CO₂: A Major Milestone

Trends Over Time in CO₂ Growth

Annual Mean CO₂ Since 1959