Isaac Rosenthal ENV710.002 Applied Data Analysis Midterm Exam

Upload this Rmd with your coding and discussions and the knitted html to Sakai. You do not need to hide your R code.

You may seek clarification from Professor Albright (not the TAs). You may NOT speak to other students or humans about this exam. Any communication about the exam with another human besides Professor Albright will be considered a violation of the Duke Community Standard.

Duke Community Standard

Duke University is a community dedicated to scholarship, leadership, and service and to the principles of honesty, fairness, respect, and accountability. Citizens of this community commit to reflect upon and uphold these principles in all academic and nonacademic endeavors, and to protect and promote a culture of integrity.

The Pledge

Students affirm their commitment to uphold the values of the Duke University community by signing a pledge that states:

To uphold the Duke Community Standard:

I will not lie, cheat, or steal in my academic endeavors;
I will conduct myself responsibly in all my endeavors; and
I will act if the Standard is compromised.

I have adhered to the Duke Community Standard in completing this assignment.

Type your name below to state that you adhered to the Duke Community Standard while completing this exam.

Isaac Rosenthal

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Question 1 (20 points)

Lead from paint, including lead-contaminated dust, is one of the most common causes of lead poisoning. In 1986, the Chinese government started regulating household uses of lead-containing paint. However, if a home was built before 1986, there is a greater chance it has lead-containing paint on the interior walls as compared to newer homes (post-1986).

Yichang, a city in China near the Three Gorges Dam, has a mix of old and new houses and apartments. Assume that, overall, 30% of the dwellings in Yichang have lead-containing paint. If you randomly select 20 houses from the city for inspection:

(10 points) Using the binomial distribution, what is the probability that two or more houses in the sample (n=20) have lead-containing paint? Show your calculations using R code below. Discuss your solution in a sentence or two.

Notes: 30% have lead paint Select 20 houses in a city probability 2 or more of the 20 have lead paint

pbinom(1, size=20, prob=0.3, lower.tail = FALSE)

## [1] 0.9923627

The probability of selecting 2 or more houses (greater than one house) with lead-containing paint has an assigned probability of 99.23%. This is with a sample size of 20, and a city-wide probability of paint containing lead of 30%.

(10 points) What did you need to assume in (a) in order for your calculations to be valid? State the assumptions as they relate to the problem. Is it valid to use the binomial in this context? Explain why or why not.

Binomial distributions have underlying assumptions which allows for the determination of the likelihood that a given event will take place. I will validate, or not, if each assumption is met, and if this is the proper experiment in this context. 1. There are a fixed number of trials (n). 1a. Yes, there are 20 houses selected.

Each trial has two possible outcomes: success or failure. 2a. Yes, this binomial experiment assumes there is only two outcomes, the house has led-contained paint or not.
The probability of success (call it p) is the same for each trial. 3a. Yes, because the probability (p=0.30) of finding led-contained paint is the same across the city.
The trials are independent, meaning the outcome of one trial doesn’t influence the outcome of any other trial. 4a. Yes, if you select a house with led-contained paint it doesn’t influence the outcome of any other house.

It is valid to use binomial in this context because it satisfies the assumptions. As an aside, it is important to ensure this is a random sample because poorer neighborhoods may disproportionately have lead-contained paint. #### Question 2 (40 points)

(30 points) Sustainable Land Management (SLM) intervention programs aim to promote agricultural conditions that can increase local food security. A study was developed to test whether or not the SLM intervention programs in rural Tanzania actually improved on the ground agricultural conditions. More specifically we ask whether farms that have had SLM interventions in the past experience less soil erosion than farms with no SLM intervention. The researcher collected data on both SLM and non-SLM farms using a meshed bag methodology that measures soil erosion in tons per hectare (tons/ha).

(5 points) Establish appropriate hypotheses for the analysis. What kind of comparison test should you conduct (selecting from tests we have covered in class)?

One sided, two independent sample, Welch’s t-test.

Hypothesis: Tanzania farm plots with sustainable land management intervention experience less soil erosion than farms without sustainable land management intervention. Ho: Mean soil erosion in tons per hectare in farm plots with SLM intervention - mean soil erosion in farm plots without SLM intervention >= 0 Ha: Mean soil erosion in tons per hectare in farm plots with SLM intervention - mean soil erosion in farm plots without SLM intervention < 0

(5 points) Discuss the assumptions underlying the test you selected in (a). Populations of soil erosion in tons per hectare (tons/ha) using sustainable land management intervention and without sustainable land management are both normally distributed. There is independence of observations within and between the two groups; soil is randomly sampled from each of the farm plots; there is no spatial autocorrelation of soil erosion observations. Also the data provides means of the sample, the n>=30, the population standard deviation is known. Lastly, it’s one-sided because SLM either reduces soil erosion, or not.
(10 points) Calculate and interpret your results of your test of significance with respect to the question posed in the prompt. (Do farms that have had SLM interventions in the past experience less soil erosion than plots with no SLM intervention, and if so, by how much?). Summary statistics of the natural logged data are shown in the table.

#standard deviation of sample with SLM
s.SLM<-.5
#standard deviation of sample without SLM
s.NoSLM<-.8

#mean soil erosion in tons/ha
xbar.SLM<-2.87
xbar.NoSLM<-3.6

#sample sizes 
n.SLM<-100
n.NoSLM<-300

#Calculating the Welch's t-statistic
t.stat<-(xbar.NoSLM-xbar.SLM)/
   sqrt(s.NoSLM^2/n.NoSLM+s.SLM^2/n.SLM)

t.stat

## [1] 10.72448

Interpretation: The large t statistic indicates a difference between SLM intervention on soil erosion and without SLM of 10.72. Given the t-statistic is large, this suggests we reject the null hypothesis which says soil erosion is greater with SLM. In other words, mean soil erosion without SLM is greater than with SLM.

Soil erosion from SLM and Non-SLM Farms
Erosion presented in log(tons/hectare)
Treatment	Observations	Mean	SD
SLM	100	2.87	0.5
Non-SLM	300	3.60	0.8

(10 points) Calculate and interpret an appropriate 95% confidence interval (either one or two-sided is okay) to address the question in the prompt.

A<-s.NoSLM^2/n.NoSLM
B<-s.SLM^2/n.SLM

df<-(A+B)^2/
     (A^2/(n.NoSLM-1)+B^2/(n.SLM-1))
df

## [1] 273.99

qt(0.975, 273.99)

## [1] 1.96866

qt

## function (p, df, ncp, lower.tail = TRUE, log.p = FALSE) 
## {
##     if (missing(ncp)) 
##         .Call(C_qt, p, df, lower.tail, log.p)
##     else .Call(C_qnt, p, df, ncp, lower.tail, log.p)
## }
## <bytecode: 0x000000001925f8a0>
## <environment: namespace:stats>

p.value<-pt(t.stat, df, lower.tail = FALSE) 
p.value

## [1] 6.154976e-23

(3.6-2.87)-1.96*sqrt(.8^2/300 +.5^2/100)

## [1] 0.5965856

(3.6-2.87)+1.96*sqrt(.8^2/300 +.5^2/100)

## [1] 0.8634144

exp(.5965)

## [1] 1.815753

exp(.8634)

## [1] 2.371209

The 95% confidence interval for the multiplicative factor of the mean soil erosion is (1.81, 2.37). This test shows 95% certainty that this confidence interval covers the true multiplicative factor between mean soil erosion using the new sustainable land management practices compared to mean soil erosion in ton/ha without sustainable land management.

(10 points) Discuss/Explain potential limitations to the sampling design and test you selected to answer the question posed (“Do farm plots that have had SLM interventions experience less soil erosion than plots with no SLM intervention?”). How might you improve the sampling approach? There are always limitations in the sampling design of studies given no sample population represents the actual population is perfectly. The choice of the sampling design is a subjective approach which comes with biases. Any improper selection of farmers, plots, climate, etc can result in a poor study. I would improve the sampling approach by including a time scale, (e.g how long was SLM implemented or not), as well as increase the amount of farmers with SLM intervention.

Question 3 (40 points)

Please read about the Yale Climate Opinion Surveys which can be found at this link: https://climatecommunication.yale.edu/visualizations-data/ycom-us/. The Yale county data can be found Sakai/Resources/Midterm folder. All variables were measured at the county level. The variables in the dataset include the following:

Yale Climate Opinion (2020) and US Census Data
Variables are estimated at the county level
Variables	Definition
County	County in the US
TotalPop	Population of County, US Census Bureau
popdensity	Number of Residents per square mile
gwvoteimp	Estimated % of adults who say global warming should be a high priority for the next President and Congress (2020)
drilloffshore	Estimated % of adults who support expanding offshore drilling for oil and natural gas off the US coast
CO2limits	Estimated % of adults who support setting srict CO2 limits on coal-fired plants
Data source: Yale Climate Communication (2020). Yale Climate Opinion Maps, last accessed October 2020, https://climatecommunication.yale.edu/visualizations-data/ycom-us/

(10 points) Develop a table showing key summary statistics of the variable gwimpvote grouped by state. Include the following states in your table: Arizona, Florida, Georgia, North Carolina, Ohio and Wisconsin. Use your favorite table-making package The table should be clearly labeled, titled, and captioned. Use appropriate fonts and justification (including font size). The table should be professional enough to publish in the Economist.

getwd()

## [1] "C:/Users/isaac/OneDrive/Desktop/Duke, Sem. 3/Statistics"

yale.df<-read.csv("yale.csv")

(5 points) Develop ONE plot that compares the distributions of the variable gwimpvote across the six states in (a). Discuss the visualization and summary statistics in a paragraph. You do NOT need to use in-line R coding. You do NOT need to report all statistics in sentence form, but rather highlight important aspects of the distributions.

State<-yale.df$State
GWVote<-yale.df$GWVote
ggplot(yale.df, aes(x=State, y=GWVote,fill=GWVote))+
geom_boxplot(col="black")+
scale_fill_manual(values=c("mediumpurple", "palevioletred"))+
  labs(title="Estimated percent of adults who think global warming should be high 
                                        priority by state", x="States", y="percent of voters")+
  
coord_flip()

A few key takeaways are that the mean and median across the six states are very close in value whihc implies the outliers seen in this graph do not hold high significance. Roughly 50% of voters across all six states think global warming is high priority which isn’t high, which doesn’t mean they don’t believe in global warming, but it’s not high piority.

(5 points) Develop a scatterplot using three of the variables in the table above using all states and counties. Use color or size to display the third variable. Use appropriate scales on the x and y axis to clearly illustrate the data and relationships between the variables. In a short paragraph, discuss the scatterplot and the relationships you are attempting to illustrate. The scatterplot should be appropriately labeled and titled and include a legend. The figure should be professional and understandable to the general public.

AllYale.df<-read.csv("AllYale.csv")
CO2Limits<-AllYale.df$CO2limits
TotalPop<-AllYale.df$TotalPop
PopDensity<-AllYale.df$PopDensity
logTotalPop<-log(TotalPop)
ggplot() + geom_point(data = AllYale.df, aes(x = CO2Limits, y = logTotalPop, color = State))+
labs(x="Percent of Adults who support setting
     limits on CO2 from power plants )", y="Log Population per state")

This scatterplot although initially unappealing, I will explain why it perfectly describes the relationship I’m trying to illustrate. The data shows that regardless of population size, the voters across 50 states have similar grouping of support towards CO2 limits on power plants and such it’s clear why the graph is crowded. The support of having string CO2 limits falls in the 50-70% range for a populations with log8-12 population.

(20 points) What is the relationship between population density (explanatory variable) and the percentage of adults that support CO2 limits for coal-fired plants (response variable) across all states and counties? Be sure to investigate whether transformations are needed. In answering this question, address the following:

1. Develop a simple linear regression model and interpret the coefficient (β1), its significance and overall goodness of fit of the model. (10 points)

logPopDensity<-log(PopDensity)
model <- lm(logPopDensity~ CO2Limits , data = AllYale.df)
model

## 
## Call:
## lm(formula = logPopDensity ~ CO2Limits, data = AllYale.df)
## 
## Coefficients:
## (Intercept)    CO2Limits  
##    -1.26152      0.08304

summary(model)

## 
## Call:
## lm(formula = logPopDensity ~ CO2Limits, data = AllYale.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8629 -0.9185  0.1265  1.1553  5.4152 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.261516   0.266961  -4.725  2.4e-06 ***
## CO2Limits    0.083040   0.004369  19.007  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.687 on 3140 degrees of freedom
## Multiple R-squared:  0.1032, Adjusted R-squared:  0.1029 
## F-statistic: 361.3 on 1 and 3140 DF,  p-value: < 2.2e-16

The log-linear equation is the log of Population Density = -1.26 + .08CO2Limits A one 2 (round to nearest whole person) person increase of residents per square mile is associated to a 8% increase in adults supporting strict CO2 limits. Predicted CO2 limit support increase by 100β% with every increase in a unit of population density.

1. Discuss the underlying assumptions of the simple linear regression and whether they are met. Use assumption diagnostic plots to support your discussion. (10 points)

plot(lm(logPopDensity~CO2Limits,data=AllYale.df))

Classical assumptions of OLS must be met and are in this problem. The model of the linear regression is linear in its parameters. Random sampling has occured. CO2 limit support is calculated for each state and population density is carried out by the census is accurate sampling due to the magnitude and randomness of the sample. There is no relationship between CO2 limit support and the error term. There is no multi-collinearity. Spherical errors or no autocorrelation are not obstructed. No homoscedasticity in this problem. Lastly, the error terms are normally distributed.