You may seek clarification from Professor Albright (not the TAs). You may NOT speak to other students or humans about this exam. Any communication about the exam with another human besides Professor Albright will be considered a violation of the Duke Community Standard.
Duke University is a community dedicated to scholarship, leadership, and service and to the principles of honesty, fairness, respect, and accountability. Citizens of this community commit to reflect upon and uphold these principles in all academic and nonacademic endeavors, and to protect and promote a culture of integrity.
Isaac Rosenthal
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Lead from paint, including lead-contaminated dust, is one of the most common causes of lead poisoning. In 1986, the Chinese government started regulating household uses of lead-containing paint. However, if a home was built before 1986, there is a greater chance it has lead-containing paint on the interior walls as compared to newer homes (post-1986).
Yichang, a city in China near the Three Gorges Dam, has a mix of old and new houses and apartments. Assume that, overall, 30% of the dwellings in Yichang have lead-containing paint. If you randomly select 20 houses from the city for inspection:
Notes: 30% have lead paint Select 20 houses in a city probability 2 or more of the 20 have lead paint
pbinom(1, size=20, prob=0.3, lower.tail = FALSE)
## [1] 0.9923627
The probability of selecting 2 or more houses (greater than one house) with lead-containing paint has an assigned probability of 99.23%. This is with a sample size of 20, and a city-wide probability of paint containing lead of 30%.
Binomial distributions have underlying assumptions which allows for the determination of the likelihood that a given event will take place. I will validate, or not, if each assumption is met, and if this is the proper experiment in this context. 1. There are a fixed number of trials (n). 1a. Yes, there are 20 houses selected.
Each trial has two possible outcomes: success or failure. 2a. Yes, this binomial experiment assumes there is only two outcomes, the house has led-contained paint or not.
The probability of success (call it p) is the same for each trial. 3a. Yes, because the probability (p=0.30) of finding led-contained paint is the same across the city.
The trials are independent, meaning the outcome of one trial doesn’t influence the outcome of any other trial. 4a. Yes, if you select a house with led-contained paint it doesn’t influence the outcome of any other house.
It is valid to use binomial in this context because it satisfies the assumptions. As an aside, it is important to ensure this is a random sample because poorer neighborhoods may disproportionately have lead-contained paint. #### Question 2 (40 points)
(30 points) Sustainable Land Management (SLM) intervention programs aim to promote agricultural conditions that can increase local food security. A study was developed to test whether or not the SLM intervention programs in rural Tanzania actually improved on the ground agricultural conditions. More specifically we ask whether farms that have had SLM interventions in the past experience less soil erosion than farms with no SLM intervention. The researcher collected data on both SLM and non-SLM farms using a meshed bag methodology that measures soil erosion in tons per hectare (tons/ha).
One sided, two independent sample, Welch’s t-test.
Hypothesis: Tanzania farm plots with sustainable land management intervention experience less soil erosion than farms without sustainable land management intervention. Ho: Mean soil erosion in tons per hectare in farm plots with SLM intervention - mean soil erosion in farm plots without SLM intervention >= 0 Ha: Mean soil erosion in tons per hectare in farm plots with SLM intervention - mean soil erosion in farm plots without SLM intervention < 0
(5 points) Discuss the assumptions underlying the test you selected in (a). Populations of soil erosion in tons per hectare (tons/ha) using sustainable land management intervention and without sustainable land management are both normally distributed. There is independence of observations within and between the two groups; soil is randomly sampled from each of the farm plots; there is no spatial autocorrelation of soil erosion observations. Also the data provides means of the sample, the n>=30, the population standard deviation is known. Lastly, it’s one-sided because SLM either reduces soil erosion, or not.
(10 points) Calculate and interpret your results of your test of significance with respect to the question posed in the prompt. (Do farms that have had SLM interventions in the past experience less soil erosion than plots with no SLM intervention, and if so, by how much?). Summary statistics of the natural logged data are shown in the table.
#standard deviation of sample with SLM
s.SLM<-.5
#standard deviation of sample without SLM
s.NoSLM<-.8
#mean soil erosion in tons/ha
xbar.SLM<-2.87
xbar.NoSLM<-3.6
#sample sizes
n.SLM<-100
n.NoSLM<-300
#Calculating the Welch's t-statistic
t.stat<-(xbar.NoSLM-xbar.SLM)/
sqrt(s.NoSLM^2/n.NoSLM+s.SLM^2/n.SLM)
t.stat
## [1] 10.72448
Interpretation: The large t statistic indicates a difference between SLM intervention on soil erosion and without SLM of 10.72. Given the t-statistic is large, this suggests we reject the null hypothesis which says soil erosion is greater with SLM. In other words, mean soil erosion without SLM is greater than with SLM.
| Soil erosion from SLM and Non-SLM Farms | |||
|---|---|---|---|
| Erosion presented in log(tons/hectare) | |||
| Treatment | Observations | Mean | SD |
| SLM | 100 | 2.87 | 0.5 |
| Non-SLM | 300 | 3.60 | 0.8 |
A<-s.NoSLM^2/n.NoSLM
B<-s.SLM^2/n.SLM
df<-(A+B)^2/
(A^2/(n.NoSLM-1)+B^2/(n.SLM-1))
df
## [1] 273.99
qt(0.975, 273.99)
## [1] 1.96866
qt
## function (p, df, ncp, lower.tail = TRUE, log.p = FALSE)
## {
## if (missing(ncp))
## .Call(C_qt, p, df, lower.tail, log.p)
## else .Call(C_qnt, p, df, ncp, lower.tail, log.p)
## }
## <bytecode: 0x000000001925f8a0>
## <environment: namespace:stats>
p.value<-pt(t.stat, df, lower.tail = FALSE)
p.value
## [1] 6.154976e-23
(3.6-2.87)-1.96*sqrt(.8^2/300 +.5^2/100)
## [1] 0.5965856
(3.6-2.87)+1.96*sqrt(.8^2/300 +.5^2/100)
## [1] 0.8634144
exp(.5965)
## [1] 1.815753
exp(.8634)
## [1] 2.371209
The 95% confidence interval for the multiplicative factor of the mean soil erosion is (1.81, 2.37). This test shows 95% certainty that this confidence interval covers the true multiplicative factor between mean soil erosion using the new sustainable land management practices compared to mean soil erosion in ton/ha without sustainable land management.
Please read about the Yale Climate Opinion Surveys which can be found at this link: https://climatecommunication.yale.edu/visualizations-data/ycom-us/. The Yale county data can be found Sakai/Resources/Midterm folder. All variables were measured at the county level. The variables in the dataset include the following:
| Yale Climate Opinion (2020) and US Census Data | |
|---|---|
| Variables are estimated at the county level | |
| Variables | Definition |
| County | County in the US |
| TotalPop | Population of County, US Census Bureau |
| popdensity | Number of Residents per square mile |
| gwvoteimp | Estimated % of adults who say global warming should be a high priority for the next President and Congress (2020) |
| drilloffshore | Estimated % of adults who support expanding offshore drilling for oil and natural gas off the US coast |
| CO2limits | Estimated % of adults who support setting srict CO2 limits on coal-fired plants |
| Data source: Yale Climate Communication (2020). Yale Climate Opinion Maps, last accessed October 2020, https://climatecommunication.yale.edu/visualizations-data/ycom-us/ | |
getwd()
## [1] "C:/Users/isaac/OneDrive/Desktop/Duke, Sem. 3/Statistics"
yale.df<-read.csv("yale.csv")
State<-yale.df$State
GWVote<-yale.df$GWVote
ggplot(yale.df, aes(x=State, y=GWVote,fill=GWVote))+
geom_boxplot(col="black")+
scale_fill_manual(values=c("mediumpurple", "palevioletred"))+
labs(title="Estimated percent of adults who think global warming should be high
priority by state", x="States", y="percent of voters")+
coord_flip()
A few key takeaways are that the mean and median across the six states are very close in value whihc implies the outliers seen in this graph do not hold high significance. Roughly 50% of voters across all six states think global warming is high priority which isn’t high, which doesn’t mean they don’t believe in global warming, but it’s not high piority.
AllYale.df<-read.csv("AllYale.csv")
CO2Limits<-AllYale.df$CO2limits
TotalPop<-AllYale.df$TotalPop
PopDensity<-AllYale.df$PopDensity
logTotalPop<-log(TotalPop)
ggplot() + geom_point(data = AllYale.df, aes(x = CO2Limits, y = logTotalPop, color = State))+
labs(x="Percent of Adults who support setting
limits on CO2 from power plants )", y="Log Population per state")
This scatterplot although initially unappealing, I will explain why it perfectly describes the relationship I’m trying to illustrate. The data shows that regardless of population size, the voters across 50 states have similar grouping of support towards CO2 limits on power plants and such it’s clear why the graph is crowded. The support of having string CO2 limits falls in the 50-70% range for a populations with log8-12 population.
logPopDensity<-log(PopDensity)
model <- lm(logPopDensity~ CO2Limits , data = AllYale.df)
model
##
## Call:
## lm(formula = logPopDensity ~ CO2Limits, data = AllYale.df)
##
## Coefficients:
## (Intercept) CO2Limits
## -1.26152 0.08304
summary(model)
##
## Call:
## lm(formula = logPopDensity ~ CO2Limits, data = AllYale.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8629 -0.9185 0.1265 1.1553 5.4152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.261516 0.266961 -4.725 2.4e-06 ***
## CO2Limits 0.083040 0.004369 19.007 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.687 on 3140 degrees of freedom
## Multiple R-squared: 0.1032, Adjusted R-squared: 0.1029
## F-statistic: 361.3 on 1 and 3140 DF, p-value: < 2.2e-16
The log-linear equation is the log of Population Density = -1.26 + .08CO2Limits A one 2 (round to nearest whole person) person increase of residents per square mile is associated to a 8% increase in adults supporting strict CO2 limits. Predicted CO2 limit support increase by 100β% with every increase in a unit of population density.
plot(lm(logPopDensity~CO2Limits,data=AllYale.df))
Classical assumptions of OLS must be met and are in this problem. The model of the linear regression is linear in its parameters. Random sampling has occured. CO2 limit support is calculated for each state and population density is carried out by the census is accurate sampling due to the magnitude and randomness of the sample. There is no relationship between CO2 limit support and the error term. There is no multi-collinearity. Spherical errors or no autocorrelation are not obstructed. No homoscedasticity in this problem. Lastly, the error terms are normally distributed.