Problem Set 2 Simple Regression Model Due Friday, Oct 30th by 10 am

Instructions: Answer each question as thoroughly and efficiently as possible. Use an RMD file, and have it include your code as well as question answers. You may work with a partner/group, but everyone must submit their own answers. (Please indicate who you worked with if relevant). Please upload your solutions to Blackboard.

load("/cloud/project/bwght2.Rdata")

1.) Using data from 1988 for houses sold in Andover, MA, from Kiel & McClain (1995), the following equation relates housing price (price) to the distance from a recently built garbage incinerator (dist):

(log⁡(price) ) ̂=9.40+0.312 log⁡(dist) n=135, R2=0.162

Interpret the coefficient on log(dist).  Is the sign what you expected it to be?  Explain. 

*This equation shows that the farther you are from a garbage incinerator the more your house will sell for.*

Do you think simple regression provides an unbiased estimator of the ceteris paribus elasticity of price with respect to dist?  (Hint: Think about the city’s decision on where to put the incinerator). 

*Most likely a city wouldn't put a garbage incinerator next to a neighborhood thats already built because they know that the houses in the surrounding area would lose value. Therefore, the distance of a house to a garbage incinerator is more biased.*

What other factors about a house affect its price?  Would these be correlated with distance from the incinerator? Explain (briefly). 

*Some other factors about price of a house besides location are how old the house is, if it's newly updated, and how many sqft it is.*

2.) One of your friends in the School of Nursing wants to be a neonatal nurse. She’s curious as to what affects a baby’s birth weight. Use the data in bwght2.Rdata to answer these questions. Note: Parity means birth order. Parity = 3 is the 3rd born, parity =1 is the first born.

bwghtv2<- bwght2[complete.cases(bwght2[,c("cigs")]),]

my_data[complete.cases(my_data[,c(“x2”,“x3”)]),] What is the average, minimum, maximum, and standard deviation of bwgtht and cigs?

summary(bwghtv2$bwght)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     360    3081    3430    3409    3771    5204
summary(bwghtv2$cigs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.089   0.000  40.000
sd(bwghtv2$bwght)
## [1] 570.2629
sd(bwghtv2$cigs)
## [1] 4.222476
**What is the average level of family income? What percent of the babies are male?**  
describeBy(bwghtv2$male)
## Warning in describeBy(bwghtv2$male): no grouping variable requested
varsnmeansdmediantrimmedmadminmaxrangeskewkurtosisse
11.72e+030.5160.510.520011-0.0627-20.012
*51% of the babies are male.*

**If you were to run a regression of bwght on cigs, that its, bwght = beta, what sign would you expect beta 1 to have? Why?**

I would expect that the more cigs a mom smokes, the less the baby would weigh.

**Determine the correlation between  bwght and cigs.  What kind of relationship  do you see (positive/negative? Weak/strong?)? Is it you would expect? Why or why not?
cor(bwghtv2$bwght, bwghtv2$cigs)
## [1] -0.08499059

There is a very strong negative correlation between baby weight and mothers who smoke with the correlation being -.0850.

**Create a histogram of bwght.  Comment on the shape of the distribution.**  
bwghthist<-hist(bwghtv2$bwght)

*The histogram is sckewed to the left.*

**Constcutr a scatter plot of bwght and cigs.  Place cigs on the horizontal axis and bwght on the vertical axis.  Include a “best fit” line. Does the scatter plot confirm your answers to d and e above?  Explain.
plot(bwghtv2$cigs, bwghtv2$bwght, abline(lm(bwghtv2$bwght~bwghtv2$cigs)))

**Regress bwght on cigs to determine the equation of the fitted line. Show your results **

**Interpret the meaning of the intercept and slope parameters. Does the intercept have meaning?** 

**Given the regression equation, estimate the birthweight of a baby whose mom smoked 2 cigarettes per day.**

**How much of the variation in birthweight is explained by cigarettes?**

Generate a variable that is equal to th estimated value of birthweight given the predicted regression equation and name this variable bwghthat. Summarize bwghthat and the cigs variables. 
cigreg<-lm(bwghtv2$cigs~bwghtv2$bwght, data=bwghtv2)
summary(cigreg)
## 
## Call:
## lm(formula = bwghtv2$cigs ~ bwghtv2$bwght, data = bwghtv2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.008 -1.264 -1.032 -0.773 38.977 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.2348697  0.6148875   5.261 1.61e-07 ***
## bwghtv2$bwght -0.0006293  0.0001779  -3.538 0.000414 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.208 on 1720 degrees of freedom
## Multiple R-squared:  0.007223,   Adjusted R-squared:  0.006646 
## F-statistic: 12.51 on 1 and 1720 DF,  p-value: 0.0004145
*weight= 3421.71 - 11.48cigs*
bwghtv2$bwghthat<-predict(lm(bwghtv2$bwght~bwghtv2$cigs, data = bwghtv2))
summary(bwghtv2$bwghthat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2963    3422    3422    3409    3422    3422
summary(bwghtv2$cigs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.089   0.000  40.000
Construct a variable which is the estimated value of the residual term, given the predicted regression equation and label it “error”. 
bwghtv2$error<-(bwghtv2$bwght-bwghtv2$bwghthat)
summary(bwghtv2$error)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3061.711  -323.711     8.289     0.000   365.539  1782.289
Summarize the error term and verity that its mean is equal to zero.  Why does it matter what the mean is? (Paste your summary here). 

*Since the mean is equal to zero, that means we did not overestimated or underestimate the data.*

n)  Create a scatter plot of bwghthat and cigs, overlaid with the actual observation points in the same graph.  Put cigs on the horizontal axis.  How do you think the regression line did in fitting the data? 
ggplot(bwghtv2, aes(cigs, y=bwght, color=variable))+geom_point(aes(y=bwghtv2$bwghthat, col="bwghthat"))+geom_point(aes(y=bwght, col="bwght"))
## Warning: Use of `bwghtv2$bwghthat` is discouraged. Use `bwghthat` instead.

3.) Let rd represent annual expenditures (in millions of dollars) on research and development for the population of firms in the chemical industry. Let sales represent the annual sales (in millions of dollars).

a.) Write down a model (not an estimated equation) that implies a constant elasticity between rd and sales (Hint: CES is interpreted as % change, % change). Which parameter is the elasticity?

Y(rd)=beta+y(sales)+u

b.) Now, estimate the model using the data in RDCHEM.DTA. Write out the estimated equation, including the number of observations and the R-squared.

load("/cloud/project/rdchem.RData")
simplereg1<-lm(RDCHEM$sales~RDCHEM$rd)
summary(simplereg1)
## 
## Call:
## lm(formula = RDCHEM$sales ~ RDCHEM$rd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5827.2  -351.6   -89.7   695.8  7627.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  385.957    474.327   0.814    0.422    
## RDCHEM$rd     22.196      1.338  16.591   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2418 on 30 degrees of freedom
## Multiple R-squared:  0.9017, Adjusted R-squared:  0.8985 
## F-statistic: 275.3 on 1 and 30 DF,  p-value: < 2.2e-16

sales= 385.96 + 22.20(rd)

c.) What is the estimated elasticity of rd with respect to sales? Explain in words what it means.

summary(RDCHEM$sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    42.0   507.8  1332.5  3797.0  2856.6 39709.0
summary(RDCHEM$rd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.70   10.85   42.60  153.68   78.85 1428.00

**E: 22.20*(153.68/3797)= .89** With the elasticity being .89, this means that this regression is more inelastic. This means that rd changes as sales change.

d.) What percent of the variation in research and development expenditures is explained by sales?

R^2=.90