Part I:
- Questions
  - Q1.
  - Q2.
  - Q3.
  - Q4.
Part II: Outliers
Submission Instructions

library( dplyr )
library( pander )
library( stargazer )

Part I:

URL <- "https://raw.githubusercontent.com/DS4PS/cpp-523-fall-2019/master/labs/data/IncomeHappiness.csv"
dat <- read.csv( URL )

Questions

Q1.

Read the study below, and then use the dataset called “IncomeHappiness.csv” to estimate the following model:

##      income           happiness     
##  Min.   :    38.9   Min.   : 21.34  
##  1st Qu.: 39970.8   1st Qu.: 62.83  
##  Median : 79994.1   Median : 78.36  
##  Mean   : 89662.9   Mean   : 72.72  
##  3rd Qu.:138908.2   3rd Qu.: 85.57  
##  Max.   :199952.2   Max.   :102.19

## [1] "income"    "happiness"

## [1] "x" "y"

##        x                  y                w         
##  Min.   : 0.00389   Min.   : 21.34   Min.   :  0.00  
##  1st Qu.: 3.99708   1st Qu.: 62.83   1st Qu.: 15.98  
##  Median : 7.99941   Median : 78.36   Median : 63.99  
##  Mean   : 8.96628   Mean   : 72.72   Mean   :113.78  
##  3rd Qu.:13.89082   3rd Qu.: 85.57   3rd Qu.:192.96  
##  Max.   :19.99522   Max.   :102.19   Max.   :399.81

$Happiness = b_0+b_1 Income+ b_2 (Income)^2+e$

You will need to create a new variable x-squared. Report your results in a regression table.

m <- lm( y ~ x, data=dat )
stargazer( m, type="html",
           omit.stat = c("rsq","f","ser"),
           notes.label = "Standard errors in parentheses" )


	Dependent variable:

	y

x	2.437^***
	(0.037)

Constant	50.871^***
	(0.390)


Observations	2,000
Adjusted R²	0.690

Standard errors in parentheses	p<0.1; p<0.05; p<0.01

m1 <- lm( y ~ x + w, data=dat )
stargazer( m1, type="html",
           omit.stat = c("rsq","f","ser"),
           notes.label = "Standard errors in parentheses" )


	Dependent variable:

	y

x	7.361^***
	(0.089)

w	-0.252^***
	(0.004)

Constant	35.348^***
	(0.361)


Observations	2,000
Adjusted R²	0.883

Standard errors in parentheses	p<0.1; p<0.05; p<0.01

#y_hat <- predict( m1, data.frame( income=1:200000, income2=(1:200000)^2 ) )
plot( dat$x, dat$y, 
      xlab="Income (Thousands of Dollars)", ylab="Hapiness Scale",
      main="Does Money Make You Happy?",
      pch=19, col="darkorange", bty="n",
      xaxt="n" )
axis( side=1, at=c(0,50000,100000,150000,200000), labels=c("$0","$50k","$100k","$150k","$200k") )

#lines( 1:200000, y_hat, col=gray(0.3,0.5), lwd=6 )

summary(m1)

Call: lm(formula = y ~ x + w, data = dat)

Residuals: Min 1Q Median 3Q Max -19.1420 -3.9703 -0.0493 3.9720 20.4357

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.348269 0.361399 97.81 <2e-16 x 7.361023 0.088702 82.99 <2e-16 w -0.251607 0.004385 -57.38 <2e-16 *** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 5.806 on 1997 degrees of freedom Multiple R-squared: 0.8829, Adjusted R-squared: 0.8828 F-statistic: 7529 on 2 and 1997 DF, p-value: < 2.2e-16

Q2.

How much happiness do you gain making an extra $10k when your initial income is $15k? –410

# replace with model coefficients 
b0 <- 1
b1 <- 1
b2 <- 1

x <- 15  # use 15000 if you did not rescale above
happy.15k <- b0 + b1*x + b2*x*x

x <- 25  # use 25000 if you did not rescale above
happy.25k <- b0 + b1*x + b2*x*x

happy.25k - happy.15k  # marginal effect of $10k increase at $15k starting salary

## [1] 410

Q3.

How much happiness do you gain making an extra $10k when your initial income is $75k? –1610

## [1] 1610

Q4.

How much happiness do you gain making an extra $10k when your initial income is $100k? –2110

## [1] 2110

Part II: Outliers

For this part of the final assignment you will be using a dataset that examines compensation of nonprofit executive directors from the years 2012-2013. The data is extracted from the IRS E-Filer database available on AWS.

URL <- "https://github.com/DS4PS/cpp-523-fall-2019/blob/master/labs/data/np-comp-data.rds?raw=true"
dat1 <- readRDS(gzcon(url( URL )))
summary(dat1)

##     FILEREIN          FILERNAME1          NTMAJ12              NPAGE       
##  Min.   : 10024645   Length:65144       Length:65144       Min.   : -1.00  
##  1st Qu.:232997254   Class :character   Class :character   1st Qu.: 15.00  
##  Median :391318616   Mode  :character   Mode  :character   Median : 26.00  
##  Mean   :436908169                                         Mean   : 28.98  
##  3rd Qu.:593725701                                         3rd Qu.: 39.00  
##  Max.   :943151580                                         Max.   :110.00  
##      TAXYR         STATE              RULEDATE         REVENUE         
##  Min.   :2012   Length:65144       Min.   :190401   Min.   :6.000e+00  
##  1st Qu.:2012   Class :character   1st Qu.:197408   1st Qu.:4.986e+05  
##  Median :2012   Mode  :character   Median :198711   Median :1.437e+06  
##  Mean   :2012                      Mean   :198458   Mean   :1.278e+07  
##  3rd Qu.:2013                      3rd Qu.:199905   3rd Qu.:5.612e+06  
##  Max.   :2013                      Max.   :201404   Max.   :5.840e+09  
##      ASSETS             PERSONNM           TITLETXT             AVGHRS      
##  Min.   :-6.296e+06   Length:65144       Length:65144       Min.   :  1.15  
##  1st Qu.: 3.851e+05   Class :character   Class :character   1st Qu.: 40.00  
##  Median : 1.556e+06   Mode  :character   Mode  :character   Median : 40.00  
##  Mean   : 2.433e+07                                         Mean   : 39.33  
##  3rd Qu.: 7.029e+06                                         3rd Qu.: 40.00  
##  Max.   : 7.276e+10                                         Max.   :168.00  
##      SALARY            GENDER          PROPORTION_FEMALE    M2012CEO     
##  Min.   :       2   Length:65144       Min.   :0.0000    Min.   :0.0000  
##  1st Qu.:   54000   Class :character   1st Qu.:0.0041    1st Qu.:0.0000  
##  Median :   83690   Mode  :character   Median :0.4366    Median :1.0000  
##  Mean   :  117191                      Mean   :0.4969    Mean   :0.5019  
##  3rd Qu.:  133547                      3rd Qu.:0.9972    3rd Qu.:1.0000  
##  Max.   :13573496                      Max.   :1.0000    Max.   :1.0000  
##      TREAT              POST       
##  Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000  
##  Mean   :0.01822   Mean   :0.4988  
##  3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000

set.seed( 1234 )
d2 <- sample_n( dat1, 2000 ) # smaller sample for data viz purposes

plot( log(d2$REVENUE), log(d2$SALARY), bty="n", pch=19, col="darkorange",
      xlab="Nonprofit Revenue (logged)", ylab="Executive Director Salary (logged)",
      xlim=c(5,25), ylim=c(5,16))
abline( h=seq( 1, 20, 0.5 ), col=gray(0.5,0.2), lwd=1 )
abline( v=seq( 1, 25, 0.5 ), col=gray(0.5,0.2), lwd=1 )
abline( lm( log(d2$SALARY) ~ log(d2$REVENUE) ), col=gray(0.5,0.5), lwd=3 )

Codebook:

FILEREIN – Tax ID of the nonprofit
TAXYR – Year of the tax record (this data is from 2012 and 2013)
FILERNAME1 – Name of the nonprofit
STATE – Location of the nonprofit
RULEDATE – Year and month the nonprofit was granted status
NPAGE – Nonprofit age
REVENUE – Total annual revenue for the nonprofit
ASSETS – Total assets of the nonprofit
PERSONNM – Name of the Executive Director
TITLETXT – Title of the Executive Director
AVGHRS – Average hours worked each week
SALARY – Annual salary for the Executive Director
GENDER – Typical gender for someone with that first name
PROPORTION_FEMALE – The proportion of babies born with that first name that are female M2012CEO – Was there a male executive director in 2012? 1=yes, 0=no
TREAT – Did the organization hire a new CEO in 2013 with a different gender?
POST – Dummy variable for the second year: 1=2013, 0=2012
NTMAJ12 – Subsector of the nonprofit
- AR Arts, culture, and humanities
- BH Education, higher
- ED Education
- EH Hospitals
- EN Environment
- HE Health
- HU Human services
- IN International
- MU Mutual benefit
- PU Public and societal benefit
- RE Religion
- UN Unknown

plot( log(d2$REVENUE), log(d2$SALARY), bty="n", pch=19, col=gray(0.5,0.2), cex=1.2,
      xlab="Nonprofit Revenue (logged)", ylab="Executive Director Salary (logged)",
      xlim=c(5,25), ylim=c(5,16))

abline( lm( log(d2$SALARY) ~ log(d2$REVENUE) ), col="darkorange", lwd=3 )

points( mean(log(d2$REVENUE)), mean(log(d2$SALARY)), pch=19, col="darkorange", cex=2 )
points( log(d2$REVENUE[c(1446,1681)]), log(d2$SALARY[c(1446,1681)]),
         cex=3, col="steelblue", lwd=2 )
points( log(d2$REVENUE[c(1446,1681)]), log(d2$SALARY[c(1446,1681)]),
         cex=1.5, col="steelblue", pch=19 )
text( log(d2$REVENUE[c(1446,1681)]), log(d2$SALARY[c(1446,1681)]), c("A","B"), 
      pos=4, offset=1.2, col="steelblue", cex=2  )

Q1. What is the likely impact of the outlier “A” on the regression line?

Will is make the slope larger or smaller? –larger
Would it contribute to a Type I or Type II error? –Type I

Q2. What is the likely impact of the outlier “B” on the regression line?

Will is make the slope larger or smaller? –smaller
Would it contribute to a Type I or Type II error? –Type II

Note: on the graph I saw, point B was located just right of the mean, under the regression line. When I knit the file, point B was located directly above the mean. If this was the case, the slope would not change at all but the intercept would be shifted up.

Q3. The average logged revenue of a nonprofit in this data is 14.44879. What does that translate to in normal dollars? Use the `exp()` function.

exp(14.44879)

## [1] 1883778

Q4. What would be the typical salary for a director of a nonprofit of this size?

$log(Salary) = 6.367 + 0.343 \cdot log(Revenue)$

m <- lm( log(SALARY) ~ log(REVENUE), data=d2 )
stargazer( m, type="html",
           omit.stat = c("rsq","f","ser"),
           notes.label = "Standard errors in parentheses" )


	Dependent variable:

	log(SALARY)

log(REVENUE)	0.354^***
	(0.009)

Constant	6.193^***
	(0.128)


Observations	2,000
Adjusted R²	0.445

Standard errors in parentheses	p<0.1; p<0.05; p<0.01

summary(m)

Call: lm(formula = log(SALARY) ~ log(REVENUE), data = d2)

Residuals: Min 1Q Median 3Q Max -6.5050 -0.2418 0.0664 0.3376 2.2819

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.192936 0.128105 48.34 <2e-16 log(REVENUE) 0.353668 0.008825 40.08 <2e-16 — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 0.6652 on 1998 degrees of freedom Multiple R-squared: 0.4456, Adjusted R-squared: 0.4454 F-statistic: 1606 on 1 and 1998 DF, p-value: < 2.2e-16

log.sal <- 6.367 + 0.343*(14.44879)
sal <- exp(log.sal)
sal

[1] 82696.7

Q5. Interpret the coefficient b1 (slope for log of revenue) in the model above (i.e. “a one-unit change in X corresponds with a b1-unit change in Y”, but adjusted for the log-log context). See the hand-out for guidance.

–If revenue goes up by x percent, the salary would increase by b1/100 dollars

Submission Instructions

After you have completed your lab submit via Canvas. Login to the ASU portal at http://canvas.asu.edu and navigate to the assignments tab in the course repository. Upload your RMD and your HTML files to the appropriate lab submission link. Or else use the link from the Lab-02 tab on the Schedule page.

Remember to name your files according to the convention: Lab-##-LastName.xxx

Lab-06 Specification

Jacqui Anderson

04 October, 2021

Part I:

Questions

Q1.

Q2.

Q3.

Q4.

Part II: Outliers

Q1. What is the likely impact of the outlier “A” on the regression line?

Q2. What is the likely impact of the outlier “B” on the regression line?

Q3. The average logged revenue of a nonprofit in this data is 14.44879. What does that translate to in normal dollars? Use the `exp()` function.

Q4. What would be the typical salary for a director of a nonprofit of this size?

Q5. Interpret the coefficient b1 (slope for log of revenue) in the model above (i.e. “a one-unit change in X corresponds with a b1-unit change in Y”, but adjusted for the log-log context). See the hand-out for guidance.

Submission Instructions

Lab-06 Specification

Jacqui Anderson

04 October, 2021

Part I:

Questions

Q1.

Q2.

Q3.

Q4.

Part II: Outliers

Q1. What is the likely impact of the outlier “A” on the regression line?

Q2. What is the likely impact of the outlier “B” on the regression line?

Q3. The average logged revenue of a nonprofit in this data is 14.44879. What does that translate to in normal dollars? Use the exp() function.

Q4. What would be the typical salary for a director of a nonprofit of this size?

Q5. Interpret the coefficient b1 (slope for log of revenue) in the model above (i.e. “a one-unit change in X corresponds with a b1-unit change in Y”, but adjusted for the log-log context). See the hand-out for guidance.

Submission Instructions

Q3. The average logged revenue of a nonprofit in this data is 14.44879. What does that translate to in normal dollars? Use the `exp()` function.