Basic Rules

Tips:

  • READ EVERYTHING CAREFULLY
  • To solve each task, refer to html files posted on Moodle
    • Copy, paste, and modify code to solve each task. (You rarely have to come up with code from scratch)
  • If you encounter an error, Google it
  • Don’t forget to set your working directory by hand
    • add the following code to your Rmd file to avoid wd problems (don’t forget to change your working directory):

Preliminaries:

  • Don’t forget to knit the Rmd file as you go. Knitting is different than running the R code, as it generates the html file right away.

Libraries

Make sure you install them in your console (never in the Rmd file!)

library(ggplot2)
library(MASS)
library(stargazer)
library(AER)
## Warning: package 'AER' was built under R version 3.5.2
library(plm)
## Warning: package 'plm' was built under R version 3.5.2
library(tidyverse)
## Warning: package 'stringr' was built under R version 3.5.2
library(lmtest)
library(sandwich)
library(dplyr)

1) Before you inspect the data, you must load it. Each team received an .Rda file through Slack. Save the .Rda file within your working directory and load the data by modifying the following code (Suppose the file is named: jimena.Rda):

load("Matt_Paul.Rda")
names(Matt.Paul)
##  [1] "age"             "gender"          "income"         
##  [4] "college"         "private.school"  "race"           
##  [7] "marital"         "kids"            "region"         
## [10] "urban"           "green.donations"

Suppose you are hired by an Environmental NGO that wants to understand its donors for climate change initiatives. You are given a dataset with donors’ information that includes the amount each donor contributed per year as well as various demographic and socio-economic characteristics. The variables are described below:

  • age = the age of an individual
  • gender = gender of individual (either female or male)
  • income = annual income of each individual
  • college = is a dummy variable that takes value of 1 if the individual has at least a college degree and 0 otherwise
  • private.school = is a dummy variable that takes value of 1 if the individual studied at a private college and 0 otherwise
  • race = individual’s race/ethnicity (white, black, latino, asian, or other)
  • marital = is the marital status of the individual (single, married, separated, or widow)
  • kids = number of children the individual has
  • region = region where individual lives (midwest, mountain, north east, south, or west coast)
  • urban = a dummy variable that takes the value of 1 if individual lives in an urban area with population above 500,000
  • green.donations = individual’s annual donation to the environmental NGO towards climate change initiatives.

Learning Objectives

Learning Objective 1

Students will inspect and summarize the data

2) Rename your data as data and convert it to a tibble data by using as_tibble()

data<-as_tibble(Matt.Paul)

3) Write code to find if there are NAs in the dataset. You may use tidyverse or a for loop or some other strategy. Hint: find examples online and modify them.

cat("Are there any NA's?  (T/F).",any(is.na(data)))
## Are there any NA's?  (T/F). FALSE

4) Display the top 10 observations within your data

head(data,10)
## # A tibble: 10 x 11
##      age gender income college private.school race  marital  kids region
##    <int> <fct>   <dbl>   <dbl>          <dbl> <fct> <fct>   <dbl> <fct> 
##  1    42 male    79186       1              1 black married     0 west.…
##  2    55 male    69082       0              1 white single      0 north…
##  3    20 female  63642       0              1 white married     0 west.…
##  4    28 male    68250       0              1 lati… single      0 north…
##  5    38 male    69375       0              0 white married     0 west.…
##  6    42 male    73457       0              1 white single      0 north…
##  7    51 male    69405       0              0 black married     0 west.…
##  8    35 male    67153       0              0 white single      0 north…
##  9    39 male    74032       1              0 white married     0 west.…
## 10    43 female  66327       1              0 white single      0 north…
## # ... with 2 more variables: urban <dbl>, green.donations <I(dbl)>

5) Write code to display the number observations in your data

obs<-nrow(data)
obs
## [1] 1800

6) Summarize the variable green.donations

attach(data)
summary(green.donations)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0   540.0   710.0   696.8   860.0  1260.0

7) Summarize the variable age

summary(age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    19.0    27.0    37.0    36.5    46.0    55.0

8) Write code to find the share (proportion) of each race within the dataset:

race.table<-table(race)
race.table
## race
##  asian  black latino  other  white 
##    115    111    367     65   1142
cat(race.table[1]/obs*100,"% of people are Asian")
## 6.388889 % of people are Asian
cat(race.table[2]/obs*100,"% of people are Black" )
## 6.166667 % of people are Black
cat(race.table[3]/obs*100,"% of people are Latino" )
## 20.38889 % of people are Latino
cat(race.table[4]/obs*100,"% of people are 'Other'" )
## 3.611111 % of people are 'Other'
cat(race.table[5]/obs*100,"% of people are White" )
## 63.44444 % of people are White

Learning Objective 2

Students will create new variables using tidyverse

9) [5 points] Use tidyverse and the function mutate to create a dummy variable for each race named white, latino, black, asian, and other, respectively:

data.1 <- data %>% #rename dataset to lower case
  mutate(white = ifelse(race == "white",1,0)) %>% 
  mutate(black = ifelse(race == "black",1,0)) %>% # creating dummy for deny.y
  mutate(asian = ifelse(race == "asian",1,0)) %>% 
  mutate(latino = ifelse(race == "latino",1,0)) %>% 
  mutate(other = ifelse(race == "other",1,0))

To check (Remove the # once you get to this part)

attach(data.1)
## The following objects are masked from data:
## 
##     age, college, gender, green.donations, income, kids, marital,
##     private.school, race, region, urban
names(data.1)
##  [1] "age"             "gender"          "income"         
##  [4] "college"         "private.school"  "race"           
##  [7] "marital"         "kids"            "region"         
## [10] "urban"           "green.donations" "white"          
## [13] "black"           "asian"           "latino"         
## [16] "other"
race.table
## race
##  asian  black latino  other  white 
##    115    111    367     65   1142
white.tab<-table(white)
latino.tab<-table(latino)
black.tab<-table(black)
other.tab<-table(other)
asian.tab<-table(asian)
cat(ifelse(white.tab[2] == race.table[5],"YES","NO"), ",the variable 'white' was created correctly")
## YES ,the variable 'white' was created correctly
cat(ifelse(latino.tab[2] == race.table[3],"YES","NO"), ",the variable 'latino' was created correctly")
## YES ,the variable 'latino' was created correctly
cat(ifelse(black.tab[2] == race.table[2],"YES","NO"), ",the variable 'black' was created correctly")
## YES ,the variable 'black' was created correctly
cat(ifelse(other.tab[2] == race.table[4],"YES","NO"), ",the variable 'other' was created correctly")
## YES ,the variable 'other' was created correctly
cat(ifelse(asian.tab[2] == race.table[1],"YES","NO"), ",the variable 'asian' was created correctly")
## YES ,the variable 'asian' was created correctly

The number of 1’s per dummy variable should match the number of observations per race within each category.

10) Write code to find the share (proportion) of each gender within the dataset:

gender.table <- table(gender)
gender.table
## gender
## female   male 
##    622   1178
cat(gender.table[1]/obs*100,"% of people are female")
## 34.55556 % of people are female
cat(gender.table[2]/obs*100,"% of people are male")
## 65.44444 % of people are male

11) [2 points] Use tidyverse and the function mutate to create a dummy variable for each gender named female and male:

data.2<- data.1 %>% 
  mutate(gender.m = ifelse(gender == "male",1,0)) %>% 
  mutate(gender.f = ifelse(gender == "female",1,0))

12) Write code to check if you created the dummy variable correctly

detach(data.1)
attach(data.2)
## The following objects are masked from data:
## 
##     age, college, gender, green.donations, income, kids, marital,
##     private.school, race, region, urban
gender.table
## gender
## female   male 
##    622   1178
gender.tab.m<-table(gender.m)
gender.tab.f<-table(gender.f)
cat(ifelse(gender.tab.m[2] == gender.table[2],"YES","NO"), "the variable 'gender.m' was created correctly")
## YES the variable 'gender.m' was created correctly
cat(ifelse(gender.tab.f[2] == gender.table[1],"YES","NO"), "the variable 'gender.f' was created correctly")
## YES the variable 'gender.f' was created correctly

16) Write code to find the share (proportion) of each region within the dataset:

table.region<-table(region)
table.region
## region
##    midwest   mountain north.east      south west.coast 
##         23         40        846         35        856
cat(table.region[1]/obs*100,"% of people are from the Midwest")
## 1.277778 % of people are from the Midwest
cat(table.region[2]/obs*100,"% of people are from the Mountain")
## 2.222222 % of people are from the Mountain
cat(table.region[3]/obs*100,"% of people are from the Northeast")
## 47 % of people are from the Northeast
cat(table.region[4]/obs*100,"% of people are from the South")
## 1.944444 % of people are from the South
cat(table.region[5]/obs*100,"% of people are from the West Coast")
## 47.55556 % of people are from the West Coast

17) [5 points] Use tidyverse and the function mutate to create a dummy variable for each region named north.east, west.coast, south, mountain, and midwest:

data.3 <- data.2 %>% #rename dataset to lower case
  mutate(midwest.m = ifelse(region == "midwest",1,0)) %>% 
  mutate(mountain.m = ifelse(region == "mountain",1,0)) %>% # creating dummy for deny.y
  mutate(north.east.m = ifelse(region == "north.east",1,0)) %>% 
  mutate(south.m = ifelse(region == "south",1,0)) %>% 
  mutate(west.coast.m = ifelse(region == "west.coast",1,0))

18) Write code to check if you created the dummy variable correctly

detach(data.2)
attach(data.3)
## The following objects are masked from data:
## 
##     age, college, gender, green.donations, income, kids, marital,
##     private.school, race, region, urban
table.region
## region
##    midwest   mountain north.east      south west.coast 
##         23         40        846         35        856
midwest.tab<-table(midwest.m)
mountain.tab<-table(mountain.m)
NE.tab<-table(north.east.m)
south.tab<-table(south.m)
westcoast.tab<-table(west.coast.m)
cat(ifelse(midwest.tab[2] == table.region[1],"YES","NO"), "the variable 'midwest.m' was created correctly")
## YES the variable 'midwest.m' was created correctly
cat(ifelse(mountain.tab[2] == table.region[2],"YES","NO"), "the variable 'mountain.m' was created correctly")
## YES the variable 'mountain.m' was created correctly
cat(ifelse(NE.tab[2] == table.region[3],"YES","NO"), "the variable 'north.east.m' was created correctly")
## YES the variable 'north.east.m' was created correctly
cat(ifelse(south.tab[2] == table.region[4],"YES","NO"), "the variable 'south' was created correctly")
## YES the variable 'south' was created correctly
cat(ifelse(westcoast.tab[2] == table.region[5],"YES","NO"), "the variable 'west.coast.m' was created correctly")
## YES the variable 'west.coast.m' was created correctly

19) Write code to find the share (proportion) of each marital status within the dataset:

table.marital<-table(marital)
table.marital
## marital
##   married separated    single     widow 
##       846        31       869        54
cat(table.marital[1]/obs*100,"% of people are married")
## 47 % of people are married
cat(table.marital[2]/obs*100,"% of people seperated are ")
## 1.722222 % of people seperated are
cat(table.marital[3]/obs*100,"% of people are single")
## 48.27778 % of people are single
cat(table.marital[4]/obs*100,"% of people are widows")
## 3 % of people are widows

20) Use tidyverse and the function mutate to create a dummy variable for single individuals named single:

data.4<-data.3 %>% 
  mutate(single =ifelse(marital == "single",1,0))

21) Write code to check if you created the dummy variable correctly

detach(data.3)
attach(data.4)
## The following objects are masked from data:
## 
##     age, college, gender, green.donations, income, kids, marital,
##     private.school, race, region, urban
table.marital
## marital
##   married separated    single     widow 
##       846        31       869        54
single.tab<-table(single)
cat(ifelse(single.tab[2] == table.marital[3],"YES","NO"), "the variable 'single' was created correctly")
## YES the variable 'single' was created correctly

Learning Objective 3

Students will visualize the data using ggplot2

22) [4 points] Generate a histogram of green.donations using ggplot2, which should include titles and appropriate horizontal and vertical scales.

data.4 %>% ggplot(aes(x = green.donations,)) + 
  geom_histogram(size = 1, colour = "darkgreen", na.rm = TRUE, binwidth = 50)+
  ggtitle(paste("Green Donations")) +
  theme_bw() + #to have black and white background
  scale_x_continuous(breaks = seq(0,1200,100))+
  scale_y_continuous(breaks=seq(0,200,10)) +
  theme(plot.title = element_text(hjust = 0.5))+
  theme(axis.text.x=element_text(size=12)) +
  theme(axis.text.y=element_text(size=16)) +
  theme(axis.title.x=element_text(size=16)) +
  theme(axis.title.y=element_text(size=16, angle=90)) +
  theme(plot.title=element_text(size=14)) +
  theme(legend.text=element_text(size=18)) +
  xlab("Annual Donations ($)")+ # to label the axis
  ylab("Frequency") # to label axis

23) [4 points] Generate a barplot of race using ggplot2, which should include titles and appropriate horizontal and vertical scales.

data.4 %>% ggplot(aes(x = race, fill= race)) + 
  geom_bar(size = 1, colour = "black", na.rm = TRUE)+
  ggtitle(paste("Count of Race")) +
  theme_bw() + #to have black and white background
  scale_y_continuous(breaks=seq(0,2000,100)) +
  theme(plot.title = element_text(hjust = 0.5))+
  theme(axis.text.x=element_text(size=12)) +
  theme(axis.text.y=element_text(size=16)) +
  theme(axis.title.x=element_text(size=16)) +
  theme(axis.title.y=element_text(size=16, angle=90)) +
  theme(plot.title=element_text(size=14)) +
  theme(legend.text=element_text(size=18)) +
  xlab("Race")+ # to label the axis
  ylab("Count") # to label axis

24) [4 points] Visualize the relationship green.donations and gender using ggplot2. Your plot should include titles and appropriate horizontal and vertical scales.

data.4 %>% ggplot(aes(x = gender, y= green.donations,fill = gender)) + 
  geom_boxplot(size = 1, colour = "black", na.rm = TRUE)+
  ggtitle(paste("Donations by Gender")) +
  theme_bw() + #to have black and white background
  scale_y_continuous(breaks=seq(0,2000,100)) +
  theme(plot.title = element_text(hjust = 0.5))+
  theme(axis.text.x=element_text(size=12)) +
  theme(axis.text.y=element_text(size=16)) +
  theme(axis.title.x=element_text(size=16)) +
  theme(axis.title.y=element_text(size=16, angle=90)) +
  theme(plot.title=element_text(size=14)) +
  theme(legend.text=element_text(size=18)) +
  xlab("Gender")+ # to label the axis
  ylab("Annual Donation ($)") # to label axis

25) [4 points] Visualize the relationship green.donations and income using ggplot2. Your plot should include titles and appropriate horizontal and vertical scales.

data.4 %>% ggplot(aes(y= green.donations, x=income)) + 
  geom_point(size = 1, colour = "blue", na.rm = TRUE)+
  ggtitle(paste("Donations VS Kids ")) +
  theme_bw() + #to have black and white background
  scale_x_continuous(breaks = seq(50000,100000,10000))+
  scale_y_continuous(breaks=seq(0,2000,100)) +
  theme(plot.title = element_text(hjust = 0.5))+
  theme(axis.text.x=element_text(size=12)) +
  theme(axis.text.y=element_text(size=16)) +
  theme(axis.title.x=element_text(size=16)) +
  theme(axis.title.y=element_text(size=16, angle=90)) +
  theme(plot.title=element_text(size=14)) +
  theme(legend.text=element_text(size=18)) +
  xlab("Income")+ # to label the axis
  ylab("Annual Donation ($)") # to label axis

26) [4 points] Visualize the relationship green.donations and kids using ggplot2. Your plot should include titles and appropriate horizontal and vertical scales.

data.4 %>% ggplot(aes(y= green.donations, x=kids, group = kids)) + 
  geom_boxplot(size = 1,  na.rm = TRUE)+
  ggtitle(paste("Donations VS Kids ")) +
  theme_bw() + #to have black and white background
  scale_x_continuous(breaks = seq(0,5,1))+
  scale_y_continuous(breaks=seq(0,2000,100)) +
  theme(plot.title = element_text(hjust = 0.5))+
  theme(axis.text.x=element_text(size=12)) +
  theme(axis.text.y=element_text(size=16)) +
  theme(axis.title.x=element_text(size=16)) +
  theme(axis.title.y=element_text(size=16, angle=90)) +
  theme(plot.title=element_text(size=14)) +
  theme(legend.text=element_text(size=18)) +
  xlab("# of Kids")+ # to label the axis
  ylab("Annual Donation ($)") # to label axis

Learning Objective 4

Students will run several OLS models to find the best fit for their data using robust standard errors

Estimate the following OLS models and named them accordingly:

ols1:

\(\begin{equation} green.donations_{i}=\beta_0 + \beta_1age_{i} + \beta_3college_i +\beta_7white_i+\beta_8female_i +u_{i} \end{equation}\)

ols2:

\(\begin{equation} green.donations_{i}=\beta_0 + \beta_1age_{i} + \beta_3private.school_i +\beta_7white_i+\beta_8female_i +u_{i} \end{equation}\)

ols3:

\(\begin{equation} green.donations_{i}=\beta_0 + \beta_1age_{i} + \beta_3college_i +\beta_7latino_i+\beta_8female_i +u_{i} \end{equation}\)

ols4:

\(\begin{equation} green.donations_{i}=\beta_0 + \beta_1age_{i}+\beta_2age_{i}^2 + \beta_3college_i + \beta_4ln(income_i) +\beta_5single_i + \beta_6kids_i +\beta_7white_i+\beta_8female_i+\beta_9urban_i +u_{i} \end{equation}\)

ols5:

\(\begin{equation} green.donations_{i}=\beta_0 + \beta_1age_{i}+\beta_2age_{i}^2 + \beta_3private.school_i + \beta_4ln(income_i) +\beta_5single_i + \beta_6kids_i +\beta_7white_i+\beta_8female_i+\beta_9(north.east_i*west.coast_i) +u_{i} \end{equation}\)

27) [5 points] Write code to estimate the models above.

attach(data.4)
## The following objects are masked from data.4 (pos = 3):
## 
##     age, asian, black, college, gender, gender.f, gender.m,
##     green.donations, income, kids, latino, marital, midwest.m,
##     mountain.m, north.east.m, other, private.school, race, region,
##     single, south.m, urban, west.coast.m, white
## The following objects are masked from data:
## 
##     age, college, gender, green.donations, income, kids, marital,
##     private.school, race, region, urban
mod1<- lm(green.donations~ age +college +white +gender.f)
mod2<-lm(green.donations~age+private.school+white+gender.f)
mod3<-lm(green.donations ~ age + college + latino + gender.f)
mod4<-lm(green.donations ~ age + age^2 + private.school + log(income) + single + kids+ white +gender.f + urban)
mod5<-lm(green.donations ~age +age^2 + private.school + log(income) + single + kids + white + gender.f + north.east.m + west.coast.m)

28) [5 points] Write code to gather the robust standard errors in a list.

rob_se <- list(sqrt(diag(vcovHC(mod1, type = "HC1"))),
               sqrt(diag(vcovHC(mod2, type = "HC1"))),
               sqrt(diag(vcovHC(mod3, type = "HC1"))),
               sqrt(diag(vcovHC(mod4, type = "HC1"))),
               sqrt(diag(vcovHC(mod5, type = "HC1"))))

Learning Objective 5

Students will use stargazer package to generate tables summarizing their regression results

29) [10 points] Use stargazer (only works in Rmd file) to generate a table with the results (Hint: make sure you include type = “html” within stargazer()):

stargazer(mod1, mod2, mod3, mod4, mod5,
         digits = 3,
         header = FALSE,
         type = "html",
         se = rob_se,
         title = "Linear Regression of Green Donations",
         model.numbers = FALSE,
         column.labels = c("(1)", "(2)", "(3)", "(4)", "(5)"))
Linear Regression of Green Donations
Dependent variable:
green.donations
(1) (2) (3) (4) (5)
age -15.574*** -15.483*** -15.576*** -14.928*** -14.929***
(0.248) (0.327) (0.248) (0.064) (0.066)
college 197.928*** 198.054***
(5.389) (5.387)
private.school 2.648 2.266* 2.648*
(7.010) (1.344) (1.394)
log(income) 1,711.266*** 1,708.884***
(8.154) (8.516)
single 13.696*** 15.601***
(1.352) (3.703)
kids -23.687*** -15.528***
(1.421) (2.972)
white -7.519 -10.885 2.728** 2.379*
(5.465) (7.250) (1.383) (1.435)
latino 0.072
(6.566)
gender.f -28.124*** -33.092*** -28.275*** -27.282*** -27.925***
(5.712) (7.453) (5.718) (1.426) (1.471)
urban 19.028***
(1.579)
north.east.m 7.406
(7.416)
west.coast.m 9.437
(7.827)
Constant 1,157.989*** 1,278.980*** 1,153.264*** -17,847.060*** -17,825.750***
(10.843) (13.957) (10.326) (91.368) (96.112)
Observations 1,800 1,800 1,800 1,800 1,800
R2 0.742 0.554 0.742 0.984 0.983
Adjusted R2 0.741 0.553 0.741 0.984 0.982
Residual Std. Error 113.073 (df = 1795) 148.579 (df = 1795) 113.131 (df = 1795) 28.441 (df = 1791) 29.422 (df = 1790)
F Statistic 1,289.101*** (df = 4; 1795) 557.742*** (df = 4; 1795) 1,287.317*** (df = 4; 1795) 13,510.470*** (df = 8; 1791) 11,208.830*** (df = 9; 1790)
Note: p<0.1; p<0.05; p<0.01

Learning Objective 6

Students will assess the goodness of fit of the regressions

30) [6 points] Complegte bullet points below describing the goodness of fit of the regression and state which regression is better

  • (I): Adj. R^2 = .741
  • (II): Adj. R^2 = .553
  • (III): Adj. R^2 = .741
  • (IV): Adj. R^2 = .984
  • (V): Adj. R^2 = .982

  • The best regression is: Model 4 because it has the highest Adj. R^2 value.

31) [1 points] UselinearHypothesis() to test simultaneously whether the following variables should be included in ols5 model.

linearHypothesis(mod5, c("white=0", "private.school=0"), white.adjust = "hc1")
## Linear hypothesis test
## 
## Hypothesis:
## white = 0
## private.school = 0
## 
## Model 1: restricted model
## Model 2: green.donations ~ age + age^2 + private.school + log(income) + 
##     single + kids + white + gender.f + north.east.m + west.coast.m
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F  Pr(>F)  
## 1   1792                    
## 2   1790  2 3.0802 0.04619 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

32) [2 points] Interpret the results from the test above. Should you keep both variables?

With the F-Statistic being 3.08 and the pvalue of .04619 which is significant at the 99% level, we should keep both variables in the model.

Learning Objective 7

Students will interpret the coefficients from the models

33) [10 points] Interpret the coefficients of the following variables from the best ols model. Write a sentence explaining the intuition behind the result. Complete the following bullet points with your answers.

  • age:
    • interpretation: Every year older the person gets, they donate $14.93 less dollars
    • intuition: I believed that the relationship would be opposite, a possible reason for it being this way is that younger people may be more concerned with the environment than older people.
  • college:
    • interpretation: N/A
    • intuition: N/A
  • ln(income):
    • interpretation: As income increases by 1% green donations increase by $17.11
    • intuition: As people are making more money they feel more willing to donate to help the environment
  • single:
    • interpretation: If someone is single, they donate $13.7 more dollars
    • intuition: More income to spend on other things
  • kids:
    • interpretation: As the amount of kids go up by 1 the amount donated goes down by $23.69
    • intuition: having more kids forces the parents to spend money on them rather than donating it

Learning Objective 8

Students will comment about potential bias in their estimation strategy

34) [5 points] Write about potential sources of bias in your regression. Hint: Review Chapter 10 HAGS.

Ommited variable bias. If we added a variable for political orientation, i believe that it would have been a great addition to the model.

Learning Objective 9

Students will communicate findings and provide recommentaions based on analysis

35) [5 points] Write a paragraph reporting your findings to the NGO. In this paragraph, include recommendations for increasing green donations for climate change initiatives.

Based on my findings, the best demographic to collect donations from would be young wealthy individuals. This is because according to my best Model, younger individuals are more likely to donate, as well as more wealthy individuals. It would also be beneficial to target people whith few to zero children, as they are more likely to donate as well.