html file and an Rmd file.knit the Rmd file as you go. Knitting is different than running the R code, as it generates the html file right away.Make sure you install them in your console (never in the Rmd file!)
library(ggplot2)
library(MASS)
library(stargazer)
library(AER)
## Warning: package 'AER' was built under R version 3.5.2
library(plm)
## Warning: package 'plm' was built under R version 3.5.2
library(tidyverse)
## Warning: package 'stringr' was built under R version 3.5.2
library(lmtest)
library(sandwich)
library(dplyr)
1) Before you inspect the data, you must load it. Each team received an .Rda file through Slack. Save the .Rda file within your working directory and load the data by modifying the following code (Suppose the file is named: jimena.Rda):
load("Matt_Paul.Rda")
names(Matt.Paul)
## [1] "age" "gender" "income"
## [4] "college" "private.school" "race"
## [7] "marital" "kids" "region"
## [10] "urban" "green.donations"
Suppose you are hired by an Environmental NGO that wants to understand its donors for climate change initiatives. You are given a dataset with donors’ information that includes the amount each donor contributed per year as well as various demographic and socio-economic characteristics. The variables are described below:
tidyverseggplot2stargazer package to generate tables summarizing the regression resultsStudents will inspect and summarize the data
2) Rename your data as data and convert it to a tibble data by using as_tibble()
data<-as_tibble(Matt.Paul)
3) Write code to find if there are NAs in the dataset. You may use tidyverse or a for loop or some other strategy. Hint: find examples online and modify them.
cat("Are there any NA's? (T/F).",any(is.na(data)))
## Are there any NA's? (T/F). FALSE
4) Display the top 10 observations within your data
head(data,10)
## # A tibble: 10 x 11
## age gender income college private.school race marital kids region
## <int> <fct> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <fct>
## 1 42 male 79186 1 1 black married 0 west.…
## 2 55 male 69082 0 1 white single 0 north…
## 3 20 female 63642 0 1 white married 0 west.…
## 4 28 male 68250 0 1 lati… single 0 north…
## 5 38 male 69375 0 0 white married 0 west.…
## 6 42 male 73457 0 1 white single 0 north…
## 7 51 male 69405 0 0 black married 0 west.…
## 8 35 male 67153 0 0 white single 0 north…
## 9 39 male 74032 1 0 white married 0 west.…
## 10 43 female 66327 1 0 white single 0 north…
## # ... with 2 more variables: urban <dbl>, green.donations <I(dbl)>
5) Write code to display the number observations in your data
obs<-nrow(data)
obs
## [1] 1800
6) Summarize the variable green.donations
attach(data)
summary(green.donations)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.0 540.0 710.0 696.8 860.0 1260.0
7) Summarize the variable age
summary(age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.0 27.0 37.0 36.5 46.0 55.0
8) Write code to find the share (proportion) of each race within the dataset:
race.table<-table(race)
race.table
## race
## asian black latino other white
## 115 111 367 65 1142
cat(race.table[1]/obs*100,"% of people are Asian")
## 6.388889 % of people are Asian
cat(race.table[2]/obs*100,"% of people are Black" )
## 6.166667 % of people are Black
cat(race.table[3]/obs*100,"% of people are Latino" )
## 20.38889 % of people are Latino
cat(race.table[4]/obs*100,"% of people are 'Other'" )
## 3.611111 % of people are 'Other'
cat(race.table[5]/obs*100,"% of people are White" )
## 63.44444 % of people are White
Students will create new variables using tidyverse
9) [5 points] Use tidyverse and the function mutate to create a dummy variable for each race named white, latino, black, asian, and other, respectively:
data.1 <- data %>% #rename dataset to lower case
mutate(white = ifelse(race == "white",1,0)) %>%
mutate(black = ifelse(race == "black",1,0)) %>% # creating dummy for deny.y
mutate(asian = ifelse(race == "asian",1,0)) %>%
mutate(latino = ifelse(race == "latino",1,0)) %>%
mutate(other = ifelse(race == "other",1,0))
To check (Remove the # once you get to this part)
attach(data.1)
## The following objects are masked from data:
##
## age, college, gender, green.donations, income, kids, marital,
## private.school, race, region, urban
names(data.1)
## [1] "age" "gender" "income"
## [4] "college" "private.school" "race"
## [7] "marital" "kids" "region"
## [10] "urban" "green.donations" "white"
## [13] "black" "asian" "latino"
## [16] "other"
race.table
## race
## asian black latino other white
## 115 111 367 65 1142
white.tab<-table(white)
latino.tab<-table(latino)
black.tab<-table(black)
other.tab<-table(other)
asian.tab<-table(asian)
cat(ifelse(white.tab[2] == race.table[5],"YES","NO"), ",the variable 'white' was created correctly")
## YES ,the variable 'white' was created correctly
cat(ifelse(latino.tab[2] == race.table[3],"YES","NO"), ",the variable 'latino' was created correctly")
## YES ,the variable 'latino' was created correctly
cat(ifelse(black.tab[2] == race.table[2],"YES","NO"), ",the variable 'black' was created correctly")
## YES ,the variable 'black' was created correctly
cat(ifelse(other.tab[2] == race.table[4],"YES","NO"), ",the variable 'other' was created correctly")
## YES ,the variable 'other' was created correctly
cat(ifelse(asian.tab[2] == race.table[1],"YES","NO"), ",the variable 'asian' was created correctly")
## YES ,the variable 'asian' was created correctly
The number of 1’s per dummy variable should match the number of observations per race within each category.
10) Write code to find the share (proportion) of each gender within the dataset:
gender.table <- table(gender)
gender.table
## gender
## female male
## 622 1178
cat(gender.table[1]/obs*100,"% of people are female")
## 34.55556 % of people are female
cat(gender.table[2]/obs*100,"% of people are male")
## 65.44444 % of people are male
11) [2 points] Use tidyverse and the function mutate to create a dummy variable for each gender named female and male:
data.2<- data.1 %>%
mutate(gender.m = ifelse(gender == "male",1,0)) %>%
mutate(gender.f = ifelse(gender == "female",1,0))
12) Write code to check if you created the dummy variable correctly
detach(data.1)
attach(data.2)
## The following objects are masked from data:
##
## age, college, gender, green.donations, income, kids, marital,
## private.school, race, region, urban
gender.table
## gender
## female male
## 622 1178
gender.tab.m<-table(gender.m)
gender.tab.f<-table(gender.f)
cat(ifelse(gender.tab.m[2] == gender.table[2],"YES","NO"), "the variable 'gender.m' was created correctly")
## YES the variable 'gender.m' was created correctly
cat(ifelse(gender.tab.f[2] == gender.table[1],"YES","NO"), "the variable 'gender.f' was created correctly")
## YES the variable 'gender.f' was created correctly
16) Write code to find the share (proportion) of each region within the dataset:
table.region<-table(region)
table.region
## region
## midwest mountain north.east south west.coast
## 23 40 846 35 856
cat(table.region[1]/obs*100,"% of people are from the Midwest")
## 1.277778 % of people are from the Midwest
cat(table.region[2]/obs*100,"% of people are from the Mountain")
## 2.222222 % of people are from the Mountain
cat(table.region[3]/obs*100,"% of people are from the Northeast")
## 47 % of people are from the Northeast
cat(table.region[4]/obs*100,"% of people are from the South")
## 1.944444 % of people are from the South
cat(table.region[5]/obs*100,"% of people are from the West Coast")
## 47.55556 % of people are from the West Coast
17) [5 points] Use tidyverse and the function mutate to create a dummy variable for each region named north.east, west.coast, south, mountain, and midwest:
data.3 <- data.2 %>% #rename dataset to lower case
mutate(midwest.m = ifelse(region == "midwest",1,0)) %>%
mutate(mountain.m = ifelse(region == "mountain",1,0)) %>% # creating dummy for deny.y
mutate(north.east.m = ifelse(region == "north.east",1,0)) %>%
mutate(south.m = ifelse(region == "south",1,0)) %>%
mutate(west.coast.m = ifelse(region == "west.coast",1,0))
18) Write code to check if you created the dummy variable correctly
detach(data.2)
attach(data.3)
## The following objects are masked from data:
##
## age, college, gender, green.donations, income, kids, marital,
## private.school, race, region, urban
table.region
## region
## midwest mountain north.east south west.coast
## 23 40 846 35 856
midwest.tab<-table(midwest.m)
mountain.tab<-table(mountain.m)
NE.tab<-table(north.east.m)
south.tab<-table(south.m)
westcoast.tab<-table(west.coast.m)
cat(ifelse(midwest.tab[2] == table.region[1],"YES","NO"), "the variable 'midwest.m' was created correctly")
## YES the variable 'midwest.m' was created correctly
cat(ifelse(mountain.tab[2] == table.region[2],"YES","NO"), "the variable 'mountain.m' was created correctly")
## YES the variable 'mountain.m' was created correctly
cat(ifelse(NE.tab[2] == table.region[3],"YES","NO"), "the variable 'north.east.m' was created correctly")
## YES the variable 'north.east.m' was created correctly
cat(ifelse(south.tab[2] == table.region[4],"YES","NO"), "the variable 'south' was created correctly")
## YES the variable 'south' was created correctly
cat(ifelse(westcoast.tab[2] == table.region[5],"YES","NO"), "the variable 'west.coast.m' was created correctly")
## YES the variable 'west.coast.m' was created correctly
19) Write code to find the share (proportion) of each marital status within the dataset:
table.marital<-table(marital)
table.marital
## marital
## married separated single widow
## 846 31 869 54
cat(table.marital[1]/obs*100,"% of people are married")
## 47 % of people are married
cat(table.marital[2]/obs*100,"% of people seperated are ")
## 1.722222 % of people seperated are
cat(table.marital[3]/obs*100,"% of people are single")
## 48.27778 % of people are single
cat(table.marital[4]/obs*100,"% of people are widows")
## 3 % of people are widows
20) Use tidyverse and the function mutate to create a dummy variable for single individuals named single:
data.4<-data.3 %>%
mutate(single =ifelse(marital == "single",1,0))
21) Write code to check if you created the dummy variable correctly
detach(data.3)
attach(data.4)
## The following objects are masked from data:
##
## age, college, gender, green.donations, income, kids, marital,
## private.school, race, region, urban
table.marital
## marital
## married separated single widow
## 846 31 869 54
single.tab<-table(single)
cat(ifelse(single.tab[2] == table.marital[3],"YES","NO"), "the variable 'single' was created correctly")
## YES the variable 'single' was created correctly
Students will visualize the data using ggplot2
22) [4 points] Generate a histogram of green.donations using ggplot2, which should include titles and appropriate horizontal and vertical scales.
data.4 %>% ggplot(aes(x = green.donations,)) +
geom_histogram(size = 1, colour = "darkgreen", na.rm = TRUE, binwidth = 50)+
ggtitle(paste("Green Donations")) +
theme_bw() + #to have black and white background
scale_x_continuous(breaks = seq(0,1200,100))+
scale_y_continuous(breaks=seq(0,200,10)) +
theme(plot.title = element_text(hjust = 0.5))+
theme(axis.text.x=element_text(size=12)) +
theme(axis.text.y=element_text(size=16)) +
theme(axis.title.x=element_text(size=16)) +
theme(axis.title.y=element_text(size=16, angle=90)) +
theme(plot.title=element_text(size=14)) +
theme(legend.text=element_text(size=18)) +
xlab("Annual Donations ($)")+ # to label the axis
ylab("Frequency") # to label axis
23) [4 points] Generate a barplot of race using ggplot2, which should include titles and appropriate horizontal and vertical scales.
data.4 %>% ggplot(aes(x = race, fill= race)) +
geom_bar(size = 1, colour = "black", na.rm = TRUE)+
ggtitle(paste("Count of Race")) +
theme_bw() + #to have black and white background
scale_y_continuous(breaks=seq(0,2000,100)) +
theme(plot.title = element_text(hjust = 0.5))+
theme(axis.text.x=element_text(size=12)) +
theme(axis.text.y=element_text(size=16)) +
theme(axis.title.x=element_text(size=16)) +
theme(axis.title.y=element_text(size=16, angle=90)) +
theme(plot.title=element_text(size=14)) +
theme(legend.text=element_text(size=18)) +
xlab("Race")+ # to label the axis
ylab("Count") # to label axis
24) [4 points] Visualize the relationship green.donations and gender using ggplot2. Your plot should include titles and appropriate horizontal and vertical scales.
data.4 %>% ggplot(aes(x = gender, y= green.donations,fill = gender)) +
geom_boxplot(size = 1, colour = "black", na.rm = TRUE)+
ggtitle(paste("Donations by Gender")) +
theme_bw() + #to have black and white background
scale_y_continuous(breaks=seq(0,2000,100)) +
theme(plot.title = element_text(hjust = 0.5))+
theme(axis.text.x=element_text(size=12)) +
theme(axis.text.y=element_text(size=16)) +
theme(axis.title.x=element_text(size=16)) +
theme(axis.title.y=element_text(size=16, angle=90)) +
theme(plot.title=element_text(size=14)) +
theme(legend.text=element_text(size=18)) +
xlab("Gender")+ # to label the axis
ylab("Annual Donation ($)") # to label axis
25) [4 points] Visualize the relationship green.donations and income using ggplot2. Your plot should include titles and appropriate horizontal and vertical scales.
data.4 %>% ggplot(aes(y= green.donations, x=income)) +
geom_point(size = 1, colour = "blue", na.rm = TRUE)+
ggtitle(paste("Donations VS Kids ")) +
theme_bw() + #to have black and white background
scale_x_continuous(breaks = seq(50000,100000,10000))+
scale_y_continuous(breaks=seq(0,2000,100)) +
theme(plot.title = element_text(hjust = 0.5))+
theme(axis.text.x=element_text(size=12)) +
theme(axis.text.y=element_text(size=16)) +
theme(axis.title.x=element_text(size=16)) +
theme(axis.title.y=element_text(size=16, angle=90)) +
theme(plot.title=element_text(size=14)) +
theme(legend.text=element_text(size=18)) +
xlab("Income")+ # to label the axis
ylab("Annual Donation ($)") # to label axis
26) [4 points] Visualize the relationship green.donations and kids using ggplot2. Your plot should include titles and appropriate horizontal and vertical scales.
data.4 %>% ggplot(aes(y= green.donations, x=kids, group = kids)) +
geom_boxplot(size = 1, na.rm = TRUE)+
ggtitle(paste("Donations VS Kids ")) +
theme_bw() + #to have black and white background
scale_x_continuous(breaks = seq(0,5,1))+
scale_y_continuous(breaks=seq(0,2000,100)) +
theme(plot.title = element_text(hjust = 0.5))+
theme(axis.text.x=element_text(size=12)) +
theme(axis.text.y=element_text(size=16)) +
theme(axis.title.x=element_text(size=16)) +
theme(axis.title.y=element_text(size=16, angle=90)) +
theme(plot.title=element_text(size=14)) +
theme(legend.text=element_text(size=18)) +
xlab("# of Kids")+ # to label the axis
ylab("Annual Donation ($)") # to label axis
Students will run several OLS models to find the best fit for their data using robust standard errors
Estimate the following OLS models and named them accordingly:
ols1:
\(\begin{equation} green.donations_{i}=\beta_0 + \beta_1age_{i} + \beta_3college_i +\beta_7white_i+\beta_8female_i +u_{i} \end{equation}\)
ols2:
\(\begin{equation} green.donations_{i}=\beta_0 + \beta_1age_{i} + \beta_3private.school_i +\beta_7white_i+\beta_8female_i +u_{i} \end{equation}\)
ols3:
\(\begin{equation} green.donations_{i}=\beta_0 + \beta_1age_{i} + \beta_3college_i +\beta_7latino_i+\beta_8female_i +u_{i} \end{equation}\)
ols4:
\(\begin{equation} green.donations_{i}=\beta_0 + \beta_1age_{i}+\beta_2age_{i}^2 + \beta_3college_i + \beta_4ln(income_i) +\beta_5single_i + \beta_6kids_i +\beta_7white_i+\beta_8female_i+\beta_9urban_i +u_{i} \end{equation}\)
ols5:
\(\begin{equation} green.donations_{i}=\beta_0 + \beta_1age_{i}+\beta_2age_{i}^2 + \beta_3private.school_i + \beta_4ln(income_i) +\beta_5single_i + \beta_6kids_i +\beta_7white_i+\beta_8female_i+\beta_9(north.east_i*west.coast_i) +u_{i} \end{equation}\)
27) [5 points] Write code to estimate the models above.
attach(data.4)
## The following objects are masked from data.4 (pos = 3):
##
## age, asian, black, college, gender, gender.f, gender.m,
## green.donations, income, kids, latino, marital, midwest.m,
## mountain.m, north.east.m, other, private.school, race, region,
## single, south.m, urban, west.coast.m, white
## The following objects are masked from data:
##
## age, college, gender, green.donations, income, kids, marital,
## private.school, race, region, urban
mod1<- lm(green.donations~ age +college +white +gender.f)
mod2<-lm(green.donations~age+private.school+white+gender.f)
mod3<-lm(green.donations ~ age + college + latino + gender.f)
mod4<-lm(green.donations ~ age + age^2 + private.school + log(income) + single + kids+ white +gender.f + urban)
mod5<-lm(green.donations ~age +age^2 + private.school + log(income) + single + kids + white + gender.f + north.east.m + west.coast.m)
28) [5 points] Write code to gather the robust standard errors in a list.
rob_se <- list(sqrt(diag(vcovHC(mod1, type = "HC1"))),
sqrt(diag(vcovHC(mod2, type = "HC1"))),
sqrt(diag(vcovHC(mod3, type = "HC1"))),
sqrt(diag(vcovHC(mod4, type = "HC1"))),
sqrt(diag(vcovHC(mod5, type = "HC1"))))
Students will use stargazer package to generate tables summarizing their regression results
29) [10 points] Use stargazer (only works in Rmd file) to generate a table with the results (Hint: make sure you include type = “html” within stargazer()):
stargazer(mod1, mod2, mod3, mod4, mod5,
digits = 3,
header = FALSE,
type = "html",
se = rob_se,
title = "Linear Regression of Green Donations",
model.numbers = FALSE,
column.labels = c("(1)", "(2)", "(3)", "(4)", "(5)"))
| Dependent variable: | |||||
| green.donations | |||||
| (1) | (2) | (3) | (4) | (5) | |
| age | -15.574*** | -15.483*** | -15.576*** | -14.928*** | -14.929*** |
| (0.248) | (0.327) | (0.248) | (0.064) | (0.066) | |
| college | 197.928*** | 198.054*** | |||
| (5.389) | (5.387) | ||||
| private.school | 2.648 | 2.266* | 2.648* | ||
| (7.010) | (1.344) | (1.394) | |||
| log(income) | 1,711.266*** | 1,708.884*** | |||
| (8.154) | (8.516) | ||||
| single | 13.696*** | 15.601*** | |||
| (1.352) | (3.703) | ||||
| kids | -23.687*** | -15.528*** | |||
| (1.421) | (2.972) | ||||
| white | -7.519 | -10.885 | 2.728** | 2.379* | |
| (5.465) | (7.250) | (1.383) | (1.435) | ||
| latino | 0.072 | ||||
| (6.566) | |||||
| gender.f | -28.124*** | -33.092*** | -28.275*** | -27.282*** | -27.925*** |
| (5.712) | (7.453) | (5.718) | (1.426) | (1.471) | |
| urban | 19.028*** | ||||
| (1.579) | |||||
| north.east.m | 7.406 | ||||
| (7.416) | |||||
| west.coast.m | 9.437 | ||||
| (7.827) | |||||
| Constant | 1,157.989*** | 1,278.980*** | 1,153.264*** | -17,847.060*** | -17,825.750*** |
| (10.843) | (13.957) | (10.326) | (91.368) | (96.112) | |
| Observations | 1,800 | 1,800 | 1,800 | 1,800 | 1,800 |
| R2 | 0.742 | 0.554 | 0.742 | 0.984 | 0.983 |
| Adjusted R2 | 0.741 | 0.553 | 0.741 | 0.984 | 0.982 |
| Residual Std. Error | 113.073 (df = 1795) | 148.579 (df = 1795) | 113.131 (df = 1795) | 28.441 (df = 1791) | 29.422 (df = 1790) |
| F Statistic | 1,289.101*** (df = 4; 1795) | 557.742*** (df = 4; 1795) | 1,287.317*** (df = 4; 1795) | 13,510.470*** (df = 8; 1791) | 11,208.830*** (df = 9; 1790) |
| Note: | p<0.1; p<0.05; p<0.01 | ||||
Students will assess the goodness of fit of the regressions
30) [6 points] Complegte bullet points below describing the goodness of fit of the regression and state which regression is better
(V): Adj. R^2 = .982
The best regression is: Model 4 because it has the highest Adj. R^2 value.
31) [1 points] UselinearHypothesis() to test simultaneously whether the following variables should be included in ols5 model.
linearHypothesis(mod5, c("white=0", "private.school=0"), white.adjust = "hc1")
## Linear hypothesis test
##
## Hypothesis:
## white = 0
## private.school = 0
##
## Model 1: restricted model
## Model 2: green.donations ~ age + age^2 + private.school + log(income) +
## single + kids + white + gender.f + north.east.m + west.coast.m
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 1792
## 2 1790 2 3.0802 0.04619 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
32) [2 points] Interpret the results from the test above. Should you keep both variables?
With the F-Statistic being 3.08 and the pvalue of .04619 which is significant at the 99% level, we should keep both variables in the model.
Students will interpret the coefficients from the models
33) [10 points] Interpret the coefficients of the following variables from the best ols model. Write a sentence explaining the intuition behind the result. Complete the following bullet points with your answers.
age:
college:
ln(income):
single:
kids:
Students will comment about potential bias in their estimation strategy
34) [5 points] Write about potential sources of bias in your regression. Hint: Review Chapter 10 HAGS.
Ommited variable bias. If we added a variable for political orientation, i believe that it would have been a great addition to the model.
Students will communicate findings and provide recommentaions based on analysis
35) [5 points] Write a paragraph reporting your findings to the NGO. In this paragraph, include recommendations for increasing green donations for climate change initiatives.
Based on my findings, the best demographic to collect donations from would be young wealthy individuals. This is because according to my best Model, younger individuals are more likely to donate, as well as more wealthy individuals. It would also be beneficial to target people whith few to zero children, as they are more likely to donate as well.