Donorschoose.org is a crowd sourcing portal used to get donations for educational projects from public donors. This research project aims at identifying the factors that are key drivers for getting donation for a project.This research work will look into various explanatory variables and identify key explanatory variables related to project funding response variable.
following are the expected outcomes from the research work-
Create a logistics regression model to identify the explanatory variables linked to the response variable. Validate the model on test data split out from original data set. **Detail analysis of the variables used in the model and understanding significance of the variables used in the model
Load the data file sourced from website http://data.donorschoose.org/open-data/project-data
open_proj<-read.csv("file:///C:/Users/Arindam/Documents/Data Science/Cuny/Data 606/Assignment/Open data donations.csv", header=TRUE,stringsAsFactors = FALSE)Load packages required for the analysis
library(aod)## Warning: package 'aod' was built under R version 3.2.5
library(Rcpp)## Warning: package 'Rcpp' was built under R version 3.2.5
library(ggplot2)## Warning: package 'ggplot2' was built under R version 3.2.3
What are the cases, and how many are there? For the purpose of this project latest 3000 records from available data set(with status completed or expired) has been sourced and analyzed.
nrow(open_proj)## [1] 30080
table(open_proj$funding_status)##
## completed expired reallocated
## 21160 8584 336
Describe the method of data collection.
Data for this project has been sourced from open data provided by donorschoose.org website and given in below location
http://data.donorschoose.org/open-data/project-data/
From this data set latest 30000 records with status completed or expired or reallocated were collected for study
What type of study is this (observational/experiment)? This will be an observational study of existing data and understand the relationships between response and explanatory variables. Then identify the key explanatory variables which will drive success of project funding.
If you collected the data, state self-collected. If not, provide a citation/link.
Following is the data source link http://data.donorschoose.org/open-data/project-data/
What is the response variable, and what type is it (numerical/categorical)?
Response variable for this study is project status which can have following three values- Completed, Expired, Live, or Reallocated. For the purpose of this project Expired and Live, or Reallocated values will be clubbed into one bucket called -“Not Completed”. Also Live projects will be excluded from this research.
#open_proj$funding_status<-as.numeric(open_proj$funding_status)
open_proj$funding_status[open_proj$funding_status=="completed"]=c(1)
open_proj$funding_status[open_proj$funding_status=="expired"| open_proj$funding_status=="reallocated" ]=c(0)
table(open_proj$funding_status)##
## 0 1
## 8920 21160
What is the explanatory variable, and what type is it (numerical/categorival)?
Following are the list of variables which will be studied as explanatory variables and shortlisted as explanatory variables School Location- school_city, school_state,school_district
Teacher attributes- teacher_teach_for_america
Project categories-
primary_focus_subject primary_focus_area resource_type poverty_level
Project pricing total_price_including_opt_donation
Project impact students_reached
proj_data<-subset(open_proj, select= c("school_city","school_state","school_district","teacher_teach_for_america","teacher_ny_teaching_fellow","primary_focus_subject","primary_focus_area","resource_type","poverty_level","grade_level","total_price_including_optional_support","students_reached","total_donations","funding_status","X_projectid"))
proj_data$funding_status[proj_data$funding_status=="completed"]=c(1)
proj_data$funding_status<-as.integer(proj_data$funding_status)Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
# Key summary statistics for response variable.
prop.table(table(proj_data$funding_status))##
## 0 1
## 0.2965426 0.7034574
# Analysis- Looking at the response variables it appears that around 70% of projects are getting completed and around 30% are not. So foucs will be to understand which are the variables that drives those 70% of completion of project.# Contengency table of funding status by state
prop.table(table(proj_data$funding_status,proj_data$school_state))##
## AL CA CO CT DC
## 0 9.308511e-04 3.799867e-02 3.324468e-05 3.324468e-05 1.037234e-02
## 1 3.058511e-03 1.168551e-01 0.000000e+00 0.000000e+00 2.280585e-02
##
## IL IN KS La LA
## 0 3.786569e-02 2.360372e-03 3.324468e-05 0.000000e+00 1.462766e-02
## 1 8.095080e-02 6.948138e-03 0.000000e+00 9.973404e-05 2.011303e-02
##
## MA MD MI MN MO
## 0 3.324468e-05 9.973404e-05 3.324468e-05 3.324468e-05 9.973404e-05
## 1 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
##
## MS NC NJ NY OK
## 0 7.313830e-04 8.178191e-02 1.329787e-04 7.263963e-02 3.324468e-05
## 1 4.820479e-03 1.086104e-01 0.000000e+00 2.926197e-01 0.000000e+00
##
## RI SC TN TX VA
## 0 6.648936e-05 2.755984e-02 3.324468e-05 8.909574e-03 6.648936e-05
## 1 0.000000e+00 3.005319e-02 0.000000e+00 1.652261e-02 0.000000e+00
##
## WA
## 0 3.324468e-05
## 1 0.000000e+00
barplot(prop.table(table(proj_data$funding_status,proj_data$school_state)),legend.text = TRUE,main="Funding status by school state") Analysis 1- data does not have representative data from all state and hence using State as variable will be skewed towards few states where majority of the donations were made like NY. This variable can only be used with response variable after excluding the states with less number of cases.
# how teacher involvement influence project funding
prop.table(table(proj_data$funding_status,proj_data$teacher_teach_for_america))##
## f t
## 0 0.27978723 0.01675532
## 1 0.61705452 0.08640293
barplot(table(proj_data$funding_status,proj_data$teacher_teach_for_america),beside = T,legend.text = TRUE,main="Funding status by teacher_teach_for_america")barplot(table(proj_data$funding_status,proj_data$teacher_ny_teaching_fellow),beside = T,legend.text = TRUE,main="Funding status by teacher_ny_teaching_fellow")prop.table(table(proj_data$funding_status,proj_data$teacher_ny_teaching_fellow))##
## f t
## 0 0.28397606 0.01256649
## 1 0.64690824 0.05654920
table(proj_data$teacher_teach_for_america,proj_data$teacher_ny_teaching_fellow)##
## f t
## f 24908 2069
## t 3093 10
Analysis- These two variables are having very less number of yes flags in the data set. These two variables can not be used to derive relationship with response variable.
# Contengency table: primary focus area
prop.table(table (proj_data$funding_status,proj_data$primary_focus_area))##
## Applied Learning Health & Sports History & Civics Literacy & Language
## 0 0.06110372 0.01090426 0.02244016 0.10781250
## 1 0.09780585 0.02476729 0.04956782 0.28238032
##
## Math & Science Music & The Arts Special Needs
## 0 0.05382314 0.02573138 0.01472739
## 1 0.14295213 0.08244681 0.02353723
barplot(prop.table(table (proj_data$funding_status,proj_data$primary_focus_area)),legend.text = TRUE,beside = T,main="Funding status by primary_focus_area")# Analysis- looking at the data there are specific focus areas where most of the projects are and also better chance of funding
# Primary focus subject
barplot(table (proj_data$funding_status,proj_data$primary_focus_subject),legend.text = TRUE,beside = T,main="Funding status by primary focus subject")prop.table(table (proj_data$funding_status,proj_data$primary_focus_subject))##
## Applied Sciences Character Education Civics & Government
## 0 0.0160904255 0.0075465426 0.0012632979
## 1 0.0349401596 0.0116023936 0.0035239362
##
## College & Career Prep Community Service Early Development Economics
## 0 0.0030585106 0.0025265957 0.0104055851 0.0002659574
## 1 0.0039561170 0.0025265957 0.0223071809 0.0010638298
##
## Environmental Science ESL Extracurricular Foreign Languages
## 0 0.0087101064 0.0059507979 0.0034574468 0.0025265957
## 1 0.0244015957 0.0116356383 0.0069481383 0.0033244681
##
## Gym & Fitness Health & Life Science Health & Wellness
## 0 0.0056183511 0.0068816489 0.0027593085
## 1 0.0092087766 0.0212765957 0.0051196809
##
## History & Geography Literacy Literature & Writing Mathematics
## 0 0.0101728723 0.0762300532 0.0231050532 0.0221409574
## 1 0.0247340426 0.2120678191 0.0553523936 0.0623337766
##
## Music Nutrition Other Parent Involvement
## 0 0.0068151596 0.0004321809 0.0324468085 0.0016622340
## 1 0.0250000000 0.0008643617 0.0479388298 0.0025265957
##
## Performing Arts Social Sciences Special Needs Team Sports
## 0 0.0049867021 0.0107380319 0.0147273936 0.0020944149
## 1 0.0127327128 0.0202460106 0.0235372340 0.0095744681
##
## Visual Arts
## 0 0.0139295213
## 1 0.0447140957
# Analysis- this varibales as like the variable above is linked to response variable)
sum(table(proj_data[proj_data$primary_focus_area=="Literacy & Language",]$primary_focus_area,proj_data[proj_data$primary_focus_area=="Literacy & Language",]$primary_focus_subject))## [1] 11737
After analyzing the two variables it appears that we can only use one of the two variables as both are related to response variable same way. Prime focus Area will be used for modeling purpose.
# Variable resource type
prop.table(table(proj_data$funding_status,proj_data$resource_type))##
## Books Other Supplies Technology Trips
## 0 0.048337766 0.024002660 0.105219415 0.102194149 0.013563830
## 1 0.174833777 0.045312500 0.319315160 0.125398936 0.033344415
##
## Visitors
## 0 0.003224734
## 1 0.005252660
barplot(prop.table(table(proj_data$funding_status,proj_data$resource_type)),legend.text = TRUE,beside = T,main="Funding status by resource_type")Analysis- Looking at the data Books, seems to be the one which has very high chance of getting funded.
# Variable poverty level is category variable. to understand relationship with response variable create contingency table
table(proj_data$funding_status,proj_data$poverty_level)##
## high poverty highest poverty low poverty moderate poverty
## 0 1991 5845 143 941
## 1 3133 16027 340 1660
prop.table(table (proj_data$funding_status,proj_data$poverty_level))##
## high poverty highest poverty low poverty moderate poverty
## 0 0.066190160 0.194315160 0.004753989 0.031283245
## 1 0.104155585 0.532812500 0.011303191 0.055186170
barplot(prop.table(table(proj_data$funding_status,proj_data$poverty_level)),legend.text = TRUE,beside = T,ylim = c(0,.6),main="Funding status by poverty_level")# Analysis- from chart it appears that there is a storng relationship between poverty level and receiving donations from the data
# what is the % of project that gets completed as percentage of respective project category
prop.table(table (proj_data [proj_data$poverty_level=="highest poverty",]$funding_status,proj_data [proj_data$poverty_level=="highest poverty",]$poverty_level))##
## highest poverty
## 0 0.2672366
## 1 0.7327634
prop.table(table (proj_data [proj_data$poverty_level=="high poverty",]$funding_status,proj_data [proj_data$poverty_level=="high poverty",]$poverty_level))##
## high poverty
## 0 0.3885636
## 1 0.6114364
prop.table(table (proj_data [proj_data$poverty_level=="low poverty",]$funding_status,proj_data [proj_data$poverty_level=="low poverty",]$poverty_level))##
## low poverty
## 0 0.2960663
## 1 0.7039337
prop.table(table (proj_data [proj_data$poverty_level=="moderate poverty",]$funding_status,proj_data [proj_data$poverty_level=="moderate poverty",]$poverty_level))##
## moderate poverty
## 0 0.3617839
## 1 0.6382161
Analysis- by looking at the data we can find that poverty definitely is a big factor in defining the success of the funding. As the number of cases for low poverty is very small we can exclude that portion to avoid any undue impact of that segment.
# Variable total donations,total price of the project
barplot(table (proj_data$funding_status,proj_data$grade_level),legend.text = TRUE,main="Funding status by grade level")# boxplot(proj_data$total_donations,proj_data$funding_status)
boxplot(total_donations~funding_status,data=proj_data,ylim=c(0,4000),main="Funding status by total donations")# This donation variable will not be used as this variable values are already taken into consideration in response variable
boxplot(total_price_including_optional_support~funding_status,data=proj_data,ylim=c(0,6000),main="Funding status by total price including optional support")# this variable will be used in to uunderstand the relationship with response variableFrom above analysis following three variables have been picked to be evaluated for building logistics regression model
(a)primary focus area (b)resource type (c)poverty level
library(ggplot2)
proj_data$primary_focus_area<-as.factor(proj_data$primary_focus_area)
barplot(table(proj_data$funding_status,proj_data$primary_focus_area),main = "Funding status vs primary focus area")proj_data$resource_type<-as.factor(proj_data$resource_type)
prop.table(table(proj_data$resource_type,proj_data$primary_focus_area))##
## Applied Learning Health & Sports History & Civics
## Books 0.0095079787 0.0006316489 0.0192154255
## Other 0.0254986702 0.0049867021 0.0033577128
## Supplies 0.0740026596 0.0250664894 0.0197473404
## Technology 0.0402593085 0.0032579787 0.0190159574
## Trips 0.0082779255 0.0012300532 0.0101396277
## Visitors 0.0013630319 0.0004986702 0.0005319149
##
## Literacy & Language Math & Science Music & The Arts
## Books 0.1693484043 0.0164561170 0.0047872340
## Other 0.0168882979 0.0091755319 0.0065159574
## Supplies 0.1258643617 0.1063497340 0.0566821809
## Technology 0.0731050532 0.0508643617 0.0274601064
## Trips 0.0036569149 0.0121010638 0.0099401596
## Visitors 0.0013297872 0.0018284574 0.0027925532
##
## Special Needs
## Books 0.0032247340
## Other 0.0028922872
## Supplies 0.0168218085
## Technology 0.0136303191
## Trips 0.0015625000
## Visitors 0.0001329787
ggplot(proj_data, aes(primary_focus_area,fill=resource_type))+geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+ggtitle("Primary focus area vs resource type ")ggplot(proj_data, aes(poverty_level,fill=primary_focus_area))+geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("Primary focus area vs poverty level ")Model with the use of poverty level:
# Step 1 split the data in training and testing as 75% for training and 25% testing
# Start with a logistics regression model with 1 variable and then add more variables to compare the model output
smp_size <- floor(0.75 * nrow(proj_data))
set.seed(123)
train_ind <- sample(seq_len(nrow(proj_data)), size = smp_size)
train_fund_data <- proj_data[train_ind, ]
test_fund_data <- proj_data[-train_ind, ]
# Building simple model with the use of poverty level: Model-->1
model1<-glm(funding_status~poverty_level, data=train_fund_data,family="binomial")
summary(model1)##
## Call:
## glm(formula = funding_status ~ poverty_level, family = "binomial",
## data = train_fund_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6232 -1.3718 0.7896 0.7896 0.9947
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.44626 0.03292 13.555 < 2e-16 ***
## poverty_levelhighest poverty 0.55935 0.03736 14.973 < 2e-16 ***
## poverty_levellow poverty 0.37715 0.11660 3.234 0.00122 **
## poverty_levelmoderate poverty 0.11052 0.05755 1.920 0.05481 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27477 on 22559 degrees of freedom
## Residual deviance: 27215 on 22556 degrees of freedom
## AIC: 27223
##
## Number of Fisher Scoring iterations: 4
predict_test1<-predict.glm(model1, new=test_fund_data,type="response")
table(test_fund_data$funding_status,predict_test1>0.5)##
## TRUE
## 0 2202
## 1 5318
# Success ratio considering above 50% probability is 69.84%
5252/(2268+5252)## [1] 0.6984043
Analysis: (a)from the model summary it appears that AIC score of this model is 27117. Following level of the poverty_level variable is not linked to the model and can be taken out for subsequent model iteration. Model success rate is 69.84% on test data.
# update the test and tran data with input from model 1- exclude poverty_levelmoderate poverty
train_fund_data <- train_fund_data[train_fund_data$poverty_level != "moderate poverty", ]
test_fund_data <- test_fund_data[test_fund_data$poverty_level != "moderate poverty", ]
# building model 2 with poverty level and resource type variables
model2<-glm(funding_status~poverty_level+resource_type, data=train_fund_data,family="binomial")
summary(model2)##
## Call:
## glm(formula = funding_status ~ poverty_level + resource_type,
## family = "binomial", data = train_fund_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8313 -1.3244 0.7093 0.8255 1.2762
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.90146 0.04626 19.486 < 2e-16 ***
## poverty_levelhighest poverty 0.56841 0.03826 14.857 < 2e-16 ***
## poverty_levellow poverty 0.39750 0.11923 3.334 0.000857 ***
## resource_typeOther -0.68237 0.06715 -10.162 < 2e-16 ***
## resource_typeSupplies -0.21820 0.04426 -4.930 8.24e-07 ***
## resource_typeTechnology -1.13067 0.04684 -24.137 < 2e-16 ***
## resource_typeTrips -0.45647 0.07850 -5.815 6.06e-09 ***
## resource_typeVisitors -0.99715 0.16099 -6.194 5.87e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24892 on 20621 degrees of freedom
## Residual deviance: 23880 on 20614 degrees of freedom
## AIC: 23896
##
## Number of Fisher Scoring iterations: 4
predict_test2<-predict.glm(model2, new=test_fund_data,type="response")
table(test_fund_data$funding_status,predict_test2>0.5)##
## FALSE TRUE
## 0 184 1783
## 1 160 4730
# success ratio considering above 50% probability success rate is 70.99%
(195+4687)/(195+4687+1843+152)## [1] 0.7099026
Analysis: (a)looking at the summary of the output, it appears adding a new variable resource type has helped the model. AIC score has come down from 27117 to 23770. Also success rate in prediction has improve marginally from 69.84 to 70.99%. So it appears that addition of new variable has helped to improve the model.
model3<-glm(funding_status~poverty_level+resource_type+primary_focus_area, data=train_fund_data,family="binomial")
summary(model3)##
## Call:
## glm(formula = funding_status ~ poverty_level + resource_type +
## primary_focus_area, family = "binomial", data = train_fund_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0342 -1.1792 0.6577 0.8260 1.4280
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.57380 0.06312 9.091 < 2e-16
## poverty_levelhighest poverty 0.57629 0.03853 14.956 < 2e-16
## poverty_levellow poverty 0.39685 0.12015 3.303 0.000957
## resource_typeOther -0.60110 0.07093 -8.475 < 2e-16
## resource_typeSupplies -0.24179 0.04778 -5.061 4.17e-07
## resource_typeTechnology -1.14594 0.04964 -23.087 < 2e-16
## resource_typeTrips -0.50076 0.08281 -6.047 1.47e-09
## resource_typeVisitors -1.10340 0.16404 -6.726 1.74e-11
## primary_focus_areaHealth & Sports 0.24137 0.08980 2.688 0.007189
## primary_focus_areaHistory & Civics 0.25049 0.06883 3.639 0.000273
## primary_focus_areaLiteracy & Language 0.32629 0.04774 6.835 8.18e-12
## primary_focus_areaMath & Science 0.51264 0.05238 9.787 < 2e-16
## primary_focus_areaMusic & The Arts 0.78388 0.06431 12.189 < 2e-16
## primary_focus_areaSpecial Needs 0.06772 0.08412 0.805 0.420836
##
## (Intercept) ***
## poverty_levelhighest poverty ***
## poverty_levellow poverty ***
## resource_typeOther ***
## resource_typeSupplies ***
## resource_typeTechnology ***
## resource_typeTrips ***
## resource_typeVisitors ***
## primary_focus_areaHealth & Sports **
## primary_focus_areaHistory & Civics ***
## primary_focus_areaLiteracy & Language ***
## primary_focus_areaMath & Science ***
## primary_focus_areaMusic & The Arts ***
## primary_focus_areaSpecial Needs
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24892 on 20621 degrees of freedom
## Residual deviance: 23683 on 20608 degrees of freedom
## AIC: 23711
##
## Number of Fisher Scoring iterations: 4
predict_test3<-predict.glm(model3, new=test_fund_data,type="response")
table(test_fund_data$funding_status,predict_test3>0.5)##
## FALSE TRUE
## 0 172 1795
## 1 144 4746
# success ratio considering above 50% probability success ratio is 70.75%
(276+4590)/(276+1762+249+4590)## [1] 0.707576
Analysis: (a)with addition of new variable primary_focus_area has not helped to improve the model performance. model prediction success rate has come down little bit from 70.99% t0 70.75%. Hence adding the variable primary focus area is not required in the model3. So model 2 seems to be the best model at this point.
summary(model2)
# considering the intercept of highest poverty level we get following odd value is
exp( 0.56841 )
#That means odds of funding funding success for the project is 1.76 times compare to other level of poverty variable keeping the resource resource type variable.
# confidence intervals for the coefficient estimates
confint(model2)
# Using wald test to assess the impact of variable poverty_level
wald.test(b = coef(model2), Sigma = vcov(model2), Terms = 2:3)
# Using wald test to assess the impact of variable poverty_level
wald.test(b = coef(model2), Sigma = vcov(model2), Terms = 4:8)Analysis:
(a)This shows that men were actually at a significant disadvantage when department and the interaction are controlled. The odds of a male being admitted were only 0.35 times the odds of a female being admitted. The reciprocal of this turns it on its head. All else being equal, the odds of female being admitted were 2.86 times the odds of a male being admitted.
(a)it appears that both poverty level and resource type variables along with their levels given in summary are statistically significant. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.
(b)Interpretation for level variables are following , if the resource_type is other then changes in the log odds of funding success by -0.675.
(c)Using confint function confidence interval of the equation has been derived
(d)Doing weld test significance of the two variables along with their levels have been identified. It is evident from the result that both the variables are significant.
library(ggplot2)
newdata <- cbind(test_fund_data, predict(model2, newdata = test_fund_data, type="link", se=TRUE))
newdata <- within(newdata, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
# Analysis of probability outcome by poverty_level
ggplot(newdata, aes(x = poverty_level, y = PredictedProb)) +
geom_line(aes(colour = poverty_level), size=1) + ggtitle("prediction probability by poverty level variable")# Analysis of probability outcome by resource_type
ggplot(newdata, aes(x = resource_type, y = PredictedProb)) +
geom_line(aes(colour = resource_type), size=1) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("prediction probability by resource type variable")Based on above study and analysis it is clear that significance of poverty_level and resource_type variables are significant in predicting the success of project funding. Other variable Primary Focus Area is also important variable but was not used for model building. Model prediction success rate is around 70.9% which is very good in terms of logistics regression standard.