Research question:
Load data and R packages
- About data used in this study
Variable selection for logistics regression model
Building logistics regression Model–>1
Building simple model with the use of poverty level + resource type : Model–>2
Building simple model with the use of poverty level+ resource type+primary focus area: Model–>3
Analysis on the the best model
Conclusion

Research question:

Donorschoose.org is a crowd sourcing portal used to get donations for educational projects from public donors. This research project aims at identifying the factors that are key drivers for getting donation for a project.This research work will look into various explanatory variables and identify key explanatory variables related to project funding response variable.

following are the expected outcomes from the research work-

Create a logistics regression model to identify the explanatory variables linked to the response variable. Validate the model on test data split out from original data set. **Detail analysis of the variables used in the model and understanding significance of the variables used in the model

Load data and R packages

Load the data file sourced from website http://data.donorschoose.org/open-data/project-data

open_proj<-read.csv("file:///C:/Users/Arindam/Documents/Data Science/Cuny/Data 606/Assignment/Open data donations.csv", header=TRUE,stringsAsFactors = FALSE)

Load packages required for the analysis

library(aod)

## Warning: package 'aod' was built under R version 3.2.5

library(Rcpp)

## Warning: package 'Rcpp' was built under R version 3.2.5

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.3

About data used in this study

What are the cases, and how many are there? For the purpose of this project latest 3000 records from available data set(with status completed or expired) has been sourced and analyzed.

nrow(open_proj)

## [1] 30080

table(open_proj$funding_status)

## 
##   completed     expired reallocated 
##       21160        8584         336

Data collection

Describe the method of data collection.

Data for this project has been sourced from open data provided by donorschoose.org website and given in below location

http://data.donorschoose.org/open-data/project-data/

From this data set latest 30000 records with status completed or expired or reallocated were collected for study

Type of study

What type of study is this (observational/experiment)? This will be an observational study of existing data and understand the relationships between response and explanatory variables. Then identify the key explanatory variables which will drive success of project funding.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

Following is the data source link http://data.donorschoose.org/open-data/project-data/

Response

What is the response variable, and what type is it (numerical/categorical)?

Response variable for this study is project status which can have following three values- Completed, Expired, Live, or Reallocated. For the purpose of this project Expired and Live, or Reallocated values will be clubbed into one bucket called -“Not Completed”. Also Live projects will be excluded from this research.

#open_proj$funding_status<-as.numeric(open_proj$funding_status) 

open_proj$funding_status[open_proj$funding_status=="completed"]=c(1)

open_proj$funding_status[open_proj$funding_status=="expired"| open_proj$funding_status=="reallocated" ]=c(0)

table(open_proj$funding_status)

## 
##     0     1 
##  8920 21160

Explanatory

What is the explanatory variable, and what type is it (numerical/categorival)?

Following are the list of variables which will be studied as explanatory variables and shortlisted as explanatory variables School Location- school_city, school_state,school_district

Teacher attributes- teacher_teach_for_america

Project categories-

primary_focus_subject primary_focus_area resource_type poverty_level

Project pricing total_price_including_opt_donation

Project impact students_reached

proj_data<-subset(open_proj, select= c("school_city","school_state","school_district","teacher_teach_for_america","teacher_ny_teaching_fellow","primary_focus_subject","primary_focus_area","resource_type","poverty_level","grade_level","total_price_including_optional_support","students_reached","total_donations","funding_status","X_projectid"))

proj_data$funding_status[proj_data$funding_status=="completed"]=c(1)

proj_data$funding_status<-as.integer(proj_data$funding_status)

Relevant summary statistics for response variable

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

# Key summary statistics for response variable. 

 prop.table(table(proj_data$funding_status))

## 
##         0         1 
## 0.2965426 0.7034574

# Analysis- Looking at the response variables it appears that around 70% of projects are getting completed and around 30% are not. So foucs will be to understand which are the variables that drives those 70% of completion of project.

Relevant summary statistics for explanatory variables

# Contengency table  of funding status by state

prop.table(table(proj_data$funding_status,proj_data$school_state))

##    
##               AL           CA           CO           CT           DC
##   0 9.308511e-04 3.799867e-02 3.324468e-05 3.324468e-05 1.037234e-02
##   1 3.058511e-03 1.168551e-01 0.000000e+00 0.000000e+00 2.280585e-02
##    
##               IL           IN           KS           La           LA
##   0 3.786569e-02 2.360372e-03 3.324468e-05 0.000000e+00 1.462766e-02
##   1 8.095080e-02 6.948138e-03 0.000000e+00 9.973404e-05 2.011303e-02
##    
##               MA           MD           MI           MN           MO
##   0 3.324468e-05 9.973404e-05 3.324468e-05 3.324468e-05 9.973404e-05
##   1 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
##    
##               MS           NC           NJ           NY           OK
##   0 7.313830e-04 8.178191e-02 1.329787e-04 7.263963e-02 3.324468e-05
##   1 4.820479e-03 1.086104e-01 0.000000e+00 2.926197e-01 0.000000e+00
##    
##               RI           SC           TN           TX           VA
##   0 6.648936e-05 2.755984e-02 3.324468e-05 8.909574e-03 6.648936e-05
##   1 0.000000e+00 3.005319e-02 0.000000e+00 1.652261e-02 0.000000e+00
##    
##               WA
##   0 3.324468e-05
##   1 0.000000e+00

barplot(prop.table(table(proj_data$funding_status,proj_data$school_state)),legend.text = TRUE,main="Funding status by school state")

Analysis 1- data does not have representative data from all state and hence using State as variable will be skewed towards few states where majority of the donations were made like NY. This variable can only be used with response variable after excluding the states with less number of cases.

# how teacher involvement influence project funding


prop.table(table(proj_data$funding_status,proj_data$teacher_teach_for_america))

##    
##              f          t
##   0 0.27978723 0.01675532
##   1 0.61705452 0.08640293

barplot(table(proj_data$funding_status,proj_data$teacher_teach_for_america),beside = T,legend.text = TRUE,main="Funding status by teacher_teach_for_america")

barplot(table(proj_data$funding_status,proj_data$teacher_ny_teaching_fellow),beside = T,legend.text = TRUE,main="Funding status by teacher_ny_teaching_fellow")

prop.table(table(proj_data$funding_status,proj_data$teacher_ny_teaching_fellow))

##    
##              f          t
##   0 0.28397606 0.01256649
##   1 0.64690824 0.05654920

table(proj_data$teacher_teach_for_america,proj_data$teacher_ny_teaching_fellow)

##    
##         f     t
##   f 24908  2069
##   t  3093    10

Analysis- These two variables are having very less number of yes flags in the data set. These two variables can not be used to derive relationship with response variable.

# Contengency table: primary focus area

prop.table(table (proj_data$funding_status,proj_data$primary_focus_area))

##    
##     Applied Learning Health & Sports History & Civics Literacy & Language
##   0       0.06110372      0.01090426       0.02244016          0.10781250
##   1       0.09780585      0.02476729       0.04956782          0.28238032
##    
##     Math & Science Music & The Arts Special Needs
##   0     0.05382314       0.02573138    0.01472739
##   1     0.14295213       0.08244681    0.02353723

barplot(prop.table(table (proj_data$funding_status,proj_data$primary_focus_area)),legend.text = TRUE,beside = T,main="Funding status by primary_focus_area")

# Analysis- looking at the data there are specific focus areas where most of the projects are and also better chance of funding


# Primary focus subject

barplot(table (proj_data$funding_status,proj_data$primary_focus_subject),legend.text = TRUE,beside = T,main="Funding status by primary focus subject")

prop.table(table (proj_data$funding_status,proj_data$primary_focus_subject))

##    
##     Applied Sciences Character Education Civics & Government
##   0     0.0160904255        0.0075465426        0.0012632979
##   1     0.0349401596        0.0116023936        0.0035239362
##    
##     College & Career Prep Community Service Early Development    Economics
##   0          0.0030585106      0.0025265957      0.0104055851 0.0002659574
##   1          0.0039561170      0.0025265957      0.0223071809 0.0010638298
##    
##     Environmental Science          ESL Extracurricular Foreign Languages
##   0          0.0087101064 0.0059507979    0.0034574468      0.0025265957
##   1          0.0244015957 0.0116356383    0.0069481383      0.0033244681
##    
##     Gym & Fitness Health & Life Science Health & Wellness
##   0  0.0056183511          0.0068816489      0.0027593085
##   1  0.0092087766          0.0212765957      0.0051196809
##    
##     History & Geography     Literacy Literature & Writing  Mathematics
##   0        0.0101728723 0.0762300532         0.0231050532 0.0221409574
##   1        0.0247340426 0.2120678191         0.0553523936 0.0623337766
##    
##            Music    Nutrition        Other Parent Involvement
##   0 0.0068151596 0.0004321809 0.0324468085       0.0016622340
##   1 0.0250000000 0.0008643617 0.0479388298       0.0025265957
##    
##     Performing Arts Social Sciences Special Needs  Team Sports
##   0    0.0049867021    0.0107380319  0.0147273936 0.0020944149
##   1    0.0127327128    0.0202460106  0.0235372340 0.0095744681
##    
##      Visual Arts
##   0 0.0139295213
##   1 0.0447140957

# Analysis- this varibales as like the variable above is linked to response variable) 

sum(table(proj_data[proj_data$primary_focus_area=="Literacy & Language",]$primary_focus_area,proj_data[proj_data$primary_focus_area=="Literacy & Language",]$primary_focus_subject))

## [1] 11737

After analyzing the two variables it appears that we can only use one of the two variables as both are related to response variable same way. Prime focus Area will be used for modeling purpose.

# Variable resource type

prop.table(table(proj_data$funding_status,proj_data$resource_type))

##    
##           Books       Other    Supplies  Technology       Trips
##   0 0.048337766 0.024002660 0.105219415 0.102194149 0.013563830
##   1 0.174833777 0.045312500 0.319315160 0.125398936 0.033344415
##    
##        Visitors
##   0 0.003224734
##   1 0.005252660

barplot(prop.table(table(proj_data$funding_status,proj_data$resource_type)),legend.text = TRUE,beside = T,main="Funding status by resource_type")

Analysis- Looking at the data Books, seems to be the one which has very high chance of getting funded.

# Variable poverty level is category variable. to understand relationship with response variable create contingency table

table(proj_data$funding_status,proj_data$poverty_level)

##    
##     high poverty highest poverty low poverty moderate poverty
##   0         1991            5845         143              941
##   1         3133           16027         340             1660

prop.table(table (proj_data$funding_status,proj_data$poverty_level))

##    
##     high poverty highest poverty low poverty moderate poverty
##   0  0.066190160     0.194315160 0.004753989      0.031283245
##   1  0.104155585     0.532812500 0.011303191      0.055186170

barplot(prop.table(table(proj_data$funding_status,proj_data$poverty_level)),legend.text = TRUE,beside = T,ylim = c(0,.6),main="Funding status by poverty_level")

# Analysis- from chart it appears that there is a storng relationship between poverty level and receiving donations from the data


# what is the % of project that gets completed  as percentage of respective project category

prop.table(table (proj_data [proj_data$poverty_level=="highest poverty",]$funding_status,proj_data [proj_data$poverty_level=="highest poverty",]$poverty_level))

##    
##     highest poverty
##   0       0.2672366
##   1       0.7327634

prop.table(table (proj_data [proj_data$poverty_level=="high poverty",]$funding_status,proj_data [proj_data$poverty_level=="high poverty",]$poverty_level))

##    
##     high poverty
##   0    0.3885636
##   1    0.6114364

prop.table(table (proj_data [proj_data$poverty_level=="low poverty",]$funding_status,proj_data [proj_data$poverty_level=="low poverty",]$poverty_level))

##    
##     low poverty
##   0   0.2960663
##   1   0.7039337

prop.table(table (proj_data [proj_data$poverty_level=="moderate poverty",]$funding_status,proj_data [proj_data$poverty_level=="moderate poverty",]$poverty_level))

##    
##     moderate poverty
##   0        0.3617839
##   1        0.6382161

Analysis- by looking at the data we can find that poverty definitely is a big factor in defining the success of the funding. As the number of cases for low poverty is very small we can exclude that portion to avoid any undue impact of that segment.

# Variable total donations,total price of the project

barplot(table (proj_data$funding_status,proj_data$grade_level),legend.text = TRUE,main="Funding status by grade level")

# boxplot(proj_data$total_donations,proj_data$funding_status)

boxplot(total_donations~funding_status,data=proj_data,ylim=c(0,4000),main="Funding status by total donations")

# This donation variable will not be used as this variable values are already taken into consideration in response variable

boxplot(total_price_including_optional_support~funding_status,data=proj_data,ylim=c(0,6000),main="Funding status by total price including optional support")

# this variable will be used in to uunderstand the relationship with response variable

Variable selection for logistics regression model

From above analysis following three variables have been picked to be evaluated for building logistics regression model

(a)primary focus area (b)resource type (c)poverty level

library(ggplot2)

proj_data$primary_focus_area<-as.factor(proj_data$primary_focus_area)

barplot(table(proj_data$funding_status,proj_data$primary_focus_area),main = "Funding status vs primary focus area")

proj_data$resource_type<-as.factor(proj_data$resource_type)

prop.table(table(proj_data$resource_type,proj_data$primary_focus_area))

##             
##              Applied Learning Health & Sports History & Civics
##   Books          0.0095079787    0.0006316489     0.0192154255
##   Other          0.0254986702    0.0049867021     0.0033577128
##   Supplies       0.0740026596    0.0250664894     0.0197473404
##   Technology     0.0402593085    0.0032579787     0.0190159574
##   Trips          0.0082779255    0.0012300532     0.0101396277
##   Visitors       0.0013630319    0.0004986702     0.0005319149
##             
##              Literacy & Language Math & Science Music & The Arts
##   Books             0.1693484043   0.0164561170     0.0047872340
##   Other             0.0168882979   0.0091755319     0.0065159574
##   Supplies          0.1258643617   0.1063497340     0.0566821809
##   Technology        0.0731050532   0.0508643617     0.0274601064
##   Trips             0.0036569149   0.0121010638     0.0099401596
##   Visitors          0.0013297872   0.0018284574     0.0027925532
##             
##              Special Needs
##   Books       0.0032247340
##   Other       0.0028922872
##   Supplies    0.0168218085
##   Technology  0.0136303191
##   Trips       0.0015625000
##   Visitors    0.0001329787

ggplot(proj_data, aes(primary_focus_area,fill=resource_type))+geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+ggtitle("Primary focus area vs resource type ")

ggplot(proj_data, aes(poverty_level,fill=primary_focus_area))+geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("Primary focus area vs poverty level ")

Building logistics regression Model–>1

Model with the use of poverty level:

# Step 1 split the data in training and testing as 75% for training and 25% testing
# Start with a logistics regression model with 1 variable and then add more variables to compare the model output


smp_size <- floor(0.75 * nrow(proj_data))

set.seed(123)

train_ind <- sample(seq_len(nrow(proj_data)), size = smp_size)

train_fund_data <- proj_data[train_ind, ]

test_fund_data <- proj_data[-train_ind, ]


# Building simple model with the use of poverty level: Model-->1

model1<-glm(funding_status~poverty_level, data=train_fund_data,family="binomial")
summary(model1)

## 
## Call:
## glm(formula = funding_status ~ poverty_level, family = "binomial", 
##     data = train_fund_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6232  -1.3718   0.7896   0.7896   0.9947  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    0.44626    0.03292  13.555  < 2e-16 ***
## poverty_levelhighest poverty   0.55935    0.03736  14.973  < 2e-16 ***
## poverty_levellow poverty       0.37715    0.11660   3.234  0.00122 ** 
## poverty_levelmoderate poverty  0.11052    0.05755   1.920  0.05481 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27477  on 22559  degrees of freedom
## Residual deviance: 27215  on 22556  degrees of freedom
## AIC: 27223
## 
## Number of Fisher Scoring iterations: 4

predict_test1<-predict.glm(model1, new=test_fund_data,type="response")

table(test_fund_data$funding_status,predict_test1>0.5)

##    
##     TRUE
##   0 2202
##   1 5318

# Success ratio considering above 50% probability is 69.84%
5252/(2268+5252)

## [1] 0.6984043

Analysis: (a)from the model summary it appears that AIC score of this model is 27117. Following level of the poverty_level variable is not linked to the model and can be taken out for subsequent model iteration. Model success rate is 69.84% on test data.

Building simple model with the use of poverty level + resource type : Model–>2

# update the test and tran data with input from model 1- exclude poverty_levelmoderate poverty 

train_fund_data <- train_fund_data[train_fund_data$poverty_level != "moderate poverty", ]

test_fund_data <- test_fund_data[test_fund_data$poverty_level != "moderate poverty", ]

# building model 2 with poverty level and resource type variables

model2<-glm(funding_status~poverty_level+resource_type, data=train_fund_data,family="binomial")

summary(model2)

## 
## Call:
## glm(formula = funding_status ~ poverty_level + resource_type, 
##     family = "binomial", data = train_fund_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8313  -1.3244   0.7093   0.8255   1.2762  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   0.90146    0.04626  19.486  < 2e-16 ***
## poverty_levelhighest poverty  0.56841    0.03826  14.857  < 2e-16 ***
## poverty_levellow poverty      0.39750    0.11923   3.334 0.000857 ***
## resource_typeOther           -0.68237    0.06715 -10.162  < 2e-16 ***
## resource_typeSupplies        -0.21820    0.04426  -4.930 8.24e-07 ***
## resource_typeTechnology      -1.13067    0.04684 -24.137  < 2e-16 ***
## resource_typeTrips           -0.45647    0.07850  -5.815 6.06e-09 ***
## resource_typeVisitors        -0.99715    0.16099  -6.194 5.87e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24892  on 20621  degrees of freedom
## Residual deviance: 23880  on 20614  degrees of freedom
## AIC: 23896
## 
## Number of Fisher Scoring iterations: 4

predict_test2<-predict.glm(model2, new=test_fund_data,type="response")

table(test_fund_data$funding_status,predict_test2>0.5)

##    
##     FALSE TRUE
##   0   184 1783
##   1   160 4730

# success ratio considering above 50% probability success rate is 70.99% 

(195+4687)/(195+4687+1843+152)

## [1] 0.7099026

Analysis: (a)looking at the summary of the output, it appears adding a new variable resource type has helped the model. AIC score has come down from 27117 to 23770. Also success rate in prediction has improve marginally from 69.84 to 70.99%. So it appears that addition of new variable has helped to improve the model.

Building simple model with the use of poverty level+ resource type+primary focus area: Model–>3

model3<-glm(funding_status~poverty_level+resource_type+primary_focus_area, data=train_fund_data,family="binomial")

summary(model3)

## 
## Call:
## glm(formula = funding_status ~ poverty_level + resource_type + 
##     primary_focus_area, family = "binomial", data = train_fund_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0342  -1.1792   0.6577   0.8260   1.4280  
## 
## Coefficients:
##                                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)                            0.57380    0.06312   9.091  < 2e-16
## poverty_levelhighest poverty           0.57629    0.03853  14.956  < 2e-16
## poverty_levellow poverty               0.39685    0.12015   3.303 0.000957
## resource_typeOther                    -0.60110    0.07093  -8.475  < 2e-16
## resource_typeSupplies                 -0.24179    0.04778  -5.061 4.17e-07
## resource_typeTechnology               -1.14594    0.04964 -23.087  < 2e-16
## resource_typeTrips                    -0.50076    0.08281  -6.047 1.47e-09
## resource_typeVisitors                 -1.10340    0.16404  -6.726 1.74e-11
## primary_focus_areaHealth & Sports      0.24137    0.08980   2.688 0.007189
## primary_focus_areaHistory & Civics     0.25049    0.06883   3.639 0.000273
## primary_focus_areaLiteracy & Language  0.32629    0.04774   6.835 8.18e-12
## primary_focus_areaMath & Science       0.51264    0.05238   9.787  < 2e-16
## primary_focus_areaMusic & The Arts     0.78388    0.06431  12.189  < 2e-16
## primary_focus_areaSpecial Needs        0.06772    0.08412   0.805 0.420836
##                                          
## (Intercept)                           ***
## poverty_levelhighest poverty          ***
## poverty_levellow poverty              ***
## resource_typeOther                    ***
## resource_typeSupplies                 ***
## resource_typeTechnology               ***
## resource_typeTrips                    ***
## resource_typeVisitors                 ***
## primary_focus_areaHealth & Sports     ** 
## primary_focus_areaHistory & Civics    ***
## primary_focus_areaLiteracy & Language ***
## primary_focus_areaMath & Science      ***
## primary_focus_areaMusic & The Arts    ***
## primary_focus_areaSpecial Needs          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24892  on 20621  degrees of freedom
## Residual deviance: 23683  on 20608  degrees of freedom
## AIC: 23711
## 
## Number of Fisher Scoring iterations: 4

predict_test3<-predict.glm(model3, new=test_fund_data,type="response")

table(test_fund_data$funding_status,predict_test3>0.5)

##    
##     FALSE TRUE
##   0   172 1795
##   1   144 4746

# success ratio considering above 50% probability success ratio is 70.75%
(276+4590)/(276+1762+249+4590)

## [1] 0.707576

Analysis: (a)with addition of new variable primary_focus_area has not helped to improve the model performance. model prediction success rate has come down little bit from 70.99% t0 70.75%. Hence adding the variable primary focus area is not required in the model3. So model 2 seems to be the best model at this point.

Analysis on the the best model

summary(model2)

# considering the intercept of highest poverty level we get following odd value is 

exp( 0.56841 )

#That means odds of funding funding success for the project is 1.76 times compare to other level of poverty variable keeping the resource resource type variable.

# confidence intervals for the coefficient estimates
confint(model2)


# Using wald test to assess the impact of variable poverty_level

wald.test(b = coef(model2), Sigma = vcov(model2), Terms = 2:3)


# Using wald test to assess the impact of variable poverty_level
wald.test(b = coef(model2), Sigma = vcov(model2), Terms = 4:8)

Analysis:

(a)This shows that men were actually at a significant disadvantage when department and the interaction are controlled. The odds of a male being admitted were only 0.35 times the odds of a female being admitted. The reciprocal of this turns it on its head. All else being equal, the odds of female being admitted were 2.86 times the odds of a male being admitted.

(a)it appears that both poverty level and resource type variables along with their levels given in summary are statistically significant. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.

(b)Interpretation for level variables are following , if the resource_type is other then changes in the log odds of funding success by -0.675.

(c)Using confint function confidence interval of the equation has been derived

(d)Doing weld test significance of the two variables along with their levels have been identified. It is evident from the result that both the variables are significant.

library(ggplot2)

newdata <- cbind(test_fund_data, predict(model2, newdata = test_fund_data, type="link", se=TRUE))
newdata <- within(newdata, {
  PredictedProb <- plogis(fit)
  LL <- plogis(fit - (1.96 * se.fit))
  UL <- plogis(fit + (1.96 * se.fit))
})

# Analysis of probability outcome by poverty_level
ggplot(newdata, aes(x = poverty_level, y = PredictedProb)) +
  geom_line(aes(colour = poverty_level), size=1) + ggtitle("prediction probability by poverty level variable")

# Analysis of probability outcome by resource_type
ggplot(newdata, aes(x = resource_type, y = PredictedProb)) +
  geom_line(aes(colour = resource_type), size=1) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("prediction probability by resource type variable")

Conclusion

Based on above study and analysis it is clear that significance of poverty_level and resource_type variables are significant in predicting the success of project funding. Other variable Primary Focus Area is also important variable but was not used for model building. Model prediction success rate is around 70.9% which is very good in terms of logistics regression standard.

Project to identify factors that determines funding of a crowdsourcing project