5.1.28 - Primary City
No data cleaning necessary.
Alumni fundraising is an integral component of universities providing thorough support for their students. While it may seem that rising tuition costs should cover the bulk of university expenses, state and federal funding has simltaenous decreased in recent years. Thus, alumni fundraising has become a necessary component of university revenue streams.
The objective is this report is to determine which features/characteristics of alumni are associated with high rates of giving (either raw amount or frequency).
The types of methods we employ are exploratory data analysis, decision/classification trees, random forests, logistic regression, and neural networks.
The summary of the results are that most models obtained between 96 and 97% accuracy on validation sets. Furthemore, features that were consistently across all models identified were First Gift Amount, Total Events Attended, Development Region, Age, and Degree Type/Department.
Further work still remains in the identification of high-end donors, i.e., those who give large donation(s) in excess of 10,000. Since the median donation amount for those who gave was approximately $25, individuals who classify as high-end donors can easily make up for the very low overall rate of 3-4% meets frequent giving critera.
Universities at the undergraduate and graduate levels are in a bidding war for the world’s top talent. To attact the best talent, universities must invest in themselves and the packages they are able to advertise to students. In addition to world-leading researchers that teach rigorous and interesting coursework, universities also aim to provide top-quality equipment, facilities, and professional connections to their students.
Alumni fundraising is one source of for these services, and while in an ideal world all alumni would feel inclined to give back to the educational institutions, the reality is that only a small percentage of all alumni donate to their alma maters. Thus, identifying key features of this population of donors will allow university personnel to more efficiently identify those who are more likely to donate. In turn, best practices can be revised and more targeted fundraising campaigns can be implemented.
Analyze the features of the individuals and determine which features, if any, can help the alumni team more effectively target current and future prospects.
Define ‘meets giving criteria’: has given in at least 2 of the last 5 years.
Given a set of features on a current or future prospect, predict the probability of the prospect meeting the giving criteria.
If time permits we will also seek to understand the features that can help us predict large donor individual, i.e. those who are above the 95th percentile of donations. As is common in alumni fundraising and donations, a few individuals will sometimes donate (just once) a large sum, i.e., in the 10,000+ dollar range.
Thus, while it is important to identify features associated with frequent donating, it is just as important, if not more important, to identify features that are associated with large donations, even if they are not themselves frequent.
There is substantial literature and research done in identifying factors that can aid in predicting alumni giving frequency and amount.
Hashimoto (Spring 2016) analyzed data at a public university in California and found that graduate level of degree was strongly impacted alumni giving. The type of graduate level degree was also analyzed and those degrees that focused in public policy or public service tended to give more. Furthermore, student athletes and those involved in alumni events were more likely to give. Other features, such as age, median household income, and proximity to university campus had an effect on the total amount given. One thing to note is that Hashimoto was unable to identify a strong indicator for high-end donors.
Farrokhvar, Ansari, & Behrooz (2018) looked at predictive models for charitable giving with machine learning techniques. While charitable giving is a much broader scope of alumni giving, similar features of importance were found to that of Hashimoto. In particular, they found education level to be a strong factor.They employed Multiple Linear Regression, Artificial Neural Networks, and Support Vector Regression in their analysis.
There are 253,558 rows of data. Each row of the data corresponds to an alumni individual. The following 40 features have been recorded.
CAE Constituency: Undergraduate or Graduate Degree from the University
Age
Gender
Marital Status
Volunteer: Have they volunteered with the university?
Volunteer Jobs Fiscal Year
Volunteer Jobs Lifetime
Total Events Attended: Total Alumni Events Attended. Includes Volunteer Jobs.
Last Event Date
Major Giving Capacity: The ‘wealth capacity’ of an individual.
Engagement Score: Score between 0 - 20 that the university has assigned to them based off of other features
Recognized Giving Lifetime: The recognized amount the individual has donated in their entire lifetime
Recognized Giving Last 5 Years: The recognized amount the individual has donated in the last 5 years
Recognized Giving Last 12 months
Years Giving Last 5: The number of years in the last 5 that a recognized donation was recorded Recognized Giving First Gift Date: The date was recorded of the first recognized giving Recognized Giving First Gift Amount
Recognized Giving Last Gift Date
Recognized Giving Last Gift Amount
Recognized Giving Largest Gift Date
Recognized Giving Lasrgest Gift Amount
Faculty: Are they an academic employee?
Staff: Are they a non-academic employee?
Current Employer
Job Title
Degree Type
College: If they were undergraduate, which undergraduate college did they attend
Year Graduated
Date Graduated
Division
Department
Subdepartment
Alumni Club Region: Geographic Factor in which alumni club region they are engaged with Development Region: Geographic Factor in which development region they are assigned to
Primary City
Primary County
Primary State
Primary Postal Code
Primary Country
Primary Address Type
We are in the fortunate position of having a large data set and can delete many observations (yet small in relative size to the data set) in which we have missing values. we will clean our data feature by feature.
No date cleaning necessary
There are individuals in the data with an Age
value of 0. I assume that this is just a placeholder for an individual with unknown age. We will simply delete all individuals with an Age
value of 0. This reduces our number of individuals from 253,558 to 250,735.
There are individuals in the data with ‘Unknown’ Gender
value. Again, like Age
, this is a small number compared to our overall data set. We will delete all rows with ‘Unknown’ Gender
.
There is an alarming number of individuals with an ‘Unknown’ or Null ``‘Marital.Status’. In fact, of the remaining 250,031 observations, only 250,031 - 195,733 = 54,298 individuals have a known marital status. Deleting all these individuals would be a significant blow to our data set size. For now, we will just merge the two. We will most likely have to delete some later.
We will be merging ‘Widowed’ with ‘Deceased Spouse’ since both of those categories are similar and, in our case, small in size. We will call this merged variable ‘Widowed’ for simplicity.
The Volunteer column as is has a TRUE in those individuals who have volunteered before, and a Null value otherwise. We will convert these to show 1 and 0 and convert them to factors.
There are some individuals with NA in this feature. Assumption is NA is equal to 0 events.
There are some individuals with NA in this feature. Assumption is NA is equal to 0 events.
There are some individuals with NA in this feature. Assumption is NA is equal to 0 events.
Unfortunately the Date columns in this data set seem to have been corrupted via file compression. Some can be assumed to be mmddyyyy, but others simply do not make sense. For example, see the first 3 unique date values from the Recognized Giving First Gift Date.
## [1] NA 34333 34334
For this reason the following Date columns will be deleted from this analysis:
Last Event Date
Recognized Giving First Gift Date
Recognized Giving Last Gift Date
Recognized Giving Largest Gift Date
Date Graduated
This column has values that are difficult to decipher and is correlated enough with other predictors that we can delete it. Furthermore, there are over 18,900 individuals with Null in this column. We do not want to simply wipe these individuals from our data.
Similar to the previously deleted feature Major Giving Capacity, Engagement Score feature has over 134,000 individuals whose scores are either 0.00 or Null. It is unknown whether 0.00 is really just a placeholder for 0.00, nor is it known how this Engagement Score is calculated.
We have two choices. We can delete all individuals whose Engagement Score is 0.00 or Null, or we can delete this feature.
I decide to delete this feature because, like Major Giving Capacity, it is likely correlated with the other features. Therefore, by learning which other features are informative towards likelihood of giving, we will in essence learn some facet of this Engagement Score.
There are some individuals with NA in this feature. Assumption is NA is equal to $0 in lifetime.
There are some individuals with NA in this feature. Assumption is NA is equal to $0 in last 5 years.
There are some individuals with NA in this feature. Assumption is NA is equal to $0 in last 12 months.
No data clean-up necessary.
There are some individuals with NA in these features. $0 woud mean they haven’t gifted, which is mathematically true. We will convert this.
Similar to the Volunteer feature, there are only ‘True’ and Null values. We will fill in the Null values with False and cover them to factors.
Of the approximately 250,000 individuals remaining in our data, there are approximately 29,000 unique employers. I will not delete this feature, since I believe some good exploratory analysis can be done with this feature. However, when it comes to modeling it may be of little use due to the sheer number of factor levels in this feature, and parsing the different employers into industries would, for this project, be too time consuming.
Similar to Current Employer, Job Title returns approximately 25,000 unique job titles. I will include this feature in my exploratory data analysis, but similar to Current Employer, it may be of little use when it comes to modeling.
There are 26 unique degree types. No data cleaning is necessary.
There are approximately 200 individuals with an ‘Unknown’ College entry. We will delete these entries.
There are approximately 100 individuals with a missing Year Graduated entry. We will delete these entries.
There are 3 Division labels that can seeminlgy be grouped together for their vagueness. ‘Unknown’, ‘Other/Unknown’, and ‘Unspecified or multi-divisional’. We will group these under the label ‘Unspecified’.
There are 90 unique Departments listed in the data set. We will combine ‘Unknown’ with Null values.
No data cleaning necessary.
No data cleaning necessary.
No date cleaning necessary.
No data cleaning necessary.
No data cleaning necessary.
These two features will be deleted out of privacy and due to them most likely being uninformative features.
We first note that a strong majority of individuals have not donated in 2 or more of the last 5 years. Approximately 96% of the individuals have not met this criteria.
Var1 | Freq |
---|---|
Less than 2 donations | 239855 |
Greater than 2 donations | 9867 |
We’ll visualize some of our features now to identify any characteristics that may be of use in identifying individuals who have donated 2 or more times in the last 5 years.
We start with Age.
Figure EDA 1.1: Years Donated >= 2 by Age
Figure EDA 1.2: Giving Criteria Percentages by Age Group
According to Figure EDA 1.1 and 1.2, it appears there is a slight uptick in percentage of individuals (by age group) who meet the criteria as they age.
We’ll now examine the percentage who meet the criteria for each gender group. In this case, we have two Gender groups, ‘Female’ and ‘Male’.
Figure EDA 2.1
We do not see any difference in percentage of givers between the two gender groups. One thing to note is that there are slightly more Males in the data set than there are Females.
We’ll now examine the percentage who meet the criteria for each marital status level. In this case, we have 7 levels.
Figure EDA 3.1: Giving Criteria by Marital Level
The scale of our graphs is thrown off by the ‘Unknown’ Marital Status. Let us zoom in on a couple subsets of our Marital Status levels.
Figure EDA 3.2: Giving Criteria by Marital Status Subsets
Looking at Figure EDA 3.2 we can see that ‘Single’ and ‘Unknown’ Marital Status have the lowest percentages of 4% and 3% respectively. ‘Married’, which is the 3rd largest Marital Status level by size, has a 12% meets giving criteria rate. ‘Divorced’, which is the 4th largest Marital Status level by size, has a 16% meets giving criteria. The three other groups (‘Widowed’, ‘Partnered’, and ‘Separated’), whose total aggregate in size is less than 1,000 individuals, have slightly higher percent meets giving criteria rates of (18%, 19%, and 18% respectively).
Let’s now examine the rates of meeting giving criteria for those who have and have not volunteered with the university. We will first just analyze if an individual has volunteered at least once with the university over their lifetime.
Figure EDA 4.1
A couple of things we notice from this Figure EDA 4.1. First, the count of those who have not volunteered greatly outnumbers those who have: an approximate 10:1 ratio of non-volunteers to volunteers. Second, the rate of meets giving criteria is significantly larger in the volunteer group. Almost 17% of Volunteers meet the giving criteria, while only 3% of Non-Volunteers do.
Let’s examine some more of our volunteer features as they relate to giving criteria.
Vol. Jobs this FY | Giving Criteria | Count | Percent within Group |
---|---|---|---|
0 | < 2 | 239813 | 96.072383 |
0 | >= 2 | 9804 | 3.927617 |
1 | < 2 | 35 | 43.750000 |
1 | >= 2 | 45 | 56.250000 |
2 | < 2 | 4 | 26.666667 |
2 | >= 2 | 11 | 73.333333 |
3 | < 2 | 3 | 33.333333 |
3 | >= 2 | 6 | 66.666667 |
4 | < 2 | 0 | 0.000000 |
4 | >= 2 | 1 | 100.000000 |
We can see from Table EDA 4.1 that a strong majority (about 95% of all individuals in our data) have not volunteered this fiscal year. Due to COVID-19 restrictions this is understandable from the standpoint of the University. Nevertheless, we can see numerically that for those who did volunteer, the giving criteria was met at a much higher rate than the non-volunteer 3%. Let us examine this visually.
Figure EDA 4.2
We can see that already at just 1 volunteer job attended this fiscal year, individuals had a 56% rate of meeting the giving criteria. Compare this to the approximate 3.9% of individuals who did not attend any volunteer jobs this fiscal years. The % of individuals meeting the giving criteria did not always increase for each new volunteer job attended this fiscal year (see the rates decrease from 73% to 67% when we increased the Number of Jobs from 2 to 3). That being said, it is promising to see the rates of giving criteria exceed 50% in each group of individuals who volunteered at least once this fiscal year.
Finally, we will examine Number of Volunteer Jobs over the Lifetime and the rate of meets giving criteria.
Vol. Jobs Lifetime | Giving Criteria | Count | Percent within Group |
---|---|---|---|
0 | < 2 | 230468 | 96.680119 |
0 | >= 2 | 7914 | 3.319882 |
1-5 | < 2 | 9112 | 85.015861 |
1-5 | >= 2 | 1606 | 14.984139 |
6-10 | < 2 | 30 | 38.961039 |
6-10 | >= 2 | 47 | 61.038961 |
11-15 | < 2 | 22 | 40.000000 |
11-15 | >= 2 | 33 | 60.000000 |
15-20 | < 2 | 14 | 12.500000 |
15-20 | >= 2 | 98 | 87.500000 |
21+ | < 2 | 209 | 55.291005 |
21+ | >= 2 | 169 | 44.708995 |
As before, over 95% of individuals have not volunteered in their lifetime, and from those, only 3% meet the giving criteria. We will redefine our factor levels of number of volunteer jobs for our visualization for scaling purposes.
Figure EDA 4.3
Similar to Number of Volunteer Jobs this Fiscal Year, we see an upward trend in rate of giving criteria met as the number of jobs increase from group to group. It appears that Volunteer Jobs has a positive association with meeting the giving criteria.
Total # Events | Giving Criteria | Count | Percent within Group |
---|---|---|---|
0 | < 2 | 186420 | 97.733087 |
0 | >= 2 | 4324 | 2.266913 |
1 | < 2 | 28826 | 95.327226 |
1 | >= 2 | 1413 | 4.672774 |
2-5 | < 2 | 20917 | 89.255387 |
2-5 | >= 2 | 2518 | 10.744613 |
6-10 | < 2 | 2576 | 76.371183 |
6-10 | >= 2 | 797 | 23.628817 |
11+ | < 2 | 860 | 54.464851 |
11+ | >= 2 | 719 | 45.535149 |
About 75% of all individuals have not attended any events, and of those, about 2% have met the giving criteria.
Let us visualize the other groups.
Figure EDA 5.1
We can see that the rate of meeting giving criteria increases as we increase the Total Number of Events Attended. In other words, there is a positive association between meeting the giving criteria and Total Events Attended.
Let’s examine the rate of meeting giving criteria as a function of Staff or not.
Figure EDA 6.1
As we can see from Figure EDA 6.1, a strong majority of individuals are not staff, and the rate of meets giving criteria of the not staff group is consistent with the overall data average of about 4%.
Those who are staff meet the giving criteria at a rate of about 12%. From a fundraising standpoint though this seems like a dead-end since one cannot hope to greatly increase the count of staff members year to year.
We now consider the faculty indicator.
Figure EDA 7.1
As we can see from Figure EDA 7.1, a strong majority of individuals are not faculty, and the rate of meets giving criteria of the not faculty group is consistent with the overall data average of about 4%.
Those who are staff meet the giving criteria at a rate of about 8%. From a fundraising standpoint though this seems like a dead-end since one cannot hope to greatly increase the count of faculty members year to year.
We will first compare the rates of meets giving criteria of Undergraduate vs Graduate Degree awarded for the individuals in the data set. Then we will analyze each subgroup with more detail.
Figure EDA 8.1
As we can see from Figure EDA 8.1 the rate of meeting the giving criteria is nearly equal in the two groups. Individuals who earned a Graduate degree from the University meet the giving criteria at a rate of 5%, while those who earned an Undegraduate degree meet the criteria at a rate of 4%.
Let’s examine the Undergraduate population more closely. In particular, we will compare the Bachelor of Science and Bachelor of Arts subgroups.
Figure EDA 8.2
As we can see in Figure EDA 8.2, the rates of meeting the giving criteria are nearly equal in the two Bachelor Degree Type groups of B.S. and B.A. Both are near the overall average of about 3-4%.
Let’s examine the Graduate population more closely. The top 4 subgroups by count of the Graduate population are Master of Science (M.S.), Doctor of Philosophy (Ph.D), Doctor of Medicine (M.D.), and Master of Arts (M.A.). We’ll visualize these groups’ respective rates of meeting giving criteria.
Figure EDA 8.3
As we can see in Figure EDA 8.3, the rates of meeting giving criteria are slightly above the 3-4% overall rate and above the rates seen in Figure EDA 8.2 for those with Bachelor Degrees. Doctor of Medicine has the highest rate seen so far of the graduate degrees at 9%.
Let’s examine the next top 7 graduate degree types by count. Note, we will include the Master of Finance towards the count of M.B.A. individuals, and we will combine the Master of International Affairs individuals towards Master of Pacific and International Affairs.
The next top 6 graduate degrees are Master of Pacific and International Affairs (M.I.A.), Master of Advanced Studies (M.A.S.), Master of Business Administration (M.B.A.), Master of Fine Arts (M.F.A.), Master of Education (M.Ed), Doctor of Pharmacy (PharmD), and Master of Engineering (M.Eng)
Figure EDA 8.4
As we can see from Figure EDA 8.4, there is a bit of variation in the rate of meeting the giving criteria between the different graduate groups. Groups that are above the 3-4% rate are M.I.A., M.B.A., and M.Eng degree types at 8%, 5% and 5% respectively.
This particular university also splits their undergraduate population into smaller colleges. Each college has its own set of unique graduate requirements and culture. We will visually analyze this subset now.
Figure EDA 9.1
As we can see from Figure EDA 9.1, the rates of meeting giving criteria are fairly constant and equal to the overall rate of 3-4%. The one group that exceeds this is Revelle with a meets giving criteria rate of approximately 5%.
We will now examine the rates for graduate colleges. We will visually examine the top 5: Graduate (Grad), School of Medicine (Med), School of Public Policy & Strategy (PP&S), Rady School of Management (Rady), and Scripps Institution of Oceanography (Scripps).
Figure EDA 9.2
As we can see from Figure EDA 9.2, the Scripps Institution of Oceanography has the highest rate of meeting the giving criteria at 12%. This is substantially greater than the overall data average of 3-4%. In second and third are the School of Medicine and the School of Public Policy and Strategy with rates of 9% and 7% respectively.
We will now visually analyze any relationship that exists between meeting the giving criteria and the Alumni Region that the individual has been assigned to. It can be assumed that this is the primary home region for the individual, or at least the home region the University has on file for the individual.
We will first focus on some key regions in the University’s home state.
Figure EDA 10.1
As we can see in Figure EDA 10.1 the SD Region as a whole seems to be at or above average in meeting the giving criteria. The geolocation SD-Central and SD-North are above the average rate, while SD-East and SD-South are equal to the overall data rate of 3-4%.
Let’s now look at the other major metropolitan regions of California.
Figure EDA 10.2
From Figure EDA 10.2 we see that the other regions of California are at the overall average rate of 3-4% for meeting the giving criteria. The Bay Area - Peninsula region is ever so slightly above average of 5%.
The other top 5 by count metropolitan areas of the United States will now be visually analyzed.
Figure EDA 10.3
Interesting to note that the top metropolitan areas/states outside of California have slightly higher than the average rate of 3-4% across the board. All regions listed above in Figure EDA 10.3 are at or above the average rate, and Boston had a relatively high rate of 6%.
We will now see the rates of meets giving criteria for the International regions. We include East & Southeast Asia, South Asia, Europe, North America (excluding USA), Middle East, South America, and Australia & Oceania
Figure EDA 10.4
From Figure EDA 10.4 we see that the International Development Regions are at or below the average of 3-4%. Interesting to note that the largest international subgroup (East & Southeast Asia) are well below the average at 1% meets giving criteria rate.
We finally will analyze the Department of individuals in the data. Due to the number of unique Departments (89), we instead just focused on the Top 25 (by count) and tried to categorize each individual that belonged to one of those Top 25 into one of six broader categories: Physical Sciences, Engineering & Math, Social Sciences, Medicine, Business, and Art.
While this is not a representation of all individuals, we can assume that it is a large enough sample to show us any trends.
Figure EDA 11.1
We can see in Figure EDA 11.1 that Medicine Department is above the overall average of 3-4%, and all other 5 categories are at the average.
It is worth noting that there is a gap larger gap between those who have gifted 2 of the last 5 years and those who gifted 1 out of the last 5 years.
There are 17,360 individuals who gifted 1 time out of the last 5 years, but only 4,910 who gifted 2 times. This represents a 71.7% decrease.
We will investigate whether there is a difference between the first gift amount for these two groups. Maybe the first gift amount will dictate whether the person converts to a more frequent donor.
Figure EDA 12.1
As we can see from Figure EDA 12.1, the large number of outliers in each group makes it difficult to determine the distribution of each group. It is worth exploring these high gift amount donors, regardless of their frequency. We will save that for another section of the EDA and in our modeling. Let’s instead compare the quantiles of each distribution and see where a difference occurs.
quantiles | Group1 | Group2 |
---|---|---|
0.00 | 0.1 | 0.01 |
0.25 | 15.0 | 15.00 |
0.50 | 25.0 | 25.00 |
0.75 | 50.0 | 50.00 |
0.80 | 65.0 | 79.00 |
0.85 | 100.0 | 100.00 |
0.90 | 100.0 | 100.00 |
0.95 | 250.0 | 250.00 |
We can see from the above Table EDA 12.1 that the quantiles of each first gift amount $ distributions for the two groups (those that gave once vs those that gave twice) are nearly identical. Through the 95th percentile, each group has near equal first gift amounts.
Let us examine the top 5 percentile for each group below in Table EDA 12.2
quantiles | Group1 | Group2 |
---|---|---|
0.95 | 250 | 250.00 |
0.96 | 250 | 340.00 |
0.97 | 350 | 400.00 |
0.98 | 450 | 500.00 |
0.99 | 1000 | 1113.75 |
1.00 | 250000 | 300000.00 |
It appears that in the top 5th percentile of Group2 (those who gifted twice) there is a bit of separation between the corresponding quantiles of Group1 (those who gifted once).
In conclusion, it doesn’t appear there is any obvious sign from the first gift amount $ whether an individual will continue to gift and meet the giving criteria.
Let’s try to first understand the distribution of outliers for each group, as outliers in this case or a good thing for the University.
It appears that a good cut-off value for an outlier is a recognized first gift amount of $100,000.
Due to the sheer number of categorical variables and levels within each factor, a considerable amount of reshaping of the data needed to be done.
In particular, we reduced the number of levels in the Department, College, and Degree Type factors. Furthermore, we removed Largest and Last Gift Amounts as that seemed like it would overwhelm any learning algorithm from identifying other features. We kept First Gift Amount.
We will first try to implement a decision tree. We will split our data into a train and test set of approximate 80-20 ratio.
Figure MOD 1.1
We can see the decision tree nodes above in Figure MOD 1.1. First Gift Amount was the first decision, which makes sense since the biggest hurdle into making it to the greater than or equal to 2 times in the last 5 years class is to make the an initial donation.
Secondly, we decide on Total Events Attended, which as seen in our Exploratory Data Analysis, clearly had a positive association with meeting the giving criteria.
Third, we decide on the Development Region, i.e., the individual’s home region. It is difficult to determine from the Figure MOD 1.1 above, but again, we can reexamine our Figure EDA 10.1, 10.2, 10.3. From there we can determine which geographic locations are more positively associated with reaching the meeting criteria.
Finally, we decide on Age. Which we saw in the Exploratory Data Analysis the as age increases so does the likelihood that you will meet the giving criteria.
Figure MOD 2.1
We can see in Figure MOD 2.1 the top variables of importance (depending on the fitting criteria of the model). In either case we can see that the features Age, Total Events Attended, Development Region, and Recognized First Gift Amount were the factors that helped the Random Forest classify individuals.
For this random forest a misclassification error rate of 3.7% was obtained on the test set. Also to note is that the random forest was not as single-tracked as the tree. In other words, the random forest did classify individuals as ‘Yes’ meets giving criteria whereas the tree had 0 such classifications for any individuals.
We now will try logistic regression as seen in Hashimoto 2016.
Our model returned a misclassification error rate of approximately 3.8%.
We will check the deviance residuals of the logistic regression model; a good fit model should have most of the residuals fall within 3 standard deviations.
Figure MOD 3.1
Unfortunately we have some high large residual deviance points. This suggests that we are not including the correct or correct amount of predictors in the model, or there may be a few outliers that have high leverage.
This is a bit discouraging, but overall the deviance residual plot suggests that logistic regression can be promising model with some more tweaking.
We will employ a neural network with one hidden layer. Since we have a high number of variables, many of which are categorical, we will choose to have 10 nodes. We will use the least square method to optimize the objective function.
## # weights: 821
## initial value 59184.490447
## iter 10 value 25922.787861
## iter 20 value 24854.976251
## iter 30 value 23840.263513
## iter 40 value 23221.920517
## iter 50 value 22752.738996
## iter 60 value 22555.262528
## iter 70 value 22427.467166
## iter 80 value 22364.032684
## iter 90 value 22334.630128
## iter 100 value 22284.449173
## final value 22284.449173
## stopped after 100 iterations
The neural network has a misclassication error rate of approximately 3.4%
We provide a summary of our correct classification for each of the models we tried.
Model | Perc.Correct |
---|---|
Tree | 0.9608770 |
Random.Forest | 0.9627791 |
Log.Reg | 0.9617780 |
Neural.Net | 0.9615777 |
Figure MOD 4.1
We can see in Figure MOD 4.1 the ROC curve. We will now compute the area under the ROC curve for each method.
Area Under ROC Curve | |
---|---|
Log Reg | 0.7988 |
NN | 0.9096 |
Tree-CART | 0.5092 |
Random Forest | 0.5161 |
We see from Table MOD 2 that the model with the highest AOC is the neural net. That being said, our other models offer more interpretability of features that may be best suited for the alumni team since it informs them of the decisions the Tree has made to classify.
Through our Exploratory Data Analysis and through various statistical models, we see that some of our more important features in predicting meeting a frequent giving criteria are the follows:
First Gift Amount
Total Events Attended
Development Region
Age
Department/Degree Type
It can’t be said that Total Event Attended causes individuals to give more frequently. They are mereley positively associated, yet it is most likely that there is a confounding variable that is influencing both Total Events Attended and meeting the giving criteria.
Development Region is an important component to keep in mind as the world re-opens from the 2020 COVID-10 pandemic. In the University’s own region, there is considerable difference between the 4 main regions and the alumni team could benefit from more outreach in South and East regions. Perhaps more events in that region would influence those regions to give more frequently.
Lastly, the Department/Degree type is something most likely correlated with income. Individuals with advanced degrees are more likely to have higher incomes, and this higher income is more likely the root reason for the more frequent giving. For example, individuals from the Medical department had an approximate 9% meets frequent giving rate, compared to the overall average of 3-4%.
More exploratory data analysis and modeling needs to be done on the high-end alumni donors, i.e. those who give more than 10,000. There are suprisingly many who give only once, but that one time donation exceeds the median donation of $25 by factors of thousands.
[1] Hashimoto. Analyzing factors that predict alumni giving at a public university in California. 2016.
[2] Farrokhvar, Ansari, & Behrooz. Predictive Models for charitable giving using machine learning techniques. 2018.
library(ggplot2)
library(plyr)
library(gridExtra)
library(kableExtra)
library(dplyr)
library(MASS)
library(caret) ##preProcess
library(nnet)
library(rpart)
library(rpart.plot)
library(randomForest)
library(ROCR)
library(tinytex)
#Read in Data
data <- read.csv('huntercsv.csv')
data <- data
data <- data[data$Age != 0,]
data <- data[which(data$Gender != 'Unknown'), ]
data$Gender <- as.factor(data$Gender)
data$Marital.Status <- ifelse(data$Marital.Status == '', 'Unknown', data$Marital.Status)
data$Marital.Status <- ifelse(data$Marital.Status == 'Deceased Spouse', 'Widowed', data$Marital.Status)
data$Marital.Status <- as.factor(data$Marital.Status)
data$Volunteer <- ifelse(data$Volunteer == 'True', 'True', 'False')
data$Volunteer <- as.factor(data$Volunteer)
data$Volunteer.Jobs..This.FY.[is.na(data$Volunteer.Jobs..This.FY.)] <- 0
data$Volunteer.Jobs..Lifetime.[is.na(data$Volunteer.Jobs..Lifetime.)] <- 0
data$Total.Events.Attended[is.na(data$Total.Events.Attended)] <- 0
data <- subset(data, select = -c(Last.Event.Date, Recognized.Giving.First.Gift.Date, Recognized.Giving.Last.Gift.Date, Recognized.Giving.Largest.Gift.Date, Date.Graduated))
data <- subset(data, select = -c(Major.Giving.Capacity.Value))
data <- subset(data, select = -c(Engagement.Score))
data$Recognized.Giving..Lifetime.[is.na(data$Recognized.Giving..Lifetime.)] <- 0
data$Recognized.Giving..5.Year.[is.na(data$Recognized.Giving..5.Year.)] <- 0
data$Recognition.Giving..12.Months.[is.na(data$Recognition.Giving..12.Months.)] <- 0
data$Recognized.Giving.First.Gift.Amt[is.na(data$Recognized.Giving.First.Gift.Amt)] <- 0
data$Recognized.Giving.Last.Gift.Amt[is.na(data$Recognized.Giving.Last.Gift.Amt)] <- 0
data$Recognized.Giving.Largest.Gift.Amt[is.na(data$Recognized.Giving.Largest.Gift.Amt)] <- 0
data$Faculty <- ifelse(data$Faculty == 'True', TRUE, FALSE)
data$Faculty <- as.factor(data$Faculty)
data$Staff <- ifelse(data$Staff == 'True', TRUE, FALSE)
data$Staff <- as.factor(data$Staff)
data <- data[-c(which(data$College == '')), ]
remove.index <- which(is.na(data$Year.Graduated) == TRUE)
data <- data[-c(remove.index),]
data$Division <- ifelse(data$Division == 'Unknown', 'Unspecified', data$Division)
data$Division <- ifelse(data$Division == 'Other/Unknown', 'Unspecified', data$Division)
data$Division <- ifelse(data$Division == 'Unspecified or multi-divisional', 'Unspecified', data$Division)
data$Department[which(data$Department == '')] <- 'Unknown'
data <- subset(data, select = -c(Primary.Postcode, Primary.Address.Type))
###########Exploratory Data Analysis#########################
#First Table
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
table(df$two.great) %>%
kbl(caption = 'A Table') %>%
kable_styling()
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
p1 <- ggplot(df) + aes(x = as.numeric(Age), group = two.great, fill = two.great) +
geom_histogram(binwidth = 1, color = 'black')
p2 <- ggplot(df[df$two.great == TRUE,]) + aes(x = as.numeric(Age)) +
geom_histogram(binwidth = 1, color = 'black', fill ='turquoise3')
title <- 'Distribution of Age by Years Giving >= 2'
bottom.title <- 'Age'
grid.arrange(p1, p2, nrow = 2, top = title, bottom = bottom.title)
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
group.age <- function(age) {
if(age < 20) {
val <- 'Teens'
} else if(age >= 20 & age < 30) {
val <- '20s'
} else if (age >= 30 & age < 40) {
val <- '30s'
} else if (age >= 40 & age < 50) {
val <- '40s'
} else if (age >= 50 & age < 60) {
val <- '50s'
} else if (age >= 60 & age < 70) {
val <- '60s'
} else {
val <- '70sup'
}
return(val)
}
v_group.age <- Vectorize(group.age, vectorize.args = 'age')
df$age.group <- v_group.age(df$Age)
df$age.group <- as.factor(df$age.group)
count <- table(df[df$age.group == '20s', ]$two.great)['FALSE']
count <- c(count, table(df[df$age.group == '20s', ]$two.great)['TRUE'])
count <- c(count, table(df[df$age.group == '30s', ]$two.great)['FALSE'])
count <- c(count, table(df[df$age.group == '30s', ]$two.great)['TRUE'])
count <- c(count, table(df[df$age.group == '40s', ]$two.great)['FALSE'])
count <- c(count, table(df[df$age.group == '40s', ]$two.great)['TRUE'])
count <- c(count, table(df[df$age.group == '50s', ]$two.great)['FALSE'])
count <- c(count, table(df[df$age.group == '50s', ]$two.great)['TRUE'])
count <- c(count, table(df[df$age.group == '60s', ]$two.great)['FALSE'])
count <- c(count, table(df[df$age.group == '60s', ]$two.great)['TRUE'])
count <- c(count, table(df[df$age.group == '70sup', ]$two.great)['FALSE'])
count <- c(count, table(df[df$age.group == '70sup', ]$two.great)['TRUE'])
#create a dataframe
age.group.levels <- rep(levels(df$age.group)[c(1:6)], each = 2)
times.given <- rep(c('< 2', '>= 2'), 6)
df.age.group <- data.frame(age.group.levels, times.given, count)
#calculate the percentages
df.age.group <- ddply(df.age.group, .(age.group.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.age.group <- ddply(df.age.group, .(age.group.levels), transform, pos = (cumsum(count) - 0.95*count))
reindex <- c(2,1,4,3,6,5,8,7,10,9,12,11)
df.age.group$pos <- df.age.group$pos[reindex]
format.pos <- c(2,4,6,8,10,12)
df.age.group$pos[format.pos] <- 3573.60
df.age.group$pos[12] <- df.age.group$pos[12] - 1000
df.age.group$label <- paste0(sprintf('%.0f', df.age.group$percent), '%')
#barplot of counts by industry withh in group proportions
ggplot(df.age.group, aes(x = age.group.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
ggtitle('Giving Criteria by Age Group')
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
count <- table(df[df$Gender == 'Female', ]$two.great)['FALSE']
count <- c(count, table(df[df$Gender == 'Female', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Gender == 'Male', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Gender == 'Male', ]$two.great)['TRUE'])
#create a dataframe
gender.levels <- rep(levels(df$Gender), each = 2)
times.given <- rep(c('< 2', '>= 2'), 2)
df.gender.group <- data.frame(gender.levels, times.given, count)
#calculate the percentages
df.gender.group <- ddply(df.gender.group, .(gender.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.gender.group <- ddply(df.gender.group, .(gender.levels), transform, pos = (cumsum(count) - 0.94*count))
reindex <- c(2,1,4,3)
df.gender.group$pos <- df.gender.group$pos[reindex]
df.gender.group$label <- paste0(sprintf('%.0f', df.gender.group$percent), '%')
#barplot of counts by industry withh in group proportions
ggplot(df.gender.group, aes(x = gender.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
ggtitle('Giving Criteria by Gender')
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
count <- table(df[df$Marital.Status == 'Divorced', ]$two.great)['FALSE']
count <- c(count, table(df[df$Marital.Status == 'Divorced', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Marital.Status == 'Married', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Marital.Status == 'Married', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Marital.Status == 'Partnered', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Marital.Status == 'Partnered', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Marital.Status == 'Separated', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Marital.Status == 'Separated', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Marital.Status == 'Single', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Marital.Status == 'Single', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Marital.Status == 'Unknown', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Marital.Status == 'Unknown', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Marital.Status == 'Widowed', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Marital.Status == 'Widowed', ]$two.great)['TRUE'])
marital.levels <- rep(levels(df$Marital.Status), each = 2)
times.given <- rep(c('< 2', '>= 2'), 7)
df.marriage <- data.frame(marital.levels, times.given, count)
#calculate the percentages
df.marriage <- ddply(df.marriage, .(marital.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.marriage <- ddply(df.marriage, .(marital.levels), transform, pos = (cumsum(count) - 0.94*count))
#reindex <- c(2,1,4,3)
#df.marriage$pos <- df.marriage$pos[reindex]
df.marriage$label <- paste0(sprintf('%.0f', df.marriage$percent), '%')
#barplot of counts by industry withh in group proportions
ggplot(df.marriage, aes(x = marital.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
ggtitle('Giving Criteria by Marital Status')
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
##Married and Single##
count <- table(df[df$Marital.Status == 'Married', ]$two.great)['FALSE']
count <- c(count, table(df[df$Marital.Status == 'Married', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Marital.Status == 'Single', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Marital.Status == 'Single', ]$two.great)['TRUE'])
marital.levels <- rep(c('Married', 'Single'), each = 2)
times.given <- rep(c('< 2', '>= 2'), 2)
df.marriage <- data.frame(marital.levels, times.given, count)
#calculate the percentages
df.marriage <- ddply(df.marriage, .(marital.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.marriage <- ddply(df.marriage, .(marital.levels), transform, pos = (cumsum(count) - 0.75*count))
reindex <- c(2,1,4,3)
df.marriage$pos <- df.marriage$pos[reindex]
df.marriage$label <- paste0(sprintf('%.0f', df.marriage$percent), '%')
#barplot of counts by industry withh in group proportions
p1 <- ggplot(df.marriage, aes(x = marital.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2)
##Divorced, Partnered, Separated, Widowed'
count <- table(df[df$Marital.Status == 'Divorced', ]$two.great)['FALSE']
count <- c(count, table(df[df$Marital.Status == 'Divorced', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Marital.Status == 'Partnered', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Marital.Status == 'Partnered', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Marital.Status == 'Separated', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Marital.Status == 'Separated', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Marital.Status == 'Widowed', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Marital.Status == 'Widowed', ]$two.great)['TRUE'])
marital.levels <- rep(c('Divorced', 'Partnered', 'Separated', 'Widowed'), each = 2)
times.given <- rep(c('< 2', '>= 2'), 4)
df.marriage <- data.frame(marital.levels, times.given, count)
#calculate the percentages
df.marriage <- ddply(df.marriage, .(marital.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.marriage <- ddply(df.marriage, .(marital.levels), transform, pos = (cumsum(count) - 0.75*count))
reindex <- c(2,1,4,3, 6,5,8,7)
df.marriage$pos <- df.marriage$pos[reindex]
df.marriage$pos[2] <- 150
df.marriage$pos[3] <- 80
df.marriage$pos[5] <- 80
df.marriage$pos[c(4,6)] <- 25
df.marriage$pos[8] <- 80
df.marriage$label <- paste0(sprintf('%.0f', df.marriage$percent), '%')
#barplot of counts by industry withh in group proportions
p2 <- ggplot(df.marriage, aes(x = marital.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2)
title = 'Giving Criteria by Marital Status'
bot.title = 'Marital Status'
grid.arrange(p1, p2, nrow = 2, top = title, bottom = bot.title)
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
##Volunteer and not Volunteer##
count <- table(df[df$Volunteer == 'False', ]$two.great)['FALSE']
count <- c(count, table(df[df$Volunteer == 'False', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Volunteer == 'True', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Volunteer == 'True', ]$two.great)['TRUE'])
vol.levels <- rep(c('Not Volunteer', 'Yes Volunteer'), each = 2)
times.given <- rep(c('< 2', '>= 2'), 2)
df.vol <- data.frame(vol.levels, times.given, count)
#calculate the percentages
df.vol <- ddply(df.vol, .(vol.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.vol <- ddply(df.vol, .(vol.levels), transform, pos = (cumsum(count) - 0.75*count))
reindex <- c(2,1,4,3)
df.vol$pos <- df.vol$pos[reindex]
df.vol$pos[c(2,3)] <- 13000
df.vol$pos[4] <- 2750
df.vol$label <- paste0(sprintf('%.0f', df.vol$percent), '%')
#barplot of counts by industry withh in group proportions
p1 <- ggplot(df.vol, aes(x = vol.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by Volunteer Status', x = 'Volunteer N/Y', y = 'Count')
p1
## Exploratory Data Analysis: Volunteer Jobs during last Fiscal Year (0 removed)
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
df$Volunteer.Jobs..This.FY. <- as.factor(df$Volunteer.Jobs..This.FY.)
##Volunteer and not Volunteer##
count <- table(df[df$Volunteer.Jobs..This.FY. == '1', ]$two.great)['FALSE']
count <- c(count, table(df[df$Volunteer.Jobs..This.FY. == '1', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Volunteer.Jobs..This.FY. == '2', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Volunteer.Jobs..This.FY. == '2', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Volunteer.Jobs..This.FY. == '3', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Volunteer.Jobs..This.FY. == '3', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Volunteer.Jobs..This.FY. == '4', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Volunteer.Jobs..This.FY. == '4', ]$two.great)['TRUE'])
vol.levels <- rep(c('1', '2', '3', '4'), each = 2)
times.given <- rep(c('< 2', '>= 2'), 4)
df.vol <- data.frame(vol.levels, times.given, count)
#calculate the percentages
df.vol <- ddply(df.vol, .(vol.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.vol <- ddply(df.vol, .(vol.levels), transform, pos = (cumsum(count) - 0.50*count))
reindex <- c(2,1,4,3,6,5,8,7)
df.vol$pos <- df.vol$pos[reindex]
df.vol$pos[3] <- df.vol$pos[3] + 5
df.vol$pos[5] <- df.vol$pos[5] + 2
df.vol$label <- paste0(sprintf('%.0f', df.vol$percent), '%')
df.vol$label[7] <- '100%'
df.vol$pos[8] <- df.vol$pos[7]
#barplot of counts by industry withh in group proportions
p1 <- ggplot(df.vol, aes(x = vol.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by Volunteer Jobs this F.Y.', x = '# of Jobs this F.Y.', y = 'Count')
p1
group.vol.2 <- function(vol) {
if(vol == 0) {
val <- '0'
} else if (vol == 1) {
val <- '1'
} else if (vol == 2) {
val <- '2'
}else if (vol >= 3 & vol < 6) {
val <- '3-5'
} else if (vol >= 6 & vol < 11) {
val <- '6-10'
} else if (vol >= 11) {
val <- '11+'
}
return(val)
}
vec.group.vol.2 <- Vectorize(group.vol.2, vectorize.args = 'vol')
df$vol.group2 <- vec.group.vol.2(df$Volunteer.Jobs..Lifetime.)
df$vol.group2 <- as.factor(df$vol.group2)
count <- table(df[df$vol.group2 == '0', ]$two.great)['FALSE']
count <- c(count, table(df[df$vol.group2 == '0', ]$two.great)['TRUE'])
count <- c(count, table(df[df$vol.group2 == '1', ]$two.great)['FALSE'])
count <- c(count, table(df[df$vol.group2 == '1', ]$two.great)['TRUE'])
count <- c(count, table(df[df$vol.group2 == '2', ]$two.great)['FALSE'])
count <- c(count, table(df[df$vol.group2 == '2', ]$two.great)['FALSE'])
count <- c(count, table(df[df$vol.group2 == '3-5', ]$two.great)['FALSE'])
count <- c(count, table(df[df$vol.group2 == '3-5', ]$two.great)['TRUE'])
count <- c(count, table(df[df$vol.group2 == '6-10', ]$two.great)['FALSE'])
count <- c(count, table(df[df$vol.group2 == '6-10', ]$two.great)['TRUE'])
count <- c(count, table(df[df$vol.group2 == '11+', ]$two.great)['FALSE'])
count <- c(count, table(df[df$vol.group2 == '11+', ]$two.great)['TRUE'])
vol.levels <- rep(c('0', '1', '2', '3-5', '6-10', '11+'), each = 2)
vol.levels <- factor(vol.levels, levels = c('0', '1', '2', '3-5', '6-10', '11+'))
times.given <- rep(c('< 2', '>= 2'), 6)
df.vol <- data.frame(vol.levels, times.given, count)
#calculate the percentages
df.vol <- ddply(df.vol, .(vol.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.vol <- ddply(df.vol, .(vol.levels), transform, pos = (cumsum(count) - 0.75*count))
df.vol$label <- paste0(sprintf('%.0f', df.vol$percent), '%')
#delete 0 volunteer job group
df.vol <- df.vol[-c(1:2),]
df.vol$pos <- df.vol$pos[c(2,1,4,3,6,5,8,7,10,9)]
df.vol$pos[2] <- 350
df.vol$pos[3] <- 2000
df.vol$pos[7] <- df.vol$pos[7] + 50
df.vol$pos[8] <- df.vol$pos[8] + 20
df.vol$pos[9] <- df.vol$pos[9] + 100
df.vol$pos[10] <- df.vol$pos[10] + 30
#barplot of counts by industry withh in group proportions
p1 <- ggplot(df.vol, aes(x = vol.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by Volunteer Jobs Lifetime', x = '# of Volunteer Jobs Lifetime', y = 'Count')
p1
group.events <- function(event) {
val <- ''
if(event == 0) {
val <- '0'
} else if (event == 1) {
val <- '1'
} else if (event >= 2 & event < 6) {
val <- '2-5'
} else if (event >= 6 & event < 10) {
val <- '6-10'
} else if (event >= 11) {
val <- '11+'
}
return(val)
}
vec.group.event <- Vectorize(group.events, vectorize.args = 'event')
df$event.group <- vec.group.event(df$Total.Events.Attended)
df$event.group <- as.factor(df$event.group)
#make table
count <- table(df[df$event.group == '0', ]$two.great)['FALSE']
count <- c(count, table(df[df$event.group == '0', ]$two.great)['TRUE'])
count <- c(count, table(df[df$event.group == '1', ]$two.great)['FALSE'])
count <- c(count, table(df[df$event.group == '1', ]$two.great)['TRUE'])
count <- c(count, table(df[df$event.group == '2-5', ]$two.great)['FALSE'])
count <- c(count, table(df[df$event.group == '2-5', ]$two.great)['TRUE'])
count <- c(count, table(df[df$event.group == '6-10', ]$two.great)['FALSE'])
count <- c(count, table(df[df$event.group == '6-10', ]$two.great)['TRUE'])
count <- c(count, table(df[df$event.group == '11+', ]$two.great)['FALSE'])
count <- c(count, table(df[df$event.group == '11+', ]$two.great)['TRUE'])
event.levels <- rep(c('0', '1', '2-5', '6-10', '11+'), each = 2)
event.levels <- factor(event.levels, levels = c('0', '1', '2-5', '6-10', '11+'))
times.given <- rep(c('< 2', '>= 2'), 5)
df.event <- data.frame(event.levels, times.given, count)
#calculate the percentages
df.event <- ddply(df.event, .(event.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.event <- ddply(df.event, .(event.levels), transform, pos = (cumsum(count) - 0.75*count))
colnames(df.event) <- c('Total # Events', 'Giving Criteria', 'Count', 'Percent within Group')
df.event[,c(1:4)] %>%
kbl(caption = 'Table EDA 5.1: Giving Criteria by Total Events Attended') %>%
kable_styling()
count <- table(df[df$event.group == '0', ]$two.great)['FALSE']
count <- c(count, table(df[df$event.group == '0', ]$two.great)['TRUE'])
count <- c(count, table(df[df$event.group == '1', ]$two.great)['FALSE'])
count <- c(count, table(df[df$event.group == '1', ]$two.great)['TRUE'])
count <- c(count, table(df[df$event.group == '2-5', ]$two.great)['FALSE'])
count <- c(count, table(df[df$event.group == '2-5', ]$two.great)['TRUE'])
count <- c(count, table(df[df$event.group == '6-10', ]$two.great)['FALSE'])
count <- c(count, table(df[df$event.group == '6-10', ]$two.great)['TRUE'])
count <- c(count, table(df[df$event.group == '11+', ]$two.great)['FALSE'])
count <- c(count, table(df[df$event.group == '11+', ]$two.great)['TRUE'])
event.levels <- rep(c('0', '1', '2-5', '6-10', '11+'), each = 2)
event.levels <- factor(event.levels, levels = c('0', '1', '2-5', '6-10', '11+'))
times.given <- rep(c('< 2', '>= 2'), 5)
df.event <- data.frame(event.levels, times.given, count)
#calculate the percentages
df.event <- ddply(df.event, .(event.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.event <- ddply(df.event, .(event.levels), transform, pos = (cumsum(count) - 0.75*count))
df.event$label <- paste0(sprintf('%.0f', df.event$percent), '%')
df.event$pos <- df.event$pos[c(2,1,4,3,6,5,8,7,10,9)]
#delete 0 volunteer job group
df.event <- df.event[-c(1:2),]
df.event$pos[2] <- df.event$pos[6]
df.event$pos[4] <- df.event$pos[6]
df.event$pos[7] <- df.event$pos[7] + 300
df.event$pos[8] <- df.event$pos[8] + 200
#barplot of counts by industry withh in group proportions
p1 <- ggplot(df.event, aes(x = event.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by Total Events Attended', x = 'Total # of Events Attended', y = 'Count')
p1
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
##Staff and Not Staff##
count <- table(df[df$Staff == 'FALSE', ]$two.great)['FALSE']
count <- c(count, table(df[df$Staff == 'FALSE', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Staff == 'TRUE', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Staff == 'TRUE', ]$two.great)['TRUE'])
staff.levels <- rep(c('Not Staff', 'Yes Staff'), each = 2)
times.given <- rep(c('< 2', '>= 2'), 2)
df.staff <- data.frame(staff.levels, times.given, count)
#calculate the percentages
df.staff <- ddply(df.staff, .(staff.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.staff <- ddply(df.staff, .(staff.levels), transform, pos = (cumsum(count) - 0.96*count))
reindex <- c(2,1,4,3)
df.staff$pos <- df.staff$pos[reindex]
df.staff$pos[3] <- df.staff$pos[3] + 5000
df.staff$pos[4] <- df.staff$pos[4] + 1000
df.staff$label <- paste0(sprintf('%.0f', df.staff$percent), '%')
#barplot of counts by industry withh in group proportions
p1 <- ggplot(df.staff, aes(x = staff.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by Staff Status', x = 'Staff N/Y', y = 'Count')
p1
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
##Staff and Not Staff##
count <- table(df[df$Faculty == 'FALSE', ]$two.great)['FALSE']
count <- c(count, table(df[df$Faculty == 'FALSE', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Faculty == 'TRUE', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Faculty == 'TRUE', ]$two.great)['TRUE'])
fac.levels <- rep(c('Not Faculy', 'Yes Faculty'), each = 2)
times.given <- rep(c('< 2', '>= 2'), 2)
df.fac <- data.frame(fac.levels, times.given, count)
#calculate the percentages
df.fac <- ddply(df.fac, .(fac.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.fac <- ddply(df.fac, .(fac.levels), transform, pos = (cumsum(count) - 0.96*count))
reindex <- c(2,1,4,3)
df.fac$pos <- df.fac$pos[reindex]
df.fac$pos[3] <- df.fac$pos[3] + 6000
df.fac$pos[4] <- df.fac$pos[4] + 2000
df.fac$label <- paste0(sprintf('%.0f', df.fac$percent), '%')
#barplot of counts by industry withh in group proportions
p1 <- ggplot(df.fac, aes(x = fac.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by Faculty Status', x = 'Faculty N/Y', y = 'Count')
p1
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
df$CAE.Constituency <- gsub('^Alumni Undergraduate Degree', 'Undergrad', df$CAE.Constituency)
df$CAE.Constituency <- gsub('^Alumni Graduate Only Degree', 'Grad', df$CAE.Constituency)
df$CAE.Constituency <- as.factor(df$CAE.Constituency)
##Undergrad vs Grad##
count <- table(df[df$CAE.Constituency == 'Undergrad', ]$two.great)['FALSE']
count <- c(count, table(df[df$CAE.Constituency == 'Undergrad', ]$two.great)['TRUE'])
count <- c(count, table(df[df$CAE.Constituency == 'Grad', ]$two.great)['FALSE'])
count <- c(count, table(df[df$CAE.Constituency == 'Grad', ]$two.great)['TRUE'])
deg.levels <- rep(c('Undergrad', 'Grad'), each = 2)
deg.levels <- factor(deg.levels, levels = c('Undergrad', 'Grad'))
times.given <- rep(c('< 2', '>= 2'), 2)
df.deg <- data.frame(deg.levels, times.given, count)
#calculate the percentages
df.deg <- ddply(df.deg, .(deg.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.deg <- ddply(df.deg, .(deg.levels), transform, pos = (cumsum(count) - 0.96*count))
reindex <- c(2,1,4,3)
df.deg$pos <- df.deg$pos[reindex]
df.deg$pos[2] <- df.deg$pos[2] + 2500
df.deg$pos[3] <- df.deg$pos[3]
df.deg$pos[4] <- df.deg$pos[4] + 4000
df.deg$label <- paste0(sprintf('%.0f', df.deg$percent), '%')
#barplot of counts by industry withh in group proportions
p1 <- ggplot(df.deg, aes(x = deg.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by Undergrad/Grad', x = 'Undergrad/Grad', y = 'Count')
p1
df.bach <- df[which(df$Degree.Type == 'Bachelor of Science' | df$Degree.Type == 'Bachelor of Arts'),]
df <- df.bach
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
##B.S. vs B.A.##
count <- table(df[df$Degree.Type == 'Bachelor of Science', ]$two.great)['FALSE']
count <- c(count, table(df[df$Degree.Type == 'Bachelor of Science', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Degree.Type == 'Bachelor of Arts', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Degree.Type == 'Bachelor of Arts', ]$two.great)['TRUE'])
deg.levels <- rep(c('B.S.', 'B.A.'), each = 2)
deg.levels <- factor(deg.levels, levels = c('B.S.', 'B.A.'))
times.given <- rep(c('< 2', '>= 2'), 2)
df.deg <- data.frame(deg.levels, times.given, count)
#calculate the percentages
df.deg <- ddply(df.deg, .(deg.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.deg <- ddply(df.deg, .(deg.levels), transform, pos = (cumsum(count) - 0.96*count))
reindex <- c(2,1,4,3)
df.deg$pos <- df.deg$pos[reindex]
df.deg$pos[2] <- df.deg$pos[2] + 2500
df.deg$pos[3] <- df.deg$pos[3]
df.deg$pos[4] <- df.deg$pos[4] + 4000
df.deg$label <- paste0(sprintf('%.0f', df.deg$percent), '%')
#barplot of counts by industry withh in group proportions
p1 <- ggplot(df.deg, aes(x = deg.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by B.S. vs B.A.', x = 'Bachelor Degree Type', y = 'Count')
p1
df <- data
df.grad <- df[which(df$Degree.Type == 'Master of Science' | df$Degree.Type == 'Doctor of Philosophy' | df$Degree.Type == 'Doctor of Medicine' | df$Degree.Type == 'Master of Arts'),]
df <- df.grad
df$Degree.Type <- as.factor(df$Degree.Type)
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
##Top 4 Grad Levels.##
count <- table(df[df$Degree.Type == 'Master of Science', ]$two.great)['FALSE']
count <- c(count, table(df[df$Degree.Type == 'Master of Science', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Degree.Type == 'Doctor of Philosophy', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Degree.Type == 'Doctor of Philosophy', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Degree.Type == 'Doctor of Medicine', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Degree.Type == 'Doctor of Medicine', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Degree.Type == 'Master of Arts', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Degree.Type == 'Master of Arts', ]$two.great)['TRUE'])
deg.levels <- rep(c('M.S', 'Ph.D', 'M.D', 'M.A'), each = 2)
deg.levels <- factor(deg.levels, levels = c('M.S', 'Ph.D', 'M.D', 'M.A'))
times.given <- rep(c('< 2', '>= 2'), 4)
df.deg <- data.frame(deg.levels, times.given, count)
#calculate the percentages
df.deg <- ddply(df.deg, .(deg.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.deg <- ddply(df.deg, .(deg.levels), transform, pos = (cumsum(count) - 0.96*count))
reindex <- c(2,1,4,3,6,5,8,7)
df.deg$pos <- df.deg$pos[reindex]
df.deg$pos[2] <- df.deg$pos[2] + 500
df.deg$pos[4] <- df.deg$pos[4] + 750
df.deg$pos[6] <- df.deg$pos[6] + 650
df.deg$pos[8] <- df.deg$pos[8] + 600
df.deg$label <- paste0(sprintf('%.0f', df.deg$percent), '%')
#barplot of counts by industry withh in group proportions
p1 <- ggplot(df.deg, aes(x = deg.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by Grad Degree Type [Top 4]', x = 'Grad Degree Type', y = 'Count')
p1
df <- data
#df$Degree.Type <- as.factor(df$Degree.Type)
#names(sort(summary(df$Degree.Type), decreasing = TRUE)[1:15])
df.grad <- df[which(df$Degree.Type == 'Master of Pacific and International Affairs' | df$Degree.Type == 'Master of Advanced Studies' | df$Degree.Type == 'Master of Business Administration' | df$Degree.Type == 'Master of Fine Arts' | df$Degree.Type == 'Doctor of Pharmacy' | df$Degree.Type == 'Master of Education' | df$Degree.Type == 'Master of Finance' | df$Degree.Type == 'Master of International Affairs' | df$Degree.Type == 'Master of Engineering'),]
df.grad$Degree.Type <- gsub('^Master of Finance', 'Master of Business Administration', df.grad$Degree.Type)
df.grad$Degree.Type <- gsub('^Master of International Affairs', 'Master of Pacific and International Affairs', df.grad$Degree.Type)
df <- df.grad
df$Degree.Type <- as.factor(df$Degree.Type)
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
##Top 5-10 Grad Levels.##
count <- table(df[df$Degree.Type == 'Master of Pacific and International Affairs', ]$two.great)['FALSE']
count <- c(count, table(df[df$Degree.Type == 'Master of Pacific and International Affairs', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Degree.Type == 'Master of Advanced Studies', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Degree.Type == 'Master of Advanced Studies', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Degree.Type == 'Master of Business Administration', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Degree.Type == 'Master of Business Administration', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Degree.Type == 'Master of Fine Arts', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Degree.Type == 'Master of Fine Arts', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Degree.Type == 'Master of Education', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Degree.Type == 'Master of Education', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Degree.Type == 'Doctor of Pharmacy', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Degree.Type == 'Doctor of Pharmacy', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Degree.Type == 'Master of Engineering', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Degree.Type == 'Master of Engineering', ]$two.great)['TRUE'])
deg.levels <- rep(c('M.I.A.', 'M.A.S.', 'M.B.A.', 'M.F.A.', 'M.Ed', 'PharmD', 'M.Eng'), each = 2)
deg.levels <- factor(deg.levels, levels = c('M.I.A.', 'M.A.S.', 'M.B.A.', 'M.F.A.', 'M.Ed', 'PharmD', 'M.Eng'))
times.given <- rep(c('< 2', '>= 2'), 7)
df.deg <- data.frame(deg.levels, times.given, count)
#calculate the percentages
df.deg <- ddply(df.deg, .(deg.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.deg <- ddply(df.deg, .(deg.levels), transform, pos = (cumsum(count) - 0.96*count))
reindex <- c(2,1,4,3,6,5,8,7,10,9,12,11,14,13)
df.deg$pos <- df.deg$pos[reindex]
df.deg$pos[2] <- df.deg$pos[2] + 175
df.deg$pos[4] <- df.deg$pos[4] + 70
df.deg$pos[6] <- df.deg$pos[6] + 75
df.deg$pos[8] <- df.deg$pos[8] + 70
df.deg$pos[10] <- df.deg$pos[10] + 70
df.deg$pos[14] <- df.deg$pos[14] + 75
df.deg$label <- paste0(sprintf('%.0f', df.deg$percent), '%')
#barplot of counts by grad degree type with in group proportions
p1 <- ggplot(df.deg, aes(x = deg.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by Grad Degree Type [Top 5-11]', x = 'Grad Degree Type', y = 'Count')
p1
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
df.college <- df[which(df$College == 'Roosevelt College' | df$College == 'Revelle College' | df$College == 'Muir College' | df$College == 'Warren College' | df$College == 'Marshall College' | df$College == 'Sixth College'),]
df.college$College <- gsub('^Roosevelt College', 'Roosevelt', df.college$College)
df.college$College <- gsub('^Revelle College', 'Revelle', df.college$College)
df.college$College <- gsub('^Muir College', 'Muir', df.college$College)
df.college$College <- gsub('^Warren College', 'Warren', df.college$College)
df.college$College <- gsub('^Marshall College', 'Marshall', df.college$College)
df.college$College <- gsub('^Sixth College', 'Sixth', df.college$College)
df.deg <- df.college
df <- df.college
##Top 5-10 Grad Levels.##
count <- table(df[df$College == 'Roosevelt', ]$two.great)['FALSE']
count <- c(count, table(df[df$College == 'Roosevelt', ]$two.great)['TRUE'])
count <- c(count, table(df[df$College == 'Revelle', ]$two.great)['FALSE'])
count <- c(count, table(df[df$College == 'Revelle', ]$two.great)['TRUE'])
count <- c(count, table(df[df$College == 'Muir', ]$two.great)['FALSE'])
count <- c(count, table(df[df$College == 'Muir', ]$two.great)['TRUE'])
count <- c(count, table(df[df$College == 'Warren', ]$two.great)['FALSE'])
count <- c(count, table(df[df$College == 'Warren', ]$two.great)['TRUE'])
count <- c(count, table(df[df$College == 'Marshall', ]$two.great)['FALSE'])
count <- c(count, table(df[df$College == 'Marshall', ]$two.great)['TRUE'])
count <- c(count, table(df[df$College == 'Sixth', ]$two.great)['FALSE'])
count <- c(count, table(df[df$College == 'Sixth', ]$two.great)['TRUE'])
deg.levels <- rep(c('Roosevelt', 'Revelle', 'Muir', 'Warren', 'Marshall', 'Sixth'), each = 2)
deg.levels <- factor(deg.levels, levels = c('Roosevelt', 'Revelle', 'Muir', 'Warren', 'Marshall', 'Sixth'))
times.given <- rep(c('< 2', '>= 2'), 6)
df.deg <- data.frame(deg.levels, times.given, count)
#calculate the percentages
df.deg <- ddply(df.deg, .(deg.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.deg <- ddply(df.deg, .(deg.levels), transform, pos = (cumsum(count) - 0.96*count))
reindex <- c(2,1,4,3,6,5,8,7,10,9,12,11)
df.deg$pos <- df.deg$pos[reindex]
df.deg$pos[2] <- df.deg$pos[2] + 175
df.deg$pos[4] <- df.deg$pos[4] + 70
df.deg$pos[6] <- df.deg$pos[6] + 75
df.deg$pos[8] <- df.deg$pos[8] + 70
df.deg$pos[10] <- df.deg$pos[10] + 70
df.deg$label <- paste0(sprintf('%.0f', df.deg$percent), '%')
#barplot of counts by grad degree type with in group proportions
p1 <- ggplot(df.deg, aes(x = deg.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by Undergraduate College', x = 'Undergrad College', y = 'Count')
p1
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
undergrad.index <- which(df$College == 'Roosevelt College' | df$College == 'Revelle College' | df$College == 'Muir College' | df$College == 'Warren College' | df$College == 'Marshall College' | df$College == 'Sixth College')
df.college <- df[-undergrad.index,]
df <- df.college
count <- table(df[df$College == 'Graduate', ]$two.great)['FALSE']
count <- c(count, table(df[df$College == 'Graduate', ]$two.great)['TRUE'])
count <- c(count, table(df[df$College == 'School of Medicine', ]$two.great)['FALSE'])
count <- c(count, table(df[df$College == 'School of Medicine', ]$two.great)['TRUE'])
count <- c(count, table(df[df$College == 'School of Global Policy & Strategy', ]$two.great)['FALSE'])
count <- c(count, table(df[df$College == 'School of Global Policy & Strategy', ]$two.great)['TRUE'])
count <- c(count, table(df[df$College == 'Rady School of Management', ]$two.great)['FALSE'])
count <- c(count, table(df[df$College == 'Rady School of Management', ]$two.great)['TRUE'])
count <- c(count, table(df[df$College == 'Scripps Institution of Oceanography', ]$two.great)['FALSE'])
count <- c(count, table(df[df$College == 'Scripps Institution of Oceanography', ]$two.great)['TRUE'])
deg.levels <- rep(c('Grad', 'Med', 'PP&S', 'Rady', 'Scripps'), each = 2)
deg.levels <- factor(deg.levels, levels = c('Grad', 'Med', 'PP&S', 'Rady', 'Scripps'))
times.given <- rep(c('< 2', '>= 2'), 5)
df.deg <- data.frame(deg.levels, times.given, count)
#calculate the percentages
df.deg <- ddply(df.deg, .(deg.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.deg <- ddply(df.deg, .(deg.levels), transform, pos = (cumsum(count) - 0.96*count))
reindex <- c(2,1,4,3,6,5,8,7,10,9)
df.deg$pos <- df.deg$pos[reindex]
df.deg$pos[2] <- df.deg$pos[2] + 1000
df.deg$pos[4] <- df.deg$pos[4] + 1000
df.deg$pos[6] <- df.deg$pos[6] + 800
df.deg$pos[8] <- df.deg$pos[8] + 800
df.deg$pos[9] <- df.deg$pos[9] + 1200
df.deg$pos[10] <- df.deg$pos[10] + 800
df.deg$label <- paste0(sprintf('%.0f', df.deg$percent), '%')
#barplot of counts by grad degree type with in group proportions
p1 <- ggplot(df.deg, aes(x = deg.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by Graduate School', x = 'Grad School', y = 'Count')
p1
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
sd.index <- which(df$Development.Region == 'SD County - Central' | df$Development.Region == 'SD County - North' | df$Development.Region == 'SD County - East' | df$Development.Region == 'SD County - South')
df.sd <- df[sd.index,]
df <- df.sd
count <- table(df[df$Development.Region == 'SD County - Central', ]$two.great)['FALSE']
count <- c(count, table(df[df$Development.Region == 'SD County - Central', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'SD County - North', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'SD County - North', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'SD County - East', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'SD County - East', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'SD County - South', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'SD County - South', ]$two.great)['TRUE'])
deg.levels <- rep(c('SD-Central', 'SD-North', 'SD-East', 'SD-South'), each = 2)
deg.levels <- factor(deg.levels, levels = c('SD-Central', 'SD-North', 'SD-East', 'SD-South'))
times.given <- rep(c('< 2', '>= 2'), 4)
df.deg <- data.frame(deg.levels, times.given, count)
#calculate the percentages
df.deg <- ddply(df.deg, .(deg.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.deg <- ddply(df.deg, .(deg.levels), transform, pos = (cumsum(count) - 0.96*count))
reindex <- c(2,1,4,3,6,5,8,7)
df.deg$pos <- df.deg$pos[reindex]
df.deg$pos[2] <- df.deg$pos[2] + 1850
df.deg$pos[4] <- df.deg$pos[4] + 1000
df.deg$pos[6] <- df.deg$pos[6] + 800
df.deg$pos[8] <- df.deg$pos[8] + 800
df.deg$label <- paste0(sprintf('%.0f', df.deg$percent), '%')
#barplot of counts by grad degree type with in group proportions
p1 <- ggplot(df.deg, aes(x = deg.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by San Diego County Region', x = 'SD Region', y = 'Count')
p1
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
sd.index <- which(df$Development.Region == 'LA County' | df$Development.Region == 'Orange County' | df$Development.Region == 'Bay Area - East' | df$Development.Region == 'Bay Area - South' | df$Development.Region == 'Bay Area - Peninsula' | df$Development.Region == 'Other CA')
df.sd <- df[sd.index,]
df <- df.sd
count <- table(df[df$Development.Region == 'LA County', ]$two.great)['FALSE']
count <- c(count, table(df[df$Development.Region == 'LA County', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'Orange County', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'Orange County', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'Bay Area - East', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'Bay Area - East', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'Bay Area - South', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'Bay Area - South', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'Bay Area - Peninsula', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'Bay Area - Peninsula', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'Other CA', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'Other CA', ]$two.great)['TRUE'])
deg.levels <- rep(c('LA', 'OC', 'Bay-East', 'Bay-South', 'Bay-Peninsula', 'Other CA'), each = 2)
deg.levels <- factor(deg.levels, levels = c('LA', 'OC', 'Bay-East', 'Bay-South', 'Bay-Peninsula', 'Other CA'))
times.given <- rep(c('< 2', '>= 2'), 6)
df.deg <- data.frame(deg.levels, times.given, count)
#calculate the percentages
df.deg <- ddply(df.deg, .(deg.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.deg <- ddply(df.deg, .(deg.levels), transform, pos = (cumsum(count) - 0.97*count))
reindex <- c(2,1,4,3,6,5,8,7, 10, 9, 12, 11)
df.deg$pos <- df.deg$pos[reindex]
df.deg$pos[2] <- df.deg$pos[2] + 750
df.deg$pos[4] <- df.deg$pos[4] + 1000
df.deg$pos[6] <- df.deg$pos[6] + 800
df.deg$pos[8] <- df.deg$pos[8] + 800
df.deg$pos[10] <- df.deg$pos[10] + 800
df.deg$pos[12] <- df.deg$pos[12] + 800
df.deg$label <- paste0(sprintf('%.0f', df.deg$percent), '%')
#barplot of counts by grad degree type with in group proportions
p1 <- ggplot(df.deg, aes(x = deg.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by CA Region', x = 'CA Region', y = 'Count')
p1
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
sd.index <- which(df$Development.Region == 'New York' | df$Development.Region == 'Seattle' | df$Development.Region == 'DC' | df$Development.Region == 'Texas' | df$Development.Region == 'Boston')
df.sd <- df[sd.index,]
df <- df.sd
count <- table(df[df$Development.Region == 'New York', ]$two.great)['FALSE']
count <- c(count, table(df[df$Development.Region == 'New York', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'Seattle', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'Seattle', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'DC', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'DC', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'Texas', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'Texas', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'Boston', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'Boston', ]$two.great)['TRUE'])
deg.levels <- rep(c('NY', 'Seattle', 'DC', 'Texas', 'Boston'), each = 2)
deg.levels <- factor(deg.levels, levels = c('NY', 'Seattle', 'DC', 'Texas', 'Boston'))
times.given <- rep(c('< 2', '>= 2'), 5)
df.deg <- data.frame(deg.levels, times.given, count)
#calculate the percentages
df.deg <- ddply(df.deg, .(deg.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.deg <- ddply(df.deg, .(deg.levels), transform, pos = (cumsum(count) - 0.97*count))
reindex <- c(2,1,4,3,6,5,8,7, 10, 9)
df.deg$pos <- df.deg$pos[reindex]
df.deg$label <- paste0(sprintf('%.0f', df.deg$percent), '%')
#barplot of counts by grad degree type with in group proportions
p1 <- ggplot(df.deg, aes(x = deg.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by US Region [Top 5 outside of CA]', x = 'US Region', y = 'Count')
p1
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
sd.index <- which(df$Development.Region == 'East & Southeast Asia' | df$Development.Region == 'South Asia' | df$Development.Region == 'Europe' | df$Development.Region == 'North America (Excluding USA)' | df$Development.Region == 'Middle East' | df$Development.Region == 'South America' | df$Development.Region == 'Australia & Oceania')
df.sd <- df[sd.index,]
df <- df.sd
count <- table(df[df$Development.Region == 'East & Southeast Asia', ]$two.great)['FALSE']
count <- c(count, table(df[df$Development.Region == 'East & Southeast Asia', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'South Asia', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'South Asia', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'Europe', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'Europe', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'North America (Excluding USA)', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'North America (Excluding USA)', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'Middle East', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'Middle East', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'South America', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'South America', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Development.Region == 'Australia & Oceania', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Development.Region == 'Australia & Oceania', ]$two.great)['TRUE'])
deg.levels <- rep(c('ESE Asia', 'S Asia', 'Euro', 'NA (exc.USA)', 'Mid E', 'S.Amer', 'Aus'), each = 2)
deg.levels <- factor(deg.levels, levels = c('ESE Asia', 'S Asia', 'Euro', 'NA (exc.USA)', 'Mid E', 'S.Amer', 'Aus'))
times.given <- rep(c('< 2', '>= 2'), 7)
df.deg <- data.frame(deg.levels, times.given, count)
#calculate the percentages
df.deg <- ddply(df.deg, .(deg.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.deg <- ddply(df.deg, .(deg.levels), transform, pos = (cumsum(count) - 0.96*count))
reindex <- c(2,1,4,3,6,5,8,7, 10, 9, 12,11, 14,13)
df.deg$pos <- df.deg$pos[reindex]
df.deg$pos[4] <- df.deg$pos[4] + 200
df.deg$pos[6] <- df.deg$pos[6] + 200
df.deg$pos[8] <- df.deg$pos[8] + 200
df.deg$pos[9] <- df.deg$pos[9] + 150
df.deg$pos[10] <- df.deg$pos[10] + 100
df.deg$pos[11] <- df.deg$pos[11] + 150
df.deg$pos[12] <- df.deg$pos[12] + 100
df.deg$pos[13] <- df.deg$pos[13] + 150
df.deg$pos[14] <- df.deg$pos[14] + 100
df.deg$label <- paste0(sprintf('%.0f', df.deg$percent), '%')
#barplot of counts by grad degree type with in group proportions
p1 <- ggplot(df.deg, aes(x = deg.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by International Region', x = 'Region', y = 'Count')
p1
rm(ucsd.data)
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
dep.names.25 <- names(sort(summary(as.factor(data$Department)), decreasing = TRUE)[1:25])
phys.sci <- c(1,9,11,18,22,25)
eng.index <- c(5,6,7,10,17,19)
soc.index <- c(2,3,4,8,13,14,20,21,24)
med.index <- c(16)
bus.index <- c(23)
art.index <- c(15,12)
for(i in phys.sci) {
df$Department <- gsub(paste('^', dep.names.25[i], sep=''), 'Phys.Science', df$Department)
}
for(i in eng.index) {
df$Department <- gsub(paste('^', dep.names.25[i], sep=''), 'Eng.Math', df$Department)
}
for(i in soc.index) {
df$Department <- gsub(paste('^', dep.names.25[i], sep=''), 'Soc.Sci.', df$Department)
}
for(i in med.index) {
df$Department <- gsub(paste('^', dep.names.25[i], sep=''), 'Med.', df$Department)
}
for(i in bus.index) {
df$Department <- gsub(paste('^', dep.names.25[i], sep=''), 'Bus.', df$Department)
}
for(i in art.index) {
df$Department <- gsub(paste('^', dep.names.25[i], sep=''), 'Art', df$Department)
}
df$Deparment <- as.factor(df$Department)
sd.index <- which(df$Department == 'Phys.Science' | df$Department == 'Eng.Math' | df$Department == 'Soc.Sci.' | df$Department == 'Med.' | df$Department == 'Bus.' | df$Department == 'Art')
df.sd <- df[sd.index,]
df <- df.sd
count <- table(df[df$Department == 'Phys.Science', ]$two.great)['FALSE']
count <- c(count, table(df[df$Department == 'Phys.Science', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Department == 'Eng.Math', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Department == 'Eng.Math', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Department == 'Soc.Sci.', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Department == 'Soc.Sci.', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Department == 'Med.', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Department == 'Med.', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Department == 'Bus.', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Department == 'Bus.', ]$two.great)['TRUE'])
count <- c(count, table(df[df$Department == 'Art', ]$two.great)['FALSE'])
count <- c(count, table(df[df$Department == 'Art', ]$two.great)['TRUE'])
deg.levels <- rep(c('Phys.Sci', 'Eng.Math', 'Soc.Sci', 'Med', 'Bus', 'Art'), each = 2)
deg.levels <- factor(deg.levels, levels = c('Phys.Sci', 'Eng.Math', 'Soc.Sci', 'Med', 'Bus', 'Art'))
times.given <- rep(c('< 2', '>= 2'), 6)
df.deg <- data.frame(deg.levels, times.given, count)
#calculate the percentages
df.deg <- ddply(df.deg, .(deg.levels), transform, percent = count/sum(count) * 100)
#format the labels and calculate their positions
df.deg <- ddply(df.deg, .(deg.levels), transform, pos = (cumsum(count) - 0.96*count))
reindex <- c(2,1,4,3,6,5,8,7, 10, 9, 12,11)
df.deg$pos <- df.deg$pos[reindex]
df.deg$pos[4] <- df.deg$pos[4] + 200
df.deg$pos[6] <- df.deg$pos[6] + 200
df.deg$pos[8] <- df.deg$pos[8] + 200
df.deg$pos[9] <- df.deg$pos[9] + 150
df.deg$pos[10] <- df.deg$pos[10] + 100
df.deg$pos[11] <- df.deg$pos[11] + 150
df.deg$pos[12] <- df.deg$pos[12] + 100
df.deg$label <- paste0(sprintf('%.0f', df.deg$percent), '%')
#barplot of counts by grad degree type with in group proportions
p1 <- ggplot(df.deg, aes(x = deg.levels, y = count, fill = times.given)) +
geom_bar(stat = 'identity') +
geom_text(aes(y = pos, label = label), size = 2) +
labs(title = 'Giving Criteria by Department', x = 'Department', y = 'Count')
p1
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 == 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
df$one.great <- ifelse(df$Years.Giving.Last.5 == 1, TRUE, FALSE)
df$one.great <- as.factor(df$one.great)
df.2 <- df[df$two.great == TRUE,]
df.1 <- df[df$one.great == TRUE,]
dtt <- data.frame(Group = c('Group 1', 'Group 2'), Count = c(nrow(df.1), nrow(df.2)))
dtt %>%
kable(caption = 'Count of Group 1 and Group 2') %>%
kable_styling()
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 == 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
df$one.great <- ifelse(df$Years.Giving.Last.5 == 1, TRUE, FALSE)
df$one.great <- as.factor(df$one.great)
df.2 <- df[df$two.great == TRUE,]
df.1 <- df[df$one.great == TRUE,]
df.1g <- data.frame(df.1$Recognized.Giving.First.Gift.Amt, rep(1, length = length(df.1$Recognized.Giving.First.Gift.Amt)))
colnames(df.1g) <- c('y', 'Group')
df.2g <- data.frame(df.2$Recognized.Giving.First.Gift.Amt, rep(2,length = length(df.2$Recognized.Giving.First.Gift.Amt)))
colnames(df.2g) <- c('y', 'Group')
one.quantiles <- as.numeric(quantile(df.1g$y, probs = c(0,0.25, 0.50, 0.75, 0.80, 0.85, 0.90, 0.95)))
two.quantiles <- as.numeric(quantile(df.2g$y, probs = c(0,0.25, 0.50, 0.75, 0.80, 0.85, 0.90, 0.95)))
dt <- data.frame(quantiles = c(0,0.25, 0.50, 0.75, 0.80, 0.85, 0.90, 0.95), Group1 = one.quantiles, Group2 = two.quantiles)
dt %>%
kable(caption = 'First Gift Amount Quantiles') %>%
kable_styling()
##Clean Up dataframe for modeling
df <- data
df$two.great <- ifelse(df$Years.Giving.Last.5 >= 2, TRUE, FALSE)
df$two.great <- as.factor(df$two.great)
df <- subset(df, select = -c(Primary.City, Primary.County, Primary.State, Primary.Country))
df <- subset(df, select = -c(Year.Graduated, Division))
#Grad vs Undergrad
df$CAE.Constituency <- gsub('^Alumni Undergraduate Degree', 'Undergrad', df$CAE.Constituency)
df$CAE.Constituency <- gsub('^Alumni Graduate Only Degree', 'Grad', df$CAE.Constituency)
df$CAE.Constituency <- as.factor(df$CAE.Constituency)
#Development Region to factor
df$Development.Region <- as.factor(df$Development.Region)
#Remove the recognized giving
df <- subset(df, select = -c(Recognized.Giving..Lifetime., Recognized.Giving..5.Year., Recognition.Giving..12.Months.))
df <- subset(df, select = -c(Recognized.Giving.Last.Gift.Amt, Recognized.Giving.Largest.Gift.Amt, Subdepartment, Alumni.Club.Region))
df <- subset(df, select = -c(Current.Employer))
df$College <- as.factor(df$College)
df$Department <- as.factor(df$Department)
df$Degree.Type <- as.factor(df$Degree.Type)
df <- subset(df, select = -c(Years.Giving.Last.5))
df <- subset(df, select = -c(Job.Title))
dep.names.25 <- names(sort(summary(as.factor(data$Department)), decreasing = TRUE)[1:length(levels(as.factor(data$Department)))])
phys.sci <- c(1,9,11,18,22,25,29, 31, 37, 38,47,62, 67, 69,86)
eng.index <- c(5,6,7,10,17,19,40,42,49,50,54,63,68,78,79,82)
soc.index <- c(2,3,4,8,13,14,20,21,24,26,27,30,33,34,39,45,46,48,52,53,55,56,59,66,73,74,75,76,77,80,81,84,85, 87,88)
med.index <- c(16,35,41,57,58,60,61,64,65)
bus.index <- c(23,83)
art.index <- c(15,12,28,32,89)
education.index <- c(36,43,70)
unknown.index <- c(44,51,71,72)
play <- c(phys.sci, eng.index, soc.index, med.index, bus.index, art.index, education.index, unknown.index)
for(i in phys.sci) {
df$Department <- gsub(paste('^', dep.names.25[i], sep=''), 'Phys.Science', df$Department)
}
for(i in eng.index) {
df$Department <- gsub(paste('^', dep.names.25[i], sep=''), 'Eng.Math', df$Department)
}
for(i in soc.index) {
df$Department <- gsub(paste('^', dep.names.25[i], sep=''), 'Soc.Sci.', df$Department)
}
for(i in med.index) {
df$Department <- gsub(paste('^', dep.names.25[i], sep=''), 'Med.', df$Department)
}
for(i in bus.index) {
df$Department <- gsub(paste('^', dep.names.25[i], sep=''), 'Bus.', df$Department)
}
for(i in art.index) {
df$Department <- gsub(paste('^', dep.names.25[i], sep=''), 'Art', df$Department)
}
for(i in education.index) {
df$Department <- gsub(paste('^', dep.names.25[i], sep=''), 'Edu', df$Department)
}
for(i in unknown.index) {
df$Department <- gsub(paste('^', dep.names.25[i], sep=''), 'Unknown', df$Department)
}
#---------------------------------------
df$Department <- as.factor(df$Department)
degree.types <- names(sort(summary(df$Degree.Type), decreasing = TRUE))
not.top.15 <- c(17:25)
for(i in not.top.15) {
df$Degree.Type <- gsub(paste('^', degree.types[i], sep = ''), 'Other Graduate', df$Degree.Type)
}
df$Degree.Type <- as.factor(df$Degree.Type)
#---------------------------------------
#names(sort(summary(df$College), decreasing = TRUE))
df$College <- gsub('^Herbert', 'Graduate', df$College)
df$College <- gsub('^Extension', 'Graduate', df$College)
df$College <- gsub('^School of Engineering', 'Graduate', df$College)
names(df)[names(df) == 'Total.Events.Attended'] <- 'Total.Events'
names(df)[names(df) == 'Recognized.Giving.First.Gift.Amt'] <- 'First.Gift.Amt'
names(df)[names(df) == 'Volunteer.Jobs..This.FY.'] <- 'Vol.Jobs.FY'
names(df)[names(df) == 'Volunteer.Jobs..Lifetime'] <- 'Vol.Jobs.LT'
names(df)[names(df) == 'Development.Region'] <- 'Region'
names(df)[names(df) == 'CAE.Constituency'] <- 'Undegrad.Grad'
#---------------------------------------
y.response = rep(-1, nrow(df))
y.response[which(df$two.great==FALSE)] = 'No'
y.response[which(df$two.great==TRUE)] = 'Yes'
y.response <- as.factor(y.response)
df <- cbind(df, y.response)
df <- subset(df, select = -c(two.great))
### Trees ###
set.seed(1001)
eighty = floor(.80*nrow(df))
train = sample(1:nrow(df), eighty)
test.size = nrow(df[-train,])
training_set <- df[train,]
testing_set <- df[-train,]
tree.y <- rpart(y.response ~., data = training_set, method = 'class', minsplit = 3, cp = 0.005)
tree.y.pred.prob <- predict(tree.y, newdata = testing_set, type = 'prob')
tree.y.pred <- predict(tree.y, newdata = testing_set, type = 'class')
table.results <- with(df[-train,], table(tree.y.pred, y.response))
num.table.results <- as.numeric(table.results)
test.size <- nrow(df[-train,])
perc.correct.tree <- (num.table.results[1] + num.table.results[4])/test.size
rpart.plot(tree.y, type = 3, clip.right.labs = FALSE, branch = .3, under = TRUE)
rm(tree.y)
######## Random Forest ###########
###tune mtry
test.err = double(13)
for(mtry in 1:6) {
fit = randomForest(y.response ~., data = df, subset = train, mtry = mtry, ntree = 100)
pred = predict(fit, df[-train,])
table.results <- with(df[-train,], table(pred, y.response))
num.table.results <- as.numeric(table.results)
perc.correct <- (num.table.results[1] + num.table.results[4])/test.size
test.err[mtry] <- perc.correct
}
rm(fit)
best.mtry <- which.max(test.err)
#### we have a best mtry now
set.seed(1001)
eighty = floor(.80*nrow(df))
train = sample(1:nrow(df), eighty)
test.size = nrow(df[-train,])
rf.best <- randomForest(y.response ~., data = df, subset = train, mtry = 4, ntree = 100, importance=TRUE)
pred.rm = predict(rf.best, df[-train,])
pred.rm.prob <- predict(rf.best, df[-train,], type = 'prob')
table.results <- with(df[-train,], table(pred, y.response))
num.table.results <- as.numeric(table.results)
perc.correct.rf <- (num.table.results[1] + num.table.results[4])/test.size
varImpPlot(rf.best, main = 'Variable Importance - Random Forest')
rm(rf.best)
set.seed(1002)
eighty = floor(.80*nrow(df))
train = sample(1:nrow(df), eighty)
test.size = nrow(df[-train,])
training_set <- df[train,]
testing_set <- df[-train,]
#### Logistic Regression
#Full model
mod.logreg <- glm(y.response ~., data = training_set, family = binomial('logit'))
#check if our model needs less variables via backward selection
m.full <- mod.logreg
m.null <- glm(y.response ~ 1, data = training_set, family = binomial('logit'))
m.forward <- step(m.null, trace = F, scope = list(lower = formula(m.null), upper = formula(m.full)), direction = 'forward')
###our full model is sufficient.
prob.logreg <- predict(mod.logreg, testing_set, type = 'response')
pred <- rep('No', length(prob))
pred[prob >= 0.5] <- 'Yes'
tb <- table(pred, testing_set$y.response)
tb.numeric <- as.numeric(tb)
perc.correct.logreg <- (tb.numeric[1] + tb.numeric[4])/test.size
index <- 1:dim(training_set)[1]
dev.resid <- residuals(mod.logreg)
criteria <- training_set$y.response
dff <- data.frame(index, dev.resid, criteria)
p <- ggplot(dff, aes(x = index, y = dev.resid, color = criteria)) +
geom_point() +
geom_hline(yintercept = 3, linetype = 'dashed', color = 'blue') +
geom_hline(yintercept = -3, linetype = 'dashed', color = 'blue')
p
rm(mod.logreg)
#####Neural Networks######
model.nn <- nnet(y.response ~., data = training_set, size = 10, maxit = 100)
pre.model.nn <- predict(model.nn, newdata = testing_set, type = 'raw')
pred1 <- rep('No', length(pre.model.nn))
pred1[pre.model.nn >= 0.5] <- 'Yes'
tb <- table(pred1, testing_set$y.response)
tb.numeric <- as.numeric(tb)
perc.correct.nn <- (tb.numeric[1] + tb.numeric[4])/test.size
rm(model.nn)
###### Percent Correct Table ####
correct.class <- c(perc.correct.tree, perc.correct.rf, perc.correct.logreg, perc.correct.nn)
model <- c('Tree', 'Random.Forest', 'Log.Reg', 'Neural.Net')
results.df <- data.frame(Model = model, Perc.Correct = correct.class)
results.df %>%
kable(caption = 'Table MOD 1: Summary of % Correct of Classifiers') %>%
kable_styling()
#logreg
pr <- prediction(prob.logreg, testing_set$y.response)
prf <- performance(pr, measure = 'tpr', x.measure = 'fpr')
#dataframe for TP and FP
dd <- data.frame(FP = prf@x.values[[1]], TP = prf@y.values[[1]])
##NN
pr1 <- prediction(pre.model.nn, testing_set$y.response)
prf1 <- performance(pr1, measure = 'tpr', x.measure = 'fpr')
dd1 <- data.frame(FP = prf1@x.values[[1]], TP = prf1@y.values[[1]])
## Tree
pr2 <- prediction(tree.y.pred.prob[,2], testing_set$y.response)
prf2 <- performance(pr2, measure = 'tpr', x.measure = 'fpr')
dd2 <- data.frame(FP = prf2@x.values[[1]], TP = prf2@y.values[[1]])
##RF
pr3 <- prediction(pred.rm.prob[,2], testing_set$y.response)
prf3 <- performance(pr3, measure = 'tpr', x.measure = 'fpr')
dd3 <- data.frame(FP = prf3@x.values[[1]], TP = prf3@y.values[[1]])
### ROC Curve###
g <- ggplot() +
geom_line(data = dd, aes(x = FP, y= TP, color = 'Log Reg')) +
geom_line(data = dd1, aes(x = FP, y= TP, color = 'NN')) +
geom_line(data = dd2, aes(x = FP, y= TP, color = 'Tree')) +
geom_line(data = dd3, aes(x = FP, y= TP, color = 'RF')) +
geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1)) +
ggtitle('ROC Curve') +
labs(x = 'False Pos Rate', y = 'True Pos. Rate')
g
auc <- rbind(performance(pr, measure = 'auc')@y.values[[1]],
performance(pr1, measure = 'auc')@y.values[[1]],
performance(pr2, measure = 'auc')@y.values[[1]],
performance(pr3, measure = 'auc')@y.values[[1]])
rownames(auc) <- c('Log Reg', 'NN', 'Tree-CART', 'Random Forest')
colnames(auc) <- 'Area Under ROC Curve'
auc <- round(auc,4)
### Area Under ROC Curve###
auc <- as.data.frame(auc)
auc %>%
kable(caption = 'Table MOD 2: Area Under Curve of ROC') %>%
kable_styling()