The Data

The paralyzed Veterans of America (PVA) is a philanthropic organization sanctioned by the US government to represent the interest of those veterans who are disabled. Since 1946, PVA has raised money to support a variety of activities including advocacy for veterans’ health care, research and education on spinal cord injury and disease, and support for veterans’ benefits and rights.

The PVA engages in list rental, Most of the PVA’s money is spent on mailings to people who never responded. If the PVA could avoid mailing to those people who never responded, they could potentially save several millions of dollars a year and produce less wasted papers on mailings.

Data Preparation

VetData <- read.csv("https://raw.githubusercontent.com/Emahayz/Data-606-Class/master/VetData.csv", header = T, sep = ",")

library(foreign)
library(tidyr)
library(dplyr)
library(psych)
library(ggplot2)

Viewing the data with 27 variables

str(VetData)
## 'data.frame':    3648 obs. of  27 variables:
##  $ AGE      : int  63 78 53 85 32 66 42 51 96 40 ...
##  $ HOMEOWNER: Factor w/ 2 levels "N","Y": 1 2 2 1 1 2 2 2 2 1 ...
##  $ HIT      : int  0 1 0 0 0 25 0 14 12 0 ...
##  $ MALEVET  : int  32 23 43 29 30 26 33 38 22 28 ...
##  $ VIETVETS : int  41 46 18 20 30 26 36 36 57 13 ...
##  $ WWIIVETS : int  20 28 35 52 44 50 31 22 23 12 ...
##  $ LOCALGOV : int  7 8 6 12 10 7 5 2 9 4 ...
##  $ STATEGOV : int  2 2 4 1 2 3 1 4 44 2 ...
##  $ FEDGOV   : int  1 3 2 5 2 1 4 0 3 0 ...
##  $ CARDPROM : int  17 27 16 20 29 27 21 17 5 23 ...
##  $ MAXADATE : int  9702 9702 9702 9702 9702 9702 9702 9702 9702 9702 ...
##  $ NUMPROM  : int  45 63 44 48 63 65 51 41 13 58 ...
##  $ CARDPM12 : int  4 6 6 4 6 5 6 6 3 3 ...
##  $ NUMPRM12 : int  10 13 12 9 13 12 13 13 8 6 ...
##  $ NGIFTALL : int  5 12 5 15 9 21 13 4 1 23 ...
##  $ CARDGIFT : int  3 7 1 10 6 15 8 3 0 13 ...
##  $ MINRAMNT : num  5 5 5 5 5 3 2 5 15 3 ...
##  $ MINRDATE : int  9103 9004 9203 9506 8812 9404 9404 9301 9507 9412 ...
##  $ MAXRAMNT : num  15 20 15 11 20 5 11 30 15 10 ...
##  $ MAXRDATE : int  9509 9409 9507 9407 9502 8910 9505 9601 9507 8805 ...
##  $ LASTGIFT : num  15 20 15 10 20 5 5 30 15 5 ...
##  $ AVGGIFT  : num  10.6 11 10.4 7.13 11.44 ...
##  $ CONTROLN : int  98282 166937 175951 147641 41222 65806 141308 142203 79580 105930 ...
##  $ HPHONE_D : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ CLUSTER2 : Factor w/ 63 levels ".","1","10","11",..: 39 11 17 17 37 63 32 44 17 12 ...
##  $ CHILDREN : int  0 0 0 0 0 0 2 0 0 0 ...
##  $ GIFTAMNT : num  15 10 15 10 20 5 4 40 21 5 ...

Preprocessing

The dataset has 3648 observations with 27 variables, I’m interested in only three variables AGE,HOMEOWNER # and GIFTAMNT. I created a new data frame (Vet) with only the three variables of interest

Vet <- VetData[, c(1,2,27)]
str(Vet)
## 'data.frame':    3648 obs. of  3 variables:
##  $ AGE      : int  63 78 53 85 32 66 42 51 96 40 ...
##  $ HOMEOWNER: Factor w/ 2 levels "N","Y": 1 2 2 1 1 2 2 2 2 1 ...
##  $ GIFTAMNT : num  15 10 15 10 20 5 4 40 21 5 ...
head(Vet)
##   AGE HOMEOWNER GIFTAMNT
## 1  63         N       15
## 2  78         Y       10
## 3  53         Y       15
## 4  85         N       10
## 5  32         N       20
## 6  66         Y        5

Recoding the factor variable with “Y” and “N” to binary “1” and “0”

Vet$HOMEOWNER <- ifelse(Vet$HOMEOWNER == "Y", 1, 0)
str(Vet)
## 'data.frame':    3648 obs. of  3 variables:
##  $ AGE      : int  63 78 53 85 32 66 42 51 96 40 ...
##  $ HOMEOWNER: num  0 1 1 0 0 1 1 1 1 0 ...
##  $ GIFTAMNT : num  15 10 15 10 20 5 4 40 21 5 ...

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Can past donations be predictive of future donations?

Cases

What are the cases, and how many are there? The PVA solicitate for donations from past and future donors across the United States.The dataset contain 3648 donors who gave to recent solicitation.

Data collection

Describe the method of data collection.

Solicitation- The PVA sends out greetings cards and mailing address labels periodically with their request for donations.

Type of study

What type of study is this (observational/experiment)? This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

The data set was submitted by the PVA to KDD annual competition and I downloaded it from http://www.kdnuggets.com/.

Dependent Variable

What is the response variable? Is it quantitative or qualitative? I have a Quantitative response variable known as GIFTAMNT.

Independent Variable

You should have two independent variables, one quantitative and one qualitative. I have two independent variables; Quantitative Variable - AGE Qualitative Variable - HOMEOWNER

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

summary(Vet$AGE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   50.00   63.50   62.36   75.00   98.00
describe(Vet$AGE)
##    vars    n  mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 3648 62.36 15.74   63.5   62.72 18.53  21  98    77 -0.18    -0.84
##      se
## X1 0.26
summary(Vet$GIFTAMNT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   10.00   14.00   15.72   20.00  200.00
describe(Vet$GIFTAMNT)
##    vars    n  mean    sd median trimmed mad min max range skew kurtosis
## X1    1 3648 15.72 12.03     14   14.16 8.9   1 200   199 4.76    46.22
##     se
## X1 0.2

The minimum age is 21 years old while the oldest is 98 years. The average age of the donors is around 62 years. The minimum amount donated is $1 while the maximum amount donated is $200.The average amount donated is $15.72.

round(prop.table(table(Vet$HOMEOWNER))*100,digit = 1) 
## 
##    0    1 
## 32.9 67.1

With HOMEOWNER = Y = Yes = 1 and N = No = 0 after recoding, the table shows that about 67% of the donors are home owners.

boxplot(Vet$AGE)

boxplot(Vet$GIFTAMNT)

Multiple gift amount are far from the median gift amount of $14

ggplot(data = Vet, aes(Vet$AGE))+geom_histogram(binwidth = 2, position="identity", alpha=0.5)+
  labs(title="Age of Donors",x="Age", y = "Frequency")

ggplot(data = Vet, aes(Vet$GIFTAMNT))+geom_histogram(binwidth = 5, position="identity", alpha=0.5)+
  labs(title="Gift Amount",x="Gift", y = "Frequency")