Loading Packages

library(foreign)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(psych)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(statsr)
## Loading required package: BayesFactor
## Loading required package: coda
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
## ************
## Welcome to BayesFactor 0.9.12-4.2. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
## 
## Type BFManual() to open the manual.
## ************
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode

Introduction:

What is your research question? Why do you care? Why should others care? Research Questions:

  1. Can the PVA Gift amount be predicted based on the age and home ownership of the donors?

The paralyzed Veterans of America (PVA) is a philanthropic organization sanctioned by the US government to represent the interest of those veterans who are disabled. Since 1946, PVA has raised money to support a variety of activities including advocacy for veterans’ health care, research and education on spinal cord injury and disease, and support for veterans’ benefits and rights.

Data:

Write about the data from your proposal in text form. Address the following points:

The data set was submitted by the PVA to KDD annual competition and I downloaded it from http://www.kdnuggets.com/.

The PVA solicitate for donations from past and future donors across the United States.The dataset contain 3648 donors who gave to recent solicitation.

I have a Quantitative response variable known as GIFTAMNT and two independent variables; Quantitative Variable - AGE Qualitative Variable - HOMEOWNER

This is an observational study.

The PVA engages in list rental, Most of the PVA’s money is spent on mailings to people who never responded. If the PVA could avoid mailing to those people who never responded, they could potentially save several millions of dollars a year and produce less wasted papers on mailings.

VetData <- read.csv("https://raw.githubusercontent.com/Emahayz/Data-606-Class/master/VetData.csv", header = T, sep = ",")

Exploratory data analysis:

Perform relevant descriptive statistics, including summary statistics and visualization of the data. Also address what the exploratory data analysis suggests about your research question.

Viewing the data with 27 variables

str(VetData)
## 'data.frame':    3648 obs. of  27 variables:
##  $ AGE      : int  63 78 53 85 32 66 42 51 96 40 ...
##  $ HOMEOWNER: Factor w/ 2 levels "N","Y": 1 2 2 1 1 2 2 2 2 1 ...
##  $ HIT      : int  0 1 0 0 0 25 0 14 12 0 ...
##  $ MALEVET  : int  32 23 43 29 30 26 33 38 22 28 ...
##  $ VIETVETS : int  41 46 18 20 30 26 36 36 57 13 ...
##  $ WWIIVETS : int  20 28 35 52 44 50 31 22 23 12 ...
##  $ LOCALGOV : int  7 8 6 12 10 7 5 2 9 4 ...
##  $ STATEGOV : int  2 2 4 1 2 3 1 4 44 2 ...
##  $ FEDGOV   : int  1 3 2 5 2 1 4 0 3 0 ...
##  $ CARDPROM : int  17 27 16 20 29 27 21 17 5 23 ...
##  $ MAXADATE : int  9702 9702 9702 9702 9702 9702 9702 9702 9702 9702 ...
##  $ NUMPROM  : int  45 63 44 48 63 65 51 41 13 58 ...
##  $ CARDPM12 : int  4 6 6 4 6 5 6 6 3 3 ...
##  $ NUMPRM12 : int  10 13 12 9 13 12 13 13 8 6 ...
##  $ NGIFTALL : int  5 12 5 15 9 21 13 4 1 23 ...
##  $ CARDGIFT : int  3 7 1 10 6 15 8 3 0 13 ...
##  $ MINRAMNT : num  5 5 5 5 5 3 2 5 15 3 ...
##  $ MINRDATE : int  9103 9004 9203 9506 8812 9404 9404 9301 9507 9412 ...
##  $ MAXRAMNT : num  15 20 15 11 20 5 11 30 15 10 ...
##  $ MAXRDATE : int  9509 9409 9507 9407 9502 8910 9505 9601 9507 8805 ...
##  $ LASTGIFT : num  15 20 15 10 20 5 5 30 15 5 ...
##  $ AVGGIFT  : num  10.6 11 10.4 7.13 11.44 ...
##  $ CONTROLN : int  98282 166937 175951 147641 41222 65806 141308 142203 79580 105930 ...
##  $ HPHONE_D : int  0 0 0 0 1 1 0 1 1 0 ...
##  $ CLUSTER2 : Factor w/ 63 levels ".","1","10","11",..: 39 11 17 17 37 63 32 44 17 12 ...
##  $ CHILDREN : int  0 0 0 0 0 0 2 0 0 0 ...
##  $ GIFTAMNT : num  15 10 15 10 20 5 4 40 21 5 ...

Preprocessing

The dataset has 3648 observations with 27 variables, I’m interested in only three variables AGE,HOMEOWNER # and GIFTAMNT. I created a new data frame (Vet) with only the three variables of interest

Vet <- VetData[, c(1,2,27)]
str(Vet)
## 'data.frame':    3648 obs. of  3 variables:
##  $ AGE      : int  63 78 53 85 32 66 42 51 96 40 ...
##  $ HOMEOWNER: Factor w/ 2 levels "N","Y": 1 2 2 1 1 2 2 2 2 1 ...
##  $ GIFTAMNT : num  15 10 15 10 20 5 4 40 21 5 ...
head(Vet)
##   AGE HOMEOWNER GIFTAMNT
## 1  63         N       15
## 2  78         Y       10
## 3  53         Y       15
## 4  85         N       10
## 5  32         N       20
## 6  66         Y        5

Relevant summary statistics

summary(Vet$AGE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   50.00   63.50   62.36   75.00   98.00
describe(Vet$AGE)
##    vars    n  mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 3648 62.36 15.74   63.5   62.72 18.53  21  98    77 -0.18    -0.84
##      se
## X1 0.26
summary(Vet$GIFTAMNT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   10.00   14.00   15.72   20.00  200.00
describe(Vet$GIFTAMNT)
##    vars    n  mean    sd median trimmed mad min max range skew kurtosis
## X1    1 3648 15.72 12.03     14   14.16 8.9   1 200   199 4.76    46.22
##     se
## X1 0.2

The minimum age is 21 years old while the oldest is 98 years. The average age of the donors is around 62 years. The minimum amount donated is $1 while the maximum amount donated is $200.The average amount donated is $15.72.

round(prop.table(table(Vet$HOMEOWNER))*100,digit = 1) 
## 
##    N    Y 
## 32.9 67.1

The table shows that about 67% of the donors are home owners.

boxplot(Vet$AGE)

boxplot(Vet$GIFTAMNT)

Multiple gift amount are far from the median gift amount of $14

Visualization using ggplot

ggplot(data = Vet, aes(Vet$AGE))+geom_histogram(binwidth = 2, position="identity", alpha=0.5)+
  labs(title="Age of Donors",x="Age", y = "Frequency")

ggplot(data = Vet, aes(Vet$GIFTAMNT))+geom_histogram(binwidth = 5, position="identity", alpha=0.5)+
  labs(title="Gift Amount",x="Gift", y = "Frequency")

Inference:

Description of methodology

Inference

H0: There is no association between Gift Amount and the variables of Age or Home ownership.

HA: There is an association with at least one of the variables of Age or Home ownership.

The Normal Probability Plot

The normal probability plot shows that the gift amount and age of the donors is not normally distributed as can be seen that the data points does not closely follow the line.

qqnorm(Vet$AGE)
qqline(Vet$AGE)

qqnorm(Vet$GIFTAMNT)
qqline(Vet$GIFTAMNT)

inference(y = HOMEOWNER, data = Vet, statistic = "proportion", type = "ci", method = "theoretical", success = "Y")
## Single categorical variable, success: Y
## n = 3648, p-hat = 0.6708
## 95% CI: (0.6555 , 0.686)

We are 95% confident that the percentage of home ownership for the donors is between 66% and 69%.

I will use the Linear Regression to address my research question. However, with a categorical variable of HOMEOWNER, linear regression assumes that the numerical amounts in all independent or explanatory variables are meaningful data points. Therefore, I will have to recode the categorical variable of Yes “Y” and No “N” to binary “1” and “0”.

Recoding the categorical variable with “Y” and “N” to binary “1” and “0”

Vet$HOMEOWNER <- ifelse(Vet$HOMEOWNER == "Y", 1, 0)
str(Vet)
## 'data.frame':    3648 obs. of  3 variables:
##  $ AGE      : int  63 78 53 85 32 66 42 51 96 40 ...
##  $ HOMEOWNER: num  0 1 1 0 0 1 1 1 1 0 ...
##  $ GIFTAMNT : num  15 10 15 10 20 5 4 40 21 5 ...

The Linear Regression Model

VetModel <- lm(GIFTAMNT~ AGE + HOMEOWNER, data = Vet)
summary(VetModel)
## 
## Call:
## lm(formula = GIFTAMNT ~ AGE + HOMEOWNER, data = Vet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.031  -6.265  -1.851   4.315 184.885 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.89192    0.87633  20.417   <2e-16 ***
## AGE         -0.03051    0.01267  -2.409    0.016 *  
## HOMEOWNER   -0.40634    0.42413  -0.958    0.338    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.02 on 3645 degrees of freedom
## Multiple R-squared:  0.001779,   Adjusted R-squared:  0.001232 
## F-statistic: 3.249 on 2 and 3645 DF,  p-value: 0.03894

The Equation for the Regression is thus:

\[ \hat{y} = 17.89192 - 0.03051∗AGE - 0.40634*HOMEOWNER \]

The \(R^2\) shows that only less than 1% of the variability in GIFTAMNT is explained by AGE + HOMEOWNER.The p-value is 0.03894 which is less than 0.05 indicating that the H0: hypothesis which states there is no association between Gift Amount and the variables of Age or Home ownership should be rejected.

anova(VetModel)
## Analysis of Variance Table
## 
## Response: GIFTAMNT
##             Df Sum Sq Mean Sq F value  Pr(>F)  
## AGE          1    806  806.34  5.5793 0.01823 *
## HOMEOWNER    1    133  132.65  0.9179 0.33810  
## Residuals 3645 526784  144.52                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA result, although the p-value for HOMEOWNER is 0.33810 which is greater than 0.05 and should support that we reject the alternative hypothesis, we can see that the p-value for AGE is 0.01823 which is less than 0.05 indicating that there is at least an association with the response variable. Therefore, we should reject the null hypothesis in favor of the alternative hypothesis.

Check of Conditions:

  1. Test for constant variance - The NcvTest
ncvTest(VetModel)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 32.71743, Df = 1, p = 1.0658e-08

The test shows that constant error of variance does not exist.

  1. Test for linearity - The crPlots
crPlots(VetModel)

The test indicates that none of the predictors shows linearity.

  1. Test for Normality- Shapiro-wilk Normality test
shapiro.test(residuals(VetModel))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(VetModel)
## W = 0.68621, p-value < 2.2e-16

The test shows that the p value is less than 0.05 meaning that the residuals are not normally distributed.

  1. Test for Independence of error- Autocorrelation- Durbin- Watson Test
durbinWatsonTest(VetModel)
##  lag Autocorrelation D-W Statistic p-value
##    1      -0.0321946      2.064341   0.038
##  Alternative hypothesis: rho != 0

The test shows that the p value is less than 0.05 (p-value = 0.038) meaning there is independence of errors.

Predicting Gift Amount with Age and Home owner variables

plot(Vet$GIFTAMNT ~ Vet$HOMEOWNER + Vet$AGE)

abline(VetModel)
## Warning in abline(VetModel): only using the first two of 3 regression
## coefficients

The plot shows that the line or relationship between the variables is not linear and that predicting Gift amount from these variables is not reliable.

Conclusion:

There is no relationship between Home ownership and Gift amount, but there is a weak relationship between Age and Gift amount. However, the various tests and analysis shows that the variables of AGE and HOMEOWNER are not sufficient to predict donors Gift Amount. This indicates that the Gift amount of each donor to PVA solicitation is not dependent on the Age and home ownership status of the donor.

Therefore, to answer the research question of Can the PVA Gift amount be predicted based on the age and home ownership of the donors using the regression equation, the gift amount will never exceed $14.50 even with the oldest donor of 98 Years and the Home ownership of Yes = 1.

Future research should consider analysis for feature selection with other possible variables especially with the original data set having 27 variables. It is possible that some of the other variables may be relevant to predict future donations or Gift amount for the PVA.