Applied Regression Analysis: Project 4

Applied Regression Analysis: Project 4 Logistic Regression

Ali Svoobda

RPI

5/14/15

1. Dataset Selection

The data used in this analysis was gathered from the United States Census Bureau- US Government’s open data (www.data.gov). From this site, the commodity flow data was selected from the list of manufacturing datasets.

Link to information about all tables in the commodity flow survey: http://www.census.gov/econ/cfs/

Link to the full dataset: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=CFS_2012_00A08&prodType=table

The code below reads in the dataset that was found on data.gov and reformatted in excel:

commodities <- read.csv("~/1.RENSSELAER POLYTECHNIC INSTITUTE/a- Senior Spring/Applied Regression Analysis/commoditydata.csv")

View(commodities)

Description of Datset

The dataset is a geographic area series of shippment characteristics within the U.S. by commodity by distance shipped. Many commodities are included such as foods, chemicals, natural resources, building materials, manufactured products, etc. There are several observations for each commodity and each observation includes the follwing variables:

geographical identifier code and area name commodity code and meaning distance shipped code and actual distance (miles) year (2012 for all in the selected set) value (in $million) Tons (in thousands) Ton-Miles (millions) and coefficient of variation for value, tons and ton-miles

Only commodity type, value, and ton-miles will be examined for this project.

Dependent Variable

For this analysis, commodity category will be the dependent variable. Since a dichotomous categorical dependent variable is desired, a commodity code (comm) of either 1 or 0 was assigned to two different groups of commodities. Since this was a large dataset, some commodities were excluded from analysis to create two unique groups.

Consumable/Food Commodities (1)

comm = 1 represents consumable food like commodities. This includes the follwoing: live animals and fish, ceral grains, agricultural products, products of animal orgin, Meat, poultry, fish, seafood, and their preparations, Milled grain products and preparations, and bakery products, other prepared foodstuffs, alcoholic beverages, and tobacco products.

Other Manufactured Commodities (0)

comm = 0 referrs to other manufactured products that an american consumer may purchase including: Pharmaceutical products, printed products, Textiles, leather, and articles of textiles or leather, machinery, electronics, motorized vehicles, precision instruments, furniture, and other manufactured products.

Altogether the dataset has 175 observations, with about equal numbers of each comm.code.

Like mentioned above, commodities that didn’t fit into one of these two groups were not included, such as building materials, chemicals, etc.

These commodites and the unused variables were removed from the excel file before being read in to R. The binary commodity codes were also assigned in excel.

Since not all the variables are utilized, we will eliminate the ones that are unused to simplify future analysis. We also make commodity category a factor instead of an integer:

data<- commodities[c(1,3,4)]

Independent Variables

Two continuous variables, value of the shippment and tons shipped, were choosen as the independent variables to see if either of these things could explain the type of commodity being shipped.

Value (value) - value of commodities shipped in millions of US dollars Tons (tons) - thousands of tons shipped in that shippment

Initial Plots

In order to learn more about this dataset, some initial boxplots were created:

par(mfrow=c(1,2))
boxplot(data$value, main="Boxplot of Shippment Value")
boxplot(data$tons, main="Boxplot of Tons Shipped")

plot of chunk unnamed-chunk-3

There are several outliers when looking at the value and tons shipped, indicating it is not too often that there is a very expensive or very large shippment.

Now, we seperate each of these independent variable by the type of commodity:

par(mfrow=c(1,1))
boxplot(data$value~data$comm, main="1. Boxplot of Shippment Value")

plot of chunk unnamed-chunk-4

boxplot(data$tons~data$comm, main="2. Boxplot of Tons(in thousands) Shipped")

plot of chunk unnamed-chunk-4

Looking at the differences between consumable food commodities (1) and other manufactured goods (0), manufactured goods shippments appear to be more expensive. There is also a greater spread for manufactured goods, while consumable/food goods have a lesser cost and the cost doesn’t varry as much.
The weight of manufactured goods shippments is more consistant than that of consumable foods. The average weights seem to be pretty similar, although foods may have a slightly higher average.

There appears to be a larger difference among the two groups by value than by tons.

2. Building the Regression Model

We will create a explanatory multiple logistic regression model. It is explanatory because were trying to figure out patterns in shippments or if some commodities tend to have a higher value or greater weight than other types. We may need to take a random sample to get a better sample size, but might be okay since we already decreased the size of the set.

Null Hypothesis

The null hypothesis is that the variation in the dependent variable, commodity category, cannot be explained by anything other than randomization. In other words, for the model described below, the variation in shippment value and tons shipped cannot explain the variation in commodity type.

Examine the scattergram for relationships between arrival delay and the independent variables:

plot(data)

plot of chunk unnamed-chunk-5

Vertical lines because only two values of comm to hold

Building the Model

The model will be built using the step-wise method. We add the most correlated independent variable first, then continue to add the other variables as long as the contriburte to explaining the variation in the dependent variable (viewed through the R^2 value).

To determine the order, we build a correlation matrix:

cor(data)

##          comm   value   tons
## comm   1.0000 -0.2488 0.2421
## value -0.2488  1.0000 0.3610
## tons   0.2421  0.3610 1.0000

There is about a 0.25 correlation between commodity type and the explanitory variables. Although similar in magnitude, the relationships are opposite in that commodity type and value have an inverse or negative correlation while commodity and tons have a positive correlation.

The two independent variables are also correlated (0.36) which makes sense as intuition may suggest that larger shippments likely weigh more, and large shippments of an item will cost more than a smaller shippment of that same item. The correlation between two independent variables means we could see supression in the model when they are both included.

Model 1: Commodity Category by Value of Shippment

model1<-glm(data$comm~data$value, family = binomial(logit))
summary(model1)

## 
## Call:
## glm(formula = data$comm ~ data$value, family = binomial(logit))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3334  -1.1866  -0.0824   1.0539   2.8225  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  4.02e-01   2.05e-01    1.96   0.0504 . 
## data$value  -7.47e-06   2.68e-06   -2.79   0.0052 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 242.46  on 174  degrees of freedom
## Residual deviance: 226.34  on 173  degrees of freedom
## AIC: 230.3
## 
## Number of Fisher Scoring iterations: 5

Interpretation

Value of a shippment tested as a significant predictor of commodity type in this model. Value returned a p-value of 0.005, meaning there is a very small chance that the variation in commodity type is purely due to randomization (reject the null hypothesis). The esitmate of the intercept is 0,4 and the estimate of the slope for the model’s equation is very small, -7.5E-6. In other words, for every unit decresase of shippment value (million dollars), the odds of a shippment containing a consumable food commodity only increases by a small amoumt. However, since the value of a shippment can range from o to millions and even billions of dollars, there is a lot of room for the value to increase, leading to a more significant increase in the odds of predicting the shippment type.

Now we will see if adding the 2nd explanatory variable, tons shipped, provides a more predictive model.

Model 2: Commodity Category by Value of Shippment and Tons Shipped

Now, the second independent varible will be added to the model to see if together, shippment value and weight in tons can better predict the commodity type:

model2<-glm(data$comm~data$value + data$tons, family = binomial(logit))

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model2)

## 
## Call:
## glm(formula = data$comm ~ data$value + data$tons, family = binomial(logit))
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.168  -0.180   0.000   0.349   3.613  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  8.57e-01   3.15e-01    2.72   0.0065 ** 
## data$value  -1.28e-04   2.57e-05   -4.99  6.1e-07 ***
## data$tons    4.28e-04   8.66e-05    4.95  7.6e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 242.459  on 174  degrees of freedom
## Residual deviance:  85.871  on 172  degrees of freedom
## AIC: 91.87
## 
## Number of Fisher Scoring iterations: 10

Interpretation

For model 2, both predictors of commodity type and the intercept had very low p-values (6.12E-7 for value, 7.60E-7 for tons, .0065 for the intercept), indicating we should reject the null hypothesis. It is likely that something other than randomization, such as shippment value or tons shipped, causes the variation in commodity type. Although still small, the estimates were slightly larger this time, with a slope of -1.28E-4 for value and 4.28E-4 for tons with an intercept of 0.8.

3. Plots: Testing the Assumptions

A series of plots were created to check the various assumptions of regression models. Model 2 will be used since it explained a greater amount of the difference in commodity type.

LINE: L- Linear I- Independent N- Normally Distributed E- Equal Variances

3.1 Linear

The first assumption is that the expected value of the response for each set of values of Xij is a linear function of the predictors Xij. In other words, the expected value of the error term is 0.

To check this, we examine the Standardized Residuals vs the fitted values and the mean of the standardized residuals:

standard.resid<-rstandard(model2)

plot(fitted(model2), standard.resid, pch=18,main="Standardized Residuals", xlab = "Fitted Values", ylab = "Residual Values")
abline(0,0, lwd=2, col="dark blue")

plot of chunk unnamed-chunk-9

mean(standard.resid)

## [1] 0.02489

The resididuals show the difference between the acutal and fitted values of the model. To fulfil this assumption and prove the model is a good fit, they should be spread out across the dynamic range and distributed equally about the 0. The calaculated mean is close to zero, however, this is not the case as there is a clear trend with the two curved lines.

3.2 Independence

The second LINE assumption is that the errors are independent.

To test this, we will plot just the residuals which again should be spread out across the dynamic range:

plot(model2$residuals, main="Residuals", ylim=c(-5,5))

plot of chunk unnamed-chunk-10

This assumption is also violated as the initial observations are above the zero while the second half are below.

3.3 Normality

The third assumption is that the errors at each Xij are normally distributed:

Examine Q-Q Plot:

qqnorm(rstandard(model2))
qqline(rstandard(model2))

plot of chunk unnamed-chunk-11

The data does not match the normally distributed line.

We can also look at a histogram to test this assumption:

hist(rstandard(model2))

plot of chunk unnamed-chunk-12

There is kurtosis in the data as it is peaked and it is also slightky skewed. Between these two plots, the model also fails the normallity assumption.

3.4 Equal Variances

Breusch-Pagan test

Test for Heteroskedasticity to determine if the residuals have equal varience. Null Hypothesis: Homoskedasticity (all residuals have equal variance) Alternative Hypothesis: Heteroskedasticity

Before the test is run, a package must first be installed:

#install.packages("lmtest")
library("lmtest", lib.loc="~/R/win-library/3.1")

## Warning: package 'lmtest' was built under R version 3.1.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.1.3

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

Run Breusch-Pagan test:

bptest(data$comm~data$value+data$tons)

## 
##  studentized Breusch-Pagan test
## 
## data:  data$comm ~ data$value + data$tons
## BP = 12.86, df = 2, p-value = 0.001616

Since the p-value is very low, we reject the null hypothesis and fail to reject the alternative: there is heteroskedasticity in the model.

3.5 Logistic assumptions

As of now, we have violted all 4 assumptions, bringing into question the quality and fit of the model. However, those assumptions don’t neccesarily need to hold true for the logistic regression performed in this analysis.

Assumptions that do apply are as follows:

The independent variables are linearly related to the log odds ratio.
No Important variables are omitted and no extraneous variables are included.
Each observation must be independent with little or no multicollinearity. In other words, that the independent variables are not linear combinations of each other.
The observatrions are independent.
And finally, that there is a large enough sample size to create a logistic model.

Not all of these are fulfilled, but enough may be to give the model some validity. We know from issue number 3 below that there is no multicollinearity. The observations are independent of eachother as they are all seperate shippments from 2012 and no shippment is counted twice (the original dataset did have a few observations that included a summary of all shippments but this was removed before being read in to R). The sample is large enough as we used a higher sample size than calaculated in the section below. To the best of our knowledge, know important variables were excluded as value and weight were the only variables in the dataset that could describe the shippment. Similarly, no extraneous variables were included as there are only two are in the model and can both be explained.

4. Issues

Below four issues that may occur when running regression analysis will be discussed.

Issue 1: Causality

The independent variables in this analysis, shippment value and weight (tons), are probabalistic and proximal causes of commodity type. They are probabalistic and not deterministic because nither of the two can determine the type of commodity of every shippment. A shippment with of lower value or smaller weight doesn’t neccesarily mean that it is a food product. There could be small and large, or expensive and inexpensive food shippments. They are probable causes because sometimes, they may identify a certain type of shippment.

They are proximal and not ultimate as the weight or value of a shippment do not cause the shippment to be a certain type. There are other factors or characteristics that can better identify the commodity in a shippment. It was simply the goal of this experiment to see if certain types of commodities tend to have certain weights or monetary values.

Issue 2: Sample Size

Will run g*power to check sample size. The size shouldn’t be an issue since the dataset used is smaller.

Tails = two odds ratio = 2.33333333 P(Y=1|X=1) H0 = .3 alpha err prob = .05 Power (1-beta err prob) = .95 R^2 other X = 0 X dist = normal X parm mu = 0 X parm sd = 1

A Sample size of 104 was calculated which is close enough to the size used in this analysis.

Issue 3: Collinearity

Collinearity becomes a problem when the independent variables in a model are correlated.

We can also examine the scatter plot between the IV’s:

plot(data$value, data$tons, xlab="Value", ylab="Tons", ylim=c(0,10000), xlim=c(0,10000))

plot of chunk unnamed-chunk-15

There is not an obvious correlation between the two IV’s. There is a small cluster near the zero for each of the variables, but this can be attributed to observations in the dataset of very small shippments because no matter the type of commodity, if theres a shippment of only a few items, it isnt going to change the value significantly. So, collinearity should not be an issue.

Issue 4: Measurement Error

There is a chance for measurement error when dealing with quantities at such a large scale. But the dataset comes from a reliable source so this should not be an issue.

5. Conclusions

Overall, the value of a shippment and tons shipped were significant predictors of commodity type. The model was called into question when initially examining the assumptions. But once we specifically tested the assumptions for logisitic regression, most were fulfilled and we can assume the model is valid. To further test this questions, future analysis could be run with data from other years or if value and weight can predict commodity type more specifically than just consumable food versus manufactured goods.

Applied Regression Analysis: Project 4

Ali Svoboda

May 14 , 2015