Part 1

library("readxl")
library("Hmisc")
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library("lmPerm")
library("corrplot")
## corrplot 0.92 loaded
library("effects")
## Loading required package: carData
## Use the command
##     lattice::trellis.par.set(effectsTheme())
##   to customize lattice options for effects plots.
## See ?effectsTheme for details.
library(MASS) 
library(ISLR2)
## 
## Attaching package: 'ISLR2'
## The following object is masked from 'package:MASS':
## 
##     Boston
library(car) 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(fpp3)
## ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
## ✓ tibble      3.1.6     ✓ tsibbledata 0.4.0
## ✓ tidyr       1.2.0     ✓ feasts      0.2.2
## ✓ lubridate   1.8.0     ✓ fable       0.3.1
## ✓ tsibble     1.1.1
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## x lubridate::date()    masks base::date()
## x dplyr::filter()      masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval()  masks lubridate::interval()
## x dplyr::lag()         masks stats::lag()
## x dplyr::recode()      masks car::recode()
## x dplyr::select()      masks MASS::select()
## x tsibble::setdiff()   masks base::setdiff()
## x dplyr::src()         masks Hmisc::src()
## x dplyr::summarize()   masks Hmisc::summarize()
## x tsibble::union()     masks base::union()
library(ggplot2)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
setwd("~/Desktop/Big data and Econometrics")
data <- read_excel("~/Desktop/Big data and Econometrics/HW 1 dataset.xls")
head(data)
## # A tibble: 6 × 6
##   Year  `Cereal yield (kg p…` `Arable land (…` `Employment in…` `Employment in…`
##   <chr>                 <dbl>            <dbl>            <dbl>            <dbl>
## 1 1991                  4508.             20.3            1.05              2.64
## 2 1992                  5358.             20.1            1.04              2.62
## 3 1993                  4299.             20.0            1.02              2.59
## 4 1994                  5560              19.9            1.01              2.56
## 5 1995                  4645.             19.9            0.990             2.52
## 6 1996                  5175.             19.5            0.970             2.48
## # … with 1 more variable: `Land under cereal production (hectares)` <dbl>

1.Questions I’m trying to answer:

I build a agriculture dataset with year, cereal yield, employment(female and male), arable land and land under cereal production of United States.I did not choose to do a cross-sectional comparison analysis, so I chose only one country and did an in-depth study of its cereal yield. Then I try to use this dataset to estimate the relationship between cereal yield and other predictors through a multiple linear regression.

2.Data(sets) chosen to address the question; outcome variable (y) vs predictors (X)

Combine five variables data cereal yield, employment(female and male), arable land and land under cereal production) into one dataset.

variable (y): ceral yield predictors (X): employment(female and male), arable land and land under ceral production

3.Assumptions, challenges and decisions made

Assumption: The response and predictors have linear relationship

Challenges: 1.The relationship between response and predictors might be more complicated than just linear. 2.Lack of data on agricultural mechanization and automation. 3.Data are not large enough in this dataset.This may require a more subjective choice, as some values may fall in places where there is little difference between the reference values, but this also introduces uncertainty in the accuracy of the results. 4.Interaction terms need to be considered which make the process more complicated.

data <- rename(data, y = "Cereal yield (kg per hectare)", x1 = "Arable land (% of land area)"
       , x2 = "Employment in agriculture, female (% of female employment) (modeled ILO estimate)"
       , x3 = "Employment in agriculture, male (% of male employment) (modeled ILO estimate)"
       , x4 = "Land under cereal production (hectares)")
data %>%
  ggplot(aes(x = Year, y = x1)) +
  labs(y = "Cereal yield(kg per hectare))",
       x = "Year") +
  geom_point() 

The multiple linear regression model takes the form:

y = β0 + β1X1 + β2X2 + β3X3 + β4X4 + ε cereal yield = β0 + β1×employment of female + β2×employment of male + β3×arable land + β4×land under ceral production + ε

To find whether there is a relationship between the response and preditors, I choose to do a Hypothesis test here.

Hypothesis:

H0 : β1 = β2 = β3 = β4 = 0 H1 : at least one β is non-zero.

Fit a multiple linear regression model using least squares

lm.fit <- lm(y ~ x1 + x2 + x3 + x4 , data = data)
class(lm.fit)
## [1] "lm"
summary(lm.fit)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1517.6  -277.5   -26.6   177.1  1220.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.623e+04  4.002e+03   6.553  1.1e-06 ***
## x1          -9.916e+02  3.010e+02  -3.295  0.00317 ** 
## x2          -4.074e+02  6.357e+03  -0.064  0.94945    
## x3           1.127e+03  3.232e+03   0.349  0.73043    
## x4          -6.365e-05  4.996e-05  -1.274  0.21531    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 669.4 on 23 degrees of freedom
## Multiple R-squared:  0.7354, Adjusted R-squared:  0.6894 
## F-statistic: 15.98 on 4 and 23 DF,  p-value: 2.164e-06

4.Results: coefficient estimates, confidence intervals, p-values (…)

Which predictors are associated with repose?

We can examine the p-values associated with each predictor’s t-statistic, P-value of each predictors:

tstats <- coef(lm.fit) / sqrt(diag(vcov(lm.fit)))
 2 * pt(abs(tstats), df = df.residual(lm.fit), lower.tail = FALSE)
##  (Intercept)           x1           x2           x3           x4 
## 1.097550e-06 3.171536e-03 9.494547e-01 7.304314e-01 2.153108e-01

From above data, we can see that the p-values for x1 and x4 are relatively low and x4 has the lowest p-value, but the two p-values for x2(female employment) and x3(male employment) are not. This suggest only x1 and x4 are close related to the response.

Decisions made: F-statistics is good but the P-value of the predictors is not significant. The reason for the p-value not being significant enough may have too little sample data, and if the sample size is larger, then the p-value of x1 and X4 may be significant. Based on the f- statistics and r-squared values showing a good fit, I would like to see what will happen, if here I first cautiously chose to reject the null hypothesis.

The model in equation form:

coef(lm.fit)
##   (Intercept)            x1            x2            x3            x4 
##  2.622523e+04 -9.916266e+02 -4.074295e+02  1.127201e+03 -6.365457e-05
lm.fit.coeff <- coef(lm.fit) #save it

y = 2.623 - 9.916x1 - 4.074x2 + 1.127x3 - 6.365x4

Obtain 95% confidence intervals for the coefficients:

confint(lm.fit)
##                     2.5 %        97.5 %
## (Intercept)  1.794606e+04  3.450440e+04
## x1          -1.614272e+03 -3.689815e+02
## x2          -1.355879e+04  1.274394e+04
## x3          -5.558478e+03  7.812880e+03
## x4          -1.669960e-04  3.968680e-05
lm.fit.confint <- confint(lm.fit)

The confidence intervals for all the predictors are almost the same narrow but only x1 has the character of far from zero. The other three predictors’ confidence interval all include zero indicating that x2,x3 and x4 may not be statistically significant given the value of x1.

Could Collinearity be the reason that the confidence interval associated x2,x3 and x4 include zero?

Compute variance inflation factors to detect collinearity:

vif(lm.fit)
##        x1        x2        x3        x4 
##  7.153949 36.546123 44.507351  1.826102

We know that the smallest value for VIF is 1, which means the complete absence of collinearity. When VIF exceeds 5 or 10, there is a problematic amount of collinearity. In this case, the predictors have VIF values of 7.15, 36.55, 44.51, and 1.83. So, there is collinearity in the data. However, x4 has better value than x1 and x4’s confidence interval includes zero. To sum up, collinearity is not the reason that confidence interval associated x2,x3 and x4 include zero.

Solutions for solving the collinearity problem: 1.Drop one of the problematic variables from the regression. In this case, I would prefer to drop x2 or x3 . 2.Combine those collinear variables together into a single predictor. May be it would be better to combine x2 and x3 (the employment of female and male ) into one.

The correlation matrix with p-values and scatter plot matrix:

mydata <- data[,c(2:6)]
data.cor.matrix <- cor(as.matrix(mydata))
data.cor.matrix
##             y         x1         x2         x3         x4
## y   1.0000000 -0.8455592 -0.7656335 -0.7765894 -0.5319883
## x1 -0.8455592  1.0000000  0.9191753  0.9243053  0.5030934
## x2 -0.7656335  0.9191753  1.0000000  0.9842560  0.5467713
## x3 -0.7765894  0.9243053  0.9842560  1.0000000  0.6046583
## x4 -0.5319883  0.5030934  0.5467713  0.6046583  1.0000000
scatterplotMatrix(mydata, main="Scatter Plot Matrix")

Plot correlation matrix

corrplot(data.cor.matrix, method = "color", type = "upper", order = "hclust",addCoef.col = "black")

We can see that all variables are negatively correlated with y.

5.Goodness of fit tests - what you think is appropriate

How strong is the ralationship between the reponse and predictors:

R^2:

summary(lm.fit)$r.sq
## [1] 0.7354141

RSE:

summary(lm.fit)$sigma
## [1] 669.3927

We know that the RSE estimates the standard deviation of the response from the regression line and R^2 statistic records the percentage of variability in the response that is explained by the predictors. From the R^2 data, we can see that the predictors explain the response very well.

Is the relationship linear? Check outliers or high leverage observations in the model:

outlierTest(lm.fit)
##     rstudent unadjusted p-value Bonferroni p
## 22 -3.722581          0.0011836     0.033141
par(mfrow = c(2,2))
plot(lm.fit)

We know that the residual plots can be used to identify non-linearity. If the relationship is linearity, the residual plots will not display pattern. From the figures above, we can see that there is no pattern in the residual, so the relationship is linear. In this case, only if we transform predictors in the linear regression model can we accommodate non-linear relationships.

Q-Q plote

qqPlot(lm.fit, labels=row.names(y), id.method="identify", 
       simulate=TRUE, main="Q-Q Plot")

## [1] 22 28

Generates a studentized residual histogram to better see it.

residplot <- function(lm.fit, nbreaks=10) {  
               z <- rstudent(lm.fit) 
               hist(z, breaks=nbreaks, freq=FALSE, 
                    xlab="Studentized Residual", 
                    main="Distribution of Errors") 
               rug(jitter(z), col="brown") 
               curve(dnorm(x, mean=mean(z), sd=sd(z)), 
                     add=TRUE, col="blue", lwd=2) 
               lines(density(z)$x, density(z)$y, 
                     col="red", lwd=2, lty=2) 
               legend("topright", 
                      legend = c( "Normal Curve", "Kernel Density Curve"), 
                      lty=1:2, col=c("blue","red"), cex=.7)} 
residplot(lm.fit) 

High leverage observations:

hat.plot <- function(lm.fit) {  
              p <- length(coefficients(lm.fit)) 
              n <- length(fitted(lm.fit)) 
              plot(hatvalues(lm.fit), main="Index Plot of Hat Values") 
              abline(h=c(2,3)*p/n, col="red", lty=2) 
              identify(1:n, hatvalues(lm.fit), names(hatvalues(lm.fit))) 
            } 
hat.plot(lm.fit) 

## integer(0)

There are two high leverage point in this chart.

Is there synergy among the predictors data? Including interaction terms.

lm.fit1 <- lm(y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x1:x4 + x2:x3 + x2:x4 + x3:x4, data = data)
summary(lm.fit1)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x1:x2 + x1:x3 + x1:x4 + 
##     x2:x3 + x2:x4 + x3:x4, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -908.34 -229.69   18.21  157.78 1567.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  3.812e+05  1.276e+05   2.987  0.00828 **
## x1          -2.884e+04  1.146e+04  -2.518  0.02213 * 
## x2           3.426e+04  2.059e+05   0.166  0.86982   
## x3           3.438e+04  1.112e+05   0.309  0.76087   
## x4          -5.365e-03  1.525e-03  -3.518  0.00264 **
## x1:x2        1.047e+04  1.033e+04   1.013  0.32511   
## x1:x3       -9.509e+02  4.614e+03  -0.206  0.83918   
## x1:x4        3.809e-04  1.438e-04   2.650  0.01686 * 
## x2:x3       -3.830e+04  2.656e+04  -1.442  0.16743   
## x2:x4       -2.442e-03  2.279e-03  -1.072  0.29889   
## x3:x4        1.739e-04  1.202e-03   0.145  0.88665   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 550.2 on 17 degrees of freedom
## Multiple R-squared:  0.8679, Adjusted R-squared:  0.7901 
## F-statistic: 11.17 on 10 and 17 DF,  p-value: 1.221e-05

The x1 and x4 interaction terms are significant, which means that the relationship between the response variable and one of the predictor variables depends on the level of the other predictor variable. Thus, this example shows that the relationship between y (Cereal yield ) and x1 (Arable land) varies depending on x4 (Land under cereal production).

Plot the interaction term of x1 and x4

plot(effect("x1:x4",lm.fit1,,list(wt=c(20,30,40))),multiline=TRUE)

Part 2

Because I am only interested in agricultural production data for the United States from 1991 to 2018 and have only investigated this one country, here I will average the cereal yield (kg per hectare) of the United States for the 15-year period from 1991 to 2005.

average.y <- mean(data$y[1:15])
average.y
## [1] 5523.353

The average cereal yield (kg per hectare) of the United States for the 15-year period from 1991 to 2005 is 5523.353.

Create a binary variable “above_average” with values of 1 if the outcome is above the average.y, and 0 if the outcome is below that average.

data$above_average <- ifelse(data$y > average.y, 1,0)

1. use a logistic regression, or other classification method, to model the relationship between above_average and other predictors over the first 15 years of data

glm.fits <- glm(
  above_average ~ x1 + x2 + x3 + x4 , data = data, family = binomial)
summary(glm.fits)
## 
## Call:
## glm(formula = above_average ~ x1 + x2 + x3 + x4, family = binomial, 
##     data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.14876   0.01269   0.05296   0.09100   2.08884  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  2.076e+01  8.687e+01   0.239    0.811
## x1           1.414e+00  7.254e+00   0.195    0.845
## x2          -3.633e+01  1.569e+02  -0.231    0.817
## x3          -3.607e+00  7.991e+01  -0.045    0.964
## x4          -7.786e-08  4.364e-07  -0.178    0.858
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 29.0965  on 27  degrees of freedom
## Residual deviance:  8.3308  on 23  degrees of freedom
## AIC: 18.331
## 
## Number of Fisher Scoring iterations: 8

We can see that the p value are not much different here. The positive coefficient for x1 suggest that if the current year cereal yield exceeds the average cereal yield, then x1 (Arable land) is more likely to increase.

2.Interpret the results

Use the coef() function for this fitted model

coef(glm.fits)
##   (Intercept)            x1            x2            x3            x4 
##  2.075538e+01  1.414244e+00 -3.632922e+01 -3.606806e+00 -7.786471e-08
summary(glm.fits)$coef
##                  Estimate   Std. Error     z value  Pr(>|z|)
## (Intercept)  2.075538e+01 8.686885e+01  0.23892781 0.8111616
## x1           1.414244e+00 7.254444e+00  0.19494860 0.8454332
## x2          -3.632922e+01 1.569317e+02 -0.23149701 0.8169287
## x3          -3.606806e+00 7.991002e+01 -0.04513584 0.9639990
## x4          -7.786471e-08 4.363559e-07 -0.17844313 0.8583750

Use the predict() function to predict the probability that the cereal yield will be more than the average cereal yield, given values of the predictors.

glm.probs <- predict(glm.fits, type = "response") 
glm.probs[1:28]
##          1          2          3          4          5          6          7 
## 0.04395345 0.03979500 0.10745929 0.11285784 0.29411761 0.30147234 0.48305674 
##          8          9         10         11         12         13         14 
## 0.77769732 0.89860711 0.97891073 0.99689352 0.99385522 0.99647082 0.99866244 
##         15         16         17         18         19         20         21 
## 0.99953900 0.99968968 0.99997728 0.99986217 0.99989233 0.99848970 0.99463268 
##         22         23         24         25         26         27         28 
## 0.99604820 0.99988695 0.99969391 0.99613439 0.99530320 0.99850796 0.99853311
contrasts(as.factor(data$above_average))
##   1
## 0 0
## 1 1

3. use the model to “predict” (estimate) the outcome (above_average predicted) for the remaining years (test dataset).

Make a forecast of whether the cereal yield will exceed the average cereal yield in a particular year and convert these predicted probabilities into class labels, Up or Down.

glm.pred <- rep("0", 28) #creates a vector of  28 Down elements
glm.pred[glm.probs > .5] = "1" #transforms to Up all of the elements for which the predicted probability of cereal yield increase exceeds 0.5.

Produce a confusion matrix to determine how many observations were correctly or incorrectly classified.

table(glm.pred, data$above_average)
##         
## glm.pred  0  1
##        0  6  1
##        1  0 21
mean(glm.pred == data$above_average)
## [1] 0.9642857

We can see that, in this case, logistic regression predicted whether the cereal yield would exceed the average cereal yield for the year with a 96.43% correct rate. 100% - 96.43% = 3.57% is the training error rate.

Create a vector corresponding to the observations from 1991 through 2004. Then use this vector to create a held out data set of observations from 2005.

train <- (data$Year < 2005)
data.2005 <- data[!train, ]
dim(data.2005)
## [1] 14  7
above_average.2005 <- data$above_average[!train]

Fit a logistic regression model using only the subset of the observations that correspond to years before 2005, using the subset argument.

glm.fits <- glm(
  above_average ~ x1 + x2 + x3 + x4, data = data, family = binomial, subset = train)
glm.probs <- predict(glm.fits, data.2005, type = "response")

4.Compare the actual and predicted outcome for the remaining years

Compute the predictions for 2005 and compare them to the actual cereal yield over that year.

glm.pred <- rep("0", 14)
glm.pred[glm.probs > .5] <- "1" 
table(glm.pred, above_average.2005)
##         above_average.2005
## glm.pred 1
##        0 7
##        1 7
mean(glm.pred == above_average.2005)
## [1] 0.5
mean(glm.pred != above_average.2005)
## [1] 0.5

We can see that the test error increased to 50% which is larger than before.

Predict the returns associated with particular values of x1 and x4.

glm.fits <- glm(above_average ~ x1 + x4, data = data, family = binomial, subset = train)
glm.probs <- predict(glm.fits, data.2005, type = "response")
glm.pred <- rep("0", 14)
glm.pred[glm.probs > .5] <- "1" 
table(glm.pred, above_average.2005)
##         above_average.2005
## glm.pred  1
##        1 14
mean(glm.pred == above_average.2005)
## [1] 1

We can see that the result is worse than before, using this method requires exceptional care.

Predict the returns associated with particular values of x1 and x4.

predict(glm.fits, newdata =
            data.frame(x1 = c(1.1, 0.6), x4 = c(1.2, 0.8)), type = "response")
## 1 2 
## 1 1

The data obtained confirms the above conclusion, and whether to use this method in this case needs to be carefully considered.