Question 1 (a)

library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(purrr)
library(tidyr)
library(ggplot2)
require(cowplot)
## Loading required package: cowplot
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
## 
##     ggsave
wage.df = read.csv(file="wage.csv", skip=16, header=TRUE, sep=",")
print("The dimensions of the data wage.df are: ")
## [1] "The dimensions of the data wage.df are: "
dim(wage.df)
## [1] 3000   11
print("Top three entries in wage.df are: ")
## [1] "Top three entries in wage.df are: "
head(wage.df, n=3)
##        year age     sex           maritl     race       education
## 231655 2006  18 1. Male 1. Never Married 1. White    1. < HS Grad
## 86582  2004  24 1. Male 1. Never Married 1. White 4. College Grad
## 161300 2003  45 1. Male       2. Married 1. White 3. Some College
##                    region       jobclass         health health_ins
## 231655 2. Middle Atlantic  1. Industrial      1. <=Good      2. No
## 86582  2. Middle Atlantic 2. Information 2. >=Very Good      2. No
## 161300 2. Middle Atlantic  1. Industrial      1. <=Good     1. Yes
##             wage
## 231655  75.04315
## 86582   70.47602
## 161300 130.98218

Question 1 (b)

# select only the factor variables
wage_fact <- wage.df%>% Filter(f = is.factor)
#print("The factor variable columns are:" )
#wage_fact

plot_data_column1 = function (data, column)
    ggplot(data = data, aes_string(x = column)) +
        geom_bar(fill = "blue") +
        ggtitle(paste("Histogram of factor variable",column,sep = " "))
        #xlab(column)

myplots1 <- lapply(colnames(wage_fact), plot_data_column1, data = wage_fact)
plot_grid(plotlist = myplots1)

We can make the following observations from the histogram plots above:

  1. All the 3000 workers are male and belong to the Middle Atlantic region.

  2. About 2/3 of them are married and a very small proportion of workers are widowed. Some of them are divorced, separated or never married before.

  3. Most of them (nearly 1000) are HS graduates and around 250 are qualified less than the HS graduate level. Rest of them have atleast some college education.

  4. About 2/3 of the workers have very good health and about the same proportion of workers also have a health insurance.

  5. The industrial and information job class have nearly equal number of workers ( around 1500 )

Question 1 (c)

#print("The numeric variable columns are:" )
 #select the numeric columns only     
numeric_col<- select_if(wage.df, is.numeric)
#print(numeric_col)

plot_data_column2 = function (data, column)
    ggplot(data = data, aes_string(x = column)) +
        geom_histogram(fill = "blue", bins=40) +
        ggtitle(paste("Histogram of numeric variable",column,sep = " ")) +
        xlab(column)


myplots2 <- lapply(colnames(numeric_col), plot_data_column2, data = numeric_col)

plot_grid(plotlist = myplots2)

We can make the following inferences from the above plots:

  1. The maximum data for wage of workers (500 entries) was recorded in the year 2003.

  2. Most workers are aged between 40 to 50 years

  3. The histogram for wage appears to be unimodular with a peak at around 100 thousand dollars. About 350 workers earn this amount as their wage.

  4. This may imply that most well skilled workers that the company wishes to hire are willing to work at a wage of around 100 thousand dollars.

  5. There are almost no records of workers earning a wage between 200 to 250 thousand dollars.

  6. Very few of these workers have a wage as high as 250 to 350 thousand dollars.

Question 2 (a) and 2(b)

wage.lm= lm(wage~year+age, data=wage.df)
print("The following are the coefficiants obtained for the linear regression model")
## [1] "The following are the coefficiants obtained for the linear regression model"
coef(wage.lm)
##   (Intercept)          year           age 
## -2318.5309186     1.1968236     0.6992032
print("data summary : ")
## [1] "data summary : "
wage.sum <- summary(wage.lm)
print(wage.sum)
## 
## Call:
## lm(formula = wage ~ year + age, data = wage.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.766 -25.081  -6.108  16.838 209.053 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2318.5309   739.1385  -3.137  0.00172 ** 
## year            1.1968     0.3685   3.247  0.00118 ** 
## age             0.6992     0.0647  10.808  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.86 on 2997 degrees of freedom
## Multiple R-squared:  0.04165,    Adjusted R-squared:  0.04101 
## F-statistic: 65.12 on 2 and 2997 DF,  p-value: < 2.2e-16
#wage.sum$coefficients[,2]
wage.se <- list()
wage.se<- wage.sum$coefficients[2,2]
wage.se[2] <- wage.sum$coefficients[3,2]
print("standard errors associated with the coefficient estimates are : ")
## [1] "standard errors associated with the coefficient estimates are : "
print(wage.se)
## [1] 0.36855211 0.06469607
print("the p values associated with the cofficient estimates are")
## [1] "the p values associated with the cofficient estimates are"
print(wage.sum$coefficients[,4])
##  (Intercept)         year          age 
## 1.724584e-03 1.177627e-03 9.802281e-27

Both the predictors , year and age have positive signs. We can explain this by stating that wage increases with time and experience. The standard errors associated with the coefficients of year and age are 0.36855211 and 0.06469607 respectively. The p-values associated with the coefficients of year and age are 1.177627e-03 and 9.802281e-27 respectively. Since both the p-values are quite less than 0.05, both these predictors are significant

Question 2 (c)

print("We can observe the diagnostic plots of the linear regression model below: ")
## [1] "We can observe the diagnostic plots of the linear regression model below: "
plot(wage.lm)

#print(wage.sum)

Residual Vs Fitted :

This plot shows that residuals are not very evenly distributed about the horizontal line and show some non linear pattern. For the fitted values less than 105 and more than 115, the residuals are distributed below the horizontal line (have negative values)

Normal Q-Q

The residuals lie along the y=x line before the theoretical quantile beacomes 1. After that, the residuals show significant deviation from the y=x line.

Scale Location

We can observe an approximate horizontal line with most residuals spread equally about it. However some points exhibit deviation from the main bulk of points as they have exceptionally high standardised residual values

Residual Vs Leverage

We can observe some points away from the main bulk of points. However, as the Cook’s distance lines are barely visible, we can say that all cases are well inside the cook’s distance lines.

CHALLENGE:

The reason for the observed discrepencies may be inferred from the histogram representing the wages of workers. We can see that there were almost no entries of workers earning wages in the range of 200-250 thousand dollars and very few in the range of 250-300 thousand dollars which may explain the residuals lying away from the bulk.

Question 2 (d)

wage.df.lt250 <- wage.df %>% filter(wage<=250)
#ggplot(wage.df.lt250, ) +
  #  stat_smooth(method = "lm")
wage.lm.lt250= lm(wage~year+age, data=wage.df.lt250)
wage.sum.lt250 <- summary(wage.lm.lt250)
print("summary of the linear regression model created is given below: ")
## [1] "summary of the linear regression model created is given below: "
print(wage.sum.lt250)
## 
## Call:
## lm(formula = wage ~ year + age, data = wage.df.lt250)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.001 -21.690  -2.905  18.518 112.226 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.126e+03  5.715e+02  -3.721 0.000203 ***
## year         1.102e+00  2.850e-01   3.866 0.000113 ***
## age          5.656e-01  4.986e-02  11.345  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.17 on 2918 degrees of freedom
## Multiple R-squared:  0.04812,    Adjusted R-squared:  0.04747 
## F-statistic: 73.75 on 2 and 2918 DF,  p-value: < 2.2e-16
print("The diagnostic plots associated with the linear regression model can be observed below")
## [1] "The diagnostic plots associated with the linear regression model can be observed below"
plot(wage.lm.lt250)

#print("bbbbbb")
#print(as.numeric(wage.sum.lt250$coefficients[1,1]))

The estimate of coefficients obtained are still positive, but show a very slight decrease in their value. Since, the p-value associated with these estimates are quite less than 0.05 , both the predictors are significant. We can observe from the residual Vs fitted, scale-location and residual Vs leverage plots that the few residuals lying away from the bulk have dissapeared. The normal Q-Q plot also shows that almost all residuals lie along the y=x line.

Question 2 (e)

print("The predicted wage of a 30 year old this year (2019) ")
## [1] "The predicted wage of a 30 year old this year (2019) "
wage.prediction1 <- as.numeric(wage.sum.lt250$coefficients[1,1]) +as.numeric(wage.sum.lt250$coefficients[2,1]*2019)+as.numeric(wage.sum.lt250$coefficients[3,1]*30) 
wage.prediction1
## [1] 114.8179
print("The predicted wage of a President Trump (72 year old) year old this year (2019) ")
## [1] "The predicted wage of a President Trump (72 year old) year old this year (2019) "
wage.prediction2 <- as.numeric(wage.sum.lt250$coefficients[1,1]) +as.numeric(wage.sum.lt250$coefficients[2,1]*2019)+as.numeric(wage.sum.lt250$coefficients[3,1]*72) 
wage.prediction2
## [1] 138.5742
print("The predicted wage of a 25 year old 5 years from now ")
## [1] "The predicted wage of a 25 year old 5 years from now "
wage.prediction3 <- as.numeric(wage.sum.lt250$coefficients[1,1]) +as.numeric(wage.sum.lt250$coefficients[2,1]*2024)+as.numeric(wage.sum.lt250$coefficients[3,1]*25) 
wage.prediction3
## [1] 117.4979

Question 3 (a)

wage.df= mutate(wage.df, wage.high = ifelse(wage > 250,1,0))
#print(wage.df)
wage.glm<- glm(wage.high ~ year+age , data=wage.df, family=binomial)
print("Summary of the logistic regression model can be observed below")
## [1] "Summary of the logistic regression model can be observed below"
summary(wage.glm)
## 
## Call:
## glm(formula = wage.high ~ year + age, family = binomial, data = wage.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4039  -0.2542  -0.2198  -0.1899   2.9302  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -48.808462 112.375762  -0.434 0.664047    
## year          0.021806   0.056033   0.389 0.697155    
## age           0.032779   0.009817   3.339 0.000841 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 730.53  on 2999  degrees of freedom
## Residual deviance: 719.08  on 2997  degrees of freedom
## AIC: 725.08
## 
## Number of Fisher Scoring iterations: 6
#print("refit log reg model")

We make the following observations from the summary of the logistic regression model :

  1. The p-value corresponding to year is 0.697155 which is greater than 0.05, hence year as a predictor is insignificant

  2. The p-value corresponding to age is 0.000841 , which is much smaller than .05, hence age is a significant predictor.

Question 3 (b)

wage.glm<- glm(wage.high ~ year+age+education, data=wage.df, family=binomial)
print("The summary of the new logistic regression model built using the predictors age, year and education can be observed below")
## [1] "The summary of the new logistic regression model built using the predictors age, year and education can be observed below"
summary(wage.glm)
## 
## Call:
## glm(formula = wage.high ~ year + age + education, family = binomial, 
##     data = wage.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6917  -0.2450  -0.1325  -0.0954   3.3037  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                 -39.113218 662.234193  -0.059   0.9529  
## year                          0.009165   0.057277   0.160   0.8729  
## age                           0.026768   0.010791   2.481   0.0131 *
## education2. HS Grad          14.283263 652.196203   0.022   0.9825  
## education3. Some College     15.068105 652.196160   0.023   0.9816  
## education4. College Grad     16.137982 652.196085   0.025   0.9803  
## education5. Advanced Degree  17.356990 652.196068   0.027   0.9788  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 730.53  on 2999  degrees of freedom
## Residual deviance: 615.76  on 2993  degrees of freedom
## AIC: 629.76
## 
## Number of Fisher Scoring iterations: 18

We can observe that there are four coefficients associated with the predictor education. This is because it is a categorical variable and has distinct values. There are 5 levels of education, but when we use it as a predictor, one of the level is set as the reference category while the other four are used as dummy variables for dummy coding.

Question 3 (c)

check <- wage.df %>% group_by(education, wage.high) %>% summarise(count=n())

check <- check %>% group_by(education) %>% summarise(count = n()) %>% mutate(completeness=ifelse(count == 1,"incomplete","complete"))

print(check)
## # A tibble: 5 x 3
##   education          count completeness
##   <fct>              <int> <chr>       
## 1 1. < HS Grad           1 incomplete  
## 2 2. HS Grad             2 complete    
## 3 3. Some College        2 complete    
## 4 4. College Grad        2 complete    
## 5 5. Advanced Degree     2 complete

The less than HS Grad level of education is an incomplete level as the data does not have any workers with this education level that have a wage of more than 250 thousand dollars (no observations with a wage.high value of 1).

Question 3 (d)

wage.df <- merge(wage.df, check,"education") %>% filter(count !=1)
wage.glm<- glm(wage.high ~ year+age+education, data=wage.df, family=binomial)
summary(wage.glm)
## 
## Call:
## glm(formula = wage.high ~ year + age + education, family = binomial, 
##     data = wage.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6917  -0.2547  -0.1435  -0.1055   3.3037  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -24.829955 114.867312  -0.216 0.828861    
## year                          0.009165   0.057277   0.160 0.872866    
## age                           0.026768   0.010791   2.481 0.013118 *  
## education3. Some College      0.784841   0.588306   1.334 0.182181    
## education4. College Grad      1.854719   0.498254   3.722 0.000197 ***
## education5. Advanced Degree   3.073726   0.475777   6.460 1.04e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 715.54  on 2731  degrees of freedom
## Residual deviance: 615.76  on 2726  degrees of freedom
## AIC: 627.76
## 
## Number of Fisher Scoring iterations: 8

We can observe from the summary of the new logistic regression model (generated after removing the data corresponding to incomplete education levels) that the three education levels “Some College”,“College Grad” and “Advanced degree” which were insignificant earlier, have become significant due to a considerable decrease in their p-value (p-value<0.5)