An Analysis of variable correlations within the Mroz Labor Supply Data Set

1.) To being, I have created a new project in R and have downloaded the data files and packages I will be working with:

require(Ecdat)
## Loading required package: Ecdat
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
require(corrplot)
## Loading required package: corrplot
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(ggvis)
## Loading required package: ggvis
require(magrittr)
## Loading required package: magrittr
data(Mroz)
names(Mroz)
##  [1] "work"       "hoursw"     "child6"     "child618"   "agew"      
##  [6] "educw"      "hearnw"     "wagew"      "hoursh"     "ageh"      
## [11] "educh"      "wageh"      "income"     "educwm"     "educwf"    
## [16] "unemprate"  "city"       "experience"
summary(Mroz)
##   work         hoursw           child6          child618    
##  yes:325   Min.   :   0.0   Min.   :0.0000   Min.   :0.000  
##  no :428   1st Qu.:   0.0   1st Qu.:0.0000   1st Qu.:0.000  
##            Median : 288.0   Median :0.0000   Median :1.000  
##            Mean   : 740.6   Mean   :0.2377   Mean   :1.353  
##            3rd Qu.:1516.0   3rd Qu.:0.0000   3rd Qu.:2.000  
##            Max.   :4950.0   Max.   :3.0000   Max.   :8.000  
##       agew           educw           hearnw           wagew     
##  Min.   :30.00   Min.   : 5.00   Min.   : 0.000   Min.   :0.00  
##  1st Qu.:36.00   1st Qu.:12.00   1st Qu.: 0.000   1st Qu.:0.00  
##  Median :43.00   Median :12.00   Median : 1.625   Median :0.00  
##  Mean   :42.54   Mean   :12.29   Mean   : 2.375   Mean   :1.85  
##  3rd Qu.:49.00   3rd Qu.:13.00   3rd Qu.: 3.788   3rd Qu.:3.58  
##  Max.   :60.00   Max.   :17.00   Max.   :25.000   Max.   :9.98  
##      hoursh          ageh           educh           wageh        
##  Min.   : 175   Min.   :30.00   Min.   : 3.00   Min.   : 0.4121  
##  1st Qu.:1928   1st Qu.:38.00   1st Qu.:11.00   1st Qu.: 4.7883  
##  Median :2164   Median :46.00   Median :12.00   Median : 6.9758  
##  Mean   :2267   Mean   :45.12   Mean   :12.49   Mean   : 7.4822  
##  3rd Qu.:2553   3rd Qu.:52.00   3rd Qu.:15.00   3rd Qu.: 9.1667  
##  Max.   :5010   Max.   :60.00   Max.   :17.00   Max.   :40.5090  
##      income          educwm           educwf         unemprate     
##  Min.   : 1500   Min.   : 0.000   Min.   : 0.000   Min.   : 3.000  
##  1st Qu.:15428   1st Qu.: 7.000   1st Qu.: 7.000   1st Qu.: 7.500  
##  Median :20880   Median :10.000   Median : 7.000   Median : 7.500  
##  Mean   :23081   Mean   : 9.251   Mean   : 8.809   Mean   : 8.624  
##  3rd Qu.:28200   3rd Qu.:12.000   3rd Qu.:12.000   3rd Qu.:11.000  
##  Max.   :96000   Max.   :17.000   Max.   :17.000   Max.   :14.000  
##   city       experience   
##  no :269   Min.   : 0.00  
##  yes:484   1st Qu.: 4.00  
##            Median : 9.00  
##            Mean   :10.63  
##            3rd Qu.:15.00  
##            Max.   :45.00

Also, as directed, I will load the cormat function:

source("http://www.sthda.com/upload/rquery_cormat.r")

2.) Below I estimate the Pearson Product-Moment Correlations for four pairs of variables which I have selected from the Mroz dataset:

First I down-selected Mroz to include the four pairs of variables that I have selected: income, wageh, hoursw and hoursh:

variables <- Mroz %>%
  select(income, wageh, hoursw, hoursh)
head(variables)
##   income   wageh hoursw hoursh
## 1  16310  4.0288   1610   2708
## 2  21800  8.4416   1656   2310
## 3  21040  3.5807   1980   3072
## 4   7300  3.5417    456   1920
## 5  27300 10.0000   1568   2000
## 6  19495  6.7106   2032   1040

Next I estimated the Pearson Product Moment correlation of the variables:

rquery.cormat(variables)

## $r
##        income  wageh hoursw hoursh
## income      1                     
## wageh    0.73      1              
## hoursw   0.15 -0.099      1       
## hoursh   0.13  -0.24 -0.056      1
## 
## $p
##         income   wageh hoursw hoursh
## income       0                      
## wageh        0       0              
## hoursw 5.6e-05  0.0068      0       
## hoursh 0.00042 5.4e-11   0.12      0
## 
## $sym
##        income wageh hoursw hoursh
## income 1                         
## wageh  ,      1                  
## hoursw              1            
## hoursh                     1     
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

3.) I have also tested the null hypotheses that the population correlations = 0 for the four pairs of variables selected.

Null Hypothesis: There is no correlation between the means of the income, age, hoursw and hoursh variables.

Alternate Hypothesis: There is a correlation between the means of the income, age, hoursw and hoursh variables.

I will set alpha equal to .05.

As noted in the results below, the p-values for the correlation between income and wageh (0), income and hoursw (5.6e-05), wageh and hoursw (.0068), income and hoursh (.00042) and wageh and hoursh (5.4e-11) are less than alpha, so I will reject the null hypothesis.

$p income wageh hoursw hoursh income 0
wageh 0 0
hoursw 5.6e-05 0.0068 0
hoursh 0.00042 5.4e-11 0.12 0

I will fail to reject the null hypothesis for the correlations between hoursw and hoursh because the p-value (.12) is larger than alpha (.05).

4.) Using ggvis, I have created scatterplots containing points and a smooth line for the four pairs of variable you selected.

Income and Wageh

variables %>% ggvis(~income, ~wageh) %>% layer_points() %>% layer_smooths() %>% add_axis("x", title = "income") %>% add_axis("y", title = "wageh")

Income and Hoursw

variables %>% ggvis(~income, ~hoursw) %>% layer_points() %>% layer_smooths() %>% add_axis("x", title = "income") %>% add_axis("y", title = "hoursw")

> Income and Hoursh

variables %>% ggvis(~income, ~hoursh) %>% layer_points() %>% layer_smooths() %>% add_axis("x", title = "income") %>% add_axis("y", title = "hoursh")

Hoursw and Wageh

variables %>% ggvis(~hoursw, ~wageh) %>% layer_points() %>% layer_smooths() %>% add_axis("x", title = "hoursw") %>% add_axis("y", title = "wageh")

5.) Finally, I have produced some visual representations of the variable correlations:

First, two correlograms:

rquery.cormat(variables)

## $r
##        income  wageh hoursw hoursh
## income      1                     
## wageh    0.73      1              
## hoursw   0.15 -0.099      1       
## hoursh   0.13  -0.24 -0.056      1
## 
## $p
##         income   wageh hoursw hoursh
## income       0                      
## wageh        0       0              
## hoursw 5.6e-05  0.0068      0       
## hoursh 0.00042 5.4e-11   0.12      0
## 
## $sym
##        income wageh hoursw hoursh
## income 1                         
## wageh  ,      1                  
## hoursw              1            
## hoursh                     1     
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
rquery.cormat(variables, type="full")

## $r
##        income  wageh hoursw hoursh
## income   1.00  0.730  0.150  0.130
## wageh    0.73  1.000 -0.099 -0.240
## hoursw   0.15 -0.099  1.000 -0.056
## hoursh   0.13 -0.240 -0.056  1.000
## 
## $p
##         income   wageh  hoursw  hoursh
## income 0.0e+00 0.0e+00 5.6e-05 4.2e-04
## wageh  0.0e+00 0.0e+00 6.8e-03 5.4e-11
## hoursw 5.6e-05 6.8e-03 0.0e+00 1.2e-01
## hoursh 4.2e-04 5.4e-11 1.2e-01 0.0e+00
## 
## $sym
##        income wageh hoursw hoursh
## income 1                         
## wageh  ,      1                  
## hoursw              1            
## hoursh                     1     
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

Now a heatmap:

cormat<-rquery.cormat(variables, graphType="heatmap")