I first open the packages and libraries I will need to complete the assignment.
require(corrplot)
## Loading required package: corrplot
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(Ecdat)
## Loading required package: Ecdat
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
require(ggvis)
## Loading required package: ggvis
The assignment requires us to use the dataset Mroz. So I open and review that data to select my 4 continuous variable for analysis. I also create a table data frame of Mroz.
summary(Mroz)
## work hoursw child6 child618
## yes:325 Min. : 0.0 Min. :0.0000 Min. :0.000
## no :428 1st Qu.: 0.0 1st Qu.:0.0000 1st Qu.:0.000
## Median : 288.0 Median :0.0000 Median :1.000
## Mean : 740.6 Mean :0.2377 Mean :1.353
## 3rd Qu.:1516.0 3rd Qu.:0.0000 3rd Qu.:2.000
## Max. :4950.0 Max. :3.0000 Max. :8.000
## agew educw hearnw wagew
## Min. :30.00 Min. : 5.00 Min. : 0.000 Min. :0.00
## 1st Qu.:36.00 1st Qu.:12.00 1st Qu.: 0.000 1st Qu.:0.00
## Median :43.00 Median :12.00 Median : 1.625 Median :0.00
## Mean :42.54 Mean :12.29 Mean : 2.375 Mean :1.85
## 3rd Qu.:49.00 3rd Qu.:13.00 3rd Qu.: 3.788 3rd Qu.:3.58
## Max. :60.00 Max. :17.00 Max. :25.000 Max. :9.98
## hoursh ageh educh wageh
## Min. : 175 Min. :30.00 Min. : 3.00 Min. : 0.4121
## 1st Qu.:1928 1st Qu.:38.00 1st Qu.:11.00 1st Qu.: 4.7883
## Median :2164 Median :46.00 Median :12.00 Median : 6.9758
## Mean :2267 Mean :45.12 Mean :12.49 Mean : 7.4822
## 3rd Qu.:2553 3rd Qu.:52.00 3rd Qu.:15.00 3rd Qu.: 9.1667
## Max. :5010 Max. :60.00 Max. :17.00 Max. :40.5090
## income educwm educwf unemprate
## Min. : 1500 Min. : 0.000 Min. : 0.000 Min. : 3.000
## 1st Qu.:15428 1st Qu.: 7.000 1st Qu.: 7.000 1st Qu.: 7.500
## Median :20880 Median :10.000 Median : 7.000 Median : 7.500
## Mean :23081 Mean : 9.251 Mean : 8.809 Mean : 8.624
## 3rd Qu.:28200 3rd Qu.:12.000 3rd Qu.:12.000 3rd Qu.:11.000
## Max. :96000 Max. :17.000 Max. :17.000 Max. :14.000
## city experience
## no :269 Min. : 0.00
## yes:484 1st Qu.: 4.00
## Median : 9.00
## Mean :10.63
## 3rd Qu.:15.00
## Max. :45.00
Mroz <- tbl_df(Mroz)
Mroz
## Source: local data frame [753 x 18]
##
## work hoursw child6 child618 agew educw hearnw wagew hoursh ageh
## (fctr) (int) (int) (int) (int) (int) (dbl) (dbl) (int) (int)
## 1 no 1610 1 0 32 12 3.3540 2.65 2708 34
## 2 no 1656 0 2 30 12 1.3889 2.65 2310 30
## 3 no 1980 1 3 35 12 4.5455 4.04 3072 40
## 4 no 456 0 3 34 12 1.0965 3.25 1920 53
## 5 no 1568 1 2 31 14 4.5918 3.60 2000 32
## 6 no 2032 0 0 54 12 4.7421 4.70 1040 57
## 7 no 1440 0 2 37 16 8.3333 5.95 2670 37
## 8 no 1020 0 0 54 12 7.8431 9.98 4120 53
## 9 no 1458 0 2 48 12 2.1262 0.00 1995 52
## 10 no 1600 0 2 39 12 4.6875 4.15 2100 43
## .. ... ... ... ... ... ... ... ... ... ...
## Variables not shown: educh (int), wageh (dbl), income (int), educwm (int),
## educwf (int), unemprate (dbl), city (fctr), experience (int)
I decide to run my estimates of Pearson Product-Moment Correlations using the variables hoursw, income, hearnw, and experience. So I select those variables from the data and create a new data frame.
wifedata <- Mroz %>%
select(hoursw, income, hearnw, experience)
wifedata
## Source: local data frame [753 x 4]
##
## hoursw income hearnw experience
## (int) (int) (dbl) (int)
## 1 1610 16310 3.3540 14
## 2 1656 21800 1.3889 5
## 3 1980 21040 4.5455 15
## 4 456 7300 1.0965 6
## 5 1568 27300 4.5918 7
## 6 2032 19495 4.7421 33
## 7 1440 21152 8.3333 11
## 8 1020 18900 7.8431 35
## 9 1458 20405 2.1262 24
## 10 1600 20425 4.6875 21
## .. ... ... ... ...
summary(wifedata)
## hoursw income hearnw experience
## Min. : 0.0 Min. : 1500 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.:15428 1st Qu.: 0.000 1st Qu.: 4.00
## Median : 288.0 Median :20880 Median : 1.625 Median : 9.00
## Mean : 740.6 Mean :23081 Mean : 2.375 Mean :10.63
## 3rd Qu.:1516.0 3rd Qu.:28200 3rd Qu.: 3.788 3rd Qu.:15.00
## Max. :4950.0 Max. :96000 Max. :25.000 Max. :45.00
I then run my Pearson Product-Moment Correlations on the data to find the strength of the linear association between the variables (r), which will serve as my test statistic in testing the null hypothesis. This also produces a correlogram of my variables.
However, I must also load the correlation package by using the following code.
source("http://www.sthda.com/upload/rquery_cormat.r")
rquery.cormat(wifedata)
## $r
## income experience hoursw hearnw
## income 1
## experience -0.028 1
## hoursw 0.15 0.4 1
## hearnw 0.23 0.25 0.42 1
##
## $p
## income experience hoursw hearnw
## income 0
## experience 0.45 0
## hoursw 5.6e-05 0 0
## hearnw 1.4e-10 3e-12 0 0
##
## $sym
## income experience hoursw hearnw
## income 1
## experience 1
## hoursw . 1
## hearnw . 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
Using this data I’m ready to test my Null Hypothesis for 4 pairs of variables. The Null Hypothesis states \(H_{0}\): \(\rho\) = 0, its alternative is \(H_{1}\): \(\rho\) \(\neq\) 0. In other words if \(\rho\) is equal to 0 then there is no correlation between the variables. The alternative hypothesis states that if \(\rho\) is not equal to 0, then there is some correlation between the variables.
I have established my level of type 1 error as \(\alpha\)=0.05.
My test of correlation between income (family income in 1975 dollars) and hoursw (wife’s hours of work in 1975) shows a small positive correlation between the two variables (r = 0.15, with p = 5.6e-05). Since p is less than \(\alpha\) I’m able to reject the null hypothesis and state that there is a small positive correlation between family income and the number of hours a wife worked in 1975. In other words the more a wife worked, the greater the family income and vice versa.
My test of correlation between income and hearnw (wife’s average hourly earnings, in 1975 dollars) shows a small positive correlation between the two variables (r = 0.23, with p = 1.4e-10). Since p is less than \(\alpha\) I’m able to reject the null hypothesis and state that there is a small positive correlation between family income and the wife’s average hourly earnings. In other words the more a wife earned per hour the greater, the family income and vice versa.
My test of correlation between experience (actual years of wife’s previous labor market experience) and hoursw shows a medium positive correlation between the two variables (r = 0.4, with p = 0). Since p is less than \(\alpha\) I’m able to reject the null hypothesis and state that there is a medium positive correlation between a wife’s work experience and the number of hours she worked. In other words the more experience she had, the more she worked and vice versa.
My test of correlation between experience and hearnw shows a small positive correlation between the two variables (r = .25, with p = 3e-12). Since p is less than \(\alpha\) I’m able to reject the null hypothesis and state that there is a small positive correlation between a wife’s work experience and the amount she earned per hour. In other words the more experience she had, the more money she made and vice versa.
Here is a heat map which is a visual representation of the correlations between the variables
cormat<-rquery.cormat(wifedata, graphType="heatmap")
Here are four scatterplots with a smooth line which also show these relationships.
wifedata %>% ggvis (~income, ~hoursw) %>% layer_points() %>%
layer_smooths()%>%
add_axis("x", title = "1975 family income", title_offset = 50) %>%
add_axis("y", title = "Wife's hours of work", title_offset = 50)
wifedata %>% ggvis (~income, ~hearnw) %>% layer_points() %>%
layer_smooths()%>%
add_axis("x", title = "1975 family income", title_offset = 50) %>%
add_axis("y", title = "Wife's average hourly earnings", title_offset = 50)
wifedata %>% ggvis (~experience, ~hoursw) %>% layer_points() %>%
layer_smooths()%>%
add_axis("x", title = "Wife's work experience", title_offset = 50) %>%
add_axis("y", title = "Wife's hours of work", title_offset = 50)
wifedata %>% ggvis (~experience, ~hearnw) %>% layer_points() %>%
layer_smooths()%>%
add_axis("x", title = "Wife's work experience", title_offset = 50) %>%
add_axis("y", title = "Wife's average hourly earnings", title_offset = 50)