Data for this exam was provided by the professor and can be accessed here.
To begin, I have created a new project in R and have read the data file into my project folder.
I also loaded a few standard packages that I anticipated needing to work with, including the ‘cormat’ function:
To assess correlation between the variables, I had to retain only the continuous variables so I have created a dataframe which does not include ‘stateNames’ (a categorical variable).
## Source: local data frame [50 x 7]
##
## stateNames Population Income Illiteracy LifeExp Murder HSGrad
## (fctr) (int) (int) (dbl) (dbl) (dbl) (dbl)
## 1 Alabama 3615 3624 2.1 69.05 15.1 41.3
## 2 Alaska 365 6315 1.5 69.31 11.3 66.7
## 3 Arizona 2212 4530 1.8 70.55 7.8 58.1
## 4 Arkansas 2110 3378 1.9 70.66 10.1 39.9
## 5 California 21198 5114 1.1 71.71 10.3 62.6
## 6 Colorado 2541 4884 0.7 72.06 6.8 63.9
## 7 Connecticut 3100 5348 1.1 72.48 3.1 56.0
## 8 Delaware 579 4809 0.9 70.06 6.2 54.6
## 9 Florida 8277 4815 1.3 70.66 10.7 52.6
## 10 Georgia 4931 4091 2.0 68.54 13.9 40.6
## .. ... ... ... ... ... ... ...
## Source: local data frame [50 x 6]
##
## Population Income Illiteracy LifeExp Murder HSGrad
## (int) (int) (dbl) (dbl) (dbl) (dbl)
## 1 3615 3624 2.1 69.05 15.1 41.3
## 2 365 6315 1.5 69.31 11.3 66.7
## 3 2212 4530 1.8 70.55 7.8 58.1
## 4 2110 3378 1.9 70.66 10.1 39.9
## 5 21198 5114 1.1 71.71 10.3 62.6
## 6 2541 4884 0.7 72.06 6.8 63.9
## 7 3100 5348 1.1 72.48 3.1 56.0
## 8 579 4809 0.9 70.06 6.2 54.6
## 9 8277 4815 1.3 70.66 10.7 52.6
## 10 4931 4091 2.0 68.54 13.9 40.6
## .. ... ... ... ... ... ...
1.) By using ‘cormat’ I was able to create a table of correlation coefficients, a table of p-values indicating the significance levels of the variables’ correlations, and a graphical representation of the correlations represented by blue (negative) and red (positive) correlations.
## $r
## LifeExp Income HSGrad Population Illiteracy Murder
## LifeExp 1
## Income 0.34 1
## HSGrad 0.58 0.62 1
## Population -0.068 0.21 -0.098 1
## Illiteracy -0.59 -0.44 -0.66 0.11 1
## Murder -0.78 -0.23 -0.49 0.34 0.7 1
##
## $p
## LifeExp Income HSGrad Population Illiteracy Murder
## LifeExp 0
## Income 0.016 0
## HSGrad 9.2e-06 1.6e-06 0
## Population 0.64 0.15 0.5 0
## Illiteracy 7e-06 0.0015 2.2e-07 0.46 0
## Murder 2.3e-11 0.11 0.00032 0.015 1.3e-08 0
##
## $sym
## LifeExp Income HSGrad Population Illiteracy Murder
## LifeExp 1
## Income . 1
## HSGrad . , 1
## Population 1
## Illiteracy . . , 1
## Murder , . . , 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
2.) By using ‘ggvis’ I was able to construct plots demonstrating the relationship between:
a.) HSGrad and Income:
a.) Illiteracy and Income:
b.) A scatterplot of Murder by Illiteracy and grouped by HSGrad.
For this requirement, I transformed HSGrad into two categories: 50 state mean graduation percentage and over, and 50 state mean graduation percentage and under:
## Source: local data frame [1 x 2]
##
## proportion_grad number_valid
## (dbl) (int)
## 1 53.108 50
Having calcuated the mean HSGrad rate, I will recode HSGrad into two categories: lower than the mean HS Graduation rate (0), and greater than the mean graduation rate (1):
##
## 0 1
## 24 26
Murder by Illiteracy and grouped by HSGrad:
3.) I have also tested the null hypotheses that the correlation = 0 between Income and states for which HSGrad is above or below the mean.
Null Hypothesis: There is no correlation between mean income and whether or not a state’s HSGrad rate is above or below the mean HSGrad rate.
Alternate Hypothesis: There is a correlation between mean income and whether or not a state’s HSGrad rate is above or below the mean HSGrad rate.
I will set alpha equal to .05 and use a t-test to test for difference between the means:
##
## Two Sample t-test
##
## data: Income by task2
## t = -2.0303, df = 48, p-value = 0.04789
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -681.492855 -3.314837
## sample estimates:
## mean in group 0 mean in group 1
## 4257.750 4600.154
Based on the results of the t-test (p value = .05, rounded), I will fail to reject the null hypothesis because alpha was set to .05.
b.) Here I test the null hypothesis that there is no difference between Murder rates between groups of states:
Null Hypothesis: There is no difference in the mean murder rate between two groups of states: Alabama, Alaska, Arkansas, Georgia, Illinois, Kentucky, Louisiana, Mississippi, Michigan and Arizona, Connecticut, Iowa, Kansas, Maine, Minnesota, Nebraska, New Hampshire, North Dakota.
Alternate Hypothesis: There is a difference in the mean murder rate between two groups of states: Alabama, Alaska, Arkansas, Georgia, Illinois, Kentucky, Louisiana, Mississippi, Michigan and Arizona, Connecticut, Iowa, Kansas, Maine, Minnesota, Nebraska, New Hampshire, North Dakota.
I will set alpha equal to .05 and use a t-test to test for difference between the means:
I recoded stateNames to include two levels relating to the aforementioned groups of states (with the others being recoded as ‘rest’):
## [1] "other" "other" "other" "other" "other" "other" "other" "other" "other"
## [1] "Alabama" "Alaska" "Arkansas" "Georgia" "Illinois"
## [6] "Kentucky" "Louisiana" "Mississippi" "Michigan"
## [1] "Arizona" "Connecticut" "Iowa" "Kansas"
## [5] "Maine" "Minnesota" "Nebraska" "New Hampshire"
## [9] "North Dakota"
## [1] "other" "other" "other" "other" "other" "other" "other" "other" "other"
I also recombined the recoded stateNames variable for the t.test:
## [1] states1 states2 rest
## Levels: rest states1 states2
## [1] states1 states2
## Levels: states1 states2
## [1] states1 states2
## Levels: states1 states2
## new_groups
## states1 states2
## 1 1
Now the t.test:
##
## One Sample t-test
##
## data: new_data$Murder
## t = 14.132, df = 49, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 6.328876 8.427124
## sample estimates:
## mean of x
## 7.378
Based on the results of the t-test and the p-value of p-value < 2.2e-16, I will reject the null hypothesis that there is no difference in the mean murder rate between two groups of states: Alabama, Alaska, Arkansas, Georgia, Illinois, Kentucky, Louisiana, Mississippi, Michigan and Arizona, Connecticut, Iowa, Kansas, Maine, Minnesota, Nebraska, New Hampshire, North Dakota.