First I require the packages and download the source code I need to complete the assignment.
require(ggvis)
## Loading required package: ggvis
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(corrplot)
## Loading required package: corrplot
require(Ecdat)
## Loading required package: Ecdat
## Loading required package: Ecfun
##
## Attaching package: 'Ecdat'
##
## The following object is masked from 'package:datasets':
##
## Orange
source("http://www.sthda.com/upload/rquery_cormat.r")
Next I download the data required by the assignment, see what is included, and create a table dataframe. I then select the variables I need to make the data easier to work with.
StateInfo <- read.csv(file="http://www.personal.psu.edu/dlp/w540/StateIndicator.csv",
header=TRUE, sep=",")
glimpse(StateInfo)
## Observations: 50
## Variables: 7
## $ stateNames (fctr) Alabama, Alaska, Arizona, Arkansas, California, Co...
## $ Population (int) 3615, 365, 2212, 2110, 21198, 2541, 3100, 579, 8277...
## $ Income (int) 3624, 6315, 4530, 3378, 5114, 4884, 5348, 4809, 481...
## $ Illiteracy (dbl) 2.1, 1.5, 1.8, 1.9, 1.1, 0.7, 1.1, 0.9, 1.3, 2.0, 1...
## $ LifeExp (dbl) 69.05, 69.31, 70.55, 70.66, 71.71, 72.06, 72.48, 70...
## $ Murder (dbl) 15.1, 11.3, 7.8, 10.1, 10.3, 6.8, 3.1, 6.2, 10.7, 1...
## $ HSGrad (dbl) 41.3, 66.7, 58.1, 39.9, 62.6, 63.9, 56.0, 54.6, 52....
summary(StateInfo)
## stateNames Population Income Illiteracy
## Alabama : 1 Min. : 365 Min. :3098 Min. :0.500
## Alaska : 1 1st Qu.: 1080 1st Qu.:3993 1st Qu.:0.625
## Arizona : 1 Median : 2838 Median :4519 Median :0.950
## Arkansas : 1 Mean : 4246 Mean :4436 Mean :1.170
## California: 1 3rd Qu.: 4968 3rd Qu.:4814 3rd Qu.:1.575
## Colorado : 1 Max. :21198 Max. :6315 Max. :2.800
## (Other) :44
## LifeExp Murder HSGrad
## Min. :67.96 Min. : 1.400 Min. :37.80
## 1st Qu.:70.12 1st Qu.: 4.350 1st Qu.:48.05
## Median :70.67 Median : 6.850 Median :53.25
## Mean :70.88 Mean : 7.378 Mean :53.11
## 3rd Qu.:71.89 3rd Qu.:10.675 3rd Qu.:59.15
## Max. :73.60 Max. :15.100 Max. :67.30
##
StateInfo <- tbl_df(StateInfo)
StateInfo
## Source: local data frame [50 x 7]
##
## stateNames Population Income Illiteracy LifeExp Murder HSGrad
## (fctr) (int) (int) (dbl) (dbl) (dbl) (dbl)
## 1 Alabama 3615 3624 2.1 69.05 15.1 41.3
## 2 Alaska 365 6315 1.5 69.31 11.3 66.7
## 3 Arizona 2212 4530 1.8 70.55 7.8 58.1
## 4 Arkansas 2110 3378 1.9 70.66 10.1 39.9
## 5 California 21198 5114 1.1 71.71 10.3 62.6
## 6 Colorado 2541 4884 0.7 72.06 6.8 63.9
## 7 Connecticut 3100 5348 1.1 72.48 3.1 56.0
## 8 Delaware 579 4809 0.9 70.06 6.2 54.6
## 9 Florida 8277 4815 1.3 70.66 10.7 52.6
## 10 Georgia 4931 4091 2.0 68.54 13.9 40.6
## .. ... ... ... ... ... ... ...
StateData <- StateInfo %>% select(Population, Income, Illiteracy, LifeExp, Murder, HSGrad)
StateData
## Source: local data frame [50 x 6]
##
## Population Income Illiteracy LifeExp Murder HSGrad
## (int) (int) (dbl) (dbl) (dbl) (dbl)
## 1 3615 3624 2.1 69.05 15.1 41.3
## 2 365 6315 1.5 69.31 11.3 66.7
## 3 2212 4530 1.8 70.55 7.8 58.1
## 4 2110 3378 1.9 70.66 10.1 39.9
## 5 21198 5114 1.1 71.71 10.3 62.6
## 6 2541 4884 0.7 72.06 6.8 63.9
## 7 3100 5348 1.1 72.48 3.1 56.0
## 8 579 4809 0.9 70.06 6.2 54.6
## 9 8277 4815 1.3 70.66 10.7 52.6
## 10 4931 4091 2.0 68.54 13.9 40.6
## .. ... ... ... ... ... ...
I then run my Pearson Product-Moment Correlations on the data to find the strength of the linear association between the variables (r). This also produces a correlogram of my variables.
rquery.cormat(StateData)
## $r
## LifeExp Income HSGrad Population Illiteracy Murder
## LifeExp 1
## Income 0.34 1
## HSGrad 0.58 0.62 1
## Population -0.068 0.21 -0.098 1
## Illiteracy -0.59 -0.44 -0.66 0.11 1
## Murder -0.78 -0.23 -0.49 0.34 0.7 1
##
## $p
## LifeExp Income HSGrad Population Illiteracy Murder
## LifeExp 0
## Income 0.016 0
## HSGrad 9.2e-06 1.6e-06 0
## Population 0.64 0.15 0.5 0
## Illiteracy 7e-06 0.0015 2.2e-07 0.46 0
## Murder 2.3e-11 0.11 0.00032 0.015 1.3e-08 0
##
## $sym
## LifeExp Income HSGrad Population Illiteracy Murder
## LifeExp 1
## Income . 1
## HSGrad . , 1
## Population 1
## Illiteracy . . , 1
## Murder , . . , 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
These calculations show the following correlations:
I use the following code to create a plot showing the relationship between HSGrad and Income.
StateData %>% ggvis(~HSGrad, ~Income) %>%
layer_points() %>% layer_model_predictions(model="lm") %>%
add_axis("x", title = "Percentage of High School Graduates in State", title_offset = 35) %>%
add_axis("y", title = "Income per Capita (1974)", title_offset = 50) %>%
add_axis("x", orient = "top", ticks = 0,
title = "Relationship between High School Graduates and Income",
properties = axis_props(
axis = list(stroke = "white"),
labels = list(fontSize = 0)))
## Guessing formula = Income ~ HSGrad
I create a plot of the relationship between Illiteracy and Income by using the following code.
StateData %>% ggvis(~Illiteracy, ~Income) %>%
layer_points() %>% layer_model_predictions(model="lm") %>%
add_axis("x", title = "Percent Illiterate in State (1970)", title_offset = 35) %>%
add_axis("y", title = "Income per Capita (1974)", title_offset = 50) %>%
add_axis("x", orient = "top", ticks = 0,
title = "Relationship between Illiteracy and Income",
properties = axis_props(
axis = list(stroke = "white"),
labels = list(fontSize = 0)))
## Guessing formula = Income ~ Illiteracy
After reviewing the data and running some test plots, I realized that a scatterplot for Murder by Illiteracy grouped by HSGrad looked confusing due to the number of different groups HSGrad produced.
Therefore, to make the scatterplot more readable, I decided to round the data in HSGrad to create a more reasonable amount of groupings. This created a much more readable scatterplot. I also included linear models for each group to increase the graph’s clarity.
StateData$HSGradR <- round(StateData$HSGrad, digits=-1)
StateData
## Source: local data frame [50 x 7]
##
## Population Income Illiteracy LifeExp Murder HSGrad HSGradR
## (int) (int) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 3615 3624 2.1 69.05 15.1 41.3 40
## 2 365 6315 1.5 69.31 11.3 66.7 70
## 3 2212 4530 1.8 70.55 7.8 58.1 60
## 4 2110 3378 1.9 70.66 10.1 39.9 40
## 5 21198 5114 1.1 71.71 10.3 62.6 60
## 6 2541 4884 0.7 72.06 6.8 63.9 60
## 7 3100 5348 1.1 72.48 3.1 56.0 60
## 8 579 4809 0.9 70.06 6.2 54.6 50
## 9 8277 4815 1.3 70.66 10.7 52.6 50
## 10 4931 4091 2.0 68.54 13.9 40.6 40
## .. ... ... ... ... ... ... ...
StateData %>% ggvis(~Murder, ~Illiteracy, fill = ~factor(HSGradR)) %>%
layer_points() %>%
group_by(HSGradR) %>%
layer_model_predictions(model="lm") %>%
add_axis("x", title="Murder/Manslaughter Rate per 100,000 (1976)", title_offset=35) %>%
add_axis("y", title="Percent Illiterate in State (1970)", title_offset=60) %>%
add_axis("x", orient = "top", ticks = 0,
title = "Relationship between Murder and Illiteracy
grouped by High School Grad Percentage (rounded)",
properties = axis_props(
axis = list(stroke = "white"),
labels = list(fontSize = 0)))
## Guessing formula = Illiteracy ~ Murder
Before testing the null hypothesis that there is no difference in Income between state above the median HSGrad and states less than or equal to median HSGrad, I needed to create a new vector which would indicate those states above or below the median.
First, I calculated the median and then created a new data set using that figure. I then viewed the data to make sure I had the results I needed.
median(StateData$HSGrad, na.rm=TRUE)
## [1] 53.25
StateData$HSMedian <- 1
StateData$HSMedian <- ifelse(StateData$HSGrad <= 53.25, 2, 1)
StateData
## Source: local data frame [50 x 8]
##
## Population Income Illiteracy LifeExp Murder HSGrad HSGradR HSMedian
## (int) (int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 3615 3624 2.1 69.05 15.1 41.3 40 2
## 2 365 6315 1.5 69.31 11.3 66.7 70 1
## 3 2212 4530 1.8 70.55 7.8 58.1 60 1
## 4 2110 3378 1.9 70.66 10.1 39.9 40 2
## 5 21198 5114 1.1 71.71 10.3 62.6 60 1
## 6 2541 4884 0.7 72.06 6.8 63.9 60 1
## 7 3100 5348 1.1 72.48 3.1 56.0 60 1
## 8 579 4809 0.9 70.06 6.2 54.6 50 1
## 9 8277 4815 1.3 70.66 10.7 52.6 50 2
## 10 4931 4091 2.0 68.54 13.9 40.6 40 2
## .. ... ... ... ... ... ... ... ...
I then decided to use a t-test of the difference in the means of these two groups to either to test the null hypothesis stated above. In short, if this difference is not equal to 0 then I would reject the null hypothesis and indicate its alternative, which would state that there is a difference in income between states above or below the median Percentage of High School Graduates. In other words, Income and HSMedian are related.
I set my acceptable level of Type 1 error to 0.05 (\(\alpha\) = 0.05).
t.test(StateData$Income ~ StateData$HSMedian, var.equal=TRUE)
##
## Two Sample t-test
##
## data: StateData$Income by StateData$HSMedian
## t = 1.9642, df = 48, p-value = 0.05531
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.838476 671.518476
## sample estimates:
## mean in group 1 mean in group 2
## 4601.72 4269.88
The t-test shows that the estimate of the difference between the means is almost two times the error in estimating that difference (t = 1.9642), however, our probability value (p-value = 0.05531) is more than \(\alpha\). Therefore, the test fails to reject the null hypothesis.
Before testing the null hypothesis in this task, I must first regroup my data according to the requirements listed. I do this by conducting the following transformation and ultimately creating a new dataframe meeting those requirements.
grp1 <- c("Alabama", "Alaska", "Arkansas", "Georgia", "Illinois", "Kentucky", "Louisiana",
"Mississippi", "Michigan")
grp2 <- c("Arizona", "Connecticut", "Iowa", "Kansas", "Maine", "Minnesota", "Nebraska",
"New Hampshire", "North Dakota")
grp1states <- filter(StateInfo, stateNames %in% grp1)
grp1states$ID <- 1
grp1states
## Source: local data frame [9 x 8]
##
## stateNames Population Income Illiteracy LifeExp Murder HSGrad ID
## (fctr) (int) (int) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 Alabama 3615 3624 2.1 69.05 15.1 41.3 1
## 2 Alaska 365 6315 1.5 69.31 11.3 66.7 1
## 3 Arkansas 2110 3378 1.9 70.66 10.1 39.9 1
## 4 Georgia 4931 4091 2.0 68.54 13.9 40.6 1
## 5 Illinois 11197 5107 0.9 70.14 10.3 52.6 1
## 6 Kentucky 3387 3712 1.6 70.10 10.6 38.5 1
## 7 Louisiana 3806 3545 2.8 68.76 13.2 42.2 1
## 8 Michigan 9111 4751 0.9 70.63 11.1 52.8 1
## 9 Mississippi 2341 3098 2.4 68.09 12.5 41.0 1
grp2states <- filter(StateInfo, stateNames %in% grp2)
grp2states$ID <- 2
grp2states
## Source: local data frame [9 x 8]
##
## stateNames Population Income Illiteracy LifeExp Murder HSGrad ID
## (fctr) (int) (int) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 Arizona 2212 4530 1.8 70.55 7.8 58.1 2
## 2 Connecticut 3100 5348 1.1 72.48 3.1 56.0 2
## 3 Iowa 2861 4628 0.5 72.56 2.3 59.0 2
## 4 Kansas 2280 4669 0.6 72.58 4.5 59.9 2
## 5 Maine 1058 3694 0.7 70.39 2.7 54.7 2
## 6 Minnesota 3921 4675 0.6 72.96 2.3 57.6 2
## 7 Nebraska 1544 4508 0.6 72.60 2.9 59.3 2
## 8 New Hampshire 812 4281 0.7 71.23 3.3 57.6 2
## 9 North Dakota 637 5087 0.8 72.78 1.4 50.3 2
NewStateInfo <- rbind(grp1states, grp2states)
NewStateInfo
## Source: local data frame [18 x 8]
##
## stateNames Population Income Illiteracy LifeExp Murder HSGrad ID
## (fctr) (int) (int) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 Alabama 3615 3624 2.1 69.05 15.1 41.3 1
## 2 Alaska 365 6315 1.5 69.31 11.3 66.7 1
## 3 Arkansas 2110 3378 1.9 70.66 10.1 39.9 1
## 4 Georgia 4931 4091 2.0 68.54 13.9 40.6 1
## 5 Illinois 11197 5107 0.9 70.14 10.3 52.6 1
## 6 Kentucky 3387 3712 1.6 70.10 10.6 38.5 1
## 7 Louisiana 3806 3545 2.8 68.76 13.2 42.2 1
## 8 Michigan 9111 4751 0.9 70.63 11.1 52.8 1
## 9 Mississippi 2341 3098 2.4 68.09 12.5 41.0 1
## 10 Arizona 2212 4530 1.8 70.55 7.8 58.1 2
## 11 Connecticut 3100 5348 1.1 72.48 3.1 56.0 2
## 12 Iowa 2861 4628 0.5 72.56 2.3 59.0 2
## 13 Kansas 2280 4669 0.6 72.58 4.5 59.9 2
## 14 Maine 1058 3694 0.7 70.39 2.7 54.7 2
## 15 Minnesota 3921 4675 0.6 72.96 2.3 57.6 2
## 16 Nebraska 1544 4508 0.6 72.60 2.9 59.3 2
## 17 New Hampshire 812 4281 0.7 71.23 3.3 57.6 2
## 18 North Dakota 637 5087 0.8 72.78 1.4 50.3 2
Using this new dataframe, I’m able to conduct a t-test of my null hypothesis by testing the difference in means of these two groups. The null hypothesis states there is no difference in Murder between the first group of states and the second group of states. The alternative hypothesis would state that there is a difference between the murder and non-negligent manslaughter rates between these two groups.
I set my level of acceptable Type 1 error to 0.05 (\(\alpha\) = 0.05).
t.test(NewStateInfo$Murder ~ NewStateInfo$ID, var.equal=TRUE)
##
## Two Sample t-test
##
## data: NewStateInfo$Murder by NewStateInfo$ID
## t = 10.124, df = 16, p-value = 2.312e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 6.834422 10.454467
## sample estimates:
## mean in group 1 mean in group 2
## 12.011111 3.366667
The t-test shows that the estimate of the difference between the means is more than 10 times the error in estimating that difference (t = 10.124), and the probability value (p-value = 2.312e-08) is less than \(\alpha\). Therefore, the null hypothesis has been rejected.