Getting ready

First I require the packages and download the source code I need to complete the assignment.

require(ggvis)
## Loading required package: ggvis
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(corrplot)
## Loading required package: corrplot
require(Ecdat)
## Loading required package: Ecdat
## Loading required package: Ecfun
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
source("http://www.sthda.com/upload/rquery_cormat.r")

Download data

Next I download the data required by the assignment, see what is included, and create a table dataframe. I then select the variables I need to make the data easier to work with.

StateInfo <- read.csv(file="http://www.personal.psu.edu/dlp/w540/StateIndicator.csv", 
                     header=TRUE, sep=",")
glimpse(StateInfo)
## Observations: 50
## Variables: 7
## $ stateNames (fctr) Alabama, Alaska, Arizona, Arkansas, California, Co...
## $ Population (int) 3615, 365, 2212, 2110, 21198, 2541, 3100, 579, 8277...
## $ Income     (int) 3624, 6315, 4530, 3378, 5114, 4884, 5348, 4809, 481...
## $ Illiteracy (dbl) 2.1, 1.5, 1.8, 1.9, 1.1, 0.7, 1.1, 0.9, 1.3, 2.0, 1...
## $ LifeExp    (dbl) 69.05, 69.31, 70.55, 70.66, 71.71, 72.06, 72.48, 70...
## $ Murder     (dbl) 15.1, 11.3, 7.8, 10.1, 10.3, 6.8, 3.1, 6.2, 10.7, 1...
## $ HSGrad     (dbl) 41.3, 66.7, 58.1, 39.9, 62.6, 63.9, 56.0, 54.6, 52....
summary(StateInfo)
##       stateNames   Population        Income       Illiteracy   
##  Alabama   : 1   Min.   :  365   Min.   :3098   Min.   :0.500  
##  Alaska    : 1   1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625  
##  Arizona   : 1   Median : 2838   Median :4519   Median :0.950  
##  Arkansas  : 1   Mean   : 4246   Mean   :4436   Mean   :1.170  
##  California: 1   3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575  
##  Colorado  : 1   Max.   :21198   Max.   :6315   Max.   :2.800  
##  (Other)   :44                                                 
##     LifeExp          Murder           HSGrad     
##  Min.   :67.96   Min.   : 1.400   Min.   :37.80  
##  1st Qu.:70.12   1st Qu.: 4.350   1st Qu.:48.05  
##  Median :70.67   Median : 6.850   Median :53.25  
##  Mean   :70.88   Mean   : 7.378   Mean   :53.11  
##  3rd Qu.:71.89   3rd Qu.:10.675   3rd Qu.:59.15  
##  Max.   :73.60   Max.   :15.100   Max.   :67.30  
## 
StateInfo <- tbl_df(StateInfo)
StateInfo
## Source: local data frame [50 x 7]
## 
##     stateNames Population Income Illiteracy LifeExp Murder HSGrad
##         (fctr)      (int)  (int)      (dbl)   (dbl)  (dbl)  (dbl)
## 1      Alabama       3615   3624        2.1   69.05   15.1   41.3
## 2       Alaska        365   6315        1.5   69.31   11.3   66.7
## 3      Arizona       2212   4530        1.8   70.55    7.8   58.1
## 4     Arkansas       2110   3378        1.9   70.66   10.1   39.9
## 5   California      21198   5114        1.1   71.71   10.3   62.6
## 6     Colorado       2541   4884        0.7   72.06    6.8   63.9
## 7  Connecticut       3100   5348        1.1   72.48    3.1   56.0
## 8     Delaware        579   4809        0.9   70.06    6.2   54.6
## 9      Florida       8277   4815        1.3   70.66   10.7   52.6
## 10     Georgia       4931   4091        2.0   68.54   13.9   40.6
## ..         ...        ...    ...        ...     ...    ...    ...
StateData <- StateInfo %>% select(Population, Income, Illiteracy, LifeExp, Murder, HSGrad)
StateData
## Source: local data frame [50 x 6]
## 
##    Population Income Illiteracy LifeExp Murder HSGrad
##         (int)  (int)      (dbl)   (dbl)  (dbl)  (dbl)
## 1        3615   3624        2.1   69.05   15.1   41.3
## 2         365   6315        1.5   69.31   11.3   66.7
## 3        2212   4530        1.8   70.55    7.8   58.1
## 4        2110   3378        1.9   70.66   10.1   39.9
## 5       21198   5114        1.1   71.71   10.3   62.6
## 6        2541   4884        0.7   72.06    6.8   63.9
## 7        3100   5348        1.1   72.48    3.1   56.0
## 8         579   4809        0.9   70.06    6.2   54.6
## 9        8277   4815        1.3   70.66   10.7   52.6
## 10       4931   4091        2.0   68.54   13.9   40.6
## ..        ...    ...        ...     ...    ...    ...

Task 1

I then run my Pearson Product-Moment Correlations on the data to find the strength of the linear association between the variables (r). This also produces a correlogram of my variables.

rquery.cormat(StateData)

## $r
##            LifeExp Income HSGrad Population Illiteracy Murder
## LifeExp          1                                           
## Income        0.34      1                                    
## HSGrad        0.58   0.62      1                             
## Population  -0.068   0.21 -0.098          1                  
## Illiteracy   -0.59  -0.44  -0.66       0.11          1       
## Murder       -0.78  -0.23  -0.49       0.34        0.7      1
## 
## $p
##            LifeExp  Income  HSGrad Population Illiteracy Murder
## LifeExp          0                                             
## Income       0.016       0                                     
## HSGrad     9.2e-06 1.6e-06       0                             
## Population    0.64    0.15     0.5          0                  
## Illiteracy   7e-06  0.0015 2.2e-07       0.46          0       
## Murder     2.3e-11    0.11 0.00032      0.015    1.3e-08      0
## 
## $sym
##            LifeExp Income HSGrad Population Illiteracy Murder
## LifeExp    1                                                 
## Income     .       1                                         
## HSGrad     .       ,      1                                  
## Population                       1                           
## Illiteracy .       .      ,                 1                
## Murder     ,              .      .          ,          1     
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

These calculations show the following correlations:

  1. Life Expectancy and Income shows a medium positive correlation (r = 0.34).
  2. Life Expectancy and Percent of High School Graduates shows a large positive correlation (r = 0.58).
  3. Life Expectancy and Population shows a very small (non-existent) negative correlation (r = -0.068).
  4. Life Expectancy and Illiteracy shows a large negative correlation (r = -0.59).
  5. Life Expectancy and Murder shows a large negative correlation (r = -0.78).
  6. Income and Percent of High School Graduates shows a large positive correlation (r = 0.62).
  7. Income and Population shows a small positive correlation (r = 0.21).
  8. Income and Illiteracy shows a medium negative correlation (r = -0.44).
  9. Income and Murder shows a small negative correlation (r = -0.23).
  10. Percent of High School Graduates and Population shows a small negative correlation (r = -0.1 (rounded)).
  11. Percent of High School Graduates and Illiteracy shows a large negative correlation (r = -0.66).
  12. Percent of High School Graduates and Murder shows a medium negative correlation (r = -0.49).
  13. Population and Illiteracy shows a small positive correlation (r = 0.11).
  14. Population and Murder shows a medium positive correlation (r = 0.34).
  15. Illiteracy and Murder shows a large positive correlation (r = 0.77).

Task 2

Question ai

I use the following code to create a plot showing the relationship between HSGrad and Income.

StateData %>% ggvis(~HSGrad, ~Income) %>% 
  layer_points() %>% layer_model_predictions(model="lm") %>%
  add_axis("x", title = "Percentage of High School Graduates in State", title_offset = 35) %>%
  add_axis("y", title = "Income per Capita (1974)", title_offset = 50) %>%
  add_axis("x", orient = "top", ticks = 0, 
           title = "Relationship between High School Graduates and Income",
           properties = axis_props(
             axis = list(stroke = "white"),
             labels = list(fontSize = 0)))
## Guessing formula = Income ~ HSGrad

Question aii

I create a plot of the relationship between Illiteracy and Income by using the following code.

StateData %>% ggvis(~Illiteracy, ~Income) %>% 
  layer_points() %>% layer_model_predictions(model="lm") %>%
  add_axis("x", title = "Percent Illiterate in State (1970)", title_offset = 35) %>%
  add_axis("y", title = "Income per Capita (1974)", title_offset = 50) %>%
  add_axis("x", orient = "top", ticks = 0, 
           title = "Relationship between Illiteracy and Income",
           properties = axis_props(
             axis = list(stroke = "white"),
             labels = list(fontSize = 0)))
## Guessing formula = Income ~ Illiteracy

Question b

After reviewing the data and running some test plots, I realized that a scatterplot for Murder by Illiteracy grouped by HSGrad looked confusing due to the number of different groups HSGrad produced.

Therefore, to make the scatterplot more readable, I decided to round the data in HSGrad to create a more reasonable amount of groupings. This created a much more readable scatterplot. I also included linear models for each group to increase the graph’s clarity.

StateData$HSGradR <- round(StateData$HSGrad, digits=-1)
StateData
## Source: local data frame [50 x 7]
## 
##    Population Income Illiteracy LifeExp Murder HSGrad HSGradR
##         (int)  (int)      (dbl)   (dbl)  (dbl)  (dbl)   (dbl)
## 1        3615   3624        2.1   69.05   15.1   41.3      40
## 2         365   6315        1.5   69.31   11.3   66.7      70
## 3        2212   4530        1.8   70.55    7.8   58.1      60
## 4        2110   3378        1.9   70.66   10.1   39.9      40
## 5       21198   5114        1.1   71.71   10.3   62.6      60
## 6        2541   4884        0.7   72.06    6.8   63.9      60
## 7        3100   5348        1.1   72.48    3.1   56.0      60
## 8         579   4809        0.9   70.06    6.2   54.6      50
## 9        8277   4815        1.3   70.66   10.7   52.6      50
## 10       4931   4091        2.0   68.54   13.9   40.6      40
## ..        ...    ...        ...     ...    ...    ...     ...
StateData %>% ggvis(~Murder, ~Illiteracy, fill = ~factor(HSGradR)) %>% 
  layer_points() %>%
  group_by(HSGradR) %>%
  layer_model_predictions(model="lm") %>%
  add_axis("x", title="Murder/Manslaughter Rate per 100,000 (1976)", title_offset=35) %>%
  add_axis("y", title="Percent Illiterate in State (1970)", title_offset=60) %>%
  add_axis("x", orient = "top", ticks = 0, 
           title = "Relationship between Murder and Illiteracy 
           grouped by High School Grad Percentage (rounded)",
           properties = axis_props(
             axis = list(stroke = "white"),
             labels = list(fontSize = 0)))
## Guessing formula = Illiteracy ~ Murder

Task 3

Question a

Before testing the null hypothesis that there is no difference in Income between state above the median HSGrad and states less than or equal to median HSGrad, I needed to create a new vector which would indicate those states above or below the median.

First, I calculated the median and then created a new data set using that figure. I then viewed the data to make sure I had the results I needed.

median(StateData$HSGrad, na.rm=TRUE)
## [1] 53.25
StateData$HSMedian <- 1
StateData$HSMedian <- ifelse(StateData$HSGrad <= 53.25, 2, 1)
StateData
## Source: local data frame [50 x 8]
## 
##    Population Income Illiteracy LifeExp Murder HSGrad HSGradR HSMedian
##         (int)  (int)      (dbl)   (dbl)  (dbl)  (dbl)   (dbl)    (dbl)
## 1        3615   3624        2.1   69.05   15.1   41.3      40        2
## 2         365   6315        1.5   69.31   11.3   66.7      70        1
## 3        2212   4530        1.8   70.55    7.8   58.1      60        1
## 4        2110   3378        1.9   70.66   10.1   39.9      40        2
## 5       21198   5114        1.1   71.71   10.3   62.6      60        1
## 6        2541   4884        0.7   72.06    6.8   63.9      60        1
## 7        3100   5348        1.1   72.48    3.1   56.0      60        1
## 8         579   4809        0.9   70.06    6.2   54.6      50        1
## 9        8277   4815        1.3   70.66   10.7   52.6      50        2
## 10       4931   4091        2.0   68.54   13.9   40.6      40        2
## ..        ...    ...        ...     ...    ...    ...     ...      ...

I then decided to use a t-test of the difference in the means of these two groups to either to test the null hypothesis stated above. In short, if this difference is not equal to 0 then I would reject the null hypothesis and indicate its alternative, which would state that there is a difference in income between states above or below the median Percentage of High School Graduates. In other words, Income and HSMedian are related.

I set my acceptable level of Type 1 error to 0.05 (\(\alpha\) = 0.05).

t.test(StateData$Income ~ StateData$HSMedian, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  StateData$Income by StateData$HSMedian
## t = 1.9642, df = 48, p-value = 0.05531
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   -7.838476 671.518476
## sample estimates:
## mean in group 1 mean in group 2 
##         4601.72         4269.88

The t-test shows that the estimate of the difference between the means is almost two times the error in estimating that difference (t = 1.9642), however, our probability value (p-value = 0.05531) is more than \(\alpha\). Therefore, the test fails to reject the null hypothesis.

Question b

Before testing the null hypothesis in this task, I must first regroup my data according to the requirements listed. I do this by conducting the following transformation and ultimately creating a new dataframe meeting those requirements.

grp1 <- c("Alabama", "Alaska", "Arkansas", "Georgia", "Illinois", "Kentucky", "Louisiana",
          "Mississippi", "Michigan")
grp2 <- c("Arizona", "Connecticut", "Iowa", "Kansas", "Maine", "Minnesota", "Nebraska",
          "New Hampshire", "North Dakota")

grp1states <- filter(StateInfo, stateNames %in% grp1)
grp1states$ID <- 1
grp1states
## Source: local data frame [9 x 8]
## 
##    stateNames Population Income Illiteracy LifeExp Murder HSGrad    ID
##        (fctr)      (int)  (int)      (dbl)   (dbl)  (dbl)  (dbl) (dbl)
## 1     Alabama       3615   3624        2.1   69.05   15.1   41.3     1
## 2      Alaska        365   6315        1.5   69.31   11.3   66.7     1
## 3    Arkansas       2110   3378        1.9   70.66   10.1   39.9     1
## 4     Georgia       4931   4091        2.0   68.54   13.9   40.6     1
## 5    Illinois      11197   5107        0.9   70.14   10.3   52.6     1
## 6    Kentucky       3387   3712        1.6   70.10   10.6   38.5     1
## 7   Louisiana       3806   3545        2.8   68.76   13.2   42.2     1
## 8    Michigan       9111   4751        0.9   70.63   11.1   52.8     1
## 9 Mississippi       2341   3098        2.4   68.09   12.5   41.0     1
grp2states <- filter(StateInfo, stateNames %in% grp2)
grp2states$ID <- 2
grp2states
## Source: local data frame [9 x 8]
## 
##      stateNames Population Income Illiteracy LifeExp Murder HSGrad    ID
##          (fctr)      (int)  (int)      (dbl)   (dbl)  (dbl)  (dbl) (dbl)
## 1       Arizona       2212   4530        1.8   70.55    7.8   58.1     2
## 2   Connecticut       3100   5348        1.1   72.48    3.1   56.0     2
## 3          Iowa       2861   4628        0.5   72.56    2.3   59.0     2
## 4        Kansas       2280   4669        0.6   72.58    4.5   59.9     2
## 5         Maine       1058   3694        0.7   70.39    2.7   54.7     2
## 6     Minnesota       3921   4675        0.6   72.96    2.3   57.6     2
## 7      Nebraska       1544   4508        0.6   72.60    2.9   59.3     2
## 8 New Hampshire        812   4281        0.7   71.23    3.3   57.6     2
## 9  North Dakota        637   5087        0.8   72.78    1.4   50.3     2
NewStateInfo <- rbind(grp1states, grp2states)
NewStateInfo
## Source: local data frame [18 x 8]
## 
##       stateNames Population Income Illiteracy LifeExp Murder HSGrad    ID
##           (fctr)      (int)  (int)      (dbl)   (dbl)  (dbl)  (dbl) (dbl)
## 1        Alabama       3615   3624        2.1   69.05   15.1   41.3     1
## 2         Alaska        365   6315        1.5   69.31   11.3   66.7     1
## 3       Arkansas       2110   3378        1.9   70.66   10.1   39.9     1
## 4        Georgia       4931   4091        2.0   68.54   13.9   40.6     1
## 5       Illinois      11197   5107        0.9   70.14   10.3   52.6     1
## 6       Kentucky       3387   3712        1.6   70.10   10.6   38.5     1
## 7      Louisiana       3806   3545        2.8   68.76   13.2   42.2     1
## 8       Michigan       9111   4751        0.9   70.63   11.1   52.8     1
## 9    Mississippi       2341   3098        2.4   68.09   12.5   41.0     1
## 10       Arizona       2212   4530        1.8   70.55    7.8   58.1     2
## 11   Connecticut       3100   5348        1.1   72.48    3.1   56.0     2
## 12          Iowa       2861   4628        0.5   72.56    2.3   59.0     2
## 13        Kansas       2280   4669        0.6   72.58    4.5   59.9     2
## 14         Maine       1058   3694        0.7   70.39    2.7   54.7     2
## 15     Minnesota       3921   4675        0.6   72.96    2.3   57.6     2
## 16      Nebraska       1544   4508        0.6   72.60    2.9   59.3     2
## 17 New Hampshire        812   4281        0.7   71.23    3.3   57.6     2
## 18  North Dakota        637   5087        0.8   72.78    1.4   50.3     2

Using this new dataframe, I’m able to conduct a t-test of my null hypothesis by testing the difference in means of these two groups. The null hypothesis states there is no difference in Murder between the first group of states and the second group of states. The alternative hypothesis would state that there is a difference between the murder and non-negligent manslaughter rates between these two groups.

I set my level of acceptable Type 1 error to 0.05 (\(\alpha\) = 0.05).

t.test(NewStateInfo$Murder ~ NewStateInfo$ID, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  NewStateInfo$Murder by NewStateInfo$ID
## t = 10.124, df = 16, p-value = 2.312e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   6.834422 10.454467
## sample estimates:
## mean in group 1 mean in group 2 
##       12.011111        3.366667

The t-test shows that the estimate of the difference between the means is more than 10 times the error in estimating that difference (t = 10.124), and the probability value (p-value = 2.312e-08) is less than \(\alpha\). Therefore, the null hypothesis has been rejected.