Data for this exam was provided by the professor and can be accessed here.

An Analysis of Relationships Between State-level Indicators

To begin, I have created a new project in R and have read the data file into my project folder.

I also loaded a few standard packages that I anticipated needing to work with, including the ‘cormat’ function:

To assess correlation between the variables, I had to retain only the continuous variables so I have created a dataframe which does not include ‘stateNames’ (a categorical variable).

## Source: local data frame [50 x 7]
## 
##     stateNames Population Income Illiteracy LifeExp Murder HSGrad
##         (fctr)      (int)  (int)      (dbl)   (dbl)  (dbl)  (dbl)
## 1      Alabama       3615   3624        2.1   69.05   15.1   41.3
## 2       Alaska        365   6315        1.5   69.31   11.3   66.7
## 3      Arizona       2212   4530        1.8   70.55    7.8   58.1
## 4     Arkansas       2110   3378        1.9   70.66   10.1   39.9
## 5   California      21198   5114        1.1   71.71   10.3   62.6
## 6     Colorado       2541   4884        0.7   72.06    6.8   63.9
## 7  Connecticut       3100   5348        1.1   72.48    3.1   56.0
## 8     Delaware        579   4809        0.9   70.06    6.2   54.6
## 9      Florida       8277   4815        1.3   70.66   10.7   52.6
## 10     Georgia       4931   4091        2.0   68.54   13.9   40.6
## ..         ...        ...    ...        ...     ...    ...    ...
## Source: local data frame [50 x 6]
## 
##    Population Income Illiteracy LifeExp Murder HSGrad
##         (int)  (int)      (dbl)   (dbl)  (dbl)  (dbl)
## 1        3615   3624        2.1   69.05   15.1   41.3
## 2         365   6315        1.5   69.31   11.3   66.7
## 3        2212   4530        1.8   70.55    7.8   58.1
## 4        2110   3378        1.9   70.66   10.1   39.9
## 5       21198   5114        1.1   71.71   10.3   62.6
## 6        2541   4884        0.7   72.06    6.8   63.9
## 7        3100   5348        1.1   72.48    3.1   56.0
## 8         579   4809        0.9   70.06    6.2   54.6
## 9        8277   4815        1.3   70.66   10.7   52.6
## 10       4931   4091        2.0   68.54   13.9   40.6
## ..        ...    ...        ...     ...    ...    ...

1.) By using ‘cormat’ I was able to create a table of correlation coefficients, a table of p-values indicating the significance levels of the variables’ correlations, and a graphical representation of the correlations represented by blue (negative) and red (positive) correlations.

## $r
##            LifeExp Income HSGrad Population Illiteracy Murder
## LifeExp          1                                           
## Income        0.34      1                                    
## HSGrad        0.58   0.62      1                             
## Population  -0.068   0.21 -0.098          1                  
## Illiteracy   -0.59  -0.44  -0.66       0.11          1       
## Murder       -0.78  -0.23  -0.49       0.34        0.7      1
## 
## $p
##            LifeExp  Income  HSGrad Population Illiteracy Murder
## LifeExp          0                                             
## Income       0.016       0                                     
## HSGrad     9.2e-06 1.6e-06       0                             
## Population    0.64    0.15     0.5          0                  
## Illiteracy   7e-06  0.0015 2.2e-07       0.46          0       
## Murder     2.3e-11    0.11 0.00032      0.015    1.3e-08      0
## 
## $sym
##            LifeExp Income HSGrad Population Illiteracy Murder
## LifeExp    1                                                 
## Income     .       1                                         
## HSGrad     .       ,      1                                  
## Population                       1                           
## Illiteracy .       .      ,                 1                
## Murder     ,              .      .          ,          1     
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

2.) By using ‘ggvis’ I was able to construct plots demonstrating the relationship between:

a.) HSGrad and Income:

a.) Illiteracy and Income:

b.) A scatterplot of Murder by Illiteracy and grouped by HSGrad.

For this requirement, I transformed HSGrad into two categories: 50 state mean graduation percentage and over, and 50 state mean graduation percentage and under:

## Source: local data frame [1 x 2]
## 
##   proportion_grad number_valid
##             (dbl)        (int)
## 1          53.108           50

Having calcuated the mean HSGrad rate, I will recode HSGrad into two categories: lower than the mean HS Graduation rate (0), and greater than the mean graduation rate (1):

## 
##  0  1 
## 24 26

Murder by Illiteracy and grouped by HSGrad:

3.) I have also tested the null hypotheses that the correlation = 0 between Income and states for which HSGrad is above or below the mean.

Null Hypothesis: There is no correlation between mean income and whether or not a state’s HSGrad rate is above or below the mean HSGrad rate.

Alternate Hypothesis: There is a correlation between mean income and whether or not a state’s HSGrad rate is above or below the mean HSGrad rate.

I will set alpha equal to .05 and use a t-test to test for difference between the means:

## 
##  Two Sample t-test
## 
## data:  Income by task2
## t = -2.0303, df = 48, p-value = 0.04789
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -681.492855   -3.314837
## sample estimates:
## mean in group 0 mean in group 1 
##        4257.750        4600.154

Based on the results of the t-test (p value = .05, rounded), I will fail to reject the null hypothesis because alpha was set to .05.

b.) Here I test the null hypothesis that there is no difference between Murder rates between groups of states:

Null Hypothesis: There is no difference in the mean murder rate between two groups of states: Alabama, Alaska, Arkansas, Georgia, Illinois, Kentucky, Louisiana, Mississippi, Michigan and Arizona, Connecticut, Iowa, Kansas, Maine, Minnesota, Nebraska, New Hampshire, North Dakota.

Alternate Hypothesis: There is a difference in the mean murder rate between two groups of states: Alabama, Alaska, Arkansas, Georgia, Illinois, Kentucky, Louisiana, Mississippi, Michigan and Arizona, Connecticut, Iowa, Kansas, Maine, Minnesota, Nebraska, New Hampshire, North Dakota.

I will set alpha equal to .05 and use a t-test to test for difference between the means:

I recoded stateNames to include two levels relating to the aforementioned groups of states (with the others being recoded as ‘rest’):

## [1] "other" "other" "other" "other" "other" "other" "other" "other" "other"
## [1] "Alabama"     "Alaska"      "Arkansas"    "Georgia"     "Illinois"   
## [6] "Kentucky"    "Louisiana"   "Mississippi" "Michigan"
## [1] "Arizona"       "Connecticut"   "Iowa"          "Kansas"       
## [5] "Maine"         "Minnesota"     "Nebraska"      "New Hampshire"
## [9] "North Dakota"
## [1] "other" "other" "other" "other" "other" "other" "other" "other" "other"

I also recombined the recoded stateNames variable for the t.test:

## [1] states1 states2 rest   
## Levels: rest states1 states2
## [1] states1 states2
## Levels: states1 states2
## [1] states1 states2
## Levels: states1 states2
## new_groups
## states1 states2 
##       1       1

Now the t.test:

## 
##  One Sample t-test
## 
## data:  new_data$Murder
## t = 14.132, df = 49, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  6.328876 8.427124
## sample estimates:
## mean of x 
##     7.378

Based on the results of the t-test and the p-value of p-value < 2.2e-16, I will reject the null hypothesis that there is no difference in the mean murder rate between two groups of states: Alabama, Alaska, Arkansas, Georgia, Illinois, Kentucky, Louisiana, Mississippi, Michigan and Arizona, Connecticut, Iowa, Kansas, Maine, Minnesota, Nebraska, New Hampshire, North Dakota.