Packages used: GGpmisc

attach(wheat)
wheat$fsoil <- as.factor(wheat$soil)
wheat$fsoil <- relevel(wheat$fsoil, ref = "sand")
leveneTest(yield ~ wheat$fsoil)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  2   0.249 0.7813
      27               
shapiro.test(yield)

    Shapiro-Wilk normality test

data:  yield
W = 0.97214, p-value = 0.5993
modelA <- aov(yield~ wheat$fsoil)
modelB <- lm(yield ~ wheat$fsoil)
summary(modelA)
            Df Sum Sq Mean Sq F value Pr(>F)  
wheat$fsoil  2   99.2   49.60   4.245  0.025 *
Residuals   27  315.5   11.69                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
summary(modelB)

Call:
lm(formula = yield ~ wheat$fsoil)

Residuals:
   Min     1Q Median     3Q    Max 
  -8.5   -1.8    0.3    1.7    7.1 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)        9.900      1.081   9.158 9.04e-10 ***
wheat$fsoilclay    1.600      1.529   1.047  0.30456    
wheat$fsoilloam    4.400      1.529   2.878  0.00773 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.418 on 27 degrees of freedom
Multiple R-squared:  0.2392,    Adjusted R-squared:  0.1829 
F-statistic: 4.245 on 2 and 27 DF,  p-value: 0.02495
posthoc1 <- TukeyHSD(x=modelA, ordered = FALSE, conf.level=0.95)
posthoc1
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = yield ~ wheat$fsoil)

$`wheat$fsoil`
          diff        lwr      upr     p adj
clay-sand  1.6 -2.1903777 5.390378 0.5546301
loam-sand  4.4  0.6096223 8.190378 0.0204414
loam-clay  2.8 -0.9903777 6.590378 0.1785489
WhatDoIDo <- wheat %>%
  group_by(fsoil) %>%
  summarise_all(funs(mean))
argument is not numeric or logical: returning NAargument is not numeric or logical: returning NAargument is not numeric or logical: returning NAargument is not numeric or logical: returning NAargument is not numeric or logical: returning NAargument is not numeric or logical: returning NA
  1. [V2-V1]/V1; In this dataset we see that there is a significant difference between the different types of soil, but the difference is only significant between sand and loam, not between sand and clay or loam and clay. If you look at the data you can see that loam has an approx. 1.4 fold yield increase over sand.
attach(censur)
leveneTest(med_income ~ redblue)
Levene's Test for Homogeneity of Variance (center = median)
        Df F value    Pr(>F)    
group    2  25.198 1.391e-11 ***
      3140                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
leveneTest(poverty ~ redblue)
Levene's Test for Homogeneity of Variance (center = median)
        Df F value    Pr(>F)    
group    2  52.884 < 2.2e-16 ***
      3140                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
shapiro.test(poverty)

    Shapiro-Wilk normality test

data:  poverty
W = 0.94815, p-value < 2.2e-16
shapiro.test(med_income)

    Shapiro-Wilk normality test

data:  med_income
W = 0.91527, p-value < 2.2e-16
logmed <- log10(med_income)
shapiro.test(logmed)

    Shapiro-Wilk normality test

data:  logmed
W = 0.98768, p-value = 6.702e-16
leveneTest(logmed ~ redblue)
Levene's Test for Homogeneity of Variance (center = median)
        Df F value    Pr(>F)    
group    2  8.9905 0.0001278 ***
      3140                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
sqpov <- (sqrt(poverty))
shapiro.test(sqpov)

    Shapiro-Wilk normality test

data:  sqpov
W = 0.99203, p-value = 3.325e-12
leveneTest(sqpov ~ redblue)
Levene's Test for Homogeneity of Variance (center = median)
        Df F value    Pr(>F)    
group    2  27.057 2.236e-12 ***
      3140                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
hist(logmed, breaks=c(seq(4,6, .1)), prob = TRUE)
lines(density(logmed))

hist(sqpov, breaks=c(seq(0,8, .50)), prob = TRUE)
lines(density(sqpov))

modelS <-  lm(logmed ~ redblue) 
summary.lm(modelS)

Call:
lm(formula = logmed ~ redblue)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.32225 -0.06309 -0.00701  0.05672  0.39167 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.602888   0.002260 2036.53   <2e-16 ***
redblueb    0.080168   0.004061   19.74   <2e-16 ***
redblueu    0.068307   0.005615   12.16   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.09888 on 3140 degrees of freedom
Multiple R-squared:  0.126, Adjusted R-squared:  0.1255 
F-statistic: 226.4 on 2 and 3140 DF,  p-value: < 2.2e-16
boxplot(logmed ~ redblue)

modelT <-  lm(sqpov ~ redblue) 
summary.lm(modelS)

Call:
lm(formula = logmed ~ redblue)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.32225 -0.06309 -0.00701  0.05672  0.39167 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.602888   0.002260 2036.53   <2e-16 ***
redblueb    0.080168   0.004061   19.74   <2e-16 ***
redblueu    0.068307   0.005615   12.16   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.09888 on 3140 degrees of freedom
Multiple R-squared:  0.126, Adjusted R-squared:  0.1255 
F-statistic: 226.4 on 2 and 3140 DF,  p-value: < 2.2e-16
boxplot(sqpov ~ redblue)

ModelNS <- lm(logmed ~ 1)
ModelNT <- lm(sqpov ~ 1)
AICctab(modelS, modelT, ModelNT, ModelNS, weights = TRUE, delta = TRUE, base=TRUE, sort = TRUE)
        AICc    dAICc   df weight
modelS  -5620.4     0.0 4  1     
ModelNS -5200.9   419.4 2  <0.001
modelT   7157.7 12778.1 4  <0.001
ModelNT  7474.7 13095.1 2  <0.001
WhatDoIDid <- censur %>%
  group_by(redblue) %>%
  summarise_all(funs(mean))
argument is not numeric or logical: returning NAargument is not numeric or logical: returning NAargument is not numeric or logical: returning NAargument is not numeric or logical: returning NAargument is not numeric or logical: returning NAargument is not numeric or logical: returning NA
  1. So the main issue here is that the population is too large to get the shapiro test to work, and no amount of transforming the data will really effectively change that. By doing different transformations I manged to fix the data so that the med_income data is pretty reasonable both when you look at a histogram of the data and when you look at the levenes test for the data (still a significant P value but MUCH closer to not rejecting the null hypothesis). There really wasn’t much I could do about poverty. Also, there’s an issue with the population sample sizes where theres significant differences in population sizes of the samples so that also might be a confounding problem with this dataset.

Now, ignoring all of that what we see is that ON AVERAGE republican counties have 34% more poverty than democratic counties, and salaraies in these republican countier are, on average, 20% less than their democratic counties. If you look at the R2 my models are only really accounting for ~ 13% of the variance in the dataset. That being said, since the dataset is so large I’m pretty confident that the modelS is at least somewhat reliable to use to answer the questions. Now we can’t fully determine if this is SIGNIFICANT because we could only use parametric tests [haven’t learned nonparametric tests yet] so we can’t rely completely on this data. Contrastingly, when we ran our AIC we see that the model that was chosen by the AIC was modelS (logmed ~ redblue) with the nullmodel following up and no other model came close to being as strong as modelS as determined by the AIC. I would hazard to say that yes, even though our data and analysis is a little shaky, the adage of economic differences seems to be true with republican counties haveing lower medium salaries. We can’t really determine much with higher povery rates because our AIC definitely didn’t choose that model so there was probably a LOT of variation within that dataset even if our R2 was .13.

attach(censur)
WhatDoIDid <- censur %>% 
  group_by(redblue) %>%
  summarise_all(funs(mean))
argument is not numeric or logical: returning NAargument is not numeric or logical: returning NAargument is not numeric or logical: returning NAargument is not numeric or logical: returning NAargument is not numeric or logical: returning NAargument is not numeric or logical: returning NA
hist(pop2010, breaks=c(seq(0,3500, 100)), prob = TRUE)               

levene.test(pop2010 ~ redblue)
'levene.test' is deprecated.
Use 'leveneTest' instead.
See help("Deprecated") and help("car-deprecated").
Levene's Test for Homogeneity of Variance (center = median)
        Df F value Pr(>F)
group    2  0.1997  0.819
      3140               
hmm <- aov(pop2010 ~ redblue)
summary.aov(hmm)
              Df    Sum Sq  Mean Sq F value Pr(>F)    
redblue        2 1.958e+08 97904451   133.4 <2e-16 ***
Residuals   3140 2.305e+09   734158                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
posthoc1 <- TukeyHSD(x=hmm, ordered = FALSE, conf.level=0.95)
posthoc1
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = pop2010 ~ redblue)

$redblue
         diff       lwr       upr     p adj
b-r  546.3777  463.8669 628.88848 0.0000000
u-r  409.2954  295.1970 523.39382 0.0000000
u-b -137.0823 -262.0166 -12.14793 0.0273597
censur$pop2010 <- as.integer(censur$pop2010)
censur$redblue <- as.factor(censur$redblue)
censur$redblue <- relevel(censur$redblue, ref= "r")
leveneTest((1/(censur$pop2010)) ~ redblue)
Levene's Test for Homogeneity of Variance (center = median)
        Df F value  Pr(>F)  
group    2  2.6335 0.07199 .
      3140                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
shapiro.test((1/(censur$pop2010)))

    Shapiro-Wilk normality test

data:  (1/(censur$pop2010))
W = 0.060176, p-value < 2.2e-16
hm <- aov((1/(censur$pop2010) ~ redblue))
hmmm <- aov((1/(censur$pop2010)) ~ redblue)
summary.aov(hmmm)
              Df Sum Sq   Mean Sq F value Pr(>F)  
redblue        2 0.0031 0.0015271   2.961 0.0519 .
Residuals   3140 1.6195 0.0005158                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
posthoc3 <- TukeyHSD(x=hmmm, ordered = FALSE, conf.level=0.95)
posthoc3
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = (1/(censur$pop2010)) ~ redblue)

$redblue
             diff          lwr          upr     p adj
b-r -2.038490e-03 -0.004225469 0.0001484904 0.0737902
u-r -1.976389e-03 -0.005000611 0.0010478325 0.2758583
u-b  6.210029e-05 -0.003249332 0.0033735328 0.9989345
censur$redblue <- as.factor(censur$redblue)
censur$redblue <- relevel(censur$redblue, ref = "r")
huh <- lm((1/(censur$pop2010)) ~ redblue)
summary.lm(huh)

Call:
lm(formula = (1/(censur$pop2010)) ~ redblue)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.00326 -0.00293 -0.00127 -0.00094  0.99642 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.0035792  0.0005191   6.895 6.49e-12 ***
redblueb    -0.0020385  0.0009327  -2.186   0.0289 *  
redblueu    -0.0019764  0.0012897  -1.532   0.1255    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.02271 on 3140 degrees of freedom
Multiple R-squared:  0.001882,  Adjusted R-squared:  0.001247 
F-statistic: 2.961 on 2 and 3140 DF,  p-value: 0.05192
huhum <- lm((1/(censur$pop2010)) ~ 1)
boxplot((1/(censur$pop2010) ~ censur$redblue))

AICctab(hm, huh, huhum, weights = TRUE, delta = TRUE, base=TRUE, sort = TRUE)
      AICc     dAICc    df weight
hm    -14867.6      0.0 4  0.42  
huh   -14867.6      0.0 4  0.42  
huhum -14865.7      1.9 2  0.16  
#Coding the graphs into names to put them into a singular table. This is also allowing up to find R2 values for EACH of the population data sets and see how they look against eachother and when they're all together
P1 <- ggplot(data=censur, aes((pop2010),(poverty), color = factor(redblue))) + geom_point(size = 1) + geom_smooth(method = "lm") +
  stat_poly_eq(formula = huh, 
                aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
                parse = TRUE)
Blue <- censur %>%
  filter(redblue == "b") %>%
  na.omit(Blue)
he <- lm((1/(Blue$pop2010)) ~ Blue$poverty)
summary.lm(he)

Call:
lm(formula = (1/(Blue$pop2010)) ~ Blue$poverty)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.00302 -0.00151 -0.00090 -0.00019  0.49673 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.0040909  0.0017673   2.315   0.0209 *
Blue$poverty -0.0002007  0.0001312  -1.530   0.1264  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01722 on 857 degrees of freedom
Multiple R-squared:  0.002724,  Adjusted R-squared:  0.001561 
F-statistic: 2.341 on 1 and 857 DF,  p-value: 0.1264
P2 <- ggplot(Blue, aes(pop2010,poverty)) + 
  geom_point(size = 1) +  
  geom_smooth(method = "lm") +
  stat_poly_eq(formula = he, 
                aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
                parse = TRUE)  
  
Red <- censur %>%
  filter(redblue == "r") %>%
  na.omit(Red)
hee <- lm((1/(Red$pop2010)) ~ Red$poverty)
summary.lm(hee)

Call:
lm(formula = (1/(Red$pop2010)) ~ Red$poverty)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.00894 -0.00422 -0.00239  0.00001  0.98902 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.098e-02  1.647e-03   6.664 3.46e-11 ***
Red$poverty -4.336e-04  8.979e-05  -4.829 1.48e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.02645 on 1912 degrees of freedom
Multiple R-squared:  0.01205,   Adjusted R-squared:  0.01153 
F-statistic: 23.32 on 1 and 1912 DF,  p-value: 1.484e-06
P3 <- ggplot(Red, aes(pop2010,poverty)) + 
  geom_point(size = 1) +  
  geom_smooth(method = "lm") +
  stat_poly_eq(formula = hee, 
                aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
                parse = TRUE)   
Unknown <- censur %>%
  filter(redblue == "u") %>%
  na.omit(Unknown)
He <- lm((1/(Unknown$pop2010)) ~ Unknown$poverty)
summary.lm(He) 

Call:
lm(formula = (1/(Unknown$pop2010)) ~ Unknown$poverty)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.001963 -0.001329 -0.001007 -0.000393  0.060789 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)   
(Intercept)      2.500e-03  7.585e-04   3.295  0.00108 **
Unknown$poverty -6.465e-05  5.068e-05  -1.276  0.20293   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.005477 on 368 degrees of freedom
Multiple R-squared:  0.004402,  Adjusted R-squared:  0.001696 
F-statistic: 1.627 on 1 and 368 DF,  p-value: 0.2029
P4 <- ggplot(Unknown, aes(Unknown$pop2010,Unknown$poverty)) + 
  geom_point(size = 1) +  
  geom_smooth(method = "lm") +
  stat_poly_eq(formula = He, 
                aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
                parse = TRUE)
#This method ONLY works with cowplot BUT it makes multiplotting extremely manigable (sp). Moving forward we can probably really easily find a way to make this "sexier" and move the graphs about to be better looking but as it is this is AWESOME :D
plot_grid(P1) #Since we couldn't make all of them fit nicely, we can just have this one by itself so we can see the pretty colors and the lots of lines

plot_grid(P2,P3,P4, labels = c("A", "B", "C"), nrow = 2, ncol = 2, align = "h")

huhh <- lm((1/(pop2010)) ~ (log10(med_income)))
P5 <- ggplot(data=censur, aes((pop2010),(log10(med_income)), color = factor(redblue))) + geom_point(size = 1) + geom_smooth(method = "lm") +
  stat_poly_eq(formula = huhh, 
                aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
                parse = TRUE)
Blue <- censur %>%
  filter(redblue == "b") %>%
  na.omit(Blue)
heee <- lm((1/(Blue$pop2010)) ~ (log10(Blue$med_income)))
summary.lm(heee)

Call:
lm(formula = (1/(Blue$pop2010)) ~ (log10(Blue$med_income)))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.00289 -0.00135 -0.00090 -0.00030  0.49793 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)
(Intercept)             0.041470   0.029041   1.428    0.154
log10(Blue$med_income) -0.008526   0.006200  -1.375    0.169

Residual standard error: 0.01723 on 857 degrees of freedom
Multiple R-squared:  0.002202,  Adjusted R-squared:  0.001038 
F-statistic: 1.891 on 1 and 857 DF,  p-value: 0.1694
P6 <- ggplot(Blue, aes(pop2010,med_income)) + 
  geom_point(size = 1) +  
  geom_smooth(method = "lm") +
  stat_poly_eq(formula = heee, 
                aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
                parse = TRUE)  
  
Red <- censur %>%
  filter(redblue == "r") %>%
  na.omit(Red)
heeee <- lm((1/(Red$pop2010)) ~ (log10(Red$med_income)))
summary.lm(heeee)

Call:
lm(formula = (1/(Red$pop2010)) ~ (log10(Red$med_income)))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.01151 -0.00389 -0.00217 -0.00012  0.98862 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           -0.108390   0.028563  -3.795 0.000152 ***
log10(Red$med_income)  0.024326   0.006204   3.921 9.13e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.0265 on 1912 degrees of freedom
Multiple R-squared:  0.007976,  Adjusted R-squared:  0.007457 
F-statistic: 15.37 on 1 and 1912 DF,  p-value: 9.133e-05
P7 <- ggplot(Red, aes(pop2010,med_income)) + 
  geom_point(size = 1) +  
  geom_smooth(method = "lm") +
  stat_poly_eq(formula = heeee, 
                aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
                parse = TRUE)   
Unknown <- censur %>%
  filter(redblue == "u") %>%
  na.omit(Unknown)
Heeeee <- lm((1/(Unknown$pop2010)) ~ (log10(Unknown$med_income)))
summary.lm(Heeeee) 

Call:
lm(formula = (1/(Unknown$pop2010)) ~ (log10(Unknown$med_income)))

Residuals:
      Min        1Q    Median        3Q       Max 
-0.001329 -0.001181 -0.001037 -0.000660  0.060870 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)
(Intercept)                0.005886   0.011785   0.499    0.618
log10(Unknown$med_income) -0.000917   0.002522  -0.364    0.716

Residual standard error: 0.005488 on 368 degrees of freedom
Multiple R-squared:  0.0003591, Adjusted R-squared:  -0.002357 
F-statistic: 0.1322 on 1 and 368 DF,  p-value: 0.7164
P8 <- ggplot(Unknown, aes(Unknown$pop2010,Unknown$med_income)) + 
  geom_point(size = 1) +  
  geom_smooth(method = "lm") +
  stat_poly_eq(formula = Heeeee, 
                aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
                parse = TRUE)
plot_grid(P5)

plot_grid(P6,P7,P8, labels = c("A", "B", "C"), nrow = 2, ncol = 2)

plot_grid(P1,P5, nrow = 2)

plot_grid(P2, P3, P4, P6, P7, P8, nrow = 2, ncol =  3, align = "v")

#I keep trying to figure out how to adjust the spaces BETWEEN the graphs but I honestly have no freaking idea how to do it soooo I'm just gonna cry and let it be. 
B1 <- boxplot((1/(censur$pop2010)) ~ censur$redblue)

  1. So by doing the inverse function for this dataset I managed to get the levenes test to above .05 so we can accept that the data is better but it still fails the shapiro test. When you run an LM on the dataset you see that there’s a significant difference between republican and democratic counties but not between republican and unidentifiable counties. The standard error is used as a proxy for standard deviation in this case with a value of 0.02271 which is pretty concise especially when we think about how large this dataset is and the fact that our R2 in our models aren’t very high. I also did a boxplot to visualize the spread in the transformed data and what you can see is that numbers closest to 0 mean high populations while number closest to 1 are smaller populations. From that boxplot you can see that there’s a LOT more variation within the r (republicans) section that leads me to believe that, on average, the repiblican counties are smaller than democratic counties, and the unknown counties have a much smaller variation that r but more spread than b. Also, if you look at the WhadDoIDid dataframe you can see the averages of the population sizes, which tells us that on average democratic counties are larger than republican ones.
attach(cancer)
Cancs <- cancer %>% 
  group_by(treatment) %>%
  summarise_all(funs(mean))
cancer$death <- as.numeric(cancer$death)
cancer$treatment <- as.factor(cancer$treatment)
hist(cancer$death, breaks=c(seq(0,60, 1)), prob = TRUE)               

levene.test(death ~ treatment)
Levene's Test for Homogeneity of Variance (center = median)
       Df F value  Pr(>F)  
group   3  3.0257 0.03241 *
      116                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
levene.test(log10(cancer$death) ~ treatment)
Levene's Test for Homogeneity of Variance (center = median)
       Df F value Pr(>F)
group   3  0.9122 0.4374
      116               
levene.test(sqrt(cancer$death) ~ treatment)
Levene's Test for Homogeneity of Variance (center = median)
       Df F value  Pr(>F)  
group   3  2.3496 0.07608 .
      116                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
levene.test((1/(cancer$death)) ~ treatment)
Levene's Test for Homogeneity of Variance (center = median)
       Df F value Pr(>F)
group   3  0.0816 0.9699
      116               
shapiro.test(cancer$death)

    Shapiro-Wilk normality test

data:  cancer$death
W = 0.78535, p-value = 5.85e-12
shapiro.test(log10(cancer$death))

    Shapiro-Wilk normality test

data:  log10(cancer$death)
W = 0.96269, p-value = 0.002094
shapiro.test(sqrt(cancer$death))

    Shapiro-Wilk normality test

data:  sqrt(cancer$death)
W = 0.94351, p-value = 7.439e-05
shapiro.test((1/(cancer$death)))

    Shapiro-Wilk normality test

data:  (1/(cancer$death))
W = 0.71166, p-value = 4.843e-14
#so here what we see is that log10 best satisfies the data for shapiro and levene so it's likely to be best to use those, even though sqrt, log, AND the inverse all keep the null hypothesis when run through the levene test. 
ggplot(cancer, aes(log10(cancer$death)))+ geom_histogram(aes(y=..density..), binwidth = .1, color = "gray", fill = "black") +
  geom_density() 

#this isn't really a PERFECT bellcurve but honestly, it's pretty decent enough and I'd be pretty okay allowing myself to draw conclusions fromt his data.
ADvsT <- aov((log10(cancer$death)) ~ treatment)
summary.aov(ADvsT)
             Df Sum Sq Mean Sq F value Pr(>F)
treatment     3  0.443  0.1478   1.017  0.388
Residuals   116 16.861  0.1454               
posthocADvsT <- TukeyHSD(x=ADvsT, ordered = FALSE, conf.level=0.95)
posthocADvsT
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = (log10(cancer$death)) ~ treatment)

$treatment
                     diff        lwr       upr     p adj
DrugA-placebo  0.10291792 -0.1536795 0.3595153 0.7230782
DrugB-placebo  0.11823973 -0.1383577 0.3748371 0.6273618
DrugC-placebo -0.01946226 -0.2760597 0.2371351 0.9972479
DrugB-DrugA    0.01532181 -0.2412756 0.2719192 0.9986495
DrugC-DrugA   -0.12238019 -0.3789776 0.1342172 0.6007856
DrugC-DrugB   -0.13770200 -0.3942994 0.1188954 0.5026287
plot(posthocADvsT)

cancer$treatment <- as.factor(cancer$treatment)
cancer$treatment <- relevel(cancer$treatment, ref = "placebo")
LDvsT <- lm((log10(cancer$death)) ~ cancer$treatment)
summary.lm(LDvsT)

Call:
lm(formula = (log10(cancer$death)) ~ cancer$treatment)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.79503 -0.19297  0.03191  0.25413  0.90153 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            0.67679    0.06961   9.723   <2e-16 ***
cancer$treatmentDrugA  0.10292    0.09844   1.045    0.298    
cancer$treatmentDrugB  0.11824    0.09844   1.201    0.232    
cancer$treatmentDrugC -0.01946    0.09844  -0.198    0.844    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3813 on 116 degrees of freedom
Multiple R-squared:  0.02563,   Adjusted R-squared:  0.0004263 
F-statistic: 1.017 on 3 and 116 DF,  p-value: 0.3879
NullA <- aov(log10(cancer$death) ~ 1)
NullL <- lm(log10(cancer$death) ~ 1)
AICctab(ADvsT, NullA, LDvsT, NullL, weights = TRUE, delta = TRUE, base=TRUE, sort = TRUE)
      AICc  dAICc df weight
NullA 112.3   0.0 2  0.42  
NullL 112.3   0.0 2  0.42  
ADvsT 115.6   3.3 5  0.08  
LDvsT 115.6   3.3 5  0.08  
#Here I'm literally SCROUNGING to figure out why I'm not seeing any interactions between my data sets and what the hell is going on. 
boxplot((log10(cancer$death)) ~ cancer$treatment)

interaction.plot(cancer$treatment, trace.factor=cancer$treatment, response=cancer$death, 
fun=mean, type="b")

ggplot(cancer, aes((log10(cancer$death)),cancer$treatment)) + 
  geom_point(size = 1) 

NA
  1. So ultimately what my data is telling me is that there’s no significant difference between the drugs and the placebo, so effectively it doesn’t seem like the drugs are having any effect on the amount that patients are dying in these trials. Now I have to say that I’m somewhat sold on this data because my levenes test is definitely satisfied, my shapiro test is almost satisfied, and when I looked at the histogram for my log transformed data it looked relatively normal (bell shaped)- not perfect but reasonable enough that I wouldn’t expect it to blow my ANOVA’s ability to perform out of the water. When you look at the model summaries for my LM and my ANOVA they both show there’s no significance between the datasets (ANOVA w/ Tukey). This is further backed up with the boxplot I made of the data because there’s HUGE overlap between the boxplot’s error bars telling me that there’s quite a bit of variation amongst the different drug trials. I then did a interaction plot of the Tukey posthoc and you can clearly see that there’s way too much overlap within the datasets for any of the models we created to be able to see any difference as significant. This assumption is also backed up by the AIC I ran, which picked the null models rather than the lm or ANOVA models I created, basically that means the models I created really didn’t explain much of the variation and were quite awful. The last piece of evidence that I have that supports my ideas is the ggplot that I created to see the variation among the different trials per drug, and basically it says the same story as the boxplot and the Tukey posthoc plot- there’s too much overlap between the datasets for the drugs to be significantly affecting the trials/amount of deaths. I do wanna say that there seems to be an observable trend if you’re just looking at the data and I feel that if there had more than 120 individuals in the drug trial it would be a lot more likely that there would’ve been a significant difference.
---
title: "R Notebook"
output: html_notebook
---
Packages used: 
GGpmisc


```{r, message=FALSE}
attach(wheat)
wheat$fsoil <- as.factor(wheat$soil)
wheat$fsoil <- relevel(wheat$fsoil, ref = "sand")
leveneTest(yield ~ wheat$fsoil)
shapiro.test(yield)
modelA <- aov(yield~ wheat$fsoil)
modelB <- lm(yield ~ wheat$fsoil)
summary(modelA)
summary(modelB)
posthoc1 <- TukeyHSD(x=modelA, ordered = FALSE, conf.level=0.95)
posthoc1
WhatDoIDo <- wheat %>%
  group_by(fsoil) %>%
  summarise_all(funs(mean))
```

1) [V2-V1]/V1; In this dataset we see that there is a significant difference between the different types of soil, but the difference is only significant between sand and loam, not between sand and clay or loam and clay. If you look at the data you can see that loam has an approx. 1.4 fold yield increase over sand. 


```{r, message=FALSE}
attach(censur)
leveneTest(med_income ~ redblue)
leveneTest(poverty ~ redblue)
shapiro.test(poverty)
shapiro.test(med_income)

logmed <- log10(med_income)
shapiro.test(logmed)
leveneTest(logmed ~ redblue)
sqpov <- (sqrt(poverty))
shapiro.test(sqpov)
leveneTest(sqpov ~ redblue)

hist(logmed, breaks=c(seq(4,6, .1)), prob = TRUE)
lines(density(logmed))
hist(sqpov, breaks=c(seq(0,8, .50)), prob = TRUE)
lines(density(sqpov))

modelS <-  lm(logmed ~ redblue) 
summary.lm(modelS)
boxplot(logmed ~ redblue)

modelT <-  lm(sqpov ~ redblue) 
summary.lm(modelS)
boxplot(sqpov ~ redblue)

ModelNS <- lm(logmed ~ 1)
ModelNT <- lm(sqpov ~ 1)
AICctab(modelS, modelT, ModelNT, ModelNS, weights = TRUE, delta = TRUE, base=TRUE, sort = TRUE)

WhatDoIDid <- censur %>%
  group_by(redblue) %>%
  summarise_all(funs(mean))
```
2) So the main issue here is that the population is too large to get the shapiro test to work, and no amount of transforming the data will really effectively change that. By doing different transformations I manged to fix the data so that the med_income data is pretty reasonable both when you look at a histogram of the data and when you look at the levenes test for the data (still a significant P value but MUCH closer to not rejecting the null hypothesis). There really wasn't much I could do about poverty. Also, there's an issue with the population sample sizes where theres significant differences in population sizes of the samples so that also might be a confounding problem with this dataset. 

Now, ignoring all of that what we see is that ON AVERAGE republican counties have 34% more poverty than democratic counties, and salaraies in these republican countier are, on average, 20% less than their democratic counties. If you look at the R2 my models are only really accounting for ~ 13% of the variance in the dataset. That being said, since the dataset is so large I'm pretty confident that the modelS is at least somewhat reliable to use to answer the questions. Now we can't fully determine if this is SIGNIFICANT because we could only use parametric tests [haven't learned nonparametric tests yet] so we can't rely completely on this data. Contrastingly, when we ran our AIC we see that the model that was chosen by the AIC was modelS (logmed ~ redblue) with the nullmodel following up and no other model came close to being as strong as modelS as determined by the AIC. I would hazard to say that yes, even though our data and analysis is a little shaky, the adage of economic differences seems to be true with republican counties haveing lower medium salaries. We can't really determine much with higher povery rates because our AIC definitely didn't choose that model so there was probably a LOT of variation within that dataset even if our R2 was .13. 

```{r, message=FALSE}
attach(censur)
WhatDoIDid <- censur %>% 
  group_by(redblue) %>%
  summarise_all(funs(mean))

hist(pop2010, breaks=c(seq(0,3500, 100)), prob = TRUE)               
levene.test(pop2010 ~ redblue)
hmm <- aov(pop2010 ~ redblue)
summary.aov(hmm)
posthoc1 <- TukeyHSD(x=hmm, ordered = FALSE, conf.level=0.95)
posthoc1


censur$pop2010 <- as.integer(censur$pop2010)
censur$redblue <- as.factor(censur$redblue)
censur$redblue <- relevel(censur$redblue, ref= "r")
leveneTest((1/(censur$pop2010)) ~ redblue)
shapiro.test((1/(censur$pop2010)))
hm <- aov((1/(censur$pop2010) ~ redblue))

hmmm <- aov((1/(censur$pop2010)) ~ redblue)
summary.aov(hmmm)
posthoc3 <- TukeyHSD(x=hmmm, ordered = FALSE, conf.level=0.95)
posthoc3

censur$redblue <- as.factor(censur$redblue)
censur$redblue <- relevel(censur$redblue, ref = "r")
huh <- lm((1/(censur$pop2010)) ~ redblue)
summary.lm(huh)

huhum <- lm((1/(censur$pop2010)) ~ 1)

boxplot((1/(censur$pop2010) ~ censur$redblue))


AICctab(hm, huh, huhum, weights = TRUE, delta = TRUE, base=TRUE, sort = TRUE)

#Coding the graphs into names to put them into a singular table. This is also allowing up to find R2 values for EACH of the population data sets and see how they look against eachother and when they're all together

P1 <- ggplot(data=censur, aes((pop2010),(poverty), color = factor(redblue))) + geom_point(size = 1) + geom_smooth(method = "lm") +
  stat_poly_eq(formula = huh, 
                aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
                parse = TRUE)

Blue <- censur %>%
  filter(redblue == "b") %>%
  na.omit(Blue)
he <- lm((1/(Blue$pop2010)) ~ Blue$poverty)
summary.lm(he)
P2 <- ggplot(Blue, aes(pop2010,poverty)) + 
  geom_point(size = 1) +  
  geom_smooth(method = "lm") +
  stat_poly_eq(formula = he, 
                aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
                parse = TRUE)  
  
Red <- censur %>%
  filter(redblue == "r") %>%
  na.omit(Red)
hee <- lm((1/(Red$pop2010)) ~ Red$poverty)
summary.lm(hee)
P3 <- ggplot(Red, aes(pop2010,poverty)) + 
  geom_point(size = 1) +  
  geom_smooth(method = "lm") +
  stat_poly_eq(formula = hee, 
                aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
                parse = TRUE)   

Unknown <- censur %>%
  filter(redblue == "u") %>%
  na.omit(Unknown)
He <- lm((1/(Unknown$pop2010)) ~ Unknown$poverty)
summary.lm(He) 
P4 <- ggplot(Unknown, aes(Unknown$pop2010,Unknown$poverty)) + 
  geom_point(size = 1) +  
  geom_smooth(method = "lm") +
  stat_poly_eq(formula = He, 
                aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
                parse = TRUE)

#This method ONLY works with cowplot BUT it makes multiplotting extremely manigable (sp). Moving forward we can probably really easily find a way to make this "sexier" and move the graphs about to be better looking but as it is this is AWESOME :D
plot_grid(P1) #Since we couldn't make all of them fit nicely, we can just have this one by itself so we can see the pretty colors and the lots of lines
plot_grid(P2,P3,P4, labels = c("A", "B", "C"), nrow = 2, ncol = 2, align = "h")

huhh <- lm((1/(pop2010)) ~ (log10(med_income)))
P5 <- ggplot(data=censur, aes((pop2010),(log10(med_income)), color = factor(redblue))) + geom_point(size = 1) + geom_smooth(method = "lm") +
  stat_poly_eq(formula = huhh, 
                aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
                parse = TRUE)

Blue <- censur %>%
  filter(redblue == "b") %>%
  na.omit(Blue)
heee <- lm((1/(Blue$pop2010)) ~ (log10(Blue$med_income)))
summary.lm(heee)
P6 <- ggplot(Blue, aes(pop2010,med_income)) + 
  geom_point(size = 1) +  
  geom_smooth(method = "lm") +
  stat_poly_eq(formula = heee, 
                aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
                parse = TRUE)  
  
Red <- censur %>%
  filter(redblue == "r") %>%
  na.omit(Red)
heeee <- lm((1/(Red$pop2010)) ~ (log10(Red$med_income)))
summary.lm(heeee)
P7 <- ggplot(Red, aes(pop2010,med_income)) + 
  geom_point(size = 1) +  
  geom_smooth(method = "lm") +
  stat_poly_eq(formula = heeee, 
                aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
                parse = TRUE)   

Unknown <- censur %>%
  filter(redblue == "u") %>%
  na.omit(Unknown)
Heeeee <- lm((1/(Unknown$pop2010)) ~ (log10(Unknown$med_income)))
summary.lm(Heeeee) 
P8 <- ggplot(Unknown, aes(Unknown$pop2010,Unknown$med_income)) + 
  geom_point(size = 1) +  
  geom_smooth(method = "lm") +
  stat_poly_eq(formula = Heeeee, 
                aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
                parse = TRUE)

plot_grid(P5)
plot_grid(P6,P7,P8, labels = c("A", "B", "C"), nrow = 2, ncol = 2)
plot_grid(P1,P5, nrow = 2)
plot_grid(P2, P3, P4, P6, P7, P8, nrow = 2, ncol =  3, align = "v")
#I keep trying to figure out how to adjust the spaces BETWEEN the graphs but I honestly have no freaking idea how to do it soooo I'm just gonna cry and let it be. 

B1 <- boxplot((1/(censur$pop2010)) ~ censur$redblue)


```

3) So by doing the inverse function for this dataset I managed to get the levenes test to above .05 so we can accept that the data is better but it still fails the shapiro test. When you run an LM on the dataset you see that there's a significant difference between republican and democratic counties but not between republican and unidentifiable counties. The standard error is used as a proxy for standard deviation in this case with a value of 0.02271 which is pretty concise especially when we think about how large this dataset is and the fact that our R2 in our models aren't very high. I also did a boxplot to visualize the spread in the transformed data and what you can see is that numbers closest to 0 mean high populations while number closest to 1 are smaller populations. From that boxplot you can see that there's a LOT more variation within the r (republicans) section that leads me to believe that, on average, the repiblican counties are smaller than democratic counties, and the unknown counties have a much smaller variation that r but more spread than b. Also, if you look at the WhadDoIDid dataframe you can see the averages of the population sizes, which tells us that on average democratic counties are larger than republican ones. 

```{r, message=FALSE, warning=FALSE}
attach(cancer)
Cancs <- cancer %>% 
  group_by(treatment) %>%
  summarise_all(funs(mean))
cancer$death <- as.numeric(cancer$death)
cancer$treatment <- as.factor(cancer$treatment)

hist(cancer$death, breaks=c(seq(0,60, 1)), prob = TRUE)               

levene.test(death ~ treatment)
levene.test(log10(cancer$death) ~ treatment)
levene.test(sqrt(cancer$death) ~ treatment)
levene.test((1/(cancer$death)) ~ treatment)

shapiro.test(cancer$death)
shapiro.test(log10(cancer$death))
shapiro.test(sqrt(cancer$death))
shapiro.test((1/(cancer$death)))
#so here what we see is that log10 best satisfies the data for shapiro and levene so it's likely to be best to use those, even though sqrt, log, AND the inverse all keep the null hypothesis when run through the levene test. 

ggplot(cancer, aes(log10(cancer$death)))+ geom_histogram(aes(y=..density..), binwidth = .1, color = "gray", fill = "black") +
  geom_density() 
#this isn't really a PERFECT bellcurve but honestly, it's pretty decent enough and I'd be pretty okay allowing myself to draw conclusions fromt his data.

ADvsT <- aov((log10(cancer$death)) ~ treatment)
summary.aov(ADvsT)
posthocADvsT <- TukeyHSD(x=ADvsT, ordered = FALSE, conf.level=0.95)
posthocADvsT
plot(posthocADvsT)

cancer$treatment <- as.factor(cancer$treatment)
cancer$treatment <- relevel(cancer$treatment, ref = "placebo")
LDvsT <- lm((log10(cancer$death)) ~ cancer$treatment)
summary.lm(LDvsT)

NullA <- aov(log10(cancer$death) ~ 1)
NullL <- lm(log10(cancer$death) ~ 1)

AICctab(ADvsT, NullA, LDvsT, NullL, weights = TRUE, delta = TRUE, base=TRUE, sort = TRUE)

#Here I'm literally SCROUNGING to figure out why I'm not seeing any interactions between my data sets and what the hell is going on. 
boxplot((log10(cancer$death)) ~ cancer$treatment)
interaction.plot(cancer$treatment, trace.factor=cancer$treatment, response=cancer$death, 
fun=mean, type="b")
ggplot(cancer, aes((log10(cancer$death)),cancer$treatment)) + 
  geom_point(size = 1) 
 
```
4) So ultimately what my data is telling me is that there's no significant difference between the drugs and the placebo, so effectively it doesn't seem like the drugs are having any effect on the amount that patients are dying in these trials. Now I have to say that I'm somewhat sold on this data because my levenes test is definitely satisfied, my shapiro test is almost satisfied, and when I looked at the histogram for my log transformed data it looked relatively normal (bell shaped)- not perfect but reasonable enough that I wouldn't expect it to blow my ANOVA's ability to perform out of the water. When you look at the model summaries for my LM and my ANOVA they both show there's no significance between the datasets (ANOVA w/ Tukey). This is further backed up with the boxplot I made of the data because there's HUGE overlap between the boxplot's error bars telling me that there's quite a bit of variation amongst the different drug trials. I then did a interaction plot of the Tukey posthoc and you can clearly see that there's way too much overlap within the datasets for any of the models we created to be able to see any difference as significant. This assumption is also backed up by the AIC I ran, which picked the null models rather than the lm or ANOVA models I created, basically that means the models I created really didn't explain much of the variation and were quite awful. The last piece of evidence that I have that supports my ideas is the ggplot that I created to see the variation among the different trials per drug, and basically it says the same story as the boxplot and the Tukey posthoc plot- there's too much overlap between the datasets for the drugs to be significantly affecting the trials/amount of deaths. I do wanna say that there seems to be an observable trend if you're just looking at the data and I feel that if there had more than 120 individuals in the drug trial it would be a lot more likely that there would've been a significant difference. 


