First, I’ll load the state dataset. One of the datasets it contains is a matrix of 5 rows with 8 columns: population, income, illiteracy, life expectancy, high graduate ration, mean number of days below freezing, and area in square miles.
library(tidyr)
library(ggplot2)
data("state")
df <- as.data.frame(state.x77)
head(df)
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Area
## Alabama 50708
## Alaska 566432
## Arizona 113417
## Arkansas 51945
## California 156361
## Colorado 103766
Next, I have to convert a column into a boolean variable. In this case, I will split states into ‘cold’ and ‘not cold’ represented by a 1 and 0 respectively, using the median number of frost days at the midpoint. This creates our ‘dichotomous’ or ‘binary’ variable. I then do the same for Murder rate.
# Frost
med <- median(df$Frost)
for(i in 1:nrow(df)){
if (df$Frost[i] < med){
df$Frost[i] <- FALSE
}
else if(df$Frost[i] > med){
df$Frost[i] <- TRUE
}
else{
df$Frost[i] <- TRUE
}
}
#Murder Rate
med <- median(df$Murder)
for(i in 1:nrow(df)){
if (df$Murder[i] < med){
df$Murder[i] <- FALSE
}
else if(df$Murder[i] > med){
df$Murder[i] <- TRUE
}
else{
df$Murder[i] <- TRUE
}
}
df
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama 3615 3624 2.1 69.05 1 41.3 0
## Alaska 365 6315 1.5 69.31 1 66.7 1
## Arizona 2212 4530 1.8 70.55 1 58.1 0
## Arkansas 2110 3378 1.9 70.66 1 39.9 0
## California 21198 5114 1.1 71.71 1 62.6 0
## Colorado 2541 4884 0.7 72.06 0 63.9 1
## Connecticut 3100 5348 1.1 72.48 0 56.0 1
## Delaware 579 4809 0.9 70.06 0 54.6 0
## Florida 8277 4815 1.3 70.66 1 52.6 0
## Georgia 4931 4091 2.0 68.54 1 40.6 0
## Hawaii 868 4963 1.9 73.60 0 61.9 0
## Idaho 813 4119 0.6 71.87 0 59.5 1
## Illinois 11197 5107 0.9 70.14 1 52.6 1
## Indiana 5313 4458 0.7 70.88 1 52.9 1
## Iowa 2861 4628 0.5 72.56 0 59.0 1
## Kansas 2280 4669 0.6 72.58 0 59.9 0
## Kentucky 3387 3712 1.6 70.10 1 38.5 0
## Louisiana 3806 3545 2.8 68.76 1 42.2 0
## Maine 1058 3694 0.7 70.39 0 54.7 1
## Maryland 4122 5299 0.9 70.22 1 52.3 0
## Massachusetts 5814 4755 1.1 71.83 0 58.5 0
## Michigan 9111 4751 0.9 70.63 1 52.8 1
## Minnesota 3921 4675 0.6 72.96 0 57.6 1
## Mississippi 2341 3098 2.4 68.09 1 41.0 0
## Missouri 4767 4254 0.8 70.69 1 48.8 0
## Montana 746 4347 0.6 70.56 0 59.2 1
## Nebraska 1544 4508 0.6 72.60 0 59.3 1
## Nevada 590 5149 0.5 69.03 1 65.2 1
## New Hampshire 812 4281 0.7 71.23 0 57.6 1
## New Jersey 7333 5237 1.1 70.93 0 52.5 1
## New Mexico 1144 3601 2.2 70.32 1 55.2 1
## New York 18076 4903 1.4 70.55 1 52.7 0
## North Carolina 5441 3875 1.8 69.21 1 38.5 0
## North Dakota 637 5087 0.8 72.78 0 50.3 1
## Ohio 10735 4561 0.8 70.82 1 53.2 1
## Oklahoma 2715 3983 1.1 71.42 0 51.6 0
## Oregon 2284 4660 0.6 72.13 0 60.0 0
## Pennsylvania 11860 4449 1.0 70.43 0 50.2 1
## Rhode Island 931 4558 1.3 71.90 0 46.4 1
## South Carolina 2816 3635 2.3 67.96 1 37.8 0
## South Dakota 681 4167 0.5 72.08 0 53.3 1
## Tennessee 4173 3821 1.7 70.11 1 41.8 0
## Texas 12237 4188 2.2 70.90 1 47.4 0
## Utah 1203 4022 0.6 72.90 0 67.3 1
## Vermont 472 3907 0.6 71.64 0 57.1 1
## Virginia 4981 4701 1.4 70.08 1 47.8 0
## Washington 3559 4864 0.6 71.72 0 63.5 0
## West Virginia 1799 3617 1.4 69.48 0 41.6 0
## Wisconsin 4589 4468 0.7 72.48 0 54.5 1
## Wyoming 376 4566 0.6 70.29 1 62.9 1
## Area
## Alabama 50708
## Alaska 566432
## Arizona 113417
## Arkansas 51945
## California 156361
## Colorado 103766
## Connecticut 4862
## Delaware 1982
## Florida 54090
## Georgia 58073
## Hawaii 6425
## Idaho 82677
## Illinois 55748
## Indiana 36097
## Iowa 55941
## Kansas 81787
## Kentucky 39650
## Louisiana 44930
## Maine 30920
## Maryland 9891
## Massachusetts 7826
## Michigan 56817
## Minnesota 79289
## Mississippi 47296
## Missouri 68995
## Montana 145587
## Nebraska 76483
## Nevada 109889
## New Hampshire 9027
## New Jersey 7521
## New Mexico 121412
## New York 47831
## North Carolina 48798
## North Dakota 69273
## Ohio 40975
## Oklahoma 68782
## Oregon 96184
## Pennsylvania 44966
## Rhode Island 1049
## South Carolina 30225
## South Dakota 75955
## Tennessee 41328
## Texas 262134
## Utah 82096
## Vermont 9267
## Virginia 39780
## Washington 66570
## West Virginia 24070
## Wisconsin 54464
## Wyoming 97203
I will test the income among different states. I will test the the high school graduation rate as an exponential term, frost as the dichotomous term, the product of illiteracy and high school graduation rate as the dichotomous vs. quantitative interaction term. I will test life expectancy as a linear term.
plot(df$Murder, df$Income)
plot(df$Frost, df$Income)
plot(df$Illiteracy*df$`HS Grad`, df$Income)
plot(df$`Life Exp`, df$Income)
hist(df$Murder)
hist(df$Frost)
hist(df$Illiteracy*df$`HS Grad`)
hist(df$`Life Exp`)
product <- df$Illiteracy*df$`HS Grad`
Then I created a multiple regression model using the terms and prescribed in the assignment.
linear.model <- lm(df$Income~ I(df$Murder ** 2) + df$Frost + product + df$`Life Exp`, df)
linear.model
##
## Call:
## lm(formula = df$Income ~ I(df$Murder^2) + df$Frost + product +
## df$`Life Exp`, data = df)
##
## Coefficients:
## (Intercept) I(df$Murder^2) df$Frost product
## -10116.252 344.292 220.121 -1.639
## df$`Life Exp`
## 202.691
plot(linear.model)
summary(linear.model)
##
## Call:
## lm(formula = df$Income ~ I(df$Murder^2) + df$Frost + product +
## df$`Life Exp`, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1048.0 -293.2 -101.2 371.1 1982.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10116.252 6182.117 -1.636 0.1087
## I(df$Murder^2) 344.292 237.101 1.452 0.1534
## df$Frost 220.121 190.373 1.156 0.2537
## product -1.639 4.000 -0.410 0.6840
## df$`Life Exp` 202.691 85.943 2.358 0.0228 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 581.1 on 45 degrees of freedom
## Multiple R-squared: 0.1786, Adjusted R-squared: 0.1056
## F-statistic: 2.447 on 4 and 45 DF, p-value: 0.05994
This is a terrible model. The only thing that remotely indicates average income level is life expectancy, which makes sense because every marginal year of life adds a marginal year of income. An arbitratrialy squared murder rate is about as good an indicator as a coin flip and the non-sense education metric is just that. Whether or not a state has more cold days than average also has little to do with income.