title: “Quiz on Correlation and Regression”
author: “Nicolas Flammia”
output:
html_document:
toc: true
Answer: According to the scatterplot, it show that the relationship has a positive direction since its upward sloping. That means that theres a relationship between the two variables, the magnitude of the relationship is determined by the absolute value of the correlation coefficient, in this case it is greater than 0.6 meaning that there is a strong relationship between income and educatinal levels.
# Load the package
library(openintro)
library(ggplot2)
str(countyComplete)
## 'data.frame': 3143 obs. of 53 variables:
## $ state : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 1877 levels "Abbeville County",..: 83 90 101 151 166 227 237 250 298 320 ...
## $ FIPS : num 1001 1003 1005 1007 1009 ...
## $ pop2010 : num 54571 182265 27457 22915 57322 ...
## $ pop2000 : num 43671 140415 29038 20826 51024 ...
## $ age_under_5 : num 6.6 6.1 6.2 6 6.3 6.8 6.5 6.1 5.7 5.3 ...
## $ age_under_18 : num 26.8 23 21.9 22.7 24.6 22.3 24.1 22.9 22.5 21.4 ...
## $ age_over_65 : num 12 16.8 14.2 12.7 14.7 13.5 16.7 14.3 16.7 17.9 ...
## $ female : num 51.3 51.1 46.9 46.3 50.5 45.8 53 51.8 52.2 50.4 ...
## $ white : num 78.5 85.7 48 75.8 92.6 23 54.4 74.9 58.8 92.7 ...
## $ black : num 17.7 9.4 46.9 22 1.3 70.2 43.4 20.6 38.7 4.6 ...
## $ native : num 0.4 0.7 0.4 0.3 0.5 0.2 0.3 0.5 0.2 0.5 ...
## $ asian : num 0.9 0.7 0.4 0.1 0.2 0.2 0.8 0.7 0.5 0.2 ...
## $ pac_isl : num NA NA NA NA NA NA 0 0.1 0 0 ...
## $ two_plus_races : num 1.6 1.5 0.9 0.9 1.2 0.8 0.8 1.7 1.1 1.5 ...
## $ hispanic : num 2.4 4.4 5.1 1.8 8.1 7.1 0.9 3.3 1.6 1.2 ...
## $ white_not_hispanic : num 77.2 83.5 46.8 75 88.9 21.9 54.1 73.6 58.1 92.1 ...
## $ no_move_in_one_plus_year : num 86.3 83 83 90.5 87.2 88.5 92.8 82.9 86.2 88.1 ...
## $ foreign_born : num 2 3.6 2.8 0.7 4.7 1.1 1.1 2.5 0.9 0.5 ...
## $ foreign_spoken_at_home : num 3.7 5.5 4.7 1.5 7.2 3.8 1.6 4.5 1.6 1.4 ...
## $ hs_grad : num 85.3 87.6 71.9 74.5 74.7 74.7 74.8 78.5 71.8 73.4 ...
## $ bachelors : num 21.7 26.8 13.5 10 12.5 12 11 16.1 10.8 10.5 ...
## $ veterans : num 5817 20396 2327 1883 4072 ...
## $ mean_work_travel : num 25.1 25.8 23.8 28.3 33.2 28.1 25.1 22.1 23.6 26.2 ...
## $ housing_units : num 22135 104061 11829 8981 23887 ...
## $ home_ownership : num 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
## $ housing_multi_unit : num 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
## $ median_val_owner_occupied : num 133900 177200 88200 81200 113700 ...
## $ households : num 19718 69476 9795 7441 20605 ...
## $ persons_per_household : num 2.7 2.5 2.52 3.02 2.73 2.85 2.58 2.46 2.51 2.22 ...
## $ per_capita_income : num 24568 26469 15875 19918 21070 ...
## $ median_household_income : num 53255 50147 33219 41770 45549 ...
## $ poverty : num 10.6 12.2 25 12.6 13.4 25.3 25 19.5 20.3 17.6 ...
## $ private_nonfarm_establishments : num 877 4812 522 318 749 ...
## $ private_nonfarm_employment : num 10628 52233 7990 2927 6968 ...
## $ percent_change_private_nonfarm_employment: num 16.6 17.4 -27 -14 -11.4 -18.5 2.1 -5.6 -45.8 5.4 ...
## $ nonemployment_establishments : num 2971 14175 1527 1192 3501 ...
## $ firms : num 4067 19035 1667 1385 4458 ...
## $ black_owned_firms : num 15.2 2.7 NA 14.9 NA NA NA 7.2 NA NA ...
## $ native_owned_firms : num NA 0.4 NA NA NA NA NA NA NA NA ...
## $ asian_owned_firms : num 1.3 1 NA NA NA NA 3.3 1.6 NA NA ...
## $ pac_isl_owned_firms : num NA NA NA NA NA NA NA NA NA NA ...
## $ hispanic_owned_firms : num 0.7 1.3 NA NA NA NA NA 0.5 NA NA ...
## $ women_owned_firms : num 31.7 27.3 27 NA 23.2 38.8 NA 24.7 29.3 14.5 ...
## $ manufacturer_shipments_2007 : num NA 1410273 NA 0 341544 ...
## $ mercent_whole_sales_2007 : num NA NA NA NA NA ...
## $ sales : num 598175 2966489 188337 124707 319700 ...
## $ sales_per_capita : num 12003 17166 6334 5804 5622 ...
## $ accommodation_food_service : num 88157 436955 NA 10757 20941 ...
## $ building_permits : num 191 696 10 8 18 1 3 107 10 6 ...
## $ fed_spending : num 331142 1119082 240308 163201 294114 ...
## $ area : num 594 1590 885 623 645 ...
## $ density : num 91.8 114.6 31 36.8 88.9 ...
ggplot(data = countyComplete, aes(x = bachelors, y = per_capita_income)) + #cars dataset is from openintro rpackage
geom_point()
cor(countyComplete$bachelors, countyComplete$per_capita_income, use = "pairwise.complete.obs")
## [1] 0.7924464
Run a regression model for income (per_capita_income) with one explanatory variable, educational level (bachelors), and answer Q2 through Q5.
summary(countyComplete)
## state name FIPS
## Texas : 254 Washington County: 30 Min. : 1001
## Georgia : 159 Jefferson County : 25 1st Qu.:18178
## Virginia: 134 Franklin County : 24 Median :29177
## Kentucky: 120 Jackson County : 23 Mean :30390
## Missouri: 115 Lincoln County : 23 3rd Qu.:45082
## Kansas : 105 Madison County : 19 Max. :56045
## (Other) :2256 (Other) :2999
## pop2010 pop2000 age_under_5 age_under_18
## Min. : 82 Min. : 67 Min. : 0.000 Min. : 0.00
## 1st Qu.: 11104 1st Qu.: 11210 1st Qu.: 5.500 1st Qu.:21.40
## Median : 25857 Median : 24608 Median : 6.200 Median :23.30
## Mean : 98233 Mean : 89623 Mean : 6.261 Mean :23.42
## 3rd Qu.: 66699 3rd Qu.: 61766 3rd Qu.: 6.800 3rd Qu.:25.10
## Max. :9818605 Max. :9519338 Max. :12.600 Max. :41.60
## NA's :3
## age_over_65 female white black
## Min. : 3.50 Min. :27.90 Min. : 2.70 Min. : 0.000
## 1st Qu.:13.10 1st Qu.:49.60 1st Qu.:75.25 1st Qu.: 0.500
## Median :15.60 Median :50.50 Median :89.10 Median : 2.000
## Mean :15.88 Mean :50.03 Mean :82.89 Mean : 8.931
## 3rd Qu.:18.20 3rd Qu.:51.10 3rd Qu.:95.50 3rd Qu.:10.200
## Max. :43.40 Max. :56.80 Max. :99.20 Max. :85.700
## NA's :17
## native asian pac_isl two_plus_races
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.100
## 1st Qu.: 0.200 1st Qu.: 0.300 1st Qu.: 0.000 1st Qu.: 1.100
## Median : 0.400 Median : 0.500 Median : 0.000 Median : 1.600
## Mean : 2.026 Mean : 1.166 Mean : 0.154 Mean : 1.976
## 3rd Qu.: 0.800 3rd Qu.: 1.000 3rd Qu.: 0.100 3rd Qu.: 2.300
## Max. :96.000 Max. :43.900 Max. :48.900 Max. :29.500
## NA's :4 NA's :24 NA's :1697
## hispanic white_not_hispanic no_move_in_one_plus_year
## Min. : 0.000 Min. : 2.70 Min. : 51.6
## 1st Qu.: 1.600 1st Qu.:66.95 1st Qu.: 83.2
## Median : 3.300 Median :85.80 Median : 86.3
## Mean : 8.284 Mean :78.29 Mean : 85.8
## 3rd Qu.: 8.200 3rd Qu.:94.20 3rd Qu.: 89.0
## Max. :95.700 Max. :99.20 Max. :100.0
##
## foreign_born foreign_spoken_at_home hs_grad bachelors
## Min. : 0.000 Min. : 0.000 Min. :47.90 Min. : 3.70
## 1st Qu.: 1.200 1st Qu.: 2.800 1st Qu.:78.40 1st Qu.:13.10
## Median : 2.400 Median : 4.800 Median :84.60 Median :16.90
## Mean : 4.372 Mean : 9.057 Mean :83.11 Mean :19.03
## 3rd Qu.: 5.300 3rd Qu.:10.000 3rd Qu.:88.60 3rd Qu.:22.60
## Max. :72.200 Max. :96.000 Max. :99.30 Max. :71.00
##
## veterans mean_work_travel housing_units home_ownership
## Min. : 0 Min. : 4.30 Min. : 50 Min. : 0.00
## 1st Qu.: 958 1st Qu.:19.00 1st Qu.: 5416 1st Qu.:69.50
## Median : 2180 Median :22.40 Median : 12162 Median :74.60
## Mean : 7207 Mean :22.73 Mean : 41904 Mean :73.26
## 3rd Qu.: 5944 3rd Qu.:26.10 3rd Qu.: 30574 3rd Qu.:78.40
## Max. :368128 Max. :44.20 Max. :3445076 Max. :91.30
##
## housing_multi_unit median_val_owner_occupied households
## Min. : 0.00 Min. : 0 Min. : 22
## 1st Qu.: 6.10 1st Qu.: 80200 1st Qu.: 4260
## Median : 9.70 Median : 105900 Median : 9868
## Mean :12.33 Mean : 132545 Mean : 36346
## 3rd Qu.:15.90 3rd Qu.: 152950 3rd Qu.: 25358
## Max. :98.50 Max. :1000001 Max. :3217889
##
## persons_per_household per_capita_income median_household_income
## Min. :1.100 Min. : 7772 Min. : 19351
## 1st Qu.:2.370 1st Qu.:19030 1st Qu.: 36952
## Median :2.490 Median :21773 Median : 42445
## Mean :2.513 Mean :22505 Mean : 44270
## 3rd Qu.:2.630 3rd Qu.:24814 3rd Qu.: 49142
## Max. :4.470 Max. :64381 Max. :115574
##
## poverty private_nonfarm_establishments private_nonfarm_employment
## Min. : 0.0 Min. : 0 Min. : 0
## 1st Qu.:11.0 1st Qu.: 229 1st Qu.: 2109
## Median :14.7 Median : 551 Median : 6351
## Mean :15.5 Mean : 2362 Mean : 35656
## 3rd Qu.:19.0 3rd Qu.: 1484 3rd Qu.: 19436
## Max. :53.5 Max. :245523 Max. :3703233
##
## percent_change_private_nonfarm_employment nonemployment_establishments
## Min. :-83.2000 Min. : 21
## 1st Qu.:-12.0000 1st Qu.: 729
## Median : -2.0000 Median : 1594
## Mean : 0.5338 Mean : 6720
## 3rd Qu.: 9.8000 3rd Qu.: 4130
## Max. :386.5000 Max. :821177
## NA's :67 NA's :5
## firms black_owned_firms native_owned_firms asian_owned_firms
## Min. : 27 Min. : 0.200 Min. : 0.200 Min. : 0.300
## 1st Qu.: 1074 1st Qu.: 2.100 1st Qu.: 0.525 1st Qu.: 1.400
## Median : 2350 Median : 5.700 Median : 0.900 Median : 2.200
## Mean : 9301 Mean : 9.806 Mean : 3.785 Mean : 3.422
## 3rd Qu.: 6034 3rd Qu.:13.350 3rd Qu.: 2.300 3rd Qu.: 3.700
## Max. :1046940 Max. :66.700 Max. :71.800 Max. :56.600
## NA's :176 NA's :2376 NA's :2653 NA's :2408
## pac_isl_owned_firms hispanic_owned_firms women_owned_firms
## Min. : 0.0000 Min. : 0.300 Min. : 6.50
## 1st Qu.: 0.1000 1st Qu.: 1.400 1st Qu.:22.70
## Median : 0.1000 Median : 2.800 Median :26.20
## Mean : 0.7171 Mean : 6.811 Mean :25.96
## 3rd Qu.: 0.3000 3rd Qu.: 6.800 3rd Qu.:29.20
## Max. :10.5000 Max. :78.000 Max. :56.20
## NA's :3073 NA's :2363 NA's :970
## manufacturer_shipments_2007 mercent_whole_sales_2007 sales
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 42125 1st Qu.: 79988
## Median : 238180 Median : 138930 Median : 257667
## Mean : 1680613 Mean : 1794262 Mean : 1262270
## 3rd Qu.: 1161878 3rd Qu.: 562056 3rd Qu.: 791435
## Max. :169275136 Max. :205478751 Max. :119111840
## NA's :488 NA's :1022 NA's :42
## sales_per_capita accommodation_food_service building_permits
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 6993 1st Qu.: 9349 1st Qu.: 5
## Median : 9793 Median : 31065 Median : 32
## Mean :10375 Mean : 211181 Mean : 192
## 3rd Qu.:12980 3rd Qu.: 110695 3rd Qu.: 123
## Max. :80800 Max. :24857836 Max. :15039
## NA's :42 NA's :272
## fed_spending area density
## Min. : 0 Min. : 2.0 Min. : 0.0
## 1st Qu.: 102922 1st Qu.: 430.7 1st Qu.: 16.9
## Median : 214994 Median : 615.6 Median : 45.2
## Mean : 944376 Mean : 1123.7 Mean : 259.3
## 3rd Qu.: 522228 3rd Qu.: 924.0 3rd Qu.: 113.8
## Max. :80457156 Max. :145504.8 Max. :69467.5
## NA's :4
library(ggplot2)
str(countyComplete)
## 'data.frame': 3143 obs. of 53 variables:
## $ state : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 1877 levels "Abbeville County",..: 83 90 101 151 166 227 237 250 298 320 ...
## $ FIPS : num 1001 1003 1005 1007 1009 ...
## $ pop2010 : num 54571 182265 27457 22915 57322 ...
## $ pop2000 : num 43671 140415 29038 20826 51024 ...
## $ age_under_5 : num 6.6 6.1 6.2 6 6.3 6.8 6.5 6.1 5.7 5.3 ...
## $ age_under_18 : num 26.8 23 21.9 22.7 24.6 22.3 24.1 22.9 22.5 21.4 ...
## $ age_over_65 : num 12 16.8 14.2 12.7 14.7 13.5 16.7 14.3 16.7 17.9 ...
## $ female : num 51.3 51.1 46.9 46.3 50.5 45.8 53 51.8 52.2 50.4 ...
## $ white : num 78.5 85.7 48 75.8 92.6 23 54.4 74.9 58.8 92.7 ...
## $ black : num 17.7 9.4 46.9 22 1.3 70.2 43.4 20.6 38.7 4.6 ...
## $ native : num 0.4 0.7 0.4 0.3 0.5 0.2 0.3 0.5 0.2 0.5 ...
## $ asian : num 0.9 0.7 0.4 0.1 0.2 0.2 0.8 0.7 0.5 0.2 ...
## $ pac_isl : num NA NA NA NA NA NA 0 0.1 0 0 ...
## $ two_plus_races : num 1.6 1.5 0.9 0.9 1.2 0.8 0.8 1.7 1.1 1.5 ...
## $ hispanic : num 2.4 4.4 5.1 1.8 8.1 7.1 0.9 3.3 1.6 1.2 ...
## $ white_not_hispanic : num 77.2 83.5 46.8 75 88.9 21.9 54.1 73.6 58.1 92.1 ...
## $ no_move_in_one_plus_year : num 86.3 83 83 90.5 87.2 88.5 92.8 82.9 86.2 88.1 ...
## $ foreign_born : num 2 3.6 2.8 0.7 4.7 1.1 1.1 2.5 0.9 0.5 ...
## $ foreign_spoken_at_home : num 3.7 5.5 4.7 1.5 7.2 3.8 1.6 4.5 1.6 1.4 ...
## $ hs_grad : num 85.3 87.6 71.9 74.5 74.7 74.7 74.8 78.5 71.8 73.4 ...
## $ bachelors : num 21.7 26.8 13.5 10 12.5 12 11 16.1 10.8 10.5 ...
## $ veterans : num 5817 20396 2327 1883 4072 ...
## $ mean_work_travel : num 25.1 25.8 23.8 28.3 33.2 28.1 25.1 22.1 23.6 26.2 ...
## $ housing_units : num 22135 104061 11829 8981 23887 ...
## $ home_ownership : num 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
## $ housing_multi_unit : num 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
## $ median_val_owner_occupied : num 133900 177200 88200 81200 113700 ...
## $ households : num 19718 69476 9795 7441 20605 ...
## $ persons_per_household : num 2.7 2.5 2.52 3.02 2.73 2.85 2.58 2.46 2.51 2.22 ...
## $ per_capita_income : num 24568 26469 15875 19918 21070 ...
## $ median_household_income : num 53255 50147 33219 41770 45549 ...
## $ poverty : num 10.6 12.2 25 12.6 13.4 25.3 25 19.5 20.3 17.6 ...
## $ private_nonfarm_establishments : num 877 4812 522 318 749 ...
## $ private_nonfarm_employment : num 10628 52233 7990 2927 6968 ...
## $ percent_change_private_nonfarm_employment: num 16.6 17.4 -27 -14 -11.4 -18.5 2.1 -5.6 -45.8 5.4 ...
## $ nonemployment_establishments : num 2971 14175 1527 1192 3501 ...
## $ firms : num 4067 19035 1667 1385 4458 ...
## $ black_owned_firms : num 15.2 2.7 NA 14.9 NA NA NA 7.2 NA NA ...
## $ native_owned_firms : num NA 0.4 NA NA NA NA NA NA NA NA ...
## $ asian_owned_firms : num 1.3 1 NA NA NA NA 3.3 1.6 NA NA ...
## $ pac_isl_owned_firms : num NA NA NA NA NA NA NA NA NA NA ...
## $ hispanic_owned_firms : num 0.7 1.3 NA NA NA NA NA 0.5 NA NA ...
## $ women_owned_firms : num 31.7 27.3 27 NA 23.2 38.8 NA 24.7 29.3 14.5 ...
## $ manufacturer_shipments_2007 : num NA 1410273 NA 0 341544 ...
## $ mercent_whole_sales_2007 : num NA NA NA NA NA ...
## $ sales : num 598175 2966489 188337 124707 319700 ...
## $ sales_per_capita : num 12003 17166 6334 5804 5622 ...
## $ accommodation_food_service : num 88157 436955 NA 10757 20941 ...
## $ building_permits : num 191 696 10 8 18 1 3 107 10 6 ...
## $ fed_spending : num 331142 1119082 240308 163201 294114 ...
## $ area : num 594 1590 885 623 645 ...
## $ density : num 91.8 114.6 31 36.8 88.9 ...
ggplot(data = countyComplete, aes(x = bachelors, y = per_capita_income)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
lm(per_capita_income ~ bachelors, data = countyComplete)
##
## Call:
## lm(formula = per_capita_income ~ bachelors, data = countyComplete)
##
## Coefficients:
## (Intercept) bachelors
## 13087.7 494.8
lm(log(per_capita_income) ~ log(bachelors), data = countyComplete)
##
## Call:
## lm(formula = log(per_capita_income) ~ log(bachelors), data = countyComplete)
##
## Coefficients:
## (Intercept) log(bachelors)
## 8.8050 0.4167
mod <- lm(per_capita_income ~ bachelors, data = countyComplete)
coef(mod)
## (Intercept) bachelors
## 13087.6795 494.7534
summary(mod)
##
## Call:
## lm(formula = per_capita_income ~ bachelors, data = countyComplete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18032.7 -1708.2 73.8 1748.0 21756.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13087.680 142.091 92.11 <2e-16 ***
## bachelors 494.753 6.795 72.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3299 on 3141 degrees of freedom
## Multiple R-squared: 0.628, Adjusted R-squared: 0.6279
## F-statistic: 5302 on 1 and 3141 DF, p-value: < 2.2e-16
Answer: No, it is not significant. The coefficient of 494.7 shows that for every 494 bachelors there is 13087 non-bachelors. The value of Pr show that the coefficient is not statistically significant at 5% because Pr<0.001
Answer: 3299 on 3141 degrees of freedom means that 3141 of all the observations are free to vary.
Answer: The adjusted R squared of 0.62 tells us that a 62.7% of the variations in the per_capita_income variable are explained by the regression model above.
Run a second regression model for income (per_capita_income) with two explanatory variables: education (bachelors) and density (density), and answer Q6.
Answer: The 2nd model explains the data set better its adjusted R squared of 0.63 is higher than 0.62 is not much greater than model 1 but adding density to the formula doesnt provide more accurate information, since the residual standard error for model 2 is 3140 is not much different from the 3131 from the 1st model.
# Create a linear model 1
mod_1 <- lm(per_capita_income ~ bachelors, data = countyComplete)
# View summary of model 1
summary(mod_1)
##
## Call:
## lm(formula = per_capita_income ~ bachelors, data = countyComplete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18032.7 -1708.2 73.8 1748.0 21756.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13087.680 142.091 92.11 <2e-16 ***
## bachelors 494.753 6.795 72.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3299 on 3141 degrees of freedom
## Multiple R-squared: 0.628, Adjusted R-squared: 0.6279
## F-statistic: 5302 on 1 and 3141 DF, p-value: < 2.2e-16
# Create a linear model 2
mod_2 <- lm(per_capita_income ~ bachelors + density, data = countyComplete)
# View summary of model 2
summary(mod_2)
##
## Call:
## lm(formula = per_capita_income ~ bachelors + density, data = countyComplete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18257.9 -1707.7 71.5 1749.6 22101.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.319e+04 1.433e+02 92.042 < 2e-16 ***
## bachelors 4.872e+02 6.963e+00 69.973 < 2e-16 ***
## density 1.623e-01 3.499e-02 4.639 3.65e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3289 on 3140 degrees of freedom
## Multiple R-squared: 0.6305, Adjusted R-squared: 0.6303
## F-statistic: 2679 on 2 and 3140 DF, p-value: < 2.2e-16