Source of motivation for the final project is the article about: Education and Socioeconomic Status
To evaluate the impact of per_capita_income on the educational attainment.
Web Scraping is performed to fetch fips code for all US counties:
https://en.wikipedia.org/wiki/List_of_United_States_FIPS_codes_by_county
API call to fetch Census data for all US counties:
https://api.census.gov/data/2019/acs/acs1?get=NAME,B01001_001E&for=county:*
A CSV file is downloaded from ‘Economic Research Service’ U.S. DEPARTMENT OF AGRICULTURE:
https://www.ers.usda.gov/data-products/county-level-data-sets/download-data/
library(tidycensus)
## Warning: package 'tidycensus' was built under R version 4.1.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.2
## Warning: package 'stringr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tidyr)
library(stringr)
library(rvest)
## Warning: package 'rvest' was built under R version 4.1.3
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.1.3
library(ggiraph)
## Warning: package 'ggiraph' was built under R version 4.1.3
library(ggiraphExtra)
## Warning: package 'ggiraphExtra' was built under R version 4.1.3
library(plyr)
## Warning: package 'plyr' was built under R version 4.1.3
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:ggpubr':
##
## mutate
## The following objects are masked from 'package:plotly':
##
## arrange, mutate, rename, summarise
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
webpage <- read_html("https://en.wikipedia.org/wiki/List_of_United_States_FIPS_codes_by_county")
tbls <- html_nodes(webpage, "table")
head(tbls)
## {xml_nodeset (5)}
## [1] <table class="box-Very_long plainlinks metadata ambox ambox-style ambox-v ...
## [2] <table class="wikitable sortable">\n<caption>\n</caption>\n<tbody>\n<tr>\ ...
## [3] <table class="nowraplinks mw-collapsible expanded navbox-inner" style="bo ...
## [4] <table class="nowraplinks hlist mw-collapsible autocollapse navbox-inner" ...
## [5] <table class="nowraplinks mw-collapsible autocollapse navbox-inner" style ...
fips_df <- webpage %>%
html_nodes("#mw-content-text") %>% html_table() %>% .[[1]]
head(fips_df)
## # A tibble: 6 x 3
## X1 X2 X3
## <chr> <chr> <chr>
## 1 "" This article may be too long to read and navigate comf~ <NA>
## 2 "FIPS" County or equivalent State or equi~
## 3 "01001" Autauga County Alabama
## 4 "01003" Baldwin County Alabama
## 5 "01005" Barbour County Alabama
## 6 "01007" Bibb County Alabama
fips_df <- fips_df %>%
dplyr::rename(
fips_code = X1,
county = X2,
state = X3
)
head(fips_df)
## # A tibble: 6 x 3
## fips_code county state
## <chr> <chr> <chr>
## 1 "" This article may be too long to read and navigate co~ <NA>
## 2 "FIPS" County or equivalent State or equi~
## 3 "01001" Autauga County Alabama
## 4 "01003" Baldwin County Alabama
## 5 "01005" Barbour County Alabama
## 6 "01007" Bibb County Alabama
fips_df <- fips_df[-c(1, 2), ]
head(fips_df)
## # A tibble: 6 x 3
## fips_code county state
## <chr> <chr> <chr>
## 1 01001 Autauga County Alabama
## 2 01003 Baldwin County Alabama
## 3 01005 Barbour County Alabama
## 4 01007 Bibb County Alabama
## 5 01009 Blount County Alabama
## 6 01011 Bullock County Alabama
colnames(fips_df)
## [1] "fips_code" "county" "state"
# Drop the columns of the dataframe
#select (fips_df,-c('.'))
fips_df <- na.omit(fips_df) # Method 1 - Remove NA
head(fips_df)
## # A tibble: 6 x 3
## fips_code county state
## <chr> <chr> <chr>
## 1 01001 Autauga County Alabama
## 2 01003 Baldwin County Alabama
## 3 01005 Barbour County Alabama
## 4 01007 Bibb County Alabama
## 5 01009 Blount County Alabama
## 6 01011 Bullock County Alabama
#fips_df$county<-gsub(" County","",as.character(fips_df$county))
#head(fips_df)
names(fips_df) <- tolower(names(fips_df))
head(fips_df)
## # A tibble: 6 x 3
## fips_code county state
## <chr> <chr> <chr>
## 1 01001 Autauga County Alabama
## 2 01003 Baldwin County Alabama
## 3 01005 Barbour County Alabama
## 4 01007 Bibb County Alabama
## 5 01009 Blount County Alabama
## 6 01011 Bullock County Alabama
# Get working directory path
path <- getwd()
path
## [1] "C:/Users/Uzma/CUNY_SPS_PROJECTS/Data_607_Final_Project"
write.csv(fips_df, file.path(path, "/posgresql/fips_data.csv"))
fips_df$NAME = paste(fips_df$county, fips_df$state, sep=", ")
names(fips_df) <- tolower(names(fips_df))
head(fips_df)
## # A tibble: 6 x 4
## fips_code county state name
## <chr> <chr> <chr> <chr>
## 1 01001 Autauga County Alabama Autauga County, Alabama
## 2 01003 Baldwin County Alabama Baldwin County, Alabama
## 3 01005 Barbour County Alabama Barbour County, Alabama
## 4 01007 Bibb County Alabama Bibb County, Alabama
## 5 01009 Blount County Alabama Blount County, Alabama
## 6 01011 Bullock County Alabama Bullock County, Alabama
#str(census_df)
str(fips_df)
## tibble [3,249 x 4] (S3: tbl_df/tbl/data.frame)
## $ fips_code: chr [1:3249] "01001" "01003" "01005" "01007" ...
## $ county : chr [1:3249] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
## $ state : chr [1:3249] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ name : chr [1:3249] "Autauga County, Alabama" "Baldwin County, Alabama" "Barbour County, Alabama" "Bibb County, Alabama" ...
## - attr(*, "na.action")= 'omit' Named int [1:13] 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 ...
## ..- attr(*, "names")= chr [1:13] "3243" "3244" "3245" "3246" ...
Sys.setenv(census_api_key = "f7ce5a76ddf2088c73fda9c0a87410995cebbffa")
Sys.getenv("census_api_key")
## [1] "f7ce5a76ddf2088c73fda9c0a87410995cebbffa"
#census_api_key("YOUR KEY GOES HERE", install = TRUE)
Sys.getenv("census_api_key")
## [1] "f7ce5a76ddf2088c73fda9c0a87410995cebbffa"
# https://api.census.gov/data/2019/acs/acs1?get=NAME,B01001_001E&for=county:*
census_df <- get_acs(
geography = "county",
variables = c(population = "B01003_001E",
median_age = "B01002_001E",
household_income = "B19013_001E",
per_capita_income = "B19301_001E",
poverty_count = "B17001_002E",
unemployment_count = "B23025_005E"
),
output = "wide",
year = 2020
)
## Getting data from the 2016-2020 5-year ACS
head(census_df)
## # A tibble: 6 x 14
## GEOID NAME population B01003_001M median_age B01002_001M household_income
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01001 Autauga ~ 55639 NA 38.6 0.6 57982
## 2 01003 Baldwin ~ 218289 NA 43.2 0.4 61756
## 3 01005 Barbour ~ 25026 NA 40.1 0.6 34990
## 4 01007 Bibb Cou~ 22374 NA 39.9 1.2 51721
## 5 01009 Blount C~ 57755 NA 41 0.5 48922
## 6 01011 Bullock ~ 10173 NA 39.7 1.9 33866
## # ... with 7 more variables: B19013_001M <dbl>, per_capita_income <dbl>,
## # B19301_001M <dbl>, poverty_count <dbl>, B17001_002M <dbl>,
## # unemployment_count <dbl>, B23025_005M <dbl>
names(census_df) <- tolower(names(census_df))
head(census_df)
## # A tibble: 6 x 14
## geoid name population b01003_001m median_age b01002_001m household_income
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01001 Autauga ~ 55639 NA 38.6 0.6 57982
## 2 01003 Baldwin ~ 218289 NA 43.2 0.4 61756
## 3 01005 Barbour ~ 25026 NA 40.1 0.6 34990
## 4 01007 Bibb Cou~ 22374 NA 39.9 1.2 51721
## 5 01009 Blount C~ 57755 NA 41 0.5 48922
## 6 01011 Bullock ~ 10173 NA 39.7 1.9 33866
## # ... with 7 more variables: b19013_001m <dbl>, per_capita_income <dbl>,
## # b19301_001m <dbl>, poverty_count <dbl>, b17001_002m <dbl>,
## # unemployment_count <dbl>, b23025_005m <dbl>
census_df <- na.omit(census_df) # Remove NA
head(census_df)
## # A tibble: 6 x 14
## geoid name population b01003_001m median_age b01002_001m household_income
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 49009 Daggett ~ 590 157 38.9 6.1 74911
## 2 49031 Piute Co~ 1870 157 53.4 7.7 29125
## 3 50009 Essex Co~ 6179 3 51.4 0.7 47035
## 4 31005 Arthur C~ 439 78 48 5.6 48500
## 5 31017 Brown Co~ 2887 143 48.6 2.4 41979
## 6 31029 Chase Co~ 3707 189 44.2 3.3 56135
## # ... with 7 more variables: b19013_001m <dbl>, per_capita_income <dbl>,
## # b19301_001m <dbl>, poverty_count <dbl>, b17001_002m <dbl>,
## # unemployment_count <dbl>, b23025_005m <dbl>
census_df <- select(census_df, name, population, median_age, household_income, per_capita_income, poverty_count, unemployment_count)
str(census_df)
## tibble [157 x 7] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:157] "Daggett County, Utah" "Piute County, Utah" "Essex County, Vermont" "Arthur County, Nebraska" ...
## $ population : num [1:157] 590 1870 6179 439 2887 ...
## $ median_age : num [1:157] 38.9 53.4 51.4 48 48.6 44.2 42.9 44.5 49.6 37.9 ...
## $ household_income : num [1:157] 74911 29125 47035 48500 41979 ...
## $ per_capita_income : num [1:157] 27568 18148 27583 25277 29420 ...
## $ poverty_count : num [1:157] 17 351 870 55 275 325 209 237 161 115 ...
## $ unemployment_count: num [1:157] 3 18 153 1 11 66 11 66 17 3 ...
## - attr(*, "na.action")= 'omit' Named int [1:3064] 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "names")= chr [1:3064] "1" "2" "3" "4" ...
#sort descending
census_df[order(-census_df$household_income), ]
## # A tibble: 157 x 7
## name population median_age household_income per_capita_inco~ poverty_count
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Nassau~ 1355683 41.9 120036 53363 72203
## 2 Suffol~ 1481364 41.7 105362 46466 93999
## 3 Alpine~ 1159 47.6 85750 37690 139
## 4 Maui C~ 166657 41.8 84363 36872 14836
## 5 Wake C~ 1091662 36.4 83567 42721 91083
## 6 Borden~ 653 37.2 83281 33412 24
## 7 Arapah~ 649980 36.8 80291 42184 49932
## 8 Bristo~ 739 44.8 79808 46950 36
## 9 Shelby~ 216350 39.5 78889 39711 14619
## 10 Steele~ 1817 45.9 77167 38907 174
## # ... with 147 more rows, and 1 more variable: unemployment_count <dbl>
census_df <- inner_join(census_df, fips_df, by = "name") # Applying inner_join() function
head(census_df)
## # A tibble: 6 x 10
## name population median_age household_income per_capita_inco~ poverty_count
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Daggett~ 590 38.9 74911 27568 17
## 2 Piute C~ 1870 53.4 29125 18148 351
## 3 Essex C~ 6179 51.4 47035 27583 870
## 4 Arthur ~ 439 48 48500 25277 55
## 5 Brown C~ 2887 48.6 41979 29420 275
## 6 Chase C~ 3707 44.2 56135 30850 325
## # ... with 4 more variables: unemployment_count <dbl>, fips_code <chr>,
## # county <chr>, state <chr>
#census_df %>% separate(name, c("county","state"), sep = " County,")
census_df <- select(census_df,-c(name))
colnames(census_df)<-gsub(".x","",colnames(census_df))
head(census_df)
## # A tibble: 6 x 9
## population median_age household_income per_capita_income poverty_count
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 590 38.9 74911 27568 17
## 2 1870 53.4 29125 18148 351
## 3 6179 51.4 47035 27583 870
## 4 439 48 48500 25277 55
## 5 2887 48.6 41979 29420 275
## 6 3707 44.2 56135 30850 325
## # ... with 4 more variables: unemployment_count <dbl>, fips_code <chr>,
## # county <chr>, state <chr>
census_df$county<-gsub(" County","",as.character(census_df$county))
head(census_df)
## # A tibble: 6 x 9
## population median_age household_income per_capita_income poverty_count
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 590 38.9 74911 27568 17
## 2 1870 53.4 29125 18148 351
## 3 6179 51.4 47035 27583 870
## 4 439 48 48500 25277 55
## 5 2887 48.6 41979 29420 275
## 6 3707 44.2 56135 30850 325
## # ... with 4 more variables: unemployment_count <dbl>, fips_code <chr>,
## # county <chr>, state <chr>
# converting character type
# column to numeric
census_df <- transform(census_df,
fips_code = as.numeric(fips_code))
# Get working directory path
path <- getwd()
path
## [1] "C:/Users/Uzma/CUNY_SPS_PROJECTS/Data_607_Final_Project"
write.csv(census_df, file.path(path, "/posgresql/census_data.csv"))
education_df <- read_csv("https://raw.githubusercontent.com/uzmabb182/Data_607_Final_Project/main/ers_usda_education.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## State = col_character(),
## `Area name` = col_character(),
## `Less than a high school diploma, 1970` = col_number(),
## `High school diploma only, 1970` = col_number(),
## `Some college (1-3 years), 1970` = col_number(),
## `Four years of college or higher, 1970` = col_number(),
## `Less than a high school diploma, 1980` = col_number(),
## `High school diploma only, 1980` = col_number(),
## `Some college (1-3 years), 1980` = col_number(),
## `Four years of college or higher, 1980` = col_number(),
## `Less than a high school diploma, 1990` = col_number(),
## `High school diploma only, 1990` = col_number(),
## `Some college or associate's degree, 1990` = col_number(),
## `Bachelor's degree or higher, 1990` = col_number(),
## `Less than a high school diploma, 2000` = col_number(),
## `High school diploma only, 2000` = col_number(),
## `Some college or associate's degree, 2000` = col_number(),
## `Bachelor's degree or higher, 2000` = col_number(),
## `Less than a high school diploma, 2015-19` = col_number(),
## `High school diploma only, 2015-19` = col_number()
## # ... with 2 more columns
## )
## i Use `spec()` for the full column specifications.
#head(education_df)
#colnames(education_df)
education_df <- education_df[-c(2:40)]
education_df <- education_df[-c(1),]
#head(education_df)
education_df <- rename_with(education_df, ~ tolower(gsub(",", "", .x, fixed = TRUE)))
education_df <- rename_with(education_df, ~ tolower(gsub("'s", "", .x, fixed = TRUE)))
education_df <- rename_with(education_df, ~ tolower(gsub(" ", "_", .x, fixed = TRUE)))
education_df <- rename_with(education_df, ~ tolower(gsub("-", "_", .x, fixed = TRUE)))
head(education_df)
## # A tibble: 6 x 8
## fips_code high_school_dipl~ some_college_or~ bachelor_degree~ percent_of_adul~
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1000 1022839 993344 845772 13.8
## 2 1001 12551 10596 9929 11.5
## 3 1003 41797 47274 48148 9.2
## 4 1005 6396 4676 2080 26.8
## 5 1007 7256 3848 1678 20.9
## 6 1009 13299 13519 5210 19.5
## # ... with 3 more variables:
## # percent_of_adults_with_a_high_school_diploma_only_2015_19 <dbl>,
## # percent_of_adults_completing_some_college_or_associate_degree_2015_19 <dbl>,
## # percent_of_adults_with_a_bachelor_degree_or_higher_2015_19 <dbl>
education_df <- na.omit(education_df) # Method 1 - Remove NA
#head(education_df)
###str(education_df)
# Get working directory path
path <- getwd()
path
## [1] "C:/Users/Uzma/CUNY_SPS_PROJECTS/Data_607_Final_Project"
write.csv(education_df, file.path(path, "/posgresql/education_data.csv"))
census_df$fips_code <- as.integer(census_df$fips_code)
education_df$fips_code <- as.integer(education_df$fips_code)
str(census_df)
## 'data.frame': 147 obs. of 9 variables:
## $ population : num 590 1870 6179 439 2887 ...
## $ median_age : num 38.9 53.4 51.4 48 48.6 44.2 42.9 44.5 49.6 37.9 ...
## $ household_income : num 74911 29125 47035 48500 41979 ...
## $ per_capita_income : num 27568 18148 27583 25277 29420 ...
## $ poverty_count : num 17 351 870 55 275 325 209 237 161 115 ...
## $ unemployment_count: num 3 18 153 1 11 66 11 66 17 3 ...
## $ fips_code : int 49009 49031 50009 31005 31017 31029 31057 31063 31069 31075 ...
## $ county : chr "Daggett" "Piute" "Essex" "Arthur" ...
## $ state : chr "Utah" "Utah" "Vermont" "Nebraska" ...
#str(education_df)
#dim(census_df)
#dim(education_df)
new_df <- census_df %>% inner_join(education_df, by="fips_code")
# Join by multiple columns
nrow(new_df)
## [1] 147
new_df <- new_df %>%
relocate(fips_code, .before=county)
new_df <-new_df %>%
relocate(state, .after=county)
head(new_df)
## population median_age household_income per_capita_income poverty_count
## 1 590 38.9 74911 27568 17
## 2 1870 53.4 29125 18148 351
## 3 6179 51.4 47035 27583 870
## 4 439 48.0 48500 25277 55
## 5 2887 48.6 41979 29420 275
## 6 3707 44.2 56135 30850 325
## unemployment_count fips_code county state
## 1 3 49009 Daggett Utah
## 2 18 49031 Piute Utah
## 3 153 50009 Essex Vermont
## 4 1 31005 Arthur Nebraska
## 5 11 31017 Brown Nebraska
## 6 66 31029 Chase Nebraska
## high_school_diploma_only_2015_19 some_college_or_associate_degree_2015_19
## 1 161 163
## 2 527 393
## 3 2175 1167
## 4 74 119
## 5 914 752
## 6 689 1105
## bachelor_degree_or_higher_2015_19
## 1 54
## 2 263
## 3 766
## 4 75
## 5 463
## 6 562
## percent_of_adults_with_less_than_a_high_school_diploma_2015_19
## 1 6.0
## 2 9.0
## 3 13.2
## 4 6.3
## 5 4.9
## 6 11.9
## percent_of_adults_with_a_high_school_diploma_only_2015_19
## 1 40.0
## 2 40.5
## 3 46.0
## 4 25.9
## 5 40.8
## 6 25.8
## percent_of_adults_completing_some_college_or_associate_degree_2015_19
## 1 40.5
## 2 30.2
## 3 24.7
## 4 41.6
## 5 33.6
## 6 41.3
## percent_of_adults_with_a_bachelor_degree_or_higher_2015_19
## 1 13.4
## 2 20.2
## 3 16.2
## 4 26.2
## 5 20.7
## 6 21.0
new_df$name = paste(new_df$county, new_df$state, sep=", ")
new_df <- new_df %>%
relocate(name, .before=county)
#head(new_df)
new_df <- mutate(new_df, percent_poverty_rate = (poverty_count/population)*100)
new_df <- mutate(new_df, percent_unemployment_rate = (unemployment_count/population)*100)
#head(new_df)
Before applying the linear regression model, the following four assumption must be made. And it is important to visualize the data based on these assumption.
1.Independence of observations (aka no autocorrelation):
Since there is only one independent variable and one dependent variable, we don’t need to test for any hidden relationships among variables.
The normality can be checked by using the hist() function, which tells whether the dependent variable follows a normal distribution or not.
hist(new_df$per_capita_income,
main = "Histohram for Per Capita income",
xlab = "Per Capita income",
col = "green",
border="blue"
)
hist(new_df$percent_of_adults_with_a_bachelor_degree_or_higher_2015_19,
main = "Histohram for Bachelor Degree or Higher",
xlab = "Bachelor Degree or Higher",
col = "blue",
border="green")
The histograms show that our data follows the normal distribution.
3- Linearity:
The relationship between the independent and dependent variable must be linear. We can test this visually with a scatter plot to see if the distribution of data points could be described with a straight line.
plot(percent_of_adults_with_a_bachelor_degree_or_higher_2015_19 ~ per_capita_income, data = new_df, main = "Percent of Adults with Bachelors Degree or Higher Vs Per capita Income",
xlab = "Per capita Income",
ylab = "percent of Adults with Bachelors Degree or Higher",
col = "red")
This scatter plot is showing a linear relationship between x and y variables.
4- Homoscedasticity (aka homogeneity of variance):
This means that the prediction error doesn’t change significantly over the range of prediction of the model.
The residual plots:
model <- lm(percent_of_adults_with_a_bachelor_degree_or_higher_2015_19 ~ per_capita_income, data = new_df)
par(mfrow=c(2,2))
plot(model, col = "blue")
The red lines representing the mean of the residuals are all basically horizontal and centered around zero. This means there are no outliers or biases in the data that would make a linear regression invalid.
In the Normal Q-Qplot in the top right, we can see that the real residuals from our model form an almost perfectly one-to-one line with the theoretical residuals from a perfect model.
Based on these residuals, we can say that our model meets the assumption of homoscedasticity.
and the appropriate visualizations helps us better understand the data.
Regression is a supervised learning algorithm which helps in determining how does one variable influence another variable.
Simple linear regression is useful for modelling the relationship between a numeric or dependent variable (Y) and multiple explanatory or independent variable (X).
The dependent variable (Y) here is ‘percent_of_adults_with_a_bachelor_degree_or_higher_2015_19’ and the independent variable (X) is per_capita_income.
percent_of_adults_with_a_bachelor_degree_or_higher_2015_19 = b0 + b4*per_capita_income
model <- lm(percent_of_adults_with_a_bachelor_degree_or_higher_2015_19 ~ per_capita_income, data = new_df)
summary(model)
##
## Call:
## lm(formula = percent_of_adults_with_a_bachelor_degree_or_higher_2015_19 ~
## per_capita_income, data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.9469 -4.0576 -0.5654 3.5357 23.5879
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.294e-02 2.886e+00 -0.008 0.994
## per_capita_income 7.420e-04 9.374e-05 7.916 5.83e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.967 on 145 degrees of freedom
## Multiple R-squared: 0.3017, Adjusted R-squared: 0.2969
## F-statistic: 62.66 on 1 and 145 DF, p-value: 5.829e-13
Null hypothesis (H0) : The model with no predictor variables (also known as an intercept-only model) fits the data as well as the regression model defined here.
Alternative hypothesis (HA) : Your regression model fits the data better than the intercept-only model.
A large F-statistic will correspond to a statistically significant p-value (p < 0.05). In our example, the F-statistic equal 62.7 producing a p-value: 5.83e-13, which is highly significant. This means that, the predictor variables is significantly related to the outcome variable.
And we accept the alternate hypothesis that our regression model fits the data better than the intercept-only model.
The coefficients table shows the estimate of regression beta coefficients and the associated t-statistics p-values:
summary(model)$coefficient
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0229359887 2.886142e+00 -0.007946937 9.936703e-01
## per_capita_income 0.0007420436 9.374395e-05 7.915642488 5.829052e-13
For a given predictor, the t-statistic evaluates whether or not there is significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero.
we can see, changing the per_capita_income variable is significantly associated to changes in percent_of_adults_with_a_bachelor_degree_or_higher_2015_19
For a given predictor variable, the coefficient (b) can be interpreted as the average effect on y of a one unit increase in predictor, holding all other predictors fixed.
One unit increase in per_capita_income will significantly increase percent_of_adults_with_a_bachelor_degree_or_higher_2015_19, holding all other predictors fixed.
confint(model)
## 2.5 % 97.5 %
## (Intercept) -5.7272787106 5.6814067332
## per_capita_income 0.0005567625 0.0009273247
R-squared:
The value of R will always be positive and will range from zero to one. An R2 value close to 1 indicates that the model explains a large portion of the variance in the outcome variable.
The R2 = 0.302, meaning that “30% of the variance in the measure of percent_of_adults_with_a_bachelor_degree_or_higher_2015_19 can be predicted by coefficient predictor.
The RSE estimate gives a measure of error of prediction. The lower the RSE, the more accurate the model (on the data in hand).
The error rate can be estimated by dividing the RSE by the mean outcome variable:
sigma(model)/mean(new_df$percent_of_adults_with_a_bachelor_degree_or_higher_2015_19)
## [1] 0.3115227
cor(new_df$percent_of_adults_with_a_bachelor_degree_or_higher_2015_19, new_df$per_capita_income, method="pearson")
## [1] 0.5493036
ggscatter(new_df, x = "percent_of_adults_with_a_bachelor_degree_or_higher_2015_19", y = "per_capita_income",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "percent_of_adults_with_a_bachelor_degree_or_higher_2015_19", ylab = "per_capita_income", col = "blue")
## `geom_smooth()` using formula 'y ~ x'
ggPredict(model,interactive=TRUE,colorn=10,jitter=FALSE)