library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(readxl)
library(pastecs)
## 
## Attaching package: 'pastecs'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
  1. Load your chosen dataset into Rmarkdown
dataset<-read.csv("state_data_combo.csv")
  1. Select the dependent variable you are interested in, along with independent variables which you believe are causing the dependent variable
library_visit_rate<-dataset$Library.Visits.Per.Capita
poverty_rate<-dataset$Poverty.Rate....
unemployment_rate<-dataset$Unemployment.rate.2019
no_computer_ownership<-dataset$Percent.with.no.home.computer..2018.
broadband_rate<-dataset$Percent.with.home.Broadband
  1. create a linear model using the “lm()” command, save it to some object
linear_model<-lm(library_visit_rate~poverty_rate+unemployment_rate+no_computer_ownership+broadband_rate,data=dataset)
  1. call a “summary()” on your new model
summary(linear_model)
## 
## Call:
## lm(formula = library_visit_rate ~ poverty_rate + unemployment_rate + 
##     no_computer_ownership + broadband_rate, data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0579 -0.5898 -0.0346  0.4654  1.7209 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)            7.35273    8.06680   0.911    0.367
## poverty_rate          -0.09598    0.07820  -1.227    0.226
## unemployment_rate     -0.01574    0.11185  -0.141    0.889
## no_computer_ownership -0.04981    0.09756  -0.511    0.612
## broadband_rate        -0.03848    0.08317  -0.463    0.646
## 
## Residual standard error: 0.7014 on 46 degrees of freedom
## Multiple R-squared:  0.1398, Adjusted R-squared:  0.06496 
## F-statistic: 1.868 on 4 and 46 DF,  p-value: 0.1321
  1. interpret the model’s r-squared and p-values. How much of the dependent variable does the overall model explain? What are the significant variables? What are the insignificant variables?

The adjusted r-square is .06496, which means that this linear regression model only explains around ~6.5% of the the data.

The p-value is 0.1321, which means that it is very unlikely that the independent variables that I selected influences the dependent variable.

None of my independent variables are particulary signficant, as they have p-values that are significantly over .06. Poverty Rate is the most signifcant with a p-value of .226.

The most surprisingly insignificant variable is unemployment rate. Anecdotally, local libraries in my experience offer programs that help unenmployed citizens with job applications, as well as providing computer stations for resume editing and browsing online job postings.

  1. Choose some significant independent variables. Interpret its Estimates (or Beta Coefficients). How do the independent variables individually affect the dependent variable?
many_iv_model<-lm(Total.Circulation.Per.Capita~Unemployment.rate.2019+Poverty.Rate....+Percent.with.no.home.computer..2018.+Percent.with.home.Broadband,data=dataset)
summary(many_iv_model)
## 
## Call:
## lm(formula = Total.Circulation.Per.Capita ~ Unemployment.rate.2019 + 
##     Poverty.Rate.... + Percent.with.no.home.computer..2018. + 
##     Percent.with.home.Broadband, data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8218 -1.4858 -0.1292  1.5521  6.7124 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                            1.2410    24.3769   0.051    0.960
## Unemployment.rate.2019                -0.2501     0.3380  -0.740    0.463
## Poverty.Rate....                       0.1244     0.2363   0.527    0.601
## Percent.with.no.home.computer..2018.  -0.3378     0.2948  -1.146    0.258
## Percent.with.home.Broadband            0.1009     0.2513   0.402    0.690
## 
## Residual standard error: 2.12 on 46 degrees of freedom
## Multiple R-squared:  0.2532, Adjusted R-squared:  0.1882 
## F-statistic: 3.899 on 4 and 46 DF,  p-value: 0.008279

Some of the independent variables can be estimated to increase the circulation per capita rate of books in the nation’s library systems. For instance, if the poverty rare of the country increased by a percentage point, circulation would increase by .1244%. Also, if the percent of people without access to home broadband in the country, we could estimate that the circulation rate would rise by .1009%.

Additionally, some of the independent variables are estimated to decrease the circulation per capita. Increases of a single percentage point to the unemployment rate or the percentage of the population that do not own a home computer are estimated to lower to circulation rate by .2501% and .3378% respectively.

The rates of broadband internet access and home computer ownership having opposite magnitudes is surprising, as one would expect those two variables to be strongly correlated themselves.

  1. Does the model you create meet or violate the assumption of linearity? Show your work with “plot(x,which=1)”
plot(linear_model,which=1)

Looking at this plot, I would interpret my model as having violated the assumptions of linearity. The plot certainly doesn’t appear linear on the face, and the data points as well do not appear to be sufficiently homoscedastic.