library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

working directory

First, set up your working directory where you have your data file and working R file.

setwd("/Users/se776257/OneDrive - University of Central Florida/Desktop/Prof. An/02 Teaching/2024 Spring/PAD 7754 Quantitative Methods/Class Materials/R data/")

Import data from your working folder.

db <- read.csv("2022_HIC.csv", stringsAsFactors = F)

If you want to only select cases in Florida

df <- db %>%
  filter(CocState == "FL")

Scatterplots

You want to examine the relationship between PIT count (# of sheltered people) and # of beds

summary(df$PIT.Count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.00   16.00   32.82   36.00  850.00      12
summary(df$Total.Beds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   20.00   39.43   44.00 1010.00
ggplot(data = df) + 
  geom_point(mapping = aes(x = PIT.Count, y = Total.Beds))
## Warning: Removed 12 rows containing missing values (`geom_point()`).

Let’s add a regression line

ggplot(data = df) + 
  geom_point(mapping = aes(x = PIT.Count, y = Total.Beds)) +
  geom_smooth(mapping = aes(x = PIT.Count, y = Total.Beds), method = lm)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 12 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).

correlation

From the scatterplots, we suspect there is a significant linear relationship between the two variables. Now we want to calcaulte the correlation

cor(df$PIT.Count, df$Total.Beds)
## [1] NA

You will notice that when running a cor function with a variable that includes missing data which is very common, the result will appear as “NA.” To correct that, add the following option.

cor(df$PIT.Count, df$Total.Beds, use ="complete.obs")
## [1] 0.9640077

Least Squares Regression

Let’s call our fitted model model1. Then we have:

model1=lm(data=df, formula= Total.Beds ~ PIT.Count)

summary(model1)
## 
## Call:
## lm(formula = Total.Beds ~ PIT.Count, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.144  -5.604  -4.230   0.312 309.260 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.863633   0.599650   6.443 1.67e-10 ***
## PIT.Count   1.091589   0.008539 127.832  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.7 on 1243 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.9293, Adjusted R-squared:  0.9293 
## F-statistic: 1.634e+04 on 1 and 1243 DF,  p-value: < 2.2e-16

“lm” stands for linear model, so the first letter is an L. The first line calculates the results of the linear regression.The second line prints the results.