library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
First, set up your working directory where you have your data file and working R file.
setwd("/Users/se776257/OneDrive - University of Central Florida/Desktop/Prof. An/02 Teaching/2024 Spring/PAD 7754 Quantitative Methods/Class Materials/R data/")
Import data from your working folder.
db <- read.csv("2022_HIC.csv", stringsAsFactors = F)
If you want to only select cases in Florida
df <- db %>%
filter(CocState == "FL")
You want to examine the relationship between PIT count (# of sheltered people) and # of beds
summary(df$PIT.Count)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.00 16.00 32.82 36.00 850.00 12
summary(df$Total.Beds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 20.00 39.43 44.00 1010.00
ggplot(data = df) +
geom_point(mapping = aes(x = PIT.Count, y = Total.Beds))
## Warning: Removed 12 rows containing missing values (`geom_point()`).
Let’s add a regression line
ggplot(data = df) +
geom_point(mapping = aes(x = PIT.Count, y = Total.Beds)) +
geom_smooth(mapping = aes(x = PIT.Count, y = Total.Beds), method = lm)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 12 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
From the scatterplots, we suspect there is a significant linear relationship between the two variables. Now we want to calcaulte the correlation
cor(df$PIT.Count, df$Total.Beds)
## [1] NA
You will notice that when running a cor function with a variable that includes missing data which is very common, the result will appear as “NA.” To correct that, add the following option.
cor(df$PIT.Count, df$Total.Beds, use ="complete.obs")
## [1] 0.9640077
Let’s call our fitted model model1. Then we have:
model1=lm(data=df, formula= Total.Beds ~ PIT.Count)
summary(model1)
##
## Call:
## lm(formula = Total.Beds ~ PIT.Count, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -84.144 -5.604 -4.230 0.312 309.260
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.863633 0.599650 6.443 1.67e-10 ***
## PIT.Count 1.091589 0.008539 127.832 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.7 on 1243 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.9293, Adjusted R-squared: 0.9293
## F-statistic: 1.634e+04 on 1 and 1243 DF, p-value: < 2.2e-16
“lm” stands for linear model, so the first letter is an L. The first line calculates the results of the linear regression.The second line prints the results.