For my final project, I want to analysize Denver's obsesity levels for each neighborhoodusing the Open Source Census data found at: https://www.denvergov.org/opendata/dataset/city-and-county-of-denver-american-community-survey-nbrhd-2016-2020. Note, I have cleaned the data a bit to make it easier for this project.
First step: I need to read in my data. My data is in .csv form so therefore I will use the read.csv function in R. I will be calling my data set 'adult'.
adult <- read.csv('/Users/meganduff/Desktop/d2pfiles/adultob.csv', header=T)
What does my data look like? The 'head' function provides the first 5 rows of my dataset for all of the collected variables. Each row in this dataset is an observation, one of Denver's neighborhoods.
head(adult)
## NEIGHBORHOOD_NAME NEIGHBORHOOD_ID TOTALPOP_INREGISTRY
## 1 College View - South Platte 19 2250
## 2 Overland 50 705
## 3 Ruby Hill 54 3826
## 4 Kennedy 40 1003
## 5 Hampden 32 4365
## 6 Baker 3 2075
## PERCENT_OBESE_ COUNT_ADULTS_OBESE CONFIDENCE_INTERVAL95
## 1 34.73 781.4250 (32.61, 36.85)
## 2 31.66 223.2030 (28.03, 35.28)
## 3 37.49 1434.3674 (35.89, 39.09)
## 4 29.70 297.8910 (26.65, 32.75)
## 5 28.72 1253.6280 (27.3, 30.14)
## 6 25.65 532.2375 (23.67, 27.63)
Now I want to pull out the percentage of obese individuals (PERCENT_OBESE_) variable to make it easier to work with. I will call this variable 'pctobs'. The '$' pulls the 'PERCENT_OBESE_' from my dataset 'adult'
pctobs <- adult$PERCENT_OBESE_
Now I want to explore my data through the following statistics/plots:
-- Get the five number summary of PERCENT_OBESE_
summary(pctobs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 17.84 27.12 26.82 35.17 42.87
-- Get the density plot for PERCENT_OBESE_
plot(density(pctobs), main = "Percentage of Adult Obsesity")
-- Get the box plot for PERCENT_OBESE_
boxplot(pctobs, main="Box Plot for Percent Obese")
-- Get the histogram plot for PERCENT_OBESE_
hist(pctobs, main="Histogram for Percent Obese")
Let's look at another variable in my set -- TOTALPOP_INREGISTRY. If I look at the five number summary, notice the minimum is 0. Does it make sense that 0 people live in a neighborhood? No! This must be an imputation error -- therefore should remove that data point from analysis
summary(adult$TOTALPOP_INREGISTRY)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1247 1968 2359 3198 11887
How to remove data entry that had a TOTALPOP_INREGISTRY of 0? I am going to subset the data set based on all data points that did NOT have a TOTALPOP_INREGISTRY of 0.
adult.cleaned<-adult[adult$TOTALPOP_INREGISTRY!=0,]
Note the minimum is no longer 0
summary(adult.cleaned$TOTALPOP_INREGISTRY)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 402 1261 2053 2390 3242 11887
Find the sample mean of percentage of obsese adults in Denver neighborhoods.
mean(pctobs)
## [1] 26.82397
Find the sample standard deviation of percentage of obsese adults in Denver neighborhoods.
sd(pctobs)
## [1] 9.37425
What if you want to find a sample proportion with your data set? Let's stay I want to find the proportion of denver neighborhoods that have a percentage of obsesity over 30%. First need to find the number of success (X) -- aka the number of neighborhoods that do have a percentage of over 30.
success <- subset(pctobs, pctobs>30)
length(success)
## [1] 30
Therefore there are 30 neighborhoods with a percentage of obsesity > 30. To get the total sample size (n).
length(pctobs)
## [1] 78
If your sample is under 30, then we need to check if the population is normal. We can do that using the 'qqPlot' function:
library("car")
## Loading required package: carData
qqPlot(pctobs)
## [1] 41 77
Notice, that quite a few point lie outside of the confidence intervals.
qqPlot(adult$TOTALPOP_INREGISTRY)
## [1] 15 34
Notice, here that only about 2-4 points lie outside so we can say that the total population for each neighborhood is approximately normal!