Working with Data Example

Final Project

For my final project, I want to analysize Denver's obsesity levels for each neighborhoodusing the Open Source Census data found at: https://www.denvergov.org/opendata/dataset/city-and-county-of-denver-american-community-survey-nbrhd-2016-2020. Note, I have cleaned the data a bit to make it easier for this project.

First step: I need to read in my data. My data is in .csv form so therefore I will use the read.csv function in R. I will be calling my data set 'adult'.

adult <- read.csv('/Users/meganduff/Desktop/d2pfiles/adultob.csv', header=T)

What does my data look like? The 'head' function provides the first 5 rows of my dataset for all of the collected variables. Each row in this dataset is an observation, one of Denver's neighborhoods.

head(adult)

##             NEIGHBORHOOD_NAME NEIGHBORHOOD_ID TOTALPOP_INREGISTRY
## 1 College View - South Platte              19                2250
## 2                    Overland              50                 705
## 3                   Ruby Hill              54                3826
## 4                     Kennedy              40                1003
## 5                     Hampden              32                4365
## 6                       Baker               3                2075
##   PERCENT_OBESE_ COUNT_ADULTS_OBESE CONFIDENCE_INTERVAL95
## 1          34.73           781.4250        (32.61, 36.85)
## 2          31.66           223.2030        (28.03, 35.28)
## 3          37.49          1434.3674        (35.89, 39.09)
## 4          29.70           297.8910        (26.65, 32.75)
## 5          28.72          1253.6280         (27.3, 30.14)
## 6          25.65           532.2375        (23.67, 27.63)

Now I want to pull out the percentage of obese individuals (PERCENT_OBESE_) variable to make it easier to work with. I will call this variable 'pctobs'. The '$' pulls the 'PERCENT_OBESE_' from my dataset 'adult'

pctobs <- adult$PERCENT_OBESE_

Now I want to explore my data through the following statistics/plots:

-- Get the five number summary of PERCENT_OBESE_

summary(pctobs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   17.84   27.12   26.82   35.17   42.87

-- Get the density plot for PERCENT_OBESE_

plot(density(pctobs), main = "Percentage of Adult Obsesity")

-- Get the box plot for PERCENT_OBESE_

boxplot(pctobs, main="Box Plot for Percent Obese")

-- Get the histogram plot for PERCENT_OBESE_

hist(pctobs, main="Histogram for Percent Obese")

Identifying outliers

Let's look at another variable in my set -- TOTALPOP_INREGISTRY. If I look at the five number summary, notice the minimum is 0. Does it make sense that 0 people live in a neighborhood? No! This must be an imputation error -- therefore should remove that data point from analysis

summary(adult$TOTALPOP_INREGISTRY)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1247    1968    2359    3198   11887

How to remove data entry that had a TOTALPOP_INREGISTRY of 0? I am going to subset the data set based on all data points that did NOT have a TOTALPOP_INREGISTRY of 0.

adult.cleaned<-adult[adult$TOTALPOP_INREGISTRY!=0,]

Note the minimum is no longer 0

summary(adult.cleaned$TOTALPOP_INREGISTRY)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     402    1261    2053    2390    3242   11887

Other Useful Functions for This Project

Find the sample mean of percentage of obsese adults in Denver neighborhoods.

mean(pctobs)

## [1] 26.82397

Find the sample standard deviation of percentage of obsese adults in Denver neighborhoods.

sd(pctobs)

## [1] 9.37425

What if you want to find a sample proportion with your data set? Let's stay I want to find the proportion of denver neighborhoods that have a percentage of obsesity over 30%. First need to find the number of success (X) -- aka the number of neighborhoods that do have a percentage of over 30.

success <- subset(pctobs, pctobs>30)
length(success)

## [1] 30

Therefore there are 30 neighborhoods with a percentage of obsesity > 30. To get the total sample size (n).

length(pctobs)

## [1] 78

How to check for normality

If your sample is under 30, then we need to check if the population is normal. We can do that using the 'qqPlot' function:

library("car")

## Loading required package: carData

qqPlot(pctobs)

## [1] 41 77

Notice, that quite a few point lie outside of the confidence intervals.

qqPlot(adult$TOTALPOP_INREGISTRY)

## [1] 15 34

Notice, here that only about 2-4 points lie outside so we can say that the total population for each neighborhood is approximately normal!