Behavioral Risk Factor Surveillance System

Reading the data in the R session

iowa <- read.csv("http://www.hofroe.net/data/iowa-brfss-2012.csv")

3)

dim(iowa)
## [1] 7166  359

There are 7166 rows and 359 columns in this data set. This data set contains numeric, factor, logical and integer variables. Most of the variables are integers.

library(ggplot2)
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble  1.4.2     v purrr   0.2.5
## v tidyr   0.8.1     v dplyr   0.7.6
## v readr   1.1.1     v stringr 1.3.1
## v tibble  1.4.2     v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
iowa$SEX<-factor(iowa$SEX, levels = c(1,2) ,labels  =c("Male", "Female"))
iowa %>% 
  ggplot(aes(x= HEIGHT3, y = WEIGHT2))+
  geom_point()+
  facet_wrap(~SEX, scales = "free")

There seems to be a linear and positive relationship between height and weight with respect to both genders, but there are a lot of nonsene points in the data which have made the plot so wide and difficult to assess the relationship visually.

iowa%>% 
  filter(HEIGHT3< 2500 & WEIGHT2<2500 )%>% 
  ggplot(aes(x= HEIGHT3, y = WEIGHT2))+
  geom_point()+
  facet_grid(~SEX)

This plot describes the data better than the first plot however it seems there are some high value outliers which affect the data. Mostly, the data are around the values 400-600 for both genders.

6)

Creating a new variable “feet” which is the hundreds and thousands of the HEIGHT3 variable.

feet<-iowa$HEIGHT3
feet<-feet %/% 100
feet[feet>=77]<- NA
iowa$feet<-feet
sum(is.na(feet))
## [1] 94

There are 94 missing values at this point in the data.

7)

Introducing a new variable “inch” to the data set.

inch<-iowa$HEIGHT3 %% 100
iowa$inch<-inch

Now, the same as the variable feet, we replace the values of 77 or above in inch by NA.

inch[inch>=77]<- NA
sum(is.na(inch))
## [1] 75

There are 75 missing data in the variable inch, but, there are 94 in feet. The reason is the values of Height3 which was measured based on cm. To get rid of this, we replace all such values in inch by corresponding values of NA in feet.

feet.na<-which(is.na(feet))
iowa[feet.na, "inch"]<-NA
sum(is.na(iowa$inch))
## [1] 94

8)

Define a variable height which is the conversion of feet and inch to meters. The output variable height presents the data in the meter metric. For exmple, 1.57 shows someone who is one meter and 57 cm.

height<-(iowa$inch*0.0254)+(iowa$feet*0.3048)
head(height)
## [1] 1.5748 1.4986 1.6764 1.4986 1.7780 1.8288
iowa$height<-height

9)

iowa%>% ggplot(aes(x = height))+
  geom_histogram(binwidth = 0.03)+
  facet_wrap(~SEX, nrow = 2)
## Warning: Removed 94 rows containing non-finite values (stat_bin).

As we can see in the histograms, males are taller than women in average. We can also say that the average height of men is higher than the mejority of women.

extra credit to get metric measurement back to height variable.

which(iowa$HEIGHT3>=9000 & iowa$HEIGHT3<=9998)->new.metric
height[new.metric]<-(iowa$HEIGHT3[new.metric]%%9000)/100
sum(is.na(height))
## [1] 73

As we can see, the number of NA’s has droped to 73 out of 94.