Introduction

The purpose of this lab is to learn how to manipulate, subset, process, and visualize raw data using R studio IDE. The R predefine function will be used as tools and Behavioral Risk Factor Surveillance System (BRFSS) as a dataset to perform this lab.

Materials

Results

Loading dataset and required libraries

#library(IS606)
#startLab("Lab1")
source("more/cdc.R")

Q1

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

plot (cdc$weight,cdc$wtdesire)

Looking at the spread of scatterplot points we can determine that the plot is not conversion into a consistent linear relationship. The graph changed behavior with weight increases. looking at the points between 200 and 500 pounds we can clearly determine that the desire weight tends to shift more toward weight lost. even though this plot might exhibit a strong correlation due to large number of people that are within the normal distribution and satisfactory body weight, illustrated by the density of the points under 200 pounds of weight, this relationship will not be true for overweight people. To get better vision of the data we must plot the curve of these points and find correlation coefficient

Q2

Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wdiff <- cdc$wtdesire-cdc$weight

Q3

What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

typeof(wdiff)
## [1] "integer"

The data type is integers

If the observation is zero, it means that the person has satisfactory with his/her weight.

If the difference is positive, the person desire would be to gain weight, otherwise; the differences is negative, the person desire would be to lose weight.

Q4

Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

hist(wdiff, breaks=25,
     main="Density plot",
     col="red", 
     xlim=c(-300,500),
     prob = TRUE)

The histogram shows that the center has the highest density of most participants. Looking at this region we can conclude that participants are in good shape and their desire is to lose weight. The shape of the plot is skewed more to the left which means that participants with desire to lose weight are more than the one on the right, where their desire is to gain weight.

Q5

sing numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

A<-boxplot(wdiff ~cdc$gender, 
        main="Weight Difference Vs.Gender",
        horizontal = TRUE,
        xlab="Gender",
        ylab="wdiff")

mytable <- A$stats
colnames(mytable)<-A$names
rownames(mytable)<-c('min','lower quartile','median','upper quartile','max')
mytable
##                  m   f
## min            -50 -67
## lower quartile -20 -27
## median          -5 -10
## upper quartile   0   0
## max             30  38
## attr(,"class")
##         m 
## "integer"

The median of the male is higher than the median of female which is an indication that males tend to desire more weight than female and it is also proven by the greater number of positive outliers of the men’s boxplot. The height between lower and higher quartiles shows that men tend to be more satisfied with their own weight than women with a smaller boundary.

The summary of the plot shows that males have a higher mean than females and also have higher lower quartiles. Males and females have the same upper quartiles of 0. This results matched our observation and support the conclusion that male tend to be more satisfied with their weight than females and female tend to desire weight lost than males

Q6

Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

Get the mean and the standard deviation

meanofweight <- mean(cdc$weight)
sdofweigth <- sd(cdc$weight)
meanofweight
## [1] 169.683
sdofweigth
## [1] 40.08097

Subset the weights within the standard deviation of the mean

weight_sdofmean=subset(cdc, cdc$weight>(meanofweight-sdofweigth) & cdc$weight<(meanofweight+sdofweigth))

Find the proportion by dividing the number of entries within the standard deviation over the entire dataset entries.

nrow(weight_sdofmean)/nrow(cdc)
## [1] 0.7076

The proportion of the weights are within one standard deviation of the mean is 70.76 %.