Lab 1 - Introduction to Data

The following commands were used to load the library, start the lab, and get the “cdc” data frame. This data frame is used for the remainder of the problems.

library(DATA606)
startLab('Lab1')

source("/Library/Frameworks/R.framework/Versions/3.5/Resources/library/DATA606/labs/Lab1/more/cdc.R")

Question 1

In my scatterplot creation, I like to use ggplot as my plotting tool. For this problem, we need to specify that the y-variable is weight and the x-variable is weight desired. Below is a scatterplot of weight vs. desired weight with a line of best fit through the data.

library(ggplot2)
gginit <- ggplot(cdc, aes(x=wtdesire,y=weight))
plottype <- geom_point(alpha=0.5)
plotstat <- geom_smooth(method=lm, se=FALSE)
plottheme <- theme_bw()
gginit + plottype + plotstat + plottheme + xlab("Desired Weight") + ylab("Weight")

From the plot, it is clear that there is a strong, positive correlation between Weight and Desired Weight. I can also see a couple of outliers where people have a high desired weight compared to the rest of the data.

Question 2

I decided to create the wdiff variable as another column in the data frame using the following command:

cdc$wdiff <- cdc$wtdesire - cdc$weight

This is now represented in the data frame:

head(cdc)

    genhlth exerany hlthplan smoke100 height weight wtdesire age gender
1      good       0        1        0     70    175      175  77      m
2      good       0        1        1     64    125      115  33      f
3      good       1        1        1     60    105      105  49      f
4      good       1        1        0     66    132      124  42      f
5 very good       0        1        0     61    150      130  55      f
6 very good       1        1        0     64    114      114  55      f
  wdiff
1     0
2   -10
3     0
4    -8
5   -20
6     0

Question 3

The wdiff variable is an integer type of data as shown in the previous table. This makes sense since it is the difference between two integer types.

The value of wdiff will tell us how a particular person feels about their current weight. This can be split into the following three categories:

wdiff = 0: This particular person is happy with their weight since the desired weight is equal to their current weight.
wdiff > 0: This particular person would like to gain weight since the desired weight is greater than the current weight.
wdiff < 0: This particular person would like to lose weight since the desired weight is less than the current weight.

Question 4

A histogram is a good way to observe the distribution of the data. Again, I used ggplot to develop my histogram. Initially, I saw that there were some outliers so I reduced the weight difference range to only show -150 to 150. Below is the R code and resulting histogram for the weight difference:

gginit2 <- ggplot(cdc,aes(x=wdiff))
plottype2 <- geom_histogram(binwidth = 5,color='red',fill='pink',alpha=0.5)
plottheme2 <- theme_bw()
gginit2 + plottype2 + plottheme2 + xlim(-150,150) + xlab('Weight Difference') + ylab('Total')

When looking at this data, it is clear that most people are happy with their current weight since the largest bar is located at 0. However, we can see that there are many more people with a negative weight difference compared to positive weight difference. This means that there are a lot more people who want to lose weight than gain weight.

In addition to the histogram, I wanted to create a box plot as well as a summary table to further analyze the data. In my initial creation of the boxplot, I can see that most of the values are considered to outliers. I think this is due to the very large amount of 0 values, which causes Q1 and Q3 to be very close to the median. Therefore, the IQR is small causing most of the data to lie outside the whiskers. However for my analysis, I wanted to key in on where the box was located. The R code and plot can be seen below:

gginit3 <- ggplot(cdc,aes(x="",y=wdiff))
plottype3 <- geom_boxplot()
plottheme3 <- theme_bw()
gginit3 + plottype3 + plottheme3 + coord_flip() + ylim(-150,150) + xlab(' ') + ylab('Weight Difference')

We can see that the entire box is located in the negative meaning that the majority of people would like to lose weight. Below is a summary table of the weight difference, which reiterates that on average, people want to lose weight rather than gain weight.

summary(cdc$wdiff)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-300.00  -21.00  -10.00  -14.59    0.00  500.00

Question 5

First, I conducted a numerical study to compare how men and women view their weight. I created two additional data frames with one of them containing all the rows where the gender is male, and the other where gender is female. I then conducted a summary on the weight difference for each of these data frames and compared them. The mean for females was more negative than that of men, which means on average females want to lose more weight than men. Below is the R code and summary of this data:

cdc.male <- subset(cdc, gender=='m')
cdc.female <- subset(cdc, gender=='f')
summary(cdc.male$wdiff)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-300.00  -20.00   -5.00  -10.71    0.00  500.00

summary(cdc.female$wdiff)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-300.00  -27.00  -10.00  -18.15    0.00   83.00

Below is a boxplot which reiterates what was stated above. We can see that the median of the weight difference to be more negative for females, and we can see that the length of the box is larger and more in the negative for the females than the men. Again, this shows that on average that females feel they need to lose more weight than men.

gginit4 <- ggplot(cdc,aes(x=factor(gender),y=wdiff))
plottype4 <- geom_boxplot(aes(fill=factor(gender)))
plottheme4 <- theme_bw()
gginit4 + plottype4 + plottheme4 + ylab('Weight Difference') + xlab('Gender') + ylim(-200,200)

Question 6

In order to solve this problem, I first needed to calculate the mean and standard deviation of the weights from the data. For the mean, I used the mean() function, and for the standard deviation, I took the square root of the variance. This can be seen below:

mean.weight <- mean(cdc$weight)
mean.weight

[1] 169.683

std.weight <- sqrt(var(cdc$weight))
std.weight

[1] 40.08097

As we can see from the output, the mean weight is 169.683 and the standard deviation is 40.08097. In order to calculate the proportion of people that fall within one standard deviation of the mean, we need to count the number of rows where the weight is between mean - standard deviation and mean + standard deviation. Then we need to divide this by the total number of rows. The R code for this can be seen below:

nrow(subset(cdc, weight > (mean.weight-std.weight) & weight < (mean.weight+std.weight)))/nrow(cdc)

[1] 0.7076

From this we can see that 70.76% of people fall within one standard deviation of the mean weight.