Lab 1 - Introduction to data

Author: Romerl Elizes

Let’s load the data.

source("more/cdc.R")

How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

There are 20,000 cases in this data set. There are 9 variables in this data set.

The data types of each variable are:

genhlth - Categorical - Ordinal

exerany - Numerical - Discrete

hlthplan - Numerical - Discrete

somke100 - Numerical - Discrete

height - Numerical - Continuous

weight - Numerical - Continuous

wtdesire - Numerical - Continuous

age - Numerical - Continuous

gender - Categorical

Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

a. Height summary

summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

b. Age summary

summary(cdc$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

c. Height interquartile range

70 - 64

## [1] 6

d. Age interquartile range

57 - 31

## [1] 26

e. Relative frequency distribution for gender

table(cdc$gender)/20000

## 
##       m       f 
## 0.47845 0.52155

f. Relative frequency distribution for exerany

table(cdc$exerany)/20000

## 
##      0      1 
## 0.2543 0.7457

g. Number of males in sample

table(cdc$gender)

## 
##     m     f 
##  9569 10431

9569 males are present in the data-set.

h. Proportion of people claiming to be healthy.

According to the table output of (cdc$exerany)/20000, 74.57% of respondents claimed they were healthy.

What does the mosaic plot reveal about smoking habits and gender?

A larger proportion of males smoke compared to females. A larger proportion of females do not smoke compared to males.

Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

g. Respondents who are under 23 and smoke 100 cigarettes in their lifetime.

respondents_under23_who_smokes = subset(cdc, age < 23 & smoke100 == 1)
dim(respondents_under23_who_smokes)

## [1] 620   9

For ease of viewing, the total number of rows returned is 620 respondents who are under 23 and smoke 100 cigarettes in their lifetime.

What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

The box plot shows some interesting information. The median BMI for those who report to be in excellent health is lesser than those who report to be in very good health. The median BMI for those who report to be in very good health is lesser than those who report to be in good health. The median BMI for those who report to be in good is lesser than those who report to be in fair health. The median BMI for those who report to be in fair health is slighlty lesser than those who report to be in poor health health. The 1st quartile and 3rd quartile BMI’s of the general health categories follow the same track.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$exerany)

For the next part of the question, I chose exerany variable. The box plot shows some interesting information. The median BMI for those who report to exercise is lesser than those who report to not exercise. The 1st quartile and 3rd quartile BMI’s of those who exercise follow the same track.

On Your Own

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

plot(
  x = cdc$weight,
  y = cdc$wtdesire,
  xlab = "Weight",
  ylab = "Desired Weight"
)

The large black mass on the lower left of the plot, clearly indicates that the majority of respondents wanted to lose some weight.

Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

cdctmp <- cdc
cdctmp$wdiff <- cdctmp$weight - cdctmp$wtdesire
head(cdctmp,10)

##      genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1       good       0        1        0     70    175      175  77      m
## 2       good       0        1        1     64    125      115  33      f
## 3       good       1        1        1     60    105      105  49      f
## 4       good       1        1        0     66    132      124  42      f
## 5  very good       0        1        0     61    150      130  55      f
## 6  very good       1        1        0     64    114      114  55      f
## 7  very good       1        1        0     71    194      185  31      m
## 8  very good       0        1        0     67    170      160  45      m
## 9       good       0        1        1     65    150      130  27      f
## 10      good       1        1        0     70    180      170  44      m
##    wdiff
## 1      0
## 2     10
## 3      0
## 4      8
## 5     20
## 6      0
## 7      9
## 8     10
## 9     20
## 10    10

To save the integrity of the original cdc table, I created a new table cdctmp which is a copy of cdc table. From this, I created wdiff which defines the difference between weight and wtdesire variables.

What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

wdiff is a numerical continuous variable. If wdiff is 0, it means that the person’s weight and desired weight are the same. Person does not want to lose or gain weight. If wdiff is positive, that means the person has a desired weight less than her current weight. If wdiff is negative, that means the person has a desired weight that is higher than her current weight.

Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

hist(cdctmp$wdiff)

hist(cdctmp$wdiff, 
     border = "blue",
     col = "green",
     xlim = c(-100, 200),
     breaks = 25)

Observe histogram 1 above: The majority of people in the table, want to lose weight up to 50 lbs. based on the high bar to the right of the 0. That number hovers close to 12,000. However, a decent number of people, approximately 7500 want to gain weight up to 50 lbs. Because of the number of people in the table (20,000), the distribution is pretty basic.

Observe Histogram 2 above: I have added a more colorful respresentation of the histogram and limited the X range from -100 to 200 lbs. This shows a more detailed bell curve for the wdiff. It shows similar findings compared to histogram 1, but it also shows a more detailed view of people who want to lose even more weight. Around 1500 people want to lose between 50 and 100 lbs. A little less than 500 people want to lose between 100 and 150 lbs. A much smaller number of people want to lose between 150 and 200 lbs. On the opposite specturm, a few people want to gain between 50 and 100 lbs. The majority of people are not satisfied with their weight.

ref: https://www.datacamp.com/community/tutorials/make-histogram-basic-r

Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

library(FSA)

## ## FSA v0.8.17. See citation('FSA') if used in publication.
## ## Run fishR() for related website and fishR('IFAR') for related book.

Summarize(wdiff ~ gender, data = cdctmp)

##   gender     n     mean       sd  min Q1 median Q3 max percZero
## 1      m  9569 10.70613 23.49262 -500  0      5 20 300 32.91880
## 2      f 10431 18.15118 23.99713  -83  0     10 27 300 23.64107

library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:FSA':
## 
##     headtail

describeBy(cdctmp$wdiff, group = cdctmp$gender)

## 
##  Descriptive statistics by group 
## group: m
##    vars    n  mean    sd median trimmed  mad  min max range skew kurtosis
## X1    1 9569 10.71 23.49      5    8.41 7.41 -500 300   800  0.6    37.44
##      se
## X1 0.24
## -------------------------------------------------------- 
## group: f
##    vars     n  mean sd median trimmed   mad min max range skew kurtosis
## X1    1 10431 18.15 24     10    14.2 14.83 -83 300   383 2.26     9.22
##      se
## X1 0.23

boxplot(cdctmp$wdiff ~ cdctmp$gender)

Using the Summarize function from package FSA and describe function from package psych, I did a summary evaluation between wdiff and gender. The mean weight loss goal (wdiff) for men is 10.7 while the mean weight loss goal (wdiff) for women is 18.15. From this observation alone, it is clear that men are less concerned about weight loss than women. The Summarize function was extremely helpful because it lists the Q1, Q3, and median weight loss values for both men and women. Because of the numerous data values on the box plot, it was very difficult to view the Q1, Q3, and median values.

Ref: http://rcompanion.org/handbook/C_02.html

Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

mean(cdc$weight)

## [1] 169.683

Using the mean function, we find the mean for weight as 169.68.

sd(cdc$weight)

## [1] 40.08097

Using the sd function, we find the standard deviation for weight as 40.08

total = length(cdc$weight)
total

## [1] 20000

Using the length function, we find total observations to be 20,000.

weight_within_one_sd_of_mean = subset(cdc, weight > mean(weight)-1 & weight < mean(weight)+1)
subtotal = length(weight_within_one_sd_of_mean$weight)
proportion_weight = subtotal / total * 100
proportion_weight

## [1] 4.75

Using the subset and length functions and percent division calculations, we find that proportion of mean weight within 1 standard deviation is 4.75%

ref: https://stat.ethz.ch/pipermail/r-help/2012-February/302515.html