Introduction to Data Lab

source("C:/Users/Georgia/Documents/Lab1/more/cdc.R")
head(cdc)

##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f

Exercise 1

How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

dim(cdc)

## [1] 20000     9

names(cdc)

## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

There are 20,000 cases and 9 variables. Four are discrete (height, weight, wtdesire and age) and the remaining five are categorical. Of the categorical, genhlth is ordinal.

Exercise 2

Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

# Numerical summaries for Height and Age
summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

summary(cdc$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

# Table of numerical summaries
summ_table = matrix(c(mean(cdc$height),
                      var(cdc$height),
                      median(cdc$height),
                      70-64,
                      mean(cdc$age),
                      var(cdc$age),
                      median(cdc$age),
                      57-31
                      ), ncol = 2)

colnames(summ_table) = c("Height", "Age")
rownames(summ_table) = c( "Mean", "Var", "Median", "IQ range")
as.table(summ_table)

##             Height       Age
## Mean      67.18290  45.06825
## Var       17.02350 295.58857
## Median    67.00000  43.00000
## IQ range   6.00000  26.00000

# Relative frequencies for Gender and Exerany
table(cdc$gender)/20000

## 
##       m       f 
## 0.47845 0.52155

table(cdc$exerany)/20000

## 
##      0      1 
## 0.2543 0.7457

# Number of males in the sample
table(cdc$gender)

## 
##     m     f 
##  9569 10431

# Proportion of sample claiming excellent health
table(cdc$genhlth)/2000

## 
## excellent very good      good      fair      poor 
##    2.3285    3.4860    2.8375    1.0095    0.3385

There are 9569 males in the sample and a proportion of 2.3285 report being in excellent health.

Exercise 3

What does the mosaic plot reveal about smoking habits and gender?

mosaicplot(table(cdc$gender,cdc$smoke100))

The mosaic plot shows that more males smoke (on an individual level) at least 100 cigarettes.

Exercise 4

Create a new object (under23_and_smoke) that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime.

under23_and_smoke = subset(cdc, cdc$age < 23 & cdc$smoke100 == 1)
head(under23_and_smoke)

##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13  excellent       1        0        1     66    185      220  21      m
## 37  very good       1        0        1     70    160      140  18      f
## 96  excellent       1        1        1     74    175      200  22      m
## 180      good       1        1        1     64    190      140  20      f
## 182 very good       1        1        1     62     92       92  21      f
## 240 very good       1        0        1     64    125      115  22      f

Exercise 5

What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

bmi = (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$hlthplan)

The BMI by hlthplan box plots show that the poorer one’s general health is, the higher one’s bmi index will be. The categorical variable I chose to compare to bmi was hlthplan because it would be interesting to see if not having health coverage would be a motivation for people to maintain a lower bmi to avoid any illnesses or diseases. Interestingly enough, the boxplots comparing bmi and hlthplan does not seem to differ according to whether an individual has some form of health coverage or not. The plot wher subjects have some form of coverage has more outliers, but as far as means and IQ ranges are concerned, the boxplots look very similar.

ON YOUR OWN

1. Scatterplot of weight vs desired weight. Describe the relationship between these two variables.

plot(cdc$weight~ cdc$wtdesire)

There appears to be a positive (potentially linear) relationship between the two variables.

2. New variable: the difference between desired weight (wtdesire) and current weight (weight).

Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wdiff = cdc$wtdesire- cdc$weight

3. What type of data is wdiff?

If an observation wdiff is 0, what does this mean about the person’s weight and desired weight.

What if wdiff is positive or negative?

wdiff is a discrete variable. If the observation of wdiff is 0, that indicates that the subject has reached their desired weight goal. If wdiff is negative, the individual must lose weight to reach their desired goal, while if wdiff is positive, the individual must gain weight instead.

4.Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use.

What does this tell us about how people feel about their current weight?

summary(wdiff)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00

hist(wdiff, breaks = 100, xlim = range(-200:200))

boxplot(wdiff)

boxplot(wdiff, outline = F)

The summary statistics show that, on average, subjects desired to lose approximately 15 pounds. The histogram shows (through being mostly left skewed) that is more a tendency for subjects to desire losing weight than gaining it. The boxplot including outliers (first boxplot) showed that the average subject was slightly more likely to desired to lose weight to reach their goal, however removing outliers (second boxplot) showed that the average subject is far more likely to desire losing weight than gaining some to reach their goal (indicated by both the IQ range and the lower mean).

5.Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

wdiff_gen = data.frame(cdc$gender,wdiff)
# summaries
diff_fem = subset(wdiff_gen, cdc$gender == "f")
diff_male = subset(wdiff_gen, cdc$gender == "m")
summary(diff_fem)

##  cdc.gender     wdiff        
##  m:    0    Min.   :-300.00  
##  f:10431    1st Qu.: -27.00  
##             Median : -10.00  
##             Mean   : -18.15  
##             3rd Qu.:   0.00  
##             Max.   :  83.00

summary(diff_male)

##  cdc.gender     wdiff        
##  m:9569     Min.   :-300.00  
##  f:   0     1st Qu.: -20.00  
##             Median :  -5.00  
##             Mean   : -10.71  
##             3rd Qu.:   0.00  
##             Max.   : 500.00

# boxplot
boxplot(wdiff_gen$wdiff ~ cdc$gender, outline = F)

It appears that women desire to lose weight more than men and women also have a larger wdiff than men, which could potentially mean that women may feel more strongly about their weight goals or even set more difficult-to-achieve goals than men.

6. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

# mean and stand. dev.
mean_weight = mean(cdc$weight)
mean_weight

## [1] 169.683

sd_weight = sd(cdc$weight)
sd_weight

## [1] 40.08097

# one standard deviation from mean
prop_stdev = subset(cdc, cdc$weight<(mean_weight+sd_weight) & cdc$weight>(mean_weight-sd_weight))
dim(prop_stdev)/20000

## [1] 0.70760 0.00045

Approximately 70.76% of subjects’ weight was withing one standard deviation of the mean (169.683).