Lab1 - Introduction to data

Excercise 1 :How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

source("http://www.openintro.org/stat/data/cdc.R")

nrow(cdc)

## [1] 20000

How many variables?

length(cdc)

## [1] 9

For each variable, identify its data type (e.g. categorical, discrete)

names(cdc)

## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

“genhlth” “exerany” “hlthplan” “smoke100” and “gender” are categorical variables. “height” “weight” “wtdesire” “age” are descrete.

Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

The interquartile range of an observation variable is the difference of its upper and lower quartiles. It is a measure of how far apart the middle portion of data spreads in value.

Interquartile Range = U pper Quartile ??? Lower Quartile

summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

summary(cdc$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00

compute the interquartile range for each

IQR(cdc$weight)

## [1] 50

IQR(cdc$age)

## [1] 26

Compute the relative frequency distribution for gender and exerany

Relative Frequency Distribution of Qualitative Data:

The relative frequency distribution of a data variable is a summary of the frequency proportion in a collection of non-overlapping categories.

The relationship of frequency and relative frequency is:

Relative F requency =-Frequency/Sample Size

gend<- cdc$gender
gend.freq <- table(gend)
gend.relfreq <- gend.freq/nrow(cdc)

#Relative Frequency Distribution of gender
gend.relfreq

## gend
##       m       f 
## 0.47845 0.52155

v_exerany<-cdc$exerany
v_exerany.freq <- table(v_exerany)
v_exerany.relfreq<-v_exerany.freq/nrow(cdc)

#Relative Frequency Distribution of exerany
v_exerany.relfreq

## v_exerany
##      0      1 
## 0.2543 0.7457

How many males are in the sample?

table(cdc$gender)[1]

##    m 
## 9569

What proportion of the sample reports being in excellent health?

cat((table(cdc$genhlth)[1]/NROW(cdc))*100,"%")

## 23.285 %

Excercise 3 : What does the mosaic plot reveal about smoking habits and gender?

table(cdc$gender,cdc$smoke100)

##    
##        0    1
##   m 4547 5022
##   f 6012 4419

The mosaic plot reveals that a randomly selected male is more probable to be a person who smoked at least 100 cigarettes than a randomly selected female to be a person who smoked at least 100 cigarettes.

Excercise 4 : Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

under23_and_smoke<- subset(cdc,cdc$smoke100==1 & age <23)

head(under23_and_smoke)

##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13  excellent       1        0        1     66    185      220  21      m
## 37  very good       1        0        1     70    160      140  18      f
## 96  excellent       1        1        1     74    175      200  22      m
## 180      good       1        1        1     64    190      140  20      f
## 182 very good       1        1        1     62     92       92  21      f
## 240 very good       1        0        1     64    125      115  22      f

Exercise 5 : What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

Ans: The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several categories

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$gender)

The boxplot of BMI Vs gender suggests that the BMI is lower on females.

On Your Own Tasks

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?
Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean

1. Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

plot(cdc$weight,cdc$wtdesire, xlab="Weight", ylab= "Desired Weight")

As the Weight increases, the desired weight also slightly increases.

2. Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wdiff <- (cdc$weight-cdc$wtdesire)

3. What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

typeof(wdiff)

## [1] "integer"

If the value of wdiff is zero, means the persons weight is the desired weight for that person. If the value of wdiff is negative, the person is underweight and if the value of wdiff is positive the person is over weight. Normally difference is taken as the absolute value (abs(mdiff)) and that will always be greater than or equal to zero.

4. Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

mean(wdiff)

## [1] 14.5891

sd(wdiff)

## [1] 24.04586

plot(density(wdiff))

People feel pretty comfortable about their current weight as most of it is having the weight closer enough to their desired weight.

5. Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

boxplot(cdc$weight-cdc$wtdesire ~ cdc$gender)

No. The difference between their weight and desired weight seems to be close to zero.

6. Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean

mean(cdc$weight)

## [1] 169.683

sd(cdc$weight)

## [1] 40.08097

plot(density(cdc$weight))

The shape of the curve is closer to a normal distribution. About 68% of values drawn from a normal distribution are within one standard deviation.

Lab1 - Introduction to data

James Kuruvilla

February 3, 2017

On Your Own Tasks