setwd("C:/Users/Exped/Desktop/Textbooks/606 Homeworks/Lab material/DATA606-master/inst/labs/Lab1")
source("more/cdc.R")
We use plot to make this happen, I use a crude method to get rid of outliers by limiting the range and domain of the graph, while including a line of best fit. Lets include a summary of our two variables as well.
plot(cdc$weight ~ cdc$wtdesire, xlim= c(0,400), ylim=c(0,400))
fit = lm(cdc$weight ~ cdc$wtdesire)
abline(fit, col='purple')
summary(cdc$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.0 140.0 165.0 169.7 190.0 500.0
summary(cdc$wtdesire)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.0 130.0 150.0 155.1 175.0 680.0
The scattplot shows us that the actual weight is higher than the desired weight (Could the desire to lose weight be a trend?) The relationship is a strong, positive, linear relationship. The mean and median agree with my assessment. Interesting to look at the desired weight max and actual weight max, suggests at least one person wants to gain significant weight.
Lets then put that variable into our dataframe for ease of use
wdiff = c(cdc$wtdesire-cdc$weight)
cdc$wdiff = wdiff
Lets use str function to ascertain the type of data
str(cdc$wdiff)
## int [1:20000] 0 -10 0 -8 -20 0 -9 -10 -20 -10 ...
wdiff is of type integer, range of 20k. If wdiff shows up as 0, there is no desired change in weight for a specific person.
If wdiff is positive, it tells us this specific person wants to gain this much weight.
If wdiff is negative, it tells us this specific person wants to lose this much weight.
Lets us use histograms and a stemplot, complete with a summary
summary(cdc$wdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -21.00 -10.00 -14.59 0.00 500.00
hist(cdc$wdiff,breaks=200,col='purple', xlim=c(-200,100), main = 'Exhibit A')
hist(cdc$wdiff,breaks=1100,col='red', xlim=c(-20,5), density = 45, angle = 70, main = 'Exhibit B')
stem(cdc$wdiff, scale=1)
##
## The decimal point is 2 digit(s) to the right of the |
##
## -3 | 00
## -2 | 5
## -2 | 421000000
## -1 | 9887777776666555555555555555555555555555
## -1 | 44444444444444433333333333333333333333333333332222222222222222222222+182
## -0 | 99999999999999999999999999999999999999999999999999999999999999999999+1442
## -0 | 44444444444444444444444444444444444444444444444444444444444444444444+10848
## 0 | 00000000000000000000000000000000000000000000000000000000000000000000+7102
## 0 | 555555555555555555566666666666677777777777888899999
## 1 | 1
## 1 |
## 2 |
## 2 |
## 3 | 1
## 3 |
## 4 |
## 4 |
## 5 | 0
Exhibit A is unimodal with a left skew. Looking at the distribution in terms of frequency, people tend towards a desire to lose weight rather than gain. The center of the distribution is -14.59 as found in summary, mode is 0.
First the box plot
boxplot(cdc$wdiff~cdc$gender,outline=FALSE)
I can see from the boxplot that men have less variance in weight goals. (In that they do not want to gain or lose as much weight as women.)
From here, we call the summary function on wdiff for both sexes to confirm or deny my boxplot conclusions.
summary(cdc$wdiff[cdc$gender=='f']);summary(cdc$wdiff[cdc$gender=='m'])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -27.00 -10.00 -18.15 0.00 83.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -20.00 -5.00 -10.71 0.00 500.00
This tells us the average man wants to lose 7.44 less pounds than the average woman. Also, lets find which gender had more 0 wdiffs (0 wdiff indicating body weight comfort sufficiency)
length(cdc$gender[cdc$gender=='f'])/length(cdc$gender[cdc$gender=='m'])
## [1] 1.090083
Women outnumber men at a ratio of 1.09, adjust accordingly
length(cdc$gender[cdc$gender=='m' & cdc$wdiff==0])*1.09 ; length(cdc$gender[cdc$gender=='f' & cdc$wdiff==0])
## [1] 3433.5
## [1] 2466
So adjusting for quantity of sexes surveyed, and defining weight satisfaction as 0 wdiff. For every 10431 males surveyed we will get 3434 satisfied with their weight (32.92%) and for every 10431 women surveyed, we will get 2466 satisfied with their weight (23.64%).
Start with finding mean and SD, then a simple column check to see how many are within 1 SD, then divide over n
SD=sd(cdc$weight)
SD
## [1] 40.08097
MEAN = mean(cdc$weight)
MEAN
## [1] 169.683
within1sd = length(cdc$weight[cdc$weight <= SD+MEAN & cdc$weight >= MEAN-SD ])
within1sd
## [1] 14152
within1sd/20000
## [1] 0.7076
70.76% of people surveyed by the CDC in this study fall within 1 standard deviation of the mean.