First we need to install and load the UsingR package which contains the datasets of the Simple R book
### First install the package with the command below"
### install.packages("UsingR")
### and then load it with the command below
library(UsingR)
The Simple dataset pi2000 contains the first 2000 digits of pi. Make a histogram. Is it surprising? Next, find the proportion of 1’s, 2’s and 3’s. Can you do it for all 10 digits 0-9?
### First I make a histogram
hist(pi2000, main="Histogram of the first 2000 digits of pi", xlab="Pi digits", col="red", xlim=c(0,9), freq = FALSE, ylab="Proportion")
The histogram appears strange since we expected all the digits to have approximately the same frequency. If we look more carefully at the histogram it has created 9 bins instead of 10 and has put in the same bin the 0 and the 1 digit. In order to see the proportion of each digit we will run the following command
table(pi2000)/length(pi2000)
## pi2000
## 0 1 2 3 4 5 6 7 8 9
## 0.0905 0.1065 0.1035 0.0945 0.0975 0.1025 0.1000 0.0985 0.1010 0.1055
As expected all the proportions are around 10%. The graph below represents the Proportion of each digit:
barplot(table(pi2000)/length(pi2000), col="red", main="Proportion of Each Digit", ylab="Proportion", xlab="Digits")
The “babies” data set contains much information about babies and their mothers for 1236 observations. Find the correlation coefficient (both Pearson and Spearman) between age and weight. Repeat for the relationship between height and weight. Make scatter plots of each pair and see if your answer makes sense.
### Correlation Pearson between age and weight
cor(babies$wt, babies$age, method = c("pearson"))
## [1] 0.02904064
### Correlation Spearman between age and weight
cor(babies$wt, babies$age, method = c("spearman"))
## [1] 0.04170028
### Correlation Pearson between height and weight
cor(babies$wt, babies$ht, method = c("pearson"))
## [1] 0.1255413
### Correlation Spearman between heiht and weight
cor(babies$wt, babies$ht, method = c("spearman"))
## [1] 0.214745
And now we can represent the plots:
plot(babies$wt, babies$age, col="red", main="Scatter Plot Weight vs Age", xlab="Weight", ylab = "Age")
plot(babies$wt, babies$ht, col="red", main="Scatter Plot Weight vs Age", xlab="Weight", ylab = "height")
The correlation is very low and this can be shown from the scatter plots. Also there are some values around in 99, or 999 and this is for the unknown values as we can see from the description in the link below https://cran.r-project.org/web/packages/UsingR/UsingR.pdf
The Simple data set iq contains simulated scores on a hypothetical IQ test. What analysis is appropriate for measuring the center of the distribution? Why? (Note: the data reads in as a list.)
simple.eda(iq)
As we can see the simulated data are coming from a Normal Distribution. Also the mean and the median the min max and the 1st and 2nd quartile of the iq data are the following
summary(iq)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 72.0 93.0 101.0 101.4 109.2 130.0
###as we can see the median and the mean are very close which means that the center of the
### distribution is around 101