DATS 6101: Assignment 2

1. Use the summary function to get a summary of all the variables in the Pima.te dataset that is in R.

summary(pima_te)

##      npreg             glu              bp              skin      
##  Min.   : 0.000   Min.   : 65.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 96.0   1st Qu.: 64.00   1st Qu.:22.00  
##  Median : 2.000   Median :112.0   Median : 72.00   Median :29.00  
##  Mean   : 3.485   Mean   :119.3   Mean   : 71.65   Mean   :29.16  
##  3rd Qu.: 5.000   3rd Qu.:136.2   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :197.0   Max.   :110.00   Max.   :63.00  
##       bmi             ped              age         type    
##  Min.   :19.40   Min.   :0.0850   Min.   :21.00   No :223  
##  1st Qu.:28.18   1st Qu.:0.2660   1st Qu.:23.00   Yes:109  
##  Median :32.90   Median :0.4400   Median :27.00            
##  Mean   :33.24   Mean   :0.5284   Mean   :31.32            
##  3rd Qu.:37.20   3rd Qu.:0.6793   3rd Qu.:37.00            
##  Max.   :67.10   Max.   :2.4200   Max.   :81.00

2. Get the structure of pima dataset:

str(pima_te)

## 'data.frame':    332 obs. of  8 variables:
##  $ npreg: int  6 1 1 3 2 5 0 1 3 9 ...
##  $ glu  : int  148 85 89 78 197 166 118 103 126 119 ...
##  $ bp   : int  72 66 66 50 70 72 84 30 88 80 ...
##  $ skin : int  35 29 23 32 45 19 47 38 41 35 ...
##  $ bmi  : num  33.6 26.6 28.1 31 30.5 25.8 45.8 43.3 39.3 29 ...
##  $ ped  : num  0.627 0.351 0.167 0.248 0.158 0.587 0.551 0.183 0.704 0.263 ...
##  $ age  : int  50 31 21 26 53 51 31 33 27 29 ...
##  $ type : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 1 1 2 ...

3. Get the variables from the dataset:

names(pima_te)

## [1] "npreg" "glu"   "bp"    "skin"  "bmi"   "ped"   "age"   "type"

4. For bmi and age variables find out the mean, median, max, min, range and number of observations.

four_header<-c("Mean", "Median", "Max", "Min", "Range", "# of Observations")
bmi_results<-c(mean(pima_te$bmi),median(pima_te$bmi), max(pima_te$bmi), min(pima_te$bmi), range(pima_te$bmi), nrow(pima_te$bmi))
age_results<-c(mean(pima_te$age),median(pima_te$age), max(pima_te$age), min(pima_te$age), range(pima_te$age), nrow(pima_te$age))
names(bmi_results)<-four_header
names(age_results)<-four_header

BMI Results:

print(bmi_results)

##              Mean            Median               Max               Min 
##          33.23976          32.90000          67.10000          19.40000 
##             Range # of Observations 
##          19.40000          67.10000

Age Results:

print(age_results)

##              Mean            Median               Max               Min 
##          31.31627          27.00000          81.00000          21.00000 
##             Range # of Observations 
##          21.00000          81.00000

5. How many women are in this dataset?

totalWomen<-nrow(pima_frame)
print(totalWomen)

## [1] 332

6. Select the first 5 observations and first 4 columns/variables from the dataset:

pima_frame[1:5, 1:4]

##   npreg glu bp skin
## 1     6 148 72   35
## 2     1  85 66   29
## 3     1  89 66   23
## 4     3  78 50   32
## 5     2 197 70   45

7. Select the records where bmi is greater than or equal to 50:

highBMI <-pima_te[which(pima_te$bmi>=50),]
print(highBMI)

##     npreg glu  bp skin  bmi   ped age type
## 55      0 162  76   56 53.2 0.759  25  Yes
## 57      1  88  30   42 55.0 0.496  26  Yes
## 70      7 152  88   44 50.0 0.337  36  Yes
## 79      0 129 110   46 67.1 0.319  26  Yes
## 107     0 165  90   33 52.3 0.427  23   No
## 198     0 180  78   63 59.4 2.420  25  Yes
## 292     3 123 100   35 57.3 0.880  22   No

8. What percentage of the women have diabetes by WHO criteria:

numOfDiabetics<-nrow(pima_te[which(pima_te$type=='Yes'),])
percentOfDiabetics<-(numOfDiabetics/totalWomen)
percent(percentOfDiabetics)

## [1] "32.8%"

9. Obtain a histogram for body mass index:

hist(pima_te$bmi, col="blue")

10. What are the mean and median for bmi? How far apart are they?

bmi_names<-c("Mean", "Median")
bmi_DoubleMs<-c(mean(pima_te$bmi),median(pima_te$bmi))
names(bmi_DoubleMs)<-bmi_names
print(bmi_DoubleMs)

##     Mean   Median 
## 33.23976 32.90000

The above 2 values are only 0.33976 apart.

Read the data from “vlbw.csv” file.(read_csv)

12. Obtain a histogram of the length of stay, i.e. the number of day the infants stay in the neonatalintensive care unit (variable hospstay).

hist(vlbw$hospstay, col="red")

13. Do you see some data problems?

Answer: Yes. There are negative values for length of stay for a number of infants. Also the range of the historgram is poorly distributed that no good analysis from the data can be drawn from it.

14. Draw a boxplot for variable lowph. Visually estimate the median and quartiles from the plot.

boxplot(vlbw$lowph)

Based on the boxplot above, the median appears to be approximately 7.2. Q1, Q2, Q3, and Q4 look to be around 6.8, 7.1, 7.3, and 7.5 respectively.

15. Use the summary function to check your answers to the previous question. Is it about the same?

boxplot.stats(vlbw$lowph)

## $stats
## [1] 6.859997 7.129997 7.209999 7.309998 7.549999
## 
## $n
## [1] 609
## 
## $conf
## [1] 7.198475 7.221524
## 
## $out
##  [1] 6.829998 6.849998 6.529999 6.809998 6.779999 6.849998 6.759998
##  [8] 6.699997 6.699997 6.820000 6.809998 6.719997 6.809998 6.739998

My estimates for question 14 do seem about the same based on the stats of the boxplot above.

16. The variable lowph contains the lowest pH in the first 4 days of life. Obtain a histogram of this variable (the variable is called lowph).

qplot(vlbw$lowph, geom="histogram",
      main = "Histogram for lowph", binwidth=0.02 ,xlab = "lowest pH", fill=I("blue"), col=I("white"))

## Warning: Removed 62 rows containing non-finite values (stat_bin).