About

Qualitative Descriptive Analytics aims to gather an in-depth understanding of the underlying reasons and motivations for an event or observation. It is typically represented with visuals or charts.

Quantitative Descriptive Analytics focuses on investigating a phenomenon via statistical, mathematical, and computational techniques. It aims to quantify an event with metrics and numbers.

In this project, we will explore quantitative analytics using the babies data set provided. This data is collected to understand different factors effecting baby weights at birth. Below are the explanations of the variables:

Bwt: Baby weights in ounces Gestation: Length of pregnancy in days Height: Mother’s height in inches Smoke: =1 if mother is smoker, = 0 Nonsmoker Age: Mother’s age in years Weight: Mother’s pregnancy weight Parity: = 0 if the baby is first born, =1 otherwise babies: number of babies before this birth

Read this worksheet carefully and follow the instructions to complete the tasks and answer any questions. Submit your work as an HTML or PDF or Word document to Sakai. If you are submitting as a team project only one submission from one of the team members will be sufficient but in this case you need to input the names of the team members in the above Title section.

Task 1: Testing for missing data (1 point)

babies_old data set will be used to check if it contains any missing data. First read the babies_old file and make sure you correctly read the file with the above mentioned variables.

#Read babies_old file and check the variables
library(readxl)
myData = read.csv(file="babies_old.csv")
View(myData)

Find NA’s in babies_old file

#Check your file for missing data
which(is.na(myData))
##  [1] 5872 5883 5923 5961 6106 6336 6419 6585 6741 6943 7025

Implement omission strategy to remove observations with missing values. (0.5 point)

#implement omission strategy and check if all the observations with NA's were omitted.
dim(myData)
## [1] 1174    8
ommisionData <- na.omit(myData)

Implement mean imputation strategy (0.5 point)

#Implement imputation strategy
Weightmean = mean(myData$weight, na.rm = TRUE)
Weightmean
## [1] 128.4996
round(Weightmean, 0)
## [1] 128
myData$weight[is.na(myData$weight)]= Weightmean
dim(myData)
## [1] 1174    8
length(which(myData$weight == Weightmean))
## [1] 11

Task 2: Testing for Outliers(1 point)

Read the babies.csv file and calculate the statistics required to find out possible outliers for baby weights.

First, read the babies.csv file and calculate the mean, standard deviation, maximum, and minimum for the bwt column using R.

#Read babies file
library(readxl)
myData2 = read.csv(file="babies.csv")
View(myData2)
#Calculate and display the average, standard deviation, maximum and minimum babies weight. 
meanweight = mean(myData2$weight)
meanweight
## [1] 128.4787
spreadweight=sd(myData2$weight)
spreadweight
## [1] 20.73428
Maxweight = max(myData2$weight)
Maxweight
## [1] 250
Minweight = min(myData2$weight)
Minweight
## [1] 87

Use the standard approach to find out possible outliers (0.5 points)

Use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data.

#calculate the upper and lower thresholds
UpperThreshold = mean(myData2$weight) + 3*sd(myData2$weight)
UpperThreshold
## [1] 190.6816
LowerThreshold = meanweight - 3*spreadweight
LowerThreshold
## [1] 66.27586

Do you think there are possible outliers? Why? In this case I believe that there are possible outliers as the weight of some babies are far above, although not below the mean weight for this data set. If you decided that you have possible outliers can you identify the observations with possible outliers?

#Observations with possible outliers at the upper end
upoutlier = myData2$weight > 190.6816
which(upoutlier == TRUE)
##  [1]   85  108  140  168  170  399  487  527  571  585  678  800  809  816  871
## [16] 1091
#observations with possible outliers at the lower end
subset(myData2, weight < LowerThreshold)
## [1] Count     bwt       gestation parity    height    weight    smoke    
## [8] age      
## <0 rows> (or 0-length row.names)

What is the number of outliers at the lower end and upper end of the data? There are 16 outliers at the upper end of the data and 0 outliers in the lower end of the data set. #### Use the IQR approach to identify outliers (0.5 point)

Another similar method to find the upper and lower thresholds discussed in class involves finding the interquartile range. Find the interquantile range using the following chunk.

#Find the inter quantile range 
iqrange=IQR(myData2$weight)
iqrange
## [1] 24.75

The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is a possible outlier.

#calculate the upper threshold
subset(myData2, weight>UpperThreshold)
##      Count bwt gestation parity height weight smoke age
## 85      85 125       305      0     70    196     1  22
## 108    108 131       283      0     67    215     0  25
## 140    140 126       282      0     66    250     0  38
## 168    168 113       277      0     65    192     1  23
## 170    170 124       277      0     63    220     0  29
## 399    399 140       251      0     63    210     0  28
## 487    487 132       282      0     67    200     1  28
## 527    527 105       260      0     64    197     0  23
## 571    571 115       273      1     67    215     1  23
## 585    585  91       248      0     63    202     0  33
## 678    678 116       280      0     68    198     0  34
## 800    800 160       271      0     67    215     0  32
## 809    809 136       291      0     66    191     0  41
## 816    816 122       273      0     66    210     0  26
## 871    871 136       288      0     62    217     0  23
## 1091  1091  72       266      1     66    200     1  25
#calculate the lower threshold
subset(myData2, weight<LowerThreshold)
## [1] Count     bwt       gestation parity    height    weight    smoke    
## [8] age      
## <0 rows> (or 0-length row.names)

Do you think there are possible outliers and why? Yes the function shows that there are 16 outliers in the upper threshold of the data set, and no outliers in the lower threshold. If you think there are possible outliers can you identify them? If so how many? Yes, there are 16 outliers in total: 85 108 140 168 170 399 487 527 571 585 678 800 809 816 871 1091.

subset(myData2, weight>UpperThreshold)
##      Count bwt gestation parity height weight smoke age
## 85      85 125       305      0     70    196     1  22
## 108    108 131       283      0     67    215     0  25
## 140    140 126       282      0     66    250     0  38
## 168    168 113       277      0     65    192     1  23
## 170    170 124       277      0     63    220     0  29
## 399    399 140       251      0     63    210     0  28
## 487    487 132       282      0     67    200     1  28
## 527    527 105       260      0     64    197     0  23
## 571    571 115       273      1     67    215     1  23
## 585    585  91       248      0     63    202     0  33
## 678    678 116       280      0     68    198     0  34
## 800    800 160       271      0     67    215     0  32
## 809    809 136       291      0     66    191     0  41
## 816    816 122       273      0     66    210     0  26
## 871    871 136       288      0     62    217     0  23
## 1091  1091  72       266      1     66    200     1  25

It can also be useful to visualize the data using a box plot.

boxplot(myData2$weight)

Can you identify the outliers from the boxplot? If so how many? It is hard to identify how many outliers there are exactly from the box plot, although it is visually apparent that outliers do exist within the data set.

Task 3: Calculating Z-Value (0.25 points)

Given a birth weight of 182, calculate the corresponding z-value or z-score.

#  Calculate the z-value and display it
mean = mean(myData2$weight)
mean
## [1] 128.4787
std= sd(myData2$weight)
std
## [1] 20.73428
zscore= (182-mean)/std
zscore
## [1] 2.581295

Based on the z-value, how would you evaluate the weight of this baby. Explain your logic. The baby is an outlier as it is more than two z-score away form the mean weight.

Task 4:Relationships with the variables (0.75 points)

How can you describe the relationship between baby weights and other variables?

#correlation between baby weight and other variables
cat("The correlation coefficent of gestation bwt and Weight is:", cor(myData2$bwt,myData2$weight),"\n")
## The correlation coefficent of gestation bwt and Weight is: 0.1559233
cat("The correlation coefficient of gestation and Weight is:", cor(myData2$gestation,myData2$weight), "\n")
## The correlation coefficient of gestation and Weight is: 0.02365494
cat("The correlation coefficient of parity and weight is:", cor(myData2$parity,myData2$weight), "\n")
## The correlation coefficient of parity and weight is: -0.09636209
cat("The correlation coefficient of smoke and weight is:", cor(myData2$smoke,myData2$weight), "\n")
## The correlation coefficient of smoke and weight is: -0.0602814
cat("The correlation coefficient of age and Weight is:", cor(myData2$age,myData2$weight), "\n")
## The correlation coefficient of age and Weight is: 0.1473221

Your comments on these relationships: All of these relationships have correlation other than height and weight. Height and weight have a value of 0.4352874 which is high enough to prove that there is not much correlation between the two variables.