About
Qualitative Descriptive Analytics aims to gather an in-depth
understanding of the underlying reasons and motivations for an event or
observation. It is typically represented with visuals or charts.
Quantitative Descriptive Analytics focuses on investigating a
phenomenon via statistical, mathematical, and computational techniques.
It aims to quantify an event with metrics and numbers.
In this project, we will explore quantitative analytics using the
babies data set provided. This data is collected to understand different
factors effecting baby weights at birth. Below are the explanations of
the variables:
Bwt: Baby weights in ounces Gestation: Length of pregnancy in days
Height: Mother’s height in inches Smoke: =1 if mother is smoker, = 0
Nonsmoker Age: Mother’s age in years Weight: Mother’s pregnancy weight
Parity: = 0 if the baby is first born, =1 otherwise babies: number of
babies before this birth
Read this worksheet carefully and follow the instructions to complete
the tasks and answer any questions. Submit your work as an HTML or PDF
or Word document to Sakai. If you are submitting as a team project only
one submission from one of the team members will be sufficient but in
this case you need to input the names of the team members in the above
Title section.
Task 1: Testing for missing data (1 point)
babies_old data set will be used to check if it contains any missing
data. First read the babies_old file and make sure you correctly read
the file with the above mentioned variables.
#Read babies_old file and check the variables
library(readxl)
babies_old <- read_excel("C:/Users/jhyne/OneDrive/INFS 343/babies_old.xlsx")
View(babies_old)
Find NA’s in babies_old file
#Check your file for missing data
which(is.na(babies_old))
[1] 5872 5883 5923 5961 6106 6336 6419 6585 6741 6943
[11] 7025
Implement omission strategy to remove observations with missing
values. (0.5 point)
#implement omission strategy and check if all the observations with NA's were omitted.
dim(babies_old)
[1] 1174 8
omissionData <- na.omit(babies_old)
Implement mean imputation strategy (0.5 point)
#Implement imputation strategy
Weightmean = mean(babies_old$weight, na.rm = TRUE)
Weightmean
[1] 128.4996
round(Weightmean,0)
[1] 128
babies_old$weight[is.na(babies_old$weight)]= Weightmean
dim(babies_old)
[1] 1174 8
#View(mydata)
length(which(babies_old$weight == Weightmean))
[1] 11
Task 2: Testing for Outliers(1 point)
Read the babies.csv file and calculate the statistics required to
find out possible outliers for baby weights.
First, read the babies.csv file and calculate the mean, standard
deviation, maximum, and minimum for the bwt column using R.
#Read babies file
library(readxl)
babies <- read_excel("C:/Users/jhyne/OneDrive/INFS 343/babies.xlsx")
View(babies)
#Calculate and display the average, standard deviation, maximum and minimum babies weight.
meanweight = mean(babies$weight)
meanweight
[1] 128.4787
spreadweight=sd(babies$weight)
spreadweight
[1] 20.73428
MaxWeight = max(babies$weight)
MaxWeight
[1] 250
MinWeight = min(babies$weight)
MinWeight
[1] 87
Use the standard approach to find out possible outliers (0.5
points)
Use the formula from class to detect any outliers. An outlier is
value that “lies outside” most of the other values in a set of data.
#calculate the upper and lower thresholds
UpperThreshold = mean(babies$weight) + 3*sd(babies$weight)
UpperThreshold
[1] 190.6816
LowerThreshold = meanweight - 3*spreadweight
LowerThreshold
[1] 66.27586
Do you think there are possible outliers? Why? Yes I do think there
are possible outliers because some of the weights of the babies are way
above the mean and they could be more than 3*SD away from the mean If
you decided that you have possible outliers can you identify the
observations with possible outliers?
#Observations with possible outliers at the upper end
upoutlier = babies$weight > 190.6816
which(upoutlier == TRUE)
[1] 85 108 140 168 170 399 487 527 571 585 678 800 809 816 871 1091
#or
subset(babies, weight > 190.6816)
#observations with possible outliers at the lower end
subset(babies, weight < LowerThreshold)
What is the number of outliers at the lower end and upper end of the
data? Lower end - 0 Upper end - 16 #### Use the IQR approach to identify
outliers (0.5 point)
Another similar method to find the upper and lower thresholds
discussed in class involves finding the interquartile range. Find the
interquantile range using the following chunk.
#Find the inter quantile range
iqrange=IQR(babies$weight)
iqrange
[1] 24.75
The threshold is the boundaries that determine if a value is an
outlier. If the value falls above the upper threshold or below the lower
threshold, it is a possible outlier.
#calculate the upper threshold
subset(babies, weight>UpperThreshold)
#calculate the lower threshold
subset(babies, weight<LowerThreshold)
Do you think there are possible outliers and why? Yes because there
are a lot of ones outside the upperthreshold If you think there are
possible outliers can you identify them? If so how many? There are
16
subset(babies, weight>UpperThreshold)
subset(babies, weight<LowerThreshold)
It can also be useful to visualize the data using a box plot.
boxplot(babies$weight)

Can you identify the outliers from the boxplot? If so how many? It is
difficult to tell but there are at least 16
Task 3: Calculating Z-Value (0.25 points)
Given a birth weight of 182, calculate the corresponding z-value or
z-score.
# Calculate the z-value and display it
mean = mean(babies$weight)
mean
[1] 128.4787
std= sd(babies$weight)
std
[1] 20.73428
zscore=(182-mean)/std
zscore
[1] 2.581295
Based on the z-value, how would you evaluate the weight of this baby.
Explain your logic. The baby is a slight outlier because he is more than
2 zscores away from the mean
Task 4:Relationships with the variables (0.75 points)
How can you describe the relationship between baby weights and other
variables?
#correlation between baby weight and other variables
cat("The correlation coefficient for bwt and Weight is:", cor(babies$bwt,babies$weight), "\n")
The correlation coefficient for bwt and Weight is: 0.1559233
cat("The correlation coefficient for gestation and Weight is:", cor(babies$gestation,babies$weight), "\n")
The correlation coefficient for gestation and Weight is: 0.02365494
cat("The correlation coefficient for parity and Weight is:", cor(babies$parity,babies$weight), "\n")
The correlation coefficient for parity and Weight is: -0.09636209
cat("The corelation coefficient for height and Weight is:", cor(babies$height,babies$weight), "\n")
The corelation coefficient for height and Weight is: 0.4352874
cat("The corelation coefficient for smoke and Weight is:", cor(babies$smoke,babies$weight), "\n")
The corelation coefficient for smoke and Weight is: -0.0602814
cat("The corelation coefficient for age and Weight is:", cor(babies$age,babies$weight), "\n")
The corelation coefficient for age and Weight is: 0.1473221
Your comments on these relationships: A lot of these are really
correleatied. The only one I would say is somewhat correleation is
height and weight, but even that is loosely correleated at 0.435.
This is an R Markdown
Notebook. When you execute code within the notebook, the results appear
beneath the code.
Try executing this chunk by clicking the Run button within
the chunk or by placing your cursor inside it and pressing
Ctrl+Shift+Enter.
plot(cars)
Add a new chunk by clicking the Insert Chunk button on the
toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and
output will be saved alongside it (click the Preview button or
press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the
editor. Consequently, unlike Knit, Preview does not
run any R code chunks. Instead, the output of the chunk when it was last
run in the editor is displayed.
