About

Qualitative Descriptive Analytics aims to gather an in-depth understanding of the underlying reasons and motivations for an event or observation. It is typically represented with visuals or charts.

Quantitative Descriptive Analytics focuses on investigating a phenomenon via statistical, mathematical, and computational techniques. It aims to quantify an event with metrics and numbers.

In this project, we will explore quantitative analytics using the babies data set provided. This data is collected to understand different factors effecting baby weights at birth. Below are the explanations of the variables:

Bwt: Baby weights in ounces Gestation: Length of pregnancy in days Height: Mother’s height in inches Smoke: =1 if mother is smoker, = 0 Nonsmoker Age: Mother’s age in years Weight: Mother’s pregnancy weight Parity: = 0 if the baby is first born, =1 otherwise babies: number of babies before this birth

Read this worksheet carefully and follow the instructions to complete the tasks and answer any questions. Submit your work as an HTML or PDF or Word document to Sakai. If you are submitting as a team project only one submission from one of the team members will be sufficient but in this case you need to input the names of the team members in the above Title section.

Task 1: Testing for missing data (1 point)

babies_old data set will be used to check if it contains any missing data. First read the babies_old file and make sure you correctly read the file with the above mentioned variables.

#Read babies_old file and check the variables
library(readxl)
babies_old <- read_excel("C:/Users/jhyne/OneDrive/INFS 343/babies_old.xlsx")
View(babies_old)

Find NA’s in babies_old file

#Check your file for missing data
which(is.na(babies_old))
 [1] 5872 5883 5923 5961 6106 6336 6419 6585 6741 6943
[11] 7025

Implement omission strategy to remove observations with missing values. (0.5 point)

#implement omission strategy and check if all the observations with NA's were omitted.
dim(babies_old)
[1] 1174    8
omissionData <- na.omit(babies_old)

Implement mean imputation strategy (0.5 point)

#Implement imputation strategy
Weightmean = mean(babies_old$weight, na.rm = TRUE)
Weightmean
[1] 128.4996
round(Weightmean,0)
[1] 128
babies_old$weight[is.na(babies_old$weight)]= Weightmean
dim(babies_old)
[1] 1174    8
#View(mydata)
length(which(babies_old$weight == Weightmean))
[1] 11

Task 2: Testing for Outliers(1 point)

Read the babies.csv file and calculate the statistics required to find out possible outliers for baby weights.

First, read the babies.csv file and calculate the mean, standard deviation, maximum, and minimum for the bwt column using R.

#Read babies file
library(readxl)
babies <- read_excel("C:/Users/jhyne/OneDrive/INFS 343/babies.xlsx")
View(babies)
#Calculate and display the average, standard deviation, maximum and minimum babies weight. 
meanweight = mean(babies$weight)
meanweight
[1] 128.4787
spreadweight=sd(babies$weight)
spreadweight
[1] 20.73428
MaxWeight = max(babies$weight)
MaxWeight
[1] 250
MinWeight = min(babies$weight)
MinWeight
[1] 87

Use the standard approach to find out possible outliers (0.5 points)

Use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data.

#calculate the upper and lower thresholds
UpperThreshold = mean(babies$weight) + 3*sd(babies$weight)
UpperThreshold
[1] 190.6816
LowerThreshold = meanweight - 3*spreadweight
LowerThreshold
[1] 66.27586

Do you think there are possible outliers? Why? Yes I do think there are possible outliers because some of the weights of the babies are way above the mean and they could be more than 3*SD away from the mean If you decided that you have possible outliers can you identify the observations with possible outliers?

#Observations with possible outliers at the upper end
upoutlier = babies$weight > 190.6816
which(upoutlier == TRUE)
 [1]   85  108  140  168  170  399  487  527  571  585  678  800  809  816  871 1091
#or
subset(babies, weight > 190.6816)
#observations with possible outliers at the lower end
subset(babies, weight < LowerThreshold)

What is the number of outliers at the lower end and upper end of the data? Lower end - 0 Upper end - 16 #### Use the IQR approach to identify outliers (0.5 point)

Another similar method to find the upper and lower thresholds discussed in class involves finding the interquartile range. Find the interquantile range using the following chunk.

#Find the inter quantile range 
iqrange=IQR(babies$weight)
iqrange
[1] 24.75

The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is a possible outlier.

#calculate the upper threshold
subset(babies, weight>UpperThreshold)
#calculate the lower threshold
subset(babies, weight<LowerThreshold)

Do you think there are possible outliers and why? Yes because there are a lot of ones outside the upperthreshold If you think there are possible outliers can you identify them? If so how many? There are 16

subset(babies, weight>UpperThreshold)
subset(babies, weight<LowerThreshold)

It can also be useful to visualize the data using a box plot.

boxplot(babies$weight)

Can you identify the outliers from the boxplot? If so how many? It is difficult to tell but there are at least 16

Task 3: Calculating Z-Value (0.25 points)

Given a birth weight of 182, calculate the corresponding z-value or z-score.

#  Calculate the z-value and display it
mean = mean(babies$weight)
mean
[1] 128.4787
std= sd(babies$weight)
std
[1] 20.73428
zscore=(182-mean)/std
zscore
[1] 2.581295

Based on the z-value, how would you evaluate the weight of this baby. Explain your logic. The baby is a slight outlier because he is more than 2 zscores away from the mean

Task 4:Relationships with the variables (0.75 points)

How can you describe the relationship between baby weights and other variables?

#correlation between baby weight and other variables
cat("The correlation coefficient for bwt and Weight is:", cor(babies$bwt,babies$weight), "\n")
The correlation coefficient for bwt and Weight is: 0.1559233 
cat("The correlation coefficient for gestation and Weight is:", cor(babies$gestation,babies$weight), "\n")
The correlation coefficient for gestation and Weight is: 0.02365494 
cat("The correlation coefficient for parity and Weight is:", cor(babies$parity,babies$weight), "\n")
The correlation coefficient for parity and Weight is: -0.09636209 
cat("The corelation coefficient for height and Weight is:", cor(babies$height,babies$weight), "\n")
The corelation coefficient for height and Weight is: 0.4352874 
cat("The corelation coefficient for smoke and Weight is:", cor(babies$smoke,babies$weight), "\n")
The corelation coefficient for smoke and Weight is: -0.0602814 
cat("The corelation coefficient for age and Weight is:", cor(babies$age,babies$weight), "\n")
The corelation coefficient for age and Weight is: 0.1473221 

Your comments on these relationships: A lot of these are really correleatied. The only one I would say is somewhat correleation is height and weight, but even that is loosely correleated at 0.435.


This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

plot(cars)

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

