Qualitative Descriptive Analytics aims to gather an in-depth understanding of the underlying reasons and motivations for an event or observation. It is typically represented with visuals or charts.
Quantitative Descriptive Analytics focuses on investigating a phenomenon via statistical, mathematical, and computational techniques. It aims to quantify an event with metrics and numbers.
In this project, we will explore quantitative analytics using the babies data set provided. This data is collected to understand different factors effecting baby weights at birth. Below are the explanations of the variables:
Bwt: Baby weights in ounces Gestation: Length of pregnancy in days Height: Mother’s height in inches Smoke: =1 if mother is smoker, = 0 Nonsmoker Age: Mother’s age in years Weight: Mother’s pregnancy weight Parity: = 0 if the baby is first born, =1 otherwise babies: number of babies before this birth
Read this worksheet carefully and follow the instructions to complete the tasks and answer any questions. Submit your work as an HTML or PDF or Word document to Sakai. If you are submitting as a team project only one submission from one of the team members will be sufficient but in this case you need to input the names of the team members in the above Title section.
babies_old data set will be used to check if it contains any missing data. First read the babies_old file and make sure you correctly read the file with the above mentioned variables.
# Inspect structure and first rows
str(babies_old)
## 'data.frame': 1174 obs. of 8 variables:
## $ Count : int 1 2 3 4 5 6 7 8 9 10 ...
## $ bwt : int 120 113 128 108 136 138 132 120 143 140 ...
## $ gestation: int 284 282 279 282 286 244 245 289 299 351 ...
## $ parity : int 0 0 0 0 0 0 0 0 0 0 ...
## $ height : int 62 64 64 67 62 62 65 62 66 68 ...
## $ weight : int 100 NA 115 125 93 178 140 125 136 120 ...
## $ smoke : int 0 0 1 1 0 0 0 0 1 0 ...
## $ age : int 27 33 28 23 25 33 23 25 30 27 ...
head(babies_old)
## Count bwt gestation parity height weight smoke age
## 1 1 120 284 0 62 100 0 27
## 2 2 113 282 0 64 NA 0 33
## 3 3 128 279 0 64 115 1 28
## 4 4 108 282 0 67 125 1 23
## 5 5 136 286 0 62 93 0 25
## 6 6 138 244 0 62 178 0 33
# Check your file for missing data
sum(is.na(babies_old)) # total number of NA values in the dataset
## [1] 11
colSums(is.na(babies_old)) # number of NAs per column
## Count bwt gestation parity height weight smoke age
## 0 0 0 0 0 11 0 0
In this dataset, there are 11 missing values in total, so omission does not remove any rows (the dataset is already complete).
# Implement omission strategy and check if all the observations with NA's were omitted.
babies_old_omit <- na.omit(babies_old)
# confirm no missing values remain
sum(is.na(babies_old_omit))
## [1] 0
Because there are no missing values, mean imputation does not change the dataset; it simply confirms the procedure.
# Implement mean imputation strategy (numeric columns)
babies_old_impute <- babies_old
# Replace missing values in each numeric column with the column mean
for (v in names(babies_old_impute)) {
if (is.numeric(babies_old_impute[[v]])) {
babies_old_impute[[v]][is.na(babies_old_impute[[v]])] <- mean(babies_old_impute[[v]], na.rm = TRUE)
}
}
# confirm no missing values remain
sum(is.na(babies_old_impute))
## [1] 0
First, read the babies.csv file and calculate the mean, standard deviation, maximum, and minimum for the bwt column using R.
# Read babies file
str(babies)
## 'data.frame': 1174 obs. of 8 variables:
## $ Count : int 1 2 3 4 5 6 7 8 9 10 ...
## $ bwt : int 120 113 128 108 136 138 132 120 143 140 ...
## $ gestation: int 284 282 279 282 286 244 245 289 299 351 ...
## $ parity : int 0 0 0 0 0 0 0 0 0 0 ...
## $ height : int 62 64 64 67 62 62 65 62 66 68 ...
## $ weight : int 100 135 115 125 93 178 140 125 136 120 ...
## $ smoke : int 0 0 1 1 0 0 0 0 1 0 ...
## $ age : int 27 33 28 23 25 33 23 25 30 27 ...
head(babies)
## Count bwt gestation parity height weight smoke age
## 1 1 120 284 0 62 100 0 27
## 2 2 113 282 0 64 135 0 33
## 3 3 128 279 0 64 115 1 28
## 4 4 108 282 0 67 125 1 23
## 5 5 136 286 0 62 93 0 25
## 6 6 138 244 0 62 178 0 33
# Calculate and display the average, standard deviation, maximum and minimum babies weight (bwt)
mean_bwt <- mean(babies$bwt)
sd_bwt <- sd(babies$bwt)
max_bwt <- max(babies$bwt)
min_bwt <- min(babies$bwt)
mean_bwt
## [1] 119.4625
sd_bwt
## [1] 18.32867
max_bwt
## [1] 176
min_bwt
## [1] 55
Use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data.
# Calculate the upper and lower thresholds for the babies weight using the standard approach (± 2 SD)
upper_sd <- mean_bwt + 2 * sd_bwt
lower_sd <- mean_bwt - 2 * sd_bwt
upper_sd
## [1] 156.1199
lower_sd
## [1] 82.80518
Do you think there are possible outliers? Why?
Yes. Using the standard ± 2 SD rule, any baby weight below 82.81 oz or above 156.12 oz is flagged as a possible outlier. Since there are observations outside these thresholds (see the upper and lower outlier tables), there are possible outliers in this dataset.
If you decided that you have possible outliers can you identify the observations with possible outliers?
# Observations with possible outliers at the upper end (standard approach)
upper_outliers_sd <- babies[babies$bwt > upper_sd, ]
upper_outliers_sd
## Count bwt gestation parity height weight smoke age
## 138 138 160 300 0 71 175 1 29
## 221 221 173 293 0 63 110 0 30
## 382 382 163 280 0 69 139 0 35
## 489 489 158 285 0 62 130 0 28
## 520 520 174 281 0 67 155 0 37
## 526 526 161 302 1 70 170 1 22
## 558 558 170 303 1 64 129 0 21
## 595 595 176 293 1 68 180 0 19
## 664 664 166 299 0 68 140 0 26
## 693 693 167 288 1 63 117 0 19
## 701 701 174 288 0 61 182 0 25
## 766 766 165 282 0 66 145 0 29
## 794 794 162 284 0 64 126 0 27
## 800 800 160 271 0 67 215 0 32
## 823 823 164 286 1 66 143 0 32
## 872 872 163 289 1 64 126 1 25
## 879 879 158 295 1 70 137 0 37
## 931 931 159 296 1 64 112 0 27
## 965 965 169 296 0 67 185 0 33
## 978 978 160 292 0 64 120 0 28
## 1035 1035 157 291 0 65 121 0 33
## 1042 1042 174 284 0 65 163 0 39
## 1058 1058 160 297 0 68 136 0 20
## 1063 1063 158 267 0 64 125 0 35
## 1065 1065 158 289 0 66 140 0 30
## 1067 1067 163 298 0 61 98 0 37
## 1105 1105 160 291 0 64 110 1 34
nrow(upper_outliers_sd)
## [1] 27
# Observations with possible outliers at the lower end (standard approach)
lower_outliers_sd <- babies[babies$bwt < lower_sd, ]
lower_outliers_sd
## Count bwt gestation parity height weight smoke age
## 57 57 75 232 0 61 110 0 33
## 178 178 75 239 0 63 124 1 26
## 240 240 81 256 0 64 148 1 30
## 279 279 80 266 1 62 125 0 25
## 288 288 71 281 0 60 117 1 32
## 336 336 71 234 0 64 110 1 32
## 431 431 68 223 0 66 149 1 32
## 436 436 78 256 1 65 123 0 29
## 466 466 69 232 0 59 103 1 31
## 493 493 71 277 0 69 135 0 40
## 588 588 79 268 0 61 108 0 36
## 655 655 78 237 1 63 144 0 23
## 683 683 81 254 0 62 157 0 23
## 780 780 77 238 1 63 103 1 23
## 781 781 62 228 0 61 107 0 24
## 852 852 72 271 0 61 136 0 39
## 860 860 58 245 0 64 156 1 34
## 874 874 77 238 0 67 135 1 38
## 923 923 55 204 0 65 140 0 35
## 959 959 78 258 1 66 115 1 24
## 964 964 75 247 0 64 120 1 36
## 979 979 65 237 0 67 130 0 31
## 1002 1002 80 262 1 61 100 1 31
## 1006 1006 73 277 0 65 145 0 29
## 1008 1008 65 232 0 66 125 1 24
## 1082 1082 63 236 1 58 99 0 24
## 1091 1091 72 266 1 66 200 1 25
## 1092 1092 75 266 0 61 113 1 37
## 1112 1112 71 254 0 61 145 1 19
## 1113 1113 82 270 0 65 150 1 21
## 1121 1121 82 274 0 64 101 1 31
## 1151 1151 75 265 0 65 103 1 21
## 1157 1157 81 285 0 63 150 1 19
nrow(lower_outliers_sd)
## [1] 33
What is the number of outliers at the lower end and upper end of the data?
Another similar method to find the upper and lower thresholds discussed in class involves finding the interquartile range. Find the interquantile range using the following chunk.
# Find the interquartile range (IQR) for baby weights
Q1 <- quantile(babies$bwt, 0.25)
Q3 <- quantile(babies$bwt, 0.75)
IQR_bwt <- IQR(babies$bwt)
Q1
## 25%
## 108
Q3
## 75%
## 131
IQR_bwt
## [1] 23
The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is a possible outlier.
# calculate the upper threshold (IQR method)
upper_iqr <- Q3 + 1.5 * IQR_bwt
upper_iqr
## 75%
## 165.5
# calculate the lower threshold (IQR method)
lower_iqr <- Q1 - 1.5 * IQR_bwt
lower_iqr
## 25%
## 73.5
Do you think there are possible outliers and why?
Yes. With the IQR method, any value below 73.5 oz or above 165.5 oz is considered a possible outlier. Since the code identifies observations outside these IQR thresholds, there are possible outliers under this approach as well.
If you think there are possible outliers can you identify them? If so how many?
# Identify outliers using the IQR method
outliers_iqr <- babies[babies$bwt > upper_iqr | babies$bwt < lower_iqr, ]
outliers_iqr
## Count bwt gestation parity height weight smoke age
## 221 221 173 293 0 63 110 0 30
## 288 288 71 281 0 60 117 1 32
## 336 336 71 234 0 64 110 1 32
## 431 431 68 223 0 66 149 1 32
## 466 466 69 232 0 59 103 1 31
## 493 493 71 277 0 69 135 0 40
## 520 520 174 281 0 67 155 0 37
## 558 558 170 303 1 64 129 0 21
## 595 595 176 293 1 68 180 0 19
## 664 664 166 299 0 68 140 0 26
## 693 693 167 288 1 63 117 0 19
## 701 701 174 288 0 61 182 0 25
## 781 781 62 228 0 61 107 0 24
## 852 852 72 271 0 61 136 0 39
## 860 860 58 245 0 64 156 1 34
## 923 923 55 204 0 65 140 0 35
## 965 965 169 296 0 67 185 0 33
## 979 979 65 237 0 67 130 0 31
## 1006 1006 73 277 0 65 145 0 29
## 1008 1008 65 232 0 66 125 1 24
## 1042 1042 174 284 0 65 163 0 39
## 1082 1082 63 236 1 58 99 0 24
## 1091 1091 72 266 1 66 200 1 25
## 1112 1112 71 254 0 61 145 1 19
nrow(outliers_iqr)
## [1] 24
It can also be useful to visualize the data using a box plot.
boxplot(babies$bwt,
main = "Boxplot of Baby Weights (bwt)",
ylab = "Baby weight (ounces)")
Can you identify the outliers from the boxplot? If so how many?
Yes. The boxplot marks outliers as individual points beyond the whiskers. The number of IQR-based outliers is 24.
How can you describe the relationship between baby weights and other variables?
# Correlation between baby weight and other variables
# (only numeric columns are included automatically)
cor_matrix <- cor(babies[, c("bwt","gestation","height","age","weight","parity","smoke")])
cor_matrix
## bwt gestation height age weight
## bwt 1.00000000 0.40754279 0.203704177 0.026982911 0.15592327
## gestation 0.40754279 1.00000000 0.070469902 -0.053424774 0.02365494
## height 0.20370418 0.07046990 1.000000000 -0.006452846 0.43528743
## age 0.02698291 -0.05342477 -0.006452846 1.000000000 0.14732211
## weight 0.15592327 0.02365494 0.435287428 0.147322111 1.00000000
## parity -0.04390817 0.08091603 0.043543487 -0.351040648 -0.09636209
## smoke -0.24679951 -0.06026684 0.017506595 -0.067771942 -0.06028140
## parity smoke
## bwt -0.043908173 -0.246799515
## gestation 0.080916029 -0.060266842
## height 0.043543487 0.017506595
## age -0.351040648 -0.067771942
## weight -0.096362092 -0.060281396
## parity 1.000000000 -0.009598971
## smoke -0.009598971 1.000000000
Your comments on these relationships: