About

Qualitative Descriptive Analytics aims to gather an in-depth understanding of the underlying reasons and motivations for an event or observation. It is typically represented with visuals or charts.

Quantitative Descriptive Analytics focuses on investigating a phenomenon via statistical, mathematical, and computational techniques. It aims to quantify an event with metrics and numbers.

In this project, we will explore quantitative analytics using the babies data set provided. This data is collected to understand different factors effecting baby weights at birth. Below are the explanations of the variables:

Bwt: Baby weights in ounces Gestation: Length of pregnancy in days Height: Mother’s height in inches Smoke: =1 if mother is smoker, = 0 Nonsmoker Age: Mother’s age in years Weight: Mother’s pregnancy weight Parity: = 0 if the baby is first born, =1 otherwise babies: number of babies before this birth

Read this worksheet carefully and follow the instructions to complete the tasks and answer any questions. Submit your work as an HTML or PDF or Word document to Sakai. If you are submitting as a team project only one submission from one of the team members will be sufficient but in this case you need to input the names of the team members in the above Title section.

Task 1: Testing for missing data (1 point)

babies_old data set will be used to check if it contains any missing data. First read the babies_old file and make sure you correctly read the file with the above mentioned variables.

# Inspect structure and first rows
str(babies_old)
## 'data.frame':    1174 obs. of  8 variables:
##  $ Count    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ bwt      : int  120 113 128 108 136 138 132 120 143 140 ...
##  $ gestation: int  284 282 279 282 286 244 245 289 299 351 ...
##  $ parity   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ height   : int  62 64 64 67 62 62 65 62 66 68 ...
##  $ weight   : int  100 NA 115 125 93 178 140 125 136 120 ...
##  $ smoke    : int  0 0 1 1 0 0 0 0 1 0 ...
##  $ age      : int  27 33 28 23 25 33 23 25 30 27 ...
head(babies_old)
##   Count bwt gestation parity height weight smoke age
## 1     1 120       284      0     62    100     0  27
## 2     2 113       282      0     64     NA     0  33
## 3     3 128       279      0     64    115     1  28
## 4     4 108       282      0     67    125     1  23
## 5     5 136       286      0     62     93     0  25
## 6     6 138       244      0     62    178     0  33

Find NA’s in babies_old file

# Check your file for missing data
sum(is.na(babies_old))          # total number of NA values in the dataset
## [1] 11
colSums(is.na(babies_old))      # number of NAs per column
##     Count       bwt gestation    parity    height    weight     smoke       age 
##         0         0         0         0         0        11         0         0

Implement omission strategy to remove observations with missing values.

In this dataset, there are 11 missing values in total, so omission does not remove any rows (the dataset is already complete).

# Implement omission strategy and check if all the observations with NA's were omitted.
babies_old_omit <- na.omit(babies_old)

# confirm no missing values remain
sum(is.na(babies_old_omit))
## [1] 0

Implement mean imputation strategy

Because there are no missing values, mean imputation does not change the dataset; it simply confirms the procedure.

# Implement mean imputation strategy (numeric columns)
babies_old_impute <- babies_old

# Replace missing values in each numeric column with the column mean
for (v in names(babies_old_impute)) {
  if (is.numeric(babies_old_impute[[v]])) {
    babies_old_impute[[v]][is.na(babies_old_impute[[v]])] <- mean(babies_old_impute[[v]], na.rm = TRUE)
  }
}

# confirm no missing values remain
sum(is.na(babies_old_impute))
## [1] 0

Task 2: Testing for Outliers(1 point)

Read the babies.csv file and calculate the statistics required to find out possible outliers for baby weights.

First, read the babies.csv file and calculate the mean, standard deviation, maximum, and minimum for the bwt column using R.

# Read babies file
str(babies)
## 'data.frame':    1174 obs. of  8 variables:
##  $ Count    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ bwt      : int  120 113 128 108 136 138 132 120 143 140 ...
##  $ gestation: int  284 282 279 282 286 244 245 289 299 351 ...
##  $ parity   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ height   : int  62 64 64 67 62 62 65 62 66 68 ...
##  $ weight   : int  100 135 115 125 93 178 140 125 136 120 ...
##  $ smoke    : int  0 0 1 1 0 0 0 0 1 0 ...
##  $ age      : int  27 33 28 23 25 33 23 25 30 27 ...
head(babies)
##   Count bwt gestation parity height weight smoke age
## 1     1 120       284      0     62    100     0  27
## 2     2 113       282      0     64    135     0  33
## 3     3 128       279      0     64    115     1  28
## 4     4 108       282      0     67    125     1  23
## 5     5 136       286      0     62     93     0  25
## 6     6 138       244      0     62    178     0  33
# Calculate and display the average, standard deviation, maximum and minimum babies weight (bwt)
mean_bwt <- mean(babies$bwt)
sd_bwt <- sd(babies$bwt)
max_bwt <- max(babies$bwt)
min_bwt <- min(babies$bwt)

mean_bwt
## [1] 119.4625
sd_bwt
## [1] 18.32867
max_bwt
## [1] 176
min_bwt
## [1] 55

Use the standard approach to find out possible outliers

Use the formula from class to detect any outliers. An outlier is value that “lies outside” most of the other values in a set of data.

# Calculate the upper and lower thresholds for the babies weight using the standard approach (± 2 SD)
upper_sd <- mean_bwt + 2 * sd_bwt
lower_sd <- mean_bwt - 2 * sd_bwt

upper_sd
## [1] 156.1199
lower_sd
## [1] 82.80518

Do you think there are possible outliers? Why?

Yes. Using the standard ± 2 SD rule, any baby weight below 82.81 oz or above 156.12 oz is flagged as a possible outlier. Since there are observations outside these thresholds (see the upper and lower outlier tables), there are possible outliers in this dataset.

If you decided that you have possible outliers can you identify the observations with possible outliers?

# Observations with possible outliers at the upper end (standard approach)
upper_outliers_sd <- babies[babies$bwt > upper_sd, ]
upper_outliers_sd
##      Count bwt gestation parity height weight smoke age
## 138    138 160       300      0     71    175     1  29
## 221    221 173       293      0     63    110     0  30
## 382    382 163       280      0     69    139     0  35
## 489    489 158       285      0     62    130     0  28
## 520    520 174       281      0     67    155     0  37
## 526    526 161       302      1     70    170     1  22
## 558    558 170       303      1     64    129     0  21
## 595    595 176       293      1     68    180     0  19
## 664    664 166       299      0     68    140     0  26
## 693    693 167       288      1     63    117     0  19
## 701    701 174       288      0     61    182     0  25
## 766    766 165       282      0     66    145     0  29
## 794    794 162       284      0     64    126     0  27
## 800    800 160       271      0     67    215     0  32
## 823    823 164       286      1     66    143     0  32
## 872    872 163       289      1     64    126     1  25
## 879    879 158       295      1     70    137     0  37
## 931    931 159       296      1     64    112     0  27
## 965    965 169       296      0     67    185     0  33
## 978    978 160       292      0     64    120     0  28
## 1035  1035 157       291      0     65    121     0  33
## 1042  1042 174       284      0     65    163     0  39
## 1058  1058 160       297      0     68    136     0  20
## 1063  1063 158       267      0     64    125     0  35
## 1065  1065 158       289      0     66    140     0  30
## 1067  1067 163       298      0     61     98     0  37
## 1105  1105 160       291      0     64    110     1  34
nrow(upper_outliers_sd)
## [1] 27
# Observations with possible outliers at the lower end (standard approach)
lower_outliers_sd <- babies[babies$bwt < lower_sd, ]
lower_outliers_sd
##      Count bwt gestation parity height weight smoke age
## 57      57  75       232      0     61    110     0  33
## 178    178  75       239      0     63    124     1  26
## 240    240  81       256      0     64    148     1  30
## 279    279  80       266      1     62    125     0  25
## 288    288  71       281      0     60    117     1  32
## 336    336  71       234      0     64    110     1  32
## 431    431  68       223      0     66    149     1  32
## 436    436  78       256      1     65    123     0  29
## 466    466  69       232      0     59    103     1  31
## 493    493  71       277      0     69    135     0  40
## 588    588  79       268      0     61    108     0  36
## 655    655  78       237      1     63    144     0  23
## 683    683  81       254      0     62    157     0  23
## 780    780  77       238      1     63    103     1  23
## 781    781  62       228      0     61    107     0  24
## 852    852  72       271      0     61    136     0  39
## 860    860  58       245      0     64    156     1  34
## 874    874  77       238      0     67    135     1  38
## 923    923  55       204      0     65    140     0  35
## 959    959  78       258      1     66    115     1  24
## 964    964  75       247      0     64    120     1  36
## 979    979  65       237      0     67    130     0  31
## 1002  1002  80       262      1     61    100     1  31
## 1006  1006  73       277      0     65    145     0  29
## 1008  1008  65       232      0     66    125     1  24
## 1082  1082  63       236      1     58     99     0  24
## 1091  1091  72       266      1     66    200     1  25
## 1092  1092  75       266      0     61    113     1  37
## 1112  1112  71       254      0     61    145     1  19
## 1113  1113  82       270      0     65    150     1  21
## 1121  1121  82       274      0     64    101     1  31
## 1151  1151  75       265      0     65    103     1  21
## 1157  1157  81       285      0     63    150     1  19
nrow(lower_outliers_sd)
## [1] 33

What is the number of outliers at the lower end and upper end of the data?

  • Upper-end outliers (standard ± 2 SD): 27
  • Lower-end outliers (standard ± 2 SD): 33

Use the IQR approach to identify outliers (1 points)

Another similar method to find the upper and lower thresholds discussed in class involves finding the interquartile range. Find the interquantile range using the following chunk.

# Find the interquartile range (IQR) for baby weights
Q1 <- quantile(babies$bwt, 0.25)
Q3 <- quantile(babies$bwt, 0.75)
IQR_bwt <- IQR(babies$bwt)

Q1
## 25% 
## 108
Q3
## 75% 
## 131
IQR_bwt
## [1] 23

The threshold is the boundaries that determine if a value is an outlier. If the value falls above the upper threshold or below the lower threshold, it is a possible outlier.

# calculate the upper threshold (IQR method)
upper_iqr <- Q3 + 1.5 * IQR_bwt
upper_iqr
##   75% 
## 165.5
# calculate the lower threshold (IQR method)
lower_iqr <- Q1 - 1.5 * IQR_bwt
lower_iqr
##  25% 
## 73.5

Do you think there are possible outliers and why?

Yes. With the IQR method, any value below 73.5 oz or above 165.5 oz is considered a possible outlier. Since the code identifies observations outside these IQR thresholds, there are possible outliers under this approach as well.

If you think there are possible outliers can you identify them? If so how many?

# Identify outliers using the IQR method
outliers_iqr <- babies[babies$bwt > upper_iqr | babies$bwt < lower_iqr, ]
outliers_iqr
##      Count bwt gestation parity height weight smoke age
## 221    221 173       293      0     63    110     0  30
## 288    288  71       281      0     60    117     1  32
## 336    336  71       234      0     64    110     1  32
## 431    431  68       223      0     66    149     1  32
## 466    466  69       232      0     59    103     1  31
## 493    493  71       277      0     69    135     0  40
## 520    520 174       281      0     67    155     0  37
## 558    558 170       303      1     64    129     0  21
## 595    595 176       293      1     68    180     0  19
## 664    664 166       299      0     68    140     0  26
## 693    693 167       288      1     63    117     0  19
## 701    701 174       288      0     61    182     0  25
## 781    781  62       228      0     61    107     0  24
## 852    852  72       271      0     61    136     0  39
## 860    860  58       245      0     64    156     1  34
## 923    923  55       204      0     65    140     0  35
## 965    965 169       296      0     67    185     0  33
## 979    979  65       237      0     67    130     0  31
## 1006  1006  73       277      0     65    145     0  29
## 1008  1008  65       232      0     66    125     1  24
## 1042  1042 174       284      0     65    163     0  39
## 1082  1082  63       236      1     58     99     0  24
## 1091  1091  72       266      1     66    200     1  25
## 1112  1112  71       254      0     61    145     1  19
nrow(outliers_iqr)
## [1] 24

It can also be useful to visualize the data using a box plot.

boxplot(babies$bwt,
        main = "Boxplot of Baby Weights (bwt)",
        ylab = "Baby weight (ounces)")

Can you identify the outliers from the boxplot? If so how many?

Yes. The boxplot marks outliers as individual points beyond the whiskers. The number of IQR-based outliers is 24.

Task 4:Relationships with the variables (1 points)

How can you describe the relationship between baby weights and other variables?

# Correlation between baby weight and other variables
# (only numeric columns are included automatically)
cor_matrix <- cor(babies[, c("bwt","gestation","height","age","weight","parity","smoke")])
cor_matrix
##                   bwt   gestation       height          age      weight
## bwt        1.00000000  0.40754279  0.203704177  0.026982911  0.15592327
## gestation  0.40754279  1.00000000  0.070469902 -0.053424774  0.02365494
## height     0.20370418  0.07046990  1.000000000 -0.006452846  0.43528743
## age        0.02698291 -0.05342477 -0.006452846  1.000000000  0.14732211
## weight     0.15592327  0.02365494  0.435287428  0.147322111  1.00000000
## parity    -0.04390817  0.08091603  0.043543487 -0.351040648 -0.09636209
## smoke     -0.24679951 -0.06026684  0.017506595 -0.067771942 -0.06028140
##                 parity        smoke
## bwt       -0.043908173 -0.246799515
## gestation  0.080916029 -0.060266842
## height     0.043543487  0.017506595
## age       -0.351040648 -0.067771942
## weight    -0.096362092 -0.060281396
## parity     1.000000000 -0.009598971
## smoke     -0.009598971  1.000000000

Your comments on these relationships: