FDA Approval of AI Algorithms in Medicine data set (dataset#1).

This is the wide table of the data set I have chosen:

Here it is in table format form the loaded data frame from the CSV file imported into R:

library(kableExtra)
df <- read.csv("https://raw.githubusercontent.com/unsecuredAMRAP/607/main/AI-FDA_wide-table.csv")

df %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
Type.of.FDA.approval Mention.of.AI.in.announcement X X.1 X.2 X.3 Modality X.4 X.5 X.6 X.7 X.8 Clinical.Discipline X.9 X.10 X.11
artificial intelligence machine learning deep learning NA Total CT Ultrasound ECG MRI Mammography Brain Cardiology Neurology Radiology Other
510(k) premarket notification 13 8 11 34 66 7 1 9 6 5 5 14 4 34 12
de novo pathway 5 0 1 4 10 1 1 2 0 0 1 2 1 4 3
PMA 1 0 1 1 3 1 1 0 0 0 0 0 0 1 1
Total 19 8 13 39 79 9 3 11 6 5 6 16 5 39 16

Project 2 starts here:

Tidying and transforming the data

The current dataset is not tidy, because each row is not an observation, rather the current format of the dataset shows a summary snapshot at the dataset (similar to a Table 1 in a research paper). The first step transformation I want to do: currently, the 4 rows are needed to become columns, and the columns need to become rows. The current columns of “artificial intelligence”, “machine learning”, “deep learning”, “NA”, and “Total” are under the parent label of “Mention of AI in announcement”. Similarly, the current columns of “CT”, “Ultrasound”, “ECG”, “MRI”, “Mammography”, and “Brain” are under the parent label of “Modality”. Also similarly, the current columns of “Cardiology”, “Neurology”, “Radiology”, and “Other” are under the parent label of “Clinical Discipline”. So considering the abbreviations of the following in ():

  • Mention of AI in announcement (MA)
  • Modality (M)
  • Clinical Discipline (CD)

when transforming this wide into long format by switching columns to rows and rows to columns, the daughter column labels will be coupled with the parent column label to create the new row labels (for example “MA.machine_learning”).

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(tidyr)

## Data Transformations

# column names to include the parent label as a prefix
new_col_names <- c("Type.of.FDA.approval", 
                   "MA.artificial_intelligence", "MA.machine_learning", "MA.deep_learning", "MA.NA", "MA.Total", 
                   "M.CT", "M.Ultrasound", "M.ECG", "M.MRI", "M.Mammography", "M.Brain", 
                   "CD.Cardiology", "CD.Neurology", "CD.Radiology", "CD.Other")
# change column names
names(df) <- new_col_names

# drop row 1 (old daughter column labels)
df <- df[-1, ]

# drop Total column and Total row to avoid confusion and clean up as a data set
df1 <- df[ , -6]
df1 <- df1[-4 , ]

# Convert all but the first column to numeric
df1[, -1] <- lapply(df1[, -1], function(x) as.numeric(as.character(x)))

#--- --- ---

# Transformation of the data from wide to long format
df1_long <- pivot_longer(
  data = df1,
  cols = -Type.of.FDA.approval, # Exclude the identifier column
  names_to = "Category", # This will hold the names of the current columns
  values_to = "Count" # This will hold the counts
)

print(df1_long)
## # A tibble: 42 × 3
##    Type.of.FDA.approval          Category                   Count
##    <chr>                         <chr>                      <dbl>
##  1 510(k) premarket notification MA.artificial_intelligence    13
##  2 510(k) premarket notification MA.machine_learning            8
##  3 510(k) premarket notification MA.deep_learning              11
##  4 510(k) premarket notification MA.NA                         34
##  5 510(k) premarket notification M.CT                           7
##  6 510(k) premarket notification M.Ultrasound                   1
##  7 510(k) premarket notification M.ECG                          9
##  8 510(k) premarket notification M.MRI                          6
##  9 510(k) premarket notification M.Mammography                  5
## 10 510(k) premarket notification M.Brain                        5
## # ℹ 32 more rows
#General summary snapshot (not final but helpful)
summary(df1)
##  Type.of.FDA.approval MA.artificial_intelligence MA.machine_learning
##  Length:3             Min.   : 1.000             Min.   :0.000      
##  Class :character     1st Qu.: 3.000             1st Qu.:0.000      
##  Mode  :character     Median : 5.000             Median :0.000      
##                       Mean   : 6.333             Mean   :2.667      
##                       3rd Qu.: 9.000             3rd Qu.:4.000      
##                       Max.   :13.000             Max.   :8.000      
##  MA.deep_learning     MA.NA           M.CT    M.Ultrasound     M.ECG      
##  Min.   : 1.000   Min.   : 1.0   Min.   :1   Min.   :1     Min.   :0.000  
##  1st Qu.: 1.000   1st Qu.: 2.5   1st Qu.:1   1st Qu.:1     1st Qu.:1.000  
##  Median : 1.000   Median : 4.0   Median :1   Median :1     Median :2.000  
##  Mean   : 4.333   Mean   :13.0   Mean   :3   Mean   :1     Mean   :3.667  
##  3rd Qu.: 6.000   3rd Qu.:19.0   3rd Qu.:4   3rd Qu.:1     3rd Qu.:5.500  
##  Max.   :11.000   Max.   :34.0   Max.   :7   Max.   :1     Max.   :9.000  
##      M.MRI   M.Mammography      M.Brain    CD.Cardiology     CD.Neurology  
##  Min.   :0   Min.   :0.000   Min.   :0.0   Min.   : 0.000   Min.   :0.000  
##  1st Qu.:0   1st Qu.:0.000   1st Qu.:0.5   1st Qu.: 1.000   1st Qu.:0.500  
##  Median :0   Median :0.000   Median :1.0   Median : 2.000   Median :1.000  
##  Mean   :2   Mean   :1.667   Mean   :2.0   Mean   : 5.333   Mean   :1.667  
##  3rd Qu.:3   3rd Qu.:2.500   3rd Qu.:3.0   3rd Qu.: 8.000   3rd Qu.:2.500  
##  Max.   :6   Max.   :5.000   Max.   :5.0   Max.   :14.000   Max.   :4.000  
##   CD.Radiology     CD.Other     
##  Min.   : 1.0   Min.   : 1.000  
##  1st Qu.: 2.5   1st Qu.: 2.000  
##  Median : 4.0   Median : 3.000  
##  Mean   :13.0   Mean   : 5.333  
##  3rd Qu.:19.0   3rd Qu.: 7.500  
##  Max.   :34.0   Max.   :12.000

1- Descriptive analyses: FDA approvals within a specific clinical discipline

# counting FDA approvals within a specific clinical discipline
total_approvals <- colSums(df1[,-1])
total_approvals
## MA.artificial_intelligence        MA.machine_learning 
##                         19                          8 
##           MA.deep_learning                      MA.NA 
##                         13                         39 
##                       M.CT               M.Ultrasound 
##                          9                          3 
##                      M.ECG                      M.MRI 
##                         11                          6 
##              M.Mammography                    M.Brain 
##                          5                          6 
##              CD.Cardiology               CD.Neurology 
##                         16                          5 
##               CD.Radiology                   CD.Other 
##                         39                         16
# mean number of approvals
mean_approvals <- colMeans(df1[1:(nrow(df)-1), -1])
mean_approvals
## MA.artificial_intelligence        MA.machine_learning 
##                   6.333333                   2.666667 
##           MA.deep_learning                      MA.NA 
##                   4.333333                  13.000000 
##                       M.CT               M.Ultrasound 
##                   3.000000                   1.000000 
##                      M.ECG                      M.MRI 
##                   3.666667                   2.000000 
##              M.Mammography                    M.Brain 
##                   1.666667                   2.000000 
##              CD.Cardiology               CD.Neurology 
##                   5.333333                   1.666667 
##               CD.Radiology                   CD.Other 
##                  13.000000                   5.333333
summary_by_discipline <- lapply(df1[,-1], summary)
summary_by_discipline
## $MA.artificial_intelligence
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   5.000   6.333   9.000  13.000 
## 
## $MA.machine_learning
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.667   4.000   8.000 
## 
## $MA.deep_learning
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   4.333   6.000  11.000 
## 
## $MA.NA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     2.5     4.0    13.0    19.0    34.0 
## 
## $M.CT
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       3       4       7 
## 
## $M.Ultrasound
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       1       1       1 
## 
## $M.ECG
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   3.667   5.500   9.000 
## 
## $M.MRI
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       2       3       6 
## 
## $M.Mammography
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.667   2.500   5.000 
## 
## $M.Brain
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.5     1.0     2.0     3.0     5.0 
## 
## $CD.Cardiology
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   5.333   8.000  14.000 
## 
## $CD.Neurology
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.500   1.000   1.667   2.500   4.000 
## 
## $CD.Radiology
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     2.5     4.0    13.0    19.0    34.0 
## 
## $CD.Other
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   5.333   7.500  12.000
proportions <- sweep(df1[,-1], 2, total_approvals, FUN = "/")
proportions
##   MA.artificial_intelligence MA.machine_learning MA.deep_learning      MA.NA
## 2                 0.68421053                   1       0.84615385 0.87179487
## 3                 0.26315789                   0       0.07692308 0.10256410
## 4                 0.05263158                   0       0.07692308 0.02564103
##        M.CT M.Ultrasound     M.ECG M.MRI M.Mammography   M.Brain CD.Cardiology
## 2 0.7777778    0.3333333 0.8181818     1             1 0.8333333         0.875
## 3 0.1111111    0.3333333 0.1818182     0             0 0.1666667         0.125
## 4 0.1111111    0.3333333 0.0000000     0             0 0.0000000         0.000
##   CD.Neurology CD.Radiology CD.Other
## 2          0.8   0.87179487   0.7500
## 3          0.2   0.10256410   0.1875
## 4          0.0   0.02564103   0.0625

2- Association between AI technique and FDA approval type

## Data Transformations

# a function to perform Chi-square tests on pairs of columns
perform_chi_square <- function(index1, index2) {
  # the contingency tables
  contingency_table <- matrix(c(df1[, index1], df1[, index2]), nrow = 3, byrow = TRUE)
   
  # Chi-square test
  test_result <- chisq.test(contingency_table)
  
  # p-value and the Chi-square statistic
  return(c(p_value = test_result$p.value, chi_square = test_result$statistic))
}

# indices for the clinical disciplines
discipline_indices <- c(12, 13, 14, 15)

# the pairs of disciplines to compare
index_pairs <- combn(discipline_indices, 2, simplify = FALSE)

# Chi-square tests for all pairs and collect the results
results <- setNames(object = lapply(index_pairs, function(indices) perform_chi_square(indices[1], indices[2])), 
                    nm = lapply(index_pairs, function(indices) paste(names(df1)[indices], collapse = " vs. ")))

# Convert the list of results to a data frame for presentation
results_df <- do.call(rbind, results)
results_df <- data.frame(Comparison = rownames(results_df), results_df)
rownames(results_df) <- NULL

print(results_df)
##                       Comparison      p_value chi_square.X.squared
## 1 CD.Cardiology vs. CD.Neurology 2.004221e-03             12.42500
## 2 CD.Cardiology vs. CD.Radiology 3.732038e-10             43.41779
## 3     CD.Cardiology vs. CD.Other 1.703242e-05             21.96078
## 4  CD.Neurology vs. CD.Radiology 6.040329e-08             33.24444
## 5      CD.Neurology vs. CD.Other 9.005544e-04             14.02500
## 6      CD.Radiology vs. CD.Other 2.496337e-07             30.40654

These results imply that the type of FDA approval a device receives may be dependent on the clinical discipline it is associated wit:

  • Cardiology vs. Neurology: The p-value (0.002) suggests a statistically significant difference in the distribution of FDA approval types between cardiology and neurology.
  • Cardiology vs. Radiology: The p-value (approx. 0) indicates a highly significant difference in FDA approval types between cardiology and radiology.
  • Cardiology vs. Other: The p-value (0.000017) points to a significant difference in FDA approval types between cardiology and other clinical disciplines.
  • Neurology vs. Radiology: The p-value (approx. 0) implies a significant difference in FDA approval types between neurology and radiology.
  • Neurology vs. Other: The p-value (0.00009) indicates a statistically significant difference in the distribution of FDA approval types between neurology and other clinical disciplines.
  • Radiology vs. Other: The p-value (approx. 0) suggests a highly significant difference in FDA approval types between radiology and other clinical disciplines.

It is worth noting that Radiology had the most FDA approvals for AI algorithms (followed by cardiology), and that the 510 FDA approval was the most common across clinical disciplines.

3- Visualization

library(ggplot2)

# bar plots for the FDA approvals across disciplines
df_approvals <- data.frame(
  Approval = rep(c("510(k)", "de novo", "PMA"), times = 4),
  Discipline = rep(c("Cardiology", "Neurology", "Radiology", "Other"), each = 3),
  Count = c(14, 2, 0, 4, 1, 0, 34, 4, 1, 12, 3, 1) # Use your actual counts here
)
ggplot(df_approvals, aes(x = Discipline, y = Count, fill = Approval)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "FDA Approval Counts by Clinical Discipline", x = "Clinical Discipline", y = "Count") +
  theme_minimal() +
  scale_fill_brewer(palette = "Pastel1")

# bar plot to summarize the chi-square analysis
ggplot(results_df, aes(x = Comparison, y = chi_square.X.squared, fill = Comparison)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = "Chi-Square Test Results for FDA Approvals Across Clinical Disciplines",
       x = "Clinical Discipline Comparisons",
       y = "Chi-Square Statistic") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set2")

# Heat map of the p values of the chi-square analysis
ggplot(results_df, aes(x = Comparison, y = chi_square.X.squared, fill = -log10(p_value))) +
  geom_tile() +
  scale_fill_gradient(low = "blue", high = "red") +
  labs(title = "Heatmap of P-Values for FDA Approvals Across Clinical Disciplines",
       x = "Clinical Discipline 1",
       y = "Clinical Discipline 2",
       fill = "-Log10 P-Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# heat map for the FDA approvals across disciplines
library(pheatmap)
# Data Transformations
# Converting the dataframe into a matrix format suitable for heatmap
approval_matrix <- matrix(c(14, 2, 0, 4, 1, 0, 34, 4, 1, 12, 3, 1), nrow = 3, byrow = TRUE,
                          dimnames = list(c("510(k)", "de novo", "PMA"), 
                                          c("Cardiology", "Neurology", "Radiology", "Other")))
pheatmap(approval_matrix, main = "Heatmap of FDA Approval Counts", 
         color = colorRampPalette(c("blue", "yellow", "red"))(50))

==============================================================================

Airbnb in NYC 2019 data (dataset#2).

Let’s start by loading the data

df_nyc <- read.csv("https://raw.githubusercontent.com/unsecuredAMRAP/607/main/AB_NYC_2019.csv")

The Blackboard post by Atta Boateng Sr noted that the aim is to see which area of NYC allows a person to rent an entire home or apartment for the least amount of money, as a surrogate for seeing which area “gives you the most bang for your buck”.

Tidying and Transformations.

The current data set is tidy already, because each row is an observation. To cover the needed work for the grading for the tidying aspect, I will untidy it and then tidy it back. I will combine the columns of room type and minimum nights, and duplicate 10 rows (chosen at random). I will also make it so that observations span more than one row.

I am going to increase the untidiness by pivoting the review-related columns (number_of_reviews, last_review, reviews_per_month) into a longer format.

df_nyc_untidy <- df_nyc %>%
  mutate(across(c(number_of_reviews, last_review, reviews_per_month), as.character)) %>%
  pivot_longer(cols = c(number_of_reviews, last_review, reviews_per_month), 
               names_to = "review_metric", 
               values_to = "review_value")

df_nyc_untidy <- df_nyc_untidy %>%
  unite(col = "room_nights", c("room_type", "minimum_nights"), sep = "_")

set.seed(123) # for reproducibility
rows_to_duplicate <- sample(nrow(df_nyc_untidy), 10, replace = FALSE)
df_duplicated <- df_nyc_untidy %>%
  slice(rep(rows_to_duplicate, each = 2))

df_combined <- bind_rows(df_nyc_untidy, df_duplicated)

# new dataframe 
head(df_combined)
## # A tibble: 6 × 14
##      id name        host_id host_name neighbourhood_group neighbourhood latitude
##   <int> <chr>         <int> <chr>     <chr>               <chr>            <dbl>
## 1  2539 Clean & qu…    2787 John      Brooklyn            Kensington        40.6
## 2  2539 Clean & qu…    2787 John      Brooklyn            Kensington        40.6
## 3  2539 Clean & qu…    2787 John      Brooklyn            Kensington        40.6
## 4  2595 Skylit Mid…    2845 Jennifer  Manhattan           Midtown           40.8
## 5  2595 Skylit Mid…    2845 Jennifer  Manhattan           Midtown           40.8
## 6  2595 Skylit Mid…    2845 Jennifer  Manhattan           Midtown           40.8
## # ℹ 7 more variables: longitude <dbl>, room_nights <chr>, price <int>,
## #   calculated_host_listings_count <int>, availability_365 <int>,
## #   review_metric <chr>, review_value <chr>

Now to tidy this data set back up again, I will need to fix the issues I introduced. I will need to separate the variables of minimum nights and room type, and I will need to identify the duplicate rows and remove the duplicates while keeping one of the two duplicates. Finally, and to tidy the data set well I am going to make it that each row is a single observation by pivot back the review-related columns (number_of_reviews, last_review, reviews_per_month) into a wider format.

df_separated <- df_combined %>%
  separate(col = "room_nights", into = c("room_type", "minimum_nights"), sep = "_")

df_separated$minimum_nights <- as.numeric(df_separated$minimum_nights)

# remove duplicate rows
df_almosttidied <- df_separated %>%
  distinct(.keep_all = TRUE)

df_tidied <- df_almosttidied %>%
  pivot_wider(names_from = "review_metric", values_from = "review_value") %>%
  mutate(
    number_of_reviews = as.integer(number_of_reviews),
    reviews_per_month = as.numeric(reviews_per_month),
    last_review = as.Date(last_review, format = "%Y-%m-%d")
  )

# tidied dataframe
head(df_tidied)
## # A tibble: 6 × 16
##      id name        host_id host_name neighbourhood_group neighbourhood latitude
##   <int> <chr>         <int> <chr>     <chr>               <chr>            <dbl>
## 1  2539 Clean & qu…    2787 John      Brooklyn            Kensington        40.6
## 2  2595 Skylit Mid…    2845 Jennifer  Manhattan           Midtown           40.8
## 3  3647 THE VILLAG…    4632 Elisabeth Manhattan           Harlem            40.8
## 4  3831 Cozy Entir…    4869 LisaRoxa… Brooklyn            Clinton Hill      40.7
## 5  5022 Entire Apt…    7192 Laura     Manhattan           East Harlem       40.8
## 6  5099 Large Cozy…    7322 Chris     Manhattan           Murray Hill       40.7
## # ℹ 9 more variables: longitude <dbl>, room_type <chr>, minimum_nights <dbl>,
## #   price <int>, calculated_host_listings_count <int>, availability_365 <int>,
## #   number_of_reviews <int>, last_review <date>, reviews_per_month <dbl>

So, as a further step in preparing this data set, I will limit the data set to the listings that are “entire home/apt”.

filtered_df_nyc <- df_nyc %>%
  filter(room_type == "Entire home/apt")

Now I want to ease the data set and the subsequent work I will be doing on it by only keeping the columns that are relevant to my intended analysis.

final_df_nyc <- filtered_df_nyc %>%
  select(price, neighbourhood, neighbourhood_group)

Checking how many neighborhoods and neighborhood_groups there are in this data set

unique_neighbourhoods <- n_distinct(final_df_nyc$neighbourhood)
unique_neighbourhood_groups <- n_distinct(final_df_nyc$neighbourhood_group)

unique_neighbourhoods
## [1] 216
unique_neighbourhood_groups
## [1] 5

I have found (from checking normality later) that the data is right-skewed and has many extreme values (outliers). So I will transform the data a little further to reduce the outliers.

# making sure that all price values are greater than zero
if(any(final_df_nyc$price <= 0)) {
  # add a constant to all values before taking the log
  final_df_nyc$price <- final_df_nyc$price + 1
}

# applying the log transformation
final_df_nyc$log_price <- log(final_df_nyc$price)

Analysis.

Since there are 5 unique neighborhood_groups, let’s go by that (as opposed to working with 216 different neighborhoods, which would make our work difficult, though it may come in handy later).

# Mean of price in each of the neighborhood_groups
mean_price_neighbourhood_group <- final_df_nyc %>%
  group_by(neighbourhood_group) %>%
  summarise(mean_price = mean(price, na.rm = TRUE))
mean_price_neighbourhood_group
## # A tibble: 5 × 2
##   neighbourhood_group mean_price
##   <chr>                    <dbl>
## 1 Bronx                     129.
## 2 Brooklyn                  179.
## 3 Manhattan                 250.
## 4 Queens                    148.
## 5 Staten Island             175.

Although the differences are quite notable, we can check for statistical significance for the differences between the different neighborhood areas using ANOVA. But first, we should check for ANOVA assumption (independence of observations, normality, homogeneity of variance).

# checking for ANOVA assumptions
# 1- independence of observations: from the information about this data set, we can probably safely say that this assumption is met

# 2- Test for Normality
# This is a large data set, and often normality is assumed by default
# Q-Q plot of the residuals
anova_model <- aov(price ~ neighbourhood_group, data = final_df_nyc)
qqnorm(residuals(anova_model))
qqline(residuals(anova_model), col = "steelblue")

# checking normality after transforming the data further by taking the logof price to reduce extreme values
anova_model2 <- aov(log_price ~ neighbourhood_group, data = final_df_nyc)
qqnorm(residuals(anova_model2))
qqline(residuals(anova_model2), col = "steelblue")

# 3- Homogeneity of Variances
library(car)
levene_test <- leveneTest(price ~ neighbourhood_group, data = final_df_nyc)
levene_test
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group     4  38.146 < 2.2e-16 ***
##       25404                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Q_Q plot for normality showed extreme values (outliers), which suggests that we should further transform the data a little more to reduce the skewness and outliers. The other option would be to not use parametric methods such as ANOVA, rather use a non-parametric test that does not assume normality (such as Kruskal-Wallis test).

Let’s stay with ANOVA and transform the data a little further to reduce the outliers (step added to the end of the data transformation above). And after checking the Normal Q-Q plot after this step, I see that the skewness of the data has improved a lot!

So we carry on with ANOVA:

# running ANOVA
anova_result2 <- aov(log_price ~ neighbourhood_group, data = final_df_nyc)
summary(anova_result2)
##                        Df Sum Sq Mean Sq F value Pr(>F)    
## neighbourhood_group     4    807  201.74   699.2 <2e-16 ***
## Residuals           25404   7330    0.29                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# if the ANOVA is significant, proceed with Tukey's HSD post-hoc test
if (summary(anova_result2)[[1]][["Pr(>F)"]][1] < 0.05) {
  tukey_result <- TukeyHSD(anova_result2)
  print(tukey_result)
} else {
  print("No significant difference detected by ANOVA.")
}
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = log_price ~ neighbourhood_group, data = final_df_nyc)
## 
## $neighbourhood_group
##                                diff        lwr         upr     p adj
## Brooklyn-Bronx           0.30972164  0.2329810  0.38646230 0.0000000
## Manhattan-Bronx          0.60574975  0.5294137  0.68208579 0.0000000
## Queens-Bronx             0.14103860  0.0592535  0.22282371 0.0000251
## Staten Island-Bronx      0.08890246 -0.0447486  0.22255352 0.3650620
## Manhattan-Brooklyn       0.29602810  0.2763496  0.31570661 0.0000000
## Queens-Brooklyn         -0.16868304 -0.2040222 -0.13334389 0.0000000
## Staten Island-Brooklyn  -0.22081918 -0.3322762 -0.10936220 0.0000006
## Queens-Manhattan        -0.46471114 -0.4991628 -0.43025946 0.0000000
## Staten Island-Manhattan -0.51684728 -0.6280261 -0.40566851 0.0000000
## Staten Island-Queens    -0.05213614 -0.1671245  0.06285226 0.7297821

The conclusion here is that the best NYC area that allows the renting of an entire house/apartment for the lowest amount of money is Bronx, followed by (by order): Queens, Staten Island, Brooklyn, and Manhattan. The difference between them is statistically significant.Interestingly though, after taking the log of the price, it seems that Staten Island came second rather than third.

Visualizations. NOTE: The following code chunk includes some data transformations along with each plotting snippet.

# bar chart of mean prices
final_df_nyc <- final_df_nyc %>%
  group_by(neighbourhood_group) %>%
  mutate(mean_price = mean(price, na.rm = TRUE)) %>%
  ungroup()
ggplot(final_df_nyc, aes(x = neighbourhood_group, y = mean_price, fill = neighbourhood_group)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(x = "Neighbourhood Group", y = "Mean Price", title = "Mean Prices by Neighbourhood Group") +
  theme_minimal()

# bar chart of median prices (less prone to extreme values)
final_df_nyc <- final_df_nyc %>%
  group_by(neighbourhood_group) %>%
  mutate(median_price = median(price, na.rm = TRUE)) %>%
  ungroup()
ggplot(final_df_nyc, aes(x = neighbourhood_group, y = median_price, fill = neighbourhood_group)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(x = "Neighbourhood Group", y = "Median Price", title = "Median Prices by Neighbourhood Group") +
  theme_minimal()

# bar chart of mean log_prices
final_df_nyc <- final_df_nyc %>%
  group_by(neighbourhood_group) %>%
  mutate(mean_logprice = mean(log_price, na.rm = TRUE)) %>%
  ungroup()
ggplot(final_df_nyc, aes(x = neighbourhood_group, y = mean_logprice, fill = neighbourhood_group)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(x = "Neighbourhood Group", y = "Mean Log_Price", title = "Mean Log_Prices by Neighbourhood Group") +
  theme_minimal()

# box plot of prices
ggplot(final_df_nyc, aes(x = neighbourhood_group, y = price, fill = neighbourhood_group)) +
  geom_boxplot() +
  labs(x = "Neighbourhood Group", y = "Price", title = "Price Distribution by Neighbourhood Group") +
  theme_minimal()

# box plot of log_prices
ggplot(final_df_nyc, aes(x = neighbourhood_group, y = log_price, fill = neighbourhood_group)) +
  geom_boxplot() +
  labs(x = "Neighbourhood Group", y = "Price", title = "Price Distribution by Neighbourhood Group") +
  theme_minimal()

The box plot of the prices shows clearly how the outliers were detrimental to be able to use the data and look at it. On the other hand, the box plot of the log-Prices was much better!

==========================================================================

Laptop Specs Data Set (dataset#3).

Let’s load the data:

df_laptop <- read.csv("https://raw.githubusercontent.com/unsecuredAMRAP/607/main/laptopData.csv")

Tidying and Transformations.

I see a lot of issue with this data set, including:

So, let’s get to work:

# Convert empty strings to NA and then remove rows with only NA
df_laptop <- df_laptop %>%
  mutate_if(is.character, ~na_if(., "")) %>%
  na.omit()

# OpSys
df_laptop <- df_laptop %>% 
  mutate(OpSys = case_when(
    OpSys == "Mac OS X" ~ "macOS",
    OpSys == "Windows 10 S" ~ "Windows 10",
    TRUE ~ OpSys
  ))

# cells with "?" in any column
df_laptop <- df_laptop %>% mutate_all(~ifelse(. == "?", NA, .)) %>% na.omit()

The current data set is tidy already, because each row is an observation (though there were MANY empty rows! which I have cleaned up and fixed above). To cover the needed work for the grading for the tidying aspect, I will untidy it and then tidy it back. I will combine specs-related columns like Cpu, Ram, Memory, Gpu, and OpSys into a long format since these represent various specifications of a laptop.

df_laptop_untidy <- df_laptop %>%
  mutate(across(c(Cpu, Ram, Memory, Gpu, OpSys), as.character)) %>%
  pivot_longer(cols = c(Cpu, Ram, Memory, Gpu, OpSys), 
               names_to = "spec_type", 
               values_to = "spec_value")

head(df_laptop_untidy)
## # A tibble: 6 × 9
##   Unnamed..0 Company TypeName  Inches ScreenResolution   Weight  Price spec_type
##        <int> <chr>   <chr>     <chr>  <chr>              <chr>   <dbl> <chr>    
## 1          0 Apple   Ultrabook 13.3   IPS Panel Retina … 1.37kg 71379. Cpu      
## 2          0 Apple   Ultrabook 13.3   IPS Panel Retina … 1.37kg 71379. Ram      
## 3          0 Apple   Ultrabook 13.3   IPS Panel Retina … 1.37kg 71379. Memory   
## 4          0 Apple   Ultrabook 13.3   IPS Panel Retina … 1.37kg 71379. Gpu      
## 5          0 Apple   Ultrabook 13.3   IPS Panel Retina … 1.37kg 71379. OpSys    
## 6          1 Apple   Ultrabook 13.3   1440x900           1.34kg 47896. Cpu      
## # ℹ 1 more variable: spec_value <chr>

Now that it is messy and pivoted into longer format, I will need to tidy it back up again and clean it up some more.

df_laptop <- df_laptop_untidy %>%
  pivot_wider(names_from = "spec_type", values_from = "spec_value")

head(df_laptop)
## # A tibble: 6 × 12
##   Unnamed..0 Company TypeName  Inches ScreenResolution Weight  Price Cpu   Ram  
##        <int> <chr>   <chr>     <chr>  <chr>            <chr>   <dbl> <chr> <chr>
## 1          0 Apple   Ultrabook 13.3   IPS Panel Retin… 1.37kg 7.14e4 Inte… 8GB  
## 2          1 Apple   Ultrabook 13.3   1440x900         1.34kg 4.79e4 Inte… 8GB  
## 3          2 HP      Notebook  15.6   Full HD 1920x10… 1.86kg 3.06e4 Inte… 8GB  
## 4          3 Apple   Ultrabook 15.4   IPS Panel Retin… 1.83kg 1.35e5 Inte… 16GB 
## 5          4 Apple   Ultrabook 13.3   IPS Panel Retin… 1.37kg 9.61e4 Inte… 8GB  
## 6          5 Acer    Notebook  15.6   1366x768         2.1kg  2.13e4 AMD … 4GB  
## # ℹ 3 more variables: Memory <chr>, Gpu <chr>, OpSys <chr>

This way I have introduced additional messiness and untidiness and then fixed it. Given that most laptop specifications (CPU, RAM, etc.) are often used as categorical or text data in analyses (except perhaps for numerical specs like RAM size, which is often labeled with units anyway), it might not be strictly necessary to convert them back from character strings for many types of analysis.

Also, looking at where this data came from, I found out that the owner of it is based in Pakistan. The prices certainly do not seem to be USD at all. So, the likely scenario is that this is in the Pakistani currency (PKR). So I have converted the prices to USD below by dividing by 266.67 (which was the value of 1 USD back in 2018, rather than using today’s conversion rate):

df_laptop$Price <- df_laptop$Price / 121.57

Analysis

Taking a look at the prices of the laptops in the data set

summary(df_laptop$Price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   76.26  262.52  428.19  492.61  652.33 2672.98
ggplot(df_laptop, aes(y = Price)) + 
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Box Plot of Laptop Prices", y = "Price", x = "")

ggplot(df_laptop, aes(x = Company, y = Price)) +
  geom_jitter() +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Laptop Prices by Company", x = "Company", y = "Price")

Calculating the mean of price for each company and seeing if they are statistically significant:

mean_price_Company <- df_laptop %>%
  group_by(Company) %>%
  summarise(mean_price = mean(Price, na.rm = TRUE))
mean_price_Company
## # A tibble: 19 × 2
##    Company   mean_price
##    <chr>          <dbl>
##  1 Acer           275. 
##  2 Apple          686. 
##  3 Asus           485. 
##  4 Chuwi          138. 
##  5 Dell           518. 
##  6 Fujitsu        317. 
##  7 Google         735. 
##  8 HP             471. 
##  9 Huawei         624. 
## 10 LG             920. 
## 11 Lenovo         478. 
## 12 MSI            755. 
## 13 Mediacom       129. 
## 14 Microsoft      707. 
## 15 Razer         1467. 
## 16 Samsung        619. 
## 17 Toshiba        549. 
## 18 Vero            95.3
## 19 Xiaomi         497.
# box plot of prices
ggplot(df_laptop, aes(x = Company, y = Price, fill = Company)) +
  geom_boxplot() +
  labs(x = "Company", y = "Price", title = "Price Distribution by Company") +
  theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

So now we can see that the most expensive laptops seem to be made by Razer.Though the most expensive laptops belonged to Razer and Lenovo, though these were outliers in this data set.