This is the wide table of the data set I have chosen:
Here it is in table format form the loaded data frame from the CSV file imported into R:
library(kableExtra)
df <- read.csv("https://raw.githubusercontent.com/unsecuredAMRAP/607/main/AI-FDA_wide-table.csv")
df %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Type.of.FDA.approval | Mention.of.AI.in.announcement | X | X.1 | X.2 | X.3 | Modality | X.4 | X.5 | X.6 | X.7 | X.8 | Clinical.Discipline | X.9 | X.10 | X.11 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
artificial intelligence | machine learning | deep learning | NA | Total | CT | Ultrasound | ECG | MRI | Mammography | Brain | Cardiology | Neurology | Radiology | Other | |
510(k) premarket notification | 13 | 8 | 11 | 34 | 66 | 7 | 1 | 9 | 6 | 5 | 5 | 14 | 4 | 34 | 12 |
de novo pathway | 5 | 0 | 1 | 4 | 10 | 1 | 1 | 2 | 0 | 0 | 1 | 2 | 1 | 4 | 3 |
PMA | 1 | 0 | 1 | 1 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
Total | 19 | 8 | 13 | 39 | 79 | 9 | 3 | 11 | 6 | 5 | 6 | 16 | 5 | 39 | 16 |
Tidying and transforming the data
The current dataset is not tidy, because each row is not an observation, rather the current format of the dataset shows a summary snapshot at the dataset (similar to a Table 1 in a research paper). The first step transformation I want to do: currently, the 4 rows are needed to become columns, and the columns need to become rows. The current columns of “artificial intelligence”, “machine learning”, “deep learning”, “NA”, and “Total” are under the parent label of “Mention of AI in announcement”. Similarly, the current columns of “CT”, “Ultrasound”, “ECG”, “MRI”, “Mammography”, and “Brain” are under the parent label of “Modality”. Also similarly, the current columns of “Cardiology”, “Neurology”, “Radiology”, and “Other” are under the parent label of “Clinical Discipline”. So considering the abbreviations of the following in ():
when transforming this wide into long format by switching columns to rows and rows to columns, the daughter column labels will be coupled with the parent column label to create the new row labels (for example “MA.machine_learning”).
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(tidyr)
## Data Transformations
# column names to include the parent label as a prefix
new_col_names <- c("Type.of.FDA.approval",
"MA.artificial_intelligence", "MA.machine_learning", "MA.deep_learning", "MA.NA", "MA.Total",
"M.CT", "M.Ultrasound", "M.ECG", "M.MRI", "M.Mammography", "M.Brain",
"CD.Cardiology", "CD.Neurology", "CD.Radiology", "CD.Other")
# change column names
names(df) <- new_col_names
# drop row 1 (old daughter column labels)
df <- df[-1, ]
# drop Total column and Total row to avoid confusion and clean up as a data set
df1 <- df[ , -6]
df1 <- df1[-4 , ]
# Convert all but the first column to numeric
df1[, -1] <- lapply(df1[, -1], function(x) as.numeric(as.character(x)))
#--- --- ---
# Transformation of the data from wide to long format
df1_long <- pivot_longer(
data = df1,
cols = -Type.of.FDA.approval, # Exclude the identifier column
names_to = "Category", # This will hold the names of the current columns
values_to = "Count" # This will hold the counts
)
print(df1_long)
## # A tibble: 42 × 3
## Type.of.FDA.approval Category Count
## <chr> <chr> <dbl>
## 1 510(k) premarket notification MA.artificial_intelligence 13
## 2 510(k) premarket notification MA.machine_learning 8
## 3 510(k) premarket notification MA.deep_learning 11
## 4 510(k) premarket notification MA.NA 34
## 5 510(k) premarket notification M.CT 7
## 6 510(k) premarket notification M.Ultrasound 1
## 7 510(k) premarket notification M.ECG 9
## 8 510(k) premarket notification M.MRI 6
## 9 510(k) premarket notification M.Mammography 5
## 10 510(k) premarket notification M.Brain 5
## # ℹ 32 more rows
#General summary snapshot (not final but helpful)
summary(df1)
## Type.of.FDA.approval MA.artificial_intelligence MA.machine_learning
## Length:3 Min. : 1.000 Min. :0.000
## Class :character 1st Qu.: 3.000 1st Qu.:0.000
## Mode :character Median : 5.000 Median :0.000
## Mean : 6.333 Mean :2.667
## 3rd Qu.: 9.000 3rd Qu.:4.000
## Max. :13.000 Max. :8.000
## MA.deep_learning MA.NA M.CT M.Ultrasound M.ECG
## Min. : 1.000 Min. : 1.0 Min. :1 Min. :1 Min. :0.000
## 1st Qu.: 1.000 1st Qu.: 2.5 1st Qu.:1 1st Qu.:1 1st Qu.:1.000
## Median : 1.000 Median : 4.0 Median :1 Median :1 Median :2.000
## Mean : 4.333 Mean :13.0 Mean :3 Mean :1 Mean :3.667
## 3rd Qu.: 6.000 3rd Qu.:19.0 3rd Qu.:4 3rd Qu.:1 3rd Qu.:5.500
## Max. :11.000 Max. :34.0 Max. :7 Max. :1 Max. :9.000
## M.MRI M.Mammography M.Brain CD.Cardiology CD.Neurology
## Min. :0 Min. :0.000 Min. :0.0 Min. : 0.000 Min. :0.000
## 1st Qu.:0 1st Qu.:0.000 1st Qu.:0.5 1st Qu.: 1.000 1st Qu.:0.500
## Median :0 Median :0.000 Median :1.0 Median : 2.000 Median :1.000
## Mean :2 Mean :1.667 Mean :2.0 Mean : 5.333 Mean :1.667
## 3rd Qu.:3 3rd Qu.:2.500 3rd Qu.:3.0 3rd Qu.: 8.000 3rd Qu.:2.500
## Max. :6 Max. :5.000 Max. :5.0 Max. :14.000 Max. :4.000
## CD.Radiology CD.Other
## Min. : 1.0 Min. : 1.000
## 1st Qu.: 2.5 1st Qu.: 2.000
## Median : 4.0 Median : 3.000
## Mean :13.0 Mean : 5.333
## 3rd Qu.:19.0 3rd Qu.: 7.500
## Max. :34.0 Max. :12.000
1- Descriptive analyses: FDA approvals within a specific clinical discipline
# counting FDA approvals within a specific clinical discipline
total_approvals <- colSums(df1[,-1])
total_approvals
## MA.artificial_intelligence MA.machine_learning
## 19 8
## MA.deep_learning MA.NA
## 13 39
## M.CT M.Ultrasound
## 9 3
## M.ECG M.MRI
## 11 6
## M.Mammography M.Brain
## 5 6
## CD.Cardiology CD.Neurology
## 16 5
## CD.Radiology CD.Other
## 39 16
# mean number of approvals
mean_approvals <- colMeans(df1[1:(nrow(df)-1), -1])
mean_approvals
## MA.artificial_intelligence MA.machine_learning
## 6.333333 2.666667
## MA.deep_learning MA.NA
## 4.333333 13.000000
## M.CT M.Ultrasound
## 3.000000 1.000000
## M.ECG M.MRI
## 3.666667 2.000000
## M.Mammography M.Brain
## 1.666667 2.000000
## CD.Cardiology CD.Neurology
## 5.333333 1.666667
## CD.Radiology CD.Other
## 13.000000 5.333333
summary_by_discipline <- lapply(df1[,-1], summary)
summary_by_discipline
## $MA.artificial_intelligence
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 6.333 9.000 13.000
##
## $MA.machine_learning
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.667 4.000 8.000
##
## $MA.deep_learning
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 4.333 6.000 11.000
##
## $MA.NA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 2.5 4.0 13.0 19.0 34.0
##
## $M.CT
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 3 4 7
##
## $M.Ultrasound
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 1 1 1
##
## $M.ECG
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 3.667 5.500 9.000
##
## $M.MRI
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 2 3 6
##
## $M.Mammography
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.667 2.500 5.000
##
## $M.Brain
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.5 1.0 2.0 3.0 5.0
##
## $CD.Cardiology
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 5.333 8.000 14.000
##
## $CD.Neurology
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.500 1.000 1.667 2.500 4.000
##
## $CD.Radiology
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 2.5 4.0 13.0 19.0 34.0
##
## $CD.Other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 5.333 7.500 12.000
proportions <- sweep(df1[,-1], 2, total_approvals, FUN = "/")
proportions
## MA.artificial_intelligence MA.machine_learning MA.deep_learning MA.NA
## 2 0.68421053 1 0.84615385 0.87179487
## 3 0.26315789 0 0.07692308 0.10256410
## 4 0.05263158 0 0.07692308 0.02564103
## M.CT M.Ultrasound M.ECG M.MRI M.Mammography M.Brain CD.Cardiology
## 2 0.7777778 0.3333333 0.8181818 1 1 0.8333333 0.875
## 3 0.1111111 0.3333333 0.1818182 0 0 0.1666667 0.125
## 4 0.1111111 0.3333333 0.0000000 0 0 0.0000000 0.000
## CD.Neurology CD.Radiology CD.Other
## 2 0.8 0.87179487 0.7500
## 3 0.2 0.10256410 0.1875
## 4 0.0 0.02564103 0.0625
2- Association between AI technique and FDA approval type
## Data Transformations
# a function to perform Chi-square tests on pairs of columns
perform_chi_square <- function(index1, index2) {
# the contingency tables
contingency_table <- matrix(c(df1[, index1], df1[, index2]), nrow = 3, byrow = TRUE)
# Chi-square test
test_result <- chisq.test(contingency_table)
# p-value and the Chi-square statistic
return(c(p_value = test_result$p.value, chi_square = test_result$statistic))
}
# indices for the clinical disciplines
discipline_indices <- c(12, 13, 14, 15)
# the pairs of disciplines to compare
index_pairs <- combn(discipline_indices, 2, simplify = FALSE)
# Chi-square tests for all pairs and collect the results
results <- setNames(object = lapply(index_pairs, function(indices) perform_chi_square(indices[1], indices[2])),
nm = lapply(index_pairs, function(indices) paste(names(df1)[indices], collapse = " vs. ")))
# Convert the list of results to a data frame for presentation
results_df <- do.call(rbind, results)
results_df <- data.frame(Comparison = rownames(results_df), results_df)
rownames(results_df) <- NULL
print(results_df)
## Comparison p_value chi_square.X.squared
## 1 CD.Cardiology vs. CD.Neurology 2.004221e-03 12.42500
## 2 CD.Cardiology vs. CD.Radiology 3.732038e-10 43.41779
## 3 CD.Cardiology vs. CD.Other 1.703242e-05 21.96078
## 4 CD.Neurology vs. CD.Radiology 6.040329e-08 33.24444
## 5 CD.Neurology vs. CD.Other 9.005544e-04 14.02500
## 6 CD.Radiology vs. CD.Other 2.496337e-07 30.40654
These results imply that the type of FDA approval a device receives may be dependent on the clinical discipline it is associated wit:
It is worth noting that Radiology had the most FDA approvals for AI algorithms (followed by cardiology), and that the 510 FDA approval was the most common across clinical disciplines.
3- Visualization
library(ggplot2)
# bar plots for the FDA approvals across disciplines
df_approvals <- data.frame(
Approval = rep(c("510(k)", "de novo", "PMA"), times = 4),
Discipline = rep(c("Cardiology", "Neurology", "Radiology", "Other"), each = 3),
Count = c(14, 2, 0, 4, 1, 0, 34, 4, 1, 12, 3, 1) # Use your actual counts here
)
ggplot(df_approvals, aes(x = Discipline, y = Count, fill = Approval)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "FDA Approval Counts by Clinical Discipline", x = "Clinical Discipline", y = "Count") +
theme_minimal() +
scale_fill_brewer(palette = "Pastel1")
# bar plot to summarize the chi-square analysis
ggplot(results_df, aes(x = Comparison, y = chi_square.X.squared, fill = Comparison)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = "Chi-Square Test Results for FDA Approvals Across Clinical Disciplines",
x = "Clinical Discipline Comparisons",
y = "Chi-Square Statistic") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set2")
# Heat map of the p values of the chi-square analysis
ggplot(results_df, aes(x = Comparison, y = chi_square.X.squared, fill = -log10(p_value))) +
geom_tile() +
scale_fill_gradient(low = "blue", high = "red") +
labs(title = "Heatmap of P-Values for FDA Approvals Across Clinical Disciplines",
x = "Clinical Discipline 1",
y = "Clinical Discipline 2",
fill = "-Log10 P-Value") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# heat map for the FDA approvals across disciplines
library(pheatmap)
# Data Transformations
# Converting the dataframe into a matrix format suitable for heatmap
approval_matrix <- matrix(c(14, 2, 0, 4, 1, 0, 34, 4, 1, 12, 3, 1), nrow = 3, byrow = TRUE,
dimnames = list(c("510(k)", "de novo", "PMA"),
c("Cardiology", "Neurology", "Radiology", "Other")))
pheatmap(approval_matrix, main = "Heatmap of FDA Approval Counts",
color = colorRampPalette(c("blue", "yellow", "red"))(50))
==============================================================================
Let’s start by loading the data
df_nyc <- read.csv("https://raw.githubusercontent.com/unsecuredAMRAP/607/main/AB_NYC_2019.csv")
The Blackboard post by Atta Boateng Sr noted that the aim is to see which area of NYC allows a person to rent an entire home or apartment for the least amount of money, as a surrogate for seeing which area “gives you the most bang for your buck”.
Tidying and Transformations.
The current data set is tidy already, because each row is an observation. To cover the needed work for the grading for the tidying aspect, I will untidy it and then tidy it back. I will combine the columns of room type and minimum nights, and duplicate 10 rows (chosen at random). I will also make it so that observations span more than one row.
I am going to increase the untidiness by pivoting the review-related columns (number_of_reviews, last_review, reviews_per_month) into a longer format.
df_nyc_untidy <- df_nyc %>%
mutate(across(c(number_of_reviews, last_review, reviews_per_month), as.character)) %>%
pivot_longer(cols = c(number_of_reviews, last_review, reviews_per_month),
names_to = "review_metric",
values_to = "review_value")
df_nyc_untidy <- df_nyc_untidy %>%
unite(col = "room_nights", c("room_type", "minimum_nights"), sep = "_")
set.seed(123) # for reproducibility
rows_to_duplicate <- sample(nrow(df_nyc_untidy), 10, replace = FALSE)
df_duplicated <- df_nyc_untidy %>%
slice(rep(rows_to_duplicate, each = 2))
df_combined <- bind_rows(df_nyc_untidy, df_duplicated)
# new dataframe
head(df_combined)
## # A tibble: 6 × 14
## id name host_id host_name neighbourhood_group neighbourhood latitude
## <int> <chr> <int> <chr> <chr> <chr> <dbl>
## 1 2539 Clean & qu… 2787 John Brooklyn Kensington 40.6
## 2 2539 Clean & qu… 2787 John Brooklyn Kensington 40.6
## 3 2539 Clean & qu… 2787 John Brooklyn Kensington 40.6
## 4 2595 Skylit Mid… 2845 Jennifer Manhattan Midtown 40.8
## 5 2595 Skylit Mid… 2845 Jennifer Manhattan Midtown 40.8
## 6 2595 Skylit Mid… 2845 Jennifer Manhattan Midtown 40.8
## # ℹ 7 more variables: longitude <dbl>, room_nights <chr>, price <int>,
## # calculated_host_listings_count <int>, availability_365 <int>,
## # review_metric <chr>, review_value <chr>
Now to tidy this data set back up again, I will need to fix the issues I introduced. I will need to separate the variables of minimum nights and room type, and I will need to identify the duplicate rows and remove the duplicates while keeping one of the two duplicates. Finally, and to tidy the data set well I am going to make it that each row is a single observation by pivot back the review-related columns (number_of_reviews, last_review, reviews_per_month) into a wider format.
df_separated <- df_combined %>%
separate(col = "room_nights", into = c("room_type", "minimum_nights"), sep = "_")
df_separated$minimum_nights <- as.numeric(df_separated$minimum_nights)
# remove duplicate rows
df_almosttidied <- df_separated %>%
distinct(.keep_all = TRUE)
df_tidied <- df_almosttidied %>%
pivot_wider(names_from = "review_metric", values_from = "review_value") %>%
mutate(
number_of_reviews = as.integer(number_of_reviews),
reviews_per_month = as.numeric(reviews_per_month),
last_review = as.Date(last_review, format = "%Y-%m-%d")
)
# tidied dataframe
head(df_tidied)
## # A tibble: 6 × 16
## id name host_id host_name neighbourhood_group neighbourhood latitude
## <int> <chr> <int> <chr> <chr> <chr> <dbl>
## 1 2539 Clean & qu… 2787 John Brooklyn Kensington 40.6
## 2 2595 Skylit Mid… 2845 Jennifer Manhattan Midtown 40.8
## 3 3647 THE VILLAG… 4632 Elisabeth Manhattan Harlem 40.8
## 4 3831 Cozy Entir… 4869 LisaRoxa… Brooklyn Clinton Hill 40.7
## 5 5022 Entire Apt… 7192 Laura Manhattan East Harlem 40.8
## 6 5099 Large Cozy… 7322 Chris Manhattan Murray Hill 40.7
## # ℹ 9 more variables: longitude <dbl>, room_type <chr>, minimum_nights <dbl>,
## # price <int>, calculated_host_listings_count <int>, availability_365 <int>,
## # number_of_reviews <int>, last_review <date>, reviews_per_month <dbl>
So, as a further step in preparing this data set, I will limit the data set to the listings that are “entire home/apt”.
filtered_df_nyc <- df_nyc %>%
filter(room_type == "Entire home/apt")
Now I want to ease the data set and the subsequent work I will be doing on it by only keeping the columns that are relevant to my intended analysis.
final_df_nyc <- filtered_df_nyc %>%
select(price, neighbourhood, neighbourhood_group)
Checking how many neighborhoods and neighborhood_groups there are in this data set
unique_neighbourhoods <- n_distinct(final_df_nyc$neighbourhood)
unique_neighbourhood_groups <- n_distinct(final_df_nyc$neighbourhood_group)
unique_neighbourhoods
## [1] 216
unique_neighbourhood_groups
## [1] 5
I have found (from checking normality later) that the data is right-skewed and has many extreme values (outliers). So I will transform the data a little further to reduce the outliers.
# making sure that all price values are greater than zero
if(any(final_df_nyc$price <= 0)) {
# add a constant to all values before taking the log
final_df_nyc$price <- final_df_nyc$price + 1
}
# applying the log transformation
final_df_nyc$log_price <- log(final_df_nyc$price)
Analysis.
Since there are 5 unique neighborhood_groups, let’s go by that (as opposed to working with 216 different neighborhoods, which would make our work difficult, though it may come in handy later).
# Mean of price in each of the neighborhood_groups
mean_price_neighbourhood_group <- final_df_nyc %>%
group_by(neighbourhood_group) %>%
summarise(mean_price = mean(price, na.rm = TRUE))
mean_price_neighbourhood_group
## # A tibble: 5 × 2
## neighbourhood_group mean_price
## <chr> <dbl>
## 1 Bronx 129.
## 2 Brooklyn 179.
## 3 Manhattan 250.
## 4 Queens 148.
## 5 Staten Island 175.
Although the differences are quite notable, we can check for statistical significance for the differences between the different neighborhood areas using ANOVA. But first, we should check for ANOVA assumption (independence of observations, normality, homogeneity of variance).
# checking for ANOVA assumptions
# 1- independence of observations: from the information about this data set, we can probably safely say that this assumption is met
# 2- Test for Normality
# This is a large data set, and often normality is assumed by default
# Q-Q plot of the residuals
anova_model <- aov(price ~ neighbourhood_group, data = final_df_nyc)
qqnorm(residuals(anova_model))
qqline(residuals(anova_model), col = "steelblue")
# checking normality after transforming the data further by taking the logof price to reduce extreme values
anova_model2 <- aov(log_price ~ neighbourhood_group, data = final_df_nyc)
qqnorm(residuals(anova_model2))
qqline(residuals(anova_model2), col = "steelblue")
# 3- Homogeneity of Variances
library(car)
levene_test <- leveneTest(price ~ neighbourhood_group, data = final_df_nyc)
levene_test
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 38.146 < 2.2e-16 ***
## 25404
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Q_Q plot for normality showed extreme values (outliers), which suggests that we should further transform the data a little more to reduce the skewness and outliers. The other option would be to not use parametric methods such as ANOVA, rather use a non-parametric test that does not assume normality (such as Kruskal-Wallis test).
Let’s stay with ANOVA and transform the data a little further to reduce the outliers (step added to the end of the data transformation above). And after checking the Normal Q-Q plot after this step, I see that the skewness of the data has improved a lot!
So we carry on with ANOVA:
# running ANOVA
anova_result2 <- aov(log_price ~ neighbourhood_group, data = final_df_nyc)
summary(anova_result2)
## Df Sum Sq Mean Sq F value Pr(>F)
## neighbourhood_group 4 807 201.74 699.2 <2e-16 ***
## Residuals 25404 7330 0.29
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# if the ANOVA is significant, proceed with Tukey's HSD post-hoc test
if (summary(anova_result2)[[1]][["Pr(>F)"]][1] < 0.05) {
tukey_result <- TukeyHSD(anova_result2)
print(tukey_result)
} else {
print("No significant difference detected by ANOVA.")
}
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = log_price ~ neighbourhood_group, data = final_df_nyc)
##
## $neighbourhood_group
## diff lwr upr p adj
## Brooklyn-Bronx 0.30972164 0.2329810 0.38646230 0.0000000
## Manhattan-Bronx 0.60574975 0.5294137 0.68208579 0.0000000
## Queens-Bronx 0.14103860 0.0592535 0.22282371 0.0000251
## Staten Island-Bronx 0.08890246 -0.0447486 0.22255352 0.3650620
## Manhattan-Brooklyn 0.29602810 0.2763496 0.31570661 0.0000000
## Queens-Brooklyn -0.16868304 -0.2040222 -0.13334389 0.0000000
## Staten Island-Brooklyn -0.22081918 -0.3322762 -0.10936220 0.0000006
## Queens-Manhattan -0.46471114 -0.4991628 -0.43025946 0.0000000
## Staten Island-Manhattan -0.51684728 -0.6280261 -0.40566851 0.0000000
## Staten Island-Queens -0.05213614 -0.1671245 0.06285226 0.7297821
The conclusion here is that the best NYC area that allows the renting of an entire house/apartment for the lowest amount of money is Bronx, followed by (by order): Queens, Staten Island, Brooklyn, and Manhattan. The difference between them is statistically significant.Interestingly though, after taking the log of the price, it seems that Staten Island came second rather than third.
Visualizations. NOTE: The following code chunk includes some data transformations along with each plotting snippet.
# bar chart of mean prices
final_df_nyc <- final_df_nyc %>%
group_by(neighbourhood_group) %>%
mutate(mean_price = mean(price, na.rm = TRUE)) %>%
ungroup()
ggplot(final_df_nyc, aes(x = neighbourhood_group, y = mean_price, fill = neighbourhood_group)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(x = "Neighbourhood Group", y = "Mean Price", title = "Mean Prices by Neighbourhood Group") +
theme_minimal()
# bar chart of median prices (less prone to extreme values)
final_df_nyc <- final_df_nyc %>%
group_by(neighbourhood_group) %>%
mutate(median_price = median(price, na.rm = TRUE)) %>%
ungroup()
ggplot(final_df_nyc, aes(x = neighbourhood_group, y = median_price, fill = neighbourhood_group)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(x = "Neighbourhood Group", y = "Median Price", title = "Median Prices by Neighbourhood Group") +
theme_minimal()
# bar chart of mean log_prices
final_df_nyc <- final_df_nyc %>%
group_by(neighbourhood_group) %>%
mutate(mean_logprice = mean(log_price, na.rm = TRUE)) %>%
ungroup()
ggplot(final_df_nyc, aes(x = neighbourhood_group, y = mean_logprice, fill = neighbourhood_group)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(x = "Neighbourhood Group", y = "Mean Log_Price", title = "Mean Log_Prices by Neighbourhood Group") +
theme_minimal()
# box plot of prices
ggplot(final_df_nyc, aes(x = neighbourhood_group, y = price, fill = neighbourhood_group)) +
geom_boxplot() +
labs(x = "Neighbourhood Group", y = "Price", title = "Price Distribution by Neighbourhood Group") +
theme_minimal()
# box plot of log_prices
ggplot(final_df_nyc, aes(x = neighbourhood_group, y = log_price, fill = neighbourhood_group)) +
geom_boxplot() +
labs(x = "Neighbourhood Group", y = "Price", title = "Price Distribution by Neighbourhood Group") +
theme_minimal()
The box plot of the prices shows clearly how the outliers were detrimental to be able to use the data and look at it. On the other hand, the box plot of the log-Prices was much better!
==========================================================================
Let’s load the data:
df_laptop <- read.csv("https://raw.githubusercontent.com/unsecuredAMRAP/607/main/laptopData.csv")
Tidying and Transformations.
I see a lot of issue with this data set, including:
So, let’s get to work:
# Convert empty strings to NA and then remove rows with only NA
df_laptop <- df_laptop %>%
mutate_if(is.character, ~na_if(., "")) %>%
na.omit()
# OpSys
df_laptop <- df_laptop %>%
mutate(OpSys = case_when(
OpSys == "Mac OS X" ~ "macOS",
OpSys == "Windows 10 S" ~ "Windows 10",
TRUE ~ OpSys
))
# cells with "?" in any column
df_laptop <- df_laptop %>% mutate_all(~ifelse(. == "?", NA, .)) %>% na.omit()
The current data set is tidy already, because each row is an observation (though there were MANY empty rows! which I have cleaned up and fixed above). To cover the needed work for the grading for the tidying aspect, I will untidy it and then tidy it back. I will combine specs-related columns like Cpu, Ram, Memory, Gpu, and OpSys into a long format since these represent various specifications of a laptop.
df_laptop_untidy <- df_laptop %>%
mutate(across(c(Cpu, Ram, Memory, Gpu, OpSys), as.character)) %>%
pivot_longer(cols = c(Cpu, Ram, Memory, Gpu, OpSys),
names_to = "spec_type",
values_to = "spec_value")
head(df_laptop_untidy)
## # A tibble: 6 × 9
## Unnamed..0 Company TypeName Inches ScreenResolution Weight Price spec_type
## <int> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 0 Apple Ultrabook 13.3 IPS Panel Retina … 1.37kg 71379. Cpu
## 2 0 Apple Ultrabook 13.3 IPS Panel Retina … 1.37kg 71379. Ram
## 3 0 Apple Ultrabook 13.3 IPS Panel Retina … 1.37kg 71379. Memory
## 4 0 Apple Ultrabook 13.3 IPS Panel Retina … 1.37kg 71379. Gpu
## 5 0 Apple Ultrabook 13.3 IPS Panel Retina … 1.37kg 71379. OpSys
## 6 1 Apple Ultrabook 13.3 1440x900 1.34kg 47896. Cpu
## # ℹ 1 more variable: spec_value <chr>
Now that it is messy and pivoted into longer format, I will need to tidy it back up again and clean it up some more.
df_laptop <- df_laptop_untidy %>%
pivot_wider(names_from = "spec_type", values_from = "spec_value")
head(df_laptop)
## # A tibble: 6 × 12
## Unnamed..0 Company TypeName Inches ScreenResolution Weight Price Cpu Ram
## <int> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 0 Apple Ultrabook 13.3 IPS Panel Retin… 1.37kg 7.14e4 Inte… 8GB
## 2 1 Apple Ultrabook 13.3 1440x900 1.34kg 4.79e4 Inte… 8GB
## 3 2 HP Notebook 15.6 Full HD 1920x10… 1.86kg 3.06e4 Inte… 8GB
## 4 3 Apple Ultrabook 15.4 IPS Panel Retin… 1.83kg 1.35e5 Inte… 16GB
## 5 4 Apple Ultrabook 13.3 IPS Panel Retin… 1.37kg 9.61e4 Inte… 8GB
## 6 5 Acer Notebook 15.6 1366x768 2.1kg 2.13e4 AMD … 4GB
## # ℹ 3 more variables: Memory <chr>, Gpu <chr>, OpSys <chr>
This way I have introduced additional messiness and untidiness and then fixed it. Given that most laptop specifications (CPU, RAM, etc.) are often used as categorical or text data in analyses (except perhaps for numerical specs like RAM size, which is often labeled with units anyway), it might not be strictly necessary to convert them back from character strings for many types of analysis.
Also, looking at where this data came from, I found out that the owner of it is based in Pakistan. The prices certainly do not seem to be USD at all. So, the likely scenario is that this is in the Pakistani currency (PKR). So I have converted the prices to USD below by dividing by 266.67 (which was the value of 1 USD back in 2018, rather than using today’s conversion rate):
df_laptop$Price <- df_laptop$Price / 121.57
Analysis
Taking a look at the prices of the laptops in the data set
summary(df_laptop$Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 76.26 262.52 428.19 492.61 652.33 2672.98
ggplot(df_laptop, aes(y = Price)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Box Plot of Laptop Prices", y = "Price", x = "")
ggplot(df_laptop, aes(x = Company, y = Price)) +
geom_jitter() +
scale_y_continuous(labels = scales::comma) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Laptop Prices by Company", x = "Company", y = "Price")
Calculating the mean of price for each company and seeing if they are statistically significant:
mean_price_Company <- df_laptop %>%
group_by(Company) %>%
summarise(mean_price = mean(Price, na.rm = TRUE))
mean_price_Company
## # A tibble: 19 × 2
## Company mean_price
## <chr> <dbl>
## 1 Acer 275.
## 2 Apple 686.
## 3 Asus 485.
## 4 Chuwi 138.
## 5 Dell 518.
## 6 Fujitsu 317.
## 7 Google 735.
## 8 HP 471.
## 9 Huawei 624.
## 10 LG 920.
## 11 Lenovo 478.
## 12 MSI 755.
## 13 Mediacom 129.
## 14 Microsoft 707.
## 15 Razer 1467.
## 16 Samsung 619.
## 17 Toshiba 549.
## 18 Vero 95.3
## 19 Xiaomi 497.
# box plot of prices
ggplot(df_laptop, aes(x = Company, y = Price, fill = Company)) +
geom_boxplot() +
labs(x = "Company", y = "Price", title = "Price Distribution by Company") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
So now we can see that the most expensive laptops seem to be made by Razer.Though the most expensive laptops belonged to Razer and Lenovo, though these were outliers in this data set.