knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
The goal of this project was to apply our knowledge of unsupervised machine learning by testing synthetic respondents from AI-generated models. Just as some prerequisites before we dive into the code, visualizations, and analysis first we must understand the assignment. We were tasked to write a 5-item survey using either the semantic differential scale, or the likert scale. The topic was based on the short term economic conditions in Connecticut.I elected to use the likert scale because I believe it is used more frequently in public surveys from my experience.Then for the 5 items, I chose economic outlook, job market strength, unemployment rate stability, inflation rate stability, and business climate for companies. The assignment also asked us to use demographics of location and business size (urban,rural, small, or not). Then for fun just to include my own twist I also included Education level (no education, high school graduate, college student, college graduate). The two AI models I chose to use were Copilot and ChatGPT. I asked them both the same prompt and had them generate me data sets in Excel of 150 respondents for my survey.
library(readxl)
library(dplyr)
library(ggplot2)
library(janitor)
library(psych)
library(tidyr)
library(factoextra)
This step is essential because the data set as whole has both text and measures for information within the columns. Using the clean_names function in R, I made all column names consistent lowercase names. Then for those text columns I manually defined them as factor variables. I used binary dummy variables in each demographic because they prepare the data set and this helped complete our cleaning and preparation process. In this chunk I also have Likert_Scale structuring which essentially is just isolating the main variables that we are going to be using for the midpoint analysis and descriptive statistics later on. This essentially just makes it easier for us to summarize, reshape, and compute midpoint usage later on. With these steps now completed, our Chat GPT data set is now ready to be used for analysis and we can now begin analyzing for mid point bias.
Final_CHATGPT_Response_ <- read_excel("Final CHATGPT Response .xlsx")
ChatGPT <- Final_CHATGPT_Response_
ChatGPT <- ChatGPT %>% clean_names()
names(ChatGPT)
## [1] "id" "overall_outlook" "job_market"
## [4] "unemployment_stability" "inflation_stability" "business_climate"
## [7] "location" "business_size" "education"
ChatGPT <- ChatGPT %>%
mutate(
location = factor(location, levels = c("Rural", "Urban")),
business_size = factor(business_size, levels = c("Small", "Not Small")),
education = factor(
education,
levels = c("High School Graduate",
"College Student",
"College Graduate",
"No Education Provided")))
ChatGPT <- ChatGPT %>%
mutate(
location_urban = if_else(location == "Urban", 1, 0),
business_small = if_else(business_size == "Small", 1, 0),
edu_hs_grad = if_else(education == "High School Graduate", 1, 0),
edu_college_student = if_else(education == "College Student", 1, 0),
edu_college_grad = if_else(education == "College Graduate", 1, 0),
edu_no_education = if_else(education == "No Education Provided", 1, 0))
glimpse(ChatGPT)
## Rows: 150
## Columns: 15
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ overall_outlook <dbl> 5, 5, 4, 2, 3, 2, 4, 3, 3, 5, 4, 5, 4, 4, 4, 3,…
## $ job_market <dbl> 2, 5, 4, 5, 4, 2, 3, 4, 3, 4, 5, 4, 2, 5, 4, 5,…
## $ unemployment_stability <dbl> 1, 4, 4, 1, 4, 3, 4, 1, 3, 2, 2, 3, 4, 1, 2, 3,…
## $ inflation_stability <dbl> 4, 1, 4, 1, 5, 2, 2, 2, 3, 4, 4, 1, 3, 3, 4, 5,…
## $ business_climate <dbl> 3, 3, 4, 3, 3, 5, 5, 4, 4, 4, 5, 2, 2, 5, 3, 5,…
## $ location <fct> Urban, Rural, Rural, Urban, Urban, Rural, Urban…
## $ business_size <fct> Small, Small, Small, Not Small, Not Small, Not …
## $ education <fct> High School Graduate, No Education Provided, Hi…
## $ location_urban <dbl> 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1,…
## $ business_small <dbl> 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0,…
## $ edu_hs_grad <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0,…
## $ edu_college_student <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ edu_college_grad <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,…
## $ edu_no_education <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0,…
likert_points <- c("overall_outlook", "job_market", "unemployment_stability", "inflation_stability", "business_climate")
likert_df <- ChatGPT %>%
select(id, all_of(likert_points))
summary(likert_df)
## id overall_outlook job_market unemployment_stability
## Min. : 1.00 Min. :2.000 Min. :2.000 Min. :1.0
## 1st Qu.: 38.25 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:1.0
## Median : 75.50 Median :3.500 Median :4.000 Median :3.0
## Mean : 75.50 Mean :3.533 Mean :3.593 Mean :2.7
## 3rd Qu.:112.75 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:4.0
## Max. :150.00 Max. :5.000 Max. :5.000 Max. :5.0
## inflation_stability business_climate
## Min. :1.000 Min. :2.00
## 1st Qu.:2.000 1st Qu.:3.00
## Median :3.000 Median :3.00
## Mean :2.833 Mean :3.58
## 3rd Qu.:4.000 3rd Qu.:5.00
## Max. :5.000 Max. :5.00
For our midpoint bias analysis we are looking for how often a respondent selects the neutral option (3). This is important because it can indicate uncertainty and indecision as well as a poor synthetic response from our AI models that provided us with this data. First we start by reshaping the data from wide to long format with the pivot_longer function for the linkert data frame we built in the previous section. Then we isolate the three columns of id, item, and response. This allows us to treat all of the 5 questions in the same way and obtain summaries. Which takes us into calculating the item_freq_chatgpt which is by grouping the long data by item and response and counting how many times each response appears. Then we use the proportion of each item by dividing n by the 5 different questions. Using ggplot we can take this item frequency function and use geom_col() with facet_wrap in order to make bar charts for each question. The next step would be now to create a midpoint category with the case function in order to identify heavy midpoint user, moderate midpoint user, and low midpoint user.Finally we visualize this with a histogram and this shows how the respondents are distributed across different levels of midpoint use.
distributions_chatgpt <- likert_df %>%
pivot_longer(cols = all_of(likert_points),
names_to = "item",
values_to = "response")
item_freq_chatgpt <- distributions_chatgpt %>%
group_by(item, response) %>%
summarise(n = n(), .groups = "drop") %>%
group_by(item) %>%
mutate(prop = n / sum(n))
item_freq_chatgpt
## # A tibble: 22 × 4
## # Groups: item [5]
## item response n prop
## <chr> <dbl> <int> <dbl>
## 1 business_climate 2 30 0.2
## 2 business_climate 3 46 0.307
## 3 business_climate 4 31 0.207
## 4 business_climate 5 43 0.287
## 5 inflation_stability 1 29 0.193
## 6 inflation_stability 2 40 0.267
## 7 inflation_stability 3 34 0.227
## 8 inflation_stability 4 21 0.14
## 9 inflation_stability 5 26 0.173
## 10 job_market 2 36 0.24
## # ℹ 12 more rows
ggplot(item_freq_chatgpt, aes(x = factor(response), y = prop)) +
geom_col(fill = "darkorange") +
facet_wrap(~ item, ncol = 2) +
labs(
title = "Response Distributions for Economic Expectation Items",
x = "Response (1 = Very Weak ... 5 = Very Strong)",
y = "Proportion of Responses") +theme_minimal()
midpoint_item <- distributions_chatgpt %>%
group_by(item) %>%
summarise(
midpoint_count = sum(response == 3),
total = n(),
midpoint_prop = midpoint_count / total)
midpoint_item
## # A tibble: 5 × 4
## item midpoint_count total midpoint_prop
## <chr> <int> <int> <dbl>
## 1 business_climate 46 150 0.307
## 2 inflation_stability 34 150 0.227
## 3 job_market 34 150 0.227
## 4 overall_outlook 39 150 0.26
## 5 unemployment_stability 34 150 0.227
ChatGPT <- ChatGPT %>%
rowwise() %>%
mutate(
midpoint_count = sum(c_across(all_of(likert_points)) == 3),
midpoint_prop = midpoint_count / length(likert_points)
) %>%
ungroup()
glimpse(ChatGPT[, c("midpoint_count", "midpoint_prop")])
## Rows: 150
## Columns: 2
## $ midpoint_count <int> 1, 1, 0, 1, 2, 1, 1, 1, 4, 0, 0, 1, 1, 1, 1, 2, 1, 1, 0…
## $ midpoint_prop <dbl> 0.2, 0.2, 0.0, 0.2, 0.4, 0.2, 0.2, 0.2, 0.8, 0.0, 0.0, …
ChatGPT <- ChatGPT %>%
mutate(
midpoint_category = case_when(
midpoint_prop >= 0.8 ~ "Heavy midpoint user",
midpoint_prop >= 0.4 ~ "Moderate midpoint user",
TRUE ~ "Low midpoint user"
)
)
table(ChatGPT$midpoint_category)
##
## Heavy midpoint user Low midpoint user Moderate midpoint user
## 1 95 54
ggplot(ChatGPT, aes(x = midpoint_prop)) +
geom_histogram(binwidth = 0.2, boundary = 0, fill = "steelblue") +
labs(
title = "Distribution of Midpoint (3) Usage Across Respondents",
x = "Proportion of Items Answered with Midpoint (3)",
y = "Count of Respondents"
) +
theme_minimal()
In this final section I use the midpoint variables to compare different respondent groups and connect midpoint behavior with overall economic sentiment. First by using the group_by(location) and summarize to compute for things such as mean midpoint proportion, standard deviation, and group size. This essentially is for answering whether urban or rural respondents will differ in how often they choose the neutral option. I then used geom_boxplot to create visualizations as well as a summary table for concrete numbers.I did the same thing for business and education as well. Next, in order to describe the economic ratings themselves I used the pysch::describe function on our 5 Likert items. This function outputs the mean, standard deviation, skewness, and kurtosis for each question. This allows us to capture the overall use of the scale and check where answers were skewed a specific way or even again if there was midpoint bias. Lastly, I made an economic index by using the rowwise mean of the 5 econmic items. Essentially this compresses the 5 questions into a single summary score per respondent which when paired with the summary function shows the distribution of this index among the respondents. This is beneficial to see the difference between models later on.
ChatGPT %>%
group_by(location) %>%
summarise(
mean_midpoint = mean(midpoint_prop),
sd_midpoint = sd(midpoint_prop),
n = n()
)
## # A tibble: 2 × 4
## location mean_midpoint sd_midpoint n
## <fct> <dbl> <dbl> <int>
## 1 Rural 0.252 0.188 80
## 2 Urban 0.246 0.181 70
ggplot(ChatGPT, aes(x = location, y = midpoint_prop)) +
geom_boxplot(fill = "purple") +
labs(title = "Midpoint Usage by Location") +
theme_minimal()
# Business Size
ChatGPT %>%
group_by(business_size) %>%
summarise(
mean_midpoint = mean(midpoint_prop),
sd_midpoint = sd(midpoint_prop),
n = n()
)
## # A tibble: 2 × 4
## business_size mean_midpoint sd_midpoint n
## <fct> <dbl> <dbl> <int>
## 1 Small 0.233 0.196 79
## 2 Not Small 0.268 0.169 71
ggplot(ChatGPT, aes(x = business_size, y = midpoint_prop)) +
geom_boxplot(fill = "beige") +
labs(title = "Midpoint Usage by Business Size") +
theme_minimal()
# Education
ChatGPT %>%
group_by(education) %>%
summarise(
mean_midpoint = mean(midpoint_prop),
sd_midpoint = sd(midpoint_prop),
n = n()
)
## # A tibble: 4 × 4
## education mean_midpoint sd_midpoint n
## <fct> <dbl> <dbl> <int>
## 1 High School Graduate 0.244 0.166 36
## 2 College Student 0.275 0.188 32
## 3 College Graduate 0.233 0.211 42
## 4 No Education Provided 0.25 0.168 40
ggplot(ChatGPT, aes(x = education, y = midpoint_prop)) +
geom_boxplot(fill = "red") +
labs(title = "Midpoint Usage by Education Group") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
psych::describe(likert_df %>% select(-id))
## vars n mean sd median trimmed mad min max range
## overall_outlook 1 150 3.53 1.13 3.5 3.54 2.22 2 5 3
## job_market 2 150 3.59 1.15 4.0 3.62 1.48 2 5 3
## unemployment_stability 3 150 2.70 1.35 3.0 2.62 1.48 1 5 4
## inflation_stability 4 150 2.83 1.36 3.0 2.79 1.48 1 5 4
## business_climate 5 150 3.58 1.11 3.0 3.60 1.48 2 5 3
## skew kurtosis se
## overall_outlook -0.01 -1.41 0.09
## job_market -0.11 -1.44 0.09
## unemployment_stability 0.21 -1.19 0.11
## inflation_stability 0.25 -1.15 0.11
## business_climate -0.01 -1.37 0.09
ChatGPT <- ChatGPT %>%
mutate(
econ_index = rowMeans(select(., all_of(likert_points)), na.rm = TRUE)
)
summary(ChatGPT$econ_index)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.800 3.200 3.248 3.600 4.600
ggplot(ChatGPT, aes(x = econ_index)) +
geom_histogram(binwidth = 0.5, boundary = 0, fill = "darkgreen") +
labs(title = "Distribution of Overall Economic Outlook Index") +
theme_minimal()
The process was more or less the same with Copilot for each section. There were some differences because of the different column names and such, but overall the process was the same and I performed the same steps I did for the Chat GPT model.
Final_COPILOT_Response_ <- read_excel("Final COPILOT Response .xlsx")
Copilot <- Final_COPILOT_Response_
Copilot <- Copilot %>%
clean_names() %>%
mutate(
location = factor(
geographic_location,
levels = c("Rural", "Urban")
),
business_size = case_when(
business_size == "Small business" ~ "Small",
business_size == "Not a small business" ~ "Not Small",
TRUE ~ NA_character_
),
business_size = factor(
business_size,
levels = c("Small", "Not Small")
),
# FIX EDUCATION
education = case_when(
education_level == "High school graduate" ~ "High School Graduate",
education_level == "College student" ~ "College Student",
education_level == "College graduate" ~ "College Graduate",
education_level == "No education" ~ "No Education Provided",
TRUE ~ NA_character_
),
education = factor(
education,
levels = c(
"High School Graduate",
"College Student",
"College Graduate",
"No Education Provided"
)
)
)
likert_points2 <- c(
"overall_economic_outlook",
"job_market_strength",
"unemployment_rate_stability",
"inflation_stability",
"business_climate_for_companies"
)
names(Copilot)
## [1] "respondent_id" "overall_economic_outlook"
## [3] "job_market_strength" "unemployment_rate_stability"
## [5] "inflation_stability" "business_climate_for_companies"
## [7] "geographic_location" "business_size"
## [9] "education_level" "location"
## [11] "education"
likert_df2 <- Copilot %>%
select(respondent_id, all_of(likert_points2))
summary(likert_df2)
## respondent_id overall_economic_outlook job_market_strength
## Min. : 1.00 Min. :1.000 Min. :1.000
## 1st Qu.: 38.25 1st Qu.:2.000 1st Qu.:2.000
## Median : 75.50 Median :3.000 Median :3.000
## Mean : 75.50 Mean :2.987 Mean :3.193
## 3rd Qu.:112.75 3rd Qu.:3.750 3rd Qu.:4.000
## Max. :150.00 Max. :5.000 Max. :5.000
## unemployment_rate_stability inflation_stability business_climate_for_companies
## Min. :1.00 Min. :1.000 Min. :1.000
## 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:2.000
## Median :3.00 Median :3.000 Median :3.000
## Mean :3.16 Mean :2.833 Mean :3.147
## 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.00 Max. :5.000 Max. :5.000
Copilot <- Copilot %>%
rowwise() %>%
mutate(
midpoint_count2 = sum(c_across(all_of(likert_points2)) == 3),
midpoint_prop2 = midpoint_count2 / length(likert_points2)
) %>%
ungroup()
glimpse(Copilot[, c("midpoint_count2", "midpoint_prop2")])
## Rows: 150
## Columns: 2
## $ midpoint_count2 <int> 2, 0, 2, 3, 1, 2, 2, 2, 4, 2, 4, 1, 5, 1, 4, 0, 2, 0, …
## $ midpoint_prop2 <dbl> 0.4, 0.0, 0.4, 0.6, 0.2, 0.4, 0.4, 0.4, 0.8, 0.4, 0.8,…
Copilot <- Copilot %>%
mutate(
midpoint_category2 = case_when(
midpoint_prop2 >= 0.8 ~ "Heavy midpoint user",
midpoint_prop2 >= 0.4 ~ "Moderate midpoint user",
TRUE ~ "Low midpoint user"))
table(Copilot$midpoint_category2)
##
## Heavy midpoint user Low midpoint user Moderate midpoint user
## 8 55 87
ggplot(Copilot, aes(x = midpoint_prop2)) +
geom_histogram(binwidth = 0.2, boundary = 0, fill = "steelblue") +
labs(
title = "Distribution of Midpoint (3) Usage Across Respondents",
x = "Proportion of Midpoint Responses",
y = "Count of Respondents"
) +
theme_minimal()
distributions_copilot <- likert_df2 %>%
pivot_longer(cols = all_of(likert_points2),
names_to = "item",
values_to = "response")
item_freq_copilot <- distributions_copilot %>%
group_by(item, response) %>%
summarise(n = n(), .groups = "drop") %>%
group_by(item) %>%
mutate(prop = n / sum(n))
ggplot(item_freq_copilot, aes(x = factor(response), y = prop)) +
geom_col(fill = "darkorange") +
facet_wrap(~ item, ncol = 2) +
labs(
title = "Response Distributions for Economic Expectation Items",
x = "Response (1 = Very Weak ... 5 = Very Strong)",
y = "Proportion"
) +
theme_minimal()
Copilot %>%
group_by(location) %>%
summarise(
mean_midpoint = mean(midpoint_prop2),
sd_midpoint = sd(midpoint_prop2),
n = n()
)
## # A tibble: 2 × 4
## location mean_midpoint sd_midpoint n
## <fct> <dbl> <dbl> <int>
## 1 Rural 0.330 0.209 43
## 2 Urban 0.389 0.207 107
ggplot(Copilot, aes(x = location, y = midpoint_prop2)) +
geom_boxplot(fill = "lightblue") +
labs(
title = "Midpoint Usage by Location",
x = "Location",
y = "Midpoint Proportion"
) +
theme_minimal()
Copilot %>%
group_by(business_size) %>%
summarise(
mean_midpoint = mean(midpoint_prop2),
sd_midpoint = sd(midpoint_prop2),
n = n()
)
## # A tibble: 2 × 4
## business_size mean_midpoint sd_midpoint n
## <fct> <dbl> <dbl> <int>
## 1 Small 0.379 0.192 105
## 2 Not Small 0.356 0.245 45
ggplot(Copilot, aes(x = business_size, y = midpoint_prop2)) +
geom_boxplot(fill = "lightgreen") +
labs(
title = "Midpoint Usage by Business Size",
x = "Business Size",
y = "Midpoint Proportion"
) +
theme_minimal()
Copilot %>%
group_by(education) %>%
summarise(
mean_midpoint = mean(midpoint_prop2),
sd_midpoint = sd(midpoint_prop2),
n = n()
)
## # A tibble: 4 × 4
## education mean_midpoint sd_midpoint n
## <fct> <dbl> <dbl> <int>
## 1 High School Graduate 0.357 0.208 47
## 2 College Student 0.371 0.251 41
## 3 College Graduate 0.404 0.157 45
## 4 No Education Provided 0.329 0.223 17
ggplot(Copilot, aes(x = education, y = midpoint_prop2)) +
geom_boxplot(fill = "plum") +
labs(
title = "Midpoint Usage by Education",
x = "Education Level",
y = "Midpoint Proportion"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
psych::describe(likert_df2 %>% select(-respondent_id))
## vars n mean sd median trimmed mad min max
## overall_economic_outlook 1 150 2.99 0.98 3 2.98 1.48 1 5
## job_market_strength 2 150 3.19 1.12 3 3.22 1.48 1 5
## unemployment_rate_stability 3 150 3.16 1.12 3 3.18 1.48 1 5
## inflation_stability 4 150 2.83 1.19 3 2.80 1.48 1 5
## business_climate_for_companies 5 150 3.15 1.18 3 3.17 1.48 1 5
## range skew kurtosis se
## overall_economic_outlook 4 0.03 -0.09 0.08
## job_market_strength 4 -0.21 -0.70 0.09
## unemployment_rate_stability 4 -0.11 -0.55 0.09
## inflation_stability 4 0.08 -0.84 0.10
## business_climate_for_companies 4 -0.04 -0.80 0.10
Copilot <- Copilot %>%
mutate(
econ_index = rowMeans(select(., all_of(likert_points2)), na.rm = TRUE))
summary(Copilot$econ_index)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.800 2.800 3.000 3.064 3.400 4.200
ggplot(Copilot, aes(x = econ_index)) +
geom_histogram(binwidth = 0.5, fill = "skyblue") +
labs(
title = "Distribution of Overall Economic Outlook Index",
x = "Economic Index (Mean of 5 Items)",
y = "Count of Respondents"
) +
theme_minimal()
Through our analysis of both data sets, we revealed differences in how respondents engaged with our economic expectation survey. Another interesting aspect of this analysis is that we also got to see the difference of sentiment expressed throughout the different demographics that we chose.
With regard to the midpoint analysis sections for both data sets, we saw that in the ChatGPT data, most respondents were low midpoint usage category, indicating clearer opinion and more decisive responses. The draw back however is that the pessimistic view of a 1 was never seen which could skew the data a certain way. In comparison to the copilot data set we saw that midpoint usage was noticeably higher across respondents which suggests uncertainty as well as a possibly of a faulty data set with bias to the midpoint. Again, we see that when we break down the data in terms of demographics like location, business size, and education that there was a more reasonable split within the ChatGPT data. In comparison to the copilot data set there was more often times where answers would swing one way dependent on the demographic answering. Depending on the user this may not matter, but what it signals to us is that demographics impact an answer which in real-world scenarios can be infinitely true.
Now when evaluating overall economic sentiment through the economic index, both data sets suggested moderate optimism regarding the short term economic conditions in Connecticut. So what does this conclude about our project and what should we take away?
Together the findings I found is that the different surveys from the different models gave different respondent decisiveness. In addition the demographic differences can impact a readers interpretability. The final clear difference was the linguistic and stylistic approaches that each LM model used to create our data. Ultimately in terms of midpoint bias, we see that copilot has more frequencies than ChatGPT. The adverse of this however is that ChatGPT failed to illustrate any true differences between demographics and how the change in demographics played a role in the answering of the questions. Both these synthetic responses were far from perfect but it is up to a user’s discretion and goals within their project to understand which LM supports their needs better.
Understanding how respondents use survey scales is just as important to researchers as what they say. When we’re discussing economic sentiment research, our goal is to understand whether there will be meaningful signals of sentiment of uncertainty, optimism, or pessimism. This project allows us to understand that a data set with heavy midpoint usage may suggest hesitation or lack of confidence in the current events. Meanwhile low midpoint analysis would reflect clearer expectations and predictions from respondents. Anyone that is involved within these surveys are going to use these patterns to understand not only the current public sentiment, but also the level of certainty behind their answers. This is something we extensively studied throughout this unsupervised machine learning course and is a big reason why I felt the need to include this in addition to the obvious implications of using different LM models.
The other aspect of this assignment was understanding data quality as I briefly mentioned. Consistent demographic tagging, midpoint bias, skewness, and more are play a huge role in accurateness and the ability to draw true insights from the data. I noted that each generated set had their strong points and each had their weak points. This is important to understand because these weak points impact our ability to draw data-driven conclusions which will adversely effect the whole point of the survey to begin with. In addition, we also saw that synthetic responses may not be a viable option at the current moment due to the limitations of each LM.
To conclude, through this project we completed the job of many data analysts when they are prompted to understand survey responses and public sentiment. It is their job to know what midpoint bias is and what those implications mean if it is found in their data. Our other basis point is that synthetic responses made by AI were actionable and gave their own unique characteristics. I wouldn’t say that either is viable until testing true respondents though. It would definitely be an interesting topic to explore if we added “real” data and compare accuracy and see the differences found between our synthetic answers. This would tell us how viable the data we recieved actually was.