── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the midwest datasetdata("midwest")# Summarizing population statistics by statepopulation_summary <- midwest %>%group_by(state) %>%# Group the data by statesummarise(poptotalmean =mean(poptotal), # Calculate the average total population for each statepoptotalmed =median(poptotal), # Calculate the median total population for each statepopmax =max(poptotal), # Find the maximum total population for each statepopmin =min(poptotal), # Find the minimum total population for each statepopdistinct =n_distinct(poptotal), # Count the number of distinct total population valuespopfirst =first(poptotal), # Get the first total population value for each statepopany =any(poptotal <5000), # Check if any total population values are less than 5000popany2 =any(poptotal >2000000) # Check if any total population values are greater than 2,000,000 ) %>%ungroup() # Remove grouping structure# Display the summarized population dataprint(population_summary)
# Load the tidyverse packagelibrary(tidyverse)# Load the midwest datasetdata("midwest")# Counting counties based on population thresholdspopulation_count_summary <- midwest %>%group_by(state) %>%# Group the data by statesummarise(num5k =sum(poptotal <5000), # Count counties with a total population less than 5000num2mil =sum(poptotal >2000000), # Count counties with a total population greater than 2,000,000numrows =n() # Count the total number of counties in each state ) %>%ungroup() # Remove grouping structure# Display the summarized population countsprint(population_count_summary)
# A tibble: 5 × 4
state num5k num2mil numrows
<chr> <int> <int> <int>
1 IL 1 1 102
2 IN 0 0 92
3 MI 1 1 83
4 OH 0 0 88
5 WI 2 0 72
Problem C
# Counting distinct states per countydistinct_states_count <- midwest %>%group_by(county) %>%# Group by countysummarize(x =n_distinct(state)) %>%# Count distinct states in each countyarrange(desc(x)) %>%# Arrange by count in descending orderungroup() # Remove grouping# Display the results for Part Iprint(distinct_states_count)
# A tibble: 320 × 2
county x
<chr> <int>
1 CRAWFORD 5
2 JACKSON 5
3 MONROE 5
4 ADAMS 4
5 BROWN 4
6 CLARK 4
7 CLINTON 4
8 JEFFERSON 4
9 LAKE 4
10 WASHINGTON 4
# ℹ 310 more rows
# Counting total rows per countytotal_count_per_county <- midwest %>%group_by(county) %>%# Group by countysummarize(x =n()) %>%# Count total rows in each countyungroup() # Remove grouping# Display the results for Part IIprint(total_count_per_county)
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 4
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 2
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 2
# ℹ 310 more rows
# Counting distinct counties in each county (should always be 1)distinct_counties_count <- midwest %>%group_by(county) %>%# Group by countysummarize(x =n_distinct(county)) %>%# Count distinct counties in each countyungroup() # Remove grouping# Display the results for Part IIIprint(distinct_counties_count)
# A tibble: 320 × 2
county x
<chr> <int>
1 ADAMS 1
2 ALCONA 1
3 ALEXANDER 1
4 ALGER 1
5 ALLEGAN 1
6 ALLEN 1
7 ALPENA 1
8 ANTRIM 1
9 ARENAC 1
10 ASHLAND 1
# ℹ 310 more rows
Notes: I am doing this but I still don’t really understand it? Am I doing this right? Am I missing something? IDK HELP!!!! - I am going to finish these questions here as I know what I’m doing in terms of codes etc but not understanding them- so I will go back do some reading and try and understand what this means! feedback would be appreciated!
Good and Bad Questions About the Diamonds Dataset
In this section, I will explore the principles of formulating effective questions by generating one good and one bad question about the diamonds data-set.
Good Question
Question: What is the average price of diamonds for each cut, and how does this vary by clarity?
Why This is a Good Question:
Specific and Focused: It clearly defines the variables of interest (price, cut, and clarity).
Quantitative Analysis: It invites a quantitative analysis that can be explored using summary statistics, making it actionable.
Comparative Aspect: It allows for comparisons between different cuts and clarities, leading to more insightful conclusions.
# Example code to answer the good questiondiamonds_summary <- diamonds %>%group_by(cut, clarity) %>%summarize(average_price =mean(price, na.rm =TRUE)) %>%arrange(cut, clarity)
`summarise()` has grouped output by 'cut'. You can override using the `.groups`
argument.
Vauge and Subjective: Question is too broad and lacks specificity regarding what factors influence prices
Not Quantifiable: Does not provide a clear path for analysis
Lacks Context: Without the scope (size, cut, colour), it can lead to confusion.
Instead of asking why diamonds are expensive, a more effective question might be:
What factors are significantly associated with the price of diamonds? - This question directs the analysis towards specific variables and allows for a more focused investigation.