This is assignment 1. Answer the following questions below. You must turn in this file as an .html file (uploaded to Canvas) by the assignment deadline. Please note the following:
setwd("C:/Users/heige/OneDrive/Desktop/2075/Data")
load("PSCI_2075_v2.1.RData")
head(states)
## state st raperate murderrate abort density ineq region gunfree
## 1 Alabama AL 31.9 6.9 16 86.17970 46.01423 South 0.444921
## 2 Alaska AK 73.3 3.1 15 1.08848 34.18490 West 0.932850
## 3 Arizona AZ 32.0 5.4 20 42.12770 49.00724 West 0.969988
## 4 Arkansas AR 47.3 6.2 11 49.10960 46.83973 South 0.520867
## 5 California CA 23.6 5.3 33 211.97700 51.47233 West -3.270630
## 6 Colorado CO 44.6 3.5 21 39.08800 44.06749 West 0.503762
## alcfree mjfree marrfree freedom knowgov evangel poptotal stuspend
## 1 -0.024292 -0.054232 -0.009304 21.6063 63.76 42.8 4700000 5273
## 2 -0.003965 0.076145 -0.010426 28.5846 68.44 18.7 686293 8599
## 3 0.016018 0.015321 -0.009304 32.4745 52.10 18.1 6500000 4785
## 4 0.005779 -0.012482 -0.009304 -5.7844 67.08 39.9 2900000 5140
## 5 0.015981 0.043902 0.041186 -85.7562 88.31 11.5 37000000 5685
## 6 0.018588 0.025245 0.015941 20.4615 50.54 14.8 4900000 5061
## ptratio hsdiploma democrat pid house senate inc minwage year
## 1 15.77 77.5 30.54249 -0.033898 -1 0.976 34650 7.25 2016
## 2 16.29 90.4 26.38180 -0.350000 0 0.638 45529 7.75 2016
## 3 20.75 85.1 28.86411 -0.102564 0 0.576 35875 7.90 2016
## 4 12.90 81.7 27.68069 0.084507 0 0.816 34014 6.25 2016
## 5 19.80 81.2 37.08743 0.179825 -1 -1.395 44481 9.00 2016
## 6 16.97 89.7 26.85935 0.125000 0 -0.528 44088 8.00 2016
## polscore newimmig popover65 percwom medinc turnout margin co2 femleg
## 1 -0.073118 4063 657792 75.49 42590 0.59 59.5446 130 11.4
## 2 0.036343 1799 54938 79.02 57431 0.59 52.4528 43 18.3
## 3 -0.058591 20333 881831 84.00 48621 0.53 37.1173 88 33.3
## 4 -0.022314 2874 419981 88.52 41302 0.51 78.2980 62 17.0
## 5 -0.088499 210591 4200000 89.94 53367 0.55 35.9504 368 30.8
## 6 -0.002005 13547 549625 79.57 58629 0.70 32.5394 88 32.0
## corrupt infant trumpwin weed death stand obamawin
## 1 23 9.53 1 0 1 1 0
## 2 NA 5.93 1 1 0 1 1
## 3 18 6.85 1 1 1 1 0
## 4 8 7.84 1 1 1 0 0
## 5 52 5.32 0 1 1 0 1
## 6 6 6.44 0 1 0 0 1
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.2 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.1.0 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
This section is designed to build familiarity with the data we’re going to use this semester.
Using either the states or world dataset,
pick one variable. Describe this varaible; Is it continuous?
Categorical? Are there missing values or other interesting features?
[AI RESTRICTED, nocode]
I picked the density variable. This variable is continuous because it is a measurement from a scale. There are no missing values every state has a density.
Calculate the means and standard deviations by some grouping variable. Why did you pick this grouping variable, and what is it telling you about your variable? [AI PERMITTED]
mean_density <- mean(states$density, na.rm = TRUE)
print(mean_density)
## [1] 176.7743
I picked this grouping variable because I found the value interesting in a different polisci class I am in. This value means the average density in the U.S. is 176.7743
sd_density <- sd(states$density, na.rm = TRUE)
print(sd_density)
## [1] 238.5726
This means the standard deviation of this data is 238.5726 meaning the states have a large variety of densities
Now visualize the variable, by group, using an appropriate plot. Why did you pick the visualization method you did? What quantity(ies) did you show, and why? [AI PERMITTED]
ggplot(states, aes(x = density)) +
geom_histogram(binwidth = 5, fill = "lightgreen", color = "black") +
labs(title = "Histogram of Density",
x = "Density",
y = "Count") +
theme_minimal()
I picked a histogram because I think it looks the best when showcasing the data. I showed the quantities of states and density to give the widest view of the data possible.
Ask your AI to produce a different visualization of this variable. What did it select? Do you understand what this plot is showing you, and how is it different from the plot you made above? [AI PERMITTED]
ggplot(states, aes(x = density, fill = region)) +
geom_density(alpha = 0.6) +
labs(title = "Density Distribution by Region",
x = "Population Density",
y = "Density") +
theme_minimal()
It selected to break the data down by region and change it into a sloping graph. It is different from the plot I made above because it breaks the states down into regions instead of plotting each states value.
Coding style is tough to learn but important. Ask your AI to list 5 good practices to use when coding in R. What did it say? Were there any that surprised you? [AI PERMITTED]
You said: list 5 good practices to use when coding in R ChatGPT said:
Absolutely! Good coding practices in R help make your code clean, readable, reproducible, and efficient. Here are five essential ones:
1️⃣ Use clear and consistent variable names
Avoid vague names like x or df1.
Prefer descriptive names: murder_rate, population_total, state_data.
Use snake_case (like_this) or camelCase (likeThis) consistently.
Example:
mean_murderrate <- mean(states$murderrate, na.rm = TRUE)
2️⃣ Comment your code
Explain why you’re doing something, not just what you’re doing.
Helps others (and future you!) understand your logic.
Example:
Calculate average murder rate for each region region_avg <- states %>% group_by(region) %>% summarise(mean_murderrate = mean(murderrate, na.rm = TRUE))
3️⃣ Use packages thoughtfully and load them at the top
Load all libraries at the start of your script (library(dplyr), library(ggplot2)).
This avoids “function not found” errors later.
Example:
library(dplyr) library(ggplot2)
4️⃣ Avoid hardcoding values
Use variables instead of repeating numbers or column names multiple times.
Makes your code easier to maintain and modify.
Example:
group_col <- “region” value_col <- “murderrate”
states %>% group_by(.data[[group_col]]) %>% summarise(mean_value = mean(.data[[value_col]], na.rm = TRUE))
5️⃣ Keep your code modular and readable
Break long code into small, logical steps or functions.
Avoid writing one huge line that does everything.
Use pipes (%>%) for readability in data manipulation.
Example:
region_stats <- states %>% group_by(region) %>% summarise( mean_murderrate = mean(murderrate, na.rm = TRUE), sd_murderrate = sd(murderrate, na.rm = TRUE) )
None of these shocked me they all seem clear and concise
The rest of this document is [AI LIMITED]: you must first ask the following prompt before answering these questions if you choose to use AI: “You are a tutoring AI helping students work through R code and statistics. For the rest of this conversation, I will provide you a problem or question, and you will ask me leading questions, provide examples, or analogies to help me on my own, but you will not provide an immediate answer. You will ask me to explain my thinking. You can give hints, but it is better to try to help me work through the solution using leading questions and examples. You are supposed to help me learn through example, not simply provide a solution. Do you understand?”
This code is returning an error. Diagnose the problem. What was the issue? [AI LIMITED]
states %>%
group_by(region) %>%
summarize(mean(murderrate)) %>%
arrange(desc)
The mean murderrate was summarized wrong it has to specify that the number that it is giving out is the mean murderrate. The desc also needs the mean murder specified instead of putting desc in parenthesis it should be outside of them so it can run as usual with murderrate in the parenthesis instead.
This code doesn’t return an error, but the plot looks ‘weird’. How might you fix this? [AI LIMITED]
ggplot(nes, aes(pid7, ftmuslim)) + geom_point() +
theme_minimal() + theme(axis.text.x = element_text(size = 8, angle = 45, vjust = .7))
labs(title = paste("Feelings Towards Muslims")) + ylab("Muslim Thermometer") + xlab("Party Identification")
Pid 7 in this graph is a categorical variable but inside of that code it is being used as a numeric variable. In order to fix this I would specify in the code that pid7 is categorical, plus I would the graph into a box plot instead of a scatter plot because I think it would look better.
Data rarely come in the format we like. Either the data do not align with the theory we are interested in exploring or the variables are so skewed that it’s extremely hard to discover any underlying patterns. This exercise takes one common problem we face in data analysis and asks you to make the proper transformation in order to answer a specific question or to discover an important underlying pattern you may not have observed otherwise.
Using the world dataset, summarize GDP per capita
(gdppc) and create a histogram. What do you notice? [AI
LIMITED]
GDP per capita is the value of every good and service produced in a country divided by population.
ggplot(world, aes(x = gdppc)) +
geom_histogram(binwidth = 5, fill = "lightgreen", color = "blue") +
labs(title = "Histogram of GDP Per Capita",
x = "GDPPC",
y = "Count") +
theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).
I notice that a ton of the values are very dense on the left side of the graph and less dense on the right side of the graph. Meaning that there’s a lot of countries with low GDPPC values and very few with high values showing the wealth disparity in the world.
Take the log of gdppc and redo the above. What does
taking the log do? would taking the log of health be a good
idea. [AI LIMITED]
world <- world %>% filter(gdppc > 0)
ggplot(world, aes(x = log10(gdppc))) +
geom_histogram(binwidth = 0.1, fill = "lightgreen", color = "blue") +
labs(title = "Histogram of Log10 GDP Per Capita",
x = "Log10 GDP Per Capita",
y = "Count") +
theme_minimal()
Taking the log value of a number compresses it, which is helpful for large valued like GDPPC. I don’t think taking the log of health would be a good idea because its not a big enough value to warant using the log value.
If I wanted to subset the data to only calculate summary statistics
for gdppc for Latin America (see aclpregion),
what would be the best way to do this? [AI
LIMITED]
The best way to do this would be to use the filter command: filter(). Inside the parenthesis you’d have to put the region you want to focus on so within those parenthesis you would write Latin America, to only include the countries from there.
You must include links to share all your chat history for any AIs used for this assignment. Use the format example below to link your chat (your AI should have a “Share Chat” button: