[Heigen Young]

Introduction

This is assignment 1. Answer the following questions below. You must turn in this file as an .html file (uploaded to Canvas) by the assignment deadline. Please note the following:

  1. Some questions require you to use AI [AI PERMITTED], others limited or conditional AI [AI LIMITED], other questions you cannot use AI [AI RESTRICTED].
  2. Do not simply copy-paste your answers from an AI. You should be reading what the output is doing and making sure you know what the code’s purpose is (if you’re not sure you should ask the AI/instructor/TA).
  3. Sometimes you might need to embed code that does not work. To show code but not run it include “eval = FALSE” in the code chunk options
  4. You must include the chat history of any AIs you use in the last section of this document
  5. You must include embedded R code for anything but “nocode” questions using R markdown’s code chunks
  6. All datasets have a codebook on Canvas
  7. Often it’s easier to write your code in a simple .R script and then transfer the code over to this markdown file once you’re ready (compiling can be slow)
setwd("C:/Users/heige/OneDrive/Desktop/2075/Data") 
load("PSCI_2075_v2.1.RData")
head(states)
##        state st raperate murderrate abort   density     ineq region   gunfree
## 1    Alabama AL     31.9        6.9    16  86.17970 46.01423  South  0.444921
## 2     Alaska AK     73.3        3.1    15   1.08848 34.18490   West  0.932850
## 3    Arizona AZ     32.0        5.4    20  42.12770 49.00724   West  0.969988
## 4   Arkansas AR     47.3        6.2    11  49.10960 46.83973  South  0.520867
## 5 California CA     23.6        5.3    33 211.97700 51.47233   West -3.270630
## 6   Colorado CO     44.6        3.5    21  39.08800 44.06749   West  0.503762
##     alcfree    mjfree  marrfree  freedom knowgov evangel poptotal stuspend
## 1 -0.024292 -0.054232 -0.009304  21.6063   63.76    42.8  4700000     5273
## 2 -0.003965  0.076145 -0.010426  28.5846   68.44    18.7   686293     8599
## 3  0.016018  0.015321 -0.009304  32.4745   52.10    18.1  6500000     4785
## 4  0.005779 -0.012482 -0.009304  -5.7844   67.08    39.9  2900000     5140
## 5  0.015981  0.043902  0.041186 -85.7562   88.31    11.5 37000000     5685
## 6  0.018588  0.025245  0.015941  20.4615   50.54    14.8  4900000     5061
##   ptratio hsdiploma democrat       pid house senate   inc minwage year
## 1   15.77      77.5 30.54249 -0.033898    -1  0.976 34650    7.25 2016
## 2   16.29      90.4 26.38180 -0.350000     0  0.638 45529    7.75 2016
## 3   20.75      85.1 28.86411 -0.102564     0  0.576 35875    7.90 2016
## 4   12.90      81.7 27.68069  0.084507     0  0.816 34014    6.25 2016
## 5   19.80      81.2 37.08743  0.179825    -1 -1.395 44481    9.00 2016
## 6   16.97      89.7 26.85935  0.125000     0 -0.528 44088    8.00 2016
##    polscore newimmig popover65 percwom medinc turnout  margin co2 femleg
## 1 -0.073118     4063    657792   75.49  42590    0.59 59.5446 130   11.4
## 2  0.036343     1799     54938   79.02  57431    0.59 52.4528  43   18.3
## 3 -0.058591    20333    881831   84.00  48621    0.53 37.1173  88   33.3
## 4 -0.022314     2874    419981   88.52  41302    0.51 78.2980  62   17.0
## 5 -0.088499   210591   4200000   89.94  53367    0.55 35.9504 368   30.8
## 6 -0.002005    13547    549625   79.57  58629    0.70 32.5394  88   32.0
##   corrupt infant trumpwin weed death stand obamawin
## 1      23   9.53        1    0     1     1        0
## 2      NA   5.93        1    1     0     1        1
## 3      18   6.85        1    1     1     1        0
## 4       8   7.84        1    1     1     0        0
## 5      52   5.32        0    1     1     0        1
## 6       6   6.44        0    1     0     0        1
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.2     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.1.0     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

SECTION 1

This section is designed to build familiarity with the data we’re going to use this semester.

Question 1.1

Using either the states or world dataset, pick one variable. Describe this varaible; Is it continuous? Categorical? Are there missing values or other interesting features? [AI RESTRICTED, nocode]

I picked the density variable. This variable is continuous because it is a measurement from a scale. There are no missing values every state has a density.

Question 1.2

Calculate the means and standard deviations by some grouping variable. Why did you pick this grouping variable, and what is it telling you about your variable? [AI PERMITTED]

 mean_density <- mean(states$density, na.rm = TRUE)
print(mean_density)
## [1] 176.7743

I picked this grouping variable because I found the value interesting in a different polisci class I am in. This value means the average density in the U.S. is 176.7743

sd_density <- sd(states$density, na.rm = TRUE)
print(sd_density)
## [1] 238.5726

This means the standard deviation of this data is 238.5726 meaning the states have a large variety of densities

Question 1.3

Now visualize the variable, by group, using an appropriate plot. Why did you pick the visualization method you did? What quantity(ies) did you show, and why? [AI PERMITTED]

ggplot(states, aes(x = density)) +
  geom_histogram(binwidth = 5, fill = "lightgreen", color = "black") +
  labs(title = "Histogram of Density",
       x = "Density",
       y = "Count") +
  theme_minimal()

I picked a histogram because I think it looks the best when showcasing the data. I showed the quantities of states and density to give the widest view of the data possible.

Question 1.4

Ask your AI to produce a different visualization of this variable. What did it select? Do you understand what this plot is showing you, and how is it different from the plot you made above? [AI PERMITTED]

ggplot(states, aes(x = density, fill = region)) +
  geom_density(alpha = 0.6) +
  labs(title = "Density Distribution by Region",
       x = "Population Density",
       y = "Density") +
  theme_minimal()

It selected to break the data down by region and change it into a sloping graph. It is different from the plot I made above because it breaks the states down into regions instead of plotting each states value.

SECTION 2

Question 2.1

Coding style is tough to learn but important. Ask your AI to list 5 good practices to use when coding in R. What did it say? Were there any that surprised you? [AI PERMITTED]

You said: list 5 good practices to use when coding in R ChatGPT said:

Absolutely! Good coding practices in R help make your code clean, readable, reproducible, and efficient. Here are five essential ones:

1️⃣ Use clear and consistent variable names

Avoid vague names like x or df1.

Prefer descriptive names: murder_rate, population_total, state_data.

Use snake_case (like_this) or camelCase (likeThis) consistently.

Example:

mean_murderrate <- mean(states$murderrate, na.rm = TRUE)

2️⃣ Comment your code

Explain why you’re doing something, not just what you’re doing.

Helps others (and future you!) understand your logic.

Example:

Calculate average murder rate for each region region_avg <- states %>% group_by(region) %>% summarise(mean_murderrate = mean(murderrate, na.rm = TRUE))

3️⃣ Use packages thoughtfully and load them at the top

Load all libraries at the start of your script (library(dplyr), library(ggplot2)).

This avoids “function not found” errors later.

Example:

library(dplyr) library(ggplot2)

4️⃣ Avoid hardcoding values

Use variables instead of repeating numbers or column names multiple times.

Makes your code easier to maintain and modify.

Example:

group_col <- “region” value_col <- “murderrate”

states %>% group_by(.data[[group_col]]) %>% summarise(mean_value = mean(.data[[value_col]], na.rm = TRUE))

5️⃣ Keep your code modular and readable

Break long code into small, logical steps or functions.

Avoid writing one huge line that does everything.

Use pipes (%>%) for readability in data manipulation.

Example:

region_stats <- states %>% group_by(region) %>% summarise( mean_murderrate = mean(murderrate, na.rm = TRUE), sd_murderrate = sd(murderrate, na.rm = TRUE) )

None of these shocked me they all seem clear and concise


SECTION 3

The rest of this document is [AI LIMITED]: you must first ask the following prompt before answering these questions if you choose to use AI: “You are a tutoring AI helping students work through R code and statistics. For the rest of this conversation, I will provide you a problem or question, and you will ask me leading questions, provide examples, or analogies to help me on my own, but you will not provide an immediate answer. You will ask me to explain my thinking. You can give hints, but it is better to try to help me work through the solution using leading questions and examples. You are supposed to help me learn through example, not simply provide a solution. Do you understand?”

Question 3.1

This code is returning an error. Diagnose the problem. What was the issue? [AI LIMITED]

states %>%
group_by(region) %>%
     summarize(mean(murderrate)) %>%
     arrange(desc)

The mean murderrate was summarized wrong it has to specify that the number that it is giving out is the mean murderrate. The desc also needs the mean murder specified instead of putting desc in parenthesis it should be outside of them so it can run as usual with murderrate in the parenthesis instead.

Question 3.2

This code doesn’t return an error, but the plot looks ‘weird’. How might you fix this? [AI LIMITED]

ggplot(nes, aes(pid7, ftmuslim)) + geom_point() +
     theme_minimal() + theme(axis.text.x = element_text(size = 8, angle = 45, vjust = .7)) 
     labs(title = paste("Feelings Towards Muslims")) + ylab("Muslim Thermometer") + xlab("Party Identification")

Pid 7 in this graph is a categorical variable but inside of that code it is being used as a numeric variable. In order to fix this I would specify in the code that pid7 is categorical, plus I would the graph into a box plot instead of a scatter plot because I think it would look better.

SECTION 4

Data rarely come in the format we like. Either the data do not align with the theory we are interested in exploring or the variables are so skewed that it’s extremely hard to discover any underlying patterns. This exercise takes one common problem we face in data analysis and asks you to make the proper transformation in order to answer a specific question or to discover an important underlying pattern you may not have observed otherwise.

Question 4.1

Using the world dataset, summarize GDP per capita (gdppc) and create a histogram. What do you notice? [AI LIMITED]

GDP per capita is the value of every good and service produced in a country divided by population.

ggplot(world, aes(x = gdppc)) +
  geom_histogram(binwidth = 5, fill = "lightgreen", color = "blue") +
  labs(title = "Histogram of GDP Per Capita",
       x = "GDPPC",
       y = "Count") +
  theme_minimal()
## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_bin()`).

I notice that a ton of the values are very dense on the left side of the graph and less dense on the right side of the graph. Meaning that there’s a lot of countries with low GDPPC values and very few with high values showing the wealth disparity in the world.

Question 4.2

Take the log of gdppc and redo the above. What does taking the log do? would taking the log of health be a good idea. [AI LIMITED]

world <- world %>% filter(gdppc > 0)
ggplot(world, aes(x = log10(gdppc))) +
  geom_histogram(binwidth = 0.1, fill = "lightgreen", color = "blue") +
  labs(title = "Histogram of Log10 GDP Per Capita",
       x = "Log10 GDP Per Capita",
       y = "Count") +
  theme_minimal()

Taking the log value of a number compresses it, which is helpful for large valued like GDPPC. I don’t think taking the log of health would be a good idea because its not a big enough value to warant using the log value.

Question 4.3

If I wanted to subset the data to only calculate summary statistics for gdppc for Latin America (see aclpregion), what would be the best way to do this? [AI LIMITED]

The best way to do this would be to use the filter command: filter(). Inside the parenthesis you’d have to put the region you want to focus on so within those parenthesis you would write Latin America, to only include the countries from there.