Assignment #3

Part 1

The dataset I want to use for this class was collected by the Department of Education (DOE) and is publicly available through NYC Open Data in CSV format. My primary research questions are: “How do dropout rates vary across the boroughs of New York City?” and “How have dropout rates evolved over time in New York City?” The key variables I am interested in are borough, cohort year, and dropout percentage. These variables will allow me to analyze trends over time and make comparisons between the boroughs. For analysis, I plan to utilize descriptive statistics to gain an initial understanding of the dataset. I will also perform a time series analysis to examine how dropout rates have changed over time, both overall in New York City and within each borough. I will also conduct statistical tests, such as linear regression and ANOVA. This will allow me to model the relationship between dropout rates and year, and determine if there are significant differences in dropout rates across the boroughs. For data visualization, I plan to use pie charts, box plots, and bar graphs to effectively showcase trends and comparisons in the data.

Link to data: https://data.cityofnewyork.us/Education/2005-2019-Graduation-Rates-Borough-All/ynqa-y42e/about_data

Part 2

# Import the data
data("airquality")

# Convert numeric months to month names
airquality$Month <- factor(airquality$Month, labels = month.name[5:9])

# View the data
head(airquality)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67   May   1
## 2    36     118  8.0   72   May   2
## 3    12     149 12.6   74   May   3
## 4    18     313 11.5   62   May   4
## 5    NA      NA 14.3   56   May   5
## 6    28      NA 14.9   66   May   6

# Moving from wide to long form
long_airquality <- airquality %>%
  pivot_longer(cols = Ozone:Temp, 
               names_to = "variable", 
               values_to = "value")

# View the long form data
head(long_airquality)

## # A tibble: 6 × 4
##   Month   Day variable value
##   <fct> <int> <chr>    <dbl>
## 1 May       1 Ozone     41  
## 2 May       1 Solar.R  190  
## 3 May       1 Wind       7.4
## 4 May       1 Temp      67  
## 5 May       2 Ozone     36  
## 6 May       2 Solar.R  118

# Moving back from long to wide form
wide_airquality <- long_airquality %>%
  pivot_wider(names_from = "variable", 
              values_from = "value")

# View the wide form data
head(wide_airquality)

## # A tibble: 6 × 6
##   Month   Day Ozone Solar.R  Wind  Temp
##   <fct> <int> <dbl>   <dbl> <dbl> <dbl>
## 1 May       1    41     190   7.4    67
## 2 May       2    36     118   8      72
## 3 May       3    12     149  12.6    74
## 4 May       4    18     313  11.5    62
## 5 May       5    NA      NA  14.3    56
## 6 May       6    28      NA  14.9    66

In this example, I started by loading the built-in “airquality” dataset. Next, I converted the numeric Month column to a factor, replacing the numeric values with the actual month names (from May to September). I then used the pivot_longer() function to change the dataset from a wide format, with a separate column for each variable (Ozone, Solar.R, Wind, and Temp), to a long format where all the variables are stacked into two columns. In this format, one of the columns was for the variable names and the other was for their corresponding values. Finally, I used pivot_wider() to change the data back to its original wide format, where each variable is represented by its own separate column.

Assignment #3

Dijana K.

2025-03-03

Part 1

Part 2