Introduction

When working with data, you should expect to spend a good amount of time in the clean-up process, but it is not always ‘messy’ or unreadable. For example, data can still be organized in a data-frame in a way that is readable, but at the same time it may not be useful in such presented formats. In these cases, we may have to transpose the data-frame and re-organize to fit our needs.

Loading Data: Creating the Initial Data-Frame

The data is loaded into a data-frame called ‘majors_df’.

rm(list = ls())

majors_df <- data.frame(
  c('NYU',NA,NA,'CUNY',NA, NA, 'NYIT', NA),
  c('Declared','Undeclared',NA,'Declared','Undeclared',NA,'Declared','Undeclared'),
  c(342,291,NA,807,227,NA,552,89),
  c(57,122,NA,58,1206,NA,192,319),
  c(915,582,NA,867,451,NA,1027,625),
  c(377,151,NA,109,650,NA,21,235),
  c(1569,257,NA,89,215,NA,1007,519),
  c(601,95,NA,712,77,NA,811,62)
)

colnames(majors_df) <- c('','','Applied Mathematics','Physics','Computer Science','Data Science','Materials Science', 'Systems Engineering')

#view current state of data-frame
majors_df

##                   Applied Mathematics Physics Computer Science Data Science
## 1  NYU   Declared                 342      57              915          377
## 2 <NA> Undeclared                 291     122              582          151
## 3 <NA>       <NA>                  NA      NA               NA           NA
## 4 CUNY   Declared                 807      58              867          109
## 5 <NA> Undeclared                 227    1206              451          650
## 6 <NA>       <NA>                  NA      NA               NA           NA
## 7 NYIT   Declared                 552     192             1027           21
## 8 <NA> Undeclared                  89     319              625          235
##   Materials Science Systems Engineering
## 1              1569                 601
## 2               257                  95
## 3                NA                  NA
## 4                89                 712
## 5               215                  77
## 6                NA                  NA
## 7              1007                 811
## 8               519                  62

Loading Data: Writing the Data-Frame to a CSV File

We can now write the ‘majors_df’ data-frame to a CSV file. Within the ‘write.csv()’ function, I specified the ‘row.names’ parameter to ‘FALSE’ to avoid creating non-blank row names. I also set the ‘na’ parameter to “” to replace all ‘NA’ values with blanks. The ‘stringsAsFactors’ parameter is set to ‘FALSE’ to avoid changing the character entries. At first glance it looks correct, but the data-frame still shows up with ‘NA’ values in some of the cells.

#write to csv file
write.csv(majors_df, "majors.csv", row.names = FALSE, na = "")

#read from csv file into R
majors_csv <- read.csv('majors.csv', stringsAsFactors = FALSE, check.names = FALSE)

majors_csv

##                   Applied Mathematics Physics Computer Science Data Science
## 1  NYU   Declared                 342      57              915          377
## 2      Undeclared                 291     122              582          151
## 3                                  NA      NA               NA           NA
## 4 CUNY   Declared                 807      58              867          109
## 5      Undeclared                 227    1206              451          650
## 6                                  NA      NA               NA           NA
## 7 NYIT   Declared                 552     192             1027           21
## 8      Undeclared                  89     319              625          235
##   Materials Science Systems Engineering
## 1              1569                 601
## 2               257                  95
## 3                NA                  NA
## 4                89                 712
## 5               215                  77
## 6                NA                  NA
## 7              1007                 811
## 8               519                  62

Loading Data: Changing “NA” Values to Blanks

To remove the remaining ‘NA’ values I subset the data-frame and set any value that ‘is.na’ to ““. When calling the updated data, it seemed to have removed the blank row entirely. To confirm that this was not the case, I presented the data nicely using the ‘datatable()’ function.

#change NA values to blank
majors_csv[is.na(majors_csv)] <- ""

#check updated data-frame
majors_csv

##                   Applied Mathematics Physics Computer Science Data Science
## 1  NYU   Declared                 342      57              915          377
## 2      Undeclared                 291     122              582          151
## 3                                                                          
## 4 CUNY   Declared                 807      58              867          109
## 5      Undeclared                 227    1206              451          650
## 6                                                                          
## 7 NYIT   Declared                 552     192             1027           21
## 8      Undeclared                  89     319              625          235
##   Materials Science Systems Engineering
## 1              1569                 601
## 2               257                  95
## 3                                      
## 4                89                 712
## 5               215                  77
## 6                                      
## 7              1007                 811
## 8               519                  62

# Display the data frame nicely
datatable(majors_csv)

Loading Data: Changing the Altered Column Names

The previous output looks almost perfect, but the first two columns were given default names, namely “Var.2” and “Var.3”. To avoid this, I set the ‘options’ parameter accordingly. We can now see that our data-frame is formatted exactly as we want it to be. As a double-check, I compared it to a screen shot of the actual table that was created in the CSV file. Here, I used Excel to view it. The tables look identical.

#specify additional options
datatable(majors_csv, options = list(columnDefs = list(list(targets = c(1, 2), title = ""))))

Using Valid Column Names

Before we can pivot, we need to change the column/variable names to a valid format. This means no white-spaces or special characters.

#check data frame again
majors_csv

##                   Applied Mathematics Physics Computer Science Data Science
## 1  NYU   Declared                 342      57              915          377
## 2      Undeclared                 291     122              582          151
## 3                                                                          
## 4 CUNY   Declared                 807      58              867          109
## 5      Undeclared                 227    1206              451          650
## 6                                                                          
## 7 NYIT   Declared                 552     192             1027           21
## 8      Undeclared                  89     319              625          235
##   Materials Science Systems Engineering
## 1              1569                 601
## 2               257                  95
## 3                                      
## 4                89                 712
## 5               215                  77
## 6                                      
## 7              1007                 811
## 8               519                  62

Here, I replace white-spaces with a period and labeled the first two columns that originally had blank titles as ‘College’ and ‘Major Status’. For the blank row entries in the ‘College’ column, I labeled them according to the proper college.

#
colnames(majors_csv) <- c('College','Major.Status','Applied.Mathematics','Physics','Computer.Science','Data.Science','Materials.Science', 'Systems.Engineering')

majors_csv$College[2] <- 'NYU'
majors_csv$College[5] <- 'CUNY'
majors_csv$College[8] <- 'NYIT'

majors_csv

##   College Major.Status Applied.Mathematics Physics Computer.Science
## 1     NYU     Declared                 342      57              915
## 2     NYU   Undeclared                 291     122              582
## 3                                                                  
## 4    CUNY     Declared                 807      58              867
## 5    CUNY   Undeclared                 227    1206              451
## 6                                                                  
## 7    NYIT     Declared                 552     192             1027
## 8    NYIT   Undeclared                  89     319              625
##   Data.Science Materials.Science Systems.Engineering
## 1          377              1569                 601
## 2          151               257                  95
## 3                                                   
## 4          109                89                 712
## 5          650               215                  77
## 6                                                   
## 7           21              1007                 811
## 8          235               519                  62

Pivoting

Now that all of the required data is loaded, we need to transform what we have so that it is readable to our liking. In this case, this means pivoting the matrix and aggregating values and columns. To avoid gaps in our pivoted data, I remove the blank row before applying the ‘pivot_longer()’ function.

majors_csv <- majors_csv[c(-3,-6), ]

majors_csv

##   College Major.Status Applied.Mathematics Physics Computer.Science
## 1     NYU     Declared                 342      57              915
## 2     NYU   Undeclared                 291     122              582
## 4    CUNY     Declared                 807      58              867
## 5    CUNY   Undeclared                 227    1206              451
## 7    NYIT     Declared                 552     192             1027
## 8    NYIT   Undeclared                  89     319              625
##   Data.Science Materials.Science Systems.Engineering
## 1          377              1569                 601
## 2          151               257                  95
## 4          109                89                 712
## 5          650               215                  77
## 7           21              1007                 811
## 8          235               519                  62

p_majors <- pivot_longer(majors_csv,
                           cols = c('Applied.Mathematics','Physics','Computer.Science','Data.Science','Materials.Science', 'Systems.Engineering'),
                           names_to = "Major",
                           values_to = "Student.Count")

p_majors

## # A tibble: 36 × 4
##    College Major.Status Major               Student.Count
##    <chr>   <chr>        <chr>               <chr>        
##  1 NYU     Declared     Applied.Mathematics 342          
##  2 NYU     Declared     Physics             57           
##  3 NYU     Declared     Computer.Science    915          
##  4 NYU     Declared     Data.Science        377          
##  5 NYU     Declared     Materials.Science   1569         
##  6 NYU     Declared     Systems.Engineering 601          
##  7 NYU     Undeclared   Applied.Mathematics 291          
##  8 NYU     Undeclared   Physics             122          
##  9 NYU     Undeclared   Computer.Science    582          
## 10 NYU     Undeclared   Data.Science        151          
## # ℹ 26 more rows

Changing Column Type

We need to change the ‘Student.Count’ column to numeric values to perform mathematical operations. After doing so, we can group the data according to College, Major, and Major.Status, in that order.

p_majors$Student.Count <- as.numeric(p_majors$Student.Count)

p_majors

## # A tibble: 36 × 4
##    College Major.Status Major               Student.Count
##    <chr>   <chr>        <chr>                       <dbl>
##  1 NYU     Declared     Applied.Mathematics           342
##  2 NYU     Declared     Physics                        57
##  3 NYU     Declared     Computer.Science              915
##  4 NYU     Declared     Data.Science                  377
##  5 NYU     Declared     Materials.Science            1569
##  6 NYU     Declared     Systems.Engineering           601
##  7 NYU     Undeclared   Applied.Mathematics           291
##  8 NYU     Undeclared   Physics                       122
##  9 NYU     Undeclared   Computer.Science              582
## 10 NYU     Undeclared   Data.Science                  151
## # ℹ 26 more rows

#Aggregate data by major and college
grouped_majors <- p_majors %>%
  group_by(College, Major, Major.Status) %>%
  summarize(Student.Count, .groups = 'drop')

grouped_majors

## # A tibble: 36 × 4
##    College Major               Major.Status Student.Count
##    <chr>   <chr>               <chr>                <dbl>
##  1 CUNY    Applied.Mathematics Declared               807
##  2 CUNY    Applied.Mathematics Undeclared             227
##  3 CUNY    Computer.Science    Declared               867
##  4 CUNY    Computer.Science    Undeclared             451
##  5 CUNY    Data.Science        Declared               109
##  6 CUNY    Data.Science        Undeclared             650
##  7 CUNY    Materials.Science   Declared                89
##  8 CUNY    Materials.Science   Undeclared             215
##  9 CUNY    Physics             Declared                58
## 10 CUNY    Physics             Undeclared            1206
## # ℹ 26 more rows

Condensing the Data

As shown above, for each college there are duplicate entries for each major in order to match to either the ‘Declared’ or ‘Undeclared’ status. This gives us a data-frame with 36 rows. We can condense this down to just 18 rows by including the major/declaration status for each city on the same row by creating a ‘Declared’ column and an ‘Undeclared’ column.

#present corresponding values for (declared) and (undeclared) side by side
wide_majors <- grouped_majors %>%
  pivot_wider(names_from = Major.Status,
              values_from = Student.Count)

wide_majors

## # A tibble: 18 × 4
##    College Major               Declared Undeclared
##    <chr>   <chr>                  <dbl>      <dbl>
##  1 CUNY    Applied.Mathematics      807        227
##  2 CUNY    Computer.Science         867        451
##  3 CUNY    Data.Science             109        650
##  4 CUNY    Materials.Science         89        215
##  5 CUNY    Physics                   58       1206
##  6 CUNY    Systems.Engineering      712         77
##  7 NYIT    Applied.Mathematics      552         89
##  8 NYIT    Computer.Science        1027        625
##  9 NYIT    Data.Science              21        235
## 10 NYIT    Materials.Science       1007        519
## 11 NYIT    Physics                  192        319
## 12 NYIT    Systems.Engineering      811         62
## 13 NYU     Applied.Mathematics      342        291
## 14 NYU     Computer.Science         915        582
## 15 NYU     Data.Science             377        151
## 16 NYU     Materials.Science       1569        257
## 17 NYU     Physics                   57        122
## 18 NYU     Systems.Engineering      601         95

Creating New Columns (1)

Now that we have data that is readable to our liking, we can create new columns to use for our analysis. Here, I created a column for the percentage of declared majors relative to all, declared + undeclared, for the given majors and corresponding to a specific college.

#create new column to show percentage of declared majors for each row
wide_majors$pct_declared <- wide_majors$Declared / (wide_majors$Declared + wide_majors$Undeclared)

wide_majors

## # A tibble: 18 × 5
##    College Major               Declared Undeclared pct_declared
##    <chr>   <chr>                  <dbl>      <dbl>        <dbl>
##  1 CUNY    Applied.Mathematics      807        227       0.780 
##  2 CUNY    Computer.Science         867        451       0.658 
##  3 CUNY    Data.Science             109        650       0.144 
##  4 CUNY    Materials.Science         89        215       0.293 
##  5 CUNY    Physics                   58       1206       0.0459
##  6 CUNY    Systems.Engineering      712         77       0.902 
##  7 NYIT    Applied.Mathematics      552         89       0.861 
##  8 NYIT    Computer.Science        1027        625       0.622 
##  9 NYIT    Data.Science              21        235       0.0820
## 10 NYIT    Materials.Science       1007        519       0.660 
## 11 NYIT    Physics                  192        319       0.376 
## 12 NYIT    Systems.Engineering      811         62       0.929 
## 13 NYU     Applied.Mathematics      342        291       0.540 
## 14 NYU     Computer.Science         915        582       0.611 
## 15 NYU     Data.Science             377        151       0.714 
## 16 NYU     Materials.Science       1569        257       0.859 
## 17 NYU     Physics                   57        122       0.318 
## 18 NYU     Systems.Engineering      601         95       0.864

Creating New Columns (2)

I decided to create two more columns for the mean of declared majors as well as the total number of declared + undeclared for each college, regardless of major. This is to compute a percentage for the mean in the same way that I computed a percentage for the individual majors.

mColl_dec <- wide_majors %>%
  group_by(College) %>%
  summarize(mean_declared = mean(Declared))

#mColl_dec

mDec_CUNY <- mColl_dec$mean_declared[1]
mDec_NYIT <- mColl_dec$mean_declared[2]
mDec_NYU <- mColl_dec$mean_declared[3]

wide_majors$mean_declared <- NA

wide_majors$mean_declared <- ifelse(wide_majors$College == "CUNY",mDec_CUNY, wide_majors$mean_declared)
wide_majors$mean_declared <- ifelse(wide_majors$College == "NYIT", mDec_NYIT, wide_majors$mean_declared)
wide_majors$mean_declared <- ifelse(wide_majors$College == "NYU", mDec_NYU, wide_majors$mean_declared)

wide_majors

## # A tibble: 18 × 6
##    College Major               Declared Undeclared pct_declared mean_declared
##    <chr>   <chr>                  <dbl>      <dbl>        <dbl>         <dbl>
##  1 CUNY    Applied.Mathematics      807        227       0.780           440.
##  2 CUNY    Computer.Science         867        451       0.658           440.
##  3 CUNY    Data.Science             109        650       0.144           440.
##  4 CUNY    Materials.Science         89        215       0.293           440.
##  5 CUNY    Physics                   58       1206       0.0459          440.
##  6 CUNY    Systems.Engineering      712         77       0.902           440.
##  7 NYIT    Applied.Mathematics      552         89       0.861           602.
##  8 NYIT    Computer.Science        1027        625       0.622           602.
##  9 NYIT    Data.Science              21        235       0.0820          602.
## 10 NYIT    Materials.Science       1007        519       0.660           602.
## 11 NYIT    Physics                  192        319       0.376           602.
## 12 NYIT    Systems.Engineering      811         62       0.929           602.
## 13 NYU     Applied.Mathematics      342        291       0.540           644.
## 14 NYU     Computer.Science         915        582       0.611           644.
## 15 NYU     Data.Science             377        151       0.714           644.
## 16 NYU     Materials.Science       1569        257       0.859           644.
## 17 NYU     Physics                   57        122       0.318           644.
## 18 NYU     Systems.Engineering      601         95       0.864           644.

Creating New Columns (3)

Here I include columns for the total number of declared majors, total number of declared + undeclared, and declared percentages for the overall colleges. Then added a final column that calculates the mean across all majors of the percentages of declared majors for each college.

mDec_total1 <- wide_majors %>%
   group_by(College) %>%
   summarize(declared_total = sum(Declared))

mColl_total <- wide_majors %>%
   group_by(College) %>%
   summarize(total = sum(Declared, Undeclared))

# mDec_total1
# mColl_total

mDec_CUNY <- mDec_total1$declared_total[1]
mDec_NYIT <- mDec_total1$declared_total[2]
mDec_NYU <- mDec_total1$declared_total[3]

mTot_CUNY <- mColl_total$total[1]
mTot_NYIT <- mColl_total$total[2]
mTot_NYU <- mColl_total$total[3]

wide_majors$declared_total <- NA
wide_majors$total <- NA

wide_majors$declared_total <- ifelse(wide_majors$College == "CUNY", mDec_CUNY, wide_majors$declared_total)
wide_majors$declared_total <- ifelse(wide_majors$College == "NYIT", mDec_NYIT, wide_majors$declared_total)
wide_majors$declared_total <- ifelse(wide_majors$College == "NYU", mDec_NYU, wide_majors$declared_total)

wide_majors$total <- ifelse(wide_majors$College == "CUNY", mTot_CUNY, wide_majors$total)
wide_majors$total <- ifelse(wide_majors$College == "NYIT", mTot_NYIT, wide_majors$total)
wide_majors$total <- ifelse(wide_majors$College == "NYU", mTot_NYU, wide_majors$total)

wide_majors$college_declared_pct <- wide_majors$declared_total / wide_majors$total

wide_majors$pct_mean_declared <- NA

wide_majors$pct_mean_declared <- wide_majors$mean_declared / wide_majors$total

wide_majors

## # A tibble: 18 × 10
##    College Major   Declared Undeclared pct_declared mean_declared declared_total
##    <chr>   <chr>      <dbl>      <dbl>        <dbl>         <dbl>          <dbl>
##  1 CUNY    Applie…      807        227       0.780           440.           2642
##  2 CUNY    Comput…      867        451       0.658           440.           2642
##  3 CUNY    Data.S…      109        650       0.144           440.           2642
##  4 CUNY    Materi…       89        215       0.293           440.           2642
##  5 CUNY    Physics       58       1206       0.0459          440.           2642
##  6 CUNY    System…      712         77       0.902           440.           2642
##  7 NYIT    Applie…      552         89       0.861           602.           3610
##  8 NYIT    Comput…     1027        625       0.622           602.           3610
##  9 NYIT    Data.S…       21        235       0.0820          602.           3610
## 10 NYIT    Materi…     1007        519       0.660           602.           3610
## 11 NYIT    Physics      192        319       0.376           602.           3610
## 12 NYIT    System…      811         62       0.929           602.           3610
## 13 NYU     Applie…      342        291       0.540           644.           3861
## 14 NYU     Comput…      915        582       0.611           644.           3861
## 15 NYU     Data.S…      377        151       0.714           644.           3861
## 16 NYU     Materi…     1569        257       0.859           644.           3861
## 17 NYU     Physics       57        122       0.318           644.           3861
## 18 NYU     System…      601         95       0.864           644.           3861
## # ℹ 3 more variables: total <dbl>, college_declared_pct <dbl>,
## #   pct_mean_declared <dbl>

Analysis

In my analysis, I decided to find the sample standard deviation based on the percentage points that I computed for the individual majors and the percentages of the means. I did this to compare the declared majors according to variation in the data set.

sample_pct_sd <- wide_majors %>%
   group_by(College) %>%
   reframe(pct_declared, pct_mean_declared)

sample_pct_sd$sq_diff <- (sample_pct_sd$pct_declared - sample_pct_sd$pct_mean_declared)^2


sample_pct_sd1 <- sample_pct_sd %>%
  group_by(College) %>%
  summarize(pct_sd = sqrt(sum(sq_diff)/(6 - 1)))

#sample_pct_sd1

#standard deviation of delay percentages for ALASKA
cuny_pct_sd <- sample_pct_sd1$pct_sd[1]

#standard deviation of delay percentages for AM WEST
nyit_pct_sd <- sample_pct_sd1$pct_sd[2]

#standard deviation of delay percentages for AM WEST
nyu_pct_sd <- sample_pct_sd1$pct_sd[3]

#wide_majors

declared_df <- data.frame(College = wide_majors$College,
                 Major = wide_majors$Major,
                 Declared = wide_majors$Declared,
                 Pct_Declared = wide_majors$pct_declared,
                 #Mean_Pct_Delay = wide_flights$pct_mean_delayed,
                 #Pct_SD = round(sqrt(((wide_flights$pct_delayed - wide_flights$pct_mean_delayed)^2) / (wide_flights$total - 1)), 5),
                 #Avg_Pct_SD = NA,
                 Pct_SD = NA,
                 College_Declared_Pct = wide_majors$college_declared_pct
                 )

declared_df$Pct_SD <- ifelse(wide_majors$College == "CUNY", cuny_pct_sd, declared_df$Pct_SD)
declared_df$Pct_SD <- ifelse(wide_majors$College == "NYIT", nyit_pct_sd, declared_df$Pct_SD)
declared_df$Pct_SD <- ifelse(wide_majors$College == "NYU", nyu_pct_sd, declared_df$Pct_SD)

#write to csv
write.csv(declared_df, 'major_analysis.csv')

majors_analysis <- read.csv('major_analysis.csv')

majors_analysis

##     X College               Major Declared Pct_Declared    Pct_SD
## 1   1    CUNY Applied.Mathematics      807   0.78046422 0.5565744
## 2   2    CUNY    Computer.Science      867   0.65781487 0.5565744
## 3   3    CUNY        Data.Science      109   0.14361001 0.5565744
## 4   4    CUNY   Materials.Science       89   0.29276316 0.5565744
## 5   5    CUNY             Physics       58   0.04588608 0.5565744
## 6   6    CUNY Systems.Engineering      712   0.90240811 0.5565744
## 7   7    NYIT Applied.Mathematics      552   0.86115445 0.6114432
## 8   8    NYIT    Computer.Science     1027   0.62167070 0.6114432
## 9   9    NYIT        Data.Science       21   0.08203125 0.6114432
## 10 10    NYIT   Materials.Science     1007   0.65989515 0.6114432
## 11 11    NYIT             Physics      192   0.37573386 0.6114432
## 12 12    NYIT Systems.Engineering      811   0.92898053 0.6114432
## 13 13     NYU Applied.Mathematics      342   0.54028436 0.6178946
## 14 14     NYU    Computer.Science      915   0.61122244 0.6178946
## 15 15     NYU        Data.Science      377   0.71401515 0.6178946
## 16 16     NYU   Materials.Science     1569   0.85925520 0.6178946
## 17 17     NYU             Physics       57   0.31843575 0.6178946
## 18 18     NYU Systems.Engineering      601   0.86350575 0.6178946
##    College_Declared_Pct
## 1             0.4831748
## 2             0.4831748
## 3             0.4831748
## 4             0.4831748
## 5             0.4831748
## 6             0.4831748
## 7             0.6612933
## 8             0.6612933
## 9             0.6612933
## 10            0.6612933
## 11            0.6612933
## 12            0.6612933
## 13            0.7204702
## 14            0.7204702
## 15            0.7204702
## 16            0.7204702
## 17            0.7204702
## 18            0.7204702

Grouping by Major

The following code mimics the same process as above that was done for each college, but for majors. To avoid confusion, I gave these data-frames to new assignments.

mMaj_dec <- wide_majors %>%
  group_by(Major) %>%
  summarize(mean_declared = mean(Declared))

wide_majors2 <- wide_majors
#wide_majors2
#mMaj_dec

mDec_math <- mMaj_dec$mean_declared[1]
mDec_cs <- mMaj_dec$mean_declared[2]
mDec_ds <- mMaj_dec$mean_declared[3]
mDec_ms <- mMaj_dec$mean_declared[4]
mDec_p <- mMaj_dec$mean_declared[5]
mDec_se <- mMaj_dec$mean_declared[6]

wide_majors2$mean_declared <- NA

wide_majors2$mean_declared <- ifelse(wide_majors2$Major == "Applied.Mathematics",mDec_math, wide_majors2$mean_declared)
wide_majors2$mean_declared <- ifelse(wide_majors2$Major == "Computer.Science", mDec_cs, wide_majors2$mean_declared)
wide_majors2$mean_declared <- ifelse(wide_majors2$Major == "Data.Science", mDec_ds, wide_majors2$mean_declared)
wide_majors2$mean_declared <- ifelse(wide_majors2$Major == "Materials.Science",mDec_ms, wide_majors2$mean_declared)
wide_majors2$mean_declared <- ifelse(wide_majors2$Major == "Physics", mDec_p, wide_majors2$mean_declared)
wide_majors2$mean_declared <- ifelse(wide_majors2$Major == "Systems.Engineering", mDec_se, wide_majors2$mean_declared)

wide_majors2

## # A tibble: 18 × 10
##    College Major   Declared Undeclared pct_declared mean_declared declared_total
##    <chr>   <chr>      <dbl>      <dbl>        <dbl>         <dbl>          <dbl>
##  1 CUNY    Applie…      807        227       0.780           567            2642
##  2 CUNY    Comput…      867        451       0.658           936.           2642
##  3 CUNY    Data.S…      109        650       0.144           169            2642
##  4 CUNY    Materi…       89        215       0.293           888.           2642
##  5 CUNY    Physics       58       1206       0.0459          102.           2642
##  6 CUNY    System…      712         77       0.902           708            2642
##  7 NYIT    Applie…      552         89       0.861           567            3610
##  8 NYIT    Comput…     1027        625       0.622           936.           3610
##  9 NYIT    Data.S…       21        235       0.0820          169            3610
## 10 NYIT    Materi…     1007        519       0.660           888.           3610
## 11 NYIT    Physics      192        319       0.376           102.           3610
## 12 NYIT    System…      811         62       0.929           708            3610
## 13 NYU     Applie…      342        291       0.540           567            3861
## 14 NYU     Comput…      915        582       0.611           936.           3861
## 15 NYU     Data.S…      377        151       0.714           169            3861
## 16 NYU     Materi…     1569        257       0.859           888.           3861
## 17 NYU     Physics       57        122       0.318           102.           3861
## 18 NYU     System…      601         95       0.864           708            3861
## # ℹ 3 more variables: total <dbl>, college_declared_pct <dbl>,
## #   pct_mean_declared <dbl>

##############

mDec_total2 <- wide_majors %>%
   group_by(Major) %>%
   summarize(declared_total = sum(Declared))

mColl_total1 <- wide_majors %>%
   group_by(Major) %>%
   summarize(total = sum(Declared, Undeclared))

# mDec_total2
# mColl_total1

mDec_math <- mDec_total2$declared_total[1]
mDec_cs <- mDec_total2$declared_total[2]
mDec_ds <- mDec_total2$declared_total[3]
mDec_ms <- mDec_total2$declared_total[4]
mDec_p <- mDec_total2$declared_total[5]
mDec_se <- mDec_total2$declared_total[6]

mTot_math <- mColl_total1$total[1]
mTot_cs <- mColl_total1$total[2]
mTot_ds <- mColl_total1$total[3]
mTot_ms <- mColl_total1$total[4]
mTot_p <- mColl_total1$total[5]
mTot_se <- mColl_total1$total[6]

wide_majors2$declared_total <- NA
wide_majors2$total <- NA

wide_majors2$declared_total <- ifelse(wide_majors2$Major == "Applied.Mathematics", mDec_math, wide_majors2$declared_total)
wide_majors2$declared_total <- ifelse(wide_majors2$Major == "Computer.Science", mDec_cs, wide_majors2$declared_total)
wide_majors2$declared_total <- ifelse(wide_majors2$Major == "Data.Science", mDec_ds, wide_majors2$declared_total)
wide_majors2$declared_total <- ifelse(wide_majors2$Major == "Materials.Science", mDec_ms, wide_majors2$declared_total)
wide_majors2$declared_total <- ifelse(wide_majors2$Major == "Physics", mDec_p, wide_majors2$declared_total)
wide_majors2$declared_total <- ifelse(wide_majors2$Major == "Systems.Engineering", mDec_se, wide_majors2$declared_total)

wide_majors2$total <- ifelse(wide_majors2$Major == "Applied.Mathematics", mTot_math, wide_majors2$total)
wide_majors2$total <- ifelse(wide_majors2$Major == "Computer.Science", mTot_cs, wide_majors2$total)
wide_majors2$total <- ifelse(wide_majors2$Major == "Data.Science", mTot_ds, wide_majors2$total)
wide_majors2$total <- ifelse(wide_majors2$Major == "Materials.Science", mTot_ms, wide_majors2$total)
wide_majors2$total <- ifelse(wide_majors2$Major == "Physics", mTot_p, wide_majors2$total)
wide_majors2$total <- ifelse(wide_majors2$Major == "Systems.Engineering", mTot_se, wide_majors2$total)

wide_majors2$major_declared_pct <- wide_majors2$declared_total / wide_majors2$total

wide_majors2$pct_mean_declared <- NA

wide_majors2$pct_mean_declared <- wide_majors2$mean_declared / wide_majors2$total

#wide_majors2

##############

sample_pct_sd3 <- wide_majors2 %>%
   group_by(Major) %>%
   reframe(pct_declared, pct_mean_declared)

sample_pct_sd3$sq_diff2 <- (sample_pct_sd3$pct_declared - sample_pct_sd3$pct_mean_declared)^2

sample_pct_sd4 <- sample_pct_sd3 %>%
  group_by(Major) %>%
  summarize(pct_sd = sqrt(sum(sq_diff2)/(6 - 1)))

# sample_pct_sd3
# sample_pct_sd4

#standard deviation of delay percentages for ALASKA
math_pct_sd <- sample_pct_sd4$pct_sd[1]

#standard deviation of delay percentages for AM WEST
cs_pct_sd <- sample_pct_sd4$pct_sd[2]

#standard deviation of delay percentages for AM WEST
ds_pct_sd <- sample_pct_sd4$pct_sd[3]

#standard deviation of delay percentages for ALASKA
ms_pct_sd <- sample_pct_sd4$pct_sd[4]

#standard deviation of delay percentages for AM WEST
p_pct_sd <- sample_pct_sd4$pct_sd[5]

#standard deviation of delay percentages for AM WEST
se_pct_sd <- sample_pct_sd4$pct_sd[6]

#wide_majors2

declared_df2 <- data.frame(College = wide_majors2$College,
                 Major = wide_majors2$Major,
                 Declared = wide_majors2$Declared,
                 Pct_Declared = wide_majors2$pct_declared,
                 #Mean_Pct_Delay = wide_flights$pct_mean_delayed,
                 #Pct_SD = round(sqrt(((wide_flights$pct_delayed - wide_flights$pct_mean_delayed)^2) / (wide_flights$total - 1)), 5),
                 #Avg_Pct_SD = NA,
                 Pct_SD = NA,
                 Major_Declared_Pct = wide_majors2$major_declared_pct
                 )

declared_df2$Pct_SD <- ifelse(wide_majors2$Major == "Applied.Mathematics", math_pct_sd, declared_df2$Pct_SD)
declared_df2$Pct_SD <- ifelse(wide_majors2$Major == "Computer.Science", cs_pct_sd, declared_df2$Pct_SD)
declared_df2$Pct_SD <- ifelse(wide_majors2$Major == "Data.Science", ds_pct_sd, declared_df2$Pct_SD)
declared_df2$Pct_SD <- ifelse(wide_majors2$Major == "Materials.Science", ms_pct_sd, declared_df2$Pct_SD)
declared_df2$Pct_SD <- ifelse(wide_majors2$Major == "Physics", p_pct_sd, declared_df2$Pct_SD)
declared_df2$Pct_SD <- ifelse(wide_majors2$Major == "Systems.Engineering", se_pct_sd, declared_df2$Pct_SD)

#write to csv
write.csv(declared_df2, 'major_analysis2.csv')

majors_analysis2 <- read.csv('major_analysis2.csv')

majors_analysis2

##     X College               Major Declared Pct_Declared    Pct_SD
## 1   1    CUNY Applied.Mathematics      807   0.78046422 0.3877194
## 2   2    CUNY    Computer.Science      867   0.65781487 0.3261813
## 3   3    CUNY        Data.Science      109   0.14361001 0.2710438
## 4   4    CUNY   Materials.Science       89   0.29276316 0.3334944
## 5   5    CUNY             Physics       58   0.04588608 0.1872943
## 6   6    CUNY Systems.Engineering      712   0.90240811 0.4637106
## 7   7    NYIT Applied.Mathematics      552   0.86115445 0.3877194
## 8   8    NYIT    Computer.Science     1027   0.62167070 0.3261813
## 9   9    NYIT        Data.Science       21   0.08203125 0.2710438
## 10 10    NYIT   Materials.Science     1007   0.65989515 0.3334944
## 11 11    NYIT             Physics      192   0.37573386 0.1872943
## 12 12    NYIT Systems.Engineering      811   0.92898053 0.4637106
## 13 13     NYU Applied.Mathematics      342   0.54028436 0.3877194
## 14 14     NYU    Computer.Science      915   0.61122244 0.3261813
## 15 15     NYU        Data.Science      377   0.71401515 0.2710438
## 16 16     NYU   Materials.Science     1569   0.85925520 0.3334944
## 17 17     NYU             Physics       57   0.31843575 0.1872943
## 18 18     NYU Systems.Engineering      601   0.86350575 0.4637106
##    Major_Declared_Pct
## 1           0.7370017
## 2           0.6288337
## 3           0.3285807
## 4           0.7289387
## 5           0.1571136
## 6           0.9007634
## 7           0.7370017
## 8           0.6288337
## 9           0.3285807
## 10          0.7289387
## 11          0.1571136
## 12          0.9007634
## 13          0.7370017
## 14          0.6288337
## 15          0.3285807
## 16          0.7289387
## 17          0.1571136
## 18          0.9007634

Visualizing Declared Percentages with Error Bars

The graphs below show the percentages of declared majors in three different ways. All visuals include error bars that show the variability of the data.

(1) Grouped by College

Here we see the colleges grouped in three graphs, each with the declared percentages of the individual majors.

ggplot(majors_analysis2, aes(x = Pct_Declared, y = Major, fill = College)) +
  facet_wrap(~ College) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(xmin = Pct_Declared - Pct_SD, xmax = Pct_Declared + Pct_SD), width = 0.2) +
  labs(
    title = "Declared Percentage by Major",
    x = "Percentage",
    y = "Major"
  )

(2) Grouped by Major

Here we see the cities grouped in five graphs. Each shows the declared percentages of the individual colleges.

ggplot(majors_analysis, aes(x = Pct_Declared, y = College, fill = Major)) +
  facet_wrap(~ Major) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(xmin = Pct_Declared - Pct_SD, xmax = Pct_Declared + Pct_SD), width = 0.2) +
  labs(
    title = "Declared Percentage by College",
    x = "Percentage",
    y = "College"
  )

Overall Declared Percentage vs College

Below I compared the overall declared majors for each college and used a bar graph to compare. It is clear that Airlines has CUNY has the smallest major declaration rate relative to their total number of students across all majors.

dec_group <- majors_analysis %>%
   group_by(College) %>%
   summarize(coll_declared_p = mean(College_Declared_Pct))

ggplot(dec_group, aes(x = College, y = coll_declared_p, fill = College)) + 
  geom_bar(stat = "identity") +
  labs(
    title = "Overall Declared Percentage by College",
    x = "College",
    y = "Percentage"
  )

Individual Declared Percentage vs College (color filered by major)

When breaking down the percentage comparison by major, I visualized it as (1) a single stacked graph with the information for all majors and (2) as separate graphs for each major. In either case, we see a significant difference in declared percentages across colleges.

(1) Stacked (One Graph)

Here the graph is visually appealing, but may be harder to compare as the bars representing the declared percentage for each major are not properly aligned side by side.

ggplot(majors_analysis, aes(x = College, y = Pct_Declared, fill = Major)) + 
  geom_bar(stat = "identity", position = "stack") +
  labs(
    title = "Individual Declared Percentage vs College (Stacked)",
    x = "College",
    y = "Percentage"
  )

(2) Facet Wrap (Separate Graphs)

Here the separate graphs represent each and within each graph is a comparison of declared percentages by . The individual graphs allow easier comparison.

ggplot(majors_analysis, aes(x = College, y = Pct_Declared, fill = Major)) + 
  facet_wrap(~ Major) +
  geom_bar(stat = "identity") +
  labs(
    title = "Individual Declared Percentage vs College (Separated)",
    x = "College",
    y = "Percentage"
  )

Declared Percentage vs Major (color filtered by College)

I likewise visualized it by comparing the declared percentages of each major and filled by college. I also swapped the axes for better readability.

(1) Stacked (One Graph)

ggplot(majors_analysis, aes(x = Pct_Declared, y = Major, fill = College)) +
  geom_bar(stat = 'identity') +
  labs(
    title = "Declared Percentage vs Major (Stacked)",
    x = "Percentage",
    y = "Major"
  )

(2) Facet Wrap (Separate Graphs)

Here the separate graphs represent each and within each graph is a comparison of their declared percentages by .

ggplot(majors_analysis, aes(x = Pct_Declared, y = Major, fill = College)) +
  facet_wrap(~ College) +
  geom_bar(stat = 'identity') +
  labs(
    title = "Declared Percentage vs Major (Separated)",
    x = "Percentage",
    y = "Major"
  )

Conclusion

My analysis focused on comparing the spread of the data according to airline as well as comparing percentages of major declarations. In comparing the standard deviations of overall declared major percentages for each school, we observe that CUNY has a standard deviation of 0.5565744, NYIT has standard deviation of 0.6114432, and NYU has a standard deviation of 0.6178946. This indicates that CUNY has more consistency in the percentage of declared majors compared to NYIT and NYU. When comparing the actual percentages of declared majors, whether by college overall or individual majors, it seems that CUNY has a smaller declaration rate across the board. Based on this analysis, CUNY students seem to be less decisive in terms of deciding on a major and with a smaller standard deviation, it shows more consistency in this regard.

DATA 607 - Project 2: Data Set 2

Julian Adames-Ng

2024-10-06

Introduction

Loading Data: Creating the Initial Data-Frame

Loading Data: Writing the Data-Frame to a CSV File

Loading Data: Changing “NA” Values to Blanks

Loading Data: Changing the Altered Column Names

Using Valid Column Names

Pivoting

Changing Column Type

Condensing the Data

Creating New Columns (1)

Creating New Columns (2)

Creating New Columns (3)

Analysis

Grouping by Major

Visualizing Declared Percentages with Error Bars

(1) Grouped by College

(2) Grouped by Major

Overall Declared Percentage vs College

Individual Declared Percentage vs College (color filered by major)

(1) Stacked (One Graph)

(2) Facet Wrap (Separate Graphs)

Declared Percentage vs Major (color filtered by College)

(1) Stacked (One Graph)

(2) Facet Wrap (Separate Graphs)

Conclusion