When working with data, you should expect to spend a good amount of time in the clean-up process, but it is not always ‘messy’ or unreadable. For example, data can still be organized in a data-frame in a way that is readable, but at the same time it may not be useful in such presented formats. In these cases, we may have to transpose the data-frame and re-organize to fit our needs.
The data is loaded into a data-frame called ‘majors_df’.
rm(list = ls())
majors_df <- data.frame(
c('NYU',NA,NA,'CUNY',NA, NA, 'NYIT', NA),
c('Declared','Undeclared',NA,'Declared','Undeclared',NA,'Declared','Undeclared'),
c(342,291,NA,807,227,NA,552,89),
c(57,122,NA,58,1206,NA,192,319),
c(915,582,NA,867,451,NA,1027,625),
c(377,151,NA,109,650,NA,21,235),
c(1569,257,NA,89,215,NA,1007,519),
c(601,95,NA,712,77,NA,811,62)
)
colnames(majors_df) <- c('','','Applied Mathematics','Physics','Computer Science','Data Science','Materials Science', 'Systems Engineering')
#view current state of data-frame
majors_df
## Applied Mathematics Physics Computer Science Data Science
## 1 NYU Declared 342 57 915 377
## 2 <NA> Undeclared 291 122 582 151
## 3 <NA> <NA> NA NA NA NA
## 4 CUNY Declared 807 58 867 109
## 5 <NA> Undeclared 227 1206 451 650
## 6 <NA> <NA> NA NA NA NA
## 7 NYIT Declared 552 192 1027 21
## 8 <NA> Undeclared 89 319 625 235
## Materials Science Systems Engineering
## 1 1569 601
## 2 257 95
## 3 NA NA
## 4 89 712
## 5 215 77
## 6 NA NA
## 7 1007 811
## 8 519 62
We can now write the ‘majors_df’ data-frame to a CSV file. Within the ‘write.csv()’ function, I specified the ‘row.names’ parameter to ‘FALSE’ to avoid creating non-blank row names. I also set the ‘na’ parameter to “” to replace all ‘NA’ values with blanks. The ‘stringsAsFactors’ parameter is set to ‘FALSE’ to avoid changing the character entries. At first glance it looks correct, but the data-frame still shows up with ‘NA’ values in some of the cells.
#write to csv file
write.csv(majors_df, "majors.csv", row.names = FALSE, na = "")
#read from csv file into R
majors_csv <- read.csv('majors.csv', stringsAsFactors = FALSE, check.names = FALSE)
majors_csv
## Applied Mathematics Physics Computer Science Data Science
## 1 NYU Declared 342 57 915 377
## 2 Undeclared 291 122 582 151
## 3 NA NA NA NA
## 4 CUNY Declared 807 58 867 109
## 5 Undeclared 227 1206 451 650
## 6 NA NA NA NA
## 7 NYIT Declared 552 192 1027 21
## 8 Undeclared 89 319 625 235
## Materials Science Systems Engineering
## 1 1569 601
## 2 257 95
## 3 NA NA
## 4 89 712
## 5 215 77
## 6 NA NA
## 7 1007 811
## 8 519 62
To remove the remaining ‘NA’ values I subset the data-frame and set any value that ‘is.na’ to ““. When calling the updated data, it seemed to have removed the blank row entirely. To confirm that this was not the case, I presented the data nicely using the ‘datatable()’ function.
#change NA values to blank
majors_csv[is.na(majors_csv)] <- ""
#check updated data-frame
majors_csv
## Applied Mathematics Physics Computer Science Data Science
## 1 NYU Declared 342 57 915 377
## 2 Undeclared 291 122 582 151
## 3
## 4 CUNY Declared 807 58 867 109
## 5 Undeclared 227 1206 451 650
## 6
## 7 NYIT Declared 552 192 1027 21
## 8 Undeclared 89 319 625 235
## Materials Science Systems Engineering
## 1 1569 601
## 2 257 95
## 3
## 4 89 712
## 5 215 77
## 6
## 7 1007 811
## 8 519 62
# Display the data frame nicely
datatable(majors_csv)
The previous output looks almost perfect, but the first two columns were given default names, namely “Var.2” and “Var.3”. To avoid this, I set the ‘options’ parameter accordingly. We can now see that our data-frame is formatted exactly as we want it to be. As a double-check, I compared it to a screen shot of the actual table that was created in the CSV file. Here, I used Excel to view it. The tables look identical.
#specify additional options
datatable(majors_csv, options = list(columnDefs = list(list(targets = c(1, 2), title = ""))))
Before we can pivot, we need to change the column/variable names to a valid format. This means no white-spaces or special characters.
#check data frame again
majors_csv
## Applied Mathematics Physics Computer Science Data Science
## 1 NYU Declared 342 57 915 377
## 2 Undeclared 291 122 582 151
## 3
## 4 CUNY Declared 807 58 867 109
## 5 Undeclared 227 1206 451 650
## 6
## 7 NYIT Declared 552 192 1027 21
## 8 Undeclared 89 319 625 235
## Materials Science Systems Engineering
## 1 1569 601
## 2 257 95
## 3
## 4 89 712
## 5 215 77
## 6
## 7 1007 811
## 8 519 62
Here, I replace white-spaces with a period and labeled the first two columns that originally had blank titles as ‘College’ and ‘Major Status’. For the blank row entries in the ‘College’ column, I labeled them according to the proper college.
#
colnames(majors_csv) <- c('College','Major.Status','Applied.Mathematics','Physics','Computer.Science','Data.Science','Materials.Science', 'Systems.Engineering')
majors_csv$College[2] <- 'NYU'
majors_csv$College[5] <- 'CUNY'
majors_csv$College[8] <- 'NYIT'
majors_csv
## College Major.Status Applied.Mathematics Physics Computer.Science
## 1 NYU Declared 342 57 915
## 2 NYU Undeclared 291 122 582
## 3
## 4 CUNY Declared 807 58 867
## 5 CUNY Undeclared 227 1206 451
## 6
## 7 NYIT Declared 552 192 1027
## 8 NYIT Undeclared 89 319 625
## Data.Science Materials.Science Systems.Engineering
## 1 377 1569 601
## 2 151 257 95
## 3
## 4 109 89 712
## 5 650 215 77
## 6
## 7 21 1007 811
## 8 235 519 62
Now that all of the required data is loaded, we need to transform what we have so that it is readable to our liking. In this case, this means pivoting the matrix and aggregating values and columns. To avoid gaps in our pivoted data, I remove the blank row before applying the ‘pivot_longer()’ function.
majors_csv <- majors_csv[c(-3,-6), ]
majors_csv
## College Major.Status Applied.Mathematics Physics Computer.Science
## 1 NYU Declared 342 57 915
## 2 NYU Undeclared 291 122 582
## 4 CUNY Declared 807 58 867
## 5 CUNY Undeclared 227 1206 451
## 7 NYIT Declared 552 192 1027
## 8 NYIT Undeclared 89 319 625
## Data.Science Materials.Science Systems.Engineering
## 1 377 1569 601
## 2 151 257 95
## 4 109 89 712
## 5 650 215 77
## 7 21 1007 811
## 8 235 519 62
p_majors <- pivot_longer(majors_csv,
cols = c('Applied.Mathematics','Physics','Computer.Science','Data.Science','Materials.Science', 'Systems.Engineering'),
names_to = "Major",
values_to = "Student.Count")
p_majors
## # A tibble: 36 × 4
## College Major.Status Major Student.Count
## <chr> <chr> <chr> <chr>
## 1 NYU Declared Applied.Mathematics 342
## 2 NYU Declared Physics 57
## 3 NYU Declared Computer.Science 915
## 4 NYU Declared Data.Science 377
## 5 NYU Declared Materials.Science 1569
## 6 NYU Declared Systems.Engineering 601
## 7 NYU Undeclared Applied.Mathematics 291
## 8 NYU Undeclared Physics 122
## 9 NYU Undeclared Computer.Science 582
## 10 NYU Undeclared Data.Science 151
## # ℹ 26 more rows
We need to change the ‘Student.Count’ column to numeric values to perform mathematical operations. After doing so, we can group the data according to College, Major, and Major.Status, in that order.
p_majors$Student.Count <- as.numeric(p_majors$Student.Count)
p_majors
## # A tibble: 36 × 4
## College Major.Status Major Student.Count
## <chr> <chr> <chr> <dbl>
## 1 NYU Declared Applied.Mathematics 342
## 2 NYU Declared Physics 57
## 3 NYU Declared Computer.Science 915
## 4 NYU Declared Data.Science 377
## 5 NYU Declared Materials.Science 1569
## 6 NYU Declared Systems.Engineering 601
## 7 NYU Undeclared Applied.Mathematics 291
## 8 NYU Undeclared Physics 122
## 9 NYU Undeclared Computer.Science 582
## 10 NYU Undeclared Data.Science 151
## # ℹ 26 more rows
#Aggregate data by major and college
grouped_majors <- p_majors %>%
group_by(College, Major, Major.Status) %>%
summarize(Student.Count, .groups = 'drop')
grouped_majors
## # A tibble: 36 × 4
## College Major Major.Status Student.Count
## <chr> <chr> <chr> <dbl>
## 1 CUNY Applied.Mathematics Declared 807
## 2 CUNY Applied.Mathematics Undeclared 227
## 3 CUNY Computer.Science Declared 867
## 4 CUNY Computer.Science Undeclared 451
## 5 CUNY Data.Science Declared 109
## 6 CUNY Data.Science Undeclared 650
## 7 CUNY Materials.Science Declared 89
## 8 CUNY Materials.Science Undeclared 215
## 9 CUNY Physics Declared 58
## 10 CUNY Physics Undeclared 1206
## # ℹ 26 more rows
As shown above, for each college there are duplicate entries for each major in order to match to either the ‘Declared’ or ‘Undeclared’ status. This gives us a data-frame with 36 rows. We can condense this down to just 18 rows by including the major/declaration status for each city on the same row by creating a ‘Declared’ column and an ‘Undeclared’ column.
#present corresponding values for (declared) and (undeclared) side by side
wide_majors <- grouped_majors %>%
pivot_wider(names_from = Major.Status,
values_from = Student.Count)
wide_majors
## # A tibble: 18 × 4
## College Major Declared Undeclared
## <chr> <chr> <dbl> <dbl>
## 1 CUNY Applied.Mathematics 807 227
## 2 CUNY Computer.Science 867 451
## 3 CUNY Data.Science 109 650
## 4 CUNY Materials.Science 89 215
## 5 CUNY Physics 58 1206
## 6 CUNY Systems.Engineering 712 77
## 7 NYIT Applied.Mathematics 552 89
## 8 NYIT Computer.Science 1027 625
## 9 NYIT Data.Science 21 235
## 10 NYIT Materials.Science 1007 519
## 11 NYIT Physics 192 319
## 12 NYIT Systems.Engineering 811 62
## 13 NYU Applied.Mathematics 342 291
## 14 NYU Computer.Science 915 582
## 15 NYU Data.Science 377 151
## 16 NYU Materials.Science 1569 257
## 17 NYU Physics 57 122
## 18 NYU Systems.Engineering 601 95
Now that we have data that is readable to our liking, we can create new columns to use for our analysis. Here, I created a column for the percentage of declared majors relative to all, declared + undeclared, for the given majors and corresponding to a specific college.
#create new column to show percentage of declared majors for each row
wide_majors$pct_declared <- wide_majors$Declared / (wide_majors$Declared + wide_majors$Undeclared)
wide_majors
## # A tibble: 18 × 5
## College Major Declared Undeclared pct_declared
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 CUNY Applied.Mathematics 807 227 0.780
## 2 CUNY Computer.Science 867 451 0.658
## 3 CUNY Data.Science 109 650 0.144
## 4 CUNY Materials.Science 89 215 0.293
## 5 CUNY Physics 58 1206 0.0459
## 6 CUNY Systems.Engineering 712 77 0.902
## 7 NYIT Applied.Mathematics 552 89 0.861
## 8 NYIT Computer.Science 1027 625 0.622
## 9 NYIT Data.Science 21 235 0.0820
## 10 NYIT Materials.Science 1007 519 0.660
## 11 NYIT Physics 192 319 0.376
## 12 NYIT Systems.Engineering 811 62 0.929
## 13 NYU Applied.Mathematics 342 291 0.540
## 14 NYU Computer.Science 915 582 0.611
## 15 NYU Data.Science 377 151 0.714
## 16 NYU Materials.Science 1569 257 0.859
## 17 NYU Physics 57 122 0.318
## 18 NYU Systems.Engineering 601 95 0.864
I decided to create two more columns for the mean of declared majors as well as the total number of declared + undeclared for each college, regardless of major. This is to compute a percentage for the mean in the same way that I computed a percentage for the individual majors.
mColl_dec <- wide_majors %>%
group_by(College) %>%
summarize(mean_declared = mean(Declared))
#mColl_dec
mDec_CUNY <- mColl_dec$mean_declared[1]
mDec_NYIT <- mColl_dec$mean_declared[2]
mDec_NYU <- mColl_dec$mean_declared[3]
wide_majors$mean_declared <- NA
wide_majors$mean_declared <- ifelse(wide_majors$College == "CUNY",mDec_CUNY, wide_majors$mean_declared)
wide_majors$mean_declared <- ifelse(wide_majors$College == "NYIT", mDec_NYIT, wide_majors$mean_declared)
wide_majors$mean_declared <- ifelse(wide_majors$College == "NYU", mDec_NYU, wide_majors$mean_declared)
wide_majors
## # A tibble: 18 × 6
## College Major Declared Undeclared pct_declared mean_declared
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 CUNY Applied.Mathematics 807 227 0.780 440.
## 2 CUNY Computer.Science 867 451 0.658 440.
## 3 CUNY Data.Science 109 650 0.144 440.
## 4 CUNY Materials.Science 89 215 0.293 440.
## 5 CUNY Physics 58 1206 0.0459 440.
## 6 CUNY Systems.Engineering 712 77 0.902 440.
## 7 NYIT Applied.Mathematics 552 89 0.861 602.
## 8 NYIT Computer.Science 1027 625 0.622 602.
## 9 NYIT Data.Science 21 235 0.0820 602.
## 10 NYIT Materials.Science 1007 519 0.660 602.
## 11 NYIT Physics 192 319 0.376 602.
## 12 NYIT Systems.Engineering 811 62 0.929 602.
## 13 NYU Applied.Mathematics 342 291 0.540 644.
## 14 NYU Computer.Science 915 582 0.611 644.
## 15 NYU Data.Science 377 151 0.714 644.
## 16 NYU Materials.Science 1569 257 0.859 644.
## 17 NYU Physics 57 122 0.318 644.
## 18 NYU Systems.Engineering 601 95 0.864 644.
Here I include columns for the total number of declared majors, total number of declared + undeclared, and declared percentages for the overall colleges. Then added a final column that calculates the mean across all majors of the percentages of declared majors for each college.
mDec_total1 <- wide_majors %>%
group_by(College) %>%
summarize(declared_total = sum(Declared))
mColl_total <- wide_majors %>%
group_by(College) %>%
summarize(total = sum(Declared, Undeclared))
# mDec_total1
# mColl_total
mDec_CUNY <- mDec_total1$declared_total[1]
mDec_NYIT <- mDec_total1$declared_total[2]
mDec_NYU <- mDec_total1$declared_total[3]
mTot_CUNY <- mColl_total$total[1]
mTot_NYIT <- mColl_total$total[2]
mTot_NYU <- mColl_total$total[3]
wide_majors$declared_total <- NA
wide_majors$total <- NA
wide_majors$declared_total <- ifelse(wide_majors$College == "CUNY", mDec_CUNY, wide_majors$declared_total)
wide_majors$declared_total <- ifelse(wide_majors$College == "NYIT", mDec_NYIT, wide_majors$declared_total)
wide_majors$declared_total <- ifelse(wide_majors$College == "NYU", mDec_NYU, wide_majors$declared_total)
wide_majors$total <- ifelse(wide_majors$College == "CUNY", mTot_CUNY, wide_majors$total)
wide_majors$total <- ifelse(wide_majors$College == "NYIT", mTot_NYIT, wide_majors$total)
wide_majors$total <- ifelse(wide_majors$College == "NYU", mTot_NYU, wide_majors$total)
wide_majors$college_declared_pct <- wide_majors$declared_total / wide_majors$total
wide_majors$pct_mean_declared <- NA
wide_majors$pct_mean_declared <- wide_majors$mean_declared / wide_majors$total
wide_majors
## # A tibble: 18 × 10
## College Major Declared Undeclared pct_declared mean_declared declared_total
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CUNY Applie… 807 227 0.780 440. 2642
## 2 CUNY Comput… 867 451 0.658 440. 2642
## 3 CUNY Data.S… 109 650 0.144 440. 2642
## 4 CUNY Materi… 89 215 0.293 440. 2642
## 5 CUNY Physics 58 1206 0.0459 440. 2642
## 6 CUNY System… 712 77 0.902 440. 2642
## 7 NYIT Applie… 552 89 0.861 602. 3610
## 8 NYIT Comput… 1027 625 0.622 602. 3610
## 9 NYIT Data.S… 21 235 0.0820 602. 3610
## 10 NYIT Materi… 1007 519 0.660 602. 3610
## 11 NYIT Physics 192 319 0.376 602. 3610
## 12 NYIT System… 811 62 0.929 602. 3610
## 13 NYU Applie… 342 291 0.540 644. 3861
## 14 NYU Comput… 915 582 0.611 644. 3861
## 15 NYU Data.S… 377 151 0.714 644. 3861
## 16 NYU Materi… 1569 257 0.859 644. 3861
## 17 NYU Physics 57 122 0.318 644. 3861
## 18 NYU System… 601 95 0.864 644. 3861
## # ℹ 3 more variables: total <dbl>, college_declared_pct <dbl>,
## # pct_mean_declared <dbl>
In my analysis, I decided to find the sample standard deviation based on the percentage points that I computed for the individual majors and the percentages of the means. I did this to compare the declared majors according to variation in the data set.
sample_pct_sd <- wide_majors %>%
group_by(College) %>%
reframe(pct_declared, pct_mean_declared)
sample_pct_sd$sq_diff <- (sample_pct_sd$pct_declared - sample_pct_sd$pct_mean_declared)^2
sample_pct_sd1 <- sample_pct_sd %>%
group_by(College) %>%
summarize(pct_sd = sqrt(sum(sq_diff)/(6 - 1)))
#sample_pct_sd1
#standard deviation of delay percentages for ALASKA
cuny_pct_sd <- sample_pct_sd1$pct_sd[1]
#standard deviation of delay percentages for AM WEST
nyit_pct_sd <- sample_pct_sd1$pct_sd[2]
#standard deviation of delay percentages for AM WEST
nyu_pct_sd <- sample_pct_sd1$pct_sd[3]
#wide_majors
declared_df <- data.frame(College = wide_majors$College,
Major = wide_majors$Major,
Declared = wide_majors$Declared,
Pct_Declared = wide_majors$pct_declared,
#Mean_Pct_Delay = wide_flights$pct_mean_delayed,
#Pct_SD = round(sqrt(((wide_flights$pct_delayed - wide_flights$pct_mean_delayed)^2) / (wide_flights$total - 1)), 5),
#Avg_Pct_SD = NA,
Pct_SD = NA,
College_Declared_Pct = wide_majors$college_declared_pct
)
declared_df$Pct_SD <- ifelse(wide_majors$College == "CUNY", cuny_pct_sd, declared_df$Pct_SD)
declared_df$Pct_SD <- ifelse(wide_majors$College == "NYIT", nyit_pct_sd, declared_df$Pct_SD)
declared_df$Pct_SD <- ifelse(wide_majors$College == "NYU", nyu_pct_sd, declared_df$Pct_SD)
#write to csv
write.csv(declared_df, 'major_analysis.csv')
majors_analysis <- read.csv('major_analysis.csv')
majors_analysis
## X College Major Declared Pct_Declared Pct_SD
## 1 1 CUNY Applied.Mathematics 807 0.78046422 0.5565744
## 2 2 CUNY Computer.Science 867 0.65781487 0.5565744
## 3 3 CUNY Data.Science 109 0.14361001 0.5565744
## 4 4 CUNY Materials.Science 89 0.29276316 0.5565744
## 5 5 CUNY Physics 58 0.04588608 0.5565744
## 6 6 CUNY Systems.Engineering 712 0.90240811 0.5565744
## 7 7 NYIT Applied.Mathematics 552 0.86115445 0.6114432
## 8 8 NYIT Computer.Science 1027 0.62167070 0.6114432
## 9 9 NYIT Data.Science 21 0.08203125 0.6114432
## 10 10 NYIT Materials.Science 1007 0.65989515 0.6114432
## 11 11 NYIT Physics 192 0.37573386 0.6114432
## 12 12 NYIT Systems.Engineering 811 0.92898053 0.6114432
## 13 13 NYU Applied.Mathematics 342 0.54028436 0.6178946
## 14 14 NYU Computer.Science 915 0.61122244 0.6178946
## 15 15 NYU Data.Science 377 0.71401515 0.6178946
## 16 16 NYU Materials.Science 1569 0.85925520 0.6178946
## 17 17 NYU Physics 57 0.31843575 0.6178946
## 18 18 NYU Systems.Engineering 601 0.86350575 0.6178946
## College_Declared_Pct
## 1 0.4831748
## 2 0.4831748
## 3 0.4831748
## 4 0.4831748
## 5 0.4831748
## 6 0.4831748
## 7 0.6612933
## 8 0.6612933
## 9 0.6612933
## 10 0.6612933
## 11 0.6612933
## 12 0.6612933
## 13 0.7204702
## 14 0.7204702
## 15 0.7204702
## 16 0.7204702
## 17 0.7204702
## 18 0.7204702
The following code mimics the same process as above that was done for each college, but for majors. To avoid confusion, I gave these data-frames to new assignments.
mMaj_dec <- wide_majors %>%
group_by(Major) %>%
summarize(mean_declared = mean(Declared))
wide_majors2 <- wide_majors
#wide_majors2
#mMaj_dec
mDec_math <- mMaj_dec$mean_declared[1]
mDec_cs <- mMaj_dec$mean_declared[2]
mDec_ds <- mMaj_dec$mean_declared[3]
mDec_ms <- mMaj_dec$mean_declared[4]
mDec_p <- mMaj_dec$mean_declared[5]
mDec_se <- mMaj_dec$mean_declared[6]
wide_majors2$mean_declared <- NA
wide_majors2$mean_declared <- ifelse(wide_majors2$Major == "Applied.Mathematics",mDec_math, wide_majors2$mean_declared)
wide_majors2$mean_declared <- ifelse(wide_majors2$Major == "Computer.Science", mDec_cs, wide_majors2$mean_declared)
wide_majors2$mean_declared <- ifelse(wide_majors2$Major == "Data.Science", mDec_ds, wide_majors2$mean_declared)
wide_majors2$mean_declared <- ifelse(wide_majors2$Major == "Materials.Science",mDec_ms, wide_majors2$mean_declared)
wide_majors2$mean_declared <- ifelse(wide_majors2$Major == "Physics", mDec_p, wide_majors2$mean_declared)
wide_majors2$mean_declared <- ifelse(wide_majors2$Major == "Systems.Engineering", mDec_se, wide_majors2$mean_declared)
wide_majors2
## # A tibble: 18 × 10
## College Major Declared Undeclared pct_declared mean_declared declared_total
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CUNY Applie… 807 227 0.780 567 2642
## 2 CUNY Comput… 867 451 0.658 936. 2642
## 3 CUNY Data.S… 109 650 0.144 169 2642
## 4 CUNY Materi… 89 215 0.293 888. 2642
## 5 CUNY Physics 58 1206 0.0459 102. 2642
## 6 CUNY System… 712 77 0.902 708 2642
## 7 NYIT Applie… 552 89 0.861 567 3610
## 8 NYIT Comput… 1027 625 0.622 936. 3610
## 9 NYIT Data.S… 21 235 0.0820 169 3610
## 10 NYIT Materi… 1007 519 0.660 888. 3610
## 11 NYIT Physics 192 319 0.376 102. 3610
## 12 NYIT System… 811 62 0.929 708 3610
## 13 NYU Applie… 342 291 0.540 567 3861
## 14 NYU Comput… 915 582 0.611 936. 3861
## 15 NYU Data.S… 377 151 0.714 169 3861
## 16 NYU Materi… 1569 257 0.859 888. 3861
## 17 NYU Physics 57 122 0.318 102. 3861
## 18 NYU System… 601 95 0.864 708 3861
## # ℹ 3 more variables: total <dbl>, college_declared_pct <dbl>,
## # pct_mean_declared <dbl>
##############
mDec_total2 <- wide_majors %>%
group_by(Major) %>%
summarize(declared_total = sum(Declared))
mColl_total1 <- wide_majors %>%
group_by(Major) %>%
summarize(total = sum(Declared, Undeclared))
# mDec_total2
# mColl_total1
mDec_math <- mDec_total2$declared_total[1]
mDec_cs <- mDec_total2$declared_total[2]
mDec_ds <- mDec_total2$declared_total[3]
mDec_ms <- mDec_total2$declared_total[4]
mDec_p <- mDec_total2$declared_total[5]
mDec_se <- mDec_total2$declared_total[6]
mTot_math <- mColl_total1$total[1]
mTot_cs <- mColl_total1$total[2]
mTot_ds <- mColl_total1$total[3]
mTot_ms <- mColl_total1$total[4]
mTot_p <- mColl_total1$total[5]
mTot_se <- mColl_total1$total[6]
wide_majors2$declared_total <- NA
wide_majors2$total <- NA
wide_majors2$declared_total <- ifelse(wide_majors2$Major == "Applied.Mathematics", mDec_math, wide_majors2$declared_total)
wide_majors2$declared_total <- ifelse(wide_majors2$Major == "Computer.Science", mDec_cs, wide_majors2$declared_total)
wide_majors2$declared_total <- ifelse(wide_majors2$Major == "Data.Science", mDec_ds, wide_majors2$declared_total)
wide_majors2$declared_total <- ifelse(wide_majors2$Major == "Materials.Science", mDec_ms, wide_majors2$declared_total)
wide_majors2$declared_total <- ifelse(wide_majors2$Major == "Physics", mDec_p, wide_majors2$declared_total)
wide_majors2$declared_total <- ifelse(wide_majors2$Major == "Systems.Engineering", mDec_se, wide_majors2$declared_total)
wide_majors2$total <- ifelse(wide_majors2$Major == "Applied.Mathematics", mTot_math, wide_majors2$total)
wide_majors2$total <- ifelse(wide_majors2$Major == "Computer.Science", mTot_cs, wide_majors2$total)
wide_majors2$total <- ifelse(wide_majors2$Major == "Data.Science", mTot_ds, wide_majors2$total)
wide_majors2$total <- ifelse(wide_majors2$Major == "Materials.Science", mTot_ms, wide_majors2$total)
wide_majors2$total <- ifelse(wide_majors2$Major == "Physics", mTot_p, wide_majors2$total)
wide_majors2$total <- ifelse(wide_majors2$Major == "Systems.Engineering", mTot_se, wide_majors2$total)
wide_majors2$major_declared_pct <- wide_majors2$declared_total / wide_majors2$total
wide_majors2$pct_mean_declared <- NA
wide_majors2$pct_mean_declared <- wide_majors2$mean_declared / wide_majors2$total
#wide_majors2
##############
sample_pct_sd3 <- wide_majors2 %>%
group_by(Major) %>%
reframe(pct_declared, pct_mean_declared)
sample_pct_sd3$sq_diff2 <- (sample_pct_sd3$pct_declared - sample_pct_sd3$pct_mean_declared)^2
sample_pct_sd4 <- sample_pct_sd3 %>%
group_by(Major) %>%
summarize(pct_sd = sqrt(sum(sq_diff2)/(6 - 1)))
# sample_pct_sd3
# sample_pct_sd4
#standard deviation of delay percentages for ALASKA
math_pct_sd <- sample_pct_sd4$pct_sd[1]
#standard deviation of delay percentages for AM WEST
cs_pct_sd <- sample_pct_sd4$pct_sd[2]
#standard deviation of delay percentages for AM WEST
ds_pct_sd <- sample_pct_sd4$pct_sd[3]
#standard deviation of delay percentages for ALASKA
ms_pct_sd <- sample_pct_sd4$pct_sd[4]
#standard deviation of delay percentages for AM WEST
p_pct_sd <- sample_pct_sd4$pct_sd[5]
#standard deviation of delay percentages for AM WEST
se_pct_sd <- sample_pct_sd4$pct_sd[6]
#wide_majors2
declared_df2 <- data.frame(College = wide_majors2$College,
Major = wide_majors2$Major,
Declared = wide_majors2$Declared,
Pct_Declared = wide_majors2$pct_declared,
#Mean_Pct_Delay = wide_flights$pct_mean_delayed,
#Pct_SD = round(sqrt(((wide_flights$pct_delayed - wide_flights$pct_mean_delayed)^2) / (wide_flights$total - 1)), 5),
#Avg_Pct_SD = NA,
Pct_SD = NA,
Major_Declared_Pct = wide_majors2$major_declared_pct
)
declared_df2$Pct_SD <- ifelse(wide_majors2$Major == "Applied.Mathematics", math_pct_sd, declared_df2$Pct_SD)
declared_df2$Pct_SD <- ifelse(wide_majors2$Major == "Computer.Science", cs_pct_sd, declared_df2$Pct_SD)
declared_df2$Pct_SD <- ifelse(wide_majors2$Major == "Data.Science", ds_pct_sd, declared_df2$Pct_SD)
declared_df2$Pct_SD <- ifelse(wide_majors2$Major == "Materials.Science", ms_pct_sd, declared_df2$Pct_SD)
declared_df2$Pct_SD <- ifelse(wide_majors2$Major == "Physics", p_pct_sd, declared_df2$Pct_SD)
declared_df2$Pct_SD <- ifelse(wide_majors2$Major == "Systems.Engineering", se_pct_sd, declared_df2$Pct_SD)
#write to csv
write.csv(declared_df2, 'major_analysis2.csv')
majors_analysis2 <- read.csv('major_analysis2.csv')
majors_analysis2
## X College Major Declared Pct_Declared Pct_SD
## 1 1 CUNY Applied.Mathematics 807 0.78046422 0.3877194
## 2 2 CUNY Computer.Science 867 0.65781487 0.3261813
## 3 3 CUNY Data.Science 109 0.14361001 0.2710438
## 4 4 CUNY Materials.Science 89 0.29276316 0.3334944
## 5 5 CUNY Physics 58 0.04588608 0.1872943
## 6 6 CUNY Systems.Engineering 712 0.90240811 0.4637106
## 7 7 NYIT Applied.Mathematics 552 0.86115445 0.3877194
## 8 8 NYIT Computer.Science 1027 0.62167070 0.3261813
## 9 9 NYIT Data.Science 21 0.08203125 0.2710438
## 10 10 NYIT Materials.Science 1007 0.65989515 0.3334944
## 11 11 NYIT Physics 192 0.37573386 0.1872943
## 12 12 NYIT Systems.Engineering 811 0.92898053 0.4637106
## 13 13 NYU Applied.Mathematics 342 0.54028436 0.3877194
## 14 14 NYU Computer.Science 915 0.61122244 0.3261813
## 15 15 NYU Data.Science 377 0.71401515 0.2710438
## 16 16 NYU Materials.Science 1569 0.85925520 0.3334944
## 17 17 NYU Physics 57 0.31843575 0.1872943
## 18 18 NYU Systems.Engineering 601 0.86350575 0.4637106
## Major_Declared_Pct
## 1 0.7370017
## 2 0.6288337
## 3 0.3285807
## 4 0.7289387
## 5 0.1571136
## 6 0.9007634
## 7 0.7370017
## 8 0.6288337
## 9 0.3285807
## 10 0.7289387
## 11 0.1571136
## 12 0.9007634
## 13 0.7370017
## 14 0.6288337
## 15 0.3285807
## 16 0.7289387
## 17 0.1571136
## 18 0.9007634
The graphs below show the percentages of declared majors in three different ways. All visuals include error bars that show the variability of the data.
Here we see the colleges grouped in three graphs, each with the declared percentages of the individual majors.
ggplot(majors_analysis2, aes(x = Pct_Declared, y = Major, fill = College)) +
facet_wrap(~ College) +
geom_bar(stat = "identity") +
geom_errorbar(aes(xmin = Pct_Declared - Pct_SD, xmax = Pct_Declared + Pct_SD), width = 0.2) +
labs(
title = "Declared Percentage by Major",
x = "Percentage",
y = "Major"
)
Here we see the cities grouped in five graphs. Each shows the declared percentages of the individual colleges.
ggplot(majors_analysis, aes(x = Pct_Declared, y = College, fill = Major)) +
facet_wrap(~ Major) +
geom_bar(stat = "identity") +
geom_errorbar(aes(xmin = Pct_Declared - Pct_SD, xmax = Pct_Declared + Pct_SD), width = 0.2) +
labs(
title = "Declared Percentage by College",
x = "Percentage",
y = "College"
)
Below I compared the overall declared majors for each college and used a bar graph to compare. It is clear that Airlines has CUNY has the smallest major declaration rate relative to their total number of students across all majors.
dec_group <- majors_analysis %>%
group_by(College) %>%
summarize(coll_declared_p = mean(College_Declared_Pct))
ggplot(dec_group, aes(x = College, y = coll_declared_p, fill = College)) +
geom_bar(stat = "identity") +
labs(
title = "Overall Declared Percentage by College",
x = "College",
y = "Percentage"
)
When breaking down the percentage comparison by major, I visualized it as (1) a single stacked graph with the information for all majors and (2) as separate graphs for each major. In either case, we see a significant difference in declared percentages across colleges.
Here the graph is visually appealing, but may be harder to compare as the bars representing the declared percentage for each major are not properly aligned side by side.
ggplot(majors_analysis, aes(x = College, y = Pct_Declared, fill = Major)) +
geom_bar(stat = "identity", position = "stack") +
labs(
title = "Individual Declared Percentage vs College (Stacked)",
x = "College",
y = "Percentage"
)
Here the separate graphs represent each and within each graph is a comparison of declared percentages by . The individual graphs allow easier comparison.
ggplot(majors_analysis, aes(x = College, y = Pct_Declared, fill = Major)) +
facet_wrap(~ Major) +
geom_bar(stat = "identity") +
labs(
title = "Individual Declared Percentage vs College (Separated)",
x = "College",
y = "Percentage"
)
I likewise visualized it by comparing the declared percentages of each major and filled by college. I also swapped the axes for better readability.
ggplot(majors_analysis, aes(x = Pct_Declared, y = Major, fill = College)) +
geom_bar(stat = 'identity') +
labs(
title = "Declared Percentage vs Major (Stacked)",
x = "Percentage",
y = "Major"
)
Here the separate graphs represent each and within each graph is a comparison of their declared percentages by .
ggplot(majors_analysis, aes(x = Pct_Declared, y = Major, fill = College)) +
facet_wrap(~ College) +
geom_bar(stat = 'identity') +
labs(
title = "Declared Percentage vs Major (Separated)",
x = "Percentage",
y = "Major"
)
My analysis focused on comparing the spread of the data according to airline as well as comparing percentages of major declarations. In comparing the standard deviations of overall declared major percentages for each school, we observe that CUNY has a standard deviation of 0.5565744, NYIT has standard deviation of 0.6114432, and NYU has a standard deviation of 0.6178946. This indicates that CUNY has more consistency in the percentage of declared majors compared to NYIT and NYU. When comparing the actual percentages of declared majors, whether by college overall or individual majors, it seems that CUNY has a smaller declaration rate across the board. Based on this analysis, CUNY students seem to be less decisive in terms of deciding on a major and with a smaller standard deviation, it shows more consistency in this regard.