Brazilian higher education data at the course level was obtained from the 2023 Higher Education Census, from INEP, via the Base dos Dados organization’s datalake. The analysis focuses only on public institutions and undergraduate courses.
Show Code
# Set the billing IDbasedosdados::set_billing_id('vinicius-projetos-r')# Query to download dictionarydic_query ="SELECT * FROM `basedosdados.br_inep_censo_educacao_superior.dicionario`"census_dic = basedosdados::read_sql(dic_query)# Query to download higher education datacensus_query =paste("SELECT sigla_uf, id_municipio, tipo_grau_academico, tipo_organizacao_administrativa, nome_area_detalhada, nome_area_geral, quantidade_vagas, quantidade_ingressantesFROM `basedosdados.br_inep_censo_educacao_superior.curso`WHERE ano = 2023 AND tipo_nivel_academico = '1' AND rede = '1' AND sigla_uf IS NOT NULL AND nome_area_detalhada IS NOT NULL", sep="")census_base =as.data.frame(basedosdados::read_sql(query =gsub("\\s+", " ", census_query)))for (col inc(7:8)){ census_base[,col] =as.numeric(census_base[,col])}
Data Processing and Geolocation
The vacancy rate was calculated by (number of vacancies - number of admitted students) / number of vacancies.
The first processing step was to define the maximum number of admitted students in a course at an institution as the total number of vacancies for that course. This is necessary because it is common to have courses with a higher number of admitted students than the total number of vacancies offered in the database, perhaps due to filling errors by institutions caused by confusion in adopted concepts, which would hinder our analysis.
The second processing step was to remove the ‘Federal Institute of Education, Science and Technology of Mato Grosso’ from our database, as it represented an outlier in the data.
To locate the mesoregion where the institution is installed, a spreadsheet from the NEREUS group was used to match municipalities with mesoregions. The boundaries of Brazilian mesoregions were obtained through the geobr package.
# Get mesoregions's boundaries meso_boundary =read_meso_region(year=2020) %>%select(name_meso, geom) %>%mutate(name_meso =str_to_lower(stri_trans_general(name_meso, "latin-ascii")))# Get match between municipatilies and mesoregionsjoin =read_xlsx('regioes_geo.xlsx') %>%select(8,12) %>%rename(codigo =`Código Município Completo`,meso =`Nome_Mesorregião`) %>%mutate(meso =str_to_lower(stri_trans_general(meso, "latin-ascii")))# Join tablesfinal_base =left_join(census_base, join, by =c('id_municipio'='codigo'))final_base =left_join(final_base, meso_boundary, by =c('meso'='name_meso')) %>%st_as_sf()
From the map below, it is possible to perceive that the undergraduate vacancy rate has a spatial heterogeneity. More specifically, the North and Northeast regions show a higher occupation of offered vacancies, while the Center-South of Brazil presents a higher vacancy rate, with the exception of São Paulo and Paraná, where they apparently also show low vacancy rates.
Show Code
max_val <-max(t$taxa, na.rm =TRUE)min_val <-min(t$taxa, na.rm =TRUE)ggplot() +geom_sf(data = meso_boundary, fill ='grey80', color ="white", linewidth =0.5) +geom_sf(data = t, aes(fill=taxa), color ="white", linewidth =0.2) +scale_fill_distiller(name ="",palette ="RdBu",direction =-1,na.value ="grey",labels = percent,limits =c(min_val, max_val),breaks =c(min_val, max_val)) +labs(title ="Undergraduate Vacancy Rate",subtitle ="Distribution across mesoregions",caption ="Fonte: Censo da Educação Superior 2023; Base dos Dados." ) +theme_void() +theme(plot.title =element_text(family ="pf", size =100, face ="bold", color ="#222222", hjust =0.5),plot.subtitle =element_text(family ="pf", size =50, color ="#555555", hjust =0.5),legend.text =element_text(family ="pf", size =45),plot.caption =element_text(family ="pf", size =40, hjust =0.5),legend.position ='bottom',legend.key.width =unit(1.8, "cm"),legend.key.size=unit(0.7, "cm") )
Seeking to deepen the work, a study was conducted on which bachelor’s and licentiate degree courses presented the highest and lowest vacancy rates in Brazil. Courses with few classes and vacancies offered in Brazil were disregarded, such as, for example, Education Science, limiting the analysis to larger courses. A greater presence of Data and Software courses is observed among those with lower vacancy rates, and Engineering courses among those with higher vacancy rates.
Show Code
ordered_data <-as.data.frame(final_base) %>%filter(!is.na(nome_area_detalhada), tipo_grau_academico %in%c('Bacharelado', 'Licenciatura'),!nome_area_detalhada %in%c('Medicina e terapia tradicional e complementar','Finanças, bancos e seguros', 'Ciência da educação','Têxteis (vestuário, calçados e couro)','Programas interdisciplinares abrangendo engenharia, produção e construção','Formação de professores de educação infantil','Programas interdisciplinares abrangendo ciências sociais, comunicação e informação','Veículos a motor, construção naval, aeronáutica, ferroviária e metroviária')) %>%group_by(nome_area_detalhada) %>%summarise(vagas =sum(qtd_vagas_tratada, na.rm =TRUE),ingressos =sum(qtd_ingressos_tratado, na.rm =TRUE),taxa = (vagas - ingressos)/vagas,.groups ='drop') %>%mutate(taxa_z =round((taxa -mean(taxa, na.rm =TRUE)) /sd(taxa, na.rm =TRUE),2)) %>%arrange(desc(taxa_z))bottom_10 <-tail(ordered_data, 10)top_10 <-head(ordered_data, 10)top_bottom_20 <-bind_rows(top_10, bottom_10) %>%mutate(nome_area_detalhada =factor(nome_area_detalhada, levels = ordered_data$nome_area_detalhada)) %>%mutate(nome_area_curto =recode(nome_area_detalhada,"Programas interdisciplinares abrangendo computação e Tecnologias da Informação e Comunicação (TIC)"='Desenvolvimento de Softw.',"Soluções computacionais para domínios específicos"="Ciência de dados","Setor militar e de defesa"="Ciências militares","Medicina"="Medicina","Odontologia"="Odontologia","Veterinária"="Veterinária","Ciência da computação"="Ciência da computação","Programas interdisciplinares abrangendo educação"="Interdisciplina em educação","Silvicultura"="Engenharia florestal","Física"="Física","Matemática"="Matemática","Programas interdisciplinares abrangendo artes e humanidades"="Interdisciplina em humanidades","Processamento de alimentos"="Engenharia de alimentos","Materiais"="Engenharia de materiais","Pesca"="Engenharia de pesca","Serviços de alimentação"="Gastronomia","Produção de software"="Engenharia de Softw.","Mineração e extração"="Eng. de Minas e Petróleo"))
Show Code
ggplot(top_bottom_20, aes(x=nome_area_curto, y=taxa_z, label=taxa_z)) +geom_point(stat='identity', fill="#26735B", size=6) +geom_segment(aes(y =0, x = nome_area_curto, yend = taxa_z, xend = nome_area_curto), color ="#0D261E") +labs(title="Undergraduate Normalized Vacancy Rate", subtitle="10 Lowest and Highest rates for bachelor's and licentiate degrees",x='',y='',caption ="Fonte: Censo da Educação Superior 2023; Base dos Dados." ) +ylim(-2.5, 2.5) +coord_flip() +theme(panel.background =element_rect(fill ="white", colour =NA), plot.background =element_rect(fill ="white", colour =NA), panel.grid.major =element_blank(), panel.grid.minor =element_blank(), text =element_text(family ="pf"),plot.title =element_text(size =40, face ="bold", color ="#222222", hjust =0.5),plot.margin =margin(r =2, l =2, unit ="cm"),plot.subtitle =element_text( size =25, color ="#555555", hjust =0.5),plot.caption =element_text(size =20, hjust =0.5),legend.position ='none',axis.text.x =element_text(size =25),axis.text.y =element_text(size =20))
Finally, seeking to expand the analysis, a visualization was developed to better understand the misalignment between offered courses and regional demands.
In the graph below, it is noticeable that ICT, Health, and Business & Law courses have a high occupancy rate in practically all states. Courses classified as Exact Sciences, Education, and Agro & Veterinary are those with lower occupancy rates in several states.
The states with a higher predominance of green dots are São Paulo, Paraná, and Pará, while those with more red dots are Goiás, Rio Grande do Norte, and Santa Catarina. While Exact Sciences courses have high occupancy in the Northeast region, there is not as much demand in the rest of Brazil. The North region, in turn, shows more occupancy in engineering courses than the rest of Brazil. In the Northeast, although Law and Business courses have a medium-high occupancy, they have lower levels compared to the rest of Brazil. In the Center-West, there is greater heterogeneity than in other regions regarding course occupancy.
This analysis is useful for contributing to the discussion of reviewing the distribution and relevance of courses offered in Brazil.
Show Code
# Griup by major area and stateregional_data <-as.data.frame(final_base) %>%filter(!is.na(nome_area_geral)) %>%group_by(nome_area_geral, sigla_uf) %>%summarise(vagas =sum(qtd_vagas_tratada, na.rm =TRUE),ingressos =sum(qtd_ingressos_tratado, na.rm =TRUE),taxa = (vagas - ingressos)/vagas)# Get the unique valuesunicos =unique(regional_data$nome_area_geral)# Rename major areasregional_data = regional_data %>%mutate(nome_curto =case_when( nome_area_geral =='Agricultura, silvicultura, pesca e veterinária'~'Agro e Veterinária', nome_area_geral =='Ciências naturais, matemática e estatística'~'Ciências exatas', nome_area_geral =='Ciências sociais, comunicação e informação'~'Sociais e comunicação', nome_area_geral =='Computação e Tecnologias da Informação e Comunicação (TIC)'~'TIC', nome_area_geral =='Engenharia, produção e construção'~'Engenharias', nome_area_geral =='Negócios, administração e direito'~'Negócios e Direito', nome_area_geral =='Saúde e bem-estar'~'Saúde', nome_area_geral =='Artes e humanidades'~'Artes e humanidades', nome_area_geral =='Educação'~'Educação', nome_area_geral =='Serviços'~'Serviços', nome_area_geral =='Programas básicos'~'Programas básicos' ) )
ggplot(regional_data, aes(nome_curto, sigla_uf, color = taxa)) +geom_point(size =6) +scale_color_gradient2(low ="darkgreen", mid ="lightgrey", high ="darkred", midpoint =0.5)+labs(title="Undergraduate Vacancy Rate", subtitle="Rate by major area across states",x='',y='',caption ="Fonte: Censo da Educação Superior 2023; Base dos Dados." ) +coord_flip() +theme(panel.background =element_rect(fill ="white", colour =NA),plot.background =element_rect(fill ="white", colour =NA), panel.grid.major =element_blank(),panel.grid.minor =element_blank(), text =element_text(family ="pf"),plot.title =element_text(size =40, face ="bold", color ="#222222", hjust =0.5),plot.subtitle =element_text( size =25, color ="#555555", hjust =0.5),plot.caption =element_text(size =20, hjust =0.5),legend.position ='none',axis.text.x =element_text(size =16),axis.text.y =element_text(size =20))