Durante el año 2017 y 2018, Kaggle llevó a cabo una encuesta a su
comunidad para conocer su experiencia con aprendizaje automático y
ciencia de datos. En esta se encuestaron a varios continentes, pero se
tuvo en cuenta principalmente a los usuarios de África, ya que el
objetivo era comparar sus estadísticas con los demás países.
Las preguntas de investigación abarcaron varios temas como el país de
residencias, la distribución por edad y género, formación educativa,
experiencia laboral y de codificación, entre otras. En basa a estos
datos se crearon diferentes gráficas que ayudan a comprender la
influencia del aprendizaje automático y ciencia de datos.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(grid)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(ggforce)
Utilizaré tres conjuntos de datos para este kernel:
Un conjunto de datos personalizado con la lista de países presentes en la encuesta de 2018 y cada continente correspondiente.
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 16716 Columns: 228
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (212): GenderSelect, Country, EmploymentStatus, StudentStatus, LearningD...
## dbl (13): Age, LearningCategorySelftTaught, LearningCategoryOnlineCourses, ...
## num (1): CompensationAmount
## lgl (2): WorkToolsFrequencyAngoss, WorkToolsFrequencyKNIMECommercial
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 23860 Columns: 395
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (395): Time from Start to Finish (seconds), Q1, Q1_OTHER_TEXT, Q2, Q3, Q...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 56 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Country, Continent
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# STRING PROCESSING
# countries
multipleChoice18$Q3 <- str_replace(multipleChoice18$Q3,"Iran, Islamic Republic of...","Iran")
multipleChoice18$Q3 <- str_replace(multipleChoice18$Q3,"I do not wish to disclose my location","Won't disclose")
multipleChoice18$Q3 <- str_replace(multipleChoice18$Q3,"United Kingdom of Great Britain and Northern Ireland","UK and NI")
multipleChoice18$Q3 <- str_replace(multipleChoice18$Q3,"United States of America","USA")
continents$Country <- str_replace(continents$Country,"Iran, Islamic Republic of...","Iran")
continents$Country <- str_replace(continents$Country,"I do not wish to disclose my location","Won't disclose")
continents$Country <- str_replace(continents$Country,"United Kingdom of Great Britain and Northern Ireland","UK and NI")
continents$Country <- str_replace(continents$Country,"United States of America","USA")
# CONVERT CATEGORICAL DATA TO FACTOR
# age groups
multipleChoice18$Q2 <- factor(multipleChoice18$Q2,
levels = c("18-21","22-24","25-29",
"30-34","35-39","40-44",
"45-49","50-54","55-59",
"60-69","70-79","80+"),
labels = c("18-21","22-24","25-29",
"30-34","35-39","40-44",
"45-49","50-54","55-59",
"60-69","70-79","80+"))
# degree
multipleChoice18$Q4 <- factor(multipleChoice18$Q4,
levels = c("Doctoral degree","Master’s degree","Bachelor’s degree","Professional degree",
"No formal education past high school",
"Some college/university study without earning a bachelor’s degree",
"I prefer not to answer"),
labels = c("PhD","Master","Bachelor","Professional",
"High school","No degree","Won't disclose"))
# undergraduate major
multipleChoice18$Q5 <- factor(multipleChoice18$Q5,
levels = c("Medical or life sciences (biology, chemistry, medicine, etc.)",
"Computer science (software engineering, etc.)",
"Engineering (non-computer focused)",
"Mathematics or statistics",
"A business discipline (accounting, economics, finance, etc.)",
"Environmental science or geology",
"Social sciences (anthropology, psychology, sociology, etc.)",
"Physics or astronomy",
"Information technology, networking, or system administration",
"I never declared a major",
"Other",
"Humanities (history, literature, philosophy, etc.)") ,
labels = c("Medical/life sciences", "Computer science",
"Engineering", "Mathematics/statistics",
"A business discipline", "Physics/astronomy",
"IT/Network/Sys. admin", "No major declared",
"Humanities", "Env. science", "Social sciences", "Other"))
# In what industry is your current employer?
multipleChoice18$Q7 <- factor(multipleChoice18$Q7,
levels = c("Retail/Sales", "I am a student",
"Computers/Technology", "Accounting/Finance",
"Academics/Education",
"Insurance/Risk Assessment","Other",
"Energy/Mining", "Non-profit/Service",
"Marketing/CRM", "Government/Public Service",
"Manufacturing/Fabrication",
"Online Service/Internet-based Services",
"Broadcasting/Communications",
"Medical/Pharmaceutical",
"Online Business/Internet-based Sales",
"Military/Security/Defense",
"Shipping/Transportation",
"Hospitality/Entertainment/Sports"),
labels = c("Retail / Sales", "Student",
"Computers / Technology", "Accounting / Finance",
"Academics / Education",
"Insurance / Risk Assessment","Other",
"Energy / Mining", "Non-profit / Service",
"Marketing / CRM", "Government / Public Service",
"Manufacturing / Fabrication",
"Online Service / Internet-based Services",
"Broadcasting / Communications",
"Medical / Pharmaceutical",
"Online Business / Internet-based Sales",
"Military / Security/Defense",
"Shipping / Transportation",
"Hospitality / Entertainment/Sports"))
# experience in current role
multipleChoice18$Q8 <- factor(multipleChoice18$Q8, levels = c("0-1","1-2","2-3",
"3-4","4-5","5-10",
"10-15","15-20","20-25",
"25-30","30+"))
# yearly compensation
multipleChoice18$Q9 <- factor(multipleChoice18$Q9,
levels = c("I do not wish to disclose my approximate yearly compensation",
"0-10,000","10-20,000","20-30,000","30-40,000",
"40-50,000","50-60,000","60-70,000","70-80,000",
"80-90,000","90-100,000","100-125,000",
"125-150,000","150-200,000","200-250,000",
"250-300,000","300-400,000", "400-500,000","500,000+"),
labels = c("Won't disclose",
"0-10,000","10-20,000","20-30,000","30-40,000",
"40-50,000","50-60,000","60-70,000","70-80,000",
"80-90,000","90-100,000","100-125,000",
"125-150,000","150-200,000","200-250,000",
"250-300,000","300-400,000", "400-500,000","500,000+"))
# time spent coding
multipleChoice18$Q23 <- factor(multipleChoice18$Q23, levels = c("0% of my time",
"1% to 25% of my time",
"25% to 49% of my time",
"50% to 74% of my time",
"75% to 99% of my time",
"100% of my time"),
labels = c("0%","1% to 25%","25% to 49%",
"50% to 74%","75% to 99%","100%"))
# coding experience
multipleChoice18$Q24 <- factor(multipleChoice18$Q24,
levels = c("I have never written code and I do not want to learn",
"I have never written code but I want to learn",
"< 1 year","1-2 years","3-5 years","5-10 years",
"10-20 years","20-30 years","30-40 years", "40+ years") ,
labels = c("I don't write code and don't want to learn",
"I don't write code but want to learn",
"< 1 year", "1-2 years", "3-5 years",
"5-10 years", "10-20 years","20-30 years","30-40 years", "40+ years")
)
# For how many years have you used machine learning methods
multipleChoice18$Q25 <- factor(multipleChoice18$Q25,
levels = c("I have never studied machine learning and I do not plan to",
"I have never studied machine learning but plan to learn in the future",
"< 1 year", "1-2 years", "2-3 years", "3-4 years", "4-5 years",
"5-10 years", "10-15 years", "20+ years"),
labels = c("Never studied, do not plan to",
"Never studied, plan to learn",
"< 1 year", "1-2 years", "2-3 years", "3-4 years", "4-5 years",
"5-10 years", "10-15 years", "20+ years"))
# use of machine learning in industries
multipleChoice18$Q10 <- factor(multipleChoice18$Q10,
levels = c("I do not know",
"No (we do not use ML methods)",
"We are exploring ML methods (and may one day put a model into production)",
"We recently started using ML methods (i.e., models in production for less than 2 years)",
"We have well established ML methods (i.e., models in production for more than 2 years)",
"We use ML methods for generating insights (but do not put working models into production)"),
labels = c("I do not know", "No", "Exploring ML methods",
"Recently started", "Well established ML methods",
"For generating insights"))
# expertise in data science
multipleChoice18$Q40 <- factor(multipleChoice18$Q40,
levels = c("Independent projects are equally important as academic achievements",
"Independent projects are much more important than academic achievements",
"Independent projects are slightly more important than academic achievements",
"Independent projects are slightly less important than academic achievements",
"Independent projects are much less important than academic achievements",
"No opinion; I do not know"),
labels = c("Equally important",
"Much more important",
"Slightly more important",
"Less important",
"Much less important",
"No opinion/Don't know"))
# are you a data scientist?
multipleChoice18$Q26 <- factor(multipleChoice18$Q26,
levels = c("Definitely yes", "Probably yes", "Maybe",
"Probably not", "Definitely not"),
labels = c("Definitely yes", "Probably yes", "Maybe",
"Probably not", "Definitely not"))
Fueron 57 el total de países encuestados, la primera gráfica se trata de una de barras que compara el número de encuestados y los continentes encuestados, siendo Asia el continente con más encuestados y Oceanía con el menor resultados, África por su lado quedó en la tercera posición.
newMultipleChoice %>%
group_by(Continent) %>%
summarise(Count = length(Continent)) %>%
mutate(highlight_flag = ifelse((Continent == "Africa"), T, F)) %>%
ggplot(aes(x = reorder(Continent,-Count), y = Count, fill = Continent)) +
geom_bar(aes(fill = highlight_flag), stat = "identity", color = "grey") +
geom_text(aes(label =as.character(Count)),
position = position_dodge(width = 1),
hjust = 0.5, vjust = -0.25, size = 3) +
scale_fill_brewer(palette = "PuBu") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
legend.position = "none",
legend.text = element_text(size = 11)) +
labs(title = "Number of respondents",
x = "", y = "Count", fill = "",
caption = "Africa and the world")
En la segunda gráfica, se compara nuevamente el número de encuestados por año de los principales países africanos, se destaca el aumento de Nigeria. Por último, la tercera gráfica corresponde a la ciudad de residencia según el género de los países africanos, con una mayor cantidad de hombres.
p1 <- df %>%
group_by(Country,Year) %>%
summarise(Count = length(Country)) %>%
ggplot(aes(x = Year, y = Count, group = Country)) +
geom_line(aes(color = Country), size = 0.5) +
geom_point(aes(color = Country), size = 4) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text = element_text(size = 12),
legend.position = "bottom",
legend.title=element_blank(),
legend.text = element_text(size = 10)) +
labs(title = "Number of respondents",
x = "", y = "Count", fill = "", caption = "")
## `summarise()` has grouped output by 'Country'. You can override using the
## `.groups` argument.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
p2 <- afroCountries %>%
group_by(Q1,Q3) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
summarise(Count = length(Q3)) %>%
ggplot(aes(x = reorder(Q3,-Count), y = Count, fill = Q1)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Paired") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5)) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text = element_text(size = 12),
axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5),
legend.position = "top",
legend.text = element_text(size = 10)) +
labs(title = "Country of residence", x = "", y = "", fill = "",
caption = "About us")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
grid.arrange(p1,p2, ncol = 2)
A pesar de que el número de mujeres encuestadas es mejor al de los hombres, la proporción de estos en todos los países se la lleva Túnez, por el contrario, Nigeria obtuvo la menor proporción en África, pero tiene el mayor número de encuestados. (4) En comparación con los demás continentes, Norte América fue quien obtuvo la mayor proporción entre hombres y mujeres, para este caso, los países africanos quedaron en segundo lugar. (5)
multipleChoice18 %>%
group_by(Q1,Q3) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
filter(!is.na(Q3)) %>%
summarise(Count = n()) %>%
spread(Q1,Count) %>%
mutate(ratio = Female/Male) %>%
mutate(highlight_flag = ifelse((Q3 == "Egypt" | Q3 == "Kenya" | Q3 == "Morocco" |
Q3 == "Nigeria" | Q3 == "Tunisia" | Q3 == "South Africa"), T, F)) %>%
ggplot(aes(x = reorder(Q3,-ratio), y = ratio, fill = ratio)) +
geom_bar(aes(fill = highlight_flag), stat = "identity") +
scale_fill_brewer(palette = "Paired") +
theme(legend.position = "none",
plot.title = element_text(size = 15, hjust = 0.5),
axis.text.y = element_text(size = 11),
axis.text.x = element_text(size = 9.5, angle = -90,
hjust = 0 , vjust = 0.5)) +
labs(title = "Female to Male ratio",
x = "", y = "Ratio", fill = "",
caption = "About us")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
newMultipleChoice %>%
group_by(Continent,Q1) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
summarise(Count = n()) %>%
spread(Q1,Count) %>%
mutate(ratio = Female/Male) %>%
mutate(highlight_flag = ifelse((Continent == "Africa"), T, F)) %>%
ggplot(aes(x = reorder(Continent,-ratio), y = ratio, fill = ratio)) +
geom_bar(aes(fill = highlight_flag), stat = "identity", color = "grey") +
scale_fill_brewer(palette = "PuBu") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
theme(legend.position = "none",
plot.title = element_text(size = 15, hjust = 0.5),
axis.text.y = element_text(size = 12),
axis.text.x = element_text(size = 12)) +
labs(title = "Female to Male ratio",
x = "", y = "Ratio", fill = "",
caption = "Africa and the world")
## `summarise()` has grouped output by 'Continent'. You can override using the
## `.groups` argument.
Aquí se muestra la distribución de edad de los encuestados según su género, se observa una mayor cantidad de candidatos hombres. En promedio, el rango de edad de la mayoría de los encuestados está entre 22 y 29 años. Para los hombres, sus edades van desde los 22 a los 24 años y en el caso de las mujeres, 25 y 29.
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
ggplot(data = temp,
aes(x = Q2, fill = Q1)) +
geom_bar(data = filter(temp, Q1 == "Male"), aes(y = Count), stat = "identity") +
geom_bar(data = filter(temp, Q1 == "Female"), aes(y = -1*Count), stat = "identity") +
scale_y_continuous(breaks = seq(-50,150,50),
labels = as.character(c(seq(50,0,-50), seq(50,150,50)))) +
scale_fill_brewer(palette = "Paired") +
coord_flip() +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text = element_text(size = 12),
legend.position = "top",
legend.text = element_text(size = 11)) +
labs(title = "Age distribution in Africa",
x = "Age group (years)", y = "Count", fill = "",
caption = "")
Según la formación académica de cada continente se puede observar que, para el caso de África, la licenciatura es la titulación más obtenida, caso contrario para los demás continentes donde es más común la maestría. (7) Cuando se trata de la titulación según el género, muestra que el porcentaje de mujeres con Másters y doctorados es mayor que el de los hombres. (8)
p1 <- newMultipleChoice %>%
group_by(Continent,Q4) %>%
filter(!is.na(Q4)) %>%
summarise(Count = length(Continent)) %>%
mutate(pct = round(prop.table(Count)*100,2)) %>%
ggplot(aes(x = Q4, y = pct, group = Continent)) +
geom_line(aes(color = Continent), size = 0.5) +
geom_point(aes(color = Continent), size = 2) +
scale_x_discrete(labels = function(x) str_wrap(x,width = 5)) +
scale_y_discrete(labels = function(x) str_wrap(x, width = 30))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text = element_text(size = 12),
axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5),
legend.position = "top",
legend.title = element_blank(),
legend.text = element_text(size = 11)) +
labs(title = "Educational background",
x = "", y = "%", fill = "",
caption = "")
## `summarise()` has grouped output by 'Continent'. You can override using the
## `.groups` argument.
p2 <- afroCountries %>%
group_by(Q1,Q4) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
filter(!is.na(Q4)) %>%
summarise(Count = length(Q4)) %>%
mutate(pct = round(prop.table(Count)*100,2)) %>%
ggplot(aes(x = "", y = pct, fill = Q4)) +
geom_col(width = 1) +
scale_fill_brewer(palette = "Set3") +
facet_grid(Q1~.) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text = element_text(size = 12),
legend.text = element_text(size = 11)) +
labs(title = "Degree",
x = "", y = "", fill = "Degree",
caption = "About us")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
grid.arrange(p1,p2,ncol = 2)
Para todos los continentes encuestados, se encontró que la licenciatura más común de los usuarios es la informática, seguida de la ingeniería y las matemáticas/estadísticas, y la menos común es las ciencias medioambientales.
newMultipleChoice %>%
group_by(Continent,Q5) %>%
filter(!is.na(Q5)) %>%
summarise(Count = length(Q5)) %>%
mutate(pct = round(prop.table(Count)*100,2)) %>%
ggplot() +
geom_point(mapping = aes(x = Continent, y = reorder(Q5,pct),
size = pct, color = Q5)) +
scale_fill_gradient(low = "salmon1", high = "blue") +
scale_x_discrete(labels = function(x) str_wrap(x,width = 5)) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.y = element_text(size = 11),
axis.text.x = element_text(size = 12),
legend.position = "none",
legend.text = element_text(size = 11)) +
labs(title = "Undergraduate major",
x = "", y = "", fill = "",
caption = "Africa and the world")
## `summarise()` has grouped output by 'Continent'. You can override using the
## `.groups` argument.
Según el género, la informática también se posiciona como la
licenciatura más común, con un porcentaje cerca de 50 para las mujeres y
un más de 40 en el caso de los hombres.
afroCountries %>%
group_by(Q1,Q5) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
filter(!is.na(Q5)) %>%
summarise(Count = length(Q5)) %>%
mutate(pct = round(prop.table(Count)*100,2)) %>%
ggplot(aes(x = reorder(Q5,-pct), y = pct, group = Q1)) +
geom_point(aes(color = Q1), size = 2) + geom_line(aes(color = Q1), size = 1) +
scale_fill_brewer(palette = "Set3") +
scale_x_discrete(labels = function(x) str_wrap(x,width = 10)) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.y = element_text(size = 11),
axis.text.x = element_text(size = 12, angle = -90,hjust = 0,vjust = 0.5),
legend.position = "top",
legend.title = element_blank(),
legend.text = element_text(size = 11)) +
labs(title = "Undergraduate major",
x = "", y = "%", fill = "",
caption = "About us")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
Para Asia y África, los estudiantes representan la mayor proporción, en cuanto a Europa, América del Norte y América del Sur, la mayoría de los usuarios están involucrados en la ciencia de datos. Los trabajos más comunes son científicos de datos, analistas de datos e ingenieros de software. (11)
newMultipleChoice %>%
group_by(Continent,Q6) %>%
filter(!is.na(Q6)) %>%
summarise(Count = length(Continent)) %>%
mutate(pct = round(prop.table(Count)*100,2)) %>%
ggplot() +
geom_point(mapping = aes(x = Continent, y = reorder(Q6,pct),
size = 5*pct, color = Q6)) +
scale_x_discrete(labels = function(x) str_wrap(x,width = 8)) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text = element_text(size = 12),
axis.text.y = element_text(size = 10),
legend.position = "none",
legend.text = element_text(size = 11)) +
labs(title = "Current role",
x = "", y = "", fill = "",
caption = "Africa and the world")
## `summarise()` has grouped output by 'Continent'. You can override using the
## `.groups` argument.
Para el caso del género, los estudiantes son mayormente hombres (más del 20%), sin embargo, las mujeres son las que sufren menos de desempleo. (12)
afroCountries %>%
group_by(Q1,Q6) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
filter(!is.na(Q6)) %>%
summarise(Count = length(Q6)) %>%
mutate(pct = round(prop.table(Count)*100,2)) %>%
ggplot(aes(x = reorder(Q6,-pct), y = pct, group = Q1)) +
geom_point(aes(color = Q1), size = 2) + geom_line(aes(color = Q1), size = 1) +
scale_x_discrete(labels = function(x) str_wrap(x,width = 10)) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.y = element_text(size = 12),
axis.text.x = element_text(size = 11, angle = -90,
hjust = 0, vjust = 0.5),
legend.position = "top",
legend.title = element_blank(),
legend.text = element_text(size = 11)) +
labs(title = "Current role", x = "", y = "%", fill = "",
caption = "About us")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
Cada cuadro presenta un color que representa la frecuencia de cada
trabajo según el país, se observa que la mayoría de los periodistas de
datos proceden de Túnez y Nigeria. La mayoría de los científicos de
datos se encuentran en Sudáfrica. (13)
afroCountries %>%
group_by(Q6,Q3) %>%
filter(!is.na(Q6)) %>%
summarise(Count = length(Q6)) %>%
mutate(pct = round(prop.table(Count)*100,2))%>%
ggplot(aes(x = Q6, y = Q3, fill = pct)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_fill_gradient(low = "salmon1", high = "blue") +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 12, angle = -90, hjust = 0, vjust = 0.5),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 11)) +
labs(title = "Current role by country",
x = "", y = "", fill = "",
caption = "About us")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
Entre el año 2017 y 2018, de los encuestados se observó que los gráficos circulares de participantes estudiantes frente a no estudiantes, el porcentaje de los primeros se redujo significativamente. (14)
propStud17 <- afroCountries17 %>%
group_by(GenderSelect,StudentStatus) %>%
filter(GenderSelect == "Female" | GenderSelect == "Male") %>%
filter(!is.na(StudentStatus)) %>%
summarise(Count = length(StudentStatus)) %>%
mutate(pct = round(prop.table(Count)*100,2)) %>%
ggplot(aes(x = "", y = pct, fill = StudentStatus)) +
geom_col(width = 1) +
coord_polar("y", start = pi / 3) +
scale_fill_brewer(palette = "Paired") +
facet_wrap(GenderSelect~.) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "top") +
labs(title = "Proportion of students", subtitle = "2017", x = "", y = "", fill = "",
caption = "")
## `summarise()` has grouped output by 'GenderSelect'. You can override using the
## `.groups` argument.
propStud <- propStud18 %>%
group_by(Q1,Q6) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
filter(!is.na(Q6)) %>%
summarise(Count = length(Q6)) %>%
mutate(pct = round(prop.table(Count)*100,2)) %>%
ggplot(aes(x = "", y = pct, fill = Q6)) +
geom_col(width = 1) +
coord_polar("y", start = pi / 3) +
scale_fill_brewer(palette = "Paired") +
facet_wrap(Q1~.) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "top") +
labs(title = "Proportion of students", subtitle = "2018", x = "", y = "", fill = "",
caption = "About us")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
grid.arrange(propStud17,propStud, ncol = 2)
A su vez, los roles actuales para estos mismos años son principalmente
científicos de datos, analistas de datos e ingenieros de software.
(15)
# 2017
p1 <- afroCountries17 %>%
group_by(CurrentJobTitleSelect) %>%
filter(!is.na(CurrentJobTitleSelect)) %>%
summarise(Count = length(CurrentJobTitleSelect)) %>%
ggplot(aes(x = reorder(CurrentJobTitleSelect, Count), y = Count, fill = CurrentJobTitleSelect)) +
geom_col() +
coord_flip() +
theme(plot.title = element_text(size = 15, hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.y = element_text(size = 8),
axis.text.x = element_text(size = 11),
legend.position = "none") +
labs(title = "Current role", subtitle = "2017", x = "", y = "Count", fill = "",
caption = "")
# 2018
p2 <- afroCountries %>%
group_by(Q6) %>%
filter(Q6 != "Student") %>%
filter(!is.na(Q6)) %>%
summarise(Count = length(Q6)) %>%
ggplot(aes(x = reorder(Q6, Count), y = Count, fill = Q6)) +
geom_col() +
coord_flip()+
theme(plot.title = element_text(size = 15, hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.y = element_text(size = 8),
axis.text.x = element_text(size = 11),
legend.position = "none") +
labs(title = "Current role", subtitle = "2018", x = "", y = "Count", fill = "",
caption = "About us")
grid.arrange(p1,p2, ncol = 2)
Los estudiantes, ingenieros de software, empleados, científicos, ingenieros y analistas de datos son las profesiones que presentan menos experiencia, menos de una año. Se muestra una excepción con puestos de consultor, director, gestor de proyectos, investigador científico y comercial, quienes en su mayoría tienen un menos de 5 a 10 años de experiencia.
afroCountries %>%
group_by(Q6,Q8) %>%
filter(!is.na(Q6)) %>%
filter(!is.na(Q8)) %>%
summarise(Count = length(Q8)) %>%
mutate(pct = round(prop.table(Count)*100,2))%>%
ggplot(aes(x = Q8, y = Q6, fill = Count)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)), color = "white", size = 3) +
theme(legend.position = "none",
plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 11,
hjust = 0, vjust = 0.5),
axis.text.y = element_text(size = 9),
legend.text = element_text(size = 11)) +
labs(title = "Experience in current role",
x = "Years", y = "", fill = "",
caption = "About us")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
El aprendizaje automático no es muy común para la mayoría de las profesiones, en el caso de los estudiantes, no son instruidos en el colegio, a su vez, profesiones como ingenieros de Software, vendedores, investigadores y asistentes de investigación, gerentes de productos, investigadores principales, analistas de markeing, gerentes, ingenieros de datos, analistas de datos, consultores, oficiales jefes y analistas de negocios tampoco hacen uso de sus métodos.
afroCountries %>%
group_by(Q6,Q10) %>%
filter(!is.na(Q10)) %>%
summarise(Count = length(Q10)) %>%
mutate(pct = round(prop.table(Count)*100,2))%>%
ggplot(aes(x = Q10, y = Q6, fill = pct)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)),
hjust = 0.5,vjust = 0.5, size = 3, color = "white") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 15)) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 10),
legend.position = "none",
legend.text = element_text(size = 11)) +
labs(title = "Use of ML in industries",
x = "", y = "", fill = "",
caption = "Machine learning usage")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
La mayor parte de las profesiones respondieron que analizan y comprenden los datos para influir en las decisiones sobre productos o negocios, a su vez, hacen sus tareas en algún momento. - Los asistentes de investigación y los científicos investigan más que los demás.
-Los ingenieros de datos y bases de datos construyen y ejecutan infraestructuras de datos.
-Los vendedores y los periodistas de datos no hacen ninguna de estas tareas.
afroCountries %>%
select(Q6,Q11_Part_1,Q11_Part_2, Q11_Part_3,Q11_Part_4,Q11_Part_5,Q11_Part_6,Q11_Part_7)%>%
gather(2:8, key = "questions", value = "Function")%>%
group_by(Q6,Function)%>%
filter(!is.na(Function))%>%
summarise(Count = length(Function))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = Function, y = Q6, fill = pct)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)),
hjust = 0.5,vjust = 0.5, size = 3, color = "white") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 15)) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 9),
legend.position = "none") +
labs(title = "Day to day function",
x = "", y = "", fill = "",
caption = "Machine learning use")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
El rango de años de experiencia en la codificación es menos de 5 años, mientras que la mayoría de los encuestados tiene entre 1 y dos años de experiencia. Podemos observar que la cantidad de hombres con experiencia codificando es significativamente mayor que la de las mujeres. (18)
En el caso de la experiencia según los países africanos, los encuestados en Sur África son quienes tiene mayor rango de años de experiencia, para el caso de Marruecos, tiene la mayor cantidad de encuestados con un año de experiencia, seguido de Nigeria y Egipto. (19)
p1 <- afroCountries %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
group_by(Q1,Q24) %>%
filter(!is.na(Q24)) %>%
summarise(Count = length(Q24)) %>%
ggplot(aes(x = Q24, y = Count, fill = Q1)) +
geom_col() +
scale_fill_brewer(palette = "Paired") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 15)) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text = element_text(size = 12),
axis.text.x = element_text(size = 10, angle = -90,hjust = 0,vjust = 0.5),
legend.position = "top",
legend.text = element_text(size = 11)) +
labs(title = "Coding experience",
x = "", y = "Count", fill = "")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
p2 <- afroCountries %>%
group_by(Q3,Q24) %>%
filter(!is.na(Q3)) %>%
filter(!is.na(Q24)) %>%
summarise(Count = length(Q24)) %>%
mutate(pct = round(prop.table(Count)*100,2)) %>%
ggplot(aes(x = Q24, y = pct, group = Q3)) +
geom_point(aes(color = Q3), size = 1.5) + geom_line(aes(color = Q3), size = 0.5) +
scale_fill_brewer(palette = "Set3") +
scale_x_discrete(labels = function(x) str_wrap(x,width = 15)) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.y = element_text(size = 11),
axis.text.x = element_text(size = 10, angle = -90,hjust = 0,vjust = 0.5),
legend.position = "top",
legend.title = element_blank(),
legend.text = element_text(size = 11)) +
labs(title = "Coding experience by country",
x = "", y = "%", fill = "",
caption = "About us")
## `summarise()` has grouped output by 'Q3'. You can override using the `.groups`
## argument.
grid.arrange(p1,p2,ncol = 2)
Para las profesiones, los ingenieros de datos tienen el mayor porcentaje de experiencia en codificación, con 5 a 10 años y la mayoría de los científicos de investigación tienen entre 3 y 5 años de experiencia en codificación. La profesión con menos porcentaje de experiencia el en uso de estos lenguajes es el investigador. (20)
afroCountries %>%
group_by(Q6,Q24) %>%
filter(!is.na(Q24)) %>%
summarise(Count = length(Q24)) %>%
mutate(pct = round(prop.table(Count)*100,2))%>%
ggplot(aes(x = Q24, y = Q6, fill = pct)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 15)) +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)), color = "white", size = 3.5) +
theme(legend.position = "none",
plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 11, angle = -90,
hjust = 0, vjust = 0.5),
axis.text.y = element_text(size = 9),
legend.text = element_text(size = 11)) +
labs(title = "Coding experience", x = "", y = "",
caption = "Coding experience")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
La mayoría de los encuestados eligen el lenguaje de programación que mejor se adapte a las necesidades de su puesto de trabajo actual. Se muestra una clara competencia entre Python y R.
Para estudiantes, ingenieros de software, vendedores, investigadores científicos, asistentes de investigación, científicos, analistas e ingenieros de datos, consultores y gerentes jefes el lenguaje de programación más usado es Python. Mientras que los estadísticos y analistas de datos tienen una mayor inclinación por R.
afroCountries %>%
group_by(Q6,Q17) %>%
filter(!is.na(Q17)) %>%
summarise(Count = length(Q17)) %>%
mutate(pct = round(prop.table(Count)*100,2))%>%
ggplot(aes(x = reorder(Q17,-pct), y = Q6, fill = pct)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 15)) +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)), color = "white", size = 3) +
theme(legend.position = "none",
plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 11, angle = 45, hjust = 1),
axis.text.y = element_text(size = 9),
legend.text = element_text(size = 11)) +
labs(title = "Most used programming language", x = "", y = "",
caption = "Coding experience")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
A la hora de recomendar un lenguaje de programación. Las principales profesiones como las mencionadas en la tabla anterior, ofrecen Python como la mejor opción para aquellas personas que aspiran a ser científicos de datos.
afroCountries %>%
group_by(Q6,Q18) %>%
filter(!is.na(Q18)) %>%
summarise(Count = length(Q18)) %>%
mutate(pct = round(prop.table(Count)*100,2))%>%
ggplot(aes(x = reorder(Q18,-pct), y = Q6, fill = pct)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 15)) +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)), color = "white", size = 3) +
theme(legend.position = "none",
plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 9),
legend.text = element_text(size = 11)) +
labs(title = "Recommended programming language", x = "", y = "",
caption = "Coding experience")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
En ambos casos, para el lenguaje de programación más usado y el más recomendado, Python sigue siendo el más elegido, seguido de Java y R.
afroCountries %>%
group_by(Q17,Q18) %>%
filter(!is.na(Q17)) %>%
filter(!is.na(Q18)) %>%
summarise(Count = length(Q17)) %>%
mutate(pct = round(prop.table(Count)*100,2))%>%
ggplot(aes(x = reorder(Q17,pct), y = Q18, fill = pct)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_fill_gradient(low = "salmon1", high = "blue") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 20))+
geom_text(aes(label = as.character(Count)), color = "white", size = 3) +
theme(legend.position = "none",
plot.title = element_text(size = 15, hjust = 0.5),
axis.text = element_text(size = 11),
axis.text.x = element_text(angle = 35, hjust = 1),
legend.text = element_text(size = 11)) +
labs(title = "Most used vs. Recommended programming languages",
x = "Most used", y = "Recommended",
caption = "Coding experience")
## `summarise()` has grouped output by 'Q17'. You can override using the `.groups`
## argument.
Según el género, podemos observar que los hombres invierten más tiempo a la programación, casi un 35% de estos, pasan el 50 al 70% en esta práctica, en el caso de las mujeres, un 31% de ellas pasan del 25 al 49% programación; esto se debe a que los hombres tienen mayor experiencia.
afroCountries %>%
group_by(Q1,Q23) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
filter(!is.na(Q23)) %>%
summarise(Count = length(Q23)) %>%
mutate(pct = round(prop.table(Count)*100,2)) %>%
ggplot(aes(x = Q23, y = pct, group = Q1)) +
geom_point(aes(color = Q1), size = 2) + geom_line(aes(color = Q1), size = 1) +
scale_fill_brewer(palette = "Set3") +
scale_x_discrete(labels = function(x) str_wrap(x,width = 10)) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.y = element_text(size = 11),
axis.text.x = element_text(size = 12),
legend.position = "top",
legend.title = element_blank(),
legend.text = element_text(size = 11)) +
labs(title = "Time spent actively coding",
x = "of time", y = "% of people", fill = "",
caption = "Coding experience")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
El tiempo invertido a la codificación depende de la profesión, los estudiantes, ingenieros de software, analistas de marketing, científicos, analistas e ingenieros de datos son quienes invierten la mayor parte del tiempo, del 50 al 74 %.
afroCountries %>%
group_by(Q6,Q23) %>%
filter(!is.na(Q23)) %>%
summarise(Count = length(Q23)) %>%
mutate(pct = round(prop.table(Count)*100,2))%>%
ggplot(aes(x = Q23, y = Q6, fill = pct)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)), color = "white", size = 3.5) +
theme(legend.position = "none",
plot.title = element_text(size = 15, hjust = 0.5),
axis.text = element_text(size = 11),
axis.text.y = element_text(size = 9),
legend.text = element_text(size = 11)) +
labs(title = "Time spent coding",
x = "of time", y = "",
caption = "Coding experience")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
El IDE también depende de las preferencias de las profesiones, los estudiantes, científicos y analistas de datos son quienes más utilizan las IDE, en este orden, las más utilizadas son Jupyter/Ipython, RStudio, Notepad++ Y MATLAB.
afroCountries %>%
select(Q6,30:45)%>%
gather(2:16, key = "questions", value = "IDEs")%>%
group_by(Q6,IDEs)%>%
filter(!is.na(IDEs))%>%
summarise(Count = length(IDEs))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = reorder(IDEs,-pct), y = Q6, fill = Count)) +
geom_tile(stat = "identity") +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)),
hjust = 0.5,vjust = 0.5, size = 3, color = "white") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 11, angle = -90,
hjust = 0, vjust = 0.5),
axis.text.y = element_text(size = 9),
legend.position = "none",
legend.text = element_text(size = 11)) +
labs(title = "IDEs",
x = "", y = "", fill = "",
caption = "Coding experience")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
La mayoría de las personas encuestadas no hacen uso de cuadernos alojados, sin embargo, hay algunas excepciones, Kaggle kernels, JupyterHub/Binder y Google Colab es usado por estudiantes y científicos de datos.
afroCountries %>%
select(Q6,Q14_Part_1:Q14_Part_11)%>%
gather(2:12, key = "questions", value = "Hosted_Notebook")%>%
group_by(Q6,Hosted_Notebook)%>%
filter(!is.na(Hosted_Notebook))%>%
summarise(Count = length(Hosted_Notebook))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Hosted_Notebook,-pct), y = Q6, fill = Count)) +
geom_tile(stat = "identity") +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)),
hjust = 0.5,vjust = 0.5, size = 3, color = "white") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(size = 11, angle = -90, hjust = 0, vjust = 0.5),
axis.text.y = element_text(size = 9),
legend.position = "none",
legend.text = element_text(size = 11)) +
labs(title = "Hosted notebook used at school or work",
subtitle = "(past 5 years)",
x = "", y = "", fill = "",
caption = "")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
Para el análisis de datos, los estudiantes y científicos de datos, quienes representan la mayoría, respondió que se inclinan por utilizar entornos de desarrollo locales/alojados como RStudio Y JupyterLab; en segundo lugar, la herramienta más usada por estos son las estadísticas básicas de software, por ejemplo, Microsoft Excel, Google y Sheets.
afroCountries %>%
group_by(Q6,Q12_MULTIPLE_CHOICE) %>%
filter(!is.na(Q12_MULTIPLE_CHOICE)) %>%
summarise(Count = length(Q12_MULTIPLE_CHOICE)) %>%
ggplot(aes(x = reorder(Q12_MULTIPLE_CHOICE,-Count), y = Q6, fill = Count)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 15)) +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)), color = "white", size = 3.5) +
theme(legend.position = "none",
plot.title = element_text(size = 15, hjust = 0.5),
axis.text = element_text(size = 11),
axis.text.y = element_text(size = 9),
legend.text = element_text(size = 11)) +
labs(title = "Primary tools for data analysis",
x = "", y = "",
caption = "Coding experience")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
Tantos los hombre como las mujeres encuestadas se consideran a sí mismos como científicos de datos, la proporción de los hombres es mayor a comparación al de las mujeres.
afroCountries %>%
filter(Q1 == "Female"| Q1 == "Male") %>%
group_by(Q1,Q26) %>%
filter(!is.na(Q26)) %>%
summarise(Count = length(Q26)) %>%
ggplot(aes(x = Q26, y = Count, fill = Q1))+
geom_col() +
scale_fill_brewer(palette = "Paired") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 8)) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 12),
legend.position = "top",
legend.text = element_text(size = 11)) +
labs(title = "Think of themself as a data scientist",
x = "", y = "Count", fill = "",
caption = "Personal views")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
Los usuarios africanos de Kaggles, tienen el mayor porcentaje de respuesta positivas al considerarse como científicos de datos, a comparación de los demás continentes encuestados. Mientras que Oceanía y Europa, tienen los menores porcentajes.
newMultipleChoice %>%
group_by(Continent,Q26) %>%
filter(!is.na(Q26)) %>%
summarise(Count = length(Q26)) %>%
mutate(pct = round(prop.table(Count)*100,2))%>%
ggplot(aes(x = Q26, y = pct, group = Continent)) +
geom_line(aes(color = Continent), size = 0.5) +
geom_point(aes(color = Continent), size = 2) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
legend.position = "top",
legend.title = element_blank(),
legend.text = element_text(size = 11)) +
labs(title = "Think of themself as a data scientist",
x = "", y = "%", fill = "",
caption = "Africa and the world")
## `summarise()` has grouped output by 'Continent'. You can override using the
## `.groups` argument.
Los Kaggles africanos apenas han comenzado a usar métodos de aprendizaje automático de forma reciente, la mayor proporción sólo tiene experiencia de un año.
afroCountries %>%
group_by(Q6,Q25) %>%
filter(!is.na(Q25)) %>%
summarise(Count = length(Q25)) %>%
mutate(pct = round(prop.table(Count)*100,2))%>%
ggplot(aes(x = Q25, y = Q6, fill = Count)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 15)) +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)), color = "white", size = 3.5) +
theme(legend.position = "none",
plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 11, angle = -90, hjust = 0, vjust = 0.5),
axis.text.y = element_text(size = 9),
legend.text = element_text(size = 11)) +
labs(title = "Usage of machine learning at work/school",
x = "", y = "",
caption = "Coding experience")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
Lo mismo ocurre con los productos de aprendizaje automático. La mayoría de los encuestados nunca habían usado ningún producto de aprendizaje automático. Solo unas pocas personas lo han usado, la mayoría usa SAS.
afroCountries %>%
select(Q6,152:194)%>%
gather(2:44, key = "questions", value = "ML_Products")%>%
group_by(Q6,ML_Products)%>%
filter(!is.na(ML_Products))%>%
summarise(Count = length(ML_Products))%>%
mutate(pct = prop.table(Count)*100)%>%
top_n(5,pct) %>%
ggplot() +
geom_point(mapping = aes(x = reorder(ML_Products,-Count), y = Q6,
size = pct, color = ML_Products)) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 8, angle = 45, hjust = 1),
axis.text.y = element_text(size = 8),
legend.position = "none",
legend.text = element_text(size = 11)) +
labs(title = "Machine learning products (past 5 years)",
x = "", y = "", fill = "",
caption = "Africa and the world")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
Pero (des)afortunadamente, África no es la única que se encuentra en esa situación. El 25 % de los encuestados en general tampoco ha utilizado ningún producto de aprendizaje automático.
newMultipleChoice %>%
select(Continent,152:194)%>%
gather(2:44, key = "questions", value = "ML_Products")%>%
group_by(Continent,ML_Products)%>%
filter(!is.na(ML_Products))%>%
summarise(Count = length(ML_Products))%>%
mutate(pct = prop.table(Count)*100)%>%
top_n(5,Count) %>%
ggplot() +
geom_point(mapping = aes(x = Continent, y = reorder(ML_Products,pct),
size = pct, color = ML_Products)) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 8),
legend.position = "none",
legend.text = element_text(size = 11)) +
labs(title = "Machine learning products (past 5 years)",
x = "", y = "", fill = "",
caption = "Africa and the world")
## `summarise()` has grouped output by 'Continent'. You can override using the
## `.groups` argument.
Scikit-Learn, TensorFlow y Keras son los marcos generales de aprendizaje automático más usados en la mayoría de los países, principalmente por los estudiantes , ingenieros de software, investigadores científicos, ingenieros, analistas y científicos de datos. Por otro lado, el marco menos usado es Mxnet.
afroCountries %>%
select(Q6,Q19_Part_1:Q19_Part_19)%>%
gather(2:19, key = "questions", value = "ML_Framework")%>%
group_by(Q6,ML_Framework)%>%
filter(!is.na(ML_Framework))%>%
summarise(Count = length(ML_Framework))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = reorder(ML_Framework,-Count), y = Q6, fill = pct)) +
geom_tile(stat = "identity") +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)),
hjust = 0.5,vjust = 0.5, size = 3, color = "white") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
legend.position = "none",
legend.text = element_text(size = 11)) +
labs(title = "Machine learning framework (past 5 years)",
x = "", y = "", fill = "",
caption = "Africa and the world")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
En una comparación de la proporción de hombres y mujeres, ambos tienen un gran nivel de confianza para entender y explicar los resultados de muchos, pero no todos, los modelos de aprendizaje automático, las mujeres con el mayor proporción. Estas también tienen la mayor proporción en considerar el aprendizaje automático como “cajas negras”, y que los expertos podrían explicarlos.
afroCountries %>%
group_by(Q1,Q48) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
filter(!is.na(Q48)) %>%
filter(!is.na(Q1)) %>%
ggplot(aes(x = Q1, fill = Q48)) +
geom_bar(position = "fill") +
scale_x_discrete(labels = function(x) str_wrap(x,width = 10)) +
scale_fill_brewer(palette = "Set3") +
coord_flip() +
labs(title = "Do you consider ML as 'black boxes'?",
x = "", y = "", fill = "", caption = "Personal views") +
theme(plot.title = element_text(size = 15, hjust = 0.5),
legend.position = "bottom",
axis.text = element_text(size = 12),
legend.text = element_text(size = 10)) +
guides(fill = guide_legend(ncol = 1))
En las industrias africanas, las métricas más comunes son las que consideran en primer lugar la precisión, después vienen los ingresos y/o los objetivos empresariales. Siendo Kenia y Sudáfrica los más inclinados a la aplicación de este último. Y está el caso de Marruecos, que no aplican un gran porcentaje, ya que los usuarios no se involucran con organizaciones que construyen los modelos de ML.
afroCountries %>%
select(Q42_Part_1:Q42_Part_5,Q3) %>%
gather(1:5, key = "questions", value = "metrics")%>%
group_by(metrics,Q3) %>%
filter(!is.na(metrics)) %>%
filter(!is.na(Q3)) %>%
ggplot(aes(x = Q3, fill = metrics)) +
geom_bar(position = "fill") +
scale_x_discrete(labels = function(x) str_wrap(x,width = 15)) +
scale_fill_brewer(palette = "Set3") +
labs(title = "Metrics used to measure model success",
x = "", y = "", fill = "", caption = "Machine learning usage") +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 11),
legend.text = element_text(size = 11),
legend.position = "bottom") +
guides(fill = guide_legend(ncol = 1))
En el caso de los continentes, todos prefieren métricas que consideran la precisión, los porcentajes están por encima del 30%. En segundo lugar, están los que consideran ingresos y objetivos empresariales, con un porcentaje del 24 al 29%.
newMultipleChoice %>%
select(Continent,Q42_Part_1:Q42_Part_5)%>%
gather(2:6, key = "questions", value = "Metrics")%>%
group_by(Continent,Metrics)%>%
filter(!is.na(Metrics))%>%
summarise(Count = length(Metrics))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = Continent, y = reorder(Metrics,pct), fill = pct)) +
geom_tile(stat = "identity") +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = sprintf("%.2f%%", pct)),
hjust = 0.5,vjust = 0.5, size = 4, color = "white") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
scale_y_discrete(labels = function(x) str_wrap(x, width = 20))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
legend.position = "none",
legend.text = element_text(size = 11)) +
labs(title = "Metrics used by organizations",
x = "", y = "", fill = "",
caption = "Africa and the world")
## `summarise()` has grouped output by 'Continent'. You can override using the
## `.groups` argument.
En la codificación, el tipo de dato más usado es el numérico, para el caso de los estudiantes, estadísticos, investigadores científicos, investigadores principales, otros, desempleados, analistas de márquetin, gerentes, defensores de desarrolladores, analistas de datos, gerentes de jefe y analistas de negocios; sin embargo, los científicos de datos manejan más los datos tabulares.
afroCountries %>%
select(Q6,Q32) %>%
group_by(Q6,Q32) %>%
filter(!is.na(Q32)) %>%
summarise(Count = length(Q32)) %>%
mutate(pct = round(prop.table(Count)*100,2)) %>%
ggplot(aes(x = reorder(Q32,-Count), y = Q6, fill = pct)) +
geom_tile(stat = "identity") +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)),
hjust = 0.5,vjust = 0.5, size = 3, color = "white") +
theme(plot.title = element_text(size = 15, hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "none",
axis.text.y = element_text(size = 11),
axis.text.x = element_text(size = 12, angle = -90, hjust = 0, vjust = 0.5),
legend.text = element_text(size = 11)) +
labs(title = "Most used data types",
x = "Type of data", y = "", fill = "",
caption = "Data")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
Las plataformas agregadoras de conjuntos de datos son las principales fuentes de datos utilizadas, seguidas de la búsqueda en Google y Github. Las repuestas según el género muestran una mayor proporción para los hombres, en comparación a las mujeres en todas las fuentes de datos.
afroCountries %>%
select(Q1,266:276) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
gather(2:12, key = "questions", value = "DataSource")%>%
group_by(Q1,DataSource) %>%
filter(!is.na(DataSource)) %>%
summarise(Count = length(DataSource))%>%
ggplot(aes(x = reorder(DataSource,-Count), y = Count, fill = Q1)) +
geom_col() +
scale_fill_brewer(palette = "Paired") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 20)) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text = element_text(size = 12),
axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5),
legend.position = "top",
legend.text = element_text(size = 11)) +
labs(title = "Sources used to get public datasets",
x = "", y = "Count", fill = "",
caption = "Data")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
Gran parte de los participantes en la cuesta nunca ha utilizado servicios de computación en la nube, los principales son estudiantes, estadísticos, ingenieros de Software, asistentes de investigación, desempleados, científicos y analistas de datos. Sin embargo, sí hacen uso de Amazon Web Services, Google Cloud Platform o Microsoft Azure.
afroCountries %>%
select(Q6,Q15_Part_1:Q15_Part_7)%>%
gather(2:8, key = "questions", value = "Cloud_services")%>%
group_by(Q6,Cloud_services)%>%
filter(!is.na(Cloud_services))%>%
summarise(Count = length(Cloud_services))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Cloud_services,-Count), y = Q6, fill = pct)) +
geom_tile(stat = "identity") +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)),
hjust = 0.5,vjust = 0.5, size = 3, color = "white") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(size = 11, angle = -90, hjust = 0, vjust = 0.5),
axis.text.y = element_text(size = 9),
legend.position = "none",
legend.text = element_text(size = 11)) +
labs(title = "Cloud computing services at work/school",
x = "", y = "", fill = "",
caption = "")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
Para este caso, también sucede que la mayoría de los encuestados no utilizan productos de computación en nube, esto es para el caso de los estudiantes, científicos y analistas de datos. Para aquellos que sí hacen eso de estos, Google Compute Engine es el más común, principalmente en los ingenieros de Software.
afroCountries %>%
select(Q6,Q27_Part_1:Q27_Part_20) %>%
gather(2:21, key = "questions", value = "cloud")%>%
group_by(Q6,cloud)%>%
filter(!is.na(cloud))%>%
filter(!is.na(Q6))%>%
summarise(Count = length(cloud))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = reorder(cloud,-Count), y = Q6, fill = pct)) +
geom_tile(stat = "identity") +
scale_fill_gradient(low = "salmon1", high = "blue") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
geom_text(aes(label = as.character(Count)),
hjust = 0.5,vjust = 0.5, size = 2.5, color = "white") +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text = element_text(size = 10),
axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5),
legend.position = "none",
legend.text = element_text(size = 11)) +
labs(title = "Cloud computing products (past 5 years)",
x = "", y = "", fill = "",
caption = "Cloud computing products")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
Se muestra la proporción de hombre y mujeres que usan las diferentes librerías de visualización, por experiencia, los hombres tienen una mayor proporción. En el caso de la librería más usada es Matplotlib, seguida de ggplot2 y Seaborn.
afroCountries %>%
select(Q1,Q22)%>%
filter(Q1 == "Female" | Q1 == "Male") %>%
group_by(Q1,Q22)%>%
filter(!is.na(Q22))%>%
summarise(Count = length(Q22))%>%
ggplot(aes(x = reorder(Q22,-Count), y = Count, fill = Q1)) +
geom_col() +
scale_fill_brewer(palette = "Paired") +
scale_x_discrete(labels = function(x) str_wrap(x,width = 10)) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text = element_text(size = 12),
axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5),
legend.text = element_text(size = 11),
legend.position = "top") +
labs(title = "Most used vizualisation libraries",
x = "", y = "Count", fill = "",
caption = "Other tools")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
La elección de la librería también va a depender del lenguaje de programación utilizado. En el caso de los usuarios de R, SQL, SAS/STATA y JavaScript, estos utilizan ggplot2, mientras que los usuarios de Python y Java utilizan Matplotlib.
afroCountries %>%
select(Q17,Q22)%>%
group_by(Q17,Q22)%>%
filter(!is.na(Q17)) %>%
filter(!is.na(Q22))%>%
summarise(Count = length(Q22))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = reorder(Q22,-Count), y = Q17, fill = pct)) +
geom_tile(stat = "identity") +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)),
hjust = 0.5,vjust = 0.25, size = 3, color = "white") +
scale_x_discrete(labels = function(x) str_wrap(x,width = 10)) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.y = element_text(size = 9),
axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5),
legend.text = element_text(size = 11),
legend.position = "none") +
labs(title = "Most used vizualisation library",
x = "", y = "Most used programming language", fill = "",
caption = "Vizualisation libraries")
## `summarise()` has grouped output by 'Q17'. You can override using the `.groups`
## argument.
De nuevo se muestra la proporción de hombres y mujeres, donde los hombres tiene la mayor. Los productos de bases de datos relacionales más usados fueron MySQL, Microsoft SQL Server y PostgresQL, mientras que ninguno de los encuestados utiliza Ingres.
afroCountries %>%
select(Q1,196:223) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
gather(2:29, key = "questions", value = "RDB_Products") %>%
group_by(Q1,RDB_Products) %>%
filter(!is.na(RDB_Products))%>%
filter(!is.na(Q1))%>%
summarise(Count = length(RDB_Products))%>%
ggplot(aes(x = reorder(RDB_Products,-Count), y = Count, fill = Q1)) +
geom_col() +
scale_fill_brewer(palette = "Paired") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 8,
angle = -90, hjust = 0, vjust = 0.5),
legend.position = "top",
legend.text = element_text(size = 11)) +
labs(title = "Relational database products (past 5 years)",
x = "", y = "Count", fill = "",
caption = "Relational database")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
Un gran número de encuestados africanos no utiliza ninguna herramienta de big data y análisis. Los pocos usuarios que los aplican eligieron Google BigQuery. En este caso, la proporción de hombres vuelve a ser mayor que la de mujeres.
afroCountries %>%
select(Q1,225:249) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
gather(2:26, key = "questions", value = "BigData_Products") %>%
group_by(Q1,BigData_Products) %>%
filter(!is.na(BigData_Products)) %>%
filter(!is.na(Q1)) %>%
summarise(Count = length(BigData_Products))%>%
ggplot(aes(x = reorder(BigData_Products,-Count), y = Count, fill = Q1)) +
geom_col() +
scale_fill_brewer(palette = "Paired") +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 9, angle = -90, hjust = 0, vjust = 0.5),
axis.text.y = element_text(size = 12),
legend.position = "top",
legend.text = element_text(size = 11)) +
labs(title = "Big data and analytics tools (past 5 years)",
x = "", y = "Count", fill = "",
caption = "Big data and analytics tools")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
Para las respuestas según los continentes, las herramientas de big data y análisis son poco populares, sobre todo en África y Asia donde casi alcanza la mitad del total de los encuestados.
newMultipleChoice %>%
select(Continent,225:249) %>%
gather(2:26, key = "questions", value = "BigData_Products") %>%
group_by(Continent,BigData_Products) %>%
filter(!is.na(BigData_Products)) %>%
summarise(Count = length(BigData_Products))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = Continent, y = reorder(BigData_Products,pct), fill = pct)) +
geom_tile(stat = "identity") +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = sprintf("%.2f%%", pct)),
hjust = 0.5,vjust = 0.5, size = 3, color = "white") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text = element_text(size = 11),
legend.position = "none",
legend.text = element_text(size = 11)) +
labs(title = "Big data and analytics tools (past 5 years)",
x = "", y = "", fill = "",
caption = "Africa and the world")
## `summarise()` has grouped output by 'Continent'. You can override using the
## `.groups` argument.
Los usuarios africanos de Kaggler hacen uso otros medios para conocer las nuevas tendencias en aprendizaje automático y ciencia de datos. Aquí se compara en base al género de los encuestados, para los hombres, valor mediano del porcentaje de aprendizaje automático y ciencia de datos es ligeramente superior en en Autodidacta, Cursos en línea y Trabajo. Para el caso de las mujeres, la mediana es más alta para el aprendizaje en la universidad.
## Warning in lapply(X = X, FUN = FUN, ...): NAs introducidos por coerción
## Warning in lapply(X = X, FUN = FUN, ...): NAs introducidos por coerción
## Warning in lapply(X = X, FUN = FUN, ...): NAs introducidos por coerción
## Warning in lapply(X = X, FUN = FUN, ...): NAs introducidos por coerción
## Warning in lapply(X = X, FUN = FUN, ...): NAs introducidos por coerción
## Warning in lapply(X = X, FUN = FUN, ...): NAs introducidos por coerción
p1 <- multipleChoice18 %>%
select(Q1,Q35_Part_1) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
ggplot(aes(x = "",y = Q35_Part_1, fill = Q1)) +
geom_boxplot() +
scale_fill_brewer(palette = "Paired") +
theme(plot.title = element_text(size = 13),
legend.text = element_text(size = 9),
legend.title = element_blank()) +
labs(title = "Self-taught", x = "", y = "%")
p2 <- multipleChoice18 %>%
select(Q1,Q35_Part_2) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
ggplot(aes(x = "", y = Q35_Part_2, fill = Q1)) +
geom_boxplot() +
scale_fill_brewer(palette = "Paired") +
theme(plot.title = element_text(size = 13),
legend.text = element_text(size = 9),
legend.title = element_blank()) +
labs(title = "Online courses", x = "", y = "%")
p3 <- multipleChoice18 %>%
select(Q1,Q35_Part_3) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
ggplot(aes(x = "",y = Q35_Part_3, fill = Q1)) +
geom_boxplot() +
scale_fill_brewer(palette = "Paired") +
theme(plot.title = element_text(size = 13),
legend.text = element_text(size = 9),
legend.title = element_blank()) +
labs(title = "Work", x = "", y = "%")
p4 <- multipleChoice18 %>%
select(Q1,Q35_Part_4) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
ggplot(aes(x = "", y = Q35_Part_4, fill = Q1)) +
geom_boxplot() +
scale_fill_brewer(palette = "Paired") +
theme(plot.title = element_text(size = 13),
legend.text = element_text(size = 9),
legend.title = element_blank()) +
labs(title = "University", x = "", y = "%")
p5 <- multipleChoice18 %>%
select(Q1,Q35_Part_5) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
ggplot(aes(x = "",y = Q35_Part_5, fill = Q1)) +
geom_boxplot() +
scale_fill_brewer(palette = "Paired") +
theme(plot.title = element_text(size = 13),
legend.text = element_text(size = 9),
legend.title = element_blank()) +
labs(title = "Kaggle competitions", x = "", y = "%")
p6 <- multipleChoice18 %>%
select(Q1,Q35_Part_6) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
ggplot(aes(x = "", y = Q35_Part_6, fill = Q1)) +
geom_boxplot() +
scale_fill_brewer(palette = "Paired") +
theme(plot.title = element_text(size = 13, hjust = 0.5),
legend.text = element_text(size = 9),
legend.title = element_blank()) +
labs(title = "Other", x = "", y = "%")
grid.arrange(p1,p2, p3, p4, p5, p6, ncol = 3)
## Warning: Removed 7916 rows containing non-finite values (`stat_boxplot()`).
## Removed 7916 rows containing non-finite values (`stat_boxplot()`).
## Removed 7916 rows containing non-finite values (`stat_boxplot()`).
## Removed 7916 rows containing non-finite values (`stat_boxplot()`).
## Removed 7916 rows containing non-finite values (`stat_boxplot()`).
## Removed 7916 rows containing non-finite values (`stat_boxplot()`).
Para el caso de los países africanos, en su mayoría utilizan la plataforma Coursera para sus cursos de ciencia de datos, seguida de DataCamp y Udaclty.
afroCountries %>%
select(Q3,Q36_Part_1:Q36_Part_13)%>%
gather(2:14, key = "questions", value = "OnlinePlatform")%>%
group_by(Q3,OnlinePlatform)%>%
filter(!is.na(OnlinePlatform))%>%
summarise(Count = length(OnlinePlatform))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = OnlinePlatform, y = pct, group = Q3)) +
geom_point(aes(color = Q3), size = 2) + geom_line(aes(color = Q3), size = 0.5) +
scale_fill_gradient(low = "salmon1", high = "blue") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 11, angle = -90, hjust = 0, vjust = 0.5),
axis.text.y = element_text(size = 12),
legend.position = "top",
legend.title = element_blank(),
legend.text = element_text(size = 11)) +
labs(title = "Platform for data science courses",
x = "", y = "%", fill = "",
caption = "")
## `summarise()` has grouped output by 'Q3'. You can override using the `.groups`
## argument.
En cuanto a la proporción entre mujeres y hombres de dichos países, ambos siguen prefiriendo Coursera, siendo mayor en términos de porcentajes para los hombres. Lo mismo ocurre con Kaggle Learn y edX.
afroCountries %>%
select(Q1,Q36_Part_1:Q36_Part_13)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:14, key = "questions", value = "OnlinePlatform")%>%
group_by(Q1,OnlinePlatform)%>%
filter(!is.na(OnlinePlatform))%>%
summarise(Count = length(OnlinePlatform))%>%
ggplot(aes(x = reorder(OnlinePlatform,-Count), y = Count, fill = Q1)) +
geom_col() +
scale_fill_brewer(palette = "Paired") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 20))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 11, angle = -90, hjust = 0),
axis.text.y = element_text(size = 12),
legend.position = "top",
legend.text = element_text(size = 11)) +
labs(title = "Online platform used for learning",
x = "", y = "Count", fill = "",
caption = "Online learning")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
Tanto para hombres como para mujeres, de igual forma la proporción de los primeros es mayor, los foros de Kaggle son las fuentes de medios favoritas para la ciencia de datos, con Twitter, y le sigue Siraj Raval Youtube Channel como siguientes en la lista.
afroCountries %>%
select(Q1,Q38_Part_1:Q38_Part_22)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
gather(2:14, key = "questions", value = "OnlinePlatform")%>%
group_by(Q1,OnlinePlatform)%>%
filter(!is.na(OnlinePlatform))%>%
summarise(Count = length(OnlinePlatform))%>%
ggplot(aes(x = reorder(OnlinePlatform,-Count), y = Count, fill = Q1)) +
geom_col() +
scale_fill_brewer(palette = "Paired") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 20))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 11, angle = -90, hjust = 0),
axis.text.y = element_text(size = 12),
legend.position = "top",
legend.text = element_text(size = 11)) +
labs(title = "Online platform used for learning",
x = "", y = "Count", fill = "",
caption = "Online learning")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
Para la mayoría de los encuestados, el aprendizaje en línea es un método más eficiente a comparación de las instituciones tradicionales, principalmente estudiantes y científicos de datos.
afroCountries %>%
group_by(Q6,Q39_Part_1) %>%
filter(!is.na(Q39_Part_1)) %>%
summarise(Count = length(Q39_Part_1)) %>%
mutate(pct = round(prop.table(Count)*100,2))%>%
ggplot(aes(x = reorder(Q39_Part_1,-Count), y = Q6, fill = pct)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)), color = "white", size = 3.5) +
theme(legend.position = "none",
plot.title = element_text(size = 15, hjust = 0.5),
axis.text = element_text(size = 11),
axis.text.y = element_text(size = 9),
legend.text = element_text(size = 11)) +
labs(title = "Online learning vs. Traditional institution",
x = "", y = "",
caption = "")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
Lo mismo ocurre con los campamentos de entrenamiento presenciales.
afroCountries %>%
group_by(Q6,Q39_Part_2) %>%
filter(!is.na(Q39_Part_2)) %>%
summarise(Count = length(Q39_Part_2)) %>%
mutate(pct = round(prop.table(Count)*100,2))%>%
ggplot(aes(x = reorder(Q39_Part_2,-pct), y = Q6, fill = pct)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = as.character(Count)), color = "white", size = 3.5) +
theme(legend.position = "none",
plot.title = element_text(size = 15, hjust = 0.5),
axis.text = element_text(size = 11),
axis.text.y = element_text(size = 9),
legend.text = element_text(size = 11)) +
labs(title = "In-person bootcamp vs. Traditional institution",
x = "", y = "",
caption = "")
## `summarise()` has grouped output by 'Q6'. You can override using the `.groups`
## argument.
La mayoría de los encuestados piensa que los proyectos independientes son mucho más importantes que los logros académicos. En este caso para la respuesta más elegida por los usuarios, la proporción de mujeres fue mayor a la de los hombres.
afroCountries %>%
select(Q1,Q40) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
group_by(Q1,Q40) %>%
filter(!is.na(Q40)) %>%
filter(!is.na(Q1)) %>%
summarise(Count = length(Q40))%>%
mutate(pct = prop.table(Count)*100) %>%
ggplot(aes(x = reorder(Q40,-pct), y = pct, fill = Q1)) +
geom_col() +
scale_fill_brewer(palette = "Paired") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text = element_text(size = 12),
legend.position = "top",
legend.text = element_text(size = 11)) +
labs(title = "Independent projects vs. Academic achievements",
x = "", y = "Count", fill = "",
caption = "Expertise in data science")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
El 45,49% de los encuestados africanos considera que los proyectos independientes son mucho más importantes que los logros académicos.
newMultipleChoice %>%
group_by(Continent, Q40)%>%
filter(!is.na(Q40))%>%
summarise(Count = length(Q40))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = Continent, y = reorder(Q40,pct), fill = pct)) +
geom_tile(stat = "identity") +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = sprintf("%.2f%%", pct)),
hjust = 0.5,vjust = 0.25, size = 3, color = "white") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
scale_y_discrete(labels = function(x) str_wrap(x, width = 30))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 11),
legend.position = "none",
legend.text = element_text(size = 11)) +
labs(title = "Expertise in data science",
subtitle = "Independent projects vs. academic achievements",
x = "", y = "", fill = "",
caption = "Africa and the world")
## `summarise()` has grouped output by 'Continent'. You can override using the
## `.groups` argument.
afroCountries %>%
group_by(Q1,Q9)%>%
filter(Q1 == "Female"|Q1 == "Male")%>%
filter(!is.na(Q9))%>%
summarise(Count = length(Q9))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = Q9, y = pct, group = Q1)) +
geom_point(aes(color = Q1), size = 2) + geom_line(aes(color = Q1), size = 1) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5)) +
theme(plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 11, angle = -90, vjust = 0.5, hjust = 0),
axis.text.y = element_text(size = 12),
legend.position = "top",
legend.title = element_blank(),
legend.text = element_text(size = 11)) +
scale_fill_brewer(palette = "Paired") +
labs(title = "Yearly compensation",
x = "$", y = "%", fill = "",
caption = "About us")
## `summarise()` has grouped output by 'Q1'. You can override using the `.groups`
## argument.
El ingreso promedio de los encuestados que estaban dispuestos a compartir fue de aproximadamente US$ 0-10 000, con un ingreso promedio de aproximadamente US$ 10 000-20 000 en todos los países. Tanto mujeres como hombres.
afroCountries %>%
group_by(Q1,Q9,Q3) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
filter(!is.na(Q9)) %>%
summarise(Count = length(Q9)) %>%
mutate(pct = round(prop.table(Count)*100,2))%>%
ggplot(aes(x = Q3, y = Q9, fill = pct)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = sprintf("%.2f%%", pct)), color = "white", size = 2) +
facet_grid(Q1~.) +
coord_flip() +
scale_y_discrete(labels = function(x) str_wrap(x, width = 35)) +
theme(legend.position = "none",
plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 11, angle = -90,
hjust = 0, vjust = 0.5),
axis.text.y = element_text(size = 11),
legend.text = element_text(size = 11)) +
labs(title = "Yearly compensation",
x = "", y = "$", fill = "",
caption = "About us")
## `summarise()` has grouped output by 'Q1', 'Q9'. You can override using the
## `.groups` argument.
Los países mejor pagados de África son Kenia y Sudáfrica, con más de 300.000 dólares para los hombres.
afroCountries %>%
group_by(Q1,Q9,Q4) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
filter(!is.na(Q4)) %>%
filter(!is.na(Q9)) %>%
summarise(Count = length(Q9)) %>%
mutate(pct = round(prop.table(Count)*100,2))%>%
ggplot(aes(x = Q4, y = Q9, fill = pct)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = sprintf("%.2f%%", pct)), color = "white", size = 2) +
facet_grid(Q1~.) +
coord_flip() +
scale_y_discrete(labels = function(x) str_wrap(x, width = 35)) +
theme(legend.position = "none",
plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 11, angle = -90,
hjust = 0, vjust = 0.5),
axis.text.y = element_text(size = 11),
legend.text = element_text(size = 11)) +
labs(title = "Yearly compensation by degree",
x = "", y = "$", fill = "",
caption = "About us")
## `summarise()` has grouped output by 'Q1', 'Q9'. You can override using the
## `.groups` argument.
Independientemente de la titulación, la mayoría de los encuestados ganan menos de 10.000 dólares al año en ambos sexos. A primera vista, los hombres doctores cobran mucho más que las mujeres.
afroCountries %>%
group_by(Q1,Q9,Q6) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
filter(!is.na(Q6)) %>%
filter(!is.na(Q9)) %>%
summarise(Count = length(Q9)) %>%
mutate(pct = round(prop.table(Count)*100,2))%>%
ggplot(aes(x = Q9, y = Q6, fill = pct)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = sprintf("%.2f%%", pct)), color = "white", size = 2) +
facet_grid(Q1~.) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5)) +
scale_y_discrete(labels = function(x) str_wrap(x, width = 30)) +
theme(legend.position = "none",
plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 11, angle = -90,
hjust = 0, vjust = 0.5),
axis.text.y = element_text(size = 7),
legend.text = element_text(size = 11)) +
labs(title = "Yearly compensation vs. current role",
x = "$", y = "", fill = "",
caption = "About us")
## `summarise()` has grouped output by 'Q1', 'Q9'. You can override using the
## `.groups` argument.
Para ganar más dinero, debes ser científico de datos, estadístico o ingeniero de datos.
afroCountries %>%
group_by(Q1,Q9,Q8) %>%
filter(Q1 == "Female" | Q1 == "Male") %>%
filter(!is.na(Q8)) %>%
filter(!is.na(Q9)) %>%
summarise(Count = length(Q9)) %>%
mutate(pct = round(prop.table(Count)*100,2))%>%
ggplot(aes(x = Q9, y = Q8, fill = pct)) +
geom_tile(size = 0.5, show.legend = TRUE) +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = sprintf("%.2f%%", pct)), color = "white", size = 2) +
facet_grid(Q1~.) +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5)) +
scale_y_discrete(labels = function(x) str_wrap(x, width = 10)) +
theme(legend.position = "none",
plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(size = 11, angle = -90, hjust = 0),
axis.text.y = element_text(size = 11),
legend.text = element_text(size = 11)) +
labs(title = "Yearly compensation by gender and experience in current role",
x = "$", y = "Years of experience", fill = "",
caption = "About us")
## `summarise()` has grouped output by 'Q1', 'Q9'. You can override using the
## `.groups` argument.
La mayoría de los encuestados africanos ganan menos de $10,000. América del Norte y Oceanía tienen los porcentajes más altos de personas que ganan más de $100,000 al año. En Europa, la mayoría están entre $0-60,000.
newMultipleChoice %>%
group_by(Continent, Q9)%>%
filter(!is.na(Q9))%>%
summarise(Count = length(Q9))%>%
mutate(pct = prop.table(Count)*100)%>%
ggplot(aes(x = Continent, y = Q9, fill = pct)) +
geom_tile(stat = "identity") +
scale_fill_gradient(low = "salmon1", high = "blue") +
geom_text(aes(label = sprintf("%.2f%%", pct)),
hjust = 0.5,vjust = 0.25, size = 3, color = "white") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5))+
scale_y_discrete(labels = function(x) str_wrap(x, width = 30))+
theme(plot.title = element_text(size = 15, hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(size = 11),
axis.text.y = element_text(size = 11),
legend.position = "none",
legend.text = element_text(size = 11)) +
labs(title = "Yearly compensation",
x = "", y = "", fill = "",
caption = "Africa and the world")
## `summarise()` has grouped output by 'Continent'. You can override using the
## `.groups` argument.
La mayoría de los encuestados africanos cobran menos de 10.000 dólares.
América del Norte y Oceanía tienen la mayor proporción de personas que
cobran más de 100.000 dólares. En Europa, la mayoría se sitúa entre 0 y
60.000 dólares.
Los encuestados africanos proceden de Egipto, Kenia, Marruecos, Nigeria, Sudáfrica y Túnez.
Los países africanos se encuentran entre los que tienen la mayor proporción de mujeres con respecto a los hombres.
La mayoría de los encuestados tienen entre 22 y 29 años.
La proporción de mujeres ha disminuido en comparación con el año pasado.
Las mujeres encuestadas tienen más estudios que sus homólogos masculinos.
Los hombres cobran mucho más que las mujeres con el mismo nivel de experiencia.
La ciencia de los datos está todavía en sus inicios.
Sudáfrica y Kenia tienen los empleos mejor pagados, y la ciencia de datos es uno de los trabajos más lucrativos.
En general, los encuestados tienen pocos años de experiencia en codificación, pero están dispuestos a aprender. Los encuestados de Sudáfrica son los que tienen más experiencia en programación. Y la mayoría de los encuestados de Nigeria tienen menos de un año de experiencia.
Coursera es la plataforma más popular.
Python (con diferencia), R y SQL son los lenguajes de programación más utilizados.
La mayoría de los encuestados africanos se consideran científicos de datos.