Introduction

In this project, I will work with the data from freeCodeCamp’s 2017 New Coder Survey to analyze the best market to target audiences through advertisements for an online learning company.

New coders with varying interests have participated in the survey contributing to the dataset, which makes it ideal for the purpose of this analysis.

Understanding the data

To import the data and perform exploratory data analysis.

library(readr)
library(tibble)
library(DataExplorer)

fcc <- read_csv("2017-fCC-New-Coders-Survey-Data.csv")
fcc %>% glimpse()
## Rows: 18,175
## Columns: 136
## $ Age                           <dbl> 27, 34, 21, 26, 20, 28, 29, 29, 23, 24, ~
## $ AttendedBootcamp              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ BootcampFinish                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ BootcampLoanYesNo             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ BootcampName                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ BootcampRecommend             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ ChildrenNumber                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ CityPopulation                <chr> "more than 1 million", "less than 100,00~
## $ CodeEventConferences          <dbl> NA, NA, NA, NA, NA, NA, 1, NA, NA, 1, NA~
## $ CodeEventDjangoGirls          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ CodeEventFCC                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ CodeEventGameJam              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ CodeEventGirlDev              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ CodeEventHackathons           <dbl> NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, NA~
## $ CodeEventMeetup               <dbl> NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, N~
## $ CodeEventNodeSchool           <dbl> NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, N~
## $ CodeEventNone                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ CodeEventOther                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ CodeEventRailsBridge          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ CodeEventRailsGirls           <dbl> NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, N~
## $ CodeEventStartUpWknd          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ CodeEventWkdBootcamps         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ CodeEventWomenCode            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ CodeEventWorkshops            <dbl> NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, N~
## $ CommuteTime                   <chr> "15 to 29 minutes", NA, "15 to 29 minute~
## $ CountryCitizen                <chr> "Canada", "United States of America", "U~
## $ CountryLive                   <chr> "Canada", "United States of America", "U~
## $ EmploymentField               <chr> "software development and IT", NA, "soft~
## $ EmploymentFieldOther          <chr> NA, NA, NA, NA, NA, NA, "Market research~
## $ EmploymentStatus              <chr> "Employed for wages", "Not working but l~
## $ EmploymentStatusOther         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ ExpectedEarning               <dbl> NA, 35000, 70000, 40000, 140000, NA, 300~
## $ FinanciallySupporting         <dbl> NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, N~
## $ FirstDevJob                   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, N~
## $ Gender                        <chr> "female", "male", "male", "male", "femal~
## $ GenderOther                   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ HasChildren                   <dbl> NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, N~
## $ HasDebt                       <dbl> 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, N~
## $ HasFinancialDependents        <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, N~
## $ HasHighSpdInternet            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1~
## $ HasHomeMortgage               <dbl> 0, 0, NA, 1, NA, NA, 1, NA, NA, NA, 0, N~
## $ HasServedInMilitary           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ HasStudentDebt                <dbl> 0, 1, NA, 0, NA, NA, 1, NA, NA, NA, 1, N~
## $ HomeMortgageOwe               <dbl> NA, NA, NA, 40000, NA, NA, 120000, NA, N~
## $ HoursLearning                 <dbl> 15, 10, 25, 14, 10, 12, 16, 15, 5, 2, 15~
## $ ID.x                          <chr> "02d9465b21e8bd09374b0066fb2d5614", "5bf~
## $ ID.y                          <chr> "eb78c1c3ac6cd9052aec557065070fbf", "21d~
## $ Income                        <dbl> NA, NA, 13000, 24000, NA, NA, 40000, NA,~
## $ IsEthnicMinority              <dbl> NA, 0, 1, 0, 0, 0, NA, 1, 0, 0, 0, 1, 0,~
## $ IsReceiveDisabilitiesBenefits <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0~
## $ IsSoftwareDev                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0~
## $ IsUnderEmployed               <dbl> 0, NA, 0, 1, NA, NA, 0, NA, 0, NA, 1, NA~
## $ JobApplyWhen                  <chr> NA, "Within 7 to 12 months", "Within 7 t~
## $ JobInterestBackEnd            <dbl> NA, NA, 1, 1, 1, NA, NA, NA, NA, 1, NA, ~
## $ JobInterestDataEngr           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ JobInterestDataSci            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ JobInterestDevOps             <dbl> NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, N~
## $ JobInterestFrontEnd           <dbl> NA, NA, 1, 1, 1, NA, NA, NA, NA, 1, NA, ~
## $ JobInterestFullStack          <dbl> NA, 1, 1, 1, 1, NA, 1, NA, NA, 1, NA, NA~
## $ JobInterestGameDev            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, N~
## $ JobInterestInfoSec            <dbl> NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, N~
## $ JobInterestMobile             <dbl> NA, NA, 1, NA, 1, NA, NA, NA, NA, NA, NA~
## $ JobInterestOther              <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ JobInterestProjMngr           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ JobInterestQAEngr             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, N~
## $ JobInterestUX                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, N~
## $ JobPref                       <chr> "start your own business", "work for a n~
## $ JobRelocateYesNo              <dbl> NA, 1, 1, NA, 1, NA, NA, NA, NA, 1, NA, ~
## $ JobRoleInterest               <chr> NA, "Full-Stack Web Developer", "Front-E~
## $ JobWherePref                  <chr> NA, "in an office with other developers"~
## $ LanguageAtHome                <chr> "English", "English", "Spanish", "Portug~
## $ MaritalStatus                 <chr> "married or domestic partnership", "sing~
## $ MoneyForLearning              <dbl> 150, 80, 1000, 0, 0, 200, 0, 0, 700, 100~
## $ MonthsProgramming             <dbl> 6, 6, 5, 5, 24, 12, 12, 4, 29, 18, 5, 1,~
## $ NetworkID                     <chr> "6f1fbc6b2b", "f8f8be6910", "2ed189768e"~
## $ Part1EndTime                  <dttm> 2017-03-09 00:36:22, 2017-03-09 00:37:0~
## $ Part1StartTime                <dttm> 2017-03-09 00:32:59, 2017-03-09 00:33:2~
## $ Part2EndTime                  <dttm> 2017-03-09 00:59:46, 2017-03-09 00:38:5~
## $ Part2StartTime                <dttm> 2017-03-09 00:36:26, 2017-03-09 00:37:1~
## $ PodcastChangeLog              <dbl> NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, N~
## $ PodcastCodeNewbie             <dbl> NA, 1, NA, NA, NA, NA, NA, 1, NA, 1, 1, ~
## $ PodcastCodePen                <dbl> NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, 1~
## $ PodcastDevTea                 <dbl> 1, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA~
## $ PodcastDotNET                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ PodcastGiantRobots            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ PodcastJSAir                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, N~
## $ PodcastJSJabber               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, N~
## $ PodcastNone                   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ PodcastOther                  <chr> NA, NA, "Codenewbie", NA, NA, NA, NA, NA~
## $ PodcastProgThrowdown          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ PodcastRubyRogues             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ PodcastSEDaily                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ PodcastSERadio                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ PodcastShopTalk               <dbl> NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, N~
## $ PodcastTalkPython             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ PodcastTheWebAhead            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ ResourceCodecademy            <dbl> 1, 1, 1, NA, NA, NA, 1, NA, NA, NA, 1, N~
## $ ResourceCodeWars              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ ResourceCoursera              <dbl> NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, N~
## $ ResourceCSS                   <dbl> NA, 1, 1, NA, NA, NA, NA, NA, NA, 1, NA,~
## $ ResourceEdX                   <dbl> NA, NA, NA, NA, NA, 1, NA, 1, NA, NA, NA~
## $ ResourceEgghead               <dbl> NA, NA, NA, 1, NA, NA, NA, NA, 1, NA, NA~
## $ ResourceFCC                   <dbl> 1, 1, 1, 1, NA, NA, 1, 1, NA, 1, 1, 1, 1~
## $ ResourceHackerRank            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ ResourceKA                    <dbl> NA, NA, NA, NA, NA, 1, NA, NA, NA, 1, 1,~
## $ ResourceLynda                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ ResourceMDN                   <dbl> 1, NA, 1, 1, NA, NA, NA, NA, 1, NA, NA, ~
## $ ResourceOdinProj              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ ResourceOther                 <chr> NA, NA, NA, NA, NA, NA, "Sololearn", NA,~
## $ ResourcePluralSight           <dbl> NA, NA, NA, NA, NA, 1, 1, NA, NA, 1, NA,~
## $ ResourceSkillcrush            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ ResourceSO                    <dbl> NA, 1, NA, 1, 1, NA, 1, NA, 1, 1, NA, NA~
## $ ResourceTreehouse             <dbl> NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, N~
## $ ResourceUdacity               <dbl> NA, NA, 1, NA, NA, NA, NA, 1, NA, 1, NA,~
## $ ResourceUdemy                 <dbl> 1, 1, 1, NA, NA, NA, NA, NA, NA, NA, 1, ~
## $ ResourceW3S                   <dbl> 1, 1, NA, NA, NA, NA, 1, NA, NA, NA, 1, ~
## $ SchoolDegree                  <chr> "some college credit, no degree", "some ~
## $ SchoolMajor                   <chr> NA, NA, NA, NA, "Information Technology"~
## $ StudentDebtOwe                <dbl> NA, NA, NA, NA, NA, NA, 8000, NA, NA, NA~
## $ YouTubeCodeCourse             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ YouTubeCodingTrain            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, N~
## $ YouTubeCodingTut360           <dbl> NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, 1,~
## $ YouTubeComputerphile          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ YouTubeDerekBanas             <dbl> NA, NA, 1, NA, NA, NA, NA, 1, NA, 1, NA,~
## $ YouTubeDevTips                <dbl> NA, NA, 1, 1, NA, NA, NA, 1, NA, 1, 1, 1~
## $ YouTubeEngineeredTruth        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ YouTubeFCC                    <dbl> NA, 1, NA, 1, NA, NA, NA, 1, NA, 1, 1, 1~
## $ YouTubeFunFunFunction         <dbl> NA, NA, NA, 1, NA, NA, NA, 1, NA, 1, NA,~
## $ YouTubeGoogleDev              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ YouTubeLearnCode              <dbl> NA, NA, 1, NA, NA, NA, NA, NA, NA, 1, NA~
## $ YouTubeLevelUpTuts            <dbl> NA, NA, 1, 1, NA, NA, NA, NA, NA, 1, NA,~
## $ YouTubeMIT                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, N~
## $ YouTubeMozillaHacks           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ YouTubeOther                  <chr> NA, NA, NA, NA, NA, "CodingEntrepreneurs~
## $ YouTubeSimplilearn            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ YouTubeTheNewBoston           <dbl> NA, NA, NA, NA, NA, 1, NA, 1, NA, 1, NA,~

First, I want to know the top interests the new users have in terms of the programming niche they want to work in. As freeCodeCamp is a e-learning platform that provides web development courses, I can expect most users to have interests in this area.

#split-and-combine workflow
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
fcc %>%
  group_by(JobRoleInterest) %>%
  summarise(freq = n()*100/nrow(fcc)) %>%
  arrange(desc(freq))
## # A tibble: 3,212 x 2
##    JobRoleInterest                                       freq
##    <chr>                                                <dbl>
##  1 <NA>                                                61.5  
##  2 Full-Stack Web Developer                             4.53 
##  3 Front-End Web Developer                              2.48 
##  4 Data Scientist                                       0.836
##  5 Back-End Web Developer                               0.781
##  6 Mobile Developer                                     0.644
##  7 Game Developer                                       0.627
##  8 Information Security                                 0.506
##  9 Full-Stack Web Developer,   Front-End Web Developer  0.352
## 10 Front-End Web Developer, Full-Stack Web Developer    0.308
## # ... with 3,202 more rows

It looks like most users are primarily interested in web (full-stack, front-end, back-end), game, and mobile development. There are also several respondents who have multiple interests which are once again in the same category.

It would be interesting to check how many users have mentioned only one specific interest vs. multiple interests.

# Split each string in the 'JobRoleInterest' column
splitted_interests <- fcc %>%
  select(JobRoleInterest) %>%
  tidyr::drop_na() %>%
  rowwise %>% #Tidyverse actually makes by default operation over columns, rowwise changes this behavior.
  mutate(opts = length(stringr::str_split(JobRoleInterest, ",")[[1]]))

# Frequency table for the var describing the number of options
n_of_options <- splitted_interests %>%
  ungroup() %>%  #this is needeed because I used the rowwise() function before
  group_by(opts) %>%
  summarize(freq = n()*100/nrow(splitted_interests))

n_of_options
## # A tibble: 13 x 2
##     opts    freq
##    <int>   <dbl>
##  1     1 31.7   
##  2     2 10.9   
##  3     3 15.9   
##  4     4 15.2   
##  5     5 12.0   
##  6     6  6.72  
##  7     7  3.86  
##  8     8  1.76  
##  9     9  0.987 
## 10    10  0.472 
## 11    11  0.186 
## 12    12  0.300 
## 13    13  0.0286

As it turns out, only 31.65% of the users have a specific interest in a programming niche, while the vast majority have multiple interests.

The focus can now be made on how the interest is divided between web and mobile development.

# Frequency table 
web_or_mobile <- stringr::str_detect(fcc$JobRoleInterest, "Web Developer|Mobile Developer")
freq_table <- table(web_or_mobile)
freq_table <- freq_table * 100 / sum(freq_table)

# Graph for the frequency table above
df <- tibble::tibble(x = c("Other Subjects","Web or Mobile Developpement"),
                       y = freq_table)
library(ggplot2)
ggplot(data = df, aes(x = x, y = y, fill = x)) +
  geom_histogram(stat = "identity", color="blue", fill="light blue") + 
  ggtitle("Frequency Distribution of Interests") + 
  xlab("Subject") +
  ylab("Frequency")

It turns out that most people in this survey (roughly 86%) are interested in either web or mobile development.

Now to figure out what are the best markets to invest money in for advertising the courses, I need to know:

Location & Densities of New Coders

Next, I will check where the new coders are located and what are the countries with the highest density of coders interested in web and mobile development.

The data set provides information about the location of each participant at a country level. The CountryCitizen variable describes the country of origin for each participant, and the CountryLive variable describes what country each participants lives in (which may be different than the origin country).

For the purpose of advertising, it is more relevant to work with the CountryLive variable because it is where people actually live at the moment when the ads are run.

# Isolate the participants that answered what role they'd be interested in
fcc_good <- fcc %>%
  tidyr::drop_na(JobRoleInterest) 

# Frequency tables with absolute and relative frequencies
fcc_good %>%
group_by(CountryLive) %>%
summarise(`Absolute frequency` = n(),
          `Percentage` = n() * 100 /  nrow(fcc_good) ) %>%
  arrange(desc(Percentage))
## # A tibble: 138 x 3
##    CountryLive              `Absolute frequency` Percentage
##    <chr>                                   <int>      <dbl>
##  1 United States of America                 3125      44.7 
##  2 India                                     528       7.55
##  3 United Kingdom                            315       4.51
##  4 Canada                                    260       3.72
##  5 <NA>                                      154       2.20
##  6 Poland                                    131       1.87
##  7 Brazil                                    129       1.84
##  8 Germany                                   125       1.79
##  9 Australia                                 112       1.60
## 10 Russia                                    102       1.46
## # ... with 128 more rows

44.69% of potential customers are located in the US, India has the second customer density, but it’s just 7.55%, which is not too far from the United Kingdom (4.50%) or Canada (3.71%).

Money Spent in Learning

The MoneyForLearning column describes in American dollars the amount of money spent by participants from the moment they started coding until the moment they completed the survey. This number can be divided by the months spent in learning to arrive at a cost per month value.

I will narrow down the analysis to only four countries: the US, India, the United Kingdom, and Canada. This is primarily because they have the highest values in the frequency table and also because English is the official language in all of these countries so, more users will be able to consume the content easily.

I will now create a new column that describes the amount of money a student has spent per month so far.

# Replace 0s with 1s to avoid division by 0
fcc_good <- fcc_good %>%
  mutate(MonthsProgramming = replace(MonthsProgramming,  MonthsProgramming == 0, 1) )

# New column for the amount of money each student spends each month
fcc_good <- fcc_good %>%
  mutate(money_per_month = MoneyForLearning/MonthsProgramming) 

fcc_good %>%
  summarise(na_count = sum(is.na(money_per_month)) ) %>%
  pull(na_count)
## [1] 675

Deleting the 675 rows with NA values.

# Keep only the rows with non-NAs in the `money_per_month` column 
fcc_good  <-  fcc_good %>% tidyr::drop_na(money_per_month)

Now to group the data by country, and then measure the average amount of money that students spend per month in each country. After removing the rows having NA values for the CountryLive column, I will check if there is still enough data for the four countries of interest.

# Remove the rows with NA values in 'CountryLive'
fcc_good  <-  fcc_good %>% tidyr::drop_na(CountryLive)

# Frequency table to check if there is enough data
fcc_good %>% group_by(CountryLive) %>%
  summarise(freq = n() ) %>%
  arrange(desc(freq)) %>%
  head()
## # A tibble: 6 x 2
##   CountryLive               freq
##   <chr>                    <int>
## 1 United States of America  2933
## 2 India                      463
## 3 United Kingdom             279
## 4 Canada                     240
## 5 Poland                     122
## 6 Germany                    114

Next, computing the average value spent per month in each country by a student.

# Mean sum of money spent by students each month
countries_mean  <-  fcc_good %>% 
  filter(CountryLive == 'United States of America' | CountryLive == 'India' | CountryLive == 'United Kingdom'|CountryLive == 'Canada') %>%
  group_by(CountryLive) %>%
  summarize(mean = mean(money_per_month)) %>%
  arrange(desc(mean))

countries_mean
## # A tibble: 4 x 2
##   CountryLive               mean
##   <chr>                    <dbl>
## 1 United States of America 228. 
## 2 India                    135. 
## 3 Canada                   114. 
## 4 United Kingdom            45.5

Now might be a good time to check for outliers in case of extremities in spending.

# Isolate only the countries of interest
only_4  <-  fcc_good %>% 
  filter(CountryLive == 'United States of America' | CountryLive == 'India' | CountryLive == 'United Kingdom' | CountryLive == 'Canada')

# To have a match with the original database in case of some indexes.
only_4 <- only_4 %>%
  mutate(index = row_number())

# Box plots to visualize distributions
ggplot( data = only_4, aes(x = CountryLive, y = money_per_month)) +
  geom_boxplot() +
  ggtitle("Money Spent Per Month Per Country\n(Distributions)") +
  xlab("Country") +
  ylab("Money per month (US dollars)") +
  theme_bw()

Immediately apparent is that two persons spend each month over $50,000 . This seems extremely unlikely, so I will remove every value that goes over $20,000 per month.

# Isolate only those participants who spend less than 10,000 per month
fcc_good  <- fcc_good %>% 
  filter(money_per_month < 20000)

Now to recompute the mean values and plot the box plots again.

# Mean sum of money spent by students each month
countries_mean = fcc_good %>% 
  filter(CountryLive == 'United States of America' | CountryLive == 'India' | CountryLive == 'United Kingdom'|CountryLive == 'Canada') %>%
  group_by(CountryLive) %>%
  summarize(mean = mean(money_per_month)) %>%
  arrange(desc(mean))

countries_mean
## # A tibble: 4 x 2
##   CountryLive               mean
##   <chr>                    <dbl>
## 1 United States of America 184. 
## 2 India                    135. 
## 3 Canada                   114. 
## 4 United Kingdom            45.5
# Isolate only the countries of interest
only_4  <-  fcc_good %>% 
  filter(CountryLive == 'United States of America' | CountryLive == 'India' | CountryLive == 'United Kingdom'|CountryLive == 'Canada') %>%
  mutate(index = row_number())

# Box plots to visualize distributions
ggplot( data = only_4, aes(x = CountryLive, y = money_per_month)) +
  geom_boxplot() +
  ggtitle("Money Spent Per Month Per Country\n(Distributions)") +
  xlab("Country") +
  ylab("Money per month (US dollars)") +
  theme_bw()

There are a few extreme outliers for India (values over $2,500 per month), but it’s unclear whether this is good data or not. Maybe these persons attended several bootcamps, which tend to be very expensive. This can be inspected further.

# Inspect the extreme outliers for India
india_outliers  <-  only_4 %>%
  filter(CountryLive == 'India' & 
           money_per_month >= 2500)

india_outliers
## # A tibble: 6 x 138
##     Age AttendedBootcamp BootcampFinish BootcampLoanYesNo BootcampName
##   <dbl>            <dbl>          <dbl>             <dbl> <chr>       
## 1    24                0             NA                NA <NA>        
## 2    20                0             NA                NA <NA>        
## 3    28                0             NA                NA <NA>        
## 4    22                0             NA                NA <NA>        
## 5    19                0             NA                NA <NA>        
## 6    27                0             NA                NA <NA>        
## # ... with 133 more variables: BootcampRecommend <dbl>, ChildrenNumber <lgl>,
## #   CityPopulation <chr>, CodeEventConferences <dbl>,
## #   CodeEventDjangoGirls <dbl>, CodeEventFCC <dbl>, CodeEventGameJam <dbl>,
## #   CodeEventGirlDev <dbl>, CodeEventHackathons <dbl>, CodeEventMeetup <dbl>,
## #   CodeEventNodeSchool <dbl>, CodeEventNone <lgl>, CodeEventOther <lgl>,
## #   CodeEventRailsBridge <dbl>, CodeEventRailsGirls <dbl>,
## #   CodeEventStartUpWknd <dbl>, CodeEventWkdBootcamps <dbl>,
## #   CodeEventWomenCode <dbl>, CodeEventWorkshops <dbl>, CommuteTime <chr>,
## #   CountryCitizen <chr>, CountryLive <chr>, EmploymentField <chr>,
## #   EmploymentFieldOther <chr>, EmploymentStatus <chr>,
## #   EmploymentStatusOther <chr>, ExpectedEarning <dbl>,
## #   FinanciallySupporting <dbl>, FirstDevJob <dbl>, Gender <chr>,
## #   GenderOther <lgl>, HasChildren <dbl>, HasDebt <dbl>,
## #   HasFinancialDependents <dbl>, HasHighSpdInternet <dbl>,
## #   HasHomeMortgage <dbl>, HasServedInMilitary <dbl>, HasStudentDebt <dbl>,
## #   HomeMortgageOwe <dbl>, HoursLearning <dbl>, ID.x <chr>, ID.y <chr>,
## #   Income <dbl>, IsEthnicMinority <dbl>, IsReceiveDisabilitiesBenefits <dbl>,
## #   IsSoftwareDev <dbl>, IsUnderEmployed <dbl>, JobApplyWhen <chr>,
## #   JobInterestBackEnd <dbl>, JobInterestDataEngr <dbl>,
## #   JobInterestDataSci <dbl>, JobInterestDevOps <dbl>,
## #   JobInterestFrontEnd <dbl>, JobInterestFullStack <dbl>,
## #   JobInterestGameDev <dbl>, JobInterestInfoSec <dbl>,
## #   JobInterestMobile <dbl>, JobInterestOther <lgl>, JobInterestProjMngr <dbl>,
## #   JobInterestQAEngr <dbl>, JobInterestUX <dbl>, JobPref <chr>,
## #   JobRelocateYesNo <dbl>, JobRoleInterest <chr>, JobWherePref <chr>,
## #   LanguageAtHome <chr>, MaritalStatus <chr>, MoneyForLearning <dbl>,
## #   MonthsProgramming <dbl>, NetworkID <chr>, Part1EndTime <dttm>,
## #   Part1StartTime <dttm>, Part2EndTime <dttm>, Part2StartTime <dttm>,
## #   PodcastChangeLog <dbl>, PodcastCodeNewbie <dbl>, PodcastCodePen <dbl>,
## #   PodcastDevTea <dbl>, PodcastDotNET <dbl>, PodcastGiantRobots <dbl>,
## #   PodcastJSAir <dbl>, PodcastJSJabber <dbl>, PodcastNone <lgl>,
## #   PodcastOther <chr>, PodcastProgThrowdown <dbl>, PodcastRubyRogues <dbl>,
## #   PodcastSEDaily <dbl>, PodcastSERadio <dbl>, PodcastShopTalk <dbl>,
## #   PodcastTalkPython <dbl>, PodcastTheWebAhead <dbl>,
## #   ResourceCodecademy <dbl>, ResourceCodeWars <dbl>, ResourceCoursera <dbl>,
## #   ResourceCSS <dbl>, ResourceEdX <dbl>, ResourceEgghead <dbl>,
## #   ResourceFCC <dbl>, ResourceHackerRank <dbl>, ResourceKA <dbl>, ...

Looks like neither participants attended a bootcamp. Overall, it’s really hard to figure out from the data whether these persons really spent that much money with learning. The actual question of the survey was “Aside from university tuition, about how much money have you spent on learning to code so far (in US dollars)?”, so they might have misunderstood and thought university tuition is included. It seems safer to remove these six rows.

# Remove the outliers for India
only_4 <-  only_4 %>% 
  filter(!(index %in% india_outliers$index))

Similar inspection needs to be done for students in US who spend over $6,000 per month.

# Examine the extreme outliers for the US
us_outliers = only_4 %>%
  filter(CountryLive == 'United States of America' & 
           money_per_month >= 6000)

us_outliers
## # A tibble: 11 x 138
##      Age AttendedBootcamp BootcampFinish BootcampLoanYesNo BootcampName         
##    <dbl>            <dbl>          <dbl>             <dbl> <chr>                
##  1    26                1              0                 0 The Coding Boot Camp~
##  2    32                1              0                 0 The Iron Yard        
##  3    34                1              1                 0 We Can Code IT       
##  4    31                0             NA                NA <NA>                 
##  5    46                1              1                 1 Sabio.la             
##  6    32                0             NA                NA <NA>                 
##  7    26                1              0                 1 Codeup               
##  8    33                1              0                 1 Grand Circus         
##  9    29                0             NA                NA <NA>                 
## 10    27                0             NA                NA <NA>                 
## 11    50                0             NA                NA <NA>                 
## # ... with 133 more variables: BootcampRecommend <dbl>, ChildrenNumber <lgl>,
## #   CityPopulation <chr>, CodeEventConferences <dbl>,
## #   CodeEventDjangoGirls <dbl>, CodeEventFCC <dbl>, CodeEventGameJam <dbl>,
## #   CodeEventGirlDev <dbl>, CodeEventHackathons <dbl>, CodeEventMeetup <dbl>,
## #   CodeEventNodeSchool <dbl>, CodeEventNone <lgl>, CodeEventOther <lgl>,
## #   CodeEventRailsBridge <dbl>, CodeEventRailsGirls <dbl>,
## #   CodeEventStartUpWknd <dbl>, CodeEventWkdBootcamps <dbl>,
## #   CodeEventWomenCode <dbl>, CodeEventWorkshops <dbl>, CommuteTime <chr>,
## #   CountryCitizen <chr>, CountryLive <chr>, EmploymentField <chr>,
## #   EmploymentFieldOther <chr>, EmploymentStatus <chr>,
## #   EmploymentStatusOther <chr>, ExpectedEarning <dbl>,
## #   FinanciallySupporting <dbl>, FirstDevJob <dbl>, Gender <chr>,
## #   GenderOther <lgl>, HasChildren <dbl>, HasDebt <dbl>,
## #   HasFinancialDependents <dbl>, HasHighSpdInternet <dbl>,
## #   HasHomeMortgage <dbl>, HasServedInMilitary <dbl>, HasStudentDebt <dbl>,
## #   HomeMortgageOwe <dbl>, HoursLearning <dbl>, ID.x <chr>, ID.y <chr>,
## #   Income <dbl>, IsEthnicMinority <dbl>, IsReceiveDisabilitiesBenefits <dbl>,
## #   IsSoftwareDev <dbl>, IsUnderEmployed <dbl>, JobApplyWhen <chr>,
## #   JobInterestBackEnd <dbl>, JobInterestDataEngr <dbl>,
## #   JobInterestDataSci <dbl>, JobInterestDevOps <dbl>,
## #   JobInterestFrontEnd <dbl>, JobInterestFullStack <dbl>,
## #   JobInterestGameDev <dbl>, JobInterestInfoSec <dbl>,
## #   JobInterestMobile <dbl>, JobInterestOther <lgl>, JobInterestProjMngr <dbl>,
## #   JobInterestQAEngr <dbl>, JobInterestUX <dbl>, JobPref <chr>,
## #   JobRelocateYesNo <dbl>, JobRoleInterest <chr>, JobWherePref <chr>,
## #   LanguageAtHome <chr>, MaritalStatus <chr>, MoneyForLearning <dbl>,
## #   MonthsProgramming <dbl>, NetworkID <chr>, Part1EndTime <dttm>,
## #   Part1StartTime <dttm>, Part2EndTime <dttm>, Part2StartTime <dttm>,
## #   PodcastChangeLog <dbl>, PodcastCodeNewbie <dbl>, PodcastCodePen <dbl>,
## #   PodcastDevTea <dbl>, PodcastDotNET <dbl>, PodcastGiantRobots <dbl>,
## #   PodcastJSAir <dbl>, PodcastJSJabber <dbl>, PodcastNone <lgl>,
## #   PodcastOther <chr>, PodcastProgThrowdown <dbl>, PodcastRubyRogues <dbl>,
## #   PodcastSEDaily <dbl>, PodcastSERadio <dbl>, PodcastShopTalk <dbl>,
## #   PodcastTalkPython <dbl>, PodcastTheWebAhead <dbl>,
## #   ResourceCodecademy <dbl>, ResourceCodeWars <dbl>, ResourceCoursera <dbl>,
## #   ResourceCSS <dbl>, ResourceEdX <dbl>, ResourceEgghead <dbl>,
## #   ResourceFCC <dbl>, ResourceHackerRank <dbl>, ResourceKA <dbl>, ...
only_4  <-  only_4 %>% 
  filter(!(index %in% us_outliers$index))

Out of these 11 extreme outliers, six people attended bootcamps, which justify the large sums of money spent on learning. For the other five, it’s hard to figure out from the data where they could have spent that much money on learning.

Also, the data shows that eight respondents had been programming for no more than three months when they completed the survey. They most likely paid a large sum of money for a bootcamp that was going to last for several months, so the amount of money spent per month is unrealistic and should be significantly lower (because they probably didn’t spend anything for the next couple of months after the survey).

In the next code block, I will remove respondents that:

# Remove the respondents who didn't attendent a bootcamp
no_bootcamp = only_4 %>%
    filter(CountryLive == 'United States of America' & 
           money_per_month >= 6000 &
             AttendedBootcamp == 0)
only_4_  <-  only_4 %>% 
  filter(!(index %in% no_bootcamp$index))
# Remove the respondents that had been programming for less than 3 months
less_than_3_months = only_4 %>%
    filter(CountryLive == 'United States of America' & 
           money_per_month >= 6000 &
           MonthsProgramming <= 3)
only_4  <-  only_4 %>% 
  filter(!(index %in% less_than_3_months$index))

Looking again at the last box plot above, there is also an extreme outlier for Canada — a person who spends roughly $5,000 per month.

# Examine the extreme outliers for Canada
canada_outliers = only_4 %>%
  filter(CountryLive == 'Canada' & 
           money_per_month >= 4500 &
           MonthsProgramming <= 3)
canada_outliers
## # A tibble: 1 x 138
##     Age AttendedBootcamp BootcampFinish BootcampLoanYesNo BootcampName
##   <dbl>            <dbl>          <dbl>             <dbl> <chr>       
## 1    24                1              0                 0 Bloc.io     
## # ... with 133 more variables: BootcampRecommend <dbl>, ChildrenNumber <lgl>,
## #   CityPopulation <chr>, CodeEventConferences <dbl>,
## #   CodeEventDjangoGirls <dbl>, CodeEventFCC <dbl>, CodeEventGameJam <dbl>,
## #   CodeEventGirlDev <dbl>, CodeEventHackathons <dbl>, CodeEventMeetup <dbl>,
## #   CodeEventNodeSchool <dbl>, CodeEventNone <lgl>, CodeEventOther <lgl>,
## #   CodeEventRailsBridge <dbl>, CodeEventRailsGirls <dbl>,
## #   CodeEventStartUpWknd <dbl>, CodeEventWkdBootcamps <dbl>,
## #   CodeEventWomenCode <dbl>, CodeEventWorkshops <dbl>, CommuteTime <chr>,
## #   CountryCitizen <chr>, CountryLive <chr>, EmploymentField <chr>,
## #   EmploymentFieldOther <chr>, EmploymentStatus <chr>,
## #   EmploymentStatusOther <chr>, ExpectedEarning <dbl>,
## #   FinanciallySupporting <dbl>, FirstDevJob <dbl>, Gender <chr>,
## #   GenderOther <lgl>, HasChildren <dbl>, HasDebt <dbl>,
## #   HasFinancialDependents <dbl>, HasHighSpdInternet <dbl>,
## #   HasHomeMortgage <dbl>, HasServedInMilitary <dbl>, HasStudentDebt <dbl>,
## #   HomeMortgageOwe <dbl>, HoursLearning <dbl>, ID.x <chr>, ID.y <chr>,
## #   Income <dbl>, IsEthnicMinority <dbl>, IsReceiveDisabilitiesBenefits <dbl>,
## #   IsSoftwareDev <dbl>, IsUnderEmployed <dbl>, JobApplyWhen <chr>,
## #   JobInterestBackEnd <dbl>, JobInterestDataEngr <dbl>,
## #   JobInterestDataSci <dbl>, JobInterestDevOps <dbl>,
## #   JobInterestFrontEnd <dbl>, JobInterestFullStack <dbl>,
## #   JobInterestGameDev <dbl>, JobInterestInfoSec <dbl>,
## #   JobInterestMobile <dbl>, JobInterestOther <lgl>, JobInterestProjMngr <dbl>,
## #   JobInterestQAEngr <dbl>, JobInterestUX <dbl>, JobPref <chr>,
## #   JobRelocateYesNo <dbl>, JobRoleInterest <chr>, JobWherePref <chr>,
## #   LanguageAtHome <chr>, MaritalStatus <chr>, MoneyForLearning <dbl>,
## #   MonthsProgramming <dbl>, NetworkID <chr>, Part1EndTime <dttm>,
## #   Part1StartTime <dttm>, Part2EndTime <dttm>, Part2StartTime <dttm>,
## #   PodcastChangeLog <dbl>, PodcastCodeNewbie <dbl>, PodcastCodePen <dbl>,
## #   PodcastDevTea <dbl>, PodcastDotNET <dbl>, PodcastGiantRobots <dbl>,
## #   PodcastJSAir <dbl>, PodcastJSJabber <dbl>, PodcastNone <lgl>,
## #   PodcastOther <chr>, PodcastProgThrowdown <dbl>, PodcastRubyRogues <dbl>,
## #   PodcastSEDaily <dbl>, PodcastSERadio <dbl>, PodcastShopTalk <dbl>,
## #   PodcastTalkPython <dbl>, PodcastTheWebAhead <dbl>,
## #   ResourceCodecademy <dbl>, ResourceCodeWars <dbl>, ResourceCoursera <dbl>,
## #   ResourceCSS <dbl>, ResourceEdX <dbl>, ResourceEgghead <dbl>,
## #   ResourceFCC <dbl>, ResourceHackerRank <dbl>, ResourceKA <dbl>, ...

This seems to be a similar scenario as seen in a user from US. The same approach can be taken to remove this outlier.

# Remove the extreme outliers for Canada
only_4  <-  only_4 %>% 
  filter(!(index %in% canada_outliers$index))

Recomputing the mean values to generate the final box plots.

# Mean sum of money spent by students each month
countries_mean = only_4 %>%
  group_by(CountryLive) %>%
  summarize(mean = mean(money_per_month)) %>%
  arrange(desc(mean))

countries_mean
## # A tibble: 4 x 2
##   CountryLive               mean
##   <chr>                    <dbl>
## 1 United States of America 143. 
## 2 Canada                    93.1
## 3 India                     65.8
## 4 United Kingdom            45.5
# Box plots to visualize distributions
ggplot( data = only_4, aes(x = CountryLive, y = money_per_month)) +
  geom_boxplot() +
  ggtitle("Money Spent Per Month Per Country\n(Distributions)") +
  xlab("Country") +
  ylab("Money per month (US dollars)") +
  theme_bw()

Deciding the Best Countries to Advertise in

US is clearly a winner here with lot of users who already pay much higher than in the other countries. For second place, there is a tough decision to make between India and Canada. Since there is a larger market in India, it may be worth the money to advertise in.

Summary In this project, the data from new users of freeCodeCamp was studied to identify the 2 best countries to advertise in if I were to launch an e-learning company. US is undoubtedly the best country both in terms of the number of users who would be interested as well as in terms of their capability/willingness to spend on a programming course. The second place is a tough decision to make between India and Canada, at least from the data at hand. Either one of these may be a good choice but to be sure, further information may be needed.