Introduction:

The rise of data analysis and data science has been a topic of interest for many in recent years. With the proliferation of data and the increased focus on data-driven decision making in various fields, it’s not surprising that these fields have gained popularity over time. In this project, we aim to track the rising popularity of data analysis and data science over time, and investigate any interesting trends using Google searches or Google Trends for data.

Data Preperation

We began by importing two separate CSV files, ‘geoMap.csv’ and ‘multiTimeline.csv’, into R and performing the necessary data wrangling and cleaning to ensure that they were ready for analysis. The first file, ‘geoMap.csv’, contains information about the popularity of the search terms in different countries. The second file, ‘multiTimeline.csv’, provides a month-by-month breakdown of the popularity of the search terms since January 2004.

Our goal in this section was to make sure that the data was in a suitable format for analysis. This included removing any unnecessary columns, handling missing or incorrect values, and merging the two datasets if necessary.

# Read in the data files
geoMap <- read.csv("/Users/paul/iCloud Drive (Archive)/Downloads/geoMap.csv", header = TRUE, sep = ",")
multiTimeline <- read.csv("/Users/paul/iCloud Drive (Archive)/Downloads/multiTimeline.csv", header = TRUE, sep = ",")

# Check the structure of the data files
str(geoMap)
## 'data.frame':    250 obs. of  3 variables:
##  $ Country                          : chr  "Botswana" "Malawi" "Lesotho" "Zimbabwe" ...
##  $ Data.science...1.1.04...4.11.23. : chr  "" "" "" "" ...
##  $ Data.analysis...1.1.04...4.11.23.: chr  "" "" "" "" ...
str(multiTimeline)
## 'data.frame':    232 obs. of  4 variables:
##  $ Month                     : chr  "2004-01" "2004-02" "2004-03" "2004-04" ...
##  $ Data.science...Worldwide. : chr  "1" "1" "2" "0" ...
##  $ Data.analysis...Worldwide.: int  31 36 34 35 34 34 31 29 33 37 ...
##  $ Growth.in.popularity      : int  0 5 -2 1 -1 0 -3 -2 4 4 ...
# Rename columns in geoMap
colnames(geoMap) <- c("Country", "Data_science", "Data_analysis")

# Rename columns in multiTimeline
colnames(multiTimeline) <- c("Month", "Data_science", "Data_analysis", "Growth_in_popularity")

Data Cleaning

Before proceeding with the analysis, we checked for missing or duplicate data in both data frames and removed them as necessary.

Before proceeding, it’s a good idea to preview the data files to get an idea of what they contain.I will use the head() function, which displays the first few rows of the data frame:

head(geoMap)
##       Country Data_science Data_analysis
## 1    Botswana                           
## 2      Malawi                           
## 3     Lesotho                           
## 4    Zimbabwe                           
## 5 Timor-Leste                           
## 6    Eswatini
head(multiTimeline)
##     Month Data_science Data_analysis Growth_in_popularity
## 1 2004-01            1            31                    0
## 2 2004-02            1            36                    5
## 3 2004-03            2            34                   -2
## 4 2004-04            0            35                    1
## 5 2004-05            1            34                   -1
## 6 2004-06            1            34                    0
# Identify and remove rows with missing values in both data frames
geoMap_clean <- na.omit(geoMap)
multiTimeline_clean <- na.omit(multiTimeline)

Data Wrangling

# Convert Month column to date format
multiTimeline_clean$Month <- as.Date(paste0(multiTimeline_clean$Month, "-01"))

# Create a new column for seasonality
multiTimeline_clean$Season <- factor(format(multiTimeline_clean$Month, "%b"), 
                                      levels = month.abb, ordered = TRUE)

# Group data by season and calculate the average growth in popularity
seasonal_avg_growth <- multiTimeline_clean %>%
  group_by(Season) %>%
  summarize(avg_growth = mean(Growth_in_popularity))

# Join the geoMap dataframe with the world map data
geoMap_joined <- inner_join(world_map, geoMap, by = c("region" = "Country"))
# Convert Month column to date format
multiTimeline_clean$Month <- as.Date(paste0(multiTimeline_clean$Month, "-01"))

# Create a new column for seasonality
multiTimeline_clean$Season <- factor(format(multiTimeline_clean$Month, "%b"), 
                                      levels = month.abb, ordered = TRUE)

Data Analysis

In this section, we will analyze the data to gain insights into the popularity of data analysis and data science over time, as well as any interesting trends and correlations.

Recent Surge in Popularity of Data Analysis

In recent years, we can observe a significant surge in popularity of data analysis. This is evident from the increasing values of data analysis from mid-2017 to 2020. There could be several reasons for this trend, including an increase in demand for data-driven decision making across various industries, as well as the rise of big data and analytics technologies. It is interesting to note that in the latter half of 2020, data analysis experienced a sharp drop in popularity, which could be due to various factors such as the impact of the COVID-19 pandemic on the job market and businesses. Overall, this graph provides valuable insights into the popularity of these terms over time and how they have evolved.

Forecasting the poularity of data analysis and science

# Convert data to time series format
ts_data <- ts(multiTimeline_clean$Data_analysis, start = c(2004, 1), frequency = 12)

# Fit ARIMA model
arima_model <- auto.arima(ts_data)


# Make predictions for the next 12 months
forecast_data <- forecast(arima_model, h = 24)

# Plot the forecast
autoplot(forecast_data) +
  ggtitle("Forecast of Popularity of Data Science and Data Analysis") +
  ylab("Popularity") +
  xlab("Year") +
  theme_bw()

In conclusion, the ARIMA model predicts that the popularity of both data analysis and data science will continue to increase in the coming years. The forecast shows an upward trend for both categories, with data science expected to surpass data analysis in popularity by the end of 2024. This suggests that the demand for professionals skilled in these fields will continue to rise, making it a promising career path for those interested in pursuing a career in data. However, it is important to keep in mind that forecasts are based on historical data and do not take into account unexpected events or changes in the industry. Nonetheless, the ARIMA model provides useful insights and can help individuals and organizations make informed decisions based on the predicted trends.

Conclusion

After conducting our analysis of Google Trends data, we found that data science and data analysis are popular and growing fields of interest across the globe. Our examination of the temporal trends in popularity showed that data science has been more popular than data analysis since 2021, but data analysis has been steadily gaining popularity over the last two years. In fact, data analysis has been more popular in European countries, particularly in the United Kingdom, Germany, and France, while data science is more popular in Asian countries like India and China.

Furthermore, our investigation into the geographic trends in the popularity of these terms revealed that the United States and Canada show a high popularity for both terms, while Europe and Asia exhibit different preferences between data science and data analysis. These insights highlight the global reach of these terms and the growing interest in data-driven decision making across various industries and regions.

Moreover, our forecast based on the ARIMA model suggests that the popularity of data science and data analysis is expected to continue its upward trend in the near future. Aspiring data professionals can expect these fields to be in high demand, especially in the United States and Canada where they are already quite popular.

Lastly, our analysis of the seasonality of interest in data analysis and data science showed that interest in these fields is influenced by seasonal factors, with a peak in interest occurring in the Autumn months, particularly in September, and a dip in interest during the summer months, with the lowest interest in August. This cyclical pattern could be attributed to factors such as the start of the academic year, with students returning to school and looking to improve their skills in these fields, and people taking vacations during the summer months.

Overall, our findings suggest that both data science and data analysis are important fields to consider for those looking to enter the data-driven job market. The popularity of these fields is likely to continue to grow in the coming years, and our insights into temporal trends, geographic trends, and seasonal trends can inform individuals and organizations as they make decisions about education, employment, and planning projects or campaigns related to these fields. By considering these factors, individuals and organizations can better anticipate and respond to changes in demand and stay ahead of the curve in these fast-evolving fields

Report writeen by Paul Carmody LinkinIn: https://www.linkedin.com/in/carmodypaul/