Data analysis

Introduction:

The rise of data analysis and data science has been a topic of interest for many in recent years. With the proliferation of data and the increased focus on data-driven decision making in various fields, it’s not surprising that these fields have gained popularity over time. In this project, we aim to track the rising popularity of data analysis and data science over time, and investigate any interesting trends using Google searches or Google Trends for data.

Data Preperation

We began by importing two separate CSV files, ‘geoMap.csv’ and ‘multiTimeline.csv’, into R and performing the necessary data wrangling and cleaning to ensure that they were ready for analysis. The first file, ‘geoMap.csv’, contains information about the popularity of the search terms in different countries. The second file, ‘multiTimeline.csv’, provides a month-by-month breakdown of the popularity of the search terms since January 2004.

Our goal in this section was to make sure that the data was in a suitable format for analysis. This included removing any unnecessary columns, handling missing or incorrect values, and merging the two datasets if necessary.

# Read in the data files
geoMap <- read.csv("/Users/paul/iCloud Drive (Archive)/Downloads/geoMap.csv", header = TRUE, sep = ",")
multiTimeline <- read.csv("/Users/paul/iCloud Drive (Archive)/Downloads/multiTimeline.csv", header = TRUE, sep = ",")

# Check the structure of the data files
str(geoMap)

## 'data.frame':    250 obs. of  3 variables:
##  $ Country                          : chr  "Botswana" "Malawi" "Lesotho" "Zimbabwe" ...
##  $ Data.science...1.1.04...4.11.23. : chr  "" "" "" "" ...
##  $ Data.analysis...1.1.04...4.11.23.: chr  "" "" "" "" ...

str(multiTimeline)

## 'data.frame':    232 obs. of  4 variables:
##  $ Month                     : chr  "2004-01" "2004-02" "2004-03" "2004-04" ...
##  $ Data.science...Worldwide. : chr  "1" "1" "2" "0" ...
##  $ Data.analysis...Worldwide.: int  31 36 34 35 34 34 31 29 33 37 ...
##  $ Growth.in.popularity      : int  0 5 -2 1 -1 0 -3 -2 4 4 ...

# Rename columns in geoMap
colnames(geoMap) <- c("Country", "Data_science", "Data_analysis")

# Rename columns in multiTimeline
colnames(multiTimeline) <- c("Month", "Data_science", "Data_analysis", "Growth_in_popularity")

Data Cleaning

Before proceeding with the analysis, we checked for missing or duplicate data in both data frames and removed them as necessary.

Before proceeding, it’s a good idea to preview the data files to get an idea of what they contain.I will use the head() function, which displays the first few rows of the data frame:

head(geoMap)

##       Country Data_science Data_analysis
## 1    Botswana                           
## 2      Malawi                           
## 3     Lesotho                           
## 4    Zimbabwe                           
## 5 Timor-Leste                           
## 6    Eswatini

head(multiTimeline)

##     Month Data_science Data_analysis Growth_in_popularity
## 1 2004-01            1            31                    0
## 2 2004-02            1            36                    5
## 3 2004-03            2            34                   -2
## 4 2004-04            0            35                    1
## 5 2004-05            1            34                   -1
## 6 2004-06            1            34                    0

# Identify and remove rows with missing values in both data frames
geoMap_clean <- na.omit(geoMap)
multiTimeline_clean <- na.omit(multiTimeline)

Data Wrangling

# Convert Month column to date format
multiTimeline_clean$Month <- as.Date(paste0(multiTimeline_clean$Month, "-01"))

# Create a new column for seasonality
multiTimeline_clean$Season <- factor(format(multiTimeline_clean$Month, "%b"), 
                                      levels = month.abb, ordered = TRUE)

# Group data by season and calculate the average growth in popularity
seasonal_avg_growth <- multiTimeline_clean %>%
  group_by(Season) %>%
  summarize(avg_growth = mean(Growth_in_popularity))

# Join the geoMap dataframe with the world map data
geoMap_joined <- inner_join(world_map, geoMap, by = c("region" = "Country"))

# Convert Month column to date format
multiTimeline_clean$Month <- as.Date(paste0(multiTimeline_clean$Month, "-01"))

# Create a new column for seasonality
multiTimeline_clean$Season <- factor(format(multiTimeline_clean$Month, "%b"), 
                                      levels = month.abb, ordered = TRUE)

Data Analysis

In this section, we will analyze the data to gain insights into the popularity of data analysis and data science over time, as well as any interesting trends and correlations.

Trends in populatity of terms over time

In this section, we will analyze the data to gain insights into the popularity of data analysis and data science over time, as well as any interesting trends and correlations.

# Round off special character in Data_Science such as <1 with 'ifelse'
multiTimeline_clean$Data_science <- ifelse(multiTimeline_clean$Data_science == "<1", 0.5, round(as.numeric(multiTimeline_clean$Data_science)))

## Warning in ifelse(multiTimeline_clean$Data_science == "<1", 0.5,
## round(as.numeric(multiTimeline_clean$Data_science))): NAs introduced by
## coercion

# convert the Month column to a Date format
multiTimeline_clean$Month <- as.Date(paste0(multiTimeline_clean$Month, "-01"))

# Plot the Graph
multiTimeline_clean %>%
  select(Month, Data_analysis, Data_science) %>%
  pivot_longer(-Month, names_to = "Category", values_to = "Popularity") %>%
  ggplot(aes(x = Month, y = Popularity, color = Category)) +
  geom_line() +
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  labs(x = "Month", y = "Popularity", title = "Popularity of Data Science and Data Analysis over Time") +
  theme_bw() +
  theme(legend.position = "bottom") +
  scale_x_date(date_labels = "%Y-%m", date_breaks = "1 year") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  # Add a vertical line at 1/1/2022 and label it "Note"
  geom_vline(xintercept = as.Date("2022-01-01"), linetype = "dashed", color = "red", size = 1) +
  annotate(geom = "text", x = as.Date("2022-01-02"), y = max(multiTimeline_clean$Popularity), label = "Note", color = "red", hjust = 0, vjust = -1)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

## Warning in max(multiTimeline_clean$Popularity): no non-missing arguments to
## max; returning -Inf

Upon analyzing the data, we observed that the popularity of data analysis had consistently been higher than that of data science until around mid-2017. However, in recent years, data science has gained in popularity and has almost caught up to data analysis in terms of popularity. This trend may be due to the increasing adoption of machine learning and artificial intelligence in various industries.

NOTE: It’s important to note that a significant spike in popularity around January 1st, 2022, may have been influenced by changes in data collection made by Google, so caution is advised when interpreting this spike. Furthermore, both data analysis and data science exhibit seasonal trends, with peaks and dips occurring at specific times of the year.

Recent Surge in Popularity of Data Analysis

In recent years, we can observe a significant surge in popularity of data analysis. This is evident from the increasing values of data analysis from mid-2017 to 2020. There could be several reasons for this trend, including an increase in demand for data-driven decision making across various industries, as well as the rise of big data and analytics technologies. It is interesting to note that in the latter half of 2020, data analysis experienced a sharp drop in popularity, which could be due to various factors such as the impact of the COVID-19 pandemic on the job market and businesses. Overall, this graph provides valuable insights into the popularity of these terms over time and how they have evolved.

Forecasting the poularity of data analysis and science

# Convert data to time series format
ts_data <- ts(multiTimeline_clean$Data_analysis, start = c(2004, 1), frequency = 12)

# Fit ARIMA model
arima_model <- auto.arima(ts_data)


# Make predictions for the next 12 months
forecast_data <- forecast(arima_model, h = 24)

# Plot the forecast
autoplot(forecast_data) +
  ggtitle("Forecast of Popularity of Data Science and Data Analysis") +
  ylab("Popularity") +
  xlab("Year") +
  theme_bw()

In conclusion, the ARIMA model predicts that the popularity of both data analysis and data science will continue to increase in the coming years. The forecast shows an upward trend for both categories, with data science expected to surpass data analysis in popularity by the end of 2024. This suggests that the demand for professionals skilled in these fields will continue to rise, making it a promising career path for those interested in pursuing a career in data. However, it is important to keep in mind that forecasts are based on historical data and do not take into account unexpected events or changes in the industry. Nonetheless, the ARIMA model provides useful insights and can help individuals and organizations make informed decisions based on the predicted trends.

Trends in Seasonality

# Convert Month column to date format
multiTimeline_clean$Month <- as.Date(paste0(multiTimeline_clean$Month, "-01"))

# Create a new column for seasonality
multiTimeline_clean$Season <- factor(format(multiTimeline_clean$Month, "%b"), 
                                      levels = month.abb, ordered = TRUE)

# Group data by season and calculate the average growth in popularity
seasonal_avg_growth <- multiTimeline_clean %>%
  group_by(Season) %>%
  summarize(avg_growth = mean(Growth_in_popularity))

# Plot the average growth in popularity by season
ggplot(seasonal_avg_growth, aes(x = Season, y = avg_growth)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(x = "Season",
       y = "Average Growth in Popularity",
       title = "Average Growth in Popularity by Season") +
  theme_minimal()

The analysis revealed some interesting insights into the seasonality of interest in data analysis and data science. The data shows that there is a consistent peak in interest in the fall months, particularly in September, with an average growth in popularity of 3.95. This could be attributed to the start of the academic year, with students returning to school and looking to improve their skills in these fields.

Additionally, the data shows a dip in interest during the summer months, with an average growth in popularity of -0.11 in August. This could be due to people taking vacations and being less focused on work-related activities during this time.

Overall, the analysis suggests that interest in data analysis and data science is influenced by seasonal factors, and it may be important to consider these factors when planning projects or campaigns related to these fields. By understanding the cyclical nature of interest in these topics, individuals and organizations can better anticipate and respond to changes in demand.

Geographic Trends

In this section, we will explore the geographic trends in the popularity of data analysis and data science. We will use the geoMap_joined data frame that we created in the Data Wrangling section to create maps showing the popularity of these terms in different countries.

# Rename columns
names(geoMap) <- c("Country", "Data_science", "Data_analysis")

# Rename the "United States" name in the geoMap dataframe to match the "USA" name in the world_map dataframe
geoMap$Country <- ifelse(geoMap$Country == "United States", "USA", geoMap$Country)

# Join UK and United Kingdom as one 
geoMap$Country[geoMap$Country == "UK"] <- "United Kingdom"

# Find the row corresponding to the UK
uk_row <- which(geoMap$Country == "United Kingdom")

# Update the values for data science and data analysis
geoMap[uk_row, "Data_science"] <- 0.37
geoMap[uk_row, "Data_analysis"] <- 0.63

# Join geoMap with world_map by the common country name column
geoMap_joined <- left_join(world_map, geoMap, by = c("region" = "Country"))

# Convert percentage columns to numeric values
geoMap_joined <- geoMap_joined %>%
  mutate(Data_science = parse_number(Data_science)/100,
         Data_analysis = parse_number(Data_analysis)/100)

# Create the data_Science map using ggplot
ggplot() +
  geom_map(data = geoMap_joined, map = world_map,
           aes(x = long, y = lat, map_id = region, fill = Data_science),
           color = "#ffffff", size = 0.1) +
  expand_limits(x = world_map$long, y = world_map$lat) +
  ggtitle("Data Science (%) by Country") +
  theme_void()

## Warning in geom_map(data = geoMap_joined, map = world_map, aes(x = long, :
## Ignoring unknown aesthetics: x and y

# Create the Data_analysis map using ggplot
ggplot() +
  geom_map(data = geoMap_joined, map = world_map,
           aes(x = long, y = lat, map_id = region, fill = Data_analysis),
           color = "#ffffff", size = 0.1) +
  expand_limits(x = world_map$long, y = world_map$lat) +
  ggtitle("Data Analysis (%) by Country") +
  theme_void()

## Warning in geom_map(data = geoMap_joined, map = world_map, aes(x = long, :
## Ignoring unknown aesthetics: x and y

The maps reveal interesting insights into the geographic trends in the popularity of data analysis and data science. The United States and Canada show a high popularity for both terms, while Europe, especially the United Kingdom, Germany, and France, also show a high popularity for data analysis. In contrast, data science is more popular in Asian countries such Indonesia and China with India preferring to search for Data Analysis. Overall, the maps highlight the global reach of these terms and the growing interest in data-driven decision making across various industries and regions.

Conclusion

After conducting our analysis of Google Trends data, we found that data science and data analysis are popular and growing fields of interest across the globe. Our examination of the temporal trends in popularity showed that data science has been more popular than data analysis since 2021, but data analysis has been steadily gaining popularity over the last two years. In fact, data analysis has been more popular in European countries, particularly in the United Kingdom, Germany, and France, while data science is more popular in Asian countries like India and China.

Furthermore, our investigation into the geographic trends in the popularity of these terms revealed that the United States and Canada show a high popularity for both terms, while Europe and Asia exhibit different preferences between data science and data analysis. These insights highlight the global reach of these terms and the growing interest in data-driven decision making across various industries and regions.

Moreover, our forecast based on the ARIMA model suggests that the popularity of data science and data analysis is expected to continue its upward trend in the near future. Aspiring data professionals can expect these fields to be in high demand, especially in the United States and Canada where they are already quite popular.

Lastly, our analysis of the seasonality of interest in data analysis and data science showed that interest in these fields is influenced by seasonal factors, with a peak in interest occurring in the Autumn months, particularly in September, and a dip in interest during the summer months, with the lowest interest in August. This cyclical pattern could be attributed to factors such as the start of the academic year, with students returning to school and looking to improve their skills in these fields, and people taking vacations during the summer months.

Overall, our findings suggest that both data science and data analysis are important fields to consider for those looking to enter the data-driven job market. The popularity of these fields is likely to continue to grow in the coming years, and our insights into temporal trends, geographic trends, and seasonal trends can inform individuals and organizations as they make decisions about education, employment, and planning projects or campaigns related to these fields. By considering these factors, individuals and organizations can better anticipate and respond to changes in demand and stay ahead of the curve in these fast-evolving fields

Report writeen by Paul Carmody LinkinIn: https://www.linkedin.com/in/carmodypaul/