Exploratory Data Analysis for International Journals I PhD Insight
Key Points
Exploratory data analysis (EDA) is crucial in any data analysis project. It involves exploring, summarizing, and visualizing your data to gain insights, identify patterns, and detect outliers.
EDA can also help you formulate hypotheses, choose appropriate statistical tests, and communicate your findings effectively.
In this article, I will explain how I perform EDA in R using tidyverse packages, a collection of tools for data manipulation, visualization, and modeling, and my article in Impact Factor Journal.
I will use a generated dataset for this tutorial that contains information about 1000 students from different countries, their academic performance, and their satisfaction with their university.
You will learn how to Load and view the data in R, Summarize the data using descriptive statistics, Visualize the data using charts and graphs, Identify missing values and outliers, Transform and filter the data, Perform hypothesis testing and correlation analysis, Generate an EDA report using R Markdown.
In the realm of data analysis and statistics, R has emerged as a powerful tool for students, researchers, and professionals alike. In this article, we will delve into the fascinating world of data analysis and visualization using R. Our journey will include generating a synthetic dataset, performing data visualization, identifying missing values and outliers, data transformation, and conducting hypothesis testing and correlation analysis. So, fasten your seatbelts and let’s embark on this data-driven adventure!
To begin our exploration, we’ll first create a synthetic dataset. Synthetic data allows us to simulate real-world scenarios, and in this case, we’ll generate data for 1000 students. The R code snippet below demonstrates how we can achieve this:
In the code snippet above, we set the seed for reproducibility and create a dataset with various student attributes, including country of origin, gender, age, major, GPA, standardized test scores, and satisfaction level.
Before diving into data visualization, it’s crucial to understand the
dataset’s structure. We can use R functions like names
,
dim
, and str
to achieve this. Additionally, we
can display the top five rows of the dataset using the head
function. This information provides us with an initial overview of the
data:
## [1] "id" "country" "gender" "age" "major"
## [6] "gpa" "sat" "toefl" "ielts" "gre"
## [11] "satisfaction"
## [1] 1000 11
## 'data.frame': 1000 obs. of 11 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ country : chr "India" "Brazil" "USA" "UK" ...
## $ gender : chr "Female" "Male" "Female" "Male" ...
## $ age : int 18 24 18 19 19 18 20 18 19 20 ...
## $ major : chr "CS" "Eng" "CS" "Bio" ...
## $ gpa : num 2.6 3.7 3.2 3.6 2.6 2.3 3.8 2 3.1 3.9 ...
## $ sat : num 1300 1250 1350 1250 1400 1100 1250 1400 1350 1100 ...
## $ toefl : num 90 85 90 110 105 100 80 120 95 100 ...
## $ ielts : num 7.2 5.8 8.1 7.4 7.2 5.7 8.6 8.9 9 5.1 ...
## $ gre : num 260 280 320 330 340 280 290 270 260 310 ...
## $ satisfaction: int 2 5 1 2 3 4 3 5 1 3 ...
## id country gender age major gpa sat toefl ielts gre satisfaction
## 1 1 India Female 18 CS 2.6 1300 90 7.2 260 2
## 2 2 Brazil Male 24 Eng 3.7 1250 85 5.8 280 5
## 3 3 USA Female 18 CS 3.2 1350 90 8.1 320 1
## 4 4 UK Male 19 Bio 3.6 1250 110 7.4 330 2
## 5 5 UK Male 19 Math 2.6 1400 105 7.2 340 3
These functions allow us to inspect the variable names, dataset dimensions, data structure, and initial data rows.
Now, we enter the exciting realm of data visualization. Visualizing data is essential for gaining insights and identifying patterns. We’ll create various plots to explore the dataset, including bar charts, histograms, and box plots.
Let’s start with bar charts, which can help us visualize the distribution of categorical variables. We’ll create bar charts for the ‘country,’ ‘gender,’ and ‘major’ variables:
This chart provides an overview of the number of students from different countries.
The gender bar chart displays the gender distribution among the students.
This chart shows the distribution of students across various majors.
Histograms are ideal for visualizing the distribution of continuous variables. We’ll create histograms for ‘age,’ ‘GPA,’ ‘SAT scores,’ ‘TOEFL scores,’ ‘IELTS scores,’ and ‘GRE scores’:
The age histogram illustrates the age distribution of students.
This histogram visualizes the distribution of student GPAs.
The SAT score histogram shows the distribution of students’ SAT scores.
This histogram presents the distribution of TOEFL scores among students.
The IELTS score histogram visualizes the distribution of IELTS scores.
This histogram illustrates the distribution of GRE scores among students.
Box plots are excellent for visualizing the distribution of a variable by a categorical variable. We’ll create box plots for ‘age’ by ‘country’ and ‘satisfaction’ by ‘country’:
This box plot helps us compare the distribution of age among students from different countries.
The box plot of satisfaction levels across different countries reveals valuable insights.
Data quality is paramount in any analysis. Identifying missing values and outliers is a crucial step. We can use R functions to check for missing values and identify outliers in numerical variables:
## id country gender age major gpa
## 0 0 0 0 0 0
## sat toefl ielts gre satisfaction
## 0 0 0 0 0
This code snippet provides the count of missing values for each variable.
We’ll create a function called ‘identify_outliers’ to identify outliers in numerical variables such as ‘age,’ ‘GPA,’ ‘SAT,’ ‘TOEFL,’ ‘IELTS,’ ‘GRE,’ and ‘satisfaction.’ This function calculates the lower and upper bounds for outliers based on the interquartile range (IQR).
## [[1]]
## integer(0)
##
## [[2]]
## numeric(0)
##
## [[3]]
## numeric(0)
##
## [[4]]
## numeric(0)
##
## [[5]]
## numeric(0)
##
## [[6]]
## numeric(0)
##
## [[7]]
## integer(0)
This code checks for outliers in the specified numerical variables and returns the outlier values.
Data transformation is a fundamental step in data analysis. In this section, we’ll perform data transformation by creating a new variable and filtering the data:
We’ll calculate the average test score based on the SAT, TOEFL, IELTS, and GRE scores. This new variable provides a more comprehensive measure of a student’s test performance.
To ensure data quality, we filter out rows where the ‘test_score’ variable is not missing. This step eliminates incomplete data points.
We’ll select a subset of columns for further analysis, including ‘id,’ ‘country,’ ‘gender,’ ‘major,’ ‘gpa,’ ‘test_score,’ and ‘satisfaction.’
Data grouping and summarization are essential for gaining insights into specific subgroups of the data. We’ll group the dataset by ‘country’ and ‘major’ and then summarize the data by calculating the mean and standard deviation of ‘gpa,’ ‘test_score,’ and ‘satisfaction’ for each group:
We calculate the mean and standard deviation for ‘gpa,’ ‘test_score,’ and ‘satisfaction’ within each group.
## # A tibble: 36 × 8
## # Groups: country [6]
## country major mean_gpa sd_gpa mean_test_score sd_test_score mean_satisfaction
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Brazil Art 2.87 0.587 482. 48.6 2.48
## 2 Brazil Bio 2.91 0.538 466. 49.9 3.36
## 3 Brazil CS 3.13 0.532 472. 49.3 2.62
## 4 Brazil Econ 3.03 0.589 473. 48.3 2.76
## 5 Brazil Eng 3.04 0.539 473. 52.7 2.45
## 6 Brazil Math 2.89 0.622 457. 36.2 2.77
## 7 Canada Art 2.98 0.555 483. 51.5 2.69
## 8 Canada Bio 2.95 0.671 465. 44.5 2.96
## 9 Canada CS 2.93 0.566 467. 51.7 3.21
## 10 Canada Econ 3.17 0.592 467. 45.5 2.81
## # ℹ 26 more rows
## # ℹ 1 more variable: sd_satisfaction <dbl>
This summary provides insights into the academic performance and satisfaction levels of students grouped by country and major.
Hypothesis Testing and Correlation Analysis
Now, we move on to hypothesis testing and correlation analysis, crucial aspects of data analysis:
We’ll perform a t-test to determine if there is a significant difference in GPA between students from China and India.
##
## Welch Two Sample t-test
##
## data: gpa by country
## t = 0.31124, df = 397.49, p-value = 0.7558
## alternative hypothesis: true difference in means between group China and group India is not equal to 0
## 95 percent confidence interval:
## -0.09189383 0.12646282
## sample estimates:
## mean in group China mean in group India
## 3.007576 2.990291
This test helps us understand whether there are statistically significant differences in GPA between these two groups.
We’ll use a correlation test to measure the correlation between GPA and the ‘test_score’ variable, which represents the average of SAT, TOEFL, IELTS, and GRE scores.
##
## Pearson's product-moment correlation
##
## data: student_data1$gpa and student_data1$test_score
## t = 0.21888, df = 998, p-value = 0.8268
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05508843 0.06889181
## sample estimates:
## cor
## 0.006928313
This correlation analysis assesses the relationship between GPA and test performance.
In this extensive exploration of student data analysis and visualization using R, we’ve covered various aspects of data analysis, including data generation, data visualization with bar charts, histograms, and box plots, identifying missing values and outliers, data transformation, data summarization, hypothesis testing, and correlation analysis. R’s versatility and power make it an invaluable tool for data analysis and statistics.
We’ve taken a deep dive into a synthetic dataset, but the principles and techniques presented here can be applied to real-world data scenarios. Whether you’re a college student, a researcher, or a data enthusiast, R can empower you to extract meaningful insights and make informed decisions based on data. As you continue your journey in data analysis, remember that the world of data is vast and full of discoveries waiting to be made. Happy analyzing!