Read and Dowload the code:

Exploratory Data Analysis for International Journals I PhD Insight

Join our Community

Key Points

Key points

  • Exploratory data analysis (EDA) is crucial in any data analysis project. It involves exploring, summarizing, and visualizing your data to gain insights, identify patterns, and detect outliers.

  • EDA can also help you formulate hypotheses, choose appropriate statistical tests, and communicate your findings effectively.

  • In this article, I will explain how I perform EDA in R using tidyverse packages, a collection of tools for data manipulation, visualization, and modeling, and my article in Impact Factor Journal.

  • I will use a generated dataset for this tutorial that contains information about 1000 students from different countries, their academic performance, and their satisfaction with their university.

  • You will learn how to Load and view the data in R, Summarize the data using descriptive statistics, Visualize the data using charts and graphs, Identify missing values and outliers, Transform and filter the data, Perform hypothesis testing and correlation analysis, Generate an EDA report using R Markdown.

Introduction

In the realm of data analysis and statistics, R has emerged as a powerful tool for students, researchers, and professionals alike. In this article, we will delve into the fascinating world of data analysis and visualization using R. Our journey will include generating a synthetic dataset, performing data visualization, identifying missing values and outliers, data transformation, and conducting hypothesis testing and correlation analysis. So, fasten your seatbelts and let’s embark on this data-driven adventure!

Generating a Synthetic Dataset

To begin our exploration, we’ll first create a synthetic dataset. Synthetic data allows us to simulate real-world scenarios, and in this case, we’ll generate data for 1000 students. The R code snippet below demonstrates how we can achieve this:

In the code snippet above, we set the seed for reproducibility and create a dataset with various student attributes, including country of origin, gender, age, major, GPA, standardized test scores, and satisfaction level.

Analyzing the Dataset

Before diving into data visualization, it’s crucial to understand the dataset’s structure. We can use R functions like names, dim, and str to achieve this. Additionally, we can display the top five rows of the dataset using the head function. This information provides us with an initial overview of the data:

##  [1] "id"           "country"      "gender"       "age"          "major"       
##  [6] "gpa"          "sat"          "toefl"        "ielts"        "gre"         
## [11] "satisfaction"
## [1] 1000   11
## 'data.frame':    1000 obs. of  11 variables:
##  $ id          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ country     : chr  "India" "Brazil" "USA" "UK" ...
##  $ gender      : chr  "Female" "Male" "Female" "Male" ...
##  $ age         : int  18 24 18 19 19 18 20 18 19 20 ...
##  $ major       : chr  "CS" "Eng" "CS" "Bio" ...
##  $ gpa         : num  2.6 3.7 3.2 3.6 2.6 2.3 3.8 2 3.1 3.9 ...
##  $ sat         : num  1300 1250 1350 1250 1400 1100 1250 1400 1350 1100 ...
##  $ toefl       : num  90 85 90 110 105 100 80 120 95 100 ...
##  $ ielts       : num  7.2 5.8 8.1 7.4 7.2 5.7 8.6 8.9 9 5.1 ...
##  $ gre         : num  260 280 320 330 340 280 290 270 260 310 ...
##  $ satisfaction: int  2 5 1 2 3 4 3 5 1 3 ...
##   id country gender age major gpa  sat toefl ielts gre satisfaction
## 1  1   India Female  18    CS 2.6 1300    90   7.2 260            2
## 2  2  Brazil   Male  24   Eng 3.7 1250    85   5.8 280            5
## 3  3     USA Female  18    CS 3.2 1350    90   8.1 320            1
## 4  4      UK   Male  19   Bio 3.6 1250   110   7.4 330            2
## 5  5      UK   Male  19  Math 2.6 1400   105   7.2 340            3

These functions allow us to inspect the variable names, dataset dimensions, data structure, and initial data rows.

Data Visualization

Now, we enter the exciting realm of data visualization. Visualizing data is essential for gaining insights and identifying patterns. We’ll create various plots to explore the dataset, including bar charts, histograms, and box plots.

Bar Charts

Let’s start with bar charts, which can help us visualize the distribution of categorical variables. We’ll create bar charts for the ‘country,’ ‘gender,’ and ‘major’ variables:

1. Bar chart of country:

This chart provides an overview of the number of students from different countries.

2. Bar chart of gender:

The gender bar chart displays the gender distribution among the students.

3. Bar chart of major:

This chart shows the distribution of students across various majors.

Histograms

Histograms are ideal for visualizing the distribution of continuous variables. We’ll create histograms for ‘age,’ ‘GPA,’ ‘SAT scores,’ ‘TOEFL scores,’ ‘IELTS scores,’ and ‘GRE scores’:

4. Histogram of age:

The age histogram illustrates the age distribution of students.

5. Histogram of GPA:

This histogram visualizes the distribution of student GPAs.

6. Histogram of SAT scores:

The SAT score histogram shows the distribution of students’ SAT scores.

7. Histogram of TOEFL scores:

This histogram presents the distribution of TOEFL scores among students.

8. Histogram of IELTS scores:

The IELTS score histogram visualizes the distribution of IELTS scores.

9. Histogram of GRE scores:

This histogram illustrates the distribution of GRE scores among students.

Box Plots

Box plots are excellent for visualizing the distribution of a variable by a categorical variable. We’ll create box plots for ‘age’ by ‘country’ and ‘satisfaction’ by ‘country’:

10. Box plot of age by country:

This box plot helps us compare the distribution of age among students from different countries.

11. Box plot of satisfaction by country:

The box plot of satisfaction levels across different countries reveals valuable insights.

Identifying Missing Values and Outliers

Data quality is paramount in any analysis. Identifying missing values and outliers is a crucial step. We can use R functions to check for missing values and identify outliers in numerical variables:

12. Checking the number of missing values for each variable:

##           id      country       gender          age        major          gpa 
##            0            0            0            0            0            0 
##          sat        toefl        ielts          gre satisfaction 
##            0            0            0            0            0

This code snippet provides the count of missing values for each variable.

13. Identifying outliers using the IQR method:

We’ll create a function called ‘identify_outliers’ to identify outliers in numerical variables such as ‘age,’ ‘GPA,’ ‘SAT,’ ‘TOEFL,’ ‘IELTS,’ ‘GRE,’ and ‘satisfaction.’ This function calculates the lower and upper bounds for outliers based on the interquartile range (IQR).

14. Check for outliers in each numerical variable:

## [[1]]
## integer(0)
## 
## [[2]]
## numeric(0)
## 
## [[3]]
## numeric(0)
## 
## [[4]]
## numeric(0)
## 
## [[5]]
## numeric(0)
## 
## [[6]]
## numeric(0)
## 
## [[7]]
## integer(0)

This code checks for outliers in the specified numerical variables and returns the outlier values.

Data Transformation

Data transformation is a fundamental step in data analysis. In this section, we’ll perform data transformation by creating a new variable and filtering the data:

15. Creating a new variable called ‘test_score’:

We’ll calculate the average test score based on the SAT, TOEFL, IELTS, and GRE scores. This new variable provides a more comprehensive measure of a student’s test performance.

16. Filtering the rows where ‘test_score’ is not missing:

To ensure data quality, we filter out rows where the ‘test_score’ variable is not missing. This step eliminates incomplete data points.

17. Selecting specific columns for analysis:

We’ll select a subset of columns for further analysis, including ‘id,’ ‘country,’ ‘gender,’ ‘major,’ ‘gpa,’ ‘test_score,’ and ‘satisfaction.’

Grouping and Summarizing Data

Data grouping and summarization are essential for gaining insights into specific subgroups of the data. We’ll group the dataset by ‘country’ and ‘major’ and then summarize the data by calculating the mean and standard deviation of ‘gpa,’ ‘test_score,’ and ‘satisfaction’ for each group:

18. Grouping the dataset by ‘country’ and ‘major’:

19. Summarizing the dataset:

We calculate the mean and standard deviation for ‘gpa,’ ‘test_score,’ and ‘satisfaction’ within each group.

## # A tibble: 36 × 8
## # Groups:   country [6]
##    country major mean_gpa sd_gpa mean_test_score sd_test_score mean_satisfaction
##    <chr>   <chr>    <dbl>  <dbl>           <dbl>         <dbl>             <dbl>
##  1 Brazil  Art       2.87  0.587            482.          48.6              2.48
##  2 Brazil  Bio       2.91  0.538            466.          49.9              3.36
##  3 Brazil  CS        3.13  0.532            472.          49.3              2.62
##  4 Brazil  Econ      3.03  0.589            473.          48.3              2.76
##  5 Brazil  Eng       3.04  0.539            473.          52.7              2.45
##  6 Brazil  Math      2.89  0.622            457.          36.2              2.77
##  7 Canada  Art       2.98  0.555            483.          51.5              2.69
##  8 Canada  Bio       2.95  0.671            465.          44.5              2.96
##  9 Canada  CS        2.93  0.566            467.          51.7              3.21
## 10 Canada  Econ      3.17  0.592            467.          45.5              2.81
## # ℹ 26 more rows
## # ℹ 1 more variable: sd_satisfaction <dbl>

This summary provides insights into the academic performance and satisfaction levels of students grouped by country and major.

Hypothesis Testing and Correlation Analysis

Now, we move on to hypothesis testing and correlation analysis, crucial aspects of data analysis:

20. T-test to compare the mean GPA of students from China and India:

We’ll perform a t-test to determine if there is a significant difference in GPA between students from China and India.

## 
##  Welch Two Sample t-test
## 
## data:  gpa by country
## t = 0.31124, df = 397.49, p-value = 0.7558
## alternative hypothesis: true difference in means between group China and group India is not equal to 0
## 95 percent confidence interval:
##  -0.09189383  0.12646282
## sample estimates:
## mean in group China mean in group India 
##            3.007576            2.990291

This test helps us understand whether there are statistically significant differences in GPA between these two groups.

21. Correlation analysis between GPA and test scores:

We’ll use a correlation test to measure the correlation between GPA and the ‘test_score’ variable, which represents the average of SAT, TOEFL, IELTS, and GRE scores.

## 
##  Pearson's product-moment correlation
## 
## data:  student_data1$gpa and student_data1$test_score
## t = 0.21888, df = 998, p-value = 0.8268
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.05508843  0.06889181
## sample estimates:
##         cor 
## 0.006928313

This correlation analysis assesses the relationship between GPA and test performance.

Conclusion

In this extensive exploration of student data analysis and visualization using R, we’ve covered various aspects of data analysis, including data generation, data visualization with bar charts, histograms, and box plots, identifying missing values and outliers, data transformation, data summarization, hypothesis testing, and correlation analysis. R’s versatility and power make it an invaluable tool for data analysis and statistics.

We’ve taken a deep dive into a synthetic dataset, but the principles and techniques presented here can be applied to real-world data scenarios. Whether you’re a college student, a researcher, or a data enthusiast, R can empower you to extract meaningful insights and make informed decisions based on data. As you continue your journey in data analysis, remember that the world of data is vast and full of discoveries waiting to be made. Happy analyzing!