Exploratory Data Analysis

Jiratchaya Nuanpirom

EDA: Why?

Exploratory Data Analysis (EDA) is the way to observe your data, and can be an iterative cycle since you have cleaned the data. Generally, as of Wickham, Çetinkaya-Rundel, and Grolemund (2023), aims of EDA are:

  • Generate questions about your data

  • Search for answers by visualizing, transforming, and modelling your data

  • Use what you learn to refine your questions and/or generate new questions.

  • This is important even if you already have research questions in mind.

  • To choose the suitable means for data cleaning.

10 Simple Steps for EDA in R

Referred from Peng (2020), here’s a quick steps to do for exploratory data analysis in R.

  1. Formulate your question

  2. Read in your data

  3. Check the packaging

  4. Glimpse

  5. Look at the top and the bottom of the data

  6. Check for the n

  7. Validate with at least one external data source

  8. Try the easy solution first

  9. Challenge your solution

  10. Follow up

Activity

For today’s EDA exercise, we will examine mortality rates by infectious and parasitic diseases. The data is adopted from the World Health Organization WHO Mortality Database. The data was collected from 1950 to 2021 in many countries around the world.

Step 1: Formulate your question

A good and sharp question or hypothesis could lead to clearing up more possible ways to answer the questions in the EDA process.

 

Can you formulate the questions that you attempt to find the answer from this data?

 

Tip

The most important questions you can answer with the EDA process is “Do I have the right data to answer this question?”. Though the question is difficult to answer at the beginning, the it is a good start to do EDA!

Step 2: Load your data

Now we will load the data set in CSV format into R environment using read_csv() function as follows.

 

# Load libraries
library(tidyverse)
# Retrieve data set
mortality_rate <- read_csv(file = "https://raw.githubusercontent.com/JirathNuan/r-handviz-workshop/main/datasets/WHOMortalityDatabase_Infectious-and-parasitic-diseases_2023-05-03.csv", comment = "#")

 

What we may have found:

  • Was the data frame object loaded mortality_rate correctly ?

  • Are there any errors/warnings during loading ?

  • From Rstudio, how many objects and variables in data set observed from environment panel ?

Step 3: Check the data dimension

  • We can check the number of rows and columns of the data frame using dim() as follows.
# See how many rows and columns in the data
dim(mortality_rate)

 

Then see the column names with colnames()

# see column names
colnames(mortality_rate)

Step 4: Glimpse the data

As mentioned in the R Basic Programming lecture, we can use the glimpse() function to see each column in the data frame as well as the column types and some values in that column as follow.

 

# Glimpse data frame
glimpse(mortality_rate)

 

As we can see from the output, glimpse() can show how many rows and columns there are, just like the dim() function. In addition, glimpse() also examine the class of each column to make sure they are specified correctly during loading.

Step 5: Look at the top and the bottom of the data

Other useful functions to look at the beginning and end of the data frame are head() and tail() functions, respectively.

 

# Show the first 10 lines of the data frame
head(mortality_rate, n = 10)
# Show the last 10 lines of the data frame
tail(mortality_rate, n = 10)

 

With these functions:

  • We can determine if the data was loaded correctly or not.

  • Were the column names defined correctly?

  • Were there no column names at all in the beginning ?

  • Was the data loaded completely from the first to the last line of the file ?

Step 6: Check for the n

  • This step is counting what you want to examine their number (n).

  • In general, counting things is a good way to find out if something is going wrong after you load the data. Whether or not duplicate values or NA are observed.

# Counting number of countries in the data set
count_countries <- length(unique(mortality_rate$country_name))
count_countries

# For each country, how many data points 
tbl_countries <- table(mortality_rate$country_name)
tbl_countries

# How many countries in Asia 
mortality_rate %>% 
  filter(region_name == "Asia") %>% 
  distinct(country_name) %>% 
  count()

# From 2000 to 2021, how many countries in Asia, where the mortality rate is available in these years. 
mortality_rate %>% 
  filter(region_name == "Asia" &
           year >= 2000 & 
           year <= 2021) %>% 
  group_by(year) %>% 
  count()

# From the above examination, we can see in which year the mortality rate is the highest by adding arrange()
mortality_rate %>% 
  filter(region_name == "Asia" &
           year >= 2000 & 
           year <= 2021) %>% 
  group_by(year) %>% 
  count() %>% 
  arrange(desc(n))

Step 7: Validate with at least one external data source

  • It is very important to make sure that your data matches something outside the data set. For example, you can cross-check with mortality database from health agencies in the desired countries. This way you can make sure that the measurements are about what they should be, and it serves as a check to see what other things in your data set might be wrong.

  • Based on our data, we can easily check the distribution of values with many functions, such as quantile() and summary() as follow.

# Summarize the distribution of death rate per 100,000 populations
summary(mortality_rate$death_rate_per_100k)
# Distribute to quantile rank
quantile(mortality_rate$death_rate_per_100k, na.rm = TRUE)
  • With summary(), you can even summarize every columns in the data frame at once as follow.
# Summarize data frame
summary(mortality_rate)

Step 8: Try the easy solution first

  • Suppose the question we want to answer is

Since the data collected from 1950 - 2021, which region contains the highest number of deaths ?

To answer the question, we need to group the data by region and examine what the region has the highest number of deaths.

# Examining the number of deaths by region
mortality_rate %>% 
  filter(!is.na(number)) %>% 
  group_by(region_name) %>% 
  summarise(total_death_per_region = sum(number)) %>% 
  ggplot(aes(x = region_name, y = total_death_per_region)) +
  geom_col()

Is it different number of death between sex in each region ?

# Examine the number of deaths by region and sex
mortality_rate %>%
  filter(!is.na(number)) %>% 
  group_by(region_name, sex) %>%
  summarise(total_death_per_region = sum(number)) %>% 
  ggplot(aes(x = region_name, y = total_death_per_region, fill = sex)) +
  geom_col(position = position_dodge(0.9, preserve = "single"))

Step 9: Challenge your solution

  • Trying easy solution is good, because it fast and easy to answer the questions. But it is always better idea to challenge the results, especially if they fit with your expectations.

  • Even if our simple solutions work well, but surely there are some obstacles challenge you to treat it. For example

    • Was the data collected every year in each countries ?

    • With the past solution, how can we handle NA, or data with unknown sex ? Should it be excluded?

    • Do you observe any unusual values in your data? Why did this happen?

We will show an example to examine the unusual values (outliers) of number of deaths.

# Examining overall number of deaths by region in 10 years backward (2012 - 2021)
mortality_rate %>% 
  filter(year >= 2012) %>% 
  ggplot(aes(x = as.character(year), 
             y = number)) +
  geom_boxplot() +
  facet_wrap(. ~ region_name, scales = "free_y")

 

❓Are there any unusual values in the box plot ? If so, how can we examine them more closely ?

Step 10: Follow up questions

We are able to formulate questions in our data and try to solve the problems in a simple and challenging way.

 

At this point, it makes sense to ask a few follow-up questions:

  • Do you have the right data?

  • Do you need other data?

  • Do you have the right question?

 

Exploratory data analysis is designed to get you thinking about your data and your question. At this point, we can refine our question or collect relevant data, over and over to get to the truth.

Wicked fast and convenient EDA package: GGally

We can use the ggpairs() function from the GGally package1 (pronounced: g-g-ally) to get a general overview of our data. Instead of plotting relationship variables pair by pair in one plot, with ggpairs(), we can explore initial relationships or each variable in more detail.

# Load GGally package 
# We asked the participants to install before the workshop
library(GGally)
# remove some columns and tidy data
mortality_rate_2 <- mortality_rate %>% 
  select(-c(country_name, age_std_death_rate_per_100k,
            age_group, number)) %>% 
  filter(year >= 2012)
mortality_rate_2$year <- as.character(mortality_rate_2$year)
# Explore relationship in our data, colored by region name
p_mortality_overview <- ggpairs(mortality_rate_2, 
                                aes(colour = region_name),
                                progress = TRUE,
                                cardinality_threshold = 20)
# Export plot to file
ggsave("mortality_rate_overview.png", p_mortality_overview,
       width = 20, height = 10, units = "in", dpi = 300, scale = 0.8)

Practice

We’ll perform the EDA approach to initially explore the characteristics of the data set from Ghazalpour et al. (2006). The data set describes several physiological quantitative traits of female mice of a specific F2 intercross.

First load the data into R environment:

liverMice <- read_csv("https://github.com/JirathNuan/r-handviz-workshop/raw/main/datasets/mouseLiver_data_ClinicalTraits.csv")

📓 First, let’s try to explore overall aspects of the data using ggpairs() from GGally library. Are there any interesting variables to dig deeper ?

📓 Given the data set, what type of visualization would be appropriate. And why ?

References

Ghazalpour, Anatole, Sudheer Doss, Bin Zhang, Susanna Wang, Christopher Plaisier, Ruth Castellanos, Alec Brozell, et al. 2006. “Integrating Genetic and Network Analysis to Characterize Genes Related to Mouse Weight.” Edited by Greg Gibson. PLoS Genetics 2 (8): e130. https://doi.org/10.1371/journal.pgen.0020130.
Peng, Roger D. 2020. Exploratory Data Analysis with r. https://bookdown.org/rdpeng/exdata/.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. 2nd ed. https://r4ds.hadley.nz/.