Exploratory Data Analysis (EDA) is the way to observe your data, and can be an iterative cycle since you have cleaned the data. Generally, as of Wickham, Çetinkaya-Rundel, and Grolemund (2023), aims of EDA are:
Generate questions about your data
Search for answers by visualizing, transforming, and modelling your data
Use what you learn to refine your questions and/or generate new questions.
This is important even if you already have research questions in mind.
To choose the suitable means for data cleaning.
Referred from Peng (2020), here’s a quick steps to do for exploratory data analysis in R.
Formulate your question
Read in your data
Check the packaging
Glimpse
Look at the top and the bottom of the data
Check for the n
Validate with at least one external data source
Try the easy solution first
Challenge your solution
Follow up
For today’s EDA exercise, we will examine mortality rates by infectious and parasitic diseases. The data is adopted from the World Health Organization WHO Mortality Database. The data was collected from 1950 to 2021 in many countries around the world.
A good and sharp question or hypothesis could lead to clearing up more possible ways to answer the questions in the EDA process.
❓Can you formulate the questions that you attempt to find the answer from this data?
Tip
The most important questions you can answer with the EDA process is “Do I have the right data to answer this question?”. Though the question is difficult to answer at the beginning, the it is a good start to do EDA!
Now we will load the data set in CSV format into R environment using read_csv()
function as follows.
What we may have found:
Was the data frame object loaded mortality_rate
correctly ?
Are there any errors/warnings during loading ?
From Rstudio, how many objects and variables in data set observed from environment panel ?
dim()
as follows.
Then see the column names with colnames()
As mentioned in the R Basic Programming lecture, we can use the glimpse()
function to see each column in the data frame as well as the column types and some values in that column as follow.
As we can see from the output, glimpse()
can show how many rows and columns there are, just like the dim()
function. In addition, glimpse()
also examine the class of each column to make sure they are specified correctly during loading.
Other useful functions to look at the beginning and end of the data frame are head()
and tail()
functions, respectively.
With these functions:
We can determine if the data was loaded correctly or not.
Were the column names defined correctly?
Were there no column names at all in the beginning ?
Was the data loaded completely from the first to the last line of the file ?
This step is counting what you want to examine their number (n).
In general, counting things is a good way to find out if something is going wrong after you load the data. Whether or not duplicate values or NA are observed.
# Counting number of countries in the data set
count_countries <- length(unique(mortality_rate$country_name))
count_countries
# For each country, how many data points
tbl_countries <- table(mortality_rate$country_name)
tbl_countries
# How many countries in Asia
mortality_rate %>%
filter(region_name == "Asia") %>%
distinct(country_name) %>%
count()
# From 2000 to 2021, how many countries in Asia, where the mortality rate is available in these years.
mortality_rate %>%
filter(region_name == "Asia" &
year >= 2000 &
year <= 2021) %>%
group_by(year) %>%
count()
# From the above examination, we can see in which year the mortality rate is the highest by adding arrange()
mortality_rate %>%
filter(region_name == "Asia" &
year >= 2000 &
year <= 2021) %>%
group_by(year) %>%
count() %>%
arrange(desc(n))
It is very important to make sure that your data matches something outside the data set. For example, you can cross-check with mortality database from health agencies in the desired countries. This way you can make sure that the measurements are about what they should be, and it serves as a check to see what other things in your data set might be wrong.
Based on our data, we can easily check the distribution of values with many functions, such as quantile()
and summary()
as follow.
summary()
, you can even summarize every columns in the data frame at once as follow.Since the data collected from 1950 - 2021, which region contains the highest number of deaths ?
To answer the question, we need to group the data by region and examine what the region has the highest number of deaths.
Is it different number of death between sex in each region ?
# Examine the number of deaths by region and sex
mortality_rate %>%
filter(!is.na(number)) %>%
group_by(region_name, sex) %>%
summarise(total_death_per_region = sum(number)) %>%
ggplot(aes(x = region_name, y = total_death_per_region, fill = sex)) +
geom_col(position = position_dodge(0.9, preserve = "single"))
Trying easy solution is good, because it fast and easy to answer the questions. But it is always better idea to challenge the results, especially if they fit with your expectations.
Even if our simple solutions work well, but surely there are some obstacles challenge you to treat it. For example
Was the data collected every year in each countries ?
With the past solution, how can we handle NA
, or data with unknown sex ? Should it be excluded?
Do you observe any unusual values in your data? Why did this happen?
We will show an example to examine the unusual values (outliers) of number of deaths.
❓Are there any unusual values in the box plot ? If so, how can we examine them more closely ?
We are able to formulate questions in our data and try to solve the problems in a simple and challenging way.
At this point, it makes sense to ask a few follow-up questions:
Do you have the right data?
Do you need other data?
Do you have the right question?
Exploratory data analysis is designed to get you thinking about your data and your question. At this point, we can refine our question or collect relevant data, over and over to get to the truth.
GGally
We can use the ggpairs()
function from the GGally package1 (pronounced: g-g-ally) to get a general overview of our data. Instead of plotting relationship variables pair by pair in one plot, with ggpairs()
, we can explore initial relationships or each variable in more detail.
# Load GGally package
# We asked the participants to install before the workshop
library(GGally)
# remove some columns and tidy data
mortality_rate_2 <- mortality_rate %>%
select(-c(country_name, age_std_death_rate_per_100k,
age_group, number)) %>%
filter(year >= 2012)
mortality_rate_2$year <- as.character(mortality_rate_2$year)
# Explore relationship in our data, colored by region name
p_mortality_overview <- ggpairs(mortality_rate_2,
aes(colour = region_name),
progress = TRUE,
cardinality_threshold = 20)
# Export plot to file
ggsave("mortality_rate_overview.png", p_mortality_overview,
width = 20, height = 10, units = "in", dpi = 300, scale = 0.8)
We’ll perform the EDA approach to initially explore the characteristics of the data set from Ghazalpour et al. (2006). The data set describes several physiological quantitative traits of female mice of a specific F2 intercross.
First load the data into R environment:
📓 First, let’s try to explore overall aspects of the data using ggpairs()
from GGally library. Are there any interesting variables to dig deeper ?
📓 Given the data set, what type of visualization would be appropriate. And why ?
17 May 2023