The econid
R package is a foundational building block of
the econdataverse family of
packages aimed at helping economists and financial professionals work
with sovereign-level economic data. The package is aimed at domain
experts in economics and finance who need to analyze and join data
across multiple sources, but who aren’t necessarily R programming
experts.
Economic and financial datasets present unique challenges when working with country-level data:
Datasets often combine different types of entities in the same “country” column:
The same entity might appear in various formats:
Researchers often need to:
econid
addresses these challenges through:
The design philosophy of the package follows tidyverse design principles and the tidy tools manifesto. We strive to practice human-centered design, with clear documentation and examples and graceful handling of edge cases. We invite you to submit suggestions for improvements and extensions on the package’s Github Issues page.
We have designed the package to handle only the most common entities financial and economic professionals might encounter in a dataset (249 in total), not to handle every edge case. However, the package allows users to extend the standardization list with custom entities to flexibly accommodate any unconventional use case.
To install the package from CRAN, you can use the
install.packages()
function:
install.packages("econid")
To install a development version from GitHub, you can use the
remotes
package:
::install_github("Teal-Insights/r-econid") remotes
Then, load the package in your R session or Quarto or RMarkdown notebook:
library(econid)
Below is a high-level overview of how econid
works in
practice, followed by a more detailed description of the main function
and its parameters. The examples and tests illustrate typical usage
patterns.
Use these patterns to explore the package and integrate it into your data cleaning workflows. For finer-grained operations (e.g., fuzzy filter and search), keep an eye on the package for future enhancements.
Input validation
The package checks if your input dataset and specified columns exist. It
also ensures you only request valid output columns (e.g.,
"entity_name"
, "entity_id"
,
"entity_type"
, "iso2c"
, and
"iso3c"
). Any invalid columns raise an error.
Name and code matching
The function standardize_entity()
looks in your dataset for
names (and optionally codes) that might match an entity. It:
Merging standardized columns
Once the function finds a match, it returns a new or augmented data
frame with standardized columns (e.g., "entity_id"
,
"entity_name"
, "entity_type"
, etc.). You
control exactly which standardized columns appear via the
output_cols
argument.
Handling missing and custom cases
add_entity_pattern()
before standardizationNA
in the
standardized columns.fill_mapping
parameter.default_entity_type
).warn_ambiguous
is TRUE
.standardize_entity()
Function# Basic example
<- data.frame(
df entity = c("United States", "China", "NotACountry"),
code = c("USA", "CHN", "ZZZ"),
obs_value = c(1, 2, 3)
)
# Using with dplyr pipeline
library(dplyr)
|>
df standardize_entity(entity, code) |>
filter(!is.na(entity_id)) |>
mutate(entity_category = case_when(
== "economy" ~ "Country",
entity_type TRUE ~ "Other"
|>
)) select(entity_name, entity_category, obs_value)
## entity_name entity_category obs_value
## 1 United States Country 1
## 2 China Country 2
You can also use the function directly without a pipeline:
standardize_entity(
data = df,
entity, code,output_cols = c("entity_id", "entity_name", "entity_type"),
fill_mapping = c(entity_name = "entity"),
default_entity_type = NA_character_,
warn_ambiguous = TRUE
)
## entity_id entity_name entity_type entity code obs_value
## 1 USA United States economy United States USA 1
## 2 CHN China economy China CHN 2
## 3 <NA> NotACountry <NA> NotACountry ZZZ 3
data
A data frame (or tibble) containing the entities to be
standardized.
…
Columns containing entity names and/or IDs. These can be specified using
unquoted column names (e.g., entity_name
) or quoted column
names (e.g., "entity_name"
). Must specify at least one
column. If multiple columns are specified, the function tries each in
sequence, prioritizing matches from earlier columns.
output_cols (optional)
A character vector of columns to include in the final output. Valid
options:
"entity_id"
"entity_name"
"entity_type"
"iso3c"
"iso2c"
Defaults to
c("entity_id", "entity_name", "entity_type")
.
prefix (optional)
A character string to prefix the output column names. Useful when
standardizing multiple entities in the same dataset (e.g., “country”,
“counterpart”).
fill_mapping (optional)
A named character vector specifying how to fill missing values when no
entity match is found. Names should be output column names (without
prefix), and values should be input column names (from
...
).
default_entity_type (optional)
A character scalar ("economy"
, "organization"
,
"aggregate"
, or "other"
) to assign as the
entity type where no match is found. This value only applies if
"entity_type"
is requested in output_cols
. The
four valid values were selected to cover the most common economic use
cases:
"economy"
: A legal or quasi-legal jurisdiction such as
a country or autonomous region (e.g., “United States”, “Democratic
Autonomous Administration of North and East Syria”)"organization"
: An institution or organization such as
a bank or international agency (e.g., “World Bank”, “IMF”)"aggregate"
: A geographic or economic aggregate such as
a region or development group (e.g., “Sub-Saharan Africa”, “Low Income
Countries”)"other"
: Anything that doesn’t fit into the other
categories (e.g., “Elon Musk”, “The Moon”)warn_ambiguous (optional)
A logical indicating whether to warn if a single row in
data
can match more than one entity. Defaults to
TRUE
.
overwrite (optional)
A logical indicating whether to overwrite existing entity columns.
Defaults to TRUE
.
warn_overwrite (optional)
A logical indicating whether to warn when overwriting existing entity
columns. Defaults to TRUE
.
.before (optional)
Column name or position to insert the standardized columns before. If
NULL (default), columns are inserted at the beginning of the dataframe.
Can be a character vector specifying the column name or a numeric value
specifying the column index.
A data frame (or tibble) the same size as data
,
augmented with the requested standardized columns.
The standardize_entity()
function can be used to
standardize multiple entities in the same dataset by using the
prefix
parameter:
<- data.frame(
df country_name = c("United States", "France"),
counterpart_name = c("China", "Germany")
)
|>
df standardize_entity(country_name) |>
standardize_entity(counterpart_name, prefix = "counterpart")
## counterpart_entity_id counterpart_entity_name counterpart_entity_type
## 1 CHN China economy
## 2 DEU Germany economy
## entity_id entity_name entity_type country_name counterpart_name
## 1 USA United States economy United States China
## 2 FRA France economy France Germany
add_entity_pattern()
FunctionThe add_entity_pattern()
function allows you to add
custom entity patterns to the package. This is useful if you need to
standardize entities that are not in the default list.
add_entity_pattern(
"BJ-CITY",
"Beijing City",
entity_type = "economy",
aliases = c("Beijing Municipality")
)
<- data.frame(entity = c("United States", "Beijing Municipality"))
df_custom <- standardize_entity(df_custom, entity)
result_custom print(result_custom)
## entity_id entity_name entity_type entity
## 1 USA United States economy United States
## 2 BJ-CITY Beijing City economy Beijing Municipality
reset_custom_entity_patterns()
FunctionThe reset_custom_entity_patterns()
function allows you
to clear all custom entity patterns that have been added during the
current R session. This is useful when you want to start fresh with only
the default entity patterns.
We welcome your feedback and contributions! Please submit suggestions for improvements and extensions on the package’s Github Issues page.