The term “city” in China encompasses a multifaceted concept. It may denote a county-level, prefectural, or provincial administrative unit. Scholars focusing on China often encounter frustration in converting these names or corresponding geocodes, particularly when handling data spanning multiple years. This complexity further arises due to periodic modifications or cancellations of some unit’s name by the central government (国家统计局 2022a).
Inspired by Vincent Arel-Bundock’s countrycode
package, we developed regioncode
. This package aims to
perform similar functions but is tailored specifically for region
name/code conversions within China for the period 1986–2019.
regioncode
?The Chinese government assigns unique geocodes to each county, city (prefecture), and provincial-level administrative unit. These “administrative division codes” are consistently adjusted and updated to align with national and regional development plans (民政部 2022). However, these adjustments may pose challenges for researchers conducting longitudinal studies or merging geo-based data from different years. For instance, inconsistencies between map data and statistical data can result in erroneous outputs when rendering statistical data on a Chinese map.
A One-Step Solution: regioncode
regioncode
offers a one-step solution to these
challenges. In its current version, it enables seamless conversion of
formal names, commonly used names, and administrative division codes of
Chinese provinces and prefectures between each other, covering a span of
thirty-four years from 1986 to 2019.
To install:
install.packages("regioncode")
.remotes::install_github("sammo3182/regioncode")
.We demonstrate the basic application of regioncode
with
a toy data randomly sampled from Wang
(2020)’s China’s Corruption Investigations Dataset. In the
regioncode
field, administrative division codes are denoted
as code
, and the formal names of regions are referred to as
name
. The current version facilitates the mutual conversion
between any pair of these elements. Users merely need to input a
character vector of names or a numeric vector of geocodes into the
function, specifying the desired output type with the
convert_to
argument.
The following example illustrates the conversion of 2019 geocodes in
the sample data to their 1989 version. It is essential for users to
correctly set the year_from
argument to reference the
appropriate year. Subsequently, the year_to
and
convert_to
arguments can be used to determine the desired
year’s projection and the format type.
library(regioncode)
data("corruption")
# Conversion to the 1989 version
regioncode(data_input = corruption$prefecture_id,
convert_to = "code", # default setting
year_from = 2019,
year_to = 1989)
## [1] 370100 329001 310227 420500 452200 433000 350300 512500 460025 420600
# Comparison
tibble(
code2019 = corruption$prefecture_id,
code1989 = regioncode(data_input = corruption$prefecture_id,
convert_to = "code", # default setting
year_from = 2019,
year_to = 1989),
name2019 = regioncode(data_input = corruption$prefecture_id,
convert_to = "name", # default setting
year_from = 2019,
year_to = 2019),
name1989 = regioncode(data_input = corruption$prefecture_id,
convert_to = "name", # default setting
year_from = 2019,
year_to = 1989)
)
## # A tibble: 10 × 4
## code2019 code1989 name2019 name1989
## <dbl> <dbl> <chr> <chr>
## 1 370100 370100 济南市 济南市
## 2 321200 329001 泰州市 泰州市
## 3 310117 310227 松江区 松江县
## 4 420500 420500 宜昌市 宜昌市
## 5 451300 452200 来宾市 柳州地区
## 6 431200 433000 怀化市 怀化地区
## 7 350300 350300 莆田市 莆田市
## 8 511500 512500 宜宾市 宜宾地区
## 9 469021 460025 定安县 定安县
## 10 420600 420600 襄阳市 襄樊市
Note that if a region was initially geocoded in, for example, 1989 and later included in a new region in 2019, the new region geocode will be subsequently used. If a large area was divided into several regions, the later-year codes will align with the first region according to the ascending order of the regions’ numeric geocodes.
In the current version, regioncode
automatically
identifies the input format: numerics for geocodes and characters for
names. The following example demonstrates the conversions from various
types of input to alternative formats of outputs:
## # A tibble: 10 × 2
## id name
## <dbl> <chr>
## 1 370100 济南市
## 2 321200 泰州市
## 3 310117 松江区
## 4 420500 宜昌市
## 5 451300 来宾市
## 6 431200 怀化市
## 7 350300 莆田市
## 8 511500 宜宾市
## 9 469021 定安县
## 10 420600 襄阳市
# Codes to name
regioncode(data_input = corruption$prefecture_id,
convert_to = "name",
year_from = 2019,
year_to = 1989)
## [1] "济南市" "泰州市" "松江县" "宜昌市" "柳州地区" "怀化地区"
## [7] "莆田市" "宜宾地区" "定安县" "襄樊市"
# Name to codes of the same year
regioncode(data_input = corruption$prefecture,
convert_to = "code",
year_from = 2019,
year_to = 2019)
## [1] 370100 321200 310117 420500 451300 431200 350300 511500 469021 420600
# Name to name of a different year
regioncode(data_input = corruption$prefecture,
convert_to = "name",
year_from = 2019,
year_to = 1989)
## [1] "济南市" "泰州市" "松江县" "宜昌市" "柳州地区" "怀化地区"
## [7] "莆田市" "宜宾地区" "定安县" "襄樊市"
The regioncode
package also offers specialized
conversion functions to assist users with more complex data and diverse
requirements, including:
Frequently, data codes may exclude the administrative level when
recording geographical information, such as “北京” instead of “北京市,”
or “内蒙” instead of “内蒙古自治区” referred to as “incomplete names.”
To execute conversions for such data, one can specify the
incomplete_name
argument to “TRUE.” As long as there are
two characters that can help to identify the city or province,
regioncode
can conduct the conversion. In the following
example, we randomly removed 70% of the input city names to incomplete
names and show how regioncode
can deal with such
problems:
## [1] "济南市" "泰州市" "松江区" "宜昌市" "来宾市" "怀化市" "莆田市" "宜宾市"
## [9] "定安县" "襄阳市"
fake_incomplete <- corruption$prefecture
index_incomplete <- sample(seq(length(corruption$prefecture)), 7)
fake_incomplete[index_incomplete] <- fake_incomplete[index_incomplete] |>
substr(start = 1, stop = 2)
fake_incomplete
## [1] "济南" "泰州" "松江" "宜昌市" "来宾" "怀化" "莆田" "宜宾市"
## [9] "定安" "襄阳市"
# Conversion to full names in 2008
regioncode(data_input = fake_incomplete,
convert_to = "name",
year_from = 2019,
year_to = 2008,
incomplete_name = TRUE)
## [1] "济南市" "泰州市" "松江区" "宜昌市" "来宾市" "怀化市" "莆田市" "宜宾市"
## [9] "定安县" "襄樊市"
Municipalities (“直辖市”) in China are geographically cities but administratively provincial. Different geographic data may categorize them differently. Some data may treat municipalities as equivalent to prefectures.
To convert this type of data, regioncode
introduces a
specific argument zhixiashi
. The default value is “FALSE,”
treating municipalities as provinces. When set to “TRUE,” municipalities
are considered as prefectures, and their provincial codes are utilized
as geocodes.
The following example illustrates the municipalities identifier with a mixed string of names of municipalities, their districts, and a prefecture:
names_municipality <- c("北京市", # Beijing, a municipality
"海淀区", # A district of Beijing
"上海市", # Shanghai, a municipality
"静安区", # A district of Shanghai
"济南市") # A prefecture of Shandong
# When `zhixiashi` is FALSE, only the districts are recognized
regioncode(data_input = names_municipality,
year_from = 2019,
year_to = 2019,
convert_to = "code",
zhixiashi = FALSE)
## [1] NA 110108 NA 310106 370100
# When `zhixiashi` is TRUE, municipalities are recognized
regioncode(data_input = names_municipality,
year_from = 2019,
year_to = 2019,
convert_to = "code",
zhixiashi = TRUE)
## [1] 110000 110108 310000 310106 370100
The Statistical Yearbook of Urban and Rural Construction classifies Chinese cities into different levels, largely based on their populations (国家统计局 2022b). From 1989 to 2014, there were four levels of cities, and the system expanded to a 7-level scale after 2014, as detailed in the following table:
Criterion | Population | Rank |
---|---|---|
Old (1989) | > 1 million | 超大城市 |
500,000 ~ 1 million | 大城市 | |
200,000 ~ 500,000 | 中等城市 | |
< 200,000 | 小城市 | |
New (2014) | > 10 million | 超大城市 |
5 million ~ 10 million | 特大城市 | |
3 million ~ 5 million | I型大城市 | |
1 million ~ 3 million | II型大城市 | |
500,000 ~ 1 million | 中等城市 | |
200,000 ~ 500,000 | I型小城市 | |
< 200,000 | II型小城市 |
The regioncode
function can return the rank of cities
according to their populations for a given year. If the population is
untraceable, the rank will be marked as NA
. Users simply
need to set convert_to = "rank"
to perform the conversion.
For regions in and before 1989, the old ranking system is applied. For
other region-years, the function will return the new ranks. For some
cities, we cannot find their populations from the official sources. The
rank
of them will be NA
.
The following example compares the ranks from the same input in different years:
tibble(
city = corruption$prefecture,
rank1989 = regioncode(data_input = corruption$prefecture,
year_from = 2019,
year_to = 1989,
convert_to="rank"),
rank2014 = regioncode(data_input = corruption$prefecture,
year_from = 2019,
year_to = 2014,
convert_to = "rank")
)
## # A tibble: 10 × 3
## city rank1989 rank2014
## <chr> <chr> <chr>
## 1 济南市 特大城市 I型大城市
## 2 泰州市 小城市 II型大城市
## 3 松江区 特大城市 超大城市
## 4 宜昌市 中等城市 II型大城市
## 5 来宾市 <NA> 中等城市
## 6 怀化市 小城市 I型小城市
## 7 莆田市 小城市 II型大城市
## 8 宜宾市 中等城市 II型大城市
## 9 定安县 <NA> <NA>
## 10 襄阳市 中等城市 II型大城市
Pinyin is a phonetic romanization of Chinese characters. Some data
may store region names in pinyin instead of Chinese characters. The
default name output of regioncode
is in Chinese characters.
However, thanks to Peng Zhao and Qu Cheng’s pinyin package, users can
now obtain pinyin format output from the regioncode
function by setting the argument to_pinyin = TRUE
. This
function also corrects the romanization output for areas with special
spellings, such as Shanxi vs. Shaanxi, Inner Mongolia, and special
administrative regions. It works for official names, incomplete names,
and administrative area outputs. The following example demonstrates how
this function operates on various demands:
tibble(
city = corruption$prefecture,
cityPY = regioncode(data_input = corruption$prefecture,
year_from = 2019,
year_to = 1989,
convert_to = "name",
to_pinyin = TRUE
),
areaPY = regioncode(data_input = corruption$prefecture,
year_from = 2019,
year_to = 1989,
convert_to = "area",
to_pinyin = TRUE
)
)
## # A tibble: 10 × 3
## city cityPY areaPY
## <chr> <chr> <chr>
## 1 济南市 ji_nan hua_dong
## 2 泰州市 tai_zhou hua_dong
## 3 松江区 song_jiang hua_dong
## 4 宜昌市 yi_chang hua_zhong
## 5 来宾市 liu_zhou hua_nan
## 6 怀化市 huai_hua hua_zhong
## 7 莆田市 pu_tian hua_dong
## 8 宜宾市 yi_bin xi_nan
## 9 定安县 ding_an hua_nan
## 10 襄阳市 xiang_fan hua_zhong
# Regions with special spelling
regioncode(data_input = c("山西", "陕西", "内蒙古", "香港", "澳门"),
year_from = 2019,
year_to = 2008,
convert_to = "name",
incomplete_name = TRUE,
province = TRUE,
to_pinyin = TRUE
)
## 山西
## "shan_xi" "shaan_xi" "inner_mongolia" "hong_kong"
##
## "macao"
The regioncode
function also supports conversions at the
provincial level. By setting the argument province = TRUE
,
users can convert all geocodes and names at this level. Chinese
provinces have abbreviations, and when the converted data only contain
abbreviations, users can set the convert_to
argument to
abbreTocode
, abbreToname
, or
abbreToarea
to obtain the desired data types. To receive
abbreviation outputs, simply set convert_to = "abbre"
.
The following example demonstrates the conversion of a vector of province geocodes to their official names and abbreviations:
tibble(
province = corruption$province_id,
prov_name = regioncode(data_input = corruption$province_id,
convert_to = "name",
year_from = 2019,
year_to = 1989,
province = TRUE),
prov_abbre = regioncode(data_input = corruption$province_id,
convert_to = "codeToabbre",
year_from = 2019,
year_to = 1989,
province = TRUE)
)
## # A tibble: 10 × 3
## province prov_name prov_abbre
## <dbl> <chr> <chr>
## 1 370000 山东省 鲁
## 2 320000 江苏省 苏
## 3 310000 上海市 沪
## 4 420000 湖北省 鄂
## 5 450000 广西壮族自治区 桂
## 6 430000 湖南省 湘
## 7 350000 福建省 闽
## 8 510000 四川省 蜀
## 9 460000 海南省 琼
## 10 420000 湖北省 鄂
The current version of regioncode
encompasses two types
of region conversion beyond the provincial level: administrative area
and linguistic zones.
Chinese regions are divided into seven areas for social, political, and martial reasons (孙平 2020):
Region | Provincial-level Administrative Unit |
---|---|
华北 | 北京市, 天津市, 山西省, 河北省, 内蒙古自治区 |
东北 | 黑龙江省, 吉林省, 辽宁省 |
华东 | 上海市, 江苏省, 浙江省, 安徽省, 福建省, 台湾省, 江西省, 山东省 |
华中 | 河南省, 湖北省, 湖南省 |
华南 | 广东省, 海南省, 广西壮族自治区, 香港特别行政区, 澳门特别行政区 |
西南 | 重庆市, 四川省, 贵州省, 云南省, 西藏自治区 |
西北 | 陕西省, 甘肃省, 青海省, 宁夏回族自治区, 新疆维吾尔自治区 |
In certain cases, users may wish to identify the area to which a
prefecture or province belongs. regioncode
offers a
function to convert codes and names of the region (both prefectures and
provinces) into areas by setting the output format as “area”:
regioncode(data_input = corruption$prefecture,
year_from = 2019,
year_to = 1989,
convert_to = "area")
## [1] "华东" "华东" "华东" "华中" "华南" "华中" "华东" "西南" "华南" "华中"
China is a multilingual country with various dialects. These dialects
may be used across several prefectures in a province or even across
different provinces. For political and sociolinguistic studies,
regioncode
includes a function to return approximate
linguistic zones of given geocodes or prefectural names. In the current
version, regioncode
offers two levels of linguistic zone
identification: dialect groups (dia_group
, “方言大类”) and
dialect sub-groups (dia_sub_group
, “分区片”), according to
the 1987 language atlas of China (Li, Xiong, and
Zhang 1987). (When province = TRUE
, the linguistic
conversion can only be to the dialect group level.)
The following example converts the toy data to dialect groups and sub-groups:
tibble(
city = corruption$prefecture,
dialectGroup = regioncode(data_input = corruption$prefecture,
year_from = 2019,
year_to = 1989,
to_dialect = "dia_group"),
dialectSubGroup = regioncode(data_input = corruption$prefecture,
year_from = 2019,
year_to = 1989,
to_dialect = "dia_sub_group")
)
## # A tibble: 10 × 3
## city dialectGroup dialectSubGroup
## <chr> <chr> <chr>
## 1 济南市 冀鲁官话 沧惠片-1,石济片-8
## 2 泰州市 江淮官话 泰如片-1
## 3 松江区 吴语 太湖片-1
## 4 宜昌市 西南官话 成渝片-3,成渝片-9
## 5 来宾市 西南官话 桂柳片-10
## 6 怀化市 湘语 岑江片-2,吉溆片-3,娄邵片-1,黔北片-3,长益片-3
## 7 莆田市 莆仙区 莆仙区-4
## 8 宜宾市 西南官话 灌赤片-10
## 9 定安县 琼文区 府城片-1
## 10 襄阳市 西南官话 鄂北片-10
Note that the linguistic distribution in China is too complex for
precise gauging at the prefectural level, and it continually changes
with population dynamics. The linguistic zone output from
regioncode
is thus for reference rather than rigorous
linguistic research.
regioncode
offers a convenient method for converting
Chinese administrative division codes, official names, and facilitating
various specific conversions. The development of the package is ongoing,
with future versions aiming to add more administrative level choices and
enriching data. Collaboration is welcome, and questions, comments, or
bug reports can be directed to Github
Issues.
We extend our appreciation to LI Ruizhe, ZHU Meng, SHI Yuyang, XU Yujia, PAN Yuxin, TIAN Haiting, SHAO Weihang, CHEN Yuanqian, and LIU Xueyan for their contributions to data collection and function editing of this package.
Yue Hu
Department of Political Science,
Tsinghua University,
Email: yuehu@tsinghua.edu.cn
Website: https://www.drhuyue.site
Xinyi Ye
Department of Political Science,
Tsinghua University,
Email: yexy23@mails.tsinghua.edu.cn