The term “city” in China encompasses a multifaceted concept. It may denote a county-level, prefectural, or provincial administrative unit. Scholars focusing on China often encounter frustration in converting these names or corresponding geocodes, particularly when handling data spanning multiple years. This complexity further arises due to periodic modifications or cancellations of some unit’s name by the central government (国家统计局 2022a).

Inspired by Vincent Arel-Bundock’s countrycode package, we developed regioncode. This package aims to perform similar functions but is tailored specifically for region name/code conversions within China for the period 1986–2019.

Why `regioncode`?

The Chinese government assigns unique geocodes to each county, city (prefecture), and provincial-level administrative unit. These “administrative division codes” are consistently adjusted and updated to align with national and regional development plans (民政部 2022). However, these adjustments may pose challenges for researchers conducting longitudinal studies or merging geo-based data from different years. For instance, inconsistencies between map data and statistical data can result in erroneous outputs when rendering statistical data on a Chinese map.

A One-Step Solution: regioncode

regioncode offers a one-step solution to these challenges. In its current version, it enables seamless conversion of formal names, commonly used names, and administrative division codes of Chinese provinces and prefectures between each other, covering a span of thirty-four years from 1986 to 2019.

Installation

To install:

The latest released version: install.packages("regioncode").
The latest developing version: remotes::install_github("sammo3182/regioncode").

Basic Usage

We demonstrate the basic application of regioncode with a toy data randomly sampled from Wang (2020)’s China’s Corruption Investigations Dataset. In the regioncode field, administrative division codes are denoted as code, and the formal names of regions are referred to as name. The current version facilitates the mutual conversion between any pair of these elements. Users merely need to input a character vector of names or a numeric vector of geocodes into the function, specifying the desired output type with the convert_to argument.

The following example illustrates the conversion of 2019 geocodes in the sample data to their 1989 version. It is essential for users to correctly set the year_from argument to reference the appropriate year. Subsequently, the year_to and convert_to arguments can be used to determine the desired year’s projection and the format type.

library(regioncode)

data("corruption")

# Conversion to the 1989 version
regioncode(data_input = corruption$prefecture_id, 
           convert_to = "code", # default setting
           year_from = 2019,
           year_to = 1989)

##  [1] 370100 329001 310227 420500 452200 433000 350300 512500 460025 420600

# Comparison
tibble(
  code2019 = corruption$prefecture_id,
  code1989 = regioncode(data_input = corruption$prefecture_id,
           convert_to = "code", # default setting
           year_from = 2019,
           year_to = 1989),
  name2019 = regioncode(data_input = corruption$prefecture_id,
           convert_to = "name", # default setting
           year_from = 2019,
           year_to = 2019),
  name1989 = regioncode(data_input = corruption$prefecture_id,
           convert_to = "name", # default setting
           year_from = 2019,
           year_to = 1989)
)

## # A tibble: 10 × 4
##    code2019 code1989 name2019 name1989
##       <dbl>    <dbl> <chr>    <chr>   
##  1   370100   370100 济南市   济南市  
##  2   321200   329001 泰州市   泰州市  
##  3   310117   310227 松江区   松江县  
##  4   420500   420500 宜昌市   宜昌市  
##  5   451300   452200 来宾市   柳州地区
##  6   431200   433000 怀化市   怀化地区
##  7   350300   350300 莆田市   莆田市  
##  8   511500   512500 宜宾市   宜宾地区
##  9   469021   460025 定安县   定安县  
## 10   420600   420600 襄阳市   襄樊市

Note that if a region was initially geocoded in, for example, 1989 and later included in a new region in 2019, the new region geocode will be subsequently used. If a large area was divided into several regions, the later-year codes will align with the first region according to the ascending order of the regions’ numeric geocodes.

In the current version, regioncode automatically identifies the input format: numerics for geocodes and characters for names. The following example demonstrates the conversions from various types of input to alternative formats of outputs:

# Original name
tibble(
  id = corruption$prefecture_id,
  name = corruption$prefecture
)

## # A tibble: 10 × 2
##        id name  
##     <dbl> <chr> 
##  1 370100 济南市
##  2 321200 泰州市
##  3 310117 松江区
##  4 420500 宜昌市
##  5 451300 来宾市
##  6 431200 怀化市
##  7 350300 莆田市
##  8 511500 宜宾市
##  9 469021 定安县
## 10 420600 襄阳市

# Codes to name
regioncode(data_input = corruption$prefecture_id, 
           convert_to = "name",
           year_from = 2019,
           year_to = 1989)

##  [1] "济南市"   "泰州市"   "松江县"   "宜昌市"   "柳州地区" "怀化地区"
##  [7] "莆田市"   "宜宾地区" "定安县"   "襄樊市"

# Name to codes of the same year
regioncode(data_input = corruption$prefecture, 
           convert_to = "code",
           year_from = 2019,
           year_to = 2019)

##  [1] 370100 321200 310117 420500 451300 431200 350300 511500 469021 420600

# Name to name of a different year
regioncode(data_input = corruption$prefecture, 
           convert_to = "name",
           year_from = 2019,
           year_to = 1989)

##  [1] "济南市"   "泰州市"   "松江县"   "宜昌市"   "柳州地区" "怀化地区"
##  [7] "莆田市"   "宜宾地区" "定安县"   "襄樊市"

Advanced Applications

The regioncode package also offers specialized conversion functions to assist users with more complex data and diverse requirements, including:

Conversion from/to incomplete names.
Different handling of municipalities.
Return of population-based city ranks.
Return of pinyin format of outputs.
Conversion of provincial data.
Return of administrative areas.
Return of linguistic zones.

Incomplete Naming of Prefectures

Frequently, data codes may exclude the administrative level when recording geographical information, such as “北京” instead of “北京市,” or “内蒙” instead of “内蒙古自治区” referred to as “incomplete names.” To execute conversions for such data, one can specify the incomplete_name argument to “TRUE.” As long as there are two characters that can help to identify the city or province, regioncode can conduct the conversion. In the following example, we randomly removed 70% of the input city names to incomplete names and show how regioncode can deal with such problems:

# Original full names
corruption$prefecture

##  [1] "济南市" "泰州市" "松江区" "宜昌市" "来宾市" "怀化市" "莆田市" "宜宾市"
##  [9] "定安县" "襄阳市"

fake_incomplete <- corruption$prefecture

index_incomplete <- sample(seq(length(corruption$prefecture)), 7)

fake_incomplete[index_incomplete] <- fake_incomplete[index_incomplete] |> 
  substr(start = 1, stop = 2)

fake_incomplete

##  [1] "济南"   "泰州"   "松江"   "宜昌市" "来宾"   "怀化"   "莆田"   "宜宾市"
##  [9] "定安"   "襄阳市"

# Conversion to full names in 2008
regioncode(data_input = fake_incomplete, 
           convert_to = "name",
           year_from = 2019,
           year_to = 2008,
           incomplete_name = TRUE)

##  [1] "济南市" "泰州市" "松江区" "宜昌市" "来宾市" "怀化市" "莆田市" "宜宾市"
##  [9] "定安县" "襄樊市"

Municipalities

Municipalities (“直辖市”) in China are geographically cities but administratively provincial. Different geographic data may categorize them differently. Some data may treat municipalities as equivalent to prefectures.

To convert this type of data, regioncode introduces a specific argument zhixiashi. The default value is “FALSE,” treating municipalities as provinces. When set to “TRUE,” municipalities are considered as prefectures, and their provincial codes are utilized as geocodes.

The following example illustrates the municipalities identifier with a mixed string of names of municipalities, their districts, and a prefecture:

names_municipality <- c("北京市", # Beijing, a municipality
                        "海淀区", # A district of Beijing
                        "上海市", # Shanghai, a municipality
                        "静安区", # A district of Shanghai
                        "济南市") # A prefecture of Shandong

# When `zhixiashi` is FALSE, only the districts are recognized
regioncode(data_input = names_municipality, 
           year_from = 2019,
           year_to = 2019, 
           convert_to = "code",
           zhixiashi = FALSE)

## [1]     NA 110108     NA 310106 370100

# When `zhixiashi` is TRUE, municipalities are recognized
regioncode(data_input = names_municipality,
           year_from = 2019,
           year_to = 2019,
           convert_to = "code",
           zhixiashi = TRUE)

## [1] 110000 110108 310000 310106 370100

City Ranking

The Statistical Yearbook of Urban and Rural Construction classifies Chinese cities into different levels, largely based on their populations (国家统计局 2022b). From 1989 to 2014, there were four levels of cities, and the system expanded to a 7-level scale after 2014, as detailed in the following table:

Criterion	Population	Rank
Old (1989)	> 1 million	超大城市
	500,000 ~ 1 million	大城市
	200,000 ~ 500,000	中等城市
	< 200,000	小城市

New (2014)	> 10 million	超大城市
	5 million ~ 10 million	特大城市
	3 million ~ 5 million	I型大城市
	1 million ~ 3 million	II型大城市
	500,000 ~ 1 million	中等城市
	200,000 ~ 500,000	I型小城市
	< 200,000	II型小城市

The regioncode function can return the rank of cities according to their populations for a given year. If the population is untraceable, the rank will be marked as NA. Users simply need to set convert_to = "rank" to perform the conversion. For regions in and before 1989, the old ranking system is applied. For other region-years, the function will return the new ranks. For some cities, we cannot find their populations from the official sources. The rank of them will be NA.

The following example compares the ranks from the same input in different years:

tibble(
  city = corruption$prefecture,
  rank1989 = regioncode(data_input = corruption$prefecture, 
           year_from = 2019,
           year_to = 1989, 
           convert_to="rank"),
  rank2014 = regioncode(data_input = corruption$prefecture, 
           year_from = 2019,
           year_to = 2014, 
           convert_to = "rank")
)

## # A tibble: 10 × 3
##    city   rank1989 rank2014  
##    <chr>  <chr>    <chr>     
##  1 济南市 特大城市 I型大城市 
##  2 泰州市 小城市   II型大城市
##  3 松江区 特大城市 超大城市  
##  4 宜昌市 中等城市 II型大城市
##  5 来宾市 <NA>     中等城市  
##  6 怀化市 小城市   I型小城市 
##  7 莆田市 小城市   II型大城市
##  8 宜宾市 中等城市 II型大城市
##  9 定安县 <NA>     <NA>      
## 10 襄阳市 中等城市 II型大城市

Pinyin

Pinyin is a phonetic romanization of Chinese characters. Some data may store region names in pinyin instead of Chinese characters. The default name output of regioncode is in Chinese characters. However, thanks to Peng Zhao and Qu Cheng’s pinyin package, users can now obtain pinyin format output from the regioncode function by setting the argument to_pinyin = TRUE. This function also corrects the romanization output for areas with special spellings, such as Shanxi vs. Shaanxi, Inner Mongolia, and special administrative regions. It works for official names, incomplete names, and administrative area outputs. The following example demonstrates how this function operates on various demands:

tibble(
  city = corruption$prefecture,
  cityPY = regioncode(data_input = corruption$prefecture, 
           year_from = 2019,
           year_to = 1989, 
           convert_to = "name",
           to_pinyin = TRUE
           ),
  areaPY = regioncode(data_input = corruption$prefecture, 
           year_from = 2019,
           year_to = 1989, 
           convert_to = "area",
           to_pinyin = TRUE
           )
)

## # A tibble: 10 × 3
##    city   cityPY     areaPY   
##    <chr>  <chr>      <chr>    
##  1 济南市 ji_nan     hua_dong 
##  2 泰州市 tai_zhou   hua_dong 
##  3 松江区 song_jiang hua_dong 
##  4 宜昌市 yi_chang   hua_zhong
##  5 来宾市 liu_zhou   hua_nan  
##  6 怀化市 huai_hua   hua_zhong
##  7 莆田市 pu_tian    hua_dong 
##  8 宜宾市 yi_bin     xi_nan   
##  9 定安县 ding_an    hua_nan  
## 10 襄阳市 xiang_fan  hua_zhong

# Regions with special spelling
regioncode(data_input = c("山西", "陕西", "内蒙古", "香港", "澳门"), 
           year_from = 2019,
           year_to = 2008, 
           convert_to = "name",
           incomplete_name = TRUE,
           province = TRUE,
           to_pinyin = TRUE
           )

##             山西                                                    
##        "shan_xi"       "shaan_xi" "inner_mongolia"      "hong_kong" 
##                  
##          "macao"

Provinces

The regioncode function also supports conversions at the provincial level. By setting the argument province = TRUE, users can convert all geocodes and names at this level. Chinese provinces have abbreviations, and when the converted data only contain abbreviations, users can set the convert_to argument to abbreTocode, abbreToname, or abbreToarea to obtain the desired data types. To receive abbreviation outputs, simply set convert_to = "abbre".

The following example demonstrates the conversion of a vector of province geocodes to their official names and abbreviations:

tibble(
  province = corruption$province_id,
  prov_name = regioncode(data_input = corruption$province_id, 
           convert_to = "name",
           year_from = 2019,
           year_to = 1989,
           province = TRUE),
  prov_abbre = regioncode(data_input = corruption$province_id, 
           convert_to = "codeToabbre",
           year_from = 2019,
           year_to = 1989,
           province = TRUE)
)

## # A tibble: 10 × 3
##    province prov_name      prov_abbre
##       <dbl> <chr>          <chr>     
##  1   370000 山东省         鲁        
##  2   320000 江苏省         苏        
##  3   310000 上海市         沪        
##  4   420000 湖北省         鄂        
##  5   450000 广西壮族自治区 桂        
##  6   430000 湖南省         湘        
##  7   350000 福建省         闽        
##  8   510000 四川省         蜀        
##  9   460000 海南省         琼        
## 10   420000 湖北省         鄂

Geographic Units Beyond Provinces

The current version of regioncode encompasses two types of region conversion beyond the provincial level: administrative area and linguistic zones.

Administrative Area

Chinese regions are divided into seven areas for social, political, and martial reasons (孙平 2020):

Region	Provincial-level Administrative Unit
华北	北京市, 天津市, 山西省, 河北省, 内蒙古自治区
东北	黑龙江省, 吉林省, 辽宁省
华东	上海市, 江苏省, 浙江省, 安徽省, 福建省, 台湾省, 江西省, 山东省
华中	河南省, 湖北省, 湖南省
华南	广东省, 海南省, 广西壮族自治区, 香港特别行政区, 澳门特别行政区
西南	重庆市, 四川省, 贵州省, 云南省, 西藏自治区
西北	陕西省, 甘肃省, 青海省, 宁夏回族自治区, 新疆维吾尔自治区

In certain cases, users may wish to identify the area to which a prefecture or province belongs. regioncode offers a function to convert codes and names of the region (both prefectures and provinces) into areas by setting the output format as “area”:

regioncode(data_input = corruption$prefecture, 
           year_from = 2019,
           year_to = 1989, 
           convert_to = "area")

##  [1] "华东" "华东" "华东" "华中" "华南" "华中" "华东" "西南" "华南" "华中"

Linguistic Zone

China is a multilingual country with various dialects. These dialects may be used across several prefectures in a province or even across different provinces. For political and sociolinguistic studies, regioncode includes a function to return approximate linguistic zones of given geocodes or prefectural names. In the current version, regioncode offers two levels of linguistic zone identification: dialect groups (dia_group, “方言大类”) and dialect sub-groups (dia_sub_group, “分区片”), according to the 1987 language atlas of China (Li, Xiong, and Zhang 1987). (When province = TRUE, the linguistic conversion can only be to the dialect group level.)

The following example converts the toy data to dialect groups and sub-groups:

tibble(
  city = corruption$prefecture,
  dialectGroup = regioncode(data_input = corruption$prefecture, 
           year_from = 2019,
           year_to = 1989,
           to_dialect = "dia_group"),
  dialectSubGroup = regioncode(data_input = corruption$prefecture, 
           year_from = 2019,
           year_to = 1989,
           to_dialect = "dia_sub_group")
)

## # A tibble: 10 × 3
##    city   dialectGroup dialectSubGroup                             
##    <chr>  <chr>        <chr>                                       
##  1 济南市 冀鲁官话     沧惠片-1,石济片-8                           
##  2 泰州市 江淮官话     泰如片-1                                    
##  3 松江区 吴语         太湖片-1                                    
##  4 宜昌市 西南官话     成渝片-3,成渝片-9                           
##  5 来宾市 西南官话     桂柳片-10                                   
##  6 怀化市 湘语         岑江片-2,吉溆片-3,娄邵片-1,黔北片-3,长益片-3
##  7 莆田市 莆仙区       莆仙区-4                                    
##  8 宜宾市 西南官话     灌赤片-10                                   
##  9 定安县 琼文区       府城片-1                                    
## 10 襄阳市 西南官话     鄂北片-10

Note that the linguistic distribution in China is too complex for precise gauging at the prefectural level, and it continually changes with population dynamics. The linguistic zone output from regioncode is thus for reference rather than rigorous linguistic research.

regioncode: One-Step Solution for Chinese Region Conversions

HU Yue, YE Xinyi

2024-03-10

Why `regioncode`?

Installation

Basic Usage

Advanced Applications

Incomplete Naming of Prefectures

Municipalities

City Ranking

Pinyin

Provinces

Geographic Units Beyond Provinces

Administrative Area

Linguistic Zone

Conclusion

Reference

Affiliation

regioncode: One-Step Solution for Chinese Region Conversions

HU Yue, YE Xinyi

2024-03-10

Why regioncode?

Installation

Basic Usage

Advanced Applications

Incomplete Naming of Prefectures

Municipalities

City Ranking

Pinyin

Provinces

Geographic Units Beyond Provinces

Administrative Area

Linguistic Zone

Conclusion

Reference

Affiliation

Why `regioncode`?