In 1898 the Russian statistician Ladislaus von Bortkiewicz (1868-1931) used a dataset on deaths in Prussian Army Corps to illustrate what he called the Law of Small Numbers (von Bortkiewicz (1898)). At the time Statistics could only be confidently applied to large datasets where, for example, Gaussian approximations might apply. Von Bortkiewicz wanted to show that methods could be devised to deal with datasets where such approximate methods clearly could not be used as well. He took annual data on deaths due to horse-kicks in 14 Army Corps over 20 years from 1875 to 1894. They had been reported in the official state publication Preussische Statistik. At the start of the period a unified Germany had just been created following their victory over France in the Franco-Prussian war of 1870-71. Germany was not involved in a war again until 1914 when the First World War began, although from the 1880s there were colonisation expeditions in Africa and the western Pacific.
In more recent times the preference for most intermediate to advanced textbooks has, for good reasons, shifted to larger and larger data sets. Nevertheless the fascination with historically interesting datasets remains even if they are small. One such dataset is the horse-kick data and deserves a fresh look with modern tools, particularly at the level of detail at which it was originally reported and not as a gross summary of about 5 or 6 frequencies.
The original data set was of death by horse-kick only, and covered the 20 calendar years from 1875 to 1894. This package offers an extension of the data in two ways: Firstly, it extends the series to 1907, making it for 33 years rather than 20, and, secondly, it provides data in parallel on two further accidental causes of death, by falling from a horse or by drowning. The bigger story is set out in detail in the article (Unwin and Venables (2025)) which this package is intended to accompany.
The following table shows the first few lines of the data set, called
hkdeaths
.
year | corps | regiments | NCOs | kick | drown | fall | vonB_kick |
---|---|---|---|---|---|---|---|
1875 | G | 8 | 0 | 0 | 3 | 0 | 0 |
1875 | I | 6 | 0 | 0 | 5 | 0 | 0 |
1875 | II | 4 | 0 | 0 | 6 | 0 | 0 |
1875 | III | 4 | 0 | 0 | 7 | 1 | 0 |
1875 | IV | 4 | 0 | 0 | 7 | 0 | 0 |
1875 | V | 4 | 0 | 0 | 11 | 0 | 0 |
Most of the column names are self-explanatory, some need further explanation:
regiments is the number of regiments in the corps as given by von Bortkiewicz. The corps sizes are unequal and regiments provides one relative measure of corps size.
NCOs is the number of non-commissioned officer deaths by horse-kick up till 1899. These deaths are included in the kick total.
vonB_kick is the original von Bortkiewicz data. It spans only the first 20 years, and contains a minor transcription error when compared to the original source, Preussiche Statistik.
To get some idea of what is going on in the data, we can show the change over time of the corps annual aggregate deaths for each of the three causes.
hazards <- hkdeaths |>
group_by(year) |>
summarise(horse_kicks = sum(kick),
falls = sum(fall),
drownings = sum(drown), .groups = "drop")
hazards_long <- hazards |>
pivot_longer(cols = c(horse_kicks, falls, drownings),
names_to = "cause", values_to = "deaths")
ggplot(hazards_long) + aes(x = year, y = deaths, colour = cause) +
geom_line(linewidth = 1.5) +
scale_colour_viridis_d() +
labs(title = "Annual Deaths from three Causes", x = "Year", y = "Deaths")
There is a suprisingly consistent fall in the death rate by drowning over the period, of about 2% per annum:
drown_1 <- glm(drownings ~ I(year - 1891), data = hazards, family = poisson(link = log))
exp(coef(drown_1)) |> rbind() |> as.data.frame(check.names = FALSE) |> kable(digits = 2)
(Intercept) | I(year - 1891) |
---|---|
56.89 | 0.98 |
So the estimated mean deaths by drowning in the middle of the series, 1891, was 56.9 and the estimated decline is 2.2% per annum.
It is natural to ask if this decline in drowning death rate is also consistent across corps. This can be studied using a facet plot.
ggplot(hkdeaths) + aes(x = year, y = drown) +
geom_line(colour = "#21908CFF", linewidth=1.0) + facet_wrap(~ corps) +
geom_smooth(method = "glm", formula = y ~ I(x-1891), method.args = list(family = poisson),
linewidth = 1.0, colour = "#440154FF", se = FALSE) +
labs(x = "Year", y = "Deaths", title = "Annual Drowning Deaths per Corps")
Perhaps corps XIV is the one exception, showing a slight increase in drowning deaths per annum over the period. All other corps show a decline to some degree.