Note on Captive Atlantic Flows Estimating Missing Data by Slave-Voyage Routes

This essay provides new estimates of the number of captives carried in the Atlantic slave trade during each decade from the 1650s to the 1860s. It relies on two categories of known data—on the routes of voyages and the numbers of captives recorded on those voyages—as a basis for estimation of missing data and totals of captive flows. It uses techniques of Bayesian statistics to estimate missing data on routes and flows of captives. As a framework for the Bayesian estimates, it focuses on analysis of 40 distinct routes linking the African coast to the Americas and traces the captive flows—that is, the number of captives embarked on or disembarked from voyages along those routes. The dataset that provides the basis for this research note is available at: https://doi.org/10.7910/DVN/6HLXO3.


Routes of Atlantic Slave Trade
The underlying data on Atlantic slave trade are documented most fully through individual voyages: for each of some 35,000 known voyages, data have been assembled systematically (Eltis et al., "Slave Voyages"). From this base, various scholars have aggregated and interpreted the voyagebased data in various fashions to provide interpretations of aspects of the historical slave trade.
Among the frameworks within which the data have been summarized are frameworks by individual voyage, by time period, by national carrier, by region of African departure, by region of American arrival, and by routes linking regions of African departure and American arrival. (Curtin 1969, Lovejoy 1982, Manning 1990) Our approach emphasizes systematic analysis of routes linking African and American regions.
The underlying voyage-based data, while ample, are incomplete in two different ways. First, there were additional voyages for which records have not survived. 1 Second, data are missing or incomplete for the voyages that are known. As a result, a great deal of effort has gone into estimating the magnitudes of missing data on voyages and on the captives carried on those voyages. The data have generally been reported, with some variation, in terms of conventional regions of African origin and American delivery of African captives.
In this study, we focus on the second type of incompleteness. We do so by linking the African voyages that did not reach the Americas)-yielding a total of 40 possible routes for slave voyages. 2 Table 1 shows our labeling of the routes as pairings of regions: thus, route 1-2 is from Senegambia to the Caribbean; route 5-4 is from the Bight of Benin to Brazil. Bight of Biafra 7 West Central Africa 8 Southeast Africa Our previous study, in addition to providing descriptive detail on the routes in general and on specific routes, reached three principal conclusions (Manning and Liu 2019, 460). First, of the 40 routes that we documented, ten of them accounted for 85% of the slaving voyages for which the routes were known, in the period from the 1650s to the 1860s. These ten principal routes are displayed in Map 1, which labels each route by its code number, indicating the number of known voyages along each of these routes.
Second, we showed that, within each route, the average numbers of captives embarked and the numbers disembarked were remarkably stable for most of the routes over time, from the 1650s to the 1840s. Third, we showed that a slight simplification of the second observation-the assumption that the average number of captive embarkations and arrivals for each route remained unchanged for each route-yielded a remarkably precise estimation of the total embarkations and arrivals of captives carried on slave ships for which we have full documentation (Manning and Liu 2019, 462-64).
Known vs. Missing Data Our first step is exploring known data on the Atlantic slave trade, consisting of voyages for which regions of departure and arrival were known, as well as voyages for which we have data on the number of captives embarked, disembarked, or both. Figure 1 summarizes known data on over 33,000 slave voyages, by decade from the 1650s to the 1860s, at five levels of documentation.
In this study, we expand the relationships that we found among known data, extending the analysis to voyages for which data were missing-on the route, the numbers of captives carried per voyage, or both. For this analysis, we need to be quite specific on the meaning of "missing data." Known voyages are those listed in the WHCDB-2017 dataset. Missing data consist of specific information on known voyages for which documentation is not available. Missing data on routes are cases of known voyages for which the region of departure, the region of arrival, or both are unknown. Missing data on captives are cases of known voyages for which the number of Source: Manning and Liu (2019), 460. embarking captives, the number of arriving captives, or both are unknown. Our task in this study is to model the missing data on routes and, thereafter, to model the missing data on captives. At the conclusion we will be able to confirm the total of voyages for which routes are known and add the voyages for which routes have been modeled, so that we will have an estimation of the route along which every known voyage traveled. In addition, we will be able to confirm the numbers of captives reported on known voyages, and estimate the numbers of captives aboard voyages for Map 1. Top 10 slave routes, with number of known voyages, 1650s -1860s.
jwsr.pitt.edu | DOI 10.5195/JWSR.2020.971 which they were not reported, so that we will have an estimation of the total number of captives who traveled on all known voyages of each route, by decade.

Modeling Missing Data in Principle
The steps in our work involve imputing the missing values for routes and captive flows in the WHCDB-2017 dataset, to obtain estimates of embarkations and arrivals of captives in the transatlantic slave trade by route and by decade from the 1650s to the 1860s. We are interested in two key types of statistics: the route distribution (voyage pattern) per decade, and the expected embarkation and disembarkation population per voyage in each route. The proportion of voyages for which we lack information on routes is small (roughly 32% of the voyages); the proportion of voyages for which we lack information on both captive embarkations and arrivals is large (about 85% of voyages). Thus, while both the models of routes and of captive flows are essential to our imputations, the imputation of regions should be more reliable than the imputation of population. And we find that the region/route information can be a significant factor in estimating the captive flow of a voyage. To estimate the above parameters, we propose a two-stage estimation strategy.
First, we conduct a regional imputation, to complete the missing regional information for all voyages in the database to get a complete pattern of routes within voyages for each decade. Second, we conduct a population imputation, building a Markov Chain Monte Carlo (MCMC) model to jwsr.pitt.edu | DOI 10.5195/JWSR.2020.971 estimate the expected embarkation and disembarkation population per voyage in each route. 3 The two stages of our imputation of data address missing documentation of routes and missing data on flows of captive population. Combining the route and captive flow information, we will have a picture of the volume and distribution of Atlantic slave trade from the 1650s to the 1860s.
In the first stage, we model the known voyages so as to impute missing routes. We observe that the proportion of voyages with complete information on regions is relatively high, i.e. 68% for the whole database level and above 60% in most decades. At the same time, we realize that we cannot assume that the missing data are completely randomly distributed among the routes, as the distribution is demonstrably non-random in several regards. 4 Even though we fail to find effective factors to explain most patterns of missing data, we find that-given the relatively high proportion of complete regional data-we are comfortable in comparing the distribution of known captive embarkations and arrivals for ships with unknown routes to the pattern for known routes, thereby gaining at least a partial basis for imputing unknown routes. The basic idea is to expand the current distribution of documented records to include estimations for unknown records in each decade. 5 In this step, we rely only on documented routes and not on other explantory factors. 6 Further, thanks to the ample regional data for each decade, the imputation can be performed decade by decade, thus accounting for the fact that the distribution of routes changed significantly over time.
In a nutshell, the first step in our analysis is thus the imputation of all missing routes, so that all voyages in the analysis are attributed either known or imputed routes. 7 Success in this imputation gives us clearly identified routes for each voyage, whether the route was documented or imputed.
In the second stage, we turn to modeling captive flows to impute the missing population information. We focus on three analytical points. First, on the distribution of captive flows for each of the various routes, we found that the route-specific averages of embarkation and arrival varied among routes. These variations in average captive flow per route can be seen by comparing the 40 3 Markov Chain Monte Carlo (MCMC) analysis provides a systematic and comprehensive method for estimating the parameters for the dataset containing missing data; it imputes the missing data at the same time. It also provides the inference interval along with the estimation. 4 For instance, absence of data on routes and on captive flows clearly depended on time and on the national carrier of slave voyages. 5 Another advantage of this algorithm that it will not result in revision of known data. That is, it will not move any known voyages from one route to another. 6 In fact, there are voyages with population information but incomplete information on region. We did some adjustment for those cases. Details can be found in later sections. 7 For further details, see the section below entitled "Imputation, Stage 1." jwsr.pitt.edu | DOI 10.5195/JWSR.2020.971 routes shown in Figure 4. 8 Second, averaging the numbers of embarkations, arrivals, and losses per voyage shows that variations within a given route can usually be neglected. Therefore, we aggregate the data in terms of route and focus on the population characteristics of each route instead of a single voyage or a single decade. Meanwhile, considering that the proportion of voyages in the dataset with documented captive flows is less than 20%, we leverage the assumption of constant captive flows in each route among decades to maximize the power of completed population records. Third, distribution of routes is quite concentrated. The top ten routes, for which we have confidence in the validity of our estimates, include more than 80% of voyages, which therefore provides us with confidence in our overall estimation. Even the routes that we label as "weakly conforming" to our assumption of stable captive flows yield results that fit with our framework. 9 To summarize this discussion of our second-stage analysis, we define our model of captive flows by assuming that the expected numbers of embarkations, arrivals, and loss rates remain constant across decades for each of the 40 routes, and thus may be taken as estimates of parameters identifying the expected number of captives for each route. These parameters, when multiplied by the numbers of voyages, for each route and each decade, will yield our estimates of the numbers of embarkations, arrivals, and loss rates for voyages where data are missing. Relying on this model, we reaffirm our conclusion that the variation in slave trade over time was mainly in the number of voyages per decade and, especially, in the distribution of voyages among competing routes of slave trade, while the number of captives per voyage along a given route varied only slightly. By finishing this step, we can replace all of the missing information in the database, WHCDB-2017, with imputed data for routes and populations, giving us a coherent estimate of the totality of the Atlantic slave trade from 1650 through the 1860s.
Applying these principles for modeling our existing and missing data with relevant statistical techniques, we conduct our imputation in two stages. Stage 1 is to estimate unknown routes using a multinomial model for the distribution of routes, to give us a full set of routes. 10 Stage 2 is to estimate unknown flows of captives, using a Poisson model for the distribution of embarkation population and a binomial model for the distribution of arriving populations. This gives us a nearly full set of estimates for embarkations, arrivals, and rates of loss at sea. 8 For instance, embarkations on the African coast were smaller for Senegambia than for Angola; arrivals on the American coasts were larger for the Caribbean than for North America. In an important further point, the variations in average captive flows within a given route are smaller and more random than the variations in average captive flows among routes. 9 For further details on our handling of these weakly conforming routes, see the section below, "Imputation, stage 2." 10 At the last step in Stage 1, all the known flows of captives are linked to now-known routes.  Table 2 lists the full set of datasets and sub-datasets in our analysis. Data0 through Data3 are sets of original information on documented voyages. Data4, 5a, 5b, and 6 are generated through stage 1 of the imputation. To indicate the expansion in available information that is achieved with imputation of missing data on routes and captive populations, Table 2 shows the number of voyages that result for each of the post-imputation data sets (from Data4 to Data6), as compared with the original number of documented voyages (Data0 to Data3).

Imputation, stage 1: estimation of all routes by multinomial model
We use a multinomial model to impute the allocation of the remaining 10,500 documented voyages among the 40 routes (the difference between Data0 and Data1). The multinomial distribution is a classic model for categorical data, which is suitable for modeling the distribution of routes. The resulting voyage pattern by route and by decade, pre-and post-imputation, is shown in Figures 2 and 3. We emphasize that, in this step, we conduct the imputation decade by decade, to account for the changing distribution of routes across decades. Since more than 60% of routes are known for each decade, we are confident in modeling the distribution of routes independently by decade. We assume that the voyage pattern follows the multinomial distribution ( , ), ∑ = 1, where is the total number of voyages and is the probability of a voyage from region to region . Let be the embarkation region code and be the embarkation region code.
We have three types of observations: The estimator of is The estimator of the voyage pattern is: For the voyages with only embarkation information, namely, the voyages from region , the distribution of the disembarkation region is . For the voyages with only disembarkation information, namely, the voyages from region , the distribution of the disembarkation region is ~ ( , . For the voyages with neither information, namely, the voyages without regions, the distribution of the route is ~ ( , ).
In the full dataset of 33,345 voyages, we have 22,803 voyages with complete regional information. In addition, the dataset has 1708 voyages with only embarkation regions, 6048 voyages with only disembarkation regions, and 104 voyages with neither of them. The missing regions can be assigned by the above model. Further, we find 3506 voyages containing captive disembarkation but incomplete route information, and there is a strong correlation between the embarkation population and the region-that is, a voyage with high disembarkation population has a better chance of having departed from certain specific regions. To capture this feature, we divide the dataset into two subsets: the ones with captive disembarkations less than or equal to 300, and the ones with captive disembarkations above 300. 11 We assume the multinomial Source: Data1 11 We focus only on the impact of disembarkation population and ignore that of embarkation population since the number of voyages with only embarkation population, but incomplete route info is fewer than 500 in total (the precise number is 428).
parameters are different for those groups. With 22 decades in total and three groups (missing arrival population, arrival population <= 300, arrival population > 300) within each decade, we divide the dataset into 66 subsets, although some of them are empty. We apply our model (implemented in the R programming language) to the 66 subsets separately in order to get the final imputation of regions and routes for all the voyages (See R code in Appendix). Figure   Imputation, stage 2: MCMC estimate of missing and total captive populations Based on the results of stage 1, we know the route information for all known voyages. We now allocate all known captive flows among the complete set of voyages after imputation: Figure 4 shows the known captive flows after imputation of routes: this is the database after stage 1 of the imputation. Figure  The other characteristics of each route included the regional system for supply of captives, the size of ships, the participation of European and African merchants in trade, the level of demand for captives in varying regions of the Americas, and the sailing conditions by route, including the timing of voyages by route. All of these other factors differed for each route and may have combined to give a stable character to the slave trade for each route (Miller 1981).
For two groups of voyages, discrepancies among decennial flows appear relatively serious.
That is, within these groups, the means for the key variables have a greater variance than that for the dataset overall. We label these as voyages that are "weakly conforming" rather than "strongly conforming" to assumptions of constant flow along each route. In total, the number of voyages that we describe as weakly conforming to our assumption of constant flow is less than 5% of all known voyages. There are several categories of "weakly conforming" routes. First, the voyages during the decades of the 1850s and 1860s (especially for departures from the Bight of Benin and West Central Africa). In those cases we have made an adjustment of our imputation of slave-trade parameters in route 5-2 and 7-2, using a step function to capture the change of the captive flows in 1850s to 1860s.

Figure 4. Known captive flows for all voyages by route. Average embarkation (red) and arrival (blue) populations per voyage, by route by decade.
Source: Data 5a (7318 voyages) and 5b (17,514 voyages).
jwsr.pitt.edu | DOI 10.5195/JWSR.2020.971 This was the era of fully illegal slave trade, during which the previous conditions of routes were disrupted by anti-slavery squadrons; in addition, the number of fully documented routes was well under 100 for the 1850s and 1860s, so that the small samples are not dependable for estimation.
The second problem area is that of the routes from Southeast Africa to all regions of the Americas (routes 8-1 to 8-4 in Figure 4), from the 1780s to the 1860s: the combination of weak documentation and long voyages yields relatively erratic patterns each of for these routes. Third, an additional variation is found with routes terminating in "Africa" from the 1810s to the 1860s, consisting mainly of voyages captured by anti-slavery squadrons, for which the captives were embarked mainly in the Bight of Benin, Bight of Biafra, and West-Central Africa and disembarked principally in Sierra Leone and St. Helena. The patterns of these voyages, represented in the righthand column of Figure 4, appear on inspection to have a relatively high variance. Fourth, besides the above cases, certain routes include less than 10 documented records for which we have complete data on captive flows. This is more of a limitation of the data in the database than a challenge to our constant-captive-flows assumption-the number of such routes (fifteen) may seem large, but in fact all those routes account for less than 2% of the voyags in the dataset. 12 In practice, of these four types of weakly conforming routes, we found that only the first case required us to adjust our procedure, which we did by calculating separate parameters for routes 52 and 72 for the decades of the 1850s and 1860s.
Based on inspection of the data, we make three assumptions as given above in our model: for all voyages along each route, across time, we assume a constant expected 1) number of embarkations, 2) number of arrivals, and 3) rate of loss of captives. Since the decades of all voyages are known (because of stage 1 imputation), that leaves only the distribution of captives among routes to be estimated at this stage of the imputation. For voyages on which the embarkation population is unknown, we assume that the data follow the Poisson distribution. In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume. 13 We assume that the expected level of embarkation varies for different routes and decades: it may be affected by the vessels, the current economic situation, and even the weather. This expected level of embarkation determines the parameter in the Possion distribution-and we use the average embarkation population per voyage to estimate. 12 The 15 routes meeting this criterion include 1-3, 1-5, 2-3, 2-4, 3-3, 3-4, 3-5, 4-3, 4-4, 4-5, 5-1, 5-3, 6-3, 8-1, 8-5. 13 The distribution describes the properties of each route rather than each voyage: that is, each voyage within a given route is assumed to have the same number of embarkations, arrivals, and losses. For voyages in which the arrival population is unknown, but the embarkation population is known, we assume that the arrival population follows the binomial distribution. The binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p; the Binomial-Poisson hierarchy model is the most classic way to model survival. The parameter in the binomial distribution is a measure of the expected survival, in other words, 1 -loss rate.
We divide all the records into four groups, so that the estimation of parameters can utilize the information on incomplete voyages. values we need to estimate. Our strategy is to utilize all voyages with known data to estimate the parameter in the model, then predict the unknown number by the model. We run the model route by route and the results give each parameter its interpretation. The model for the known data is: The parameter for Poisson distribution is the expected embarkation population per voyage, for a specific route; and the parameter for binomial distribution is the expected corresponding loss rate. Then we impute the missing populations with the following model jwsr.pitt.edu | DOI 10.5195/JWSR.2020.971 The imputation is done by the r-stan procedure. We employ a non-informative prior to fit the model. We run it for 4 chains with random initials and 20,000 iterations for each route, of which the first 5000 iterations are for burn-in each time. As a Bayesian method, MCMC can also be interpreted as a multi-level model. The multi-level structure also explains the over-dispersion for Poisson assumption. 14 Furthermore, the model itself ensures that all the imputed voyages have a positive loss rate. 15 In the results of the imputation as displayed in Table 3, the parameters are calculated through common procedures for all data, with two exceptions that are discussed below. The table shows the estimated parameter for each route, the 95% confidence range for each parameter, and the number of voyages included in the calculation of each parameter. Overall, parameters are calculated on the basis of 5125 voyages out of the total of 33,345 voyages. We believe this is a statistically adequate basis for projecting embarkation, survival, and disembarkation for the full 33,345 voyages.
The error levels we display in Table 3 are underestimates of error margins. The first source of additional inaccuracy is that our estimation is based on a fixed imputation of regions, so that we do not take into account the "variance" of the distribution of voyages. Secondly, our estimation of totals is based on the captive flow character of each route, yet each single voyage will also be affected by some randomness. The specific interpretation of the margin of error as reported here is that it is the margin of error in the expected total captive flow.
There are two exceptions in our calculation: for "weakly conforming" data within routes 52 and 72, from the Bight of Benin and West-Central Africa to the Caribbean. In these exceptions, parameters are calculated separately (with the same algorithm, but only for the data of the 1850s and 1860s). The highly illegal slave trade to Cuba dominated these routes in the 1850s and 1860s.
In Table 3, estimates for routes 52 and 72 are for the period 1650s-1840s, while estimates for routes 52(late) and 72(late) are for the 1850s-1860s. As can be seen, cargoes were larger and survival rates lower for these "late" routes. For other routes and voyages that we identified earlier as potentially "weakly confirming," we found that, in practice, the data conformed surprisingly strongly to our assumption of constant flow. Of the voyages disembarking in "Africa," the WHCCDB-2017 dataset includes 354 such voyages, all in the decades 1810s to 1860s; they came in roughly equal numbers from the Bight of Benin, Bight of Biafra, and West Central Africa and 14 The "non-informative prior" means that we have only the data and no belief on the prior condition. The Poisson distribution assumes that the mean and variance should be equal, but in the data we always find the sample variance to be greater than the sample mean: this phenomenon is called over-dispersion. 15 The model is run route by route, but there are different types of voyages on each route. North America. We still need to emphasize that for certain routes, highlighted in Table 3, the number of voyages was very small, so that the estimates for those voyages lack precision. 16 Estimates show 95% confidence range. The 52 and 72 are the parameters of routes 52 and 72 in 1650s-1840s respectively. The 52 (late) and 72 (late) are the parameters of route 52 and 72 in 1850s-1860s respectively. Highlighted rows have less than ten documented cases. Note that this is a summary of Data7. In the concluding step of our imputations, we multiply each of the parameters from Table 3 by the number of voyages in each decade for each route, as given in Data4. 17 Table 4 summarizes the results of those calculations: it gives estimates for the transatlantic slave trade by decade, giving numbers of known voyages and imputed embarkation and arrival populations and loss rate.
In a further summary of the results of our imputations, Table 5 displays data on the busiest ten routes of the 40 routes in the Atlantic slave trade. The table shows the number of voyages along the ten routes with the largest amount of traffic for six successive periods. Total departures and arrivals for each route in each period are estimated as the number of voyages along that route multiplied by the appropriate embarkation parameter. These ten routes accounted for 83.5% of the voyages and over 85% of the captives in the WHCCDB-2017 dataset, and the remaining 30 routes accounted for the other 16.5% of the voyages.  Conclusion Figure 5 presents the data from Table 4 1650-1870. 18 In each succeeding case, the imputations are based on progressively larger numbers of cases in each imputation, thus increasing the precision of the analysis. In addition, the two MCMC analyses allow estimation of error estimates. In sum, the combination of the separate 18 Table 4. Also note that the Slave Voyages site has updates after TASTDB-2010 with revisions and expansions, which could be explored with the methods used here. analyses shows that the various approaches are relatively consistent with each other, but that different statistical techniques provide different perspectives on the size of missing data. In particular, we argue that our approach on analyzing missing data through the framework of routes shows the significance of the routes of slave trade, not only as a technique for estimation, but also the routes as a framework through which to consider the complex history of the Atlantic slave trade. The analysis by routes included the largest number of cases in each imputation, and arguably gives the most statistically valid result. As a final observation, we emphasize a result that may seem surprising but that confirms the logic of our analysis. That is, the aggregated results show a declining average loss rate of captives over time, even though we have assumed that the number and proportion of losses per individual voyage remained unchanged over time for each route. This means that the captive loss rates declined over time not because of improved health conditions on the vessels but because, with time, larger proportions of voyages took low-mortality routes.