FAQs

This page describes the Counter-Trafficking Data Collaborative concept, the Global Dataset creation process, and the future of the project.

About Us

What is the Counter-Trafficking Data Collaborative?

The Counter-Trafficking Data Collaborative is the first global data hub on human trafficking, publishing harmonized data from counter-trafficking organizations around the world. Launched in November 2017, the goal of CTDC is to break down information-sharing barriers and equip the counter-trafficking community with up to date, reliable data on human trafficking.

Datasets

What data are available?

Data on CTDC is available either to download or to visualize. The visualizations are powered by aggregate statistics which are available for download, and anonymized versions of the CTDC datasets are available publicly to download from the site. Additional datasets will be published on CTDC as the Platform receives more data from partners, and each dataset type will be detailed here:

The global victim of trafficking dataset

The CTDC global victim of trafficking dataset is the largest of its kind in the world, and currently exists in two forms. The data are based on case management data, gathered from identified cases of human trafficking, disaggregated at the level of the individual. The cases are recorded in a case management system during the provision of protection and assistance services, or are logged when individuals contact a counter-trafficking hotline. The number of observations in the dataset increases as new records are added by the contributing organizations. The global victim of trafficking dataset that is available to download from the website in csv format has been mathematically anonymized, and the complete, non k-anonymized version of the dataset is displayed throughout the website through visualizations and charts showing detailed analysis.

The global synthetic dataset

In September 2021, CTDC released its first downloadable Global Synthetic Dataset, representing data from over 156,000 victims and survivors of trafficking across 189 countries and territories (where victims were first identified and supported by CTDC partners). The privacy-preserving synthetic data solution, developed at Microsoft Research in the Python programming language, is freely available via GitHub. Please refer to the Definitions section for more information on synthetic data.

The global victim-perpetrator synthetic dataset

In December 2022, CTDC released the second synthetic dataset, the Global Victim-Perpetrator Synthetic Dataset, which was produced using an extension of the algorithm with added support for differential privacy. Please refer to the Definitions section for more information on synthetic data.

Where do the data come from?

The data come from a variety of sources. The data featured in the global victim of trafficking dataset come from the assistance activities of the contributing organizations, including case management services and counter-trafficking hotline logs.

How does human trafficking case data relate to prevalence data?

There are currently no global or regional estimates of the prevalence of human trafficking. National estimates have been conducted in a few countries but they are also based on modelling of existing administrative data from identified cases and should therefore only be considered as basic baseline estimates. Historically, producing estimates of the prevalence of trafficking based on the collection of new primary data through surveys, for example, has been difficult. This is due to trafficking’s complicated legal definition and the challenges of addressing difficult, sensitive questions to respondents in household surveys in an ethical manner.

The only comparable global estimate is the 2017 Global Estimate of Modern Slavery, which estimates the prevalence of the related crimes of Forced Labor and Forced Marriage. This estimate was produced by the International Labour Organisation (ILO) and the Walk Free Foundation (WFF) in collaboration with IOM. The 2017 report estimates that 40 million people were victims of modern slavery in any given day in 2016. Out of these, approximately 25 million people were in forced labour and another 15 million people were in a forced marriage.

CTDC case-level data are from victims of human trafficking who have been identified or assisted by the contributing organisations. As with all data from identified cases, it is challenging to infer to what extent trends within identified victim populations are representative of the total victim population, since trafficking is a crime intended to be undetected and identified cases are not random samples of the population. This does not mean that they are unrepresentative of the population, however, and testimony from survivors of trafficking are one of the best and only sources of information available on this complex crime. They provide detailed data and opportunity for analysis on the profile and form of trafficking.

How are the global datasets created?

Each dataset has been created through a process of comparing and harmonizing existing data models of contributing partners and data classification systems. Initial areas of compatibility were identified to create a unified system for organizing and mapping data to a single standard. Each contributing organization transforms its data to this shared standard and any identifying information is removed before the datasets are made available.

How is the individual-level data protected?

Step 1
Counter-trafficking case data contains highly sensitive information, and maintaining privacy and confidentiality is of paramount importance for CTDC. For example, all explicit identifiers, such as names, were removed from the global victim dataset and some data such as age has been transformed into age ranges. No personally identifying information is transferred to or hosted by CTDC, and organizations that want to contribute are asked to anonymize in accordance to the standards set by CTDC.

Step 2
In addition to the safeguard measures outlined in step 1 the global victim dataset has been anonymized to a higher level, through a mathematical approach called k-anonymization. For a full description of k-anonymization, please refer to the Definitions section.

IOM collects and processes data in accordance to its own Data Protection Policy. The other contributors adhere to relevant national and international standards through their policies for collecting and processing personal data.

Geographic Information Systems (GIS)

How are GIS used?

CTDC uses Geographic Information Systems (GIS) to map the main geographic trends at country level, without pointing to specific route coordinates. More information about this can be found in the Definitions section.

Definitions

Data anonymization

Data anonymization refers to the process by which information which could lead to an individual being identified is removed from data. Anonymization of data is a process of data de-identification which means that the resulting data cannot be linked back to the original data; in other words, it cannot be ‘re-identified’. Often, data anonymization includes data transformation, which involves processes of data structure/format change.

Synthetic data

Since 2019, Microsoft Research has worked with IOM to develop and refine an algorithm to generate synthetic data from CTDC’s sensitive victim case data. Rather than systematically redacting cases, which results in a substantial amount of data being suppressed, the algorithm generates a synthetic dataset that accurately preserves the statistical properties and relationships in the original data. However, the records of the synthetic dataset no longer correspond to actual individuals and each is constructed entirely from common attribute combinations. This means that none of the attribute combinations in the synthetic dataset can be linked to distinctive individuals (or even small groups of distinctive individuals) in the sensitive dataset, or world at large. In September 2021, CTDC released its first downloadable Global Synthetic Dataset, representing data from over 156,000 victims and survivors of trafficking across 189 countries and territories (where victims were first identified and supported by CTDC partners). In December 2022, CTDC released the second synthetic dataset, The Global Victim-Perpetrator Synthetic Dataset, which was produced using an extension of the algorithm with added support for differential privacy.

K-anonymization

K-anonymization is a data anonymization technique that redacts cases falling into sets with fewer than k members, where each set is defined by a unique combination of values of the different variables in a dataset. This means that it is not possible to query a dataset and return fewer than a pre-determined (k) number of results, regardless of the query. The appropriate threshold for the number of results depends on the nature of the dataset and its size. Based on research and testing, k=10 for CTDC data, which means cases have been redacted from the Global K-Anonymized Dataset such that queries to the Global Dataset cannot return fewer than 10 results.

Codebook

A codebook is a comprehensive record made available for anyone wishing to understand or analyse the dataset. It is particularly valuable for researchers and analysts. A codebook describes the content and variables of a dataset, including definitions and methodological considerations. It also contains the possible values and formats for all variables. Codebooks are provided on CTDC in order to understand the different data sources of the combined dataset, as well as the particularities of each of the contributions.

Data dictionary

A data dictionary describes the structure of a database or dataset by listing and classifying all variables, and specifying the format within which data is stored. It also includes lookup tables for relevant variables. It is usually aimed at helping programmers or database administrators work with a dataset. Data dictionaries are provided on CTDC especially for the use of future data contributors, so that they understand the format and values that they need to adhere to.

Data standardization

A standardized dataset is a dataset for which common data definitions, formats, categories and structures of all data elements have been agreed. For the CTDC Global Dataset, data from different contributing organizations are combined and standardized in order to produce a unified dataset which adheres to these common standards.

De-identification

De-identification of data refers to the process of removing or obscuring information from individual-level data in a way that minimizes the risk of an individual being identified through the data. There are different methods of data de-identification, some of which do not transform the data but allow for it to be “re-identified” and some of which permanently remove identifying features from data (such as anonymization).

GIS

GIS stands for Geographic Information System. It is software that helps to visualize, analyse, and interpret geographic data to understand relationships, patterns and trends. GIS typically allow multiple layers of geographic information to be displayed on a single map. CTDC uses GIS through the ArcGIS mapping software. This software maps the main human trafficking trends based on identified or assisted victim data, at country, state and regional levels, without pointing to specific route coordinates.

Click here to take our survey