Introduction

Established in 2012, the German Cancer Consortium (DKTK) is an alliance connecting university medical center-based comprehensive cancer centers (CCC) and the German Cancer Research Center (DKFZ) with the goal to foster translational cancer research [1]. For that purpose, the Clinical Communication Platform (CCP), a key instrument for cross-center networked research, operates a federated data warehouse system populated with real-world data (RWD) of patients and biosamples. With increasing data volume, the CCP’s functionality shifts from finding and recruiting patients for clinical trials to compiling patient cohorts and using this data directly for research purposes. Activities in clinical data science and the rising potential of machine learning algorithms promise enhanced (translational) value of such RWD [2,3,4,5,6]. To invigorate respective research activities in clinical epidemiology and outcomes research in Germany and to offer a joint interface for international collaboration, we introduce the pan-cancer multicenter clinical cohort of the DKTK’s CCP.

With initially nine sites, the CCP grew beyond its original borders of the DKTK-network, and currently connects fourteen university hospital-based cancer centers, including the largest CCCs designated by the German Cancer Aid (DKH).Footnote 1 The cohort is considered representative for a specific section of real-world cancer care in Germany as it mirrors the most advanced care standards of tertiary care centers with pioneering potential for other hospitals.

In this cohort profile, we describe the technical basis of the CCP-infrastructure as well as methodological aspects of the here-applied federated analysis. Our analysis details information about patient demographics, cohort growth and disease-specific statistics. For a better understanding of the available data quality and quantity, this cohort overview is complemented with an in-depth inquiry of four disease-specific sub-cohorts (pancreatic cancer, laryngeal cancer, kidney cancer and cancer of the thyroid gland), for which we provide exemplary diagnosis- and treatment-related analyses as in-silico validation instrument. Finally, we discuss how future projects can use the cohort data to tap their translational potential.

Methods

Ethics and patient consent

The cohort profile is the result of a federated analysis of patient data that required only aggregated, non-personal information to be exchanged among the sites and allowed all personal data to remain safely within each hospital. All patients were treated and observed according to institutional guidelines. In this setting, no ethics vote or informed consent is legally required. Additionally, ten participating centers approved the project (ethics committees of seven centers independently approved the project, three more centers accepted the initial vote).

Data infrastructure: federated concept of the CCP

The cohort is based on the CCP’s federated system of so-called bridgeheads, which have been customized for the collection, storage and analysis of multi-center RWD [8]. Bridgeheads serve as local data-hubs which facilitate effective cooperation and the exchange of (pseudonymized) data. For the participating university hospitals, this infrastructure guarantees sovereignty over their data [9]. Local IT administration safeguards data entering and leaving their servers and ensures that local rules and regulations are properly applied. Figure 1 illustrates the federated concept of the CCP’s data infrastructure. The institutions who contributed data to the cohort profile are Charité Universitätsmedizin Berlin, Hospital of the Carl Gustav Carus Technical University Dresden, University Hospital Essen, University Hospital Frankfurt, University Hospital Freiburg, University Medical Center of the Johannes Gutenberg University Mainz, Hospital of the Technical University Munich, Hospital of the Ludwig Maximilians University Munich, Hospital of the Eberhard-Karls University Tübingen, University Medical Center Hamburg-Eppendorf, Comprehensive Cancer Center Hannover, Mannheim University Medical Center, Comprehensive Cancer Center Ulm, and University Hospital Würzburg.

Fig. 1
figure 1

CCP-Bridgehead infrastructure and federated analysis

Bridgeheads hold a specified set of data in a standardized and extensible format covering the most significant information of the patients’ diagnoses and events over their course of disease and treatment. In addition to this clinical information, the availability of liquid or tissue biosamples is also covered. The data set builds on the Unified Basic Oncological Data Set (German: Einheitlicher Onkologischer Basisdatensatz) defined and maintained by the Association of German Tumor Centers (German: Arbeitsgemeinschaft Deutscher Tumorzentren, short ADT), the Association of Population Based Cancer Registries in Germany (German: Gesellschaft der epidemiologischen Krebsregister in Deutschland e.V.) and the Platform §65c (a panel of experts consisting of one representative from each of the state cancer registries in Germany). For reportable events, healthcare providers are required by law to report ADT-formatted data to the clinical cancer registries of the German federal states. While many other routine documentation data sources provide only diagnoses, treatments or outcomes, the CCP’s clinical data provides patient and diagnosis-related as well as treatment and outcome-related information. For example, the CCP data allows to map the treatment modalities of patients with primary cancer of the colon who developed liver metastases after first-line systemic therapy and to successively analyze their overall survival.

Bridgeheads support various methods of local or cross-site pseudonymization with fault-tolerant, privacy-preserving record linkage [10, 11]. This allows to extend the bridgeheads with data from other sources, e.g., studies conducted within the DKTK or the primary routine documentation systems within each hospital. The federated infrastructure of the CCP comes with the advantage to potentially join further data elements from already developed sources (e.g., dose information for substances administered in systemic therapy from the source tumor registry data) or to connect adjacent data sources containing, e.g., laboratory parameters or radiological imaging.

Data quality assessment

RWD research is often accompanied by data quality issues [12,13,14], for example, when patient data are missing or not documented at all. This reduces the power of data analyses and may lead to biased results if the missingness is not at random, for instance, when histopathological information is more often missing in patients because of guidelines or for technical reasons. Concerning the cohort presented here, it is important to note, that patient data were derived from DKTK-sites and other university hospital-based cancer centers, representing a specific selection of tertiary cancer care in Germany.

When working with data from 14 different sites, some degree of heterogeneity regarding collection and annotation of data can be expected. To monitor and, if necessary, improve data quality, especially with respect to completeness and syntactic validity, the CCP bridgeheads are connected to a central metadata repository (MDR) that contains the definitions of the agreed upon data elements. The MDR is used to automatically generate standardized quality reports that are used for cross-site comparative data quality assessments and to unveil data inconsistencies [15] such as invalid values for post-operative residual tumor status or the missingness of the mandatory ICD-coded diagnosis.

The here-presented data is facility-based data. As compared to German cancer registry data, the CCP data has similar but less complex demands for harmonization. Most importantly, within CCP harmonization processes only comprise a within-facility “best-of” information selection as compared to a more comprehensive “best-of” from concurrent sources as in registries. However, to keep comparability with registry data high, data management and processing is based on the standards set by common practice of the German cancer registries [16].

Additionally, we deemed as an essential requirement to have some information about the conditions under which documentation was conducted. We launched a survey among the cancer registry units of the participating sites; 13 out of 14 sites answered the email-administered questionnaire. Additionally, telephone calls were made to confer explanations if required. Most importantly, we asked whether the registry units established processes to check the validity, completeness, and the plausibility [16] of their data. While a detailed description of the survey’s findings is beyond the scope of this paper, it is important to state that the majority of the study sites (12 out of 13) conduct internal and software-based data plausibility and validity checks. Additionally, the respondent units indicated that 87.5% of cases are locally registered within the first five months after the event.

Statistical analysis

For statistical analyses a federated procedure was applied: Instead of transferring patient data from multiple sites to a central database to conduct statistical analysis, analyses were performed at local facilities. Only aggregated and non-disclosive data were transferred for manual cross-site result aggregation. This approach is conceptually based on what is proposed by federated learning (FL) software implementations following the principle ‘not bringing the data to the analysis, but bringing analysis to the data’ [17, 18]. Following a coordinated statistical analysis plan, an analysis script was designed in the statistical programming language R (Version 4.0.4) [19]. At each site, the script was executed on a local copy of the CCP-bridgehead data. Data processing and successive analyses were conducted between December 2021 and April 2022.

The statistical analysis focused on overview figures characterizing the patient cohort stratified by disease (according to the tenth version of International Statistical Classification of Diseases and Related Health Problems, ICD-10), i.e., counts of respective primary diagnoses and available biosamples, mean and standard deviation of age at primary diagnosis, the percentage of patients who survived a five-year period after diagnosis (5-year overall-survival) including 95%-confidence intervals, the percentage of female patients as an indicator for gender distribution and an estimator for cohort coverage. Coverage estimation was conducted to evaluate how many patients with respective diagnosis received examination or treatment in medical centers of the participating sites. Coverage estimation is calculated by dividing the sum of cohort patients with the respective diagnosis by the incident cases registered in the cancer registry database of the German Center for Cancer Registry Data (Zentrum für Krebsregisterdaten, https://www.krebsdaten.de).Footnote 2 The coverage measure estimates whether the cohort can be regarded representative for the population of cancer patients in Germany. High coverage (values beyond the 90th quantile of the coverage distribution, i.e., 13.3%) is interpreted as an indicator for the diagnosis-specific patient group to be overrepresented in the cohort as compared to the general population of cancer patients in Germany.

As an in-silico validation,Footnote 3 and in order to illustrate the depth of the cohort data, four disease-specific sub-cohorts were formed (pancreatic cancer, laryngeal cancer, kidney cancer and cancer of the thyroid gland), for which an in-depth descriptive and visual analysis was conducted. The sub-cohorts depict frequent diseases from different organ systems (digestive, respiratory, genitourinary and endocrine) with different therapeutic approaches and outcomes. The frequency of patients with specified localization and the stage at diagnosis combination is depicted in so-called Voronoi treemaps [21]. Additionally, the sequences-of-therapies stratified for stage at diagnosis are visualized as alluvial diagrams.

Results

Cohort overview

Local bridgeheads from fourteen participating cancer centers in Germany comprise data of NP = 600,915 patients and NDX = 905,453 diagnoses. As illustrated in Fig. 2 a four-step data quality assuring process was applied to assemble the cohorts’ patients with (1) a histologically confirmed primary diagnosis (excluded: NP = 90,475),Footnote 4 (2) a minimum of two documented visits, defined as examination or anti-cancer-treatment events (excluded: NP = 127,507 patients), (3) logically consistent event dates (e.g. primary diagnosis had to be no later than first treatment, excluded: NP = 22,870)Footnote 5 and (4) year of diagnosis was 2013 or later (excluded: NP = 127,072). The latter starting date was chosen because in 2013 the Cancer Screening and Registry Act (German: Krebsfrüherkennungs- und Registergesetz, KFRG) entered into force which led to the establishment of a comprehensive cancer documentation practice.Footnote 6 These exclusion criteria led to a final cohort size of NP = 232,991 (NDX = 242,756) or 39% of the total cohort.

Fig. 2
figure 2

Assembly of the cohort through a four-step process to ensure data quality. The numbers of patients (NP) and diagnoses (NDX) at each filter step that are included (solid lined boxes) and excluded (dashed lined boxes) are indicated

Figure 3a illustrates the distribution of the count of patients by all fourteen participating medical centers. The median of contributed patients is NP = 13,194.5 (interquartile range (IQR) = [7802; 18,285]), varying markedly between sites, ranging from NP = 51,326 at the largest center to NP = 3632 patients at the smallest center. Focusing on the year of diagnosis, an increasing patient count can be observed from NP = 24,090 in 2013 to NP = 32,191 in 2017 followed by a slight decrease until 2020 (NP = 24,162) and a sharp decline thereafter (NP = 8078 patients in 2021).

Fig. 3
figure 3

Distribution of patient numbers in the cohort a over time (year of diagnosis 2013–2021) and by participating medical center (color-coded) and b regarding demographic factors (age at diagnosis, gender status)

The cohort’s distribution of demographics, age at diagnosis and gender, are depicted in Fig. 3b. Overall, the cohort comprises more male (NP = 127,543) than female patients (NP = 105,425). Sixteen patients were coded with unknown (NP = 12), other (NP = 3) or missing (NP = 1) gender status. Importantly, the age distribution differs between female and male patients. Due to the high prevalence of breast cancer in younger females (mean age at diagnosis = 59.3 (SD = 13.8)), the frequency distribution of female patients shows an earlier, steeper onset, than the frequency distribution for male patients (mean age at diagnosis = 68.0 (SD = 8.2)). In contrast, the most common cancer diagnosis in male patients is prostate cancer; a disease that more often affects older men. Also, more men than women are affected by highly prevalent cancer diagnoses such as lung cancer and colorectal cancer.

Table 1 provides an overview over the total number of patients with their primary diagnosis, the mean and standard deviation of age at diagnosis, the number of available biosamples, the percentage of female patients with the respective diagnosis and the estimated coverage with respect to the total number of incident cases in Germany.

Table 1 Cohort characteristics grouped by primary diagnosis

The cohort covers diagnoses of solid cancers (NDX = 172,190) from all organ systems (lip, oral cavity and pharynx: NDX = 13,211; digestive organs: NDX = 34,295; respiratory and intrathoracic Organs: NDX = 19,993; bone and articular cartilage: NDX = 1297; malignant melanoma: NDX = 13,964; mesothelial and soft tissue: NDX = 5153; breast and female genital organs: NDX = 29,419; male genital organs: NDX = 24,576; urinary tract: NDX = 10,374; eye, brain and CNS: NDX = 11,947; endocrine glands: NDX = 5172; ill-defined and unspecified: NDX = 2789) as well as a wide spectrum of malignancies of the hematopoietic and lymphatic system (NDX = 23,003).

More specifically, the cohort’s ten most frequent diagnoses are prostate cancer (NDX = 22,523, cohort rank 1 vs. population rank 2), breast cancer (NDX = 18,409; cohort rank 2 vs. population rank 1), lung cancer (NDX = 15,575; cohort rank 3 vs. population rank 3), malignant melanoma of the skin (NDX = 13,964; cohort rank 4 vs. population rank 5), colon cancer (NDX = 6218; cohort rank 6 vs. population rank 4), pancreatic cancer (NDX = 6009; cohort rank 7 vs. population rank 7) and cancer of the bladder (NDX = 5279; cohort rank 10 vs. population rank 8).Footnote 7 Only malignant tumors of the brain (NDX = 9005; cohort rank 5 vs. population rank 17), cancer of the liver and bile duct (NDX = 5907; cohort rank 8 vs. population rank 13) and diffuse non-Hodgkin lymphoma (NDX = 5837; cohort rank 9 vs. population rank 15) are among the ten most frequent diagnoses of the cohort deviating from the top ten cancer diagnoses in the population.

The median diagnosis-specific coverage is 5.7% (IQR = [3.7%; 10.1%]). The estimated coverage of many frequent diagnoses such as lung cancer (3.0%), breast cancer (2.9%) and prostate cancer (4.1%) lie within the IQR (cf. estimated coverage in Table 1). However, the coverage of neuro-oncological cancers such as malignant tumors of the brain (14.3%) and the eye (36.2%) as well as cancer of the spinal cord and other unspecified parts of the CNS (16.8%) lie beyond the 90th quantile of the coverage distribution (13.3%); but also rather rare entities such as malignant tumors of the bones and articular cartilage (14.5% and 19.7%) and cancer of endocrine glands (15.7%), the placenta (14.8%) or connective and soft tissue (13.4%) feature a high estimated coverage in the cohort.

Analysis of diagnosis-specific sub-cohorts

In order to prove the validity of the data, we performed a disease- and therapy-related analysis that covers four sub-cohorts (pancreatic cancer, laryngeal cancer, kidney cancer and cancer of the thyroid gland), reproducing known cancer-specific traits of patients’ clinical courses. The cancer entities differ with respect to the distribution of stage at diagnosis and therapeutic approaches.

Voronoi treemaps (Fig. 4) illustrate the frequency of tumor location-specific subtype and its UICC (Union Internationale Contre le Cancer) stage at the time of diagnosis. Pancreatic cancer is most frequently detected in an advanced stage (UICC stage I: 11.3%, II: 31.0%, III: 11.8%, IV: 45.9%). For laryngeal cancer, the UICC stage at diagnosis is more homogenously distributed (UICC stage 0: 1.0%, I: 26.8%, II: 20.8%, III: 20.0%, IV: 31.4%). Cancers of the kidney and thyroid gland are often detected in an earlier stage (kidney cancer UICC stage at diagnosis: I: 51.2%, II: 7.6%, III: 15.4%, IV: 25.8%; cancer of the thyroid gland UICC stage at diagnosis I: 58.8%, II: 13.9%, III: 13.1%, IV: 14.3%).

Fig. 4
figure 4

Frequency of UICC stage at time of diagnosis by specified tumor localization for cancer of the a pancreas, b larynx, c kidney and d thyroid gland. Notes number of diagnoses (NDX) according to ICD-10 (including codes with digits after decimal point); tumor localization-specific subtypes according to ICD-O (C25.0 head of pancreas, C25.1 body of pancreas; C25.2 tail of pancreas; C25.4 endocrine pancreas; C25.7 other parts; C25.8 overlapping lesion; C25.9 unspecified; C32.0 glottis; C32.1 supraglottis; C32.2 subglottis; C32.8 overlapping lesion; C32.9 unspecified; C64.9 kidney; C64.91 kidney, upper third; C64.92 kidney, middle third; C64.93 kidney, lower third; C73.9 thyroid gland unspecific; C73.91 lobe of thyroid gland; C73.92 isthmus of thyroid gland) are color-coded; UICC stage is coded by color intensity (darker color indicates the more advanced stage at the time of diagnosis); area size indicates relative frequency of a stage-localization combination

Figure 5 presents an analysis of therapy sequences faceted by the four disease-specific sub-cohorts and stratified by UICC stage at diagnosis. The visualization illustrates the number of patients who received therapies (x-axis) from the first up to the sixth anti-cancer treatment (y-axis) including the flows of patients between the different modes of therapy. Surgery is the most frequent mode of therapy for early-stage cancer of the pancreas, larynx and kidney (UICC stage I and II). Higher clinical stages at diagnosis (UICC stage III and IV) were more frequently treated with other modalities, which differed among cancers. While systemic therapy dominated in patients with advanced-stage pancreatic cancer, radiation and systemic therapies prevailed in patients with laryngeal or kidney cancer. For cancer of the thyroid gland in early stage (UICC stage I and II), nuclear medicine treatment, such as radioactive iodine therapy, and in later stages, surgery and systemic therapy are most frequently documented. The decreasing size of the frequency bars in Fig. 5 indicate the successive relative reduction of patients receiving additional therapies. This observation holds for all disease-specific sub-cohorts and all strata. The flow of patients between therapies is indicated by the colored alluvial connections between the bars. These alluvial connections illustrate that for early stage (UICC stage I and II) malignancies the mode of the first therapy (e.g., surgery) is re-applied when successive therapy is required. For malignancies detected in later stages (UICC stage III and IV), however, the alluvial streams reflect multimodal anti-cancer-treatment, more often moving from one mode of therapy to another (e.g., from surgery to radiotherapy in stage IV thyroid cancers).

Fig. 5
figure 5

Alluvial diagram of mode-of-therapy sequences stratified by diagnosis and UICC stage for cancer of the a pancreas, b larynx, c kidney and d thyroid gland. Illustrated are the percentage of patients who received therapy (x-axis) from the first up to the sixths therapy sequence (bars at y-axis) including the flows of patients (colored alluvial connections) between the different sequences and therapies. Notes Number of diagnoses (NDX) according to ICD-10 and therapeutic events (NTX) in the sub-cohort

Discussion and limitations

The present report profiles the pan-cancer multicenter cohort of the DKTK’s CCP which contains 232,911 patients. CNS, non-Hodgkin lymphoma and hepatobiliary cancers ranked higher (with respect to diagnosis frequency and coverage) in the cohort as compared to the national population of cancer patients. We also find high coverage of rare malignancies such as malignant tumors of bones and articular cartilage, cancer of endocrine glands, cancer of the placenta or connective and soft tissue sarcomas. Taken together, these findings indicate a cluster of specialized care in our network of tertiary cancer centers. However, the frequency distribution of the remaining diagnoses in the cohort resembled the distribution in the national population, supporting the assumption that the cohort can in part be considered representative. The cohort features a continuous influx of patients that allows monitoring of their clinical pathways and outcomes. The sharp decline in patients with primary diagnosis in 2021 may be a direct result of the coronavirus pandemic, e.g., because infection control measures delayed elective diagnosis.

Disease-specific sub-cohorts, for which we exemplified diagnosis- and treatment-related analyses, provide detailed insights mirroring known properties of the respective diseases. Our findings concerning the stage-distributions are in line with existing data. For example, early-stage pancreatic cancer is often asymptomatic and thus remains undetected for longer periods of time, which may be considered a reason why most diagnoses find pancreatic cancer in advanced stage [23]. Likewise, documented pancreatic cancer diagnoses predominantly comprise ductal adenocarcinoma, originating from the exocrine pancreas, which is by far more frequent compared to cancer of endocrine origin [24]. Also, for laryngeal cancer, for example, the data are in line with findings from epidemiological cancer registries [25].

While the cohort may serve to monitor clinical outcomes in cancer patients, a recent national research project shows that improved clinical outcomes are positively associated with treatment and treatment options in specialized cancer centers [26]. It must also be considered that university hospital patients are more often part of clinical trials. This circumstance may affect outcomes because clinical trials often require histopathological proof and molecular analysis of the tumor; such deep phenotyping techniques subsequently allow more often for personalized treatment approaches [27, 28].

In summary, the granularity and size of the cohort data is a potential catalyst of translational cancer research. It provides rapid access to comprehensive patient groups of interest and may enhance the understanding of the clinical history of various (even rare) malignancies. Consequently, the cohort may justify decisions in clinical trial design and will contribute to the evaluation of scientific findings under real-world conditions. Moreover, with the application of analytic scripts, data evaluation and visualization can be performed rapidly.

The cohort clearly benefits from its underlying IT-infrastructure, which may serve as a core to future extension of data elements (e.g., laboratory values, genetic information, comorbidities, co-medication, medical history, radiological imaging data) if required for specific research purposes. It enables researchers to access a rich source of harmonized data across the participating sites of the consortium without impairing privacy regulations and the data sovereignty of the hospitals. As a complement to epidemiological data of cancer patients with near complete coverage [23], the cohort of the DKTK’s CCP bridges big data clinical epidemiology and deep-insight real-world cancer research. The cohort dataset is also connected to liquid and tissue biosamples stored in local biobanks allowing to unfold the translational potential of the multi-center pan-cancer cohort.

Of course, the facility-based nature of the cohort data also has its limitations. While certified cancer centers document the corresponding follow-up examinations and ex-domo treatments of their patients, there is not yet an equivalent level of comprehensive documentation of patient journeys treated in non-certified units. Thus, for non-certified cancers, the proportion of covered follow-up and ex-domo treatment events can be expected to be inferior as compared to respective certified diagnoses. The here applied validity assessment in the disease-specific sub-cohorts is limited to face validity, i.e., the plausibility of the data given what we know about disease epidemiology and treatment approaches. In order to strengthen data validity, future assessments should include the predictive validity of the data [4]. Another limitation concerns the specification of inclusion criteria: While the here presented pan-cancer cohort profile was limited to patients diagnosed in 2013 or later and a minimum of two documented disease-related events, these criteria must be re-considered for research focused on specific diseases. For example, a study about long-term survival in prostate cancer patients—a disease for which many German cancer centers have been certified since 2008—would reasonably include patients diagnosed in 2007 onwards and would exclude patients with less than a specified minimum of follow-up examinations. Unlike patient demographics and diagnostic information, it must also be considered that not all data elements can be used directly for analyses. Other data elements must traverse intricate preprocessing in advance of analysis. For example, treatment-related information can be used for the inference of lines of therapy, which might be a useful component to normalize patient data.

In conclusion, we have demonstrated the cohort of the DKTK’s CCP is a patient population representative for German university medicine-based tertiary cancer centers, providing valuable insights into the real-world study of contemporary oncological treatment and outcomes in Germany. Access to biobanks and other data sources build a broad basis for future comprehensive and in-depth analyses.