Introduction

Non-communicable diseases are considered the major threat for health worldwide [1]. Cancer and cardiovascular disease (CVD) remain the most important causes of death in Germany, accounting for 63% of all years of life lost in 2017 [2]. However, the past years have demonstrated that globally we face major challenges with immediate impacts and long-term consequences for health. These in particular include (1) the ageing population and the increase in social disparities, (2) the obesity epidemic, (3) the climate crisis, (4) the medical and public health impacts of the COVID-19 pandemic and (5) other emerging diseases. All five of these major challenges jointly change living conditions, environmental exposures, risk factor profiles, susceptibility and health service access.

Large-scale cohort studies are essential to understand the inherited and acquired determinants of health in populations and to shape the future of prevention and early disease detection. Furthermore, they provide us with insights about how living and working conditions of study participants change over time. Based on the advances in biomedical sciences, cohort studies are nowadays able to address the fundamental challenges of future health research in an unprecedented fashion by real-world assessments of population health and by generating and testing innovative hypotheses based on large-scale standardized observational data. Cohort studies are the prime source of inference in areas where randomized clinical trials are infeasible or unethical. Within the last 2 decades, a number of mega cohorts have been initiated worldwide to foster the understanding and prevention of non-communicable diseases [3,4,5,6,7,8]. Jointly with experimental and clinical studies, they guide novel approaches to personalized prevention, precision medicine and policies and population-based prevention to improve public health in a changing world.

The German National Cohort (NAKO, “NAKO Gesundheitsstudie”) is a large, multidisciplinary, prospective population-based cohort study [5, 9] (Table 1). The overarching scientific goals of NAKO are: (1) The identification of etiological pathways from life-style and environmental risk factors to major diseases and functional impairments. (2) The description and understanding of the causes of geographic and socio-economic disparities in health status and disease risks. (3) The development of risk prediction models for identifying individuals at increased risk for major diseases in a framework of personalized prevention strategies. (4) The evaluation of markers for early detection of disease and pre-disease phenotypes, in order to develop effective tools for disease prevention.

Table 1 Current and planned data collection in the German National Cohort

NAKO is the largest epidemiologic study in Germany to date and a joint interdisciplinary endeavour of 27 German scientific institutions, including 15 universities, 4 Helmholtz health centres, 4 institutes of the Leibniz Association and 4 other national research institutions (see https://nako.de/allgemeines/der-verein-nako-e-v/organe-und-gremien/wissenschaftliche-projektleiter-der-mitgliedsinstitutionen/mitgliederversammlung/).

We report here on the NAKO baseline recruitment and assessment, and we describe the success and challenges in setting-up such a large population-based cohort as a national resource.

Methods

Study population and recruitment

NAKO set out to recruit a total of 200,000 residents aged 20 to 69 years at baseline within a 5-year period. Study participants were recruited through a network of 18 study centres from 16 study regions throughout Germany that include urban and industrialised areas as well as rural regions (Fig. 1). The goal was to recruit 10,000 participants each in 16 study centres and 20,000 participants each in 2 larger centres. All study participants were identified based on age and sex-stratified samples randomly drawn from compulsory registries of residents within the study areas. For both sexes, the design intended the overall recruitment of 10,000 participants in each 10-year age-group between 20 and 39 years, and 26,667 participants in each 10-year age-group between 40 and 69 years. The local study centres invited the participants for standardised assessments.

Fig. 1
figure 1

Study centres and infrastructures of the German National Cohort

Examinations and data collection

The baseline study programme included (1) a standardised, computer-assisted face-to-face interview, (2) biomedical examinations, (3) questionnaires to be filled in by the participants (mainly via touchscreen), (4) collection of biosamples, and (5) in 5 centres, a whole-body MRI of 30,000 participants. By design, data collection comprised two levels of intensity [5]. The standard Level 1 programme was offered to all 200,000 participants and additional in-depth examinations to 20% randomly selected participants (Level 2 programme). Magnetic resonance imaging (MRI) was planned to be performed in 30,000 participants. All MRI-participants were also invited to participate in the extended Level 2 programme. An overview of the examination modules as part of the Level 1 and Level 2 programme is presented in Table 2. An overview of the 16 interview and the 30 touchscreen questionnaire modules is presented in Table 3.

Table 2 Disease groups and functions in focus of the baseline examination modules of the Level 1 and Level 2 programme
Table 3 Questionnaire data collected within the German National Cohort

All participants gave informed consent after receiving detailed information on the study content and procedures. Data collection was performed by specifically trained and certified study personnel. Biological samples of blood, urine, saliva, stool, and nasal swabs were obtained and processed on site.

Finally, a total of 30,000 participants were offered to undergo whole-body MRI using dedicated 3 Tesla scanners (Magnetom Skyra, Siemens Healthineers, Erlangen, Germany) at five MRI centres in Augsburg, Berlin, Essen, Mannheim and Neubrandenburg—(MRI programme) [10]. Identically installed MRI scanners remained technically (hard- and software) constant throughout the baseline recruitment period. The scanning protocol included sequences for the brain, the cardiovascular and musculoskeletal system as well as for the thorax and abdomen. Comprehensive measures assured homogeneous and highest quality of the acquired images. Moreover, procedures for the management of incidental findings were developed that included findings that were communicated to the study participants in order to inform about potential health problems that would require further medical attention [11].

A letter containing the basic results (e. a. blood parameters, blood pressure, anthropometric and accelerometry data) was sent to all study participants after the visit at the study centre.

Central data management

Data were collected through standardised data entry forms and protocols for interviews and questionnaires as well as for all examinations at the study centres. For most medical devices, an automated transfer of examination data to the central data base was specifically programmed and implemented. Thus, all data were directly integrated in a central study database serviced at two data integration centres. This includes all data, also raw data collected during examinations by devices. These data integration centres are physically located at the University of Greifswald and at the German Cancer Research Center in the Helmholtz Association DKFZ, Heidelberg. The independent trust centre at the medical faculty of the University of Greifswald is responsible for personal identifying data storage, including addresses and consent management. For the MRI programme, incidental finding ascertainment and data storage infrastructure was built up [10].

Collection and storage of biosamples

Whole blood, serum, EDTA plasma, erythrocytes, RNA, urine, saliva, nasal swabs and stool were collected from all study participants at baseline as part of the Level 1 programme. The collection and local processing of the samples was highly standardised and includes the use of an automated liquid handling system for sample aliquoting [12]. More than two thirds of each individual’s aliquots collected during baseline recruitment are stored in a central biorepository at Helmholtz Munich [13] that is dedicated exclusively to NAKO. It includes – 80 °C semi-automated storage and − 180 °C storage in a fully automated sample handling robotic system for more than 20 million aliquots. One third of serum and plasma samples is stored at the local study centres for use in local analyses and as back-up storage.

Follow-up

The follow-up of the cohort is essential to achieve the objectives of NAKO. Table 1 describes the approaches for data collection on incident events building upon the baseline recruitment described in this paper. All participants are followed via postal questionnaires (active follow-up), with the first 3-year follow-up initiated in 2017, and via record linkage with secondary data sources such as cancer registries and health insurance records (passive follow-up). All study population deaths are monitored from two main sources: notifications from study centres and regular vital status screening in German and other European residential registries. For all deaths, information on the exact date and place is collected, the death certificate plus medical and forensic reports accessed and coded in the ICD-10 framework in three versions: (1) as documented by the examining physician, (2) eventually re-arranged by the internationally established coding software IRIS, and (3) augmented by medical and forensic reports.

Quality management

All data collection and biomedical examinations used standardised instruments and followed the procedures as described in specific standard operating procedure manuals (SOP). The study personnel underwent extensive training and was certified for examinations and interviews. Repeated centralised training and recertification took place at regular intervals. Comprehensive quality management included internal quality management organised by the central executive office and the study centres, and external quality management conducted by the Robert Koch Institute, Berlin.

Data protection and ethics

Personal data are processed according to the concept on data privacy protection and IT security developed for the NAKO (see NAKO Gesundheitsstudie—Datenschutz in der NAKO). Essential principles of data protection addressed are (1) the separation of identifying data from other personal data, (2) the participants’ right of self-determination and control of own personal data, and (3) data reduction and data economy.

Adherence to legal requirements and to generally accepted ethical rules are fundamental principles for the conduct of NAKO. Structures, documents, and processes to ensure compliance of the conduct of the study with the ethical principles have been implemented. The ‘Code of Ethics’ of NAKO describes general ethical rules and principles for collection and use of study data (see NAKO Gesundheitsstudie—Ethik in der NAKO). An external ethics advisory board consists of members who represent ethical, social, scientific, medical, and legitimate matters in the area of life sciences and the study participants themselves. All study documents, including study protocols, participant information documents and declaration of consent forms for the baseline including the MRI examinations have been approved by all responsible local ethical committees. All ethics-related documents and processes are revised regularly and adapted as needed.

A modular consent declaration consisting of separate sections for participation in the main examination programme as well as specific programme modules, collection and storage of biosamples, retrieval of secondary data and repeated contact is applied to document the participants consent in detail (see NAKO Gesundheitsstudie—Einwilligung). Processes for collecting and documenting informed consent, for data processing consistent with the given consent and for handling of consent withdrawal have been implemented.

The linkage to external registries is ad persona. It can only be done for individuals who gave the specific informed consent. Linkage to health and pension insurances is done via the health insurance and pension insurance numbers, given by the participants during the consent process. Linkage to other registries is done via name, date of birth and address. To ensure data security and participant privacy, the linkage process is solely operationalized via the independent trust centre. Secondary data is retransmitted by the providers to the data integration centres only in pseudonymized form. These raw data are exclusively available to the competence network secondary and registry data of the NAKO, which is responsible for processing and coarsening. Only these processed data is incorporated into the research database and available for scientific analysis.

Use and access

Access to and use of NAKO data and biosamples is regulated on the basis of a Use and Access Policy adopted by the General Assembly of NAKO (NAKO-e.-V._Nutzungsordnung_v2_2019-03-21.pdf) and is binding for all users. A transfer unit is responsible for the technical and administrative tasks related to the use and access procedures. Below, we briefly describe the current procedures. An electronic application portal (https://transfer.nako.de) was developed to support the applications for the use of NAKO data and in a separate modality for NAKO biosamples. A Use and Access Committee (UAC) evaluates the applications. The UAC's recommendation to accept or reject the application is presented to the members of the NAKO for potential objections. If no objection is raised, NAKO decides on the acceptance or rejection of the application. For biosamples, appropriateness of the proposed methodologies and the prioritisation given the limited amount of biomaterial is assessed in addition. After a data usage agreement has been signed by all involved institutions, the transfer unit provides the applicant with the data for analysis and supports the transfer of biosamples. Adherence to the EU-General Data Protection Regulation (GDPR) provisions is mandatory for all users.

Results

Recruitment, age distribution and response rates

The study centres started baseline recruitment between March and September 2014 and completed it between October 2018 and September 2019. Overall, 205,415 participants—out of which 59,971 with Level 2 examinations and 30,861 with MRI examinationswere recruited, clearly exceeding the originally planned targets for the baseline examination. Level 2 examinations were offered to Level 1 participants, who had not been drawn into the random sample of Level 2 participants, if they received the MRI examination.

The participants’ age at the day of the examination ranged from 19 to 74 years. By the end of the baseline recruitment, 362 participants (0.18%) had revoked their informed consent completely, resulting in 205,053 participants as of October 2019. Recruitment complied with the planned age-sex stratification of the cohort (Table 4). This was achieved by continuous monitoring of the age and sex distribution and intensified efforts to recruit within the age range 20–50 years towards the end (Fig. 2). The cohort shows a well-balanced age-sex distribution.

Table 4 Number of recruited NAKO participants by sex and age groups between 2014 and 2019. N = 205,053 after excluding 362 participants who withdrew their consent until October 2019
Fig. 2
figure 2

Age distribution of NAKO for men and women. N = 205,053 after excluding 362 participants who withdrew their consent until October 2019

The overall response at baseline for NAKO was 17%. The response varied between 9 and 32% across study centres.

Baseline examination and MRI

Level 1 participants spent on average 215 min (3.6 h) at the study centre compared to 203 min planned. Level 2 participants spent 340 min (5.6 h) compared to 306 min planned. Of this, examination and interview time amounted to about 155 min for the Level 1 programme (of which ~ 30 min for interview and ~ 125 min for examinations), and 280 min for the Level 2 programme (of which ~ 30 min for interview and ~ 250 min for examinations), and additional 15 min for informed consent ascertainment. The self-administered touchscreen questionnaires took ~ 45 min if all modules were completed.

Level 1 participants completed 16 out of 16 interview modules. 94% of the Level 1 participants completed all 8 examination modules and 89% all 30 touchscreen modules. Level 2 participants completed in addition on average 8 out of 10 further examination modules and 79% of them all 10 modules. Table 5 lists the performed examinations and the percentage of participants who actually received the respective examination modules as compared to the originally planned target. Down time of measurement equipment and data capturing procedures, sick leaves of certified examiners and staff turnover in the study centres were main reasons for reduced completeness of the examination modules.

Table 5 Number of Level 1 and the Level 2 examination modules in 205,415 NAKO participants and completeness at baseline 2014–2019

Overall, 30,861 participants underwent the whole-body MRI protocol at one of the five MRI centres, which was completely acquired with all imaging sequences in 94% of subjects. While MRI-centres in Augsburg and Neubrandenburg examined local study participants, participants from adjacent study centres were invited for participation in MRI-centres in Berlin-Nord, Essen and Mannheim. As such, at the MRI-centre located in Berlin-Nord, 3,878 examined participants were from Berlin-Mitte and Berlin Sued, at the MRI-centre in Essen 1,277 participants were from Muenster and Duesseldorf, and at the MRI-centre in Mannheim, 750 participants were from Saarbruecken und Freiburg.

Biosample collection

The study centres processed and immediately froze at − 80 °C more than 19 million biosample aliquots as part of the baseline examinations. These were collected with a high degree of completeness. Per participant, 30 serum aliquots (à 0.25 mL; 99% completeness), 48 EDTA plasma aliquots (à 0.25 mL; 98% completeness), and 12 urine aliquots (à 0.25 mL; 98% completeness) were collected.

Discussion

NAKO is a large prospective central European cohort of young, middle-aged and older women and men living within urban and rural regions of Germany. NAKO achieved its goal to recruit the planned number of participants within 5 years. In addition, NAKO also adhered to the planned sex and age distribution. This was possible through dedicated efforts. Participants with a migration background are well represented and comprised 16% within the first 100,000 participants [14]. Even though the participants with migration backgrounds are a very diverse group overall, subgroups of migrants can be studied separately with respect to region of origin due to the size of NAKO. The response proportion were substantially lower than the anticipated 50% at the planning stage, and we observed a considerable variation between the study centres. We consider a number of factors that may be responsible, including the urbanisation of the study region, the population composition in terms of migration background and education, and differences in local recruitment measures and strategies. We will assess the response proportions and regional differences in forthcoming publications.

NAKO was able to achieve a high degree of completeness for many of the examinations. Thus, an exceptional and rich database was created. The study is thereby a prime example for deep phenotyping in a large population-based cohort. The MRI scans provide a wealth of novel information to detect early changes within deeply phenotyped individuals. Jointly, it will allow assessing the role of early changes in function for the prediction of diseases and to derive signs of multimorbidity long before definite diagnoses are manifest. Quality assurance building on longstanding experience [15] as well as novel approaches employing machine learning and artificial intelligence techniques are underway to enrich the raw data by derived variables.

The data is complemented by a rich set of biosamples allowing for future analyses using targeted and untargeted high-throughput approaches as well as traditional laboratory analyses. The instant local processing combined with the complete cooling chain and the storage of blood and urine samples at − 180 °C will enable NAKO to provide high quality biosamples throughout the next decades. Characterisation of inherited and acquired molecular traits and their regulation is the cornerstone of innovative approaches to personalised prevention and medicine [16]. Understanding the role of genetics and molecular phenotypes such as transcriptome, proteome, metabolome, microbiome and other biomarkers can reveal new insights into the physiology of mechanisms as well as pathophysiology of disease development and progression. Life-style and environmental exposures determine and modify molecular phenotypes and form the basis for their impact on disease development [17]. The high quality of biosamples collected during the baseline examination and the subsequent data collections of NAKO provide an immense potential for implementation of well-established and novel high-throughput omics technologies. For example, we consider genotyping and whole genome sequencing, genome-wide methylation, RNA sequencing, metabolomics, proteomics, and serolomics. The NAKO biosampling is also remarkable due to its collection of samples for assessing the microbiome and virome of the gut, nose and oral cavity.

Follow-up

An important design aspect of NAKO is that all participants are re-invited every 5 years to the study centres for repeated examinations. The extensive assessment by functional examinations permits the analysis of continuous changes over time, e.g., regarding vascular, cardiac, metabolic, neurocognitive, pulmonary and sensory function, but also in terms of changes in exposures. NAKO is currently inviting all participants to the first re-examination, which started between November 2018 and January 2020. The first re-examination was planned to be completed by April 2023; however, due to the COVID-19 pandemic, participation rates were lower than planned and temporary closures of study centres were unavoidable. Nevertheless, 66,870 study participants were re-examined (66% of forecasted numbers) including 9,889 MRI examinations (55% of forecasted numbers) until December 2021.

A further exceptional aspect is the platform-building feature of NAKO that enables ad hoc surveys in a well-defined group to answer imminent public health questions. For example, a supplementary COVID-19 questionnaire on SARS-CoV-2 infections and pandemic-related topics was sent out via email or letter to all study participants between April 30th and June 30th 2020. Overall, 160,227 questionnaires were completed, resulting in a response of 80.6%. While infection rates were very low during the first wave of the COVID-19 pandemic, clear increases in depression, anxiety and stress scores as compared to baseline levels were documented, in particular affecting the younger adults [18].

Strengths and challenges

NAKO is characterized by a number of important design aspects that distinguish it from other international cohort studies and enables large-scale observational health research in Germany. NAKO is the largest cohort and harbours the largest biobank in Germany to date and is one of the largest in Europe. When compared to other large-scale epidemiological endeavours in Europe, NAKO has a substantially wider scope and assesses lifestyle characteristics and other exposures not only at baseline, but repeatedly and with innovative approaches. For example, the European Prospective Investigation into Cancer and Nutrition (EPIC) assessed lifestyle characteristics and anthropometric variables only once at baseline and did not include innovative tests and examinations like NAKO; disease follow-up in EPIC differs by country but is on a European level largely focussed on linkage with cancer and mortality registries (https://epic.iarc.fr/). Many other large cohorts are restricted to questionnaires and did not include any phenotyping in person (http://www.millionwomenstudy.org/; https://nurseshealthstudy.org/;). In NAKO, (i) the focus on repeated deep phenotyping will allow assessing the development of major chronic diseases and multimorbidity in an ageing society based on changes in risk factors and intermediate phenotypes of multiple diseases. (ii) The large number of participants below 40 years of age at the baseline examination (N = 41,642) is exceptional in comparison to cohorts internationally. (iii) NAKO collects and stores high quality biosamples repeatedly, which will allow innovative genomic analyses and application of omic-technologies repeatedly within the same individual. (iv) Repeated MRI scanning will allow characterizing changes over time with unprecedented depth. (v) The record linkage with secondary data sources including health insurance data, cancer registries, pension funds and the statutory occupational insurance will enrich the primary data. Thus, supplemental information on disease incidence, medical treatments and occupational histories will be ascertained which would be otherwise not available. (vi) NAKO will provide unique data for assessing the medical and public health impact of the COVID-19 pandemic building upon examination data collected before, during and after the pandemic.

There are, however, also a number of weaknesses. They include the low response proportions and therefore a possible reduced generalizability when it comes to estimation of risk factor distributions or disease or prevalence within the German population. Also, while the completeness of the modules overall is excellent, missingness due to different combinations of examination modules might pose a problem for certain analyses. Specific research questions may need to build on smaller samples sizes and consider the patterns of missingness.

One special element of NAKO is the fact that virtually the whole German epidemiological community has collaborated in the design and preparation of the cohort, and many of them are directly involved in the field work. Thus, NAKO constitutes an extraordinary basis for scientific cooperation and networking amongst epidemiologists and other health scientists in Germany. However, coordinating such a large and complex project jointly within the framework of the NAKO e.V. and collecting data at 18 study centres poses a number of challenges. Among them are (1) the consistent standardized data collection with high quality in 18 study centres simultaneously and over more than a decade, (2) the regional differences in institutional settings and state-based regulations including data protection, but also on other issues, such as COVID-19 abatement measures, (3) the different response rates between the centres, (4) quality assurance procedures involving more than 100 experts from the epidemiological community, (5) the complexity of data integration and synchronization of data availability.

Other methodological challenges arise from the underlying sampling frame of the study population, the varying response proportions, the complexity of the data, and the potential non-random structures in missing data. These aspects are addressed in more detail in Kuss et al. [19], providing guidance on data analyses strategies. The large-scale data will allow applying causal inference frameworks [20] as well as developing and testing approaches for best-practice of data analyses [21, 22].

Outlook

Largescale cohort studies such as the UK Biobank show the immense potential to support biomedical discoveries and novel therapeutic approaches [23]. In addition, cohort studies are also prime resources to understand the health impacts in a rapidly changing world. The repeated survey of psychosocial, socioeconomic and behavioural factors over time will permit the analysis of the effect of environmental and societal changes and restrictions on health outcomes and wellbeing. Importantly, this will also provide insights into health impacts resulting from the COVID-19 pandemic including infections and containment measures. The NAKO study regions include urban and rural areas capturing diverse central European environmental conditions. Adding complex time and space resolved environmental exposure data will offer the opportunity to study the health impacts of climate change [24] and the benefits of climate change mitigation efforts on health within the next decades. The age-composition of the cohort provides longitudinal data on persons who will be up to 80 years old during the next years. We will therefore be able to provide data on factors relevant for healthy ageing and an ageing population. It also includes participants aged 20 years who can be followed in their health course for decades. The rich phenotyping data on subclinical findings and function will permit insights into trajectories of health and disease including multimorbidity. Thereby, NAKO will provide foundations for tailored recommendations on health and disease management strategies and form an exceptionally strong basis for designing interventional studies to promote healthy ageing and to slow down progression to overt disease.

Conclusion

NAKO provides a central platform for future epidemiological research with a strong potential to push the development of new strategies for prevention, early detection and risk stratification of major chronic diseases. With its broad spectrum of examinations, the systematic re-assessments of all study participants and its high-quality biomaterials, NAKO provides an excellent tool for future, population-based longitudinal research. The large-scale, embedded MRI programme is another asset of NAKO. The German National Cohort, NAKO, thus constitutes an extraordinary basis for scientific cooperation and networking amongst epidemiologists internationally.

Supplementary information

The German National Cohort (NAKO) Consortium consists of persons responsible for the planning and execution of the study, namely PIs and Co-PIs, the heads of study centres, infrastructures and competence units and the persons responsible for the study modules – both current and former, if they were active during baseline. An overview of the people contributing to NAKO can be found in the Supplementary Tables 1 to 5.