figure b

Introduction

Latino is a diverse ethnic group recently admixed from Native American, European and African ancestries, with a high prevalence of metabolic disorders including type 2 diabetes. Although genetic studies in the Latino population are limited, they have revealed unexpected pathways and potential therapeutic targets for type 2 diabetes [1,2,3,4]. This is the case for a Native American haplotype within the SLC16A11 gene identified as the main genetic contributor to type 2 diabetes in the Latino population [1, 4], a rare risk variant within HNF1A unique to Latino population [2] and a loss-of-function (LoF) Latino-enriched variant within IGF2 associated with a 22% decrease in the odds of type 2 diabetes in heterozygous carriers [3].

Unlike genetically homogenous populations, the complex linkage disequilibrium (LD) structure of admixed populations imposes challenges in implementing statistical methods that are crucial to maximise genetic discoveries [5]. This is especially relevant for genotype imputation, a method used to estimate the genotype probabilities at genetic variants that have not been experimentally genotyped [6]. A major factor limiting the accuracy of genotype imputation in Latino samples has been the poor representation of their haplotypes in the reference panels (i.e. 352 from the latest version of the 1000 Genomes [1000G] imputation model) [7]. The multi-ancestry National Heart, Lung, and Blood Institute (NHLBI) Trans-Omics for Precision Medicine (TOPMed) programme has released a reference panel for genotype imputation that includes the highest sequencing coverage (i.e. 30×) and the largest number of reference samples (i.e. 97,256) to date, of which ~15% are from Latino individuals. It has been shown to increase the number of well-imputed low-frequency variants in the Hispanic Community Health Study/Study of Latinos (HCHS/SOL) [8, 9].

We hypothesised that by boosting the identification of variants in Latino samples with the recently released TOPMed reference panel, we would improve our knowledge of the genetic architecture of type 2 diabetes in the Latino population. The 1000G (1000G) panel was chosen as a comparison, since, besides TOPMed, it has the largest number of Latino samples. We performed a type 2 diabetes genome-wide association study (GWAS) meta-analysis, as well as association analyses on a collection of related phenotypes from TOPMed Latino imputed datasets to allow the interpretation of our novel variants that had low frequencies or were absent in other publicly available biobanks that mainly contained individuals of European ancestry. Finally, we leveraged the generated GWAS data to develop, in combination with GWAS data from other ancestries, a type 2 diabetes polygenic score (PS) for the Latino population.

Methods

Detailed descriptions of the methods are given in electronic supplementary material (ESM) Methods.

Discovery sample

We aggregated data from six Latino cohorts with a sample size of 18,885 individuals (8150 with type 2 diabetes [cases] and 10,735 without [controls]): the Slim Initiative for Genomic Medicine in the Americas (SIGMA) [1,2,3]; the Mexican Biobank (MXBB) [10]; the Mass General Brigham (MGB) Biobank [11]; and the Genetic Epidemiology Research on Aging (GERA) [12] (Fig. 1 and ESM Table 1). We selected Latino samples based on their genetically estimated ancestry using principal components (PCs) and Admixture v1.3.0 [13] (ESM Fig. 1). All human research was approved by the relevant Institutional Review Boards and conducted according to the Declaration of Helsinki. All participants provided written informed consent.

Fig. 1
figure 1

General overview of the study. Six cohorts of admixed Latino ancestry, representing a total of 8150 type 2 diabetes cases and 10,735 controls, were imputed with the TOPMed and 1000G Phase 3 panels (grey box). A type 2 diabetes GWAS meta-analysis of the imputed variants resulted in the identification of two novel loci, which were tested for replication in six additional Latino cohorts (green box). They were also interrogated for association with a collection of phenotypes in eight Latino cohorts (blue box) and for functional evidence in multiple available resources (purple box). The generated Latino type 2 diabetes GWAS data were used, in combination with GWAS from other ancestries, to construct ancestry-specific and multi-ancestry type 2 diabetes PSs (brown box). CMDK, Common Metabolic Disease Knowledge; sum stats, summary statistics

Genotyping and imputation

Genotyping was done using several commercially available genome-wide arrays, and for a subset of the samples (N=9520), we integrated whole-exome sequencing (WES) (ESM Table 1). We applied pre-imputation quality control to each dataset separately. Clean datasets were phased using SHAPEIT2 v2 [14]. For comparison purposes, we imputed the phased haplotypes using both 1000G Phase3 version 5 [15] and TOPMed reference panels freeze 8 [8].

Imputation performance evaluation

We evaluated the performance of TOPMed and 1000G imputations by summarising the chromosome-wise r2 quality measure and the number of well-imputed (r2≥0.8) variants at different allele frequency (AF) thresholds. We used available WES data from the SIGMA3 cohort and estimated the proportion of the sequenced variants in chromosome 22 that were well-imputed with TOPMed and 1000G panels at different WES AF thresholds. We used SnpEff v4.3 [16] to annotate the WES variants. We calculated the effective sample size (Neff) needed to reach 80% statistical power to detect genome-wide significant associations (α=5 × 10–8) at different effect sizes and AFs covered by the imputations (Fig. 2c).

Fig. 2
figure 2

Performance of the TOPMed reference panel for the imputation of Latino samples. (a) Number of chromosome-wide well-imputed variants (imputation r2≥0.8) by AF for each analysed cohort when using the 1000GP3 (blue) or the TOPMed (black) reference panels. (b) Average chromosome-wide imputation quality by AF for each analysed cohort when using the 1000GP3 (blue) or the TOPMed (black) reference panels. (c) Effective sample size required for reaching 80% statistical power to detect genome-wide significant signals at different effect sizes (OR). The dotted lines show the discovery effective sample size of this study. (d) Percentage of the exome sequenced variants in chromosome 22 that could be imputed when using the 1000GP3 (blue) or the TOPMed (black) reference panels. (e) Percentage of the exome sequenced LoF and deleterious predicted variants based on CADD score in chromosome 22 that could be imputed when using the 1000GP3 (blue) or the TOPMed (black) reference panels

Type 2 diabetes GWAS meta-analysis

Association analyses were performed in each cohort with SNPTEST v2.5.4 [17]. Models were adjusted for sex, age, BMI and ten PCs to account for population structure. We ran additional models without adjusting for BMI. Only well-imputed variants (r2≥0.5) were meta-analysed using the inverse of the corresponding squared SEs in METALv2011-03-25 [18] We used a standard GWAS significance threshold of p<5 × 10−8.

We performed LD-based clumping on the genome-wide significant variants to keep one representative variant per region of LD. If the lead SNP lay within a previously reported type 2 diabetes locus, we defined it as conditionally distinct if showing evidence of residual association (p<5 × 10−5) after conditioning on each of the reported variants.

Variants with sub-genome-wide significance (p<1 × 10−6) that were only imputed with the TOPMed panel, showed increased frequency in the Latino population and were >250 kb from other reported genome-wide significant variants from European or East Asian ancestry large consortia [19, 20] were considered for further investigation.

Replication sample

Variants associated with type 2 diabetes at genome-wide and sub-genome-wide significance were tested for replication in six independent cohorts: the Cameron County Hispanic Cohort (CCHC) [21]; the Urban American Indians and Arizona Pima Indians cohorts [22]; the Population Architecture using Genomics and Epidemiology (PAGE) study [23]; the All of Us Research Program [24]; and the Progress in Diabetes Genetics in Youth (PRODIGY), which comprises the Treatment Options for Type 2 Diabetes in Adolescents and Youth (TODAY) [25], the SEARCH for Diabetes in Youth studies [26], the Type 2 Diabetes Genetics Exploration by Next-generation sequencing in multi-Ethnic Samples (T2D-GENES) cohorts and the Mexican Metabolic Syndrome (METS) cohort [27] (ESM Table 2).

Association with type 2 diabetes-related phenotypes

Given the lack of large-scale publicly available biobanks with Latino samples that may allow for better characterisation of our novel signals, we assembled a collection of cohorts to perform association analyses to several type 2 diabetes-related traits comprising 46 glycaemic, anthropometric and lipid traits. In addition to five of the Latino cohorts analysed in the type 2 diabetes meta-analysis (i.e. SIGMA1, SIGMA2, SIGMA3, MXBB and MGB Biobank), we included three extra cohorts, which we also imputed to the TOPMed panel: the METS and the Mexican Hypertriglyceridemia (MHTG) cohorts, as well as the genetically identified Latino samples from the UK Biobank (UKBB) [28] We also analysed the Nightingale NMR-based panel of 168 metabolomic biomarkers from the UKBB. Association analyses were done with a maximum of 26,400 adult Latino individuals, depending on the trait, of whom 19,459 were diabetes-free.

Credible sets

For each novel variant, we identified the set of variants with 99% probability of containing the causal variant. We used a Bayesian method [29], considering variants in LD with the lead variant (r2>0.1). We calculated LD using genetic data from 1996 Hispanic/Latino samples from TOPMed freeze 5b.

Genomic annotation

We used the 99% credible sets to annotate their genomic effect using the VEP v100 [30] (GRCh38.p7) and SNPNEXUS release Dec 2020 [31] applications. We used the Genotype–Tissue Expression project (GTEx) V8 [32] to assess the influence of the variants in gene-level expression, the TIGER Portal v7 [33] to evaluate the gene-level expression in pancreatic islets and the Islet Gene View (accessed 17 Dec 2022) [34] to assess the gene co-expression in human islets. We also assessed their association with a variety of phenotypes and diseases using the Common Metabolic Disease Knowledge Portal (cmdgenkp.org, accessed 17 Dec 2022 ) and other resources.

Expression of genes near novel variants

We assessed the expression levels of the genes ±500 kb around the novel signals in human islets under different conditions pertaining to type 1 and type 2 diabetes. Gene expression differences between groups were assessed using p values and adjusted p values (Benjamini Hochberg correction) determined by the Wald test using the DESeq2 pipeline [35] Transcripts per million (TPM) was normalised by Salmon v1.4.0 [36].

Polygenic scores

Polygenic scoring using single ancestry summary statistics and LD reference panels was calculated via Bayesian Regression and Continuous Shrinkage priors as implemented in PRS-CS release 4 Jun 2021 [37]. We used the UKBB LD reference panel and GWAS summary statistics from European [20], East Asian [19] and Latino populations. GWAS Latino summary statistics were calculated using a meta-analysis with five of the discovery cohorts (i.e. SIGMA1, SIGMA2, SIGMA3, MGB and GERA). Then, we used the estimated posterior SNP effect sizes for each ancestry to calculate and evaluate the performance of the polygenic scores (PSs) in a training cohort (i.e. MXBB). The best model was tested in a target cohort (i.e. the METS cohort).

Given that the ancestry-specific PSs were not highly correlated (r2<0.3), we also used PRS-CSx release 29 Jul 2021 [38], a method that improves multi-ancestry polygenic prediction by integrating GWAS summary statistics from multiple populations. We assessed the performance of the ancestry-specific vs the multi-ancestry PS.

Results

Overall strategy

Figure 1 summarises our overall strategy. We meta-analysed six type 2 diabetes GWAS of Latino ancestry, comprising 8150 cases and 10735 controls from hospital and population-based studies. All cohorts were imputed with TOPMed and 1000G panels and the imputation performance was evaluated. To replicate the novel loci, we analysed 13,617 type 2 diabetes cases and 20,822 controls from six independent cohorts of Latino ancestry. To gain further insight into the novel loci, we created a collection of type 2 diabetes-related phenotypes that included 26,400 Latino participants with 46 glycaemic and anthropometric traits, as well as 168 metabolomic traits. We used publicly available resources to interrogate our top signals, including functional annotation of the credible sets, and gene expression assessment of nearby genes in pancreatic islets from either type 1 or type 2 diabetes cases and controls or treated under conditions relevant for diabetes pathophysiology. We then used the generated Latino GWAS data, in combination with GWAS from other ancestries, to construct ancestry-specific and multi-ancestry type 2 diabetes PSs.

TOPMed imputation performance

On average, imputation using the TOPMed panel resulted in 41 million (M) high-quality (r2≥0.8) variants, being 24M rare (minor allele frequency [MAF]<0.1%). This represents a 6.5-fold increased number of imputed rare variants compared with 1000G (Fig. 2a). The quality of imputation consistently improved when using TOPMed, particularly for low-frequency and rare variants (Fig. 2b).

We used WES data to confirm the improvement of TOPMed imputation to detect low-frequency and rare variants. The TOPMed panel allowed the identification of >80% of the WES variants with MAF≥0.1% compared with 60% for the same MAF cut-off with the 1000G panel (Fig. 2d). It also improved the identification of likely pathogenic variants predicted as deleterious that usually occur at low frequency (Fig. 2e).

Type 2 diabetes GWAS meta-analysis

To illustrate the gain in discovery when using TOPMed imputation, we tested the genetic variants for association with type 2 diabetes in six Latino cohorts. Our discovery sample comprised 18,885 Latino non-related individuals (8150 cases, 10,735 controls).

We identified 26 genome-wide significant variants (p<5 × 10−8) associated with type 2 diabetes at 13 loci. Twenty-five of these were previously reported type 2 diabetes-associated variants, including those consistently identified in multiple populations (e.g. variants at KCNQ1 and TCF7L2) and others enriched in the Latino population (e.g. variants at SLC16A11) (Fig. 3a, ESM Fig. 2 and ESM Table 3).

Fig. 3
figure 3

Type 2 diabetes GWAS meta-analysis in the Latino population. (a) Manhattan plot of the meta-analysis association statistics, highlighting the loci with genome-wide significance (red) or sub-genome-wide significance (orange) for type 2 diabetes. (b) Regional association plot of the novel ORC5/LHFPL3 locus associated with type 2 diabetes risk. (c) Forest plot of the GWAS association statistics for the novel ORC5/LHFPL3 locus in the discovery (black), the replication (blue) and overall (red) cohorts

We identified a novel locus between the ORC5 and LHFPL3 genes on chromosome 7. The intergenic lead variant, rs2891691, has low frequency in Latino people and is associated with a twofold increase in the odds of developing type 2 diabetes in the discovery sample (MAF 1.7%; OR 2.0 [95% CI 1.59, 2.52], p=3.4 × 10−9) (Fig. 3b,c). Although it was also imputed with the 1000G panel, TOPMed’s higher imputation quality strengthened the association (1000G, mean ± SD imputation r2=0.948 ± 0.057, p=2.3 × 10−8; TOPMed, mean ± SD imputation r2=0.983 ± 0.009, p=3.4 × 10−9).

This variant is rare in Europeans (MAF 0.04%), yet prevalent among African (MAF 16%) and East Asian populations (MAF 7.6%). However, its association with type 2 diabetes does not replicate in either Africans (p=0.149) or East Asians (p=0.095). A fixed effects meta-analysis of the three ancestries showed no association of the variant with type 2 diabetes (p=0.734) but a significant heterogeneity in the allelic effects (p=5 × 10−8). To further investigate the source of such heterogeneity, we used MR-MEGA v1.0.5 software [39], which implements a multi-ancestry meta-regression approach to model allelic effects as a function of axes of the genetic variation. This meta-regression approach showed a significant association of rs2891691 with type 2 diabetes (p=1.1 × 10−7), as well as significant heterogeneity of the allelic effects between populations driven by ancestry (p=2.9 × 10−8). The residual heterogeneity accounting for other factors, such as phenotype definition or uncorrected population structure, was not significant (p=0.944) (ESM Fig. 3). These results show that the effects of rs2891691 on type 2 diabetes are specific to the Latino population and suggest that the lead variant we identified is in LD with the causal variant in Latino but not African or East Asian populations, a phenomenon also observed in a previous type 2 diabetes multi-ancestry meta-analysis [40] The heterogeneity in the allelic effects across ancestries can also be explained by differences in environmental exposures.

A sex-dimorphism in RELN gene expression has been documented, with higher RELN expression in women [41] and sex hormones likely mediating RELN expression. Because of the proximity of RELN to rs2891691, we evaluated the sex-specific association with type 2 diabetes and tested for heterogeneity between sex-specific allelic effects using GWAMA v2.2.2 [42]. rs2891691 showed a larger effect and was more associated with type 2 diabetes in women (Neff 10,228; OR 2.4 [95% CI 1.73, 3.22], p=6.6 × 10−8) compared with men (Neff 7206; OR 1.5 [95% CI 1.08, 2.19], p=0.018), yet the between-sex heterogeneity did not reach statistical significance (p=0.076) (ESM Table 4).

Replication analysis

The replication analysis comprised 13,617 type 2 diabetes cases and 20,822 controls (ESM Table 2). The meta-analysis of the replication cohorts, where the variant was present, was nominally significant and showed a consistent direction of effect with the discovery sample (OR 1.18 [95% CI 1.02, 1.36], p=0.025) (Fig. 3c, ESM Table 5).

By querying our Latino collection of type 2 diabetes-related phenotypes, we found that the rs2891691 risk allele C was nominally associated with lower fasting glucose levels (p=0.026) (ESM Table 6). Such negative correlation might be induced by collider bias since specifically for glycaemic traits we only analysed diabetes-free individuals. Indeed, a positive association of rs2891691 risk allele with 2 h glucose adjusted for BMI has been previously reported in Latino ancestry participants (β=3.4 mg/dl [0.2 mmol/l], p=0.006) [43] and low potassium levels in East Asian ancestry participants (p=8.5 × 10−5) [44]. Accumulated epidemiological evidence points to a relationship between low potassium levels and decreased insulin secretion and risk of type 2 diabetes [45, 46]

The 99% credible set consisted only of the lead variant rs2891691 (ESM Table 7), yet we cannot discard other variants not called due to genotyping complexity nor those imputed to the TOPMed panel, such as a structural, variable tandem repeat or copy number variants.

To better characterise the role of the ORC5/LHFPL3 locus, we assessed gene expression using the GTEx [32] and TIGER [33] portals. ORC5 is expressed ubiquitously, while LHFPL3 is specifically expressed in the brain (ESM Fig. 5a, b). We then assessed the expression levels of genes ±500 kb around the novel signal in human islets under different conditions relevant to diabetes pathophysiology. ORC5 was downregulated after 2 h and 8 h exposure to IFN-α, and upregulated by exposure to brefeldin A (ESM Fig. 6a, c). Both IFN-α and brefeldin A are endoplasmic reticulum stress inducers that reduce the insulin content with a rise in the proinsulin/insulin ratio [47] and inhibit glucose-stimulated insulin secretion [48], respectively.

Prioritising sub-genome-wide significant variants

We next searched for variants that were associated with type 2 diabetes at sub-genome-wide significance (p<5 × 10−6) but that deserved further study as they lay in previously unreported type 2 diabetes loci, were enriched or Latino-specific, and/or exclusively imputed with the TOPMed panel (Fig. 4a and ESM Table 8). Three out of the 23 sub-genome-wide lead variants lay in or near the known type 2 diabetes loci TACC2, FGFR2 and CCND2. We considered them as distinct variants as they retained locus-wide significance (p<5 × 10−5) after conditioning on the nearest known associated variant.

Fig. 4
figure 4

Sub-genome-wide significant HDAC2 novel type 2 diabetes loci. (a) Scatter plot of the effect allele frequencies (EAFs) from the sub-genome-wide significant variants in Latino (LAT) vs European (EUR) populations, highlighting those that are distinct from the known lead type 2 diabetes-associated variants (purple) and those that are in novel loci (yellow). (b) Regional association plot of the novel HDAC2 locus associated with type 2 diabetes risk. (c) Forest plot of the association statistics in the discovery (black) and the replication (blue) cohorts. (d) Violin plots of serum 3-hydroxybutyrate levels in non-carriers (blue) and carriers (yellow) of the rs1016378028 variant. Whiskers range from upper and lower fences (1.5 × IQR); points represent outliers. (e) HDAC2 gene expression in human islets from donors with type 1 and type 2 diabetes and control islets treated (brown) or not (green) with different cytokines or other stressor compounds. **p<0.01, ***p<0.001 vs no treatment (adjusted p values, Benjamini–Hochberg correction). Whiskers range from upper and lower fences (1.5 × IQR); points represent outliers. (f) HDAC2 gene expression in multiple tissues from GTEx and TIGER portals. Each box plot shows expression in a different tissue or cell line Whiskers range from upper and lower fences (1.5 × IQR); points represent outliers

Three additional sub-genome-wide significant variants were located ±1 Mb away from any reported type 2 diabetes locus (Fig. 4a and ESM Table 9). Of interest, rs1016378028 is a low-frequency variant (MAF 1.3%; OR 1.77 [95% CI 1.41, 2.21], p=7.0 × 10−7) that is Latino private (MAF<0.01% in other populations) and is only imputed with the TOPMed panel. It is intronic of HDAC2, a gene under strong purifying selection (probability of being LoF intolerant [pLI]=1, gnomAD, gnomAD-sg.org, accessed 17 December 2022) and that is highly and mostly expressed in pancreatic islets (tiger.bsc.es, accessed 17 December 2022) (Fig. 4f) [33].

Although the replication results did not show statistical significance, the direction of the effect was consistent with the discovery effect (OR 1.17 [95% CI 0.94, 1.45], p=0.1547) (Fig. 4b,c, ESM Table 5). The Diabetes Meta-Analysis of Trans-Ethnic association studies (DIAMANTE) European meta-analysis [20] reported a suggestive signal ~80 kb upstream of rs1016378028 (rs4945979, p=4.8 × 10−6). After conditioning for the rs4945979 variant, the statistical significance of our identified variant remained essentially the same (OR 1.75 [95% CI 1.4, 2.2], p=4.5 × 10−7).

The rs1016378028 risk allele was significantly associated with higher levels of acetone (p=1.2 × 10−7), 3-hydroxybutyrate (p=1.01 × 10−5) and acetoacetate (p=3.3 × 10−5) (Fig. 4d and ESM Table 10). It was also nominally associated with lower hip circumference (p=0.02) and higher WHR (p=0.03) (ESM Table 6).

HDAC2 expression in human islets is downregulated after exposure to IFN-α (8 h log2-fold change=−0.38, p=6 × 10−7; 18 h log2-fold change=−0.28, p=3 × 10−4) or IFN-γ+IL-1β (log2-fold change=−0.39, p=3 × 10−7) (Fig. 4e). These cytokines mimic the proinflammatory milieu of type 1 diabetes, inhibit beta cell function [49, 50], induce beta cell stress and may trigger beta cell dedifferentiation in type 2 diabetes [51, 52].

Development of PSs for the Latino population

We then developed a PS for type 2 diabetes in Latino people using our TOPMed imputed GWAS meta-analysis data. This PS explained 1.6% of the type 2 diabetes status variance (Fig. 5a), which is expected given the relatively small sample size of the Latino summary statistics compared with European and East Asian ancestries. The PS derived from the Diabetes Meta-Analysis of Trans-Ethnic association studies (DIAMANTE) European GWAS [20] and from Asian Genetic Epidemiology Network (AGEN) East Asian GWAS [19] explained 5.1% and 4.4% of the type 2 diabetes variance in the Latino population, respectively. The European and East Asian PSs showed a weak correlation (r2<0.2) with our Latino TOPMed-derived PS, suggesting that they could provide orthogonal information and improve the overall predictive performance. We developed a PS that incorporated GWAS data from the three ancestries using PRS-CSx [38], a method that allows for the integration of summary statistics and LD reference panels from different ancestries. The multi-ancestry PS including the three GWAS summary statistics explained 7.6% of the type 2 diabetes variance in the Latino target sample. Our Latino GWAS added 1% of the explained variance compared with the PS using only European and East Asian GWAS, which explained 6.6% of the variance.

Fig. 5
figure 5

PS for the risk of type 2 diabetes in Latino population. (a) Variance explained by a PS using these Latino GWAS association statistics (LAT, green), the AGEN East Asian GWAS association statistics (EAS, grey), the DIAMANTE European GWAS association statistics (EUR, red), a combination of DIAMANTE European and AGEN East Asian GWAS association statistics (yellow) and a combination of DIAMANTE European, AGEN East Asian and these Latino GWAS association statistics (blue). METSB was used as the testing cohort. (b) Receiver operating characteristic curves for the type 2 diabetes risk prediction explained by a model including sex, age and ten PCs of ancestry (black), a model including covariates and a PS constructed using DIAMANTE European (red) and a model including covariates and a PS constructed using a combination of DIAMANTE European, AGEN East Asian and these Latino GWAS association statistics (blue). (c) Distribution of a multi-ancestry PS using a combination of DIAMANTE European, AGEN East Asian and these Latino GWAS association statistics in type 2 diabetes cases (blue) and controls (black). The table shows the OR per SD attributed to the multi-ancestry PS, as well as the OR for high-risk individuals

Each SD of the multi-ancestry PS was associated with an OR of 1.9 (95% CI 1.6, 2.2, p=3.7 × 10−19) (Fig. 5c). People in the 2.5 percentile of the PS showed four times more risk of developing type 2 diabetes (OR 4.01 [95% CI 1.87, 8.62], p=3.7 × 10−4) (Fig. 5c). The receiver operating characteristic AUC of the full model including the multi-ancestry PS was 0.748 (95% CI 0.72, 0.775) compared with 0.729 (95% CI 0.701, 0.758) of the PS including European GWAS only, representing a 2% improvement in the prediction accuracy (p=0.008) (Fig. 5b).

Discussion

The Latino population has been underrepresented in most genetic studies. Yet, recent studies of type 2 diabetes in Latino populations have been fruitful, even with sample-size orders of magnitude smaller than those in studies of European or East Asian ancestries. The poor representation of Latino samples with genotype and phenotype data constrains nearly every step of a gene–disease association framework, including genotype imputation, a cost-effective technique to improve the resolution of a GWAS. This is more problematic for low-frequency and rare variation. Instead, next-generation sequencing technologies have typically been chosen but these are more expensive, precluding the study of large samples. This study was motivated by the recent release of the TOPMed imputation panel, which includes the largest number of Latino haplotypes compared with all available panels.

In this study, we aggregate genotype and WES data from six datasets to test the improvement in accuracy of the TOPMed imputation compared with 1000G. To illustrate how this panel can boost the discovery of complex disease variants we performed a type 2 diabetes GWAS meta-analysis using the imputed data. TOPMed imputation not only improved the statistical significance of our findings but allowed for the testing of up to 24 M rare variants, compared with 3 M properly imputed with the 1000G panel. The high quality of TOPMed imputation at low/rare frequencies is especially relevant for the study of disease-causing variation, because deleterious variants usually span such a spectrum. We show that by imputing with TOPMed, it is possible to test >90% of the variants with a MAF≥0.1% predicted to be deleterious by the Combined Annotation Dependent Depletion (CADD) score; previously, it was only possible to detect these variants by relying on more expensive sequencing technologies. While ascertaining variants at frequencies <0.1% may still require whole-genome sequencing (WGS) or WES, we estimate that the power to identify associated variants may be limited unless we undertake sequencing efforts with sample sizes orders of magnitude larger than our study. For example, for MAF<0.1%, the effective sample size required to reach statistical power to detect associations with an effect of OR>2.0 is above 170,000 individuals (Fig. 2c). Since the cost of sequencing such a large sample size is a major constraint for the study of underrepresented populations, we propose that highly accurate imputation with dense reference panels may be a more cost-effective approach.

In this study, we identified a novel low-frequency variant associated with type 2 diabetes, rs2891691, which lies between the ORC5 and LHFPL3 genes and showed increased accuracy of imputation and association power when using the TOPMed panel. ORC5 encodes the subunit 5 of the origin recognition complex implicated in the DNA replication origins, transcription silencing and heterochromatin formation [53] Lipoma HMGIC fusion partner-like 3 (LHFPL3) is a member of the tetraspanin superfamily, which functions as membrane protein organiser. The rs2891691 risk allele is present in 1% of Latino people. Overall, in discovery and replication cohorts, carriers have 1.37-fold increased odds of developing type 2 diabetes, with a possibly higher risk in women.

We identified a second low-frequency variant, rs1016378028, associated with a 1.7-fold increased risk of type 2 diabetes, which is not imputed with the 1000G panel. This variant was prioritised from a subset of variants at a sub-genome-wide significant threshold that showed additional evidence of association. rs1016378028 is a Latino private variant (MAF: Latino, 1.3%; East Asian, 0.2%; other populations, <0.05%), and lies within HDAC2, a gene that is highly intolerant of protein-changing variation and is mostly expressed in pancreatic islets [33].

Histone deacetylase 2 (HDAC2) is a histone deacetylase involved in gene transcription repression. HDACs play a regulatory role in insulin signalling, beta cell function and pancreatic endocrine cell development. At low glucose levels, HDAC2 is recruited to the insulin promoter to downregulate its expression [54]. In human islets, HDAC2 expression negatively correlates with insulin gene expression (r=−0.56, false discovery rate 3.7 × 10−16) and positively correlates with IAPP expression, which encodes for a satiety hormone (r=0.38, false discovery rate 1.8 × 10−7) [34] HDAC2 also deacetylates IRS-1, uncoupling its downstream phosphorylation cascade. Both insulin expression and insulin signalling are partially restored after treatment with HDAC2 inhibitors [55, 56]. We show that cytokine treatment of pancreatic islets downregulated HDAC2 expression.

Because there are no comprehensive phenome-wide association data to guide the interpretation of variants enriched in Latino populations, we aggregated phenotypic glycaemic and cardiometabolic data from 26,400 Latino individuals to follow-up the identified variants. We found that rs1016378028 risk allele carriers have higher levels of ketone bodies, which are produced through the breakdown of fatty acids and serve as an alternative energy source to glucose. Uncoupled hepatic production of ketone bodies may be a pathological consequence of relative insulin deficiency in diabetes [57]. While the mechanism linking rs1016378028, diabetes and 3-hydroxybutyrate levels remains to be determined, our results suggest this variant as a potential genetic type 2 diabetes risk factor.

We leveraged our GWAS results and existing publicly available data to develop an improved PS for Latino ancestry. PSs developed in a particular ancestry group poorly transfer to other populations, exacerbating disparities between populations. We provide an improved PS for the Latino population, by using a combination of GWAS and LD data from East Asian, European and our Latino GWAS. This PS showed a similar performance to the previously reported in European ancestry [58] with individuals at the top 2.5 percentile showing a fourfold increased risk of type 2 diabetes. Evaluating this PS in additional external datasets of Latino ancestry may prove useful in assessing its potential clinical utility.

Leveraging new resources to reanalyse Latino data, such as imputation with the TOPMed panel, proved to be successful in identifying additional type 2 diabetes-related loci. We acknowledge that the TOPMed panel allows the testing of an increased number of variants and additional evidence will be needed to confirm associations at the standard GWAS significance. Further efforts are needed to increase the power of discovery and to follow-up on novel findings in diverse populations. Until then, translation of identified genetic variation-to-function and application to the clinic in Latino populations will remain highly compromised compared with the resources available for European populations. In this study we gathered a high number of Latino samples with extensive biomarker and clinical characterisation; however, larger sample sizes are still needed to achieve sufficient statistical power to detect low-frequency variants. Efforts must be expanded to build shareable resources with a high representation of different ancestries, enabling ancestry-specific effects to be interpreted within the local ancestry context, which is instrumental to identify causal genes, to improve the biological mechanistic insight and to develop targeted therapies.

Overall, this study confirms the superior imputation performance of TOPMed, representing a cost-effective and unique opportunity to analyse low-frequency and rare genetic variants in Latino samples at scale. It also presents the largest type 2 diabetes GWAS meta-analysis performed in individuals of Latino ancestry imputed with the TOPMed reference panel. Despite the sample size being orders of magnitude smaller compared with studies performed in other populations, the novel discoveries presented here suggest that more novel genetic associations and new biology of type 2 diabetes will be revealed as the sample size of discovery samples, reference panels and large-scale biobanks with phenome-wide data increase in studies including non-European populations.