Network expansion of genetic associations defines a pleiotropy map of human cell biology

Network expansion of genetic associations defines a pleiotropy map of human cell biology

Systematic augmentation of GWAS with network propagation

Recent studies have shown that a comprehensive protein interaction network is critical for network propagation efforts9. Here, we combined the International Molecular Exchange physical protein interaction dataset19 from IntAct (protein–protein interactions)20, Reactome (pathways)21 and SIGNOR (directed signaling pathways)22. To facilitate re-use of these data (referred to as ‘OTAR interactome’) we have made the data available via a Neo4j Graph Database ( The physical interactions were combined with functional associations from the STRING database (v.11)23 to give a final network containing 571,917 edges connecting 18,410 proteins (nodes) (Fig. 1a). GWAS trait associations were mapped to genes using the locus-to-gene (L2G) score from Open Targets Genetics, a machine learning approach that integrates features such as SNP fine-mapping, gene distance and molecular quantitative trait locus (QTL) information to identify causal genes (Fig. 1b)11. Genes with L2G scores higher than 0.5 are expected to be causal for the respective trait association in 50% of cases.

Fig. 1: Implementation and benchmarking of network-based augmentation of GWAS.
figure 1

a, Edge and node counts of the combined interactome and its components. OTAR is the Open Targets combined physical protein interaction network that is provided via a Neo4j Graph Database. b, Graphic representation of some L2G components: SNP-to-gene distance, data from QTLs and variant effect predictions. The integration of information into the L2G score has been described previously11. c, Graphical representation of the network-based approach: network propagation of the initial input, clustering using a random walker to find gene communities and scoring of modules using the distribution of PageRank score. KS, Kolmogorov–Smirnov. d, Number of starting genes linked to traits, grouped in therapeutic areas. In the violin plot, the red dots represent the median, the limits of the thick line correspond to quartiles 1 and 3 (25% and 75% of the distribution) and the limits of the thin line are 1.5× the interquartile range. e, Benchmarking of the method, using as a starting signal genes from the Open Targets Genetics portal with a L2G score >0.5. AUC values are calculated using as positive hits the DISEASE database, with increasing cutoff values for its gene-to-trait score (Methods), as well as clinical trials data from the ChEMBL database (clinical phase II or higher). We also re-calculated the AUC values and determined Z-scores reflecting the deviation in AUCs relative to those observed after randomization of the list of true positives (TPs). In the boxplots, the middle lines represent the median, the limits of the box are quartiles 1 and 3 and the whiskers represent 1.5× the interquartile range.

For each GWAS, associated genes were used as seeds in the interaction network. Of 7,660 GWAS genes linked to at least one trait, 7,248 correspond to proteins present in the interaction network. We then used the Personalized PageRank (PPR) algorithm to score all other protein coding genes in the network where genes connected via short paths to GWAS genes receive higher scores (Fig. 1c). Genes in the top 25% of network propagation scores were used to identify gene modules, from which we selected those significantly enriched for high network propagation scores (Benjamini–Hochberg (BH)-adjusted P < 0.05 with Kolmogorov–Smirnov test) and with at least two GWAS-linked genes (Methods). We applied this approach to 1,002 traits (Supplementary Table 1) with GWAS in the Open Targets Genetics portal that had at least two genes mapped to the interactome. These GWAS were spread across 21 therapeutic areas, and differed in the number of GWAS-linked genes (median 6, range 2–763) (Fig. 1d).

To measure the capacity of the network expansion to recover trait-associated genes, we defined a ‘gold standard’ set of disease-associated genes (from that are known drug targets for specific human diseases (from the ChEMBL database, Methods). To avoid circularity in benchmarking, we excluded gold standard genes that overlapped with GWAS-linked genes for the respective diseases. The network propagation score predicted disease-associated genes with an average area under the receiver operating characteristic (ROC) curve (AUC) >0.7 for the most stringent definition of disease-associated genes as well as known drug targets (Fig. 1e and example ROC curves in Supplementary Fig. 1). The performance was higher than that observed with random permutation of the gold standard gene sets (Fig. 1e and Supplementary Fig. 2; true positive permutations), suggesting that it is not strongly biased by the placement of the gold standard genes within the network. We also tested the impact of changing the interaction network, either by using subsets of the network defined here or by using the previously defined composite PCNet network9 (Supplementary Fig. 3). Overall, the combined network performed best with an accuracy similar to that of the larger PCNet (Supplementary Fig. 3).

In total, we obtained network propagation scores for 1,002 traits and gene modules for 906 traits (Supplementary Table 1).

Network propagation identifies related human traits

Identifying groups of traits likely to have a common genetic basis is of value because drugs used to treat one disease may also have effects in related diseases. Genetic sharing between human traits is often determined by correlation of SNP-level statistics from GWAS; however, this approach does not identify how the shared genetics corresponds to shared biological processes. In addition, many GWAS do not report the full summary statistics needed for such comparisons. By contrast, network propagation scores can be calculated from the set of candidate genes available for any GWAS. To benchmark trait–trait associations derived from network propagation, we used the similarity of annotations from the Experimental Factor Ontology (EFO), which include aspects of disease type, anatomy and cell type among others. For example, pairs of related neurological traits tend to share many annotation terms in the EFO. Using these annotations, we defined 796 pairs of traits that are functionally related and therefore likely to have a common genetic basis (Methods). An additional benchmark was obtained from trait-to-trait genetic correlations calculated from SNP-based analyses24,25,26. Using these benchmarks, we show that similarity in the network propagation scores can identify functionally and genetically related pairs of traits (Supplementary Fig. 4).

To explore trait–trait relationships on the basis of the similarity of their perturbed biological processes, we used the pairwise distance of network propagation scores to build a tree by hierarchical clustering (Fig. 2a), and defined 54 subgroups of traits. The traits tend to group according to functional similarity with 34 of 54 having an EFO term annotated to more than 50% of the traits in the group (Fig. 2a). In Fig. 2b we show examples of traits that are grouped together according to the network propagation scores. These include known relationships between immune-associated traits such as cellulitis or psoriasis and immunoglobulin G measurements; the relationship between skin neoplasms and skin pigmentation or eye color; or the clustering of cardiovascular diseases (acute coronary symptoms) with lipoprotein measurements and cholesterol.

Fig. 2: Trait–trait genetic and functional similarities determined from network expansion of GWAS data.
figure 2

a, Tree showing the Manhattan distance between all traits, using the full PPR score. Hierarchical clustering was performed using a cutoff of h = 1, leading to 54 clusters, colored depending on the predominant EFO ancestry term. The right-hand panel is a barplot showing the 54 clusters with the frequencies for the predominant EFO ancestry terms and a heatmap showing the counts for ChEMBL targets and drugs. The text label next to each cluster corresponds to the second most predominant EFO terms that, on average, label 35% of the traits within the clusters that have a text label. b, Examples of traits grouped using the Manhattan distance, extracted from the tree in a. CSF, colony-stimulating factor; Ig, immunolglobulin; LDL, low-density lipoprotein.

We obtained drug indications from the ChEMBL database for the diseases in each cluster (Fig. 2a). This allows us to find clusters in which drugs may be considered for repurposing, as well as groups of traits in which drug development is most needed. Eighteen clusters representing 64 traits contain no associated drug and represent less well-explored areas of drug development. All trait clusters, genes and corresponding drugs are available in Supplementary Table 1.

Pleiotropy of gene modules across human traits

We can study the pleiotropy of human cell biology by identifying which gene modules tend to be associated with many human traits. This allows us to understand how perturbations in specific aspects of cell biology may have broad consequences across multiple traits. In total, we found 2,021 associations between gene modules and traits, of which 886 (43.8%) are gene modules linked to a single trait and the remaining can be collapsed to 73 gene modules linked to two or more traits (Fig. 3a, Supplementary Table 2 and Methods). The 73 modules associated with more than one trait did not have a significantly larger number of genes (P = 0.72, Kolmogorov–Smirnov test), whereas the traits linked with the 73 pleiotropic gene modules tend to have a higher number of significant initial GWAS seed genes (Supplementary Fig. 5). Therefore, traits with a larger number of linked loci are more likely to be associated with pleiotropic gene modules.

Fig. 3: Multitrait gene module associations for studies of shared biological processes and drug-repurposing opportunities.
figure 3

a, Heatmap showing the overlap between gene modules across traits. Traits were clustered using hierarchical clustering (Methods) and subgroups were defined by a cutoff of 0.6 average correlation coefficient. A module was considered the same across different traits when most genes are in common (Jaccard index > 0.7). Significant trait–module relations are marked in yellow or pink, with yellow indicating modules overrepresented in one of the subgroups of traits (one-sided Fisher’s exact test, adjusted P < 0.05) and pink otherwise. The heatmap in the right-hand panel shows the number of genes in modules from each subgroup of traits that are drug targets (phase III or higher, ChEMBL database), linked with clinical variants (ClinVar database) or with mouse KO phenotypes (International Mouse Phenotyping Consortium database). b, Barplot showing the number of traits linked with the top six most pleiotropic gene modules. The GOBP description is based on the results of a GOBP enrichment test (Methods). c, Simplified heatmap of the clusters in a concerning bone-related and fasciitis traits. The represented network includes genes from the modules indicated in blue letters and the represented interactions have been filtered for visualization (Methods). Blue nodes, relevant mouse KO phenotypes; green nodes, diseases with clinical variants enriched in this gene module; red nodes, drugs in clinical trials. Genes linked to blue, green or yellow nodes have the linked mouse phenotypes, clinical variants in the linked disease or are targets of the linked drug. Genes that are the targets of drugs in clinical trials have yellow nodes. GWAS-linked genes (L2G score > 0.5) have borders colored in an orange to red gradient (count of GWAS-linked traits). d, Simplified heatmap of one the clusters in a concerning allergic reactions (node and edge color code are the same as in c). In this case, two modules were merged to build the interaction network in the right-hand panel. mRNA, messenger RNA; SRP, signal recognition particle.

The six most pleiotropic gene modules were linked to between 56 and 110 traits in our study, and were enriched (Gene Ontology Biological Process (GOBP) enrichment with one-sided Fisher’s exact test, BH-adjusted P < 0.05) for genes involved in protein ubiquitination, extracellular matrix organization, RNA processing and G protein-coupled receptor (GPCR) signaling (Fig. 3b). Gene deletion studies in yeast have identified some of the same cellular processes as being highly pleiotropic15. Genes within pleiotropic modules linked to ten or more traits are enriched in genes that are ubiquitously expressed (fold enrichment = 1.42, P = 1.71 × 10−16, Fisher’s exact test, one-sided), have many deletion phenotypes (fold enrichment = 1.56, P = 1.71 × 10−30, Fisher’s exact test, one-sided) and higher numbers of genetic interaction (Fisher’s exact test, one-sided P = 4.155 × 1010). Targeting pleiotropic processes with drugs could, therefore, have broad application, but may also raise safety concerns. However, despite these enrichments, there is no simple correlation between the number of traits linked to a gene module and the enrichment of ubiquitously expressed genes (Pearson’s r = 0.0793) or genes with many deletion phenotypes (Pearson’s r = −0.0345). This analysis allows us to connect gene deletion phenotypes with human traits (Supplementary Fig. 6). For example, a pleiotropic module linked to traits such as ‘autism spectrum disorder’ and ‘osteoarthritis’ has a high fraction of gene deletion phenotypes impacting on protein transport, and a module linked with Alzheimer’s disease, balding measurement and bone density has genes with a high fraction of gene deletion phenotypes associated with cellular senescence (Supplementary Fig. 6).

We then related pleiotropy as defined by the module–trait associations derived here with pleiotropy defined by CRISPR gene deletion studies. For each Gene Ontology (GO) term, we calculated the enrichment in genes linked with many traits in our analysis with the enrichment in genes having many gene deletion phenotypes. GO terms specifically enriched in pleiotropic genes based on our definition are dominated by terms that relate to multicellularity, such as membrane signaling, cell-to-cell communication and cell migration (Supplementary Fig. 7). For pleiotropy that is specifically found with CRISPR screens, we find terms related to essential processes such as cell cycle, ribosome biogenesis and RNA metabolism (Supplementary Fig. 7).

For each of the 73 pleiotropic gene modules, we highlighted those that are overrepresented in each group of related traits (Fig. 3a and Methods, one-sided Fisher’s exact test, BH-adjusted P < 0.05). To facilitate the study of cell biology and drug-repurposing opportunities we annotated (Fig. 3a and Supplementary Table 2) the genes found in overlapping modules for each of the clusters with data from: ChEMBL (targets of drugs in at least phase III clinical trials), ClinVar (genes linked to clinical variants) and mouse knockout (KO) phenotypes (phenotypic relevance and possible biological link). We explore a few examples of these modules in the following sections.

Shared mechanisms and drug-repurposing opportunities

We identified two groups of traits (bone and fasciitis related) that are predicted to have a common determining gene module (Fig. 3c and Supplementary Table 3). This module is enriched in Wnt signaling genes, which have been previously linked to bone homeostasis27 and to different types of fasciitis as well as Dupuytren’s contracture28. We collected genes harboring likely pathogenic variants from ClinVar (Methods), hereafter referred to as ClinVar variants. This gene module is enriched in genes harboring ClinVar variants from patients with tooth agenesis and bone-related diseases (osteoporosis and osteopenia). Several genes with ClinVar variants, such as LRP6, SOST, WNT1, WNT10A and WNT10B, are not linked to bone diseases via GWAS. Genetic manipulation of several genes within this module causes changes in bone density in mouse models29. In addition, this module contains the target (SOST) of Romosozumab, a drug proven effective to treat osteoporosis.

In a second example (Fig. 3d and Supplementary Table 3), we identified a group of ten respiratory (for example, asthma) and cutaneous (for example, eczema) immune-related diseases that share three gene modules: a highly pleiotropic module related to regulation of transcription and proteasome, and two more specific modules related to pattern recognition receptor signaling and cytokine production with Janus kinase/signal transducer and activator of transcription (JAK–STAT) involvement. These modules were significantly enriched (one-sided Fisher’s exact test, P < 0.05) in genes having likely pathogenic variants from patients with asthma. The two most specific gene modules were grouped and are shown in Fig. 3d highlighting several genes with known pathogenic variants not associated with these diseases via GWAS (for example, IRAK3, TNF, ALOX5, TBX21). IRAK3, encoding a protein pseudokinase, is an example of a druggable gene not identified by GWAS for asthma, but with protein missense variants linked to this disease30, and mice model studies have implicated the regulation of IRAK3 in airway inflammation induced by interleukin-33 (IL-33)31. Although no drug for IRAK3 is used in the clinic, this analysis suggests it may serve as a relevant drug target for asthma and other related diseases.

We identified a total of 41 targets of 126 drugs targeting the genes in the module shown in Fig. 3d. To identify drugs that could have repurposing potential, we excluded those already targeting therapeutic areas that include the ten diseases linked to this gene module. This resulted in 18 drugs (Supplementary Table 3) targeting 5 genes including: 14 drugs targeting PTGS2, used to treat primarily rheumatic disease and osteoarthritis; interferon alfacon1 or alfa-2B (targeting IFNAR1 and IFNAR2), designed to counteract viral infections; galiximab and antibody for CD80 (phase III trials for lymphoma); and the antibody RA-18C3 targeting IL1A for colorectal cancer. These drugs may be suited to repurposing for respiratory or cutaneous autoimmune-related diseases. As an example, RA-18C3 has shown benefit in a small phase II trial for hidradenitis suppurativa (acne inversa)32.

Gene module analysis of related immune-mediated diseases

Traits related to the immune system are well represented in our analysis, falling into three different groups: one cluster containing systemic and organ-specific diseases; one cluster of immune cell measurements; and a third, more heterogeneous, cluster (Fig. 3a and Supplementary Table 2). In Fig. 4a we represent the first of these clusters, which can be further subdivided into a subgroup linking IBD, multiple sclerosis and systemic lupus erythematosus, and one linking celiac disease, vitiligo and other diseases. We found six gene modules that are specifically enriched with at least one of these two groups of traits, including gene modules related to GPCR signaling, neutrophil activation and interferon signaling. Genes present in these modules show higher relative expression (Fig. 4a, right) in key immune tissues.

Fig. 4: Gene module analysis of autoimmune diseases.
figure 4

a, Heatmap showing the overlap between gene modules across traits (color-coded as in Fig. 3a,c,d). The GOBP description is based on the results of a GOBP enrichment test (one-sided Fisher’s exact test, BH adjustment, Methods). The heatmap in the right-hand panel shows the gene set enrichment analysis carried out on the expression data from different tissues extracted from Human Protein Atlas (HPA) for the gene modules in blue (two-sided Kolmogorov–Smirnov test, Methods). After BH adjustment for multiple testing, the P value of the test was log transformed and given a positive value if the median distribution for the foreground was higher than the background and a negative value if it was lower. b, Shared modules as a network, nodes are gene modules associated with different immune-related traits colored blue or red for the two trait subgroups; edges represent a high degree of overlap at the gene level (Jaccard index > 0.7). Gene modules linked to different traits are given in black circles. Gene modules are linked with the yellow node ‘ChEMBL-drugs’ when they contain targets for drugs in clinical trials (phases III and IV, ChEMBL); linked with green nodes when they are enriched in genes with clinical variants for a given disease; and linked with purple nodes when they are enriched for the corresponding KO phenotypes (one-sided Fisher’s exact test, adjusted P < 0.05). c, Network corresponding to genes found in gene modules enriched for Type I interferon (INF) signaling, phospholipase C-activating GPCR signaling, neutrophil activation (integrins) and protein kinase A (PKA) activity. Edge filtering, node and edge colors are the same as in Fig. 3c,d.

The six gene modules are shown in Fig. 4b with a connection between them when there is a significant gene-level overlap (Fig. 4b; Methods). For representation (Fig. 4c), we selected genes from modules linked with at least three immune-mediated diseases and kept a subset of interactions of high confidence (Methods). We found multiple genes with ClinVar variants from patients with primary immune deficiencies (for example, IRF9, IRF7, STAT1, STAT2) that are not GWAS-linked genes but are in their network vicinity, providing evidence of the importance of this gene module for these diseases.

To pinpoint drugs with repurposing potential, we excluded those targeting diseases in the same therapeutic areas as the immune-mediated group of diseases, identifying 49 drugs with 20 targets. These include ulimorelin, an agonist of the ghrelin hormone secretagogue receptor GHSR used to treat gastrointestinal obstruction. Ghrelin hormone signaling has been studied in the context of age-related chronic inflammation33, psoriasis34 and IBD (reviewed in ref. 35) indicating a potential repurposing opportunity. The 49 drugs with repurposing potential are listed in Supplementary Table 3 with information on target genes and clinical trials.

Network-assisted candidate gene prioritization for IBD

Although the gene modules we have described can highlight biological pathways shared between genetically related traits, identifying causal genes at individual GWAS loci is important for prioritizing therapeutic targets. Existing methods such as GRAIL36, DEPICT37 and MAGMA38 prioritize genes based on biological pathways but do not fully use genome-wide protein interaction networks, which can provide finer-grained information over GO terms.

Here, we use network propagation to prioritize genes at IBD GWAS loci, similar to our previous work on Alzheimer’s disease39. We used two alternative methods of defining seed genes for the network. First, we manually curated 37 genes with high confidence of being causally related to either Crohn’s disease or ulcerative colitis (Supplementary Table 4) and second, we used the Open Targets L2G score to automatically select 110 genes with L2G > 0.5 at established IBD loci40,41 (Methods and Supplementary Table 4). To obtain network propagation scores, we compared each gene’s score with 1,000 runs using the same number of randomly selected input genes, to give the PPR percentile value (Methods). We obtained unbiased network propagation values for each seed gene by excluding them one at a time (Methods).

The curated seed genes had far higher network scores than other genes within 200 kb (P = 7.4 × 10−6, one-tailed Wilcoxon rank sum test), indicating that most seed genes have close interactions with other seed genes (Fig. 5a). The same was true when considering seed genes exclusively in the L2G gene set (Fig. 5b; P = 3 × 10−10, one-tailed Wilcoxon rank sum test), indicating that many of these are also strong IBD candidate genes. Finally, we examined the enrichment of low SNP P values within 10 kb of genes having high network scores. This revealed a progressive enrichment of low P values near genes with higher network scores (Fig. 5c), which held for the large number of genes linked to SNPs not reaching the typical genome-wide significance threshold of 5 × 10−8 for locus discovery.

Fig. 5: An IBD-specific network is enriched for likely causal genes.
figure 5

a, Curated IBD seed genes (N = 37) tend to have a higher network propagation score (PPR percentile) than other genes within 200 kb at the same loci. b, Genes selected by high Open Targets L2G score also tend to have high PRR percentile, highlighting network evidence as complementary to typical locus features. In the boxplots, the middle lines represents the median, the limits of the box are quartiles 1 and 3 and the whiskers represents 1.5× the interquartile range. c, Genome-wide, genes with low P-value SNPs within 10 kb are enriched for high PPR percentile (one-sided Fisher’s exact test). Data are presented as the mean ± s.d.

Curated genes with strong network support include the drug targets TYK2, ICAM1 and ITGA4, and NOD2 and IL23R, which have missense variants implicating them as modulators of IBD42,43,44. A small number of curated genes had lower network support, which could be due to these genes affecting IBD via pathways distinct from the biological functions covered most well by the curated gene set. Across IBD loci without curated genes, our network scores rank 42 candidates as being more highly functionally connected than the remaining genes at the locus (Supplementary Table 4 and Methods). Although many of these were already strong IBD candidate genes, some have found strong support only recently. A clear example is the RIPK2 locus. Although OSGIN2 is nearest to IBD lead SNP rs7015630 (38 kb distal), it has no apparent functional links with IBD (network score 43%). By contrast, RIPK2 (108 kb distal, network score 99%) encodes for a mediator of inflammatory signaling via interaction with the bacterial sensor NOD2 (ref. 45). Network information can also provide a comparison point for other evidence sources. At the DLD-SLC26A3 locus, there is moderate evidence of genetic colocalization between IBD and an expression quantitative trait loci (eQTL) for DLD in various tissues (Open Targets Genetics portal). However, DLD has no clear functional links with IBD and receives a low network score (14%). By contrast, SLC26A3 is a chloride anion transporter highly expressed in the human colon, with a high network score (98.4% in the L2G seed gene network), and its expression has been recently associated with clinical outcomes in ulcerative colitis46. IBD candidate genes that have high network scores but have not been well characterized in the context of IBD include PTPRC (a phosphatase required for T cell activation) and BTBD8, which is functionally connected to autophagy by the network analysis (via WIPI2 and ATG16L1).

To study the pleiotropy of the curated and candidate genes we looked at the eight gene modules linked by our analysis to IBD (Supplementary Fig. 8). Of the 37 curated and 42 candidate genes, 35 (14 curated and 21 candidate) are found within these modules. Interestingly, we found that most of these genes are in modules that are only linked to IBD; in particular, a module that is enriched for genes related to receptor signaling via the JAK–STAT pathway (Supplementary Fig. 9). Conversely, the most pleiotropic modules linked to IBD have very few IBD candidate genes within them. As expected, these pleiotropic modules tend to be associated with traits that are related to the immune system, with the exception of the most pleiotropic module, which is enriched for genes related to protein ubiquitination (Supplementary Fig. 8). This analysis suggests that the JAK–STAT-related module is likely to be the best source of novel candidate disease genes and drug targets that are more inclined to be specific to IBD.