Connecting to Spark

For working GraphFrames, we need a connection to Spark. For computation on a computer cluster, this can be achieved by connecting to the spark installation on the cluster.

However, for demonstration purposes we are using here a local connection, that will require a local Spark installation.

This can be achieved via (need to be carried out only once):

sparklyr::spark_install("3.0")

We can connect to both local instances of Spark as well as remote Spark clusters. Here we connect to our local installation of Spark via

sc <- sparklyr::spark_connect(master = "local", version = "3.0")

Store BioPlex PPIs in a GraphFrame

Get the latest version of the 293T PPI network:

bp.293t <- BioPlex::getBioPlex(cell.line = "293T", version = "3.0")
## Using cached version from 2023-01-14 23:15:28
head(bp.293t)
##    GeneA  GeneB UniprotA UniprotB SymbolA SymbolB           pW          pNI
## 1    100 728378   P00813   A5A3E0     ADA   POTEF 6.881844e-10 0.0001176357
## 2 222389   6137 Q8N7W2-2   P26373   BEND7   RPL13 1.340380e-18 0.2256644741
## 3 222389   5928 Q8N7W2-2 Q09028-3   BEND7   RBBP4 7.221401e-21 0.0000641669
## 4 222389  25873 Q8N7W2-2   Q9Y3U8   BEND7   RPL36 7.058372e-17 0.1281827343
## 5 222389   6124 Q8N7W2-2   P36578   BEND7    RPL4 1.632313e-22 0.2006379109
## 6 222389   6188 Q8N7W2-2   P23396   BEND7    RPS3 3.986270e-26 0.0010264311
##        pInt
## 1 0.9998824
## 2 0.7743355
## 3 0.9999358
## 4 0.8718173
## 5 0.7993621
## 6 0.9989736

and turn into a graph object:

bp.gr <- BioPlex::bioplex2graph(bp.293t)
bp.gr
## A graphNEL graph with directed edges
## Number of Nodes = 13689 
## Number of Edges = 115868

Switch to a graphframes backend:

gf <- BioPlexAnalysis::graph2graphframe(bp.gr, sc)
gf
## GraphFrame
## Vertices:
##   Database: spark_connection
##   $ id       <chr> "P00813", "Q8N7W2", "Q6ZMN8", "P20138", "P55039", "Q17R55", "…
##   $ entrezid <chr> "100", "222389", "645121", "945", "1819", "148109", "54363", …
##   $ symbol   <chr> "ADA", "BEND7", "CCNI2", "CD33", "DRG2", "FAM187B", "HAO1", "…
##   $ isoform  <chr> "P00813", "Q8N7W2-2", "Q6ZMN8", "P20138", "P55039", "Q17R55",…
## Edges:
##   Database: spark_connection
##   $ src  <chr> "P00813", "Q8N7W2", "Q8N7W2", "Q8N7W2", "Q8N7W2", "Q8N7W2", "Q8N7…
##   $ dst  <chr> "A5A3E0", "P26373", "Q09028", "Q9Y3U8", "P36578", "P23396", "Q070…
##   $ pW   <dbl> 6.881844e-10, 1.340380e-18, 7.221401e-21, 7.058372e-17, 1.632313e…
##   $ pNI  <dbl> 1.176357e-04, 2.256645e-01, 6.416690e-05, 1.281827e-01, 2.006379e…
##   $ pInt <dbl> 0.9998824, 0.7743355, 0.9999358, 0.8718173, 0.7993621, 0.9989736,…

PageRank

PageRank is an algorithm used in Google Search for ranking websites in their results, but it has been adopted also for other purposes. According to Google, PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

For the analysis if PPI networks, personalized PageRank seems to be capable to robustly evaluate the importance of the vertices of a network, relatively to some already known relevant nodes (Ivan and Grolmusz, 2011).

As an example we look at the ACADVL gene. Mutations in ACADVL are associated with very long-chain acyl-coenzyme A dehydrogenase deficiency.

We first look up the corresponding node ID:

dplyr::filter(graphframes::gf_vertices(gf), symbol == "ACADVL")
## # Source: spark<?> [?? x 4]
##   id     entrezid symbol isoform 
##   <chr>  <chr>    <chr>  <chr>   
## 1 P49748 37       ACADVL P49748-2

And then apply the personalized PageRank as implemented in the graphframes package.

gf <- graphframes::gf_pagerank(gf, 
                               reset_prob = 0.15,
                               max_iter = 10L,
                               source_id = "P49748")
gf
## GraphFrame
## Vertices:
##   Database: spark_connection
##   $ id       <chr> "Q8N7W2", "Q6ZMN8", "P20138", "Q9NZL4", "Q86X40", "O15264", "…
##   $ entrezid <chr> "222389", "645121", "945", "23640", "123355", "5603", "10095"…
##   $ symbol   <chr> "BEND7", "CCNI2", "CD33", "HSPBP1", "LRRC28", "MAPK13", "ARPC…
##   $ isoform  <chr> "Q8N7W2-2", "Q6ZMN8", "P20138", "Q9NZL4", "Q86X40", "O15264",…
##   $ pagerank <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## Edges:
##   Database: spark_connection
##   $ src    <chr> "B4DP52", "O95221", "P42331", "Q8WVK2", "Q9Y6Q1", "UNKNOWN", "O…
##   $ dst    <chr> "A0AVT1", "A0AVT1", "A0AVT1", "A0AVT1", "A0AVT1", "A0AVT1", "A0…
##   $ pW     <dbl> 4.434764e-06, 7.293189e-06, 1.295321e-01, 3.934934e-06, 6.49981…
##   $ pNI    <dbl> 5.063683e-02, 4.506911e-02, 9.913222e-02, 4.948126e-02, 1.72294…
##   $ pInt   <dbl> 0.9493587, 0.9549236, 0.7713357, 0.9505148, 0.9827641, 0.994325…
##   $ weight <dbl> 0.052631579, 0.142857143, 0.100000000, 0.013698630, 0.090909091…

Inspect the results:

dplyr::filter(graphframes::gf_vertices(gf), pagerank > 0)
## # Source: spark<?> [?? x 5]
##   id     entrezid symbol isoform  pagerank
##   <chr>  <chr>    <chr>  <chr>       <dbl>
## 1 P49748 37       ACADVL P49748-2    0.541
## 2 P25054 324      APC    P25054      0.230
## 3 Q9BTZ2 10901    DHRS4  Q9BTZ2-4    0.230
dplyr::filter(graphframes::gf_edges(gf), src == "P49748")
## # Source: spark<?> [?? x 6]
##   src    dst          pW       pNI  pInt weight
##   <chr>  <chr>     <dbl>     <dbl> <dbl>  <dbl>
## 1 P49748 Q9BTZ2 6.54e-12 0.0000250 1.00     0.5
## 2 P49748 P25054 7.28e-10 0.00138   0.999    0.5

Switch back to a graphNEL backend:

gr <- BioPlexAnalysis::graphframe2graph(gf)
gr
## A graphNEL graph with directed edges
## Number of Nodes = 13689 
## Number of Edges = 115868
head(graph::nodeData(gr), n = 2)
## $B4DP52
## $B4DP52$entrezid
## [1] "534"
## 
## $B4DP52$symbol
## [1] "ATP6V1G2"
## 
## $B4DP52$isoform
## [1] "B4DP52"
## 
## $B4DP52$pagerank
## [1] 0
## 
## 
## $O95221
## $O95221$entrezid
## [1] "338674"
## 
## $O95221$symbol
## [1] "OR5F1"
## 
## $O95221$isoform
## [1] "O95221"
## 
## $O95221$pagerank
## [1] 0
head(graph::edgeData(gr), n = 2)
## $`B4DP52|A0AVT1`
## $`B4DP52|A0AVT1`$weight
## [1] 0.05263158
## 
## $`B4DP52|A0AVT1`$pW
## [1] 4.434764e-06
## 
## $`B4DP52|A0AVT1`$pNI
## [1] 0.05063683
## 
## $`B4DP52|A0AVT1`$pInt
## [1] 0.9493587
## 
## 
## $`B4DP52|A5A3E0`
## $`B4DP52|A5A3E0`$weight
## [1] 0.05263158
## 
## $`B4DP52|A5A3E0`$pW
## [1] 2.554695e-11
## 
## $`B4DP52|A5A3E0`$pNI
## [1] 0.06102967
## 
## $`B4DP52|A5A3E0`$pInt
## [1] 0.9389703