GraphFrames.Rmd
For working GraphFrames, we need a connection to Spark. For computation on a computer cluster, this can be achieved by connecting to the spark installation on the cluster.
However, for demonstration purposes we are using here a local connection, that will require a local Spark installation.
This can be achieved via (need to be carried out only once):
sparklyr::spark_install("3.0")
We can connect to both local instances of Spark as well as remote Spark clusters. Here we connect to our local installation of Spark via
sc <- sparklyr::spark_connect(master = "local", version = "3.0")
Get the latest version of the 293T PPI network:
bp.293t <- BioPlex::getBioPlex(cell.line = "293T", version = "3.0")
## Using cached version from 2023-01-14 23:15:28
head(bp.293t)
## GeneA GeneB UniprotA UniprotB SymbolA SymbolB pW pNI
## 1 100 728378 P00813 A5A3E0 ADA POTEF 6.881844e-10 0.0001176357
## 2 222389 6137 Q8N7W2-2 P26373 BEND7 RPL13 1.340380e-18 0.2256644741
## 3 222389 5928 Q8N7W2-2 Q09028-3 BEND7 RBBP4 7.221401e-21 0.0000641669
## 4 222389 25873 Q8N7W2-2 Q9Y3U8 BEND7 RPL36 7.058372e-17 0.1281827343
## 5 222389 6124 Q8N7W2-2 P36578 BEND7 RPL4 1.632313e-22 0.2006379109
## 6 222389 6188 Q8N7W2-2 P23396 BEND7 RPS3 3.986270e-26 0.0010264311
## pInt
## 1 0.9998824
## 2 0.7743355
## 3 0.9999358
## 4 0.8718173
## 5 0.7993621
## 6 0.9989736
and turn into a graph object:
bp.gr <- BioPlex::bioplex2graph(bp.293t)
bp.gr
## A graphNEL graph with directed edges
## Number of Nodes = 13689
## Number of Edges = 115868
Switch to a graphframes backend:
gf <- BioPlexAnalysis::graph2graphframe(bp.gr, sc)
gf
## GraphFrame
## Vertices:
## Database: spark_connection
## $ id <chr> "P00813", "Q8N7W2", "Q6ZMN8", "P20138", "P55039", "Q17R55", "…
## $ entrezid <chr> "100", "222389", "645121", "945", "1819", "148109", "54363", …
## $ symbol <chr> "ADA", "BEND7", "CCNI2", "CD33", "DRG2", "FAM187B", "HAO1", "…
## $ isoform <chr> "P00813", "Q8N7W2-2", "Q6ZMN8", "P20138", "P55039", "Q17R55",…
## Edges:
## Database: spark_connection
## $ src <chr> "P00813", "Q8N7W2", "Q8N7W2", "Q8N7W2", "Q8N7W2", "Q8N7W2", "Q8N7…
## $ dst <chr> "A5A3E0", "P26373", "Q09028", "Q9Y3U8", "P36578", "P23396", "Q070…
## $ pW <dbl> 6.881844e-10, 1.340380e-18, 7.221401e-21, 7.058372e-17, 1.632313e…
## $ pNI <dbl> 1.176357e-04, 2.256645e-01, 6.416690e-05, 1.281827e-01, 2.006379e…
## $ pInt <dbl> 0.9998824, 0.7743355, 0.9999358, 0.8718173, 0.7993621, 0.9989736,…
PageRank is an algorithm used in Google Search for ranking websites in their results, but it has been adopted also for other purposes. According to Google, PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.
For the analysis if PPI networks, personalized PageRank seems to be capable to robustly evaluate the importance of the vertices of a network, relatively to some already known relevant nodes (Ivan and Grolmusz, 2011).
As an example we look at the ACADVL gene. Mutations in ACADVL are associated with very long-chain acyl-coenzyme A dehydrogenase deficiency.
We first look up the corresponding node ID:
dplyr::filter(graphframes::gf_vertices(gf), symbol == "ACADVL")
## # Source: spark<?> [?? x 4]
## id entrezid symbol isoform
## <chr> <chr> <chr> <chr>
## 1 P49748 37 ACADVL P49748-2
And then apply the personalized PageRank as implemented in the
graphframes
package.
gf <- graphframes::gf_pagerank(gf,
reset_prob = 0.15,
max_iter = 10L,
source_id = "P49748")
gf
## GraphFrame
## Vertices:
## Database: spark_connection
## $ id <chr> "Q8N7W2", "Q6ZMN8", "P20138", "Q9NZL4", "Q86X40", "O15264", "…
## $ entrezid <chr> "222389", "645121", "945", "23640", "123355", "5603", "10095"…
## $ symbol <chr> "BEND7", "CCNI2", "CD33", "HSPBP1", "LRRC28", "MAPK13", "ARPC…
## $ isoform <chr> "Q8N7W2-2", "Q6ZMN8", "P20138", "Q9NZL4", "Q86X40", "O15264",…
## $ pagerank <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## Edges:
## Database: spark_connection
## $ src <chr> "B4DP52", "O95221", "P42331", "Q8WVK2", "Q9Y6Q1", "UNKNOWN", "O…
## $ dst <chr> "A0AVT1", "A0AVT1", "A0AVT1", "A0AVT1", "A0AVT1", "A0AVT1", "A0…
## $ pW <dbl> 4.434764e-06, 7.293189e-06, 1.295321e-01, 3.934934e-06, 6.49981…
## $ pNI <dbl> 5.063683e-02, 4.506911e-02, 9.913222e-02, 4.948126e-02, 1.72294…
## $ pInt <dbl> 0.9493587, 0.9549236, 0.7713357, 0.9505148, 0.9827641, 0.994325…
## $ weight <dbl> 0.052631579, 0.142857143, 0.100000000, 0.013698630, 0.090909091…
Inspect the results:
dplyr::filter(graphframes::gf_vertices(gf), pagerank > 0)
## # Source: spark<?> [?? x 5]
## id entrezid symbol isoform pagerank
## <chr> <chr> <chr> <chr> <dbl>
## 1 P49748 37 ACADVL P49748-2 0.541
## 2 P25054 324 APC P25054 0.230
## 3 Q9BTZ2 10901 DHRS4 Q9BTZ2-4 0.230
## # Source: spark<?> [?? x 6]
## src dst pW pNI pInt weight
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 P49748 Q9BTZ2 6.54e-12 0.0000250 1.00 0.5
## 2 P49748 P25054 7.28e-10 0.00138 0.999 0.5
Switch back to a graphNEL
backend:
gr <- BioPlexAnalysis::graphframe2graph(gf)
gr
## A graphNEL graph with directed edges
## Number of Nodes = 13689
## Number of Edges = 115868
## $B4DP52
## $B4DP52$entrezid
## [1] "534"
##
## $B4DP52$symbol
## [1] "ATP6V1G2"
##
## $B4DP52$isoform
## [1] "B4DP52"
##
## $B4DP52$pagerank
## [1] 0
##
##
## $O95221
## $O95221$entrezid
## [1] "338674"
##
## $O95221$symbol
## [1] "OR5F1"
##
## $O95221$isoform
## [1] "O95221"
##
## $O95221$pagerank
## [1] 0
## $`B4DP52|A0AVT1`
## $`B4DP52|A0AVT1`$weight
## [1] 0.05263158
##
## $`B4DP52|A0AVT1`$pW
## [1] 4.434764e-06
##
## $`B4DP52|A0AVT1`$pNI
## [1] 0.05063683
##
## $`B4DP52|A0AVT1`$pInt
## [1] 0.9493587
##
##
## $`B4DP52|A5A3E0`
## $`B4DP52|A5A3E0`$weight
## [1] 0.05263158
##
## $`B4DP52|A5A3E0`$pW
## [1] 2.554695e-11
##
## $`B4DP52|A5A3E0`$pNI
## [1] 0.06102967
##
## $`B4DP52|A5A3E0`$pInt
## [1] 0.9389703