Liger单细胞多组学分析：整合单细胞RNA-seq数据集

2014/09/16来源：傻狗的BLOG！评论2,740

介绍完liger的数学原理，我们便可以来跑一跑liger的流程，并对liger的数学原理和参数选择有更深刻的理解。我们可以在GitHub上找到liger的码源以及相关的教程，本文也主要翻译自GitHub上的教程，并加入了一些自己的函数和理解。

载入包和数据

本文的教程希望整合控制组以及干扰素刺激组的PBMC单细胞RNA-seq数据，数据来源于Kang et al, 2017，liger的作者则提供了以RData格式存储的经过down sample的表达矩阵，我们可以从这里下载并直接使用。当然，如果你使用10x的数据的话，则也可以使用read10X函数将数据读入。

############################################
## Project: Liger-learning
## Script Purpose: run pipeline of liger integration and further analysis
## Data: 2020.10.24
## Author: Yiming Sun
############################################

# general setting
setwd('/data/User/sunym/project/liger_learning/')
# library
library(liger)
library(Seurat)
library(dplyr)
library(tidyverse)
library(viridis)
library(parallel)
#my function
MySuggestLambda <- function(object,lambda,k,knn_k = 20,thresh = 1e-04,max.iters = 100,k2 = 500,ref_dataset = NULL,resolution = 1,parl_num){
  res <- matrix(ncol = 2,nrow = length(lambda))
  colnames(res) <- c('lambda','alignment')
  res[,'lambda'] <- lambda
  cl <- makeCluster(parl_num)
  clusterExport(cl,c('object','k','knn_k','thresh','max.iters','k2','ref_dataset','resolution'),envir = environment())
  clusterExport(cl,c('optimizeALS','quantileAlignSNF','calcAlignment'),envir = .GlobalEnv)
  alignment_score <- parLapply(cl = cl,lambda,function(x){
    object <- optimizeALS(object,k = k,lambda = x,thresh = thresh,max.iters = max.iters)
    object <- quantileAlignSNF(object,knn_k = knn_k,k2 = k2,ref_dataset = ref_dataset,resolution = resolution)
    return(calcAlignment(object))
  })
  stopCluster(cl)
  res[,'alignment'] <- (alignment_score %>% unlist())
  res <- as.data.frame(res)
  res[,1] <- as.numeric(as.character(res[,1]))
  res[,2] <- as.numeric(as.character(res[,2]))
  p <- ggplot(res,aes(x = lambda,y = alignment)) geom_point() geom_line()
  return(list(res,p))
}
# 1.Integrating Multiple Single-Cell RNA-seq Datasets
#load data
ctrl_dge <- readRDS("./data/PBMC_control.RDS")
stim_dge <- readRDS("./data/PBMC_interferon-stimulated.RDS")

这里我写了一个函数MySuggestLambda用来优化λ的选择，其实liger包自带有suggestK和suggestLambda函数帮助我们优化K和λ的选择，但是suggestLambda函数似乎有什么bug，在一台服务器上跑完优化后会报一个莫名其妙的错误，而在另一台服务器上则完全正常，我还没有发现bug在哪里所以就自己写了个类似的函数帮助我优化λ的选择，如果你使用liger默认的函数也会报错的话不妨试试我的函数，虽然他内存运用效率特别低，建议配置好50G左右的swap区，因为当R跑多线程时，如果内存不够rsession会被直接挤掉，配置好swap能拯救一下它（血泪教训）。

标准化

在降维和聚类（整合）之前，我们通常都要对表达矩阵进行标准化、挑选variable gene最后再归一化。需要注意的是，liger使用的是非负矩阵分解的方法，因此我们在做归一化的时候注意不要做中心化，因为这会导致负数的出现。

#initialize a liger object
ifnb_liger <- createLiger(list(ctrl = ctrl_dge, stim = stim_dge))
#explore liger object
dim([email protected]$ctrl)
head(colnames([email protected]$ctrl))
head(rownames([email protected]$ctrl))
dim([email protected]$stim)
#normalize data
ifnb_liger <- normalize(ifnb_liger)
#select variable gene
ifnb_liger <- selectGenes(ifnb_liger)
#scale data but not center
ifnb_liger <- scaleNotCenter(ifnb_liger)

可以简单关注一下liger的数据结构，原始的表达矩阵存储在raw.data中，聚类的信息存储在clusters中，降维的信息存储在tsne.coords中。这里就不得不提一下seurat，seurat有一个meta.data的表格来存储各个细胞的各种信息，因此在我们做个性化的比较分析的时候特别有用。liger虽然也有cell.data，但我们似乎不太能够利用liger自带的函数指定cell.data中的变量来进行比较，我想到的解决方法是把liger导出为seurat，在seurat中做一些个性化的比较，而liger则单纯用来做整合、降维以及聚类。

Liger单细胞多组学分析：整合单细胞RNA-seq数据集-图片1

非负矩阵分解

Liger利用非负矩阵分解的方法进行数据集之间的整合，我们需要提供的参数为K和λ。通常，K反映了数据集中真实cluster的个数，而λ则反映了数据集之间的异质性，如果你对你的数据完全没有把握，那么默认参数K=20，λ=5往往也能有不错的表现。当然我们也可以根据KL散度和Alignment score来优化K和λ的选择。

#find the optimized k
suggestK(ifnb_liger,k.test = seq(10,30,1),lambda = 5,plot.log2 = FALSE,num.cores = 10)

设置λ=5，调用10个进程，测试K从10到30时所计算得到的KL散度，最终选择20作为最优的K。

Liger单细胞多组学分析：整合单细胞RNA-seq数据集-图片2

使用liger自带的函数优化λ的选择，取K=20，测试lambda的范围从1到10，调用10个进程，最终选择5作为最优的λ。

#find the optimized lambda
#use liger function
suggestLambda(object = ifnb_liger,k = 20,lambda.test = seq(1,10,1),num.cores = 10)

Liger单细胞多组学分析：整合单细胞RNA-seq数据集-图片3

使用我自己写的函数来优化λ的选择，

#use my function
suggest_lambda <- MySuggestLambda(ifnb_liger,lambda = seq(1,10,1),k = 20,parl_num = 10)
suggest_lambda[[2]]

Liger单细胞多组学分析：整合单细胞RNA-seq数据集-图片4

官方的函数应该是在Alignment score上做了平均而使得曲线更加平滑了，总之建议使用官方的函数，如果出现报错的话，可以试试我的函数。最后我们选择K=20，λ=5进行非负矩阵分解。

#use lambda 5 k 20
#integrate NMF
ifnb_liger <- optimizeALS(ifnb_liger,k = 20,lambda = 5,max.iters = 30,thresh = 1e-06)

聚类

做完非负矩阵分解后就可以进行聚类了，liger聚类的方法也比较有趣，可以参考liger的数学原理。quantile_norm函数会构建SFN图（shared factor neighborhood graph）对细胞进行聚类，在聚类完成后将同一cluster中不同数据集来源的细胞在坐标轴（W V矩阵）上进行分位数归一化。

#Quantile Normalization and Joint Clustering
ifnb_liger <- quantile_norm(ifnb_liger,knn_k = 20,quantiles = 50,min_cells = 20,do.center = FALSE,
                            max_sample = 1000,refine.knn = TRUE,eps = 0.9)

这里需要讲两个参数，knn_k是在构建SFN图中选择邻居细胞的个数，do.center是指是否做中心化，一般默认为FALSE，但当我们处理一些比较dense的数据时，如DNA甲基化数据，metagene的loading值通常会很高，此时可以做一下中心化来进行矫正。当然，我们也可以用一些图聚类算法来进一步优化我们的聚类，这一步可做可不做。

# you can use louvain cluster to detect and assign cluster
ifnb_liger <- louvainCluster(ifnb_liger, resolution = 0.25)

可视化与下游分析

UMAP降维图

#Visualization and Downstream Analysis
ifnb_liger <- runUMAP(ifnb_liger, distance = 'cosine', n_neighbors = 30, min_dist = 0.3)
all.plots <- plotByDatasetAndCluster(ifnb_liger, axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = T)
all.plots[[1]]   all.plots[[2]]

Liger单细胞多组学分析：整合单细胞RNA-seq数据集-图片5

Gene loading plot

我们可以找到在各个cluster中高表达的metagene，以及这些metagene在不同的数据集中（W和V矩阵）对其贡献最高的基因。

#find the maximun loading factor on each cluster and the maximun loading gene on the factor
gene_loadings <- plotGeneLoadings(ifnb_liger, do.spec.plot = FALSE, return.plots = TRUE)
gene_loadings[[4]]

Liger单细胞多组学分析：整合单细胞RNA-seq数据集-图片6

metagene_4在cluster_3中特异性高表达，那么metagene_4在W矩阵中对其贡献最大的基因（Shared），则很有可能是cluster_3的marker基因。类似的，metagene_4在V矩阵中对其贡献最大的基因，则反映了同一细胞类型数据集之间（或处理之间）的差异。我们也可以很容易得到图片背后的数据。

#the metadata of the factor/gene loading plot 
#UMAP
head(gene_loadings[[4]][[1]]$data)
#gene loading value on the specific factor
head(gene_loadings[[4]][[2]][[1]]$data[order(gene_loadings[[4]][[2]][[1]]$data$loadings,decreasing = TRUE),])
head(gene_loadings[[4]][[2]][[2]]$data[order(gene_loadings[[4]][[2]][[2]]$data$loadings,decreasing = TRUE),])
head(gene_loadings[[4]][[2]][[3]]$data[order(gene_loadings[[4]][[2]][[3]]$data$loadings,decreasing = TRUE),])
# conclusion: gene_loadings object is a list of all the factor, specific the factor you want
# conclusion: the factor object listed by gene_loadings has two lists, the first is the UMAP with the factor loading value, the second is the dataset-sepcific matrix W
# the dataset specific matrix W is a list contains the all the dataset and shared gene loading value
# what a mess!

总之就是list套list，很麻烦，可以对照着图和输出理解一下。

> head(gene_loadings[[4]][[1]]$data)
                          Factor4       tSNE1      tSNE2
ctrlTCAGCGCTGGTCAT-1 0.0018384202  -9.5633178 -1.5025842
ctrlTTATGGCTTCATTC-1           NA  -7.4026990 -0.5618219
ctrlACCCACTGCTTAGG-1 0.0007611497   3.7179575  4.9839707
ctrlATGGGTACCCCGTT-1 0.0001820607 -11.0730367 -4.6024990
ctrlTGACTGGACAGTCA-1           NA   0.5273923 -8.4171533
ctrlGTGTAGTGGTTGTG-1 0.0013735353   8.8018236  5.2826267
> #gene loading value on the specific factor
> head(gene_loadings[[4]][[2]][[1]]$data[order(gene_loadings[[4]][[2]][[1]]$data$loadings,decreasing = TRUE),])
       loadings      xpos top_k
FAM43A 2.522310 1.0000000  TRUE
FGFBP2 2.411532 0.9995883  TRUE
SPON2  2.264957 0.9991766 FALSE
ZNF638 1.925689 0.9987649 FALSE
KARS   1.916834 0.9983532 FALSE
PRDX5  1.889175 0.9979415  TRUE
> head(gene_loadings[[4]][[2]][[2]]$data[order(gene_loadings[[4]][[2]][[2]]$data$loadings,decreasing = TRUE),])
       loadings      xpos top_k
GNLY  136.98236 1.0000000  TRUE
NKG7  119.07990 0.9995883  TRUE
CLIC3 117.71097 0.9991766  TRUE
GZMB  108.58467 0.9987649  TRUE
PRF1  108.14754 0.9983532  TRUE
GZMA   93.22887 0.9979415  TRUE
> head(gene_loadings[[4]][[2]][[3]]$data[order(gene_loadings[[4]][[2]][[3]]$data$loadings,decreasing = TRUE),])
        loadings      xpos top_k
PRF1    3.273654 1.0000000  TRUE
SDF2L1  2.913147 0.9995883 FALSE
CD38    2.204181 0.9991766  TRUE
IFNG    2.111837 0.9987649 FALSE
HSPA5   2.081298 0.9983532 FALSE
CIAPIN1 2.002635 0.9979415 FALSE

差异表达分析

转录组分析怎么离得开差异表达分析嘛，我们可以用runWilcoxon比较不同的cluster之间的差异表达基因，也可以比较同一个cluster的不同数据集来源之间的差异表达基因。

#differential gene express
# for different cluster
cluster.results <- runWilcoxon(ifnb_liger, compare.method = "clusters")
head(cluster.results)
datasets.results <- runWilcoxon(ifnb_liger, compare.method = "datasets")
head(datasets.results)

只需要指定compare.method就可以进行不同的比较，不过这个函数其实也只能做这两种比较，如果我想做之前提到的更加个性化的比较，就只能导入到seurat中再进行比较。

> cluster.results <- runWilcoxon(ifnb_liger, compare.method = "clusters")
[1] "Performing Wilcoxon test on ALL datasets: ctrl, stim"
> head(cluster.results)
        feature group   avgExpr        logFC statistic
1 RP11-206L10.2     0 -23.01898 -0.005557170   4259604
2 RP11-206L10.9     0 -23.01914 -0.005979592   4259602
3     LINC00115     0 -23.00609 -0.031260202   4252898
4         NOC2L     0 -21.83167  0.292228474   4340248
5        KLHL17     0 -23.00556  0.003658342   4262141
6       PLEKHN1     0 -23.01194 -0.039663835   4249915
        auc        pval       padj pct_in pct_out
1 0.4998057 0.570575909 0.69419204    100     100
2 0.4998055 0.570110484 0.69419204    100     100
3 0.4990189 0.138680541 0.24813295    100     100
4 0.5092682 0.004585054 0.01435755    100     100
5 0.5001034 0.819517383 0.88363326    100     100
6 0.4986689 0.044537037 0.10099905    100     100
> datasets.results <- runWilcoxon(ifnb_liger, compare.method = "datasets")
[1] "Performing Wilcoxon test on ALL datasets: ctrl, stim"
> head(datasets.results)
        feature  group   avgExpr       logFC statistic
1 RP11-206L10.2 0-ctrl -23.02585 -0.01408078  665437.5
2 RP11-206L10.9 0-ctrl -23.02585 -0.01376717  665437.5
3     LINC00115 0-ctrl -22.99983  0.01283157  666564.5
4         NOC2L 0-ctrl -21.87312 -0.08498766  662383.5
5        KLHL17 0-ctrl -22.98625  0.03960509  667718.0
6       PLEKHN1 0-ctrl -23.02585 -0.02853147  664846.0
        auc       pval      padj pct_in pct_out
1 0.4995560 0.30577296 0.6990712    100     100
2 0.4995560 0.30577296 0.6990712    100     100
3 0.5004020 0.59231247 0.8964793    100     100
4 0.4972633 0.61893894 0.9110340    100     100
5 0.5012680 0.09102028 0.4580475    100     100
6 0.4991119 0.14726281 0.5773302    100     100

对差异表达进行筛选。

> # filter the data to find the meaningful DEG for your own experiment
> cluster.results <- cluster.results[cluster.results$padj < 0.05,]
> cluster.results <- cluster.results[cluster.results$logFC > 3,]
> wilcoxon.cluster_3 <- cluster.results[cluster.results$group == 3, ]
> wilcoxon.cluster_3 <- wilcoxon.cluster_3[order(wilcoxon.cluster_3$padj), ]
> markers <- wilcoxon.cluster_3[1:20, ]
> head(markers)
      feature group    avgExpr     logFC statistic       auc
41861    GNLY     3  -5.282466 16.147323   2379942 0.9647642
46904   CLIC3     3 -12.962070  9.485599   1946458 0.7890417
47130    PRF1     3 -12.260573  9.830082   1970124 0.7986352
49239    GZMB     3  -7.840488 13.697178   2231966 0.9047789
52832    NKG7     3  -6.594620 14.433185   2324832 0.9424243
48310   KLRC1     3 -18.000059  4.916239   1606731 0.6513252
               pval          padj pct_in pct_out
41861  0.000000e 00  0.000000e 00    100     100
46904  0.000000e 00  0.000000e 00    100     100
47130  0.000000e 00  0.000000e 00    100     100
49239  0.000000e 00  0.000000e 00    100     100
52832  0.000000e 00  0.000000e 00    100     100
48310 1.839405e-289 4.099114e-286    100     100

可以看一下PRF1在不同的cluster之间的表达情况。

# feature plot on UMAP
PRF1 <- plotGene(ifnb_liger, "PRF1", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = T)
PRF1[[1]]   PRF1[[2]]

Liger单细胞多组学分析：整合单细胞RNA-seq数据集-图片7

可以看到PRF1在两个数据集的cluster_3中均显著高表达。也可以用小提琴图来展示。

# vlnplot
PRF1 <- plotGeneViolin(ifnb_liger,gene = 'PRF1',return.plots = TRUE)
PRF1[[1]]   PRF1[[2]]

Liger单细胞多组学分析：整合单细胞RNA-seq数据集-图片8

完整代码

############################################
## Project: Liger-learning
## Script Purpose: run pipeline of liger integration and further analysis
## Data: 2020.10.24
## Author: Yiming Sun
############################################

# general setting
setwd('/data/User/sunym/project/liger_learning/')
# library
library(liger)
library(Seurat)
library(dplyr)
library(tidyverse)
library(viridis)
library(parallel)
#my function
MySuggestLambda <- function(object,lambda,k,knn_k = 20,thresh = 1e-04,max.iters = 100,k2 = 500,ref_dataset = NULL,resolution = 1,parl_num){
  res <- matrix(ncol = 2,nrow = length(lambda))
  colnames(res) <- c('lambda','alignment')
  res[,'lambda'] <- lambda
  cl <- makeCluster(parl_num)
  clusterExport(cl,c('object','k','knn_k','thresh','max.iters','k2','ref_dataset','resolution'),envir = environment())
  clusterExport(cl,c('optimizeALS','quantileAlignSNF','calcAlignment'),envir = .GlobalEnv)
  alignment_score <- parLapply(cl = cl,lambda,function(x){
    object <- optimizeALS(object,k = k,lambda = x,thresh = thresh,max.iters = max.iters)
    object <- quantileAlignSNF(object,knn_k = knn_k,k2 = k2,ref_dataset = ref_dataset,resolution = resolution)
    return(calcAlignment(object))
  })
  stopCluster(cl)
  res[,'alignment'] <- (alignment_score %>% unlist())
  res <- as.data.frame(res)
  res[,1] <- as.numeric(as.character(res[,1]))
  res[,2] <- as.numeric(as.character(res[,2]))
  p <- ggplot(res,aes(x = lambda,y = alignment)) geom_point() geom_line()
  return(list(res,p))
}
# 1.Integrating Multiple Single-Cell RNA-seq Datasets
#load data
ctrl_dge <- readRDS("./data/PBMC_control.RDS")
stim_dge <- readRDS("./data/PBMC_interferon-stimulated.RDS")
#initialize a liger object
ifnb_liger <- createLiger(list(ctrl = ctrl_dge, stim = stim_dge))
#explore liger object
dim([email protected]$ctrl)
head(colnames([email protected]$ctrl))
head(rownames([email protected]$ctrl))
dim([email protected]$stim)
#normalize data
ifnb_liger <- normalize(ifnb_liger)
#select variable gene
ifnb_liger <- selectGenes(ifnb_liger)
#scale data but not center
ifnb_liger <- scaleNotCenter(ifnb_liger)
#find the optimized k
suggestK(ifnb_liger,k.test = seq(10,30,1),lambda = 5,plot.log2 = FALSE,num.cores = 10)
#find the optimized lambda
#use liger function
suggestLambda(object = ifnb_liger,k = 20,lambda.test = seq(1,10,1),num.cores = 10)
#use my function
suggest_lambda <- MySuggestLambda(ifnb_liger,lambda = seq(1,10,1),k = 20,parl_num = 10)
suggest_lambda[[2]]
#use lambda 5 k 20
#integrate NMF
ifnb_liger <- optimizeALS(ifnb_liger,k = 20,lambda = 5,max.iters = 30,thresh = 1e-06)
#Quantile Normalization and Joint Clustering
ifnb_liger <- quantile_norm(ifnb_liger,knn_k = 20,quantiles = 50,min_cells = 20,do.center = FALSE,
                            max_sample = 1000,refine.knn = TRUE,eps = 0.9)
# you can use louvain cluster to detect and assign cluster
ifnb_liger <- louvainCluster(ifnb_liger, resolution = 0.25)
head([email protected])
#Visualization and Downstream Analysis
ifnb_liger <- runUMAP(ifnb_liger, distance = 'cosine', n_neighbors = 30, min_dist = 0.3)
all.plots <- plotByDatasetAndCluster(ifnb_liger, axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = T)
#the place of each cell and the cluster it belongs
head(all.plots[[1]]$data)
pdf(file = './res/fig_201024/plot_by_dataset_and_cluster.pdf',width = 8,height = 4)
all.plots[[1]]   all.plots[[2]]
dev.off()
#find the maximun loading factor on each cluster and the maximun loading gene on the factor
gene_loadings <- plotGeneLoadings(ifnb_liger, do.spec.plot = FALSE, return.plots = TRUE)
pdf(file = './res/fig_201024/gene_loading_plot.pdf',width = 6.5,height = 6)
gene_loadings[[4]]
dev.off()
#the metadata of the factor/gene loading plot 
#UMAP
head(gene_loadings[[4]][[1]]$data)
#gene loading value on the specific factor
head(gene_loadings[[4]][[2]][[1]]$data[order(gene_loadings[[4]][[2]][[1]]$data$loadings,decreasing = TRUE),])
head(gene_loadings[[4]][[2]][[2]]$data[order(gene_loadings[[4]][[2]][[2]]$data$loadings,decreasing = TRUE),])
head(gene_loadings[[4]][[2]][[3]]$data[order(gene_loadings[[4]][[2]][[3]]$data$loadings,decreasing = TRUE),])
# conclusion: gene_loadings object is a list of all the factor, specific the factor you want
# conclusion: the factor object listed by gene_loadings has two lists, the first is the UMAP with the factor loading value, the second is the dataset-sepcific matrix W
# the dataset specific matrix W is a list contains the all the dataset and shared gene loading value
# what a mess!
#differential gene express
# for different cluster
cluster.results <- runWilcoxon(ifnb_liger, compare.method = "clusters")
head(cluster.results)
datasets.results <- runWilcoxon(ifnb_liger, compare.method = "datasets")
head(datasets.results)
# filter the data to find the meaningful DEG for your own experiment
cluster.results <- cluster.results[cluster.results$padj < 0.05,]
cluster.results <- cluster.results[cluster.results$logFC > 3,]
wilcoxon.cluster_3 <- cluster.results[cluster.results$group == 3, ]
wilcoxon.cluster_3 <- wilcoxon.cluster_3[order(wilcoxon.cluster_3$padj), ]
markers <- wilcoxon.cluster_3[1:20, ]
head(markers)
# feature plot on UMAP
PRF1 <- plotGene(ifnb_liger, "PRF1", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = T)
pdf(file = './res/fig_201024/feature_plot_PRF1.pdf',width = 8,height = 4)
PRF1[[1]]   PRF1[[2]]
dev.off()
# vlnplot
PRF1 <- plotGeneViolin(ifnb_liger,gene = 'PRF1',return.plots = TRUE)
pdf(file = './res/fig_201024/vlnplot_PRF1.pdf',width = 6,height = 3)
PRF1[[1]]   PRF1[[2]]
dev.off()
# compare cluster marker express winthin and across dataset
IFIT3 <- plotGene(ifnb_liger, "IFIT3", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
IFITM3 <- plotGene(ifnb_liger, "IFITM3", axis.labels = c('UMAP 1', 'UMAP 2'), return.plots = TRUE)
pdf(file = './res/fig_201024/cluster_marker_within_between_dataset.pdf',width = 8,height = 8)
plot_grid(IFIT3[[1]],IFIT3[[2]],IFITM3[[1]],IFITM3[[2]], ncol=2)
dev.off()

载入包和数据

标准化

非负矩阵分解

聚类

可视化与下游分析

完整代码

发表评论