【原】单细胞实战之Scissor和BayesPrism映射——入门到进阶(初级篇4）

健明 2025-02-15 发布于广东

展开全文

我们在上一讲内容中学习了pseudobulks和GSVA富集分析~ 接下来我们我们学习一下两个“映射”工具——Scissor和BayesPrism。同时本次分析还需要TCGA-COAD的数据，可以自己清洗数据，也可以直接下载网盘内容分享的文件：TCGA-COAD_sur_model.Rdata

映射工具

在做基础实验的时候，研究者都希望能够改变各种条件来进行对比分析，从而探索自己所感兴趣的方向。在做数据分析的时候也是一样的，我们希望有一个数据集能够附加了很多临床信息/表型，然后二次分析者们就可以进一步挖掘。然而现实情况总是数据集质量非常不错，但是附加的临床信息/表型却十分有限，这种状况在单细胞数据分析时代中更加常见。因此如何将大量的含有临床信息/表型的bulkRNA测序数据和单细胞数据构成联系，这也是算法开发者们所重点关注的方向之一。

目前关于映射的工具有很多，从bulkRNA到单细胞上我们常用Scissor(“剪刀”法)，从单细胞上映射到bulk层面最常用的就是BayesPrism(贝叶斯棱镜法)了，这次内容我们就来学习一下Scissor法和BayesPrism。

1.Scissor

图片的内容展示了Scissor法的整一个分析流程，通过结合bulk RNA-seq表型数据和单细胞转录组数据，用于识别与特定表型（如生存率、治疗反应）相关的细胞亚群。首先，计算bulk样本与单细胞之间的相关性矩阵，并结合细胞相似性网络进行正则化回归，筛选出与表型显著相关的Scissor⁺ (正相关) 和Scissor⁻ (负相关)细胞。随后，通过UMAP可视化、统计检验和通路富集分析，进一步解析这些细胞的特征及其在生物学功能中的作用。从文献中对Scissor的概述来看，Scissor的核心计算步骤采用的是皮尔逊相关性分析，那么皮尔逊相关性分析的主要优势是计算高效、适用于大规模数据，且能够很好地衡量bulk数据和单细胞数据的线性关系，局限是仅适用于线性相关性，对非线性关系的处理较弱。

但测序数据就是高维数据，笔者从目前的知识储备来理解，线性相关性方法是在大样本情况下更加稳健，高维数据往往包含大量噪声，非线性方法（如Spearman秩相关性等方法）在小样本、非线性情况下可能表现良好，但在高维数据中容易受数据稀疏性影响，导致计算不稳定或过拟合。还有就是Scissor的核心任务是将 bulk RNA-seq数据与单细胞数据进行匹配，即找到与bulk数据模式最接近的单细胞群体。由于 bulk RNA-seq 反映的是大量细胞的加权平均表达，其主要变化趋势通常表现就是为线性模式。

2.BayesPrism

那么从单细胞数据映射到bulkRN数据我们通常采用的是BayesPrism，其作为一种工具可以根据scRNA数据(作为先验模型)去推断bulkRNA数据中肿瘤微环境组成(不同免疫细胞组分/不同细胞群)和基因表达情况。开发者展示的图片就很形象了，左边图展示了把标注了不同细胞类型的单细胞数据作为先验信息(prior info)的基因信息和bulkRNA测序数据输入进贝叶斯棱镜算法，右图就是通过了棱镜算法之后把bulkRNA测序数据中每个样本里所占的标注细胞类型的细胞百分比展示出来。从文献中对BayesPrism的概述来看，其通过贝叶斯建模，利用单细胞RNA-seq作为先验信息，联合推断bulk RNA-seq数据中的细胞类型比例和细胞类型特异性基因表达。该模型显式建模(显式建模是指在数学模型中明确地表达某些变量之间的关系，而不是隐含地或通过经验性调整来描述它们。在 BayesPrism中，显式建模的核心思想是直接在模型中纳入单细胞RNA-seq 和 bulk RNA-seq 之间的基因表达差异，而不是假设它们完全一致)并边缘化(边缘化是概率论中的一个概念，指的是通过对某些未观测变量进行积分或求和，使其影响被消除，从而得到目标变量的分布。简单来说，就是通过整合掉某些变量，使得模型对关键变量的推断更加稳健)单细胞与bulk之间的基因表达差异，以减少技术和生物学偏差，从而提高推断准确性。在模型优化和推断过程中，BayesPrism采用后验分布推断(贝叶斯统计中，后验分布指的是在观测到数据之后，进而调整参数分布。它是基于先验知识和观测数据综合得出的结果)进行细胞类型比例估计，并结合单细胞参考数据预测bulk 样本中各细胞类型的基因表达水平，从而克服传统去卷积方法的局限性。

分析步骤——Scissor

1.导入

rm(list=ls())
source('scRNA_scripts/lib.R')
source('scRNA_scripts/mycolors.R')
library(BiocParallel)
library(Scissor)
register(MulticoreParam(workers = 8, progressbar = TRUE))  # 单线程模式
load("TCGA-COAD_sur_model.Rdata")
scRNA <- qread("./4-Corrective-data/sce.all.qs")

dir.create("./7-Scissors")
setwd("./7-Scissors")

2.预处理数据

# Check TCGA-COAD
head(meta)[1:5,1:5]
#                                ID  OS.time OS                     histology gender
# TCGA-AA-3818-01A TCGA-AA-3818-01A 1.000000  1          Colon Adenocarcinoma FEMALE
# TCGA-AA-3543-01A TCGA-AA-3543-01A 1.000000  0 Colon Mucinous Adenocarcinoma   MALE
# TCGA-AA-3856-01A TCGA-AA-3856-01A 1.000000  0          Colon Adenocarcinoma   MALE
# TCGA-AA-A00R-01A TCGA-AA-A00R-01A 1.000000  0          Colon Adenocarcinoma FEMALE
# TCGA-AA-A01V-01A TCGA-AA-A01V-01A 1.033333  0          Colon Adenocarcinoma   MALE
head(exprSet)[1:5,1:5]
#            TCGA-AA-3818-01A TCGA-AA-3543-01A TCGA-AA-3856-01A TCGA-AA-A00R-01A TCGA-AA-A01V-01A
# WASH7P          0.003626661        0.6383137      0.001720891       0.06671424        1.1729266
# AL627309.6      0.919089866        1.8961174      0.148456647       2.29267751        0.6987401
# WASH9P          1.927197805        2.1416464      1.727985983       2.11144083        2.3731305
# MTND1P23        1.792300493        1.4819065      2.676033573       3.46413905        5.0904344
# MTND2P28        8.726735194        7.2472009      8.578934482       8.10135516        8.8684236

table(scRNA$celltype)
#       B cells              T/NK cells epithelial/cancer cells                  plasma 
#          3686                   10160                    2268                    1540 
# myeloid cells                   VSMCs     proliferative cells             fibroblasts 
#          3198                     395                     765                     984 
#    mast cells       endothelial cells 
#           568                     516 
sel.clust = "celltype"
scRNA <- SetIdent(scRNA, value = sel.clust)
table(scRNA@active.ident)

2.由于Scissor不支持V5，因此需要修改代码

修改了数据读取方式：sc_exprs <- as.matrix(sc_dataset@assay@layers$data)

RUNScissor <- function (bulk_dataset, sc_dataset, phenotype, tag = NULL, alpha = NULL, 
                        cutoff = 0.2, family = c("gaussian", "binomial", "cox"), 
                        Save_file = "Scissor_inputs.RData", Load_file = NULL) 
{
  library(Seurat)
  library(Matrix)
  library(preprocessCore)
  
  if (is.null(Load_file)) {
    common <- intersect(rownames(bulk_dataset), rownames(sc_dataset))
    if (length(common) == 0) {
      stop("There is no common genes between the given single-cell and bulk samples.")
    }
    
    if (class(sc_dataset) == "Seurat") {
      sc_exprs <- as.matrix(sc_dataset@assays$RNA@layers$data)
      rownames(sc_exprs) <- rownames(sc_dataset)
      colnames(sc_exprs) <- colnames(sc_dataset)
      network <- as.matrix(sc_dataset@graphs$RNA_snn)
    } else {
      sc_exprs <- as.matrix(sc_dataset)
      Seurat_tmp <- CreateSeuratObject(sc_dataset)
      Seurat_tmp <- FindVariableFeatures(Seurat_tmp, selection.method = "vst", verbose = FALSE)
      Seurat_tmp <- ScaleData(Seurat_tmp, verbose = FALSE)
      Seurat_tmp <- RunPCA(Seurat_tmp, features = VariableFeatures(Seurat_tmp), verbose = FALSE)
      Seurat_tmp <- FindNeighbors(Seurat_tmp, dims = 1:10, verbose = FALSE)
      network <- as.matrix(Seurat_tmp@graphs$RNA_snn)
    }
    
    diag(network) <- 0
    network[which(network != 0)] <- 1
    dataset0 <- cbind(bulk_dataset[common, ], sc_exprs[common, ])
    dataset0 <- as.matrix(dataset0)
    dataset1 <- normalize.quantiles(dataset0)
    rownames(dataset1) <- rownames(dataset0)
    colnames(dataset1) <- colnames(dataset0)
    Expression_bulk <- dataset1[, 1:ncol(bulk_dataset)]
    Expression_cell <- dataset1[, (ncol(bulk_dataset) + 1):ncol(dataset1)]
    X <- cor(Expression_bulk, Expression_cell)
    
    quality_check <- quantile(X)
    print("|**************************************************|")
    print("Performing quality-check for the correlations")
    print("The five-number summary of correlations:")
    print(quality_check)
    print("|**************************************************|")
    
    if (quality_check[3] < 0.01) {
      warning("The median correlation between the single-cell and bulk samples is relatively low.")
    }
    
    if (family == "binomial") {
      Y <- as.numeric(phenotype)
      z <- table(Y)
      if (length(z) != length(tag)) {
        stop("The length differs between tags and phenotypes. Please check Scissor inputs and selected regression type.")
      } else {
        print(sprintf("Current phenotype contains %d %s and %d %s samples.", z[1], tag[1], z[2], tag[2]))
        print("Perform logistic regression on the given phenotypes:")
      }
    }
    
    if (family == "gaussian") {
      Y <- as.numeric(phenotype)
      z <- table(Y)
      if (length(z) != length(tag)) {
        stop("The length differs between tags and phenotypes. Please check Scissor inputs and selected regression type.")
      } else {
        tmp <- paste(z, tag)
        print(paste0("Current phenotype contains ", paste(tmp[1:(length(z) - 1)], collapse = ", "), ", and ", tmp[length(z)], " samples."))
        print("Perform linear regression on the given phenotypes:")
      }
    }
    
    if (family == "cox") {
      Y <- as.matrix(phenotype)
      if (ncol(Y) != 2) {
        stop("The size of survival data is wrong. Please check Scissor inputs and selected regression type.")
      } else {
        print("Perform cox regression on the given clinical outcomes:")
      }
    }
    
    save(X, Y, network, Expression_bulk, Expression_cell, file = Save_file)
  } else {
    load(Load_file)
  }
  
  if (is.null(alpha)) {
    alpha <- c(0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
  }
  
  for (i in 1:length(alpha)) {
    set.seed(123)
    fit0 <- APML1(X, Y, family = family, penalty = "Net", alpha = alpha[i], Omega = network, nlambda = 100, nfolds = min(10, nrow(X)))
    fit1 <- APML1(X, Y, family = family, penalty = "Net", alpha = alpha[i], Omega = network, lambda = fit0$lambda.min)
    
    if (family == "binomial") {
      Coefs <- as.numeric(fit1$Beta[2:(ncol(X) + 1)])
    } else {
      Coefs <- as.numeric(fit1$Beta)
    }
    
    Cell1 <- colnames(X)[which(Coefs > 0)]
    Cell2 <- colnames(X)[which(Coefs < 0)]
    percentage <- (length(Cell1) + length(Cell2)) / ncol(X)
    print(sprintf("alpha = %s", alpha[i]))
    print(sprintf("Scissor identified %d Scissor+ cells and %d Scissor- cells.", length(Cell1), length(Cell2)))
    print(sprintf("The percentage of selected cell is: %s%%", formatC(percentage * 100, format = "f", digits = 3)))
    
    if (percentage < cutoff) {
      break
    }
    cat("\n")
  }
  
  print("|**************************************************|")
  return(list(para = list(alpha = alpha[i], lambda = fit0$lambda.min, family = family), Coefs = Coefs, Scissor_pos = Cell1, Scissor_neg = Cell2))
}

2.1 二分类数据

请注意单细胞数据集的细胞量，细胞量稍一大就会报错；
似乎对bulkRNA和phenotype的数据格式没有特殊要求，或者可以考虑全部改成matrix；
tag中的顺序需要与phenotype中顺序一致；

# 单细胞数据
# 按比例取样
set.seed(123)
sc_dataset = scRNA[, sample(1:ncol(scRNA),round(ncol(scRNA)/10)) ]
sc_dataset <- FindNeighbors(sc_dataset, dims = 1:20) 
table(sc_dataset$celltype)

# 单细胞数据 
# sc_dataset <- scRNA

# bulk数据##############
# 二分类数据，后续要使用binomial
# 分析时数据中不能存在na数据，去除或者改成0(需要提前确认！)
class(exprSet)
bulk_dataset <- as.matrix(exprSet)
#bulk_dataset <- na.omit(bulk_dataset)
#bulk_dataset[is.na(bulk_dataset)] <- 0
class(bulk_dataset)

# 提取phenotype
identical(rownames(meta),colnames(exprSet))
meta$cluster <- ifelse(meta$OS == "1", "1", "0")  # 先将 OS 变量转换为二分类
phenotype <- as.numeric(meta[,"cluster"])
class(phenotype)
phenotype

# tag
table(phenotype)
# phenotype
#   0   1 
# 332  92 
tag <- c("Live","Dead") 

# 保存数据
save(bulk_dataset,phenotype,tag,file = "bulkData.Rdata")
qsave(sc_dataset,"sc_dataset.qs")

开始运行+可视化

alpha默认值是0.05，越大惩罚力度越大；alpha需要多调整
cutoff默认20%，不要超过单细胞总数的50%
family = binomial

# V5
infos1 <- RUNScissor(bulk_dataset, 
                     sc_dataset, 
                     phenotype, 
                     tag = tag,
                     alpha = 0.001, # 默认0.05
                     cutoff = 0.2, # Scissor选择的细胞不能超过单细胞总数的50%，默认20%
                     family = "binomial",  # 生存数据需要使用cox
                     Save_file = './result.RData')
# [1] "|**************************************************|"
# [1] "Performing quality-check for the correlations"
# [1] "The five-number summary of correlations:"
#        0%       25%       50%       75%      100% 
# 0.1395088 0.3866064 0.4255620 0.4629632 0.7986783 
# [1] "|**************************************************|"
# [1] "Current phenotype contains 332 Live and 92 Dead samples."
# [1] "Perform logistic regression on the given phenotypes:"
# [1] "alpha = 0.001"
# [1] "Scissor identified 2281 Scissor+ cells and 2368 Scissor- cells."
# [1] "The percentage of selected cell is: 96.532%"
# 
# [1] "|**************************************************|"
infos1 <- qread("infos1.qs")

# 可视化
Scissor_select <- rep(0, ncol(sc_dataset))
names(Scissor_select) <- colnames(sc_dataset)
Scissor_select[infos1$Scissor_pos] <- "Scissor+"
Scissor_select[infos1$Scissor_neg] <- "Scissor-"
sc_dataset <- AddMetaData(sc_dataset, 
                          metadata = Scissor_select, 
                          col.name = "scissor")
UMAP_scissor <- DimPlot(sc_dataset, reduction = 'umap', 
                        group.by = 'scissor',
                        cols = c('grey','#0d1b46','tomato'), 
                        pt.size = 0.001, order = c("Scissor+","Scissor-"))
UMAP_scissor+DimPlot(sc_dataset,group.by = "celltype")
ggsave("scissor.pdf",width = 18,height = 7)

我们粗略的看一下这张图(肉眼看比例哈，因为这里的alpha值是随便设定的)，scissor+细胞主要集中于mast，endothelial，VSMCs，T/NK细胞还有部分的myeloid细胞，这个结果其实让我有点犹豫和困惑，比如T细胞几乎都是阳性细胞，并且癌症细胞中没有阳性细胞。接下来我们把时间也加上去重新分析看看。

2.1 生存数据

不需要tag数据
需要提取时间和状态数据，重命名为time和state
官方中time数据类型为num，state数据类型为dbl，因此phenotype应当需要为数值型

# 按比例取样
set.seed(123)
sc_dataset = scRNA[, sample(1:ncol(scRNA),round(ncol(scRNA)/5)) ]
sc_dataset <- FindNeighbors(sc_dataset, dims = 1:30) 
table(sc_dataset$celltype)

# 单细胞数据(如果数据量很小的话) 
# sc_dataset <- scRNA

# bulk数据##############
# 分析时数据中不能存在na数据，去除或者改成0(需要提前确认！)
class(exprSet)
bulk_dataset <- as.matrix(exprSet)
#bulk_dataset <- na.omit(bulk_dataset)
#bulk_dataset[is.na(bulk_dataset)] <- 0

# 提取phenotype
identical(rownames(meta),colnames(exprSet))
phenotype <- meta[,c("OS.time","OS")]
colnames(phenotype) <- c("time", "status")
str(phenotype)
# 'data.frame': 424 obs. of  2 variables:
#  $ time  : num  1 1 1 1 1.03 ...
#  $ status: int  1 0 0 0 0 0 0 0 0 0 ...
head(phenotype)
#                      time status
# TCGA-AA-3818-01A 1.000000      1
# TCGA-AA-3543-01A 1.000000      0
# TCGA-AA-3856-01A 1.000000      0
# TCGA-AA-A00R-01A 1.000000      0
# TCGA-AA-A01V-01A 1.033333      0
# TCGA-AA-3496-01A 1.033333      0

# 保存数据
save(bulk_dataset,phenotype,tag,file = "bulkData.Rdata")
qsave(sc_dataset,"sc_dataset.qs")

开始运行+可视化

alpha默认值是0.05，越大惩罚力度越大；alpha需要多调整
cutoff默认20%，不要超过单细胞总数的50%
family = cox

# V5
infos1 <- RUNScissor(bulk_dataset, 
                     sc_dataset, 
                     phenotype, 
                     #tag = tag,
                     alpha = 0.00025, # 默认0.05
                     cutoff = 0.2, # Scissor选择的细胞不能超过单细胞总数的50%，默认20%
                     family = "cox",  # 生存数据需要使用cox
                     Save_file = './result.RData')
# [1] "|**************************************************|"
# [1] "Performing quality-check for the correlations"
# [1] "The five-number summary of correlations:"
#        0%       25%       50%       75%      100% 
# 0.1395088 0.3866064 0.4255620 0.4629632 0.7986783 
# [1] "|**************************************************|"
# [1] "Perform cox regression on the given clinical outcomes:"
# [1] "alpha = 0.00025"
# [1] "Scissor identified 1554 Scissor+ cells and 398 Scissor- cells."
# [1] "The percentage of selected cell is: 40.532%"
# 
# [1] "|**************************************************|"
infos1 <- qread("infos1.qs")

我们粗略的看一下这张图，scissor+细胞主要集中于mast，endothelial，VSMCs，fibroblasts,部分T/NK和B细胞还有部分的myeloid细胞。除了增加了fibroblasts进入了阳性细胞行列，其他细胞的情况是类似的。

那么其他类似的细胞系暂且不谈，笔者思考了一下为什么成纤维细胞在这两种不同的分析方式中存在不同的情况。有一种可能性是在二分类分析中，成纤维细胞被归为 Scissor-，说明其在存活患者中更富集，或在死亡患者中表达较低，这是一个整体表达的情况。然而，在 Scissor结合生存分析时，成纤维细胞变为Scissor+，表明其与较短生存时间相关，可能在长期预后上促进肿瘤进展或免疫抑制。因此其实这个结果挺有趣的，但到底成纤维细胞和疾病进展(我认为癌症中的死亡很大程度上是因为疾病进展)是怎么样的因果关系还是值得推敲的(比如是成纤维细胞加速了疾病进展？还是疾病本身进展的同时导致了呈现成纤维细胞的异常？当然最有可能的是这两种情况都存在) 。

分析步骤——BayesPrism

1.导入

rm(list = ls())
library(BayesPrism)
library(TCGAbiolinks)
library(stringr)
library(SummarizedExperiment)
library(Seurat)
library(qs)
library(BiocParallel)
BiocParallel::register(MulticoreParam(workers = 8, progressbar = TRUE))
load("TCGA-COAD_sur_model.Rdata")
scRNA <- qread("./4-Corrective-data/sce.all.qs")

dir.create("./8-BayesPrism/")
setwd("./8-BayesPrism")

2.TCGA预处理

#请注意我们要做BayesPrism分析
#请确保使用一致的基因注释（TCGA RNA-seq 使用 GENCODE v22）。
#建议使用未规范化和未转换（ unnormalized and untransformed）的count data。当原始计数不可用时，线性归一化（例如 TPM、RPM、RPKM、FPKM）也是可以接受的，因为 BayesPrism 对reference和mixture之间的线性乘法差异具有鲁棒性（换个单词 即Robust）。理想情况下，如果使用规范化数据，最好提供参考数据和通过相同方法转换的批量数据。应避免log转换。因此建议使用小洁老师旧的探针注释方式
proj = "TCGA-COAD"
exp <- 2^exprSet - 1
exp_pre <- t(exp)

3.scRNA数据预处理

# 单细胞数据
sc_dataset <- scRNA
# 按比例抽样
set.seed(123)
sc_dataset = sc_dataset[, sample(1:ncol(sc_dataset),round(ncol(sc_dataset)/5)) ]
table(sc_dataset$celltype)

#sc_dat <- as.data.frame(sc_dataset@assays$RNA@counts) # V4
sc_dat <- GetAssayData(sc_dataset,layer = "counts") # V5
sc_dat_pre <- t(sc_dat)
table(sc_dataset$celltype)

# 保存一下
qsave(sc_dat_pre,"sc_dat_pre.qs")
qsave(sc_dataset,"sc_dataset.qs")
qsave(exp_pre,"exp_pre.qs")

4.对状态和细胞类型进行质量控制

# 对细胞状态进行质量控制
# plot.cor.phi(input=sc_dat_pre,
#               input.labels=sc_dataset$status,
#               title="cell state correlation",
#               #specify pdf.prefix if need to output to pdf
#               #pdf.prefix="gbm.cor.cs",
#               cexRow=0.2, cexCol=0.2,
#               margins=c(2,2))
# 对细胞类型进行质量控制
plot.cor.phi (input=sc_dat_pre, 
              input.labels=sc_dataset$celltype, 
              title="cell type correlation",
              #specify pdf.prefix if need to output to pdf
              #pdf.prefix="gbm.cor.ct",
              cexRow=0.5, cexCol=0.5,
)

笔者认为细胞状态参数有助于我们对肿瘤细胞或髓系细胞在不同样本中的状态进行细分，从而更好地揭示细胞群体的异质性。例如，整合转移、复发和原发样本的数据，并对细胞进行标注，有助于更细致地探索各细胞亚群的情况。

不同的细胞状态要保留至少大于20或者50个细胞
纳入分析的数据建议 1）不同细胞类型是具有足够多显著差异表达基因的细胞群体，例如大于50个或100个。对于转录组相似度过高的细胞群体，建议将其视为细胞状态，在最终的Gibbs采样之前将其聚合。因此，细胞状态通常适用于那些在表型空间中形成连续体的细胞，而非形成离散簇的细胞。2）为具有显著异质性的细胞类型定义多个细胞状态，例如恶性细胞等，以便解析其转录组。

可以看出每个细胞群分的还算清晰

5.过滤离群基因-单细胞和bulk

表达量较高的基因，如核糖体蛋白基因和线粒体基因，可能主导数据的分布并偏倚推断结果。这些基因通常在区分细胞类型方面信息较少，并且可能引入较大的虚假方差。因此，它们可能对去卷积分析产生不利影响。因此需要去除这些基因。

#查看单细胞数据的离群基因
#sc.stat 显示每个基因的归一化平均表达(x 轴)和最大特异性(y 轴)的对数，以及每个基因是否属于一个潜在的异常值类别。Other _ Rb 是由核糖体假基因组成的一组精选基因。
sc.stat <- plot.scRNA.outlier(
  input=sc_dat_pre, #make sure the colnames are gene symbol or ENSMEBL ID 
  cell.type.labels=sc_dataset$celltype,
  species="hs", #currently only human(hs) and mouse(mm) annotations are supported
  return.raw=TRUE #return the data used for plotting. 
  #pdf.prefix="gbm.sc.stat" specify pdf.prefix if need to output to pdf
)
head(sc.stat)  

#查看bulk细胞数据的离群基因
bk.stat <- plot.bulk.outlier(
  bulk.input=exp_pre,#make sure the colnames are gene symbol or ENSMEBL ID 
  sc.input=sc_dat_pre, #make sure the colnames are gene symbol or ENSMEBL ID 
  cell.type.labels=sc_dataset$celltype,
  species="hs", #currently only human(hs) and mouse(mm) annotations are supported
  return.raw=TRUE
  #pdf.prefix="gbm.bk.stat" specify pdf.prefix if need to output to pdf
)
head(bk.stat)

#过滤基因
sc.dat.filtered <- cleanup.genes (input=sc_dat_pre,
                                   input.type="count.matrix",
                                   species="hs", 
                                   gene.group=c( "Rb","Mrp","other_Rb",
                                                 "chrM","MALAT1","chrX","chrY") ,
                                   exp.cells=5)
plot.bulk.vs.sc (sc.input = sc.dat.filtered,
                 bulk.input = exp_pre
                 #pdf.prefix="gbm.bk.vs.sc" specify pdf.prefix if need to output to pdf
)

6.根据 “蛋白编码基因”确定子集。

接下来，我们检查不同类型基因(scRNA和bulkRNA)的表达一致性。

sc.dat.filtered.pc <-  select.gene.type(sc.dat.filtered,
                                        gene.type = "protein_coding")
#根据'signature gene’确定子集
#或者，在细胞类型被定义为其中一些显示非常相似的转录或存在严重的批次效应的情况下，例如，reference 和mixture 来自不同的测定手段，选择signature gene是可行的。这是因为特征基因的选择可以丰富用于反卷积的基因信息，同时减少技术批量效应造成的噪声影响。

基本上蛋白基因的表达一致性分数还是挺高的。

7.构建BayesPrism模型

myPrism <- new.prism(
   reference=sc.dat.filtered.pc, 
   mixture=exp_pre,
   input.type="count.matrix", 
   cell.type.labels = sc_dataset$celltype, 
   cell.state.labels = NULL, # 不同的细胞状态不能分配到同一细胞类型中
   key= "epithelial/cancer cells",   #"Malignant cells",#key是 cell.type.label 中对应于恶性细胞类型的字符向量。如无恶性细胞，即设置为 NULL，此时，所有类型的细胞将被 treated equally。
   outlier.cut=0.01,
   outlier.fraction=0.1,
 )

8.运行BayesPrism

#后续即运行BayesPrism控制 Gibbs 采样和优化的参数可以使用 Gibbs.control 和 opt.control 指定。?run.prism获取details。这里建议使用默认参数
bp.res <- run.prism(prism = myPrism, n.cores=20)

qsave(bp.res,file = "bp.res.qs")

9. 提取细胞型分数 θ 的后验平均值/变异分数(CV)/Z分数

bp.res <- qread("bp.res.qs")
# 提取细胞型分数θ的后验平均值
theta <- get.fraction (bp=bp.res,
            which.theta="final",
            state.or.type="type")

head(theta)
#                  epithelial/cancer cells       plasma      B cells   T/NK cells myeloid cells proliferative cells
# TCGA-AA-3818-01A               0.9271370 6.325108e-05 1.375325e-04 0.0004325820    0.01680760        0.0004305814
# TCGA-AA-3543-01A               0.8562443 1.974167e-04 3.342300e-05 0.0085692794    0.05991585        0.0042596279
# TCGA-AA-3856-01A               0.8191022 9.401481e-04 2.925654e-02 0.0004518154    0.07363352        0.0119511269
# TCGA-AA-A00R-01A               0.6588630 5.125138e-05 3.889203e-05 0.0106335478    0.08606630        0.1664692259
# TCGA-AA-A01V-01A               0.9290664 6.600326e-05 3.966345e-03 0.0001245812    0.01926844        0.0005808367
# TCGA-AA-3496-01A               0.7466722 6.637973e-04 4.654608e-03 0.0028542579    0.06975685        0.0024060684
#                  fibroblasts   mast cells endothelial cells        VSMCs
# TCGA-AA-3818-01A  0.02857603 6.929316e-05        0.02026500 0.0060810801
# TCGA-AA-3543-01A  0.05132464 7.272148e-04        0.01832902 0.0003992496
# TCGA-AA-3856-01A  0.03616049 4.760719e-04        0.02599443 0.0020336560
# TCGA-AA-A00R-01A  0.04469131 2.575208e-05        0.03237835 0.0007823437
# TCGA-AA-A01V-01A  0.02405473 1.266463e-04        0.02253263 0.0002133477
# TCGA-AA-3496-01A  0.11691740 2.093585e-04        0.05484618 0.0010192445
write.csv(theta,"theta.csv")

得到了每个样本的分析就很好办了~ 比如就可以按照不同样本中不同细胞类型的中位值将样本重新分组，然后进行表达量或者生存分析~

参考资料：

Identifying phenotype-associated subpopulations by integrating bulk and single-cell sequencing data. Nat Biotechnol. 2022 Apr;40(4):527-538. Nucleic Acids Res. 2022 Nov 28;50(21):12112-12130.
scAB detects multiresolution cell states with clinical significance by integrating single-cell genomics and bulk sequencing data.
Cell type and gene expression deconvolution with BayesPrism enables Bayesian integrative analysis across bulk and single-cell RNA sequencing in oncology. Nat Cancer. 2022 Apr;3(4):505-517.
Scissor：https://github.com/sunduanchen/Scissor
scAB：https://github.com/jinworks/scAB
BayesPrism：https://github.com/Danko-Lab/BayesPrism/tree/main
生信技能树：https://mp.weixin.qq.com/s/swPDu2PoGTBEutzaR7YYSg https://mp.weixin.qq.com/s/3dZQxDdY6M1WwMMcus5Gkg