加载页面中...
使用我的GSE157827数据进行初步分析-5 | lwstkhyl

使用我的GSE157827数据进行初步分析-5

2026.3.18-2026.4.8研究进展

GSE174367

my_GSE157827_5_1

先看这几个1/2/3都是什么部分:

cd /public/home/GENE_proc/wth/GSE174367/fastq/
for fq in SRR14513977_{1,2,3}.fastq.gz; do
  echo "===== $fq ====="
  zcat "$fq" | awk '
    NR%4==2 && ++n<=200000 {len[length($0)]++}
    END{
      for (l in len) print l, len[l]
    }' | sort -n
  echo
done
  • SRR14513977_1.fastq.gz:都是8bp,是sample index read
  • 正常来讲另外两个应该一个是28bp,另一个是90-100bp,但很神奇的是另外两个都是100bp
  • GPT说是没切成28bp的形式,而是原始格式,需要自己看一下哪个是barcode+UMI,哪个是cDNA
import gzip
import sys
import math
from collections import Counter
N = 200000
MAX_POS = 40
def rc(seq):
    comp = str.maketrans("ACGTN", "TGCAN")
    return seq.translate(comp)[::-1]
def iter_seqs(fq, n=N):
    count = 0
    with gzip.open(fq, "rt") as f:
        for i, line in enumerate(f):
            if i % 4 == 1:
                yield line.strip()
                count += 1
                if count >= n:
                    break
def entropy(counter):
    total = sum(counter.values())
    if total == 0:
        return 0.0
    h = 0.0
    for v in counter.values():
        p = v / total
        h -= p * math.log2(p)
    return h
def load_whitelist(path):
    wl = set()
    opener = gzip.open if path.endswith(".gz") else open
    with opener(path, "rt") as f:
        for line in f:
            wl.add(line.strip())
    return wl
def analyze_fastq(fq, whitelist=None):
    seqs = list(iter_seqs(fq, N))
    if not seqs:
        print(f"\n===== {fq} =====")
        print("No reads found")
        return
    lens = Counter(map(len, seqs))
    dom_len, dom_n = lens.most_common(1)[0]
    print(f"\n===== {fq} =====")
    print(f"sampled reads: {len(seqs)}")
    print("length distribution:")
    for l, c in sorted(lens.items()):
        print(f"  {l} bp\t{c}")
    print(f"dominant length: {dom_len} bp ({dom_n/len(seqs):.2%})")
    p16 = [s[:16] for s in seqs if len(s) >= 16]
    p28 = [s[:28] for s in seqs if len(s) >= 28]
    c16 = Counter(p16)
    c28 = Counter(p28)
    print(f"unique first16: {len(c16)} / {len(p16)} ({len(c16)/len(p16):.2%})")
    print(f"unique first28: {len(c28)} / {len(p28)} ({len(c28)/len(p28):.2%})")
    print("top 10 first16:")
    for s, c in c16.most_common(10):
        print(f"  {c}\t{s}")
    print("top 10 first28:")
    for s, c in c28.most_common(10):
        print(f"  {c}\t{s}")
    if whitelist is not None:
        hit = sum((x in whitelist) for x in p16)
        hit_rc = sum((rc(x) in whitelist) for x in p16)
        print(f"whitelist hit rate (first16): {hit/len(p16):.2%}")
        print(f"whitelist hit rate (reverse-complement first16): {hit_rc/len(p16):.2%}")
    tails = [s[28:60] for s in seqs if len(s) >= 60]
    polyA = sum(("AAAAAAAAAAAA" in x) for x in tails)
    polyT = sum(("TTTTTTTTTTTT" in x) for x in tails)
    print(f"reads with >=12A in pos29-60: {polyA/len(tails):.2%}")
    print(f"reads with >=12T in pos29-60: {polyT/len(tails):.2%}")
    pos_counts = [Counter() for _ in range(MAX_POS)]
    for s in seqs:
        for i, b in enumerate(s[:MAX_POS]):
            pos_counts[i][b] += 1
    print("\nPer-cycle summary (first 40 cycles):")
    print("pos\tmajor_base\tmajor_frac\tentropy\tA\tC\tG\tT\tN")
    for i, cnt in enumerate(pos_counts, start=1):
        total = sum(cnt.values())
        if total == 0:
            continue
        major_base, major_n = cnt.most_common(1)[0]
        A = cnt.get("A", 0) / total
        C = cnt.get("C", 0) / total
        G = cnt.get("G", 0) / total
        T = cnt.get("T", 0) / total
        Nn = cnt.get("N", 0) / total
        H = entropy(cnt)
        print(f"{i}\t{major_base}\t{major_n/total:.3f}\t{H:.3f}\t{A:.3f}\t{C:.3f}\t{G:.3f}\t{T:.3f}\t{Nn:.3f}")
if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: python inspect_10x_cb_umi.py whitelist.txt.gz file1.fastq.gz file2.fastq.gz ...")
        sys.exit(1)
    wl = load_whitelist(sys.argv[1])
    for fq in sys.argv[2:]:
        analyze_fastq(fq, wl)
cd /public/home/GENE_proc/wth/GSE174367/fastq/
python ../inspect_10x_cb_umi.py \
  /public/home/wangtianhao/Desktop/STAR_ref/whitelist/3M-february-2018.txt \
  SRR14513977_2.fastq.gz \
  SRR14513977_3.fastq.gz

最后算出来_2的前16bp对whitelist的命中率高达96%,并且它的unique first16只有14.52%,但加上12bp以后,unique first28达到91.62%,说明前16位是barcode,后12bp是UMI

先跑一个测试:

module load miniconda3/base
conda activate STAR
cd /public/home/GENE_proc/wth/GSE174367/
STAR \
    --runMode alignReads \
    --runThreadN 8 \
    --genomeDir /public/home/wangtianhao/Desktop/STAR_ref/hg38/ \
    --readFilesIn fastq/SRR14513977_3.fastq.gz fastq/SRR14513977_2.fastq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix star_test/SRR14513977 \
    --soloType CB_UMI_Simple \
    --soloCBstart 1 \
    --soloCBlen 16 \
    --soloUMIstart 17 \
    --soloUMIlen 12 \
    --soloBarcodeReadLength 0 \
    --soloCBwhitelist /public/home/wangtianhao/Desktop/STAR_ref/whitelist/3M-february-2018.txt \
    --soloFeatures GeneFull \
    --clipAdapterType CellRanger4 \
    --soloCellFilter EmptyDrops_CR \
    --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts \
    --soloUMIfiltering MultiGeneUMI_CR \
    --soloUMIdedup 1MM_CR

my_GSE157827_5_2

再跑一个完整的样本(包含多个SRR)

my_GSE157827_5_3

这个GSM5292838在作者的统计表里有4605个细胞(我的有4642个细胞),可以说是完全吻合,由此这个计数参数应该是没有什么问题的

总结:

  • 6w多个细胞,来自18个样本,虽然GEO数据库中有19个,但根据论文的补充材料,作者只使用了其中的18个,实测可能是因为差的那个测序质量不太好(nCount_RNA比较低)
  • 测序深度较深,nCount_RNA/nFeature_RNA较大(虽然细胞数只有前组数据的1/3,但fastq文件大小却接近3/4)
  • 在该数据对应的论文中,作者分了7种细胞类型,在之前的6种基础上,增加了“少突胶质前体细胞”(OPC, oligodendrocyte progenitor cells)——具有分化生成成熟少突胶质细胞的能力,在大脑早期发育的髓鞘形成过程中起重要作用

hERV ~ gene

主要思路:

  • 在细胞类型层面做hERV~gene共表达分析(不在亚群层面上):构建共表达网络,把网络中邻近的基因进行功能富集
  • 在各细胞类型里找hERV特异表达的亚群,聚焦到某些亚群中高表达的subfamilies做一些定性分析:看这个亚群中特异表达的hERV在共表达网络中连接的基因,对比这个亚群与其它亚群的差异表达基因,看是否与AD有关
  • 验证hERVK与TLR8表达相关(先所有细胞,再看细胞类型),参照那篇论文,为下一步hERV位点分析选择位点?

共表达网络test

在之前的代码中,我做的是一个hERV和一个gene的相关性或线性拟合,现在想做一下网络层面的关系:

  • 在某个大细胞类型里,哪些基因会一起变化、组成模块,这些模块和hERV是否相关
  • 某个hERV高表达亚群里,究竟激活了哪一类网络程序
  • WGCNA/hdWGCNA(用于单细胞数据的WGCNA):先用表达相关性构建加权网络,再把高度共表达的基因聚成模块,用模块特征值去和外部性状做关联
  • 在构建完网络之后,可以在某个和hERV明显相关的模块里,再去找和这个hERV相关性最高的基因(一个hERV和一个gene的线性拟合)

方案1:先用基因构建共表达网络,再把hERV当作外部性状去挂到网络上(GPT推荐)

  • 在细胞大类中,单独构建基因的共表达网络,得到若干个基因模块
    • 选在该细胞类型中有一定表达比例的基因
  • 把hERV的表达和模块特征值做关联
    • 选定一些trait:hERV家族表达量、AD/NC等
    • 模块特征值和trait的相关分析,画热图
  • 找到和某个hERV家族显著相关的模块
    • 例如某个模块和hERVK表达显著相关,同时和AD状态相关,那么这个模块就是一个很有价值的“hERV-AD关联模块”
  • 后续对这些模块进行亚群级别的分析(具体方法在方案1中有讲)

方案2:把hERV当成普通基因作为网络节点直接参与建网

  • 取网络中与hERV相邻的一些基因进行亚群级别的分析
  • GPT不推荐,说是会存在一些问题:
    • hERV家族数量远少于基因,且表达更稀疏
    • hERV节点本身不能直接做GO
    • 在WGCNA时,hERV要和普通基因一同作为feature进行标准化,这样就和之前单独标准化的思路不同,不过考虑到hERV表达量很低,影响也许不大?

采用哪种方案比较好

  • 方案1会更好解释家族水平hERV
  • 方案2看起来更适合探讨位点级别的局部网络

共表达网络的5个对象

  • 节点(node):通常是gene
  • 边(edge):表示两个基因在同一大细胞类型内、在不同样本/细胞中的表达是否同步变化
  • 模块(module):一群彼此更紧密共表达的基因,往往对应某种生物过程
  • 模块特征值(module eigengene, ME):本质上是模块表达矩阵的一个总结量,可以理解成“这个模块整体表达水平的代表值”,用于回答“这个模块在AD中是否表达更高”等问题
  • hub gene:模块里连接度高、最核心的基因,通常是模块的代表基因或关键解释点。hdWGCNA里常用kME去衡量这种模块内连接性

共表达网络其它的一些专业名词

  • metacell:把若干个彼此相近、而且来自同一分组里的单细胞做平均,得到一个更稳定的小样本单元,如果设置为group.by = c("wgcna_celltype", "orig.ident")就是先限制在同一个大细胞类型里,再限制在同一个sample里,然后在这些细胞里找相近细胞做平均。既保留了细胞类型特异性,又降低了单细胞水平分析的噪声
  • trait:和模块做关联的外部变量——某个module的整体表达水平,是否随着某个trait变化
  • soft power:WGCNA不会直接把基因之间的相关性(0.2/0.5/0.8这种)当网络中边的值,而是会加一个指数soft power,决定要把“高相关”和“低相关”的差距拉大到什么程度。这个指数变换可以把弱相关压小、强相关放大,如果power过小,网络对强弱相关的区分就不够;power慢慢增大,使网络结构更接近WGCNA想要的“scale-free”特征;如果power过大,整体网络会越来越稀,很多真实联系可能被压掉
  • scale-free:无标度,指的是一种网络结构——大多数节点连接很少,只有少数节点连接特别多,也就是网络里会有少量枢纽节点(hub gene),而不是每个节点都连得差不多
  • TOM:拓扑重叠矩阵(Topological Overlap Matrix),不是只看两个基因自己相关不相关,而是进一步看它们是不是拥有很多共同邻居
    • 基因A和基因B直接相关,当然说明它们可能有关
    • 但如果A和B不仅彼此相关,而且还都和C、D、E一起相关,那说明它们在网络中的“位置”也很像
    • 这种“共同邻居很多”的关系,TOM会给更高分

    所以它比单纯相关性更稳一些,因为它用了局部网络结构信息,而不是只看一条边

方案1

构建纯gene的共表达网络

herv_assay <- "HERV_family"
celltype_name <- "Oligo"
herv_traits <- c("HERV3-int", "HERVH-int", "HERVE-int", "HERVK-int", "HERVK3-int", "HERVK9-int", "HERVK11-int", "HERVK13-int", "HERVK14-int", "HERVK22-int")
DefaultAssay(seu) <- "RNA"
seu$AD_binary <- ifelse(seu$group == "AD", 1, 0)
# 把trait写入metadata
mat <- LayerData(seu, assay = herv_assay, layer = "data")[herv_traits, , drop = FALSE]
mat <- t(as.matrix(mat))
colnames(mat) <- paste0("herv_", make.names(herv_traits))
seu <- AddMetaData(seu, metadata = as.data.frame(mat))
# SetupForWGCNA
seu <- SetupForWGCNA(  # 运行时间几分钟
  seu,
  gene_select = "fraction",
  fraction = 0.05,  # 过滤掉表达比例<5%的基因
  assay = "RNA",
  wgcna_name = "gene_net"
)
# 构建metacell(按样本分开)
seu <- MetacellsByGroups(  # 运行时间20min,需要10G+内存
  seurat_obj = seu,
  group.by = c("celltype", "orig.ident"),
  ident.group = "celltype",
  assay = "RNA",
  layer = "data",
  reduction = "pca",
  dims = 1:30,
  k = 25,
  max_shared = 10,
  min_cells = 50,
  mode = "average",
  wgcna_name = "gene_net"
)
seu <- NormalizeMetacells(seu, wgcna_name = "gene_net")
# 设置网络表达矩阵
seu <- SetDatExpr(  # 运行时间几分钟
  seu,
  group_name = celltype_name,
  group.by = "celltype",
  use_metacells = TRUE,
  assay = "RNA",
  layer = "data",
  wgcna_name = "gene_net"
)
# 选soft power
seu <- TestSoftPowers(  # 运行时间几分钟
  seu,
  networkType = "signed",
  wgcna_name = "gene_net"
)
power_table <- GetPowerTable(seu)
soft_power <- power_table$Power[which(power_table$SFT.R.sq >= 0.8)[1]]
if (is.na(soft_power)) {
  soft_power <- power_table$Power[which.max(power_table$SFT.R.sq)]
}
plot_list <- PlotSoftPowers(seu)
pdf(file.path(res_dir, "oligo_soft_power.pdf"), width = 16, height = 16)
p <- patchwork::wrap_plots(plot_list, ncol = 2)
print(p)
dev.off()
save(seu, soft_power, herv_traits, file = file.path(res_dir, "oligo_WGCNA_before_ConstructNetwork.RData"))
# 构建网络
seu <- ConstructNetwork(  # 运行时间20min,需要50G+内存
  seu,
  soft_power = soft_power,
  networkType = "signed",
  tom_name = "seu_TOM",
  wgcna_name = "gene_net"
)
# 计算module eigengenes
seu <- ModuleEigengenes(  # 运行时间20min,需要50G+内存
  seu,
  group.by.vars = "orig.ident",
  assay = "RNA",
  wgcna_name = "gene_net"
)
# module~trait关联
trait_cols <- c(
  paste0("herv_", make.names(herv_traits)),
  "AD_binary"
)
seu$celltype <- factor(seu$celltype)
seu <- ModuleTraitCorrelation(
  seu,
  traits = trait_cols,
  group.by = "celltype",
  wgcna_name = "gene_net"
)
mt <- GetModuleTraitCorrelation(seu)
# 结果
View(mt$cor)  # 相关系数
View(mt$fdr)  # FDR校正后p值
pdf(file.path(res_dir, "oligo_trait_corr.pdf"), width = 16, height = 16)
p <- PlotModuleTraitCorrelation(seu, label = "fdr", label_symbol = "stars")
print(p)
dev.off()

my_GSE157827_5_5

  • 横轴是power,纵轴是scale-free topology model fit(signed R²),常取第一个让R²达到设定阈值(比如0.8左右)的power。如果数据本身比较噪、样本数少、或者是mixed network,达不到理想值也很常见,此时就选一个相对合理、且曲线已经进入平台区的power
  • 横轴是power,纵轴是平均连接度(Mean connectivity),在“结构合理”和“不要过稀”之间找平衡
  • 在这个结果中,power=5是一个比较合适的折中点,SFT.R.sq已达到阈值,连通性还保留得可以

my_GSE157827_5_6

  • turquoise模块最值得关注,几乎所有的hERV都是正相关
  • blue模块和很多hERV是反向的
  • 比较意外的是,这几个模块和AD的相关性竟然不是很高
  • 不过总体来说相关度不算特别大,几乎都是-0.2~0.2的水平,

从gene network里提取“和某个hERV trait相关的模块”和hub genes

# 计算kME/hub genes
seu <- ModuleConnectivity(
  seu,
  group.by = "celltype",
  group_name = celltype_name,
  assay = "RNA",
  wgcna_name = "gene_net"
)
save(seu, file = file.path(res_dir, "oligo_WGCNA.RData"))
# 目标hERV
trait_keep <- grep("^herv_", rownames(GetModuleTraitCorrelation(seu)$cor$all_cells), value = TRUE)
n_hubs <- 50  # 每个模块取多少个hub genes
pick_col <- function(df, candidates) {
  x <- intersect(candidates, colnames(df))
  if (length(x) == 0) stop("找不到列: ", paste(candidates, collapse = ", "))
  x[1]
}
mat_to_df <- function(mat, value_name) {
  df <- as.data.frame(as.table(as.matrix(mat)), stringsAsFactors = FALSE)
  colnames(df) <- c("module", "trait", value_name)
  df
}
# 选目标模块
fdr_cutoff <- 0.05  # 显著性阈值
cor_cutoff <- 0.05
mt <- GetModuleTraitCorrelation(seu)
cor_df <- mat_to_df(mt$cor$all_cells,  "cor")
fdr_df <- mat_to_df(mt$fdr$all_cells,  "fdr")
sig_tbl <- left_join(cor_df, fdr_df, by = c("module", "trait")) %>%
  # filter(
  #   trait %in% trait_keep,
  #   module != "grey",
  #   !is.na(fdr),
  #   fdr < fdr_cutoff,
  #   abs(cor) >= cor_cutoff
  # ) %>%
  arrange(trait, desc(abs(cor)))
# write.csv(sig_tbl, file.path(res_dir, "oligo_module_trait_significant.csv"), row.names = FALSE)
# target_modules <- unique(sig_tbl$module)
target_modules <- c("turquoise", "blue")  # 手动
# 提取模块基因 + hub genes
modules_df <- GetModules(seu)
hub_df <- GetHubGenes(seu, n_hubs = n_hubs)
gene_col_mod <- pick_col(modules_df, c("gene_name", "gene", "feature"))
module_col_mod <- pick_col(modules_df, c("module", "color"))
gene_col_hub <- pick_col(hub_df, c("gene_name", "gene", "feature"))
module_col_hub <- pick_col(hub_df, c("module", "color"))
# 背景基因设置为网络中的所有基因
universe_genes <- unique(modules_df[[gene_col_mod]])
universe_genes <- universe_genes[!is.na(universe_genes)]
# 只保留显著模块
module_gene_list <- lapply(target_modules, function(m) {
  unique(modules_df[modules_df[[module_col_mod]] == m, gene_col_mod, drop = TRUE])
})
names(module_gene_list) <- target_modules
module_hub_list <- lapply(target_modules, function(m) {
  unique(hub_df[hub_df[[module_col_hub]] == m, gene_col_hub, drop = TRUE])
})
names(module_hub_list) <- target_modules
# 导出模块基因和hub genes
# for (m in target_modules) {
#   write.csv(
#     data.frame(gene = module_gene_list[[m]]),
#     file.path(res_dir, paste0("oligo_genes_", m, ".csv")),
#     row.names = FALSE
#   )
#   write.csv(
#     data.frame(hub_gene = module_hub_list[[m]]),
#     file.path(res_dir, paste0("oligo_hub_genes_", m, ".csv")),
#     row.names = FALSE
#   )
# }
# 汇总结果
module_summary <- data.frame(
  module = target_modules,
  n_genes = sapply(module_gene_list, length),
  n_hubs  = sapply(module_hub_list, length),
  top_hubs = sapply(module_hub_list, function(x) paste(head(x, 10), collapse = ", ")),
  stringsAsFactors = FALSE
)
# write.csv(module_summary, file.path(res_dir, "oligo_module_summary.csv"), row.names = FALSE)
save(target_modules, universe_genes, module_gene_list, sig_tbl, module_hub_list, file = file.path(res_dir, "oligo_target_modules.RData"))

最后发现turquoise有3k多个基因,而blue只有300多个

# GO富集
run_go_one <- function(gene_vec, universe_genes = NULL, ont = "BP") {
  gene_vec <- unique(na.omit(gene_vec))
  if (length(gene_vec) < 10) return(NULL)
  ego <- enrichGO(
    gene          = gene_vec,
    universe      = universe_genes,
    OrgDb         = org.Hs.eg.db,
    keyType       = "SYMBOL",
    ont           = ont,
    pAdjustMethod = "BH",
    pvalueCutoff  = 1,
    qvalueCutoff  = 1,
    readable      = TRUE
  )
  if (is.null(ego) || nrow(as.data.frame(ego)) == 0) return(NULL)
  ego
}
go_list <- lapply(target_modules, function(m) {
  run_go_one(
    gene_vec = module_gene_list[[m]],
    universe_genes = universe_genes,
    ont = "BP"
  )
})
names(go_list) <- target_modules
for (m in target_modules) {
  ego <- go_list[[m]]
  if (is.null(ego)) next
  go_df <- as.data.frame(ego)
  write.csv(
    go_df,
    file.path(res_dir, paste0("GO_BP_", m, ".csv")),
    row.names = FALSE
  )
}
res <- list()
# 画图展示结果
for (m in target_modules) {
  ego <- go_list[[m]]
  if (is.null(ego)) next
  res[[m]] <- dotplot(ego, showCategory = 15) +
    ggtitle(paste0("GO BP - ", m))
}
pdf(file.path(res_dir, "oligo_module_trait_GO_summary.pdf"), width = 15, height = 24)
p <- wrap_plots(res, ncol = 2)
print(p)
dev.off()
# 汇总GO结果
top_go_tbl <- lapply(target_modules, function(m) {
  ego <- go_list[[m]]
  if (is.null(ego)) return(NULL)
  go_df <- as.data.frame(ego)
  go_df <- go_df[order(go_df$p.adjust), ]
  trait_df <- sig_tbl %>%
    filter(trait == m) %>%
    arrange(desc(abs(cor)))
  data.frame(
    module = m,
    trait = paste(trait_df$module, collapse = "; "),
    cor = paste(round(trait_df$cor, 3), collapse = "; "),
    n_gene = length(module_gene_list[[m]]),
    hub_gene = paste(module_hub_list[[m]], collapse = "; "),
    GO = paste(go_df$Description[1:20], collapse = "; "),
    GO_padj = paste(go_df$p.adjust[1:20], collapse = "; "),
    stringsAsFactors = FALSE
  )
}) %>% bind_rows()
write.csv(top_go_tbl, file.path(res_dir, "oligo_module_trait_GO_summary2.csv"), row.names = FALSE)

my_GSE157827_5_7

my_GSE157827_5_8

因为turquoise的基因数太多,所以p值就很高。又把两个模块的基因让GPT分析了一下,GPT也觉得turquoise更偏向总体细胞状态,blue的功能同质性更高:

  • CNP/OLIG1/NKX6-2/HSPA2/APOD:少突成熟/髓鞘相关
  • FTH1/FTL/CRYAB/SELENOP:代谢、应激、铁稳态/保护反应
  • 有很多核糖体、线粒体呼吸链相关基因

说明blue很可能对应的是一种代谢活跃/翻译活跃/线粒体活跃/相对成熟稳态的少突程序,而AD和hERV又是负相关,说明在AD或hERV较高的状态下,这套“代谢-翻译-氧化磷酸化”程序是下降的


试一试改参数让模块更细一些?改ConstructNetwork中的两个参数:

  • fraction = 0.1:过滤掉更多的基因
  • minModuleSize = 20:允许更小的模块保留下来
  • mergeCutHeight = 0.05:减少相似模块合并
herv_assay <- "HERV_family"
celltype_name <- "Oligo"
herv_traits <- c("HERV3-int", "HERVH-int", "HERVE-int", "HERVK-int", "HERVK3-int", "HERVK9-int", "HERVK11-int", "HERVK13-int", "HERVK14-int", "HERVK22-int")
DefaultAssay(seu) <- "RNA"
seu$AD_binary <- ifelse(seu$group == "AD", 1, 0)
mat <- LayerData(seu, assay = herv_assay, layer = "data")[herv_traits, , drop = FALSE]
mat <- t(as.matrix(mat))
colnames(mat) <- paste0("herv_", make.names(herv_traits))
seu <- AddMetaData(seu, metadata = as.data.frame(mat))
seu <- SetupForWGCNA(
  seu,
  gene_select = "fraction",
  fraction = 0.1,
  assay = "RNA",
  wgcna_name = "gene_net"
)
seu <- MetacellsByGroups(
  seurat_obj = seu,
  group.by = c("celltype", "orig.ident"),
  ident.group = "celltype",
  assay = "RNA",
  layer = "data",
  reduction = "pca",
  dims = 1:30,
  k = 25,
  max_shared = 10,
  min_cells = 50,
  mode = "average",
  wgcna_name = "gene_net"
)
seu <- NormalizeMetacells(seu, wgcna_name = "gene_net")
seu <- SetDatExpr(
  seu,
  group_name = celltype_name,
  group.by = "celltype",
  use_metacells = TRUE,
  assay = "RNA",
  layer = "data",
  wgcna_name = "gene_net"
)
seu <- TestSoftPowers(
  seu,
  networkType = "signed",
  wgcna_name = "gene_net"
)
power_table <- GetPowerTable(seu)
soft_power <- power_table$Power[which(power_table$SFT.R.sq >= 0.8)[1]]
if (is.na(soft_power)) {
  soft_power <- power_table$Power[which.max(power_table$SFT.R.sq)]
}
seu <- ConstructNetwork(
  seu,
  soft_power = soft_power,
  networkType = "signed",
  overwrite_tom = TRUE,
  tom_name = "seu_TOM",
  deepSplit = 4,
  minModuleSize = 20,
  mergeCutHeight = 0.05,
  pamStage = FALSE,
  wgcna_name = "gene_net"
)
seu <- ModuleEigengenes(
  seu,
  group.by.vars = "orig.ident",
  assay = "RNA",
  wgcna_name = "gene_net"
)
trait_cols <- c(
  paste0("herv_", make.names(herv_traits)),
  "AD_binary"
)
seu <- ModuleTraitCorrelation(
  seu,
  traits = trait_cols,
  group.by = "celltype",
  wgcna_name = "gene_net"
)
get_module_size <- function(seu, wgcna_name) {  # 看模块大小
  df <- GetModules(seu, wgcna_name = wgcna_name)
  gene_col <- intersect(c("gene_name", "gene", "feature"), colnames(df))[1]
  mod_col  <- intersect(c("module", "color"), colnames(df))[1]
  out <- as.data.frame(table(df[[mod_col]]), stringsAsFactors = FALSE)
  colnames(out) <- c("module", "n_genes")
  out <- out[order(out$n_genes, decreasing = TRUE), ]
  out
}
get_module_size(seu, "gene_net")
pdf(file.path(res_dir, "oligo_trait_corr.pdf"), width = 16, height = 16)
p <- PlotModuleTraitCorrelation(seu, label = "fdr", label_symbol = "stars")
print(p)
dev.off()
save(seu, file = file.path(res_dir, "oligo_WGCNA_try2.RData"))
     module n_genes
3      grey    3157
2 turquoise    1407
1      blue     201
5     brown      71
4    yellow      64
6     green      58
8       red      50
7     black      33
9      pink      21

my_GSE157827_5_10

target_modules <- c("turquoise", "blue", "brown", "yellow", "green", "red")

my_GSE157827_5_11

my_GSE157827_5_12

my_GSE157827_5_13

  • turquoise和几乎所有hERV trait都是正相关,但基因数仍比较大(1400多个基因),GO结果里也有一些像small GTPase signaling/signal transduction negative regulation/actin cytoskeleton organization/IL-8 production这类看起来可能和AD有关的条目,但padj基本都不显著,属于广义状态程序
  • blue的GO较集中且显著,都是关于能量代谢/蛋白合成的,hub genes也有不少核糖体/代谢相关基因:当hERV较高或AD状态更明显时,这套“翻译-氧化磷酸化-ATP合成”程序会下降
  • green是少突分化/髓鞘形成模块,yellow是细胞黏附/神经突起/运动与分化模块,和hERVK都是正相关,和AD略偏负相关
  • brown是蛋白折叠/热休克/蛋白稳态模块,和AD/hERV都是轻度正相关,更像是一个病理相关的模块

亚群层面分析

  • 对于hERV高表达的亚群,分为AD富集/不富集两种,如果该亚群hERV高表达且AD多,就对应brown/turquoise/blue这类AD和hERV同步变化的基因模块(蛋白稳态/广义hERV激活状态/稳态代谢受损),如果AD少就更像green/yellow这种不同步变化的(少突成熟/髓鞘化/分化或形态重塑)
    • 理想情况是上面说的这样的,不过在我的亚群聚类时,至少在oligo的亚群里面,并没有出现某个亚群的AD特别富集,所以还是以模块表达分数为准
  • 先看这个亚群的模块表达分数(就是上面得到的几个关键模块),看哪个模块在这个亚群里表达更高
  • 把这个亚群和其它亚群做差异表达分析,得到一组亚群特异上调基因,看这些基因与各模块基因的重叠度,综合模块表达分数,确定这个亚群在哪个模块上特异表达
  • 对DEG与目标模块内的基因取交集,再GO
load(file.path(res_dir, "oligo_WGCNA_try2.RData"))
seu_net <- seu
seu <- readRDS("/public/home/GENE_proc/wth/my_data/oligo.rds")
seu$subcluster <- seu$RNA_snn_res.0.5
wgcna_name <- "gene_net"
subcluster_col <- "subcluster"
target_clusters <- c("3", "5", "9")
modules_use <- c("brown", "turquoise", "blue", "green", "yellow")
# 取出模块基因
modules_df <- GetModules(seu_net, wgcna_name = wgcna_name)
gene_col <- intersect(c("gene_name", "gene", "feature"), colnames(modules_df))[1]
modules_df <- subset(modules_df, module %in% modules_use & module != "grey")
module_list <- split(modules_df[[gene_col]], modules_df$module)
module_list <- lapply(module_list, unique)
module_list <- lapply(module_list, intersect, rownames(seu[["RNA"]]))
# 给每个细胞打模块分数
for (m in modules_use) {
  seu <- AddModuleScore(
    seu,
    features = list(module_list[[m]]),
    assay = "RNA",
    name = paste0(m, "_score_")
  )
}
score_cols <- setNames(paste0(modules_use, "_score_1"), modules_use)
Idents(seu) <- seu[[subcluster_col]][, 1]
# 各亚群的模块平均分数
score_avg <- seu@meta.data[, c(subcluster_col, unname(score_cols)), drop = FALSE]
colnames(score_avg)[1] <- "subcluster"
score_avg <- score_avg %>%
  group_by(subcluster) %>%
  summarise(across(everything(), mean))
write.csv(score_avg, file.path(res_dir, "subcluster_module_scores.csv"), row.names = FALSE)
score_mat <- as.matrix(score_avg[, -1])
rownames(score_mat) <- score_avg$subcluster
colnames(score_mat) <- names(score_cols)
pdf(file.path(res_dir, "subcluster_module_scores.pdf"), width = 16, height = 16)
p <- pheatmap(
  scale(score_mat),
  cluster_rows = FALSE,
  cluster_cols = FALSE,
  main = "Scaled by module across subclusters"
)
print(p)
dev.off()

my_GSE157827_5_18

# 特异亚群与其它亚群的DEG
universe_genes <- unique(unlist(module_list))
Idents(seu) <- seu$subcluster
for (cl in target_clusters) {
  other_ids <- setdiff(levels(Idents(seu)), cl)
  deg <- FindMarkers(
    seu,
    ident.1 = cl,
    ident.2 = other_ids,
    assay = "RNA",
    slot = "data",
    test.use = "MAST",
    # latent.vars = c("nCount_RNA", "percent_mito"),
    min.pct = 0.1,
    logfc.threshold = 0
  )
  deg$gene <- rownames(deg)
  fc_col <- grep("^avg_log", colnames(deg), value = TRUE)[1]
  deg_up <- subset(deg, p_val_adj < 0.05 & deg[[fc_col]] > 0.1)
  write.csv(deg, file.path(res_dir, paste0("DEG_subcluster_", cl, "_all.csv")), row.names = FALSE)
  write.csv(deg_up, file.path(res_dir, paste0("DEG_subcluster_", cl, "_up.csv")), row.names = FALSE)
  target_mean <- colMeans(
    seu@meta.data[seu[[subcluster_col]][, 1] == cl, unname(score_cols), drop = FALSE]
  )
  other_mean <- colMeans(
    seu@meta.data[seu[[subcluster_col]][, 1] != cl, unname(score_cols), drop = FALSE]
  )
  score_tbl <- data.frame(
    module = names(score_cols),
    target_mean = unname(target_mean[unname(score_cols)]),
    other_mean  = unname(other_mean[unname(score_cols)]),
    stringsAsFactors = FALSE
  )
  score_tbl$score_diff <- score_tbl$target_mean - score_tbl$other_mean
  up_genes <- deg_up$gene
  overlap_tbl <- data.frame(
    module = names(score_cols),
    overlap_n = 0,
    gene_ratio = 0,
    fisher_p = 1,
    overlap_genes = "",
    stringsAsFactors = FALSE
  )
  for (m in names(score_cols)) {
    mod_genes <- module_list[[m]]
    hit_genes <- intersect(up_genes, mod_genes)
    a <- length(hit_genes)
    b <- length(setdiff(up_genes, mod_genes))
    c <- length(setdiff(mod_genes, up_genes))
    d <- length(setdiff(universe_genes, union(up_genes, mod_genes)))
    overlap_tbl[overlap_tbl$module == m, "overlap_n"] <- a
    overlap_tbl[overlap_tbl$module == m, "gene_ratio"] <- ifelse(length(up_genes) > 0, a / length(up_genes), 0)
    overlap_tbl[overlap_tbl$module == m, "overlap_genes"] <- paste(hit_genes, collapse = ";")
    overlap_tbl[overlap_tbl$module == m, "fisher_p"] <- fisher.test(
      matrix(c(a, b, c, d), nrow = 2),
      alternative = "greater"
    )$p.value
  }
  res_tbl <- left_join(score_tbl, overlap_tbl, by = "module") %>%
    arrange(desc(score_diff), fisher_p)
  write.csv(res_tbl, file.path(res_dir, paste0("subcluster_", cl, "_module_integration.csv")), row.names = FALSE)
}
save(module_list, universe_genes, file = file.path(res_dir, "0401.RData"))

之前做的oligo各cluster的hERV平均表达量热图:

my_GSE157827_5_20

cluster 5:

my_GSE157827_5_19

  • 从表达量热图中可以看到它的hERVK/hERVK9比较高
  • turquoise score_diff比较高,且上调DEG与turquoise的重叠较多
  • 最像广义hERV激活状态

cluster 3:

my_GSE157827_5_21

  • HERVK3比较高
  • 上调基因包括ARHGAP24/CTNNA2/PTPRM/DCC/BCAN/ITPR2/PLEKHH2,属于细胞黏附/细胞骨架/发育的模块,与yellow模块的GO结果吻合
  • 像结构重塑/形态变化/分化相关亚群

cluster 9:

my_GSE157827_5_22

  • HERVK家族普遍较高
  • 虽然上调基因有一些偏向神经元的标记基因,不过画了一下dotplot发现整体上更偏向oligo而不是神经元
  • blue模块分数较高
target_module_map <- c(
  "5" = "turquoise",
  "3" = "yellow",
  "9" = "blue"
)
plot_list <- list()
for (cl in names(target_module_map)) {
  deg_up <- read.csv(
    paste0("C:\\Users\\17185\\Desktop\\fsdownload\\DEG_subcluster_", cl, "_up.csv"),
    stringsAsFactors = FALSE
  )
  target_module <- target_module_map[[cl]]
  gene_use <- intersect(deg_up$gene, module_list[[target_module]])
  write.csv(
    data.frame(gene = gene_use),
    paste0("C:\\Users\\17185\\Desktop\\fsdownload\\subcluster_", cl, "_", target_module, "_intersect_genes.csv"),
    row.names = FALSE
  )
  ego <- enrichGO(
    gene = gene_use,
    universe = universe_genes,
    OrgDb = org.Hs.eg.db,
    keyType = "SYMBOL",
    ont = "BP",
    pAdjustMethod = "BH",
    pvalueCutoff = 1,
    qvalueCutoff = 1,
    readable = TRUE
  )
  ego_df <- as.data.frame(ego)
  write.csv(
    ego_df,
    paste0("C:\\Users\\17185\\Desktop\\fsdownload\\subcluster_", cl, "_", target_module, "_GO_BP.csv"),
    row.names = FALSE
  )
  plot_list[[cl]] <- dotplot(ego, showCategory = 15) +
    ggtitle(paste0("Subcluster ", cl, " : DEG ∩ ", target_module))
}
pdf(file.path(res_dir, "oligo_359_GO_summary.pdf"), width = 16, height = 16)
p <- wrap_plots(plot_list, ncol = 2)
print(p)
dev.off()

my_GSE157827_5_23

  • cluster 9的GO结果是唯一真正显著的,且GO富集条目与blue亚群吻合,说明它最可能代表的是:一个翻译活跃、线粒体呼吸活跃、能量代谢强、同时又保留成熟髓鞘marker的少突亚群

    结合前面的hERV表达量热图,可以理解成:cluster 9虽然也是hERV-high,但它对应的不是 “HERVK-int驱动的广义激活状态”,而更像一种“成熟髓鞘化背景下的hERV表达模式”

  • cluster 5的GO结果虽然不显著,但也符合turquoise模块的GO结果,在功能上呈现出脂质代谢、膜转运、去泛素化和FGF反应等倾向。说明交集基因虽然多,但功能比较分散

    它是最像hERV激活型的亚群,只是它的功能解释更适合写成脂质与膜代谢重塑、广义信号调节

  • cluster 3的GO结果与5类似,它确实更偏向细胞骨架/黏附/actin重塑/与外界刺激或炎症反应有关的调节,像是一个结构重塑/黏附变化/分化过程中的亚群

总结:hERV高表达在少突细胞内并不对应单一病理程序,而是和不同的细胞状态相联系

  • cluster 5:hERVK/K9主导的广义激活型

    功能倾向:脂质代谢、膜转运、信号/修饰重塑

  • cluster 3:HERVK3相关的结构重塑型

    功能倾向:actin、黏附、形态/细胞识别

  • cluster 9:多种hERV相关的成熟髓鞘化/高代谢型

    功能:翻译、氧化磷酸化、ATP合成

方案2
mixed_herv_features <- c("HERV3-int", "HERVH-int", "HERVE-int", "HERVK-int", "HERVK3-int", "HERVK9-int", "HERVK11-int", "HERVK13-int", "HERVK14-int", "HERVK22-int")
# 过滤掉表达比例<5%的基因(为防止把加进去的hERV也过滤掉,所以就不在SetupForWGCNA时过滤)
rna_counts <- LayerData(seu, assay = "RNA", layer = "counts")
feature_rowsum <- Matrix::rowSums(rna_counts > 0)
retained_f_low <- feature_rowsum > ncol(seu) * 0.05
genes_keep <- names(feature_rowsum)[retained_f_low]
seu[["RNA"]] <- subset(seu[["RNA"]], features = genes_keep)
# 构建混合assay:RNA gene + hERV feature
rna_counts <- LayerData(seu, assay = "RNA", layer = "counts")
rna_data  <- LayerData(seu, assay = "RNA", layer = "data")
herv_counts <- LayerData(seu, assay = herv_assay, layer = "counts")[mixed_herv_features, , drop = FALSE]
herv_data <- LayerData(seu, assay = herv_assay, layer = "data")[mixed_herv_features, , drop = FALSE]
rownames(herv_counts) <- paste0("HERV__", rownames(herv_counts))
rownames(herv_data) <- paste0("HERV__", rownames(herv_data))
mixed_counts <- rbind(rna_counts, herv_counts)
mixed_data <- rbind(rna_data, herv_data)
seu[["RNA_HERV"]] <- CreateAssayObject(counts = mixed_counts)
LayerData(seu, assay = "RNA_HERV", layer = "data") <- mixed_data
DefaultAssay(seu) <- "RNA_HERV"
mixed_features <- rownames(seu[["RNA_HERV"]])
# 重复前述步骤
seu <- SetupForWGCNA(
  seu,
  gene_select = "custom",
  gene_list = mixed_features,
  assay = "RNA_HERV",
  wgcna_name = "mixed_net"
)
seu <- MetacellsByGroups(
  seurat_obj = seu,
  group.by = c("celltype", "orig.ident"),
  ident.group = "celltype",
  assay = "RNA_HERV",
  layer = "data",
  reduction = "pca",
  dims = 1:30,
  k = 25,
  max_shared = 10,
  min_cells = 50,
  mode = "average",
  wgcna_name = "mixed_net"
)
seu <- NormalizeMetacells(seu, wgcna_name = "mixed_net")
seu <- SetDatExpr(
  seu,
  group_name = celltype_name,
  group.by = "celltype",
  use_metacells = TRUE,
  assay = "RNA_HERV",
  layer = "data",
  wgcna_name = "mixed_net"
)
grep("^HERV__", rownames(seu))
seu <- TestSoftPowers(
  seu,
  networkType = "signed",
  wgcna_name = "mixed_net"
)
power_table_mixed <- GetPowerTable(seu)
soft_power_mixed <- power_table_mixed$Power[which(power_table_mixed$SFT.R.sq >= 0.8)[1]]
if (is.na(soft_power_mixed)) {
  soft_power_mixed <- power_table_mixed$Power[which.max(power_table_mixed$SFT.R.sq)]
}
plot_list <- PlotSoftPowers(seu)
pdf(file.path(res_dir, "oligo_soft_power2.pdf"), width = 16, height = 16)
p <- patchwork::wrap_plots(plot_list, ncol = 2)
print(p)
dev.off()
save(seu, file = file.path(res_dir, "oligo_WGCNA2_before_ConstructNetwork.RData"))
seu <- ConstructNetwork(
  seu,
  assay = "RNA_HERV",
  soft_power = soft_power_mixed,
  networkType = "signed",
  tom_name = "seu_TOM",
  overwrite_tom = TRUE,
  wgcna_name = "mixed_net"
)
seu <- ScaleData(
  seu,
  assay = "RNA_HERV",
  features = rownames(seu[["RNA_HERV"]]),
)
seu <- ModuleEigengenes(
  seu,
  # group.by.vars = "orig.ident",
  assay = "RNA_HERV",
  wgcna_name = "mixed_net"
)
seu <- ModuleConnectivity(
  seu,
  group.by = "celltype",
  group_name = celltype_name,
  assay = "RNA_HERV",
  wgcna_name = "mixed_net"
)
save(seu, file = file.path(res_dir, "oligo_WGCNA2.RData"))

my_GSE157827_5_9

module_genes_df <- GetModules(seu)
hub_genes_df <- GetHubGenes(seu, n_hubs = 50)
modules <- unique(module_genes_df$module)
genes_list <- list()
for(single_module in modules){
  module_genes <- module_genes_df %>%
    filter(module == single_module) %>%
    pull(gene_name) %>%
    unique()
  genes_list[[single_module]]$module_genes <- module_genes
  hub_genes <- hub_genes_df %>%
    filter(module == single_module) %>%
    pull(gene_name) %>%
    unique()
  genes_list[[single_module]]$hub_genes <- hub_genes
}
TOM <- GetTOM(seu)
target_herv <- grepl("^HERV--", colnames(TOM))  # 目标hERV
target_gene <- !grepl("^HERV--", rownames(TOM))  # gene
universe_genes <- rownames(TOM)[target_gene]
herv_neighbors <- as.data.frame(TOM[target_gene, target_herv]) %>% 
  rownames_to_column("rowname") %>% 
  as_tibble() %>%
  column_to_rownames("rowname")
res_list <- list()
for(herv in colnames(herv_neighbors)){
  herv_gene <- herv_neighbors[, herv, drop = FALSE]
  colnames(herv_gene) <- "value"
  herv_module <- subset(module_genes_df, gene_name == herv)$module
  print(paste0(herv, "所属的模块为:", herv_module))
  module_genes <- subset(module_genes_df, module == herv_module)$gene_name
  res_list[[herv]]$module <- herv_module
  cor_gene <- subset(herv_gene, rownames(herv_gene) %in% module_genes) %>%
    filter(abs(value) > 0.02) %>%
    arrange(desc(value)) %>%
    head(n = 300) %>%
    rownames_to_column("gene")
  res_list[[herv]]$cor <- cor_gene
  res_list[[herv]]$cor_range <- c(max(cor_gene$value), min(cor_gene$value))
  res_list[[herv]]$gene <- cor_gene$gene
}
save(genes_list, res_list, universe_genes, file = file.path(res_dir, "oligo_WGCNA2_res.RData"))
"HERV--HERV3-int所属的模块为:grey"
"HERV--HERVH-int所属的模块为:blue"
"HERV--HERVE-int所属的模块为:grey"
"HERV--HERVK-int所属的模块为:turquoise"
"HERV--HERVK3-int所属的模块为:grey"
"HERV--HERVK9-int所属的模块为:turquoise"
"HERV--HERVK11-int所属的模块为:blue"
"HERV--HERVK13-int所属的模块为:grey"
"HERV--HERVK14-int所属的模块为:grey"
"HERV--HERVK22-int所属的模块为:turquoise"

GO富集分析:

  • 对HERVK/HERVK9/HERVK22/HERVH/HERVK11这几个不在grey模块(grey模块就是指比较模糊/没法聚类的基因)里的hERV相关性前300的基因做GO
  • 对它们所在的模块做GO
# 前300基因
target_hervs <- c("HERV--HERVK-int", "HERV--HERVK9-int", "HERV--HERVK22-int", "HERV--HERVK11-int", "HERV--HERVH-int")
go_list <- lapply(target_hervs, function(m) {
  run_go_one(
    gene_vec = res_list[[m]]$gene,
    universe_genes = universe_genes,
    ont = "BP"
  )
})
names(go_list) <- target_hervs
res <- list()
for (m in target_hervs) {
  ego <- go_list[[m]]
  if (is.null(ego)) next
  res[[m]] <- dotplot(ego, showCategory = 15) +
    ggtitle(paste0("GO BP - ", m))
}
pdf(file.path(res_dir, "oligo_hERV_gene_summary.pdf"), width = 15, height = 24)
p <- wrap_plots(res, ncol = 2)
print(p)
dev.off()
oligo_hERV_gene_summary <- lapply(target_hervs, function(m) {
  ego <- go_list[[m]]
  if (is.null(ego)) return(NULL)
  go_df <- as.data.frame(ego)
  go_df <- go_df[order(go_df$p.adjust), ]
  herv_list <- res_list[[m]]
  data.frame(
    herv = m,
    module = as.character(herv_list$module),
    cor_gene = paste(herv_list$gene, collapse = "; "),
    max_cor = herv_list$cor_range[1],
    min_cor = herv_list$cor_range[2],
    GO = paste(go_df$Description[1:20], collapse = "; "),
    GO_padj = paste(go_df$p.adjust[1:20], collapse = "; "),
    stringsAsFactors = FALSE
  )
}) %>% bind_rows()
write.csv(oligo_hERV_gene_summary, file.path(res_dir, "oligo_hERV_gene_summary.csv"), row.names = FALSE)

my_GSE157827_5_15

my_GSE157827_5_16

my_GSE157827_5_17

# 它们所在的模块
target_modules <- unique(as.character(oligo_hERV_gene_summary$module))
go_list <- lapply(target_modules, function(m) {
  run_go_one(
    gene_vec = genes_list[[m]]$module_genes,
    universe_genes = universe_genes,
    ont = "BP"
  )
})
names(go_list) <- target_modules
res <- list()
for (m in target_modules) {
  ego <- go_list[[m]]
  if (is.null(ego)) next
  res[[m]] <- dotplot(ego, showCategory = 15) +
    ggtitle(paste0("GO BP - ", m))
}
pdf(file.path(res_dir, "oligo_hERV_module_summary.pdf"), width = 16, height = 8)
p <- wrap_plots(res, ncol = 2)
print(p)
dev.off()
oligo_hERV_module_summary <- lapply(target_modules, function(m) {
  ego <- go_list[[m]]
  if (is.null(ego)) return(NULL)
  go_df <- as.data.frame(ego)
  go_df <- go_df[order(go_df$p.adjust), ]
  module_list <- genes_list[[m]]
  data.frame(
    module = m,
    n_gene = length(module_list$module_genes),
    hub_gene = paste(module_list$hub_genes, collapse = "; "),
    GO = paste(go_df$Description[1:20], collapse = "; "),
    GO_padj = paste(go_df$p.adjust[1:20], collapse = "; "),
    stringsAsFactors = FALSE
  )
}) %>% bind_rows()
write.csv(oligo_hERV_module_summary, file.path(res_dir, "oligo_hERV_module_summary.csv"), row.names = FALSE)

my_GSE157827_5_14

综合上面的结果,HERVH-int和HERVK11-int的GO结果最显著,且它们都在blue这个module,对该模块的GO分析结果也显著集中于蛋白翻译+线粒体能量代谢。另外几个都在turquoise模块,不过这个模块的GO结果与方案1的类似,都比较模糊

  • 一个比较大的问题就是与hERV相关的前300个基因的相关性在0.05±0.02左右,这个相关性应如何衡量?

    这个相关性实际上不是前面线性拟合时得到的相关性,而是TOM相似度,这种计算方法得到的数值会比普通相关系数小很多,因此这个值其实不能说是“弱相关”

接下来的GO分析与上面方案1的类似,都是计算目标亚群与其它亚群的DEG,然后看目标亚群的高表达hERV相邻的基因与DEG的重叠度,取两者重合的部分作GO

共表达网络

选hERV特异亚群和显著神经相关模块进行GO分析

以下结果,如果family和group的没有太大差别,就只展示family的结果,后期也以family为主进行分析

hERV表达量热图是按行看——每个hERV家族在各亚群中的相对表达水平,模块分数热图是按列看,每个模块在各亚群中的相对表达水平(其实只是行列倒过来了,意义相同)

Oligo

汇总一下之前的记录:

my_GSE157827_5_48

my_GSE157827_5_49

my_GSE157827_5_50

my_GSE157827_5_51

Astro

my_GSE157827_5_27

my_GSE157827_5_26

  • 0在blue表达量高,blue和hERVK有正相关,和AD微弱负相关,可作为目标亚群
  • 1在red表达量高,但red和hERV相关性较弱,不如0+blue

my_GSE157827_5_38

  • blue和red都有神经相关的条目,其中blue神经相关条目看起来非常多

my_GSE157827_5_44

  • cluster0有15000个细胞,占astro的3/4,AD占比很平均
  • cluster1有3000多个细胞,AD稍微多一些(比astro整体AD占比多3%)
  • 但是它们的GO聚类结果很好,blue和神经功能显著相关,red也较显著,和信号传导相关
OPC

my_GSE157827_5_28

my_GSE157827_5_29

  • 6在turquoise表达量高,turquoise和hERVK有正相关,和AD无相关,可作为目标亚群
  • 8没有特别高的,只有yellow有点点高,但yellow和hERV相关性不如6+turquoise

my_GSE157827_5_39

  • yellow和astro的blue几乎一模一样,turquoise更偏普通功能,也有染色体结构/转录翻译这些
  • 然而不知道为什么OPC的6/8亚群的DEG只有十几个,取交集后就只剩下个位数了
Excit

my_GSE157827_5_30

my_GSE157827_5_31

  • 6hERV下调,在blue/green/magenta/pink模块也下调,在black/red/yellow模块上调
    • blue/green/magenta/pink与hERV都正相关
    • black与hERV负相关程度很大,red其次,yellow相关程度较低
    • 因此可以看看6与blue/green/magenta/pink/black/red这几个模块

my_GSE157827_5_40

  • 每个模块都有一些看起来和神经/免疫相关的,但都不太显著
    • black和red和神经比较相关
    • pink和免疫有关
    • green和DNA/RNA有关
  • 考虑到6在black和red下调,所以应该取6亚群下调基因与它们的交集

my_GSE157827_5_47

  • 6有近3000个细胞,NC略多(1.4%),但其上调基因确实显著富集到神经相关条目上,black和pink模块与hERV负相关,6在这两个模块上调,且hERV表达量低
Inhib

my_GSE157827_5_32

my_GSE157827_5_33

  • 2没有特别负相关的模块,不过brown与hERV负相关,且2在其中表达很低,可以看看2和brown的关系

my_GSE157827_5_41

  • 感觉没什么特别的
Endo

my_GSE157827_5_34

my_GSE157827_5_35

  • 1在blue上调,red/turquoise下调
  • 4在brown上调,yellow下调
  • 主线应该就是1和blue,yellow相关性不太稳定

my_GSE157827_5_42

  • blue基因数似乎太多了,感觉也没什么特别的
Mic

my_GSE157827_5_36

my_GSE157827_5_37

  • blue和hERV显著负相关,3hERV表达高,但blue模块表达量也高,这对吗?
  • 6在turquoise上调,brown/pink/black/red下调,然而turquoise/brown/pink/black/red都和hERV正相关,最主要6就100多个细胞,不太行
  • 只有2有1000多个细胞,在blue微微下调,在turquoise/brown/pink微上调
  • 如果mic很值得一看的话,大概就是blue/turquoise/brown/pink

my_GSE157827_5_43

  • blue的结构比较显著,与神经和翻译有相关,其它几个感觉有点普通了,可以先做个取交集看看
  • 考虑到2在blue下调,所以应该取2亚群下调基因与它的交集

my_GSE157827_5_45

  • 2AD多(13%),在blue这个神经相关模块下调,且hERVK等家族表达量高,hERVK与blue显著负相关;在pink这个信号响应?模块上调,pink与hERV正相关
  • 补一下mic的group热图,与hERVK的关系看得更明显

    my_GSE157827_5_46

所有亚群的GO分析

对所有模块进行一遍GO分析,看有没有免疫相关模块、炎症相关模块(Mic),补充一下之前的结论

astro:也存在免疫/细胞因子响应方向的hERV关联,但强度和清晰度不如Mic,更像是“胶质支持/分化+轻度炎症响应”的背景状态

  • blue:与hERV明显正相关,更偏向一个胶质分化/细胞黏附/神经支持相关状态
  • green:与hERV也有正相关,TNF/cytokine response
  • yellow:显著富集到蛋白质相关条目,与hERV有弱相关
  • turquoise:与DNA修复有关,与hERV强正相关,但GO不显著

Oligo:hERV更稳定地关联到一个“广义信号调节/状态重塑”模块;但在具体亚群层面,某些hERV-high状态(比如cluster 9)又可以进一步落到“高翻译/高代谢/成熟髓鞘化”程序上

  • turquoise和hERV相关性最强,GO条目分布比较广(该亚群基因数很多)
  • green:成熟少突/髓鞘化模块,和hERV相关性一般
  • brown:蛋白折叠/伴侣蛋白模块,和hERV相关性一般

Mic:hERV高表达更倾向于和先天免疫/炎症样程序增强相联系,同时与翻译/神经互作样程序下降相联系

  • black:最像“先天免疫/抗病原体”模块,与hERV相关性较高
  • pink:最像“炎症受体/信号响应”模块,虽然GO条目不是很集中,但hub gene中有TLR2/SP100/MAP3K8/EPSTI1,与炎症刺激/先天免疫信号响应有关
  • yellow:蛋白折叠+免疫效应,可能说明Mic的hERV-high状态,可能不是单纯炎症,也可能伴随proteostasis/stress response
  • blue:有cytoplasmic translation/synaptic signaling/glutamate receptor signaling/ribosome biogenesis条目,与hERV是负相关,可能说明Mic的hERV升高并不是简单伴随“全局活性增强”,而更像伴随某种homeostatic/neuron-interacting/translation-high状态的下降

Excit:hERV还可能与核酸代谢/复制样异常程序相关

  • green模块的GO条目与DNA复制/核酸加工/应激样有关,且与hERV相关性较强,但GO不显著,也不是很标准的表观染色质调控模块

Inhib:与hERV相关性整体较弱,turquoise模块与hERV相关性较强,模块显著富集到突触相关条目,且与AD有弱负相关;brown与hERV有较强负相关,与蛋白折叠/应激有关,但GO不显著,可作为补充说明

Endo:与hERV相关性整体较弱,不过也有涉及到DNA合成/细胞分裂的yellow模块与hERV相关性较强,但GO不显著;brown显著富集到神经相关条目、green显著富集到蛋白质相关条目,但它们与hERV相关性较弱

有没有“STAT1等抗病毒通路相关的hERV”

  • STAT1指的是标准抗病毒程序:细胞感受到病毒核酸/dsRNA,从而激活免疫通路(type I / type II interferon)
  • 上面提到的先天免疫/炎症样/抗原呈递样模块:细胞感受到损伤、病原相关分子、细胞碎片、炎症因子,通过TLR/NF-κB/MAPK/cytokine等通路进入激活状态,同时可能伴随吞噬、补体、抗原呈递等功能增强
  • Mic中没有看到一个很典型的“STAT1-IFIT-OAS-MX1”式干扰素模块,有defense response/innate immune/TLR2/EPSTI1/MAP3K8(先天免疫/炎症样和抗原呈递样模块),不算特别像标准interferon/STAT1 signature
  • OPC的green模块的hub gene里有STAT1,GO也有response to external biotic stimulus/response to other organism,与hERV也是正相关,但GO富集并不显著

总结byGPT

hERV关联不是统一的,而是明显细胞类型依赖

  • Astro、Oligo、OPC、Excit的模块-hERV相关性幅度更大
  • Mic相关性中等,但模块的生物学解释很有意思
  • Endo、Inhib整体较弱

AD和hERV不是简单同向

  • 很多模块里都能看到hERV和模块相关明显,但AD_binary相关很弱,甚至方向相反
  • hERV更像是“细胞状态轴”的标志,而不是简单的AD/NC二分类标志

主线:

  • Oligo:在整体层面,hERV更稳定地关联到“广义信号调节/状态重塑”模块。但在具体亚群层面,hERV高表达并不对应单一转录程序,而是分化为至少三类状态——一类偏膜脂与信号重塑,一类偏细胞骨架/黏附重塑,另一类则表现为成熟髓鞘化背景下的高翻译和高氧化磷酸化状态
    • cluster 5:更像hERV激活型,偏脂质代谢、膜转运、信号/修饰重塑
    • cluster 3:更像结构重塑/黏附变化/分化相关状态
    • cluster 9:最清楚,GO直接落在cytoplasmic translation/oxidative phosphorylation/ATP metabolic process,这说明它是一个成熟髓鞘化、高代谢、高翻译活性的少突状态
  • Mic:在整体层面,hERV高表达状态与抗病毒/先天免疫样反应增强相关,同时伴随翻译/神经互作相关程序的下降
    • cluster 2:AD偏多(比全部Mic中的AD比例多13%),hERVK等家族偏高,与hERV正相关的pink模块上调,而pink模块富集到defense response to virus/response to virus/defense response to symbiont相关(DEG包括TLR2/MX2/MX1);与hERV负相关的blue模块下调,blue模块与learning/cognition/trans-synaptic signaling/neuron projection morphogenesis相关
    • 除了pink和blue模块,在整体层面,Mic中的hERV与有关先天免疫/抗病原体的black模块、有关蛋白折叠和免疫效应的yellow模块有相关

支持性结果:

  • Excit:提供了一个和神经胶质细胞不同的模式——是hERV较低的神经元状态,保留了更强的突触传递和囊泡循环程序;反过来说,hERV更多和激活/重塑/炎症类状态一起出现,hERV升高可能与神经元突触功能程序的减弱相关
    • cluster 6的hERV表达量低,在red和black模块上调,而red和black的GO都是突触和信号传递相关的
  • astro:hERV相关模块主要连接到gliogenesis、cell-cell signaling和synapse organization,提示其更可能反映胶质细胞参与神经支持和细胞间通讯的状态,也存在免疫/细胞因子响应方向的hERV关联,但强度和清晰度不如Mic,更像“胶质支持/分化+轻度炎症响应”的背景状态

总述:hERV重调在AD脑内具有明显的细胞类型和细胞状态依赖性。在不同细胞类型中,hERV并非统一对应同一种病理程序,而是分别耦联到不同的转录模块:在oligo中与膜脂重塑、细胞骨架重塑及成熟髓鞘化/高代谢状态相关;在mic中与广义先天免疫/炎症样激活、抗原呈递样程序增强相关,并与稳态和翻译程序下降相关;在astro中更多关联于胶质分化、神经支持与轻度细胞因子响应;而在excit中,hERV升高则更可能与突触传递程序减弱相联系。除此之外,在OPC中也存在一定STAT1/外源刺激响应线索,在inhib中与突触功能和蛋白折叠/应激有关,在endo中与DNA合成/细胞分裂、神经和蛋白质相关功能有关,但与hERV的相关性较弱或者是GO富集结果不显著,需要更多验证

Figure1&S1

my_GSE157827_5_4

  • 1B:在小鼠胚胎发育不同阶段中,前100个“高表达且动态变化明显”的蛋白编码基因/逆转座子家族,说明逆转座子的整体动态表达模式,与普通蛋白编码基因很相似,也随发育阶段发生有规律的切换

    颜色不是绝对表达量,而是标准化后的相对高低(这个位点/家族在该列条件下相对它自己更高/低)

  • 1D左:逆转座子-基因junction更常落在编码蛋白基因上,而不是非编码转录本上
  • 1D右:这些逆转座子的类型

在我的研究中:只画热图,作为展示hERV整体分布的图。具体来说,把横轴的发育阶段换成Astro_NC/Astro_AD/Endo_NC/Endo_AD…,纵轴是是hERV家族(选一些高表达且在这些列之间变化较大的),右面加一个注释条标明超家族,先不做聚类,每个单元格的颜色表示该hERV家族在该细胞类型中的相对表达量,以此展示每个细胞类型中高表达的hERV有区别,经过测试,我选择了行标准化,即“同一个hERV家族在不同细胞类型中,哪里相对更高、哪里相对更低”(因为列标准化会出来很多hERV家族在所有细胞类型中表达都高,就没法体现出阶梯状的差异)

mat <- GetAssayData(seu, assay = "HERV_family", layer = "data")
celltype_order <- c("Astro", "Oligo", "OPC", "Excit", "Inhib", "Mic", "Endo")
group_order <- c("AD", "NC")
# 提取metadata,并构造列分组
meta <- seu@meta.data[, c("celltype", "group"), drop = FALSE]
mat <- mat[, rownames(meta), drop = FALSE]
meta$celltype <- factor(as.character(meta$celltype), levels = celltype_order)
meta$group <- factor(as.character(meta$group), levels = group_order)
mat <- mat[, rownames(meta), drop = FALSE]
meta$cellgroup <- paste(meta$celltype, meta$group, sep = "_")
col_order <- expand.grid(
  celltype = celltype_order,
  group = group_order,
  stringsAsFactors = FALSE
)
col_order$cellgroup <- paste(col_order$celltype, col_order$group, sep = "_")
col_order <- col_order[order(
  match(col_order$celltype, celltype_order),
  match(col_order$group, group_order)
), , drop = FALSE]
col_order <- col_order[col_order$cellgroup %in% meta$cellgroup, , drop = FALSE]
# 计算每个celltype×group的平均表达
avg_list <- lapply(col_order$cellgroup, function(x) {
  cells <- rownames(meta)[meta$cellgroup == x]
  Matrix::rowMeans(mat[, cells, drop = FALSE])
})
avg_mat <- do.call(cbind, avg_list)
colnames(avg_mat) <- col_order$cellgroup
rownames(avg_mat) <- rownames(mat)
# 选择“高表达且变化较大”的 hERV family
row_mean <- rowMeans(avg_mat)
row_sd <- apply(avg_mat, 1, sd)
min_mean <- 0.05  # “高表达”的阈值
top_n <- 40  # 画多少个family
keep <- row_mean >= min_mean
stat_df <- data.frame(
  family = rownames(avg_mat),
  mean_expr = row_mean,
  sd_expr = row_sd,
  stringsAsFactors = FALSE
)
stat_df <- stat_df[keep, ]
stat_df <- stat_df[order(-stat_df$sd_expr, -stat_df$mean_expr), ]  # 先按平均表达过滤,再按不同列之间的标准差排序
family_use <- head(stat_df$family, top_n)
plot_mat <- avg_mat[family_use, , drop = FALSE]
# 计算Z-score
plot_mat_z <- t(scale(t(as.matrix(plot_mat))))
plot_mat_z <- plot_mat_z[apply(plot_mat_z, 1, function(x) all(is.finite(x))), , drop = FALSE]
# 按“最高表达出现在哪一列”排序
peak_col <- max.col(plot_mat_z, ties.method = "first")
peak_val <- apply(plot_mat_z, 1, max)
ord <- order(peak_col, -peak_val)
plot_mat_z <- plot_mat_z[ord, , drop = FALSE]
# 加右侧超家族注释条
family_anno <- read.csv("/public/home/GENE_proc/wth/metadata/hERV/hERV_locus2family.csv")
family_anno$subfamily_raw <- gsub("_", "-", family_anno$subfamily_raw)
row_class <- family_anno$class[match(rownames(plot_mat_z), family_anno$subfamily_raw)]
row_class[is.na(row_class)] <- "Unknown"
class_cols <- c(
  "ERV1" = "#E6E6E6",
  "ERVK" = "#969696",
  "ERVL" = "#252525",
  "Unknown" = "#BDBDBD"
)
right_anno <- rowAnnotation(
  Class = row_class,
  col = list(Class = class_cols),
  show_annotation_name = TRUE,
  annotation_name_gp = gpar(fontsize = 10),
  simple_anno_size = unit(2, "mm")   # 右侧色条宽度
)
# 顶部注释
celltype_cols <- setNames(hcl.colors(length(celltype_order), "Set 2"), celltype_order)
group_cols <- c("NC" = "#2B6CB0", "AD" = "#B22222")
col_order$celltype <- factor(col_order$celltype, levels = celltype_order)  # 固定图例标签顺序
col_order$group <- factor(col_order$group, levels = group_order)
top_anno <- HeatmapAnnotation(
  Celltype = col_order$celltype,
  Group = col_order$group,
  col = list(
    Celltype = celltype_cols,
    Group = group_cols
  ),
  annotation_legend_param = list(
    Celltype = list(
      at = celltype_order
    ),
    Group = list(
      at = group_order
    )
  ),
  annotation_name_gp = gpar(fontsize = 8),
  simple_anno_size = unit(2, "mm")  # 顶部色条高度
  # , gap = unit(0.25, "mm")  # 两条注释之间留一点小空隙
)
# 画图
ht <- Heatmap(
  plot_mat_z,
  name = "Z-score",
  col = colorRamp2(c(-2, 0, 2), c("#3B4CC0", "white", "#B40426")),
  top_annotation = top_anno,
  right_annotation = right_anno,
  cluster_rows = FALSE,
  cluster_columns = FALSE,
  show_row_names = TRUE,
  show_column_names = TRUE,
  row_names_gp = gpar(fontsize = 10),
  column_names_gp = gpar(fontsize = 12),
  column_names_rot = 45,
  border = FALSE,
  # column_title = "Average expression of hERV families",
  # row_title = "hERV family",
  rect_gp = gpar(col = NA)  # 去掉单元格边框
)
pdf(file.path(res_dir, "hERV_family_heatmap.pdf"), width = 15, height = 8)
draw(ht, heatmap_legend_side = "right", annotation_legend_side = "right", padding = unit(c(5, 10, 5, 5), "mm"))
dev.off()

my_GSE157827_5_24


把z-score改成AD-NC的log2FC:即用hERV_family_celltype.csv

de_df <- read_xlsx("C:\\Users\\17185\\Desktop\\hERV_AD_snRNA-seq_analyze\\res\\hERV_DE\\hERV_family_celltype.xlsx")
celltype_order <- c("Astro", "Oligo", "OPC", "Excit", "Inhib", "Mic", "Endo")
de_df <- de_df %>%
  filter(celltype %in% celltype_order) %>%
  select(celltype, subfamily, avg_log2FC)
logfc_df <- de_df %>%
  pivot_wider(
    names_from = celltype,
    values_from = avg_log2FC
  )
logfc_df <- logfc_df[, c("subfamily", celltype_order[celltype_order %in% colnames(logfc_df)])]
logfc_mat <- as.matrix(logfc_df[, -1])
rownames(logfc_mat) <- logfc_df$subfamily
logfc_mat[is.na(logfc_mat)] <- 0
row_sd <- apply(logfc_mat, 1, sd)
row_maxabs <- apply(abs(logfc_mat), 1, max)
stat_df <- data.frame(
  family = rownames(logfc_mat),
  sd_logFC = row_sd,
  max_abs_logFC = row_maxabs,
  stringsAsFactors = FALSE
)
stat_df <- stat_df[order(-stat_df$sd_logFC, -stat_df$max_abs_logFC), ]
top_n <- 40
family_use <- head(stat_df$family, top_n)
plot_mat <- logfc_mat[family_use, , drop = FALSE]
plot_mat_z <- t(scale(t(plot_mat)))
plot_mat_z[!is.finite(plot_mat_z)] <- 0
peak_col <- max.col(plot_mat_z, ties.method = "first")
peak_val <- apply(plot_mat_z, 1, max)
ord <- order(peak_col, -peak_val)
plot_mat <- plot_mat[ord, , drop = FALSE]
family_anno <- read.csv("C:\\Users\\17185\\Desktop\\hERV_AD_snRNA-seq_analyze\\metadata\\hERV\\hERV_locus2family.csv")
family_anno$subfamily_raw <- gsub("_", "-", family_anno$subfamily_raw)
row_class <- family_anno$class[match(rownames(plot_mat), family_anno$subfamily_raw)]
row_class[is.na(row_class)] <- "Unknown"
class_cols <- c(
  "ERV1" = "#E6E6E6",
  "ERVK" = "#969696",
  "ERVL" = "#252525",
  "Unknown" = "#BDBDBD"
)
right_anno <- rowAnnotation(
  Class = row_class,
  col = list(Class = class_cols),
  show_annotation_name = TRUE,
  annotation_name_gp = gpar(fontsize = 10),
  simple_anno_size = unit(2, "mm")
)
celltype_cols <- setNames(hcl.colors(length(celltype_order), "Set 2"), celltype_order)
top_anno <- HeatmapAnnotation(
  Celltype = factor(colnames(plot_mat), levels = celltype_order),
  col = list(Celltype = celltype_cols),
  annotation_legend_param = list(
    Celltype = list(at = celltype_order)
  ),
  annotation_name_gp = gpar(fontsize = 8),
  simple_anno_size = unit(2, "mm")
)
lim <- quantile(abs(plot_mat), 0.98, na.rm = TRUE)
lim <- max(lim, 0.2)
ht <- Heatmap(
  plot_mat,
  name = "log2FC",
  col = colorRamp2(c(-lim, 0, lim), c("#3B4CC0", "white", "#B40426")),
  top_annotation = top_anno,
  right_annotation = right_anno,
  cluster_rows = FALSE,
  cluster_columns = FALSE,
  show_row_names = TRUE,
  show_column_names = TRUE,
  row_names_gp = gpar(fontsize = 10),
  column_names_gp = gpar(fontsize = 12),
  column_names_rot = 45,
  border = FALSE,
  column_title = "logFC of hERV families All",
  # row_title = "hERV family",
  rect_gp = gpar(col = NA),
  na_col = "grey90",
)

my_GSE157827_5_25

组合图

my_heatmap <- function(seu, de_res_path, herv_map_path = "/public/home/GENE_proc/wth/metadata/hERV/hERV_locus2family.csv"){
  dataset <- identity_seu(seu, "dataset")
  mat <- GetAssayData(seu, assay = "HERV_family", layer = "data")
  celltype_order <- c("Astro", "Oligo", "OPC", "Excit", "Inhib", "Mic", "Endo")
  group_order <- c("AD", "NC")
  meta <- seu@meta.data[, c("celltype", "group"), drop = FALSE]
  mat <- mat[, rownames(meta), drop = FALSE]
  meta$celltype <- factor(as.character(meta$celltype), levels = celltype_order)
  meta$group <- factor(as.character(meta$group), levels = group_order)
  mat <- mat[, rownames(meta), drop = FALSE]
  meta$cellgroup <- paste(meta$celltype, meta$group, sep = "_")
  col_order <- expand.grid(
    celltype = celltype_order,
    group = group_order,
    stringsAsFactors = FALSE
  )
  col_order$cellgroup <- paste(col_order$celltype, col_order$group, sep = "_")
  col_order <- col_order[order(
    match(col_order$celltype, celltype_order),
    match(col_order$group, group_order)
  ), , drop = FALSE]
  col_order <- col_order[col_order$cellgroup %in% meta$cellgroup, , drop = FALSE]
  avg_list <- lapply(col_order$cellgroup, function(x) {
    cells <- rownames(meta)[meta$cellgroup == x]
    Matrix::rowMeans(mat[, cells, drop = FALSE])
  })
  avg_mat <- do.call(cbind, avg_list)
  colnames(avg_mat) <- col_order$cellgroup
  rownames(avg_mat) <- rownames(mat)
  row_mean <- rowMeans(avg_mat)
  row_sd <- apply(avg_mat, 1, sd)
  min_mean <- 0.05
  top_n <- 40
  keep <- row_mean >= min_mean
  stat_df <- data.frame(
    family = rownames(avg_mat),
    mean_expr = row_mean,
    sd_expr = row_sd,
    stringsAsFactors = FALSE
  )
  stat_df <- stat_df[keep, ]
  stat_df <- stat_df[order(-stat_df$sd_expr, -stat_df$mean_expr), ]
  family_use <- head(stat_df$family, top_n)
  plot_mat <- avg_mat[family_use, , drop = FALSE]
  plot_mat_z <- t(scale(t(as.matrix(plot_mat))))
  plot_mat_z <- plot_mat_z[apply(plot_mat_z, 1, function(x) all(is.finite(x))), , drop = FALSE]
  peak_col <- max.col(plot_mat_z, ties.method = "first")
  peak_val <- apply(plot_mat_z, 1, max)
  ord <- order(peak_col, -peak_val)
  plot_mat_z <- plot_mat_z[ord, , drop = FALSE]
  family_anno <- read.csv(herv_map_path)
  family_anno$subfamily_raw <- gsub("_", "-", family_anno$subfamily_raw)
  row_class <- family_anno$class[match(rownames(plot_mat_z), family_anno$subfamily_raw)]
  row_class[is.na(row_class)] <- "Unknown"
  class_cols <- c(
    "ERV1" = "#E6E6E6",
    "ERVK" = "#969696",
    "ERVL" = "#252525",
    "Unknown" = "#BDBDBD"
  )
  right_anno <- rowAnnotation(
    Class = row_class,
    col = list(Class = class_cols),
    show_annotation_name = TRUE,
    annotation_name_gp = gpar(fontsize = 0),
    simple_anno_size = unit(2, "mm")
  )
  celltype_cols <- setNames(hcl.colors(length(celltype_order), "Set 2"), celltype_order)
  group_cols <- c("NC" = "#2B6CB0", "AD" = "#B22222")
  col_order$celltype <- factor(col_order$celltype, levels = celltype_order)
  col_order$group <- factor(col_order$group, levels = group_order)
  top_anno <- HeatmapAnnotation(
    Celltype = col_order$celltype,
    Group = col_order$group,
    col = list(
      Celltype = celltype_cols,
      Group = group_cols
    ),
    annotation_legend_param = list(
      Celltype = list(
        at = celltype_order
      ),
      Group = list(
        at = group_order
      )
    ),
    annotation_name_gp = gpar(fontsize = 0),
    simple_anno_size = unit(2, "mm")
  )
  up <- Heatmap(
    plot_mat_z,
    name = "Z-score",
    col = colorRamp2(c(-2, 0, 2), c("#3B4CC0", "white", "#B40426")),
    top_annotation = top_anno,
    right_annotation = right_anno,
    cluster_rows = FALSE,
    cluster_columns = FALSE,
    show_row_names = TRUE,
    show_column_names = FALSE,
    row_names_gp = gpar(fontsize = 10),
    column_names_rot = 45,
    border = FALSE,
    column_title = paste0("Average expression of hERV families - ", dataset),
    column_title_gp = gpar(fontsize = 16),
    rect_gp = gpar(col = NA)
  )
  de_df <- read.csv(de_res_path)
  de_df <- de_df %>%
    filter(celltype %in% celltype_order) %>%
    select(celltype, subfamily, avg_log2FC)
  logfc_df <- de_df %>%
    pivot_wider(
      names_from = celltype,
      values_from = avg_log2FC
    )
  logfc_df <- logfc_df[, c("subfamily", celltype_order[celltype_order %in% colnames(logfc_df)])]
  logfc_mat <- as.matrix(logfc_df[, -1])
  rownames(logfc_mat) <- logfc_df$subfamily
  row_sd <- apply(logfc_mat, 1, sd)
  row_maxabs <- apply(abs(logfc_mat), 1, max)
  stat_df <- data.frame(
    family = rownames(logfc_mat),
    sd_logFC = row_sd,
    max_abs_logFC = row_maxabs,
    stringsAsFactors = FALSE
  )
  stat_df <- stat_df[order(-stat_df$sd_logFC, -stat_df$max_abs_logFC), ]
  top_n <- 40
  family_use <- head(stat_df$family, top_n)
  plot_mat <- logfc_mat[family_use, , drop = FALSE]
  plot_mat_z <- t(scale(t(plot_mat)))
  plot_mat_z[!is.finite(plot_mat_z)] <- 0
  peak_col <- max.col(plot_mat_z, ties.method = "first")
  peak_val <- apply(plot_mat_z, 1, max)
  ord <- order(peak_col, -peak_val)
  plot_mat <- plot_mat[ord, , drop = FALSE]
  row_class <- family_anno$class[match(rownames(plot_mat), family_anno$subfamily_raw)]
  row_class[is.na(row_class)] <- "Unknown"
  right_anno <- rowAnnotation(
    Class = row_class,
    col = list(Class = class_cols),
    show_annotation_name = TRUE,
    annotation_name_gp = gpar(fontsize = 0),
    simple_anno_size = unit(2, "mm")
  )
  celltype_cols <- setNames(hcl.colors(length(celltype_order), "Set 2"), celltype_order)
  top_anno <- HeatmapAnnotation(
    Celltype = factor(colnames(plot_mat), levels = celltype_order),
    col = list(Celltype = celltype_cols),
    annotation_legend_param = list(
      Celltype = list(at = celltype_order)
    ),
    annotation_name_gp = gpar(fontsize = 0),
    simple_anno_size = unit(2, "mm")
  )
  lim <- quantile(abs(plot_mat), 0.98, na.rm = TRUE)
  lim <- max(lim, 0.2)
  down <- Heatmap(
    plot_mat,
    name = "log2FC",
    col = colorRamp2(c(-lim, 0, lim), c("#3B4CC0", "white", "#B40426")),
    top_annotation = top_anno,
    right_annotation = right_anno,
    cluster_rows = FALSE,
    cluster_columns = FALSE,
    show_column_names = FALSE,
    show_row_names = TRUE,
    row_names_gp = gpar(fontsize = 10),
    border = FALSE,
    column_title = paste0("logFC of hERV families - ", dataset),
    column_title_gp = gpar(fontsize = 16),
    rect_gp = gpar(col = NA),
    na_col = "grey90",
  )
  up <- as.ggplot(up)
  down <- as.ggplot(down)
  pdf(file.path(res_dir, paste0("hERV_family_heatmap_", dataset, ".pdf")), width = 10, height = 12)
  print(up / down)
  dev.off()
}

其它问题

dotplot表达量负数

毕设正文和图片5

Average Expression:经过标准化后的平均表达量。DotPlot默认会对每个基因在不同细胞群之间的平均表达做缩放,缩放之后,数值表示的是这个基因在某个群里的表达,相对于该基因在所有群中的整体水平是高还是低:

  • 大于0:这个群里该基因表达高于该基因的整体平均水平
  • 小于0:这个群里该基因表达低于该基因的整体平均水平
  • 0附近:接近平均水平

如果在DotPlot设置参数scale = FALSE,就是原始平均表达量

测序方法对hERV计数的影响

使用的数据的建库方法是10x Chromium Single Cell 3′ v3——“用poly(dT)引物捕获带poly(A)尾的转录本”,可能会低估或者漏掉不带polyA的ncRNA(非编码RNA),同时TE里面有相当一部分是non-polyA的,所以在hERV计数上大概率会偏低。不过我这个分析主要是AD-NC/不同细胞类型之间的组间比较,或者是共表达这种定性的找趋势,所以在这类不需要精确计数的分析上应该影响不大。当然也不可能说完全没有影响:

  • 之前别人的bulk RNA-seq分析用的是total RNA的测序方法(而不是我这种polyA+的测序方法),这可能是结论有差异的原因之一
  • 我这里能被polyA+的测序方法测出来的hERV读段,可能是hERV中有一定特殊性的片段,比如是剪接到了正常gene中的、或是借用了周围的polyA

hERV被基因内化为外显子或是启动子/增强子对测序的影响

  • 内化为外显子:更偏向polyA+,因为这时它本质上已经成了宿主基因转录本的一部分,后续剪接、加polyA、输出等过程大多按宿主mRNA或某些lncRNA的规则来走
  • 内化为替代启动子:也常常更偏向polyA+,因为很多情况下是 LTR 驱动了一个宿主基因的替代转录本,这个转录本后面接宿主外显子,最后通常会形成比较稳定的加polyA尾的RNA
  • 内化为增强子:更偏向non-polyA,因为增强子大多数不产生稳定RNA

古老型hERV更容易被基因内化为外显子或是启动子/增强子?

  • 总体上,较老的TE/ERV更容易在进化中被宿主共选为调控元件。比如hERVK(HML-2)属于hERV中较年轻、最接近仍保留病毒样转录潜力的家族之一,而hERVL/ERVL就属于更古老的谱系,但在同一家族内部,也会有一些特殊的位点(比如hERVH就有一些年轻的高转录位点)
  • hERV的年龄更可能改变“它以什么形式被转录出来”,而不是直接规定它一定是polyA还是non-polyA
    • 古老型如果更多作为外显子/启动子(被内化进宿主基因转录本),polyA+比例会上升
    • 古老型如果更多作为增强子产生调控RNA,non-polyA比例会上升
    • 年轻型如果更多保持自主病毒样转录,polyA+也可能很高
  • polyA测序更容易检测到
    • 嵌在宿主polyA转录本里的hERV片段
    • 本身能形成polyA尾的hERV转录本
  • 从热图上看,也是既有ERVL/HERVL,也有hERVK这类,似乎并没有说高表达的更偏向年轻/古老

Human hippocampal neurogenesis

Human hippocampal neurogenesis in adulthood, ageing and Alzheimer’s disease

主要结论:成人海马神经发生是存在的;AD不只是“新生神经元少了”,更像是从前体细胞开始,整条神经发生轴线的表观调控网络逐步失稳;而认知保持得特别好的人,保留了一套与这种失稳相反的“韧性网络”

  • 人类成年海马确实存在一条可识别的神经发生轨迹:用snRNA-seq和snATAC-seq联合分析,识别出了NSCs(神经干细胞)、neuroblasts(神经母细胞)、immature neurons(未成熟神经元),并且通过RNA velocity和调控网络分析支持它们构成一条从干/祖细胞到未成熟神经元的连续轨迹
  • 随着衰老和AD进展,最稳定、最明显的变化不是RNA,而是染色质开放性:与DEGs相比,DARs(差异染色质开放区域)的变化更多、更强,也更能区分正常衰老、临床前阶段和AD。尤其在PCI(可理解为向AD过渡的临床前病理状态)里,一些和突触可塑性、神经传递、神经元结构维护有关的开放染色质区域已经开始下调,而这些变化在AD中更严重
  • 最早的异常更可能起始于NSC,后面在neuroblast和immature neuron里放大
  • SuperAgers(在情景记忆测试中表现出色的老年人)有更独特的neurogenesis profile,尤其是在immature neurons/neuroblasts上更明显,而且这种“resilience signature”主要也是体现在开放染色质和调控网络上,这种韧性与海马内astrocytes、CA1 neurons、谷氨酸能突触通路之间的相互作用有联系,“认知功能未明显衰退”和“认知衰退”分岔点可能就在这些细胞间调控网络上

TE和免疫系统

Transposable elements as instructors of the immune system

主要结论:TE在多个层面作为免疫系统行为的调控元件

  • TE被转录后产生RNA、cDNA、蛋白或肽段,这些产物能被先天免疫识别,或者被MHC呈递给T细胞
  • TE还能作为顺式调控元件,重塑某些免疫细胞的基因表达程序,或充当某些免疫细胞状态转换和组织适应程序中的增强子:免疫系统可能反复利用TE作为调控方式
    • TE转录出来的RNA或RNA-DNA产物可以激活RIG-I、MDA5、cGAS–STING等先天免疫通路,从而诱导I型干扰素反应
    • 在T细胞里,某些含LINE-1的转录本和T细胞的活化、分化、耗竭状态有关
    • 在胸腺里,TE来源的转录本和肽段甚至可能参与中枢耐受,防止免疫系统攻击自身正常细胞
    • 某些ERV类TE影响辅助性T细胞分化
    • 在不同组织免疫细胞的scATAC-seq里找到了一批与组织适应有关的TE家族/亚家族,发现这些TE所在开放染色质里富集了BATF、JUNB等关键转录因子结合位点
  • 未来可能用于癌症免疫治疗,也可能用于炎症和自身免疫病的干预

TE增强子调控网络如何计算:

  • 先找候选增强子:用ATAC-seq或scATAC-seq找开放染色质区域
  • 把开放峰和TE注释重叠:把ATAC峰和RepeatMasker之类的TE注释表相交,筛出“落在TE上的开放峰”,这些就是候选的TE-derived enhancers。然后再统计哪些TE家族/亚家族在某个细胞状态中显著富集
  • motif enrichment推断谁在用这些TE:如果某些开放的TE峰里富集BATF、JUNB、TH1相关TF(转录因子)的motif(序列中的特定模式),就提示这些TF可能通过这些TE位点来驱动该细胞程序
  • 把峰连到基因:做peak-gene correlation、co-accessibility,或者用同一细胞的RNA+ATAC多组学数据去关联“这个开放峰变化”和“哪个基因表达变化同步”
  • 把这些关系拼成网络,并做实验验证

JETs(外显子和TE序列拼接形成的非经典转录本)在肿瘤里已经显示出很有意思的抗原潜力:一些肿瘤特异的JETs可以被翻译成蛋白,并作为肽段经MHC-I呈递,还能诱导T细胞反应。虽然他还没有被证实为神经系统疾病标志物,不过潜力很大