The microbiome GWAS was performed on ES-Ana relative abundance in the discovery cohort

最后发布时间 : 2026-06-12 17:01:18 浏览量 :

The microbiome GWAS was performed on ES-Ana relative abundance in the discovery cohort by PLINK [145], via the linear regression model adjusting for the first five genetic PCs (PC1 ~ PC5), age and gender. A total of 207 SNPs were significantly associated with the ES-Ana relative abundance under the suggestive significance level of
. Following linkage disequilibrium (LD)–based clumping (LD
< 0.1), we finally identified 41 lead variants. Finally, the Manhattan plot was created by the CMplot R package, and the regional association plot of a single SNP with ES-Ana relative abundance was visualized by LocusZoom 1.4 [158].

通过线性回归模型，PLINK[145]对发现队列中的ES-Ana相对丰度进行了免疫系统GWAS分析，并调整了前五个遗传PC（PC1 ~ PC5）、年龄和性别。在暗示显著水平下，共有207个SNP与ES-Ana相对丰度显著相关 10 − 5 .在基于链联不平衡（LD）的聚团（LD）之后 𝑟 2 < 0.1），我们最终确定了41个先导变异。最后，CMplot R包创建了曼哈顿图，LocusZoom 1.4可视化了一个具有ES-Ana相对丰度的单一SNP的区域关联图[158]。

这段论文描述的是一个微生物组 GWAS（microbiome GWAS），研究的是 ES-Ana（某个菌或功能的相对丰度）与宿主 SNP 的关联。它使用的是 线性回归（linear regression），而不是病例-对照 GWAS 常用的 logistic 回归。

1. 回归模型的数学形式

论文中说：

adjusting for the first five genetic PCs (PC1~PC5), age and gender

那么对于每一个 SNP，都建立一个独立的线性模型：

$\boxed{ Y_i=\beta_0+\beta_1G_i+\beta_2Age_i+\beta_3Gender_i+\beta_4PC1_i+\cdots+\beta_8PC5_i+\varepsilon_i }$

其中：

(Y_i)：第 (i) 个样本的 ES-Ana relative abundance（菌群相对丰度）
(G_i)：第 (i) 个样本该 SNP 的基因型编码
Age：年龄
Gender：性别
PC1～PC5：PCA 得到的人群结构校正变量
(\varepsilon_i)：误差项

真正检验的是：

$H_0:\beta_1=0$

即：

这个 SNP 是否与 ES-Ana 丰度有关。

2. SNP 是怎么编码的？

对于每个 SNP，例如

rs572688
Allele1 = G
Allele2 = A

通常按加性模型（Additive model）编码：

基因型	G 等位基因数	编码
AA	0	0
AG	1	1
GG	2	2

因此：

G_i ∈ {0,1,2}

每增加一个 G 等位基因，就增加 β 个单位的 ES-Ana 丰度。

3. 举一个具体例子

假设有 5 个样本：

Sample	ES-Ana	rs572688	Age	Sex	PC1
S1	0.15	AA	30	M	0.01
S2	0.21	AG	28	F	-0.03
S3	0.30	GG	45	M	0.02
S4	0.17	AA	50	F	-0.01
S5	0.26	AG	35	M	0.00

编码后：

Sample	Y	SNP
S1	0.15	0
S2	0.21	1
S3	0.30	2
S4	0.17	0
S5	0.26	1

拟合：

$Y=\beta_0+\beta_1SNP+\beta_2Age+\beta_3Sex+\beta_4PC1+\cdots$

如果得到

β1 = 0.08
P = 1e-6

表示：

每增加一个 G 等位基因，ES-Ana 平均增加 0.08。

4. 为什么要加 PC1～PC5？

这是为了控制群体分层（Population Stratification）。

例如：

北方人 ────── SNP A 多
         │
         └──── ES-Ana 也高

南方人 ────── SNP A 少
         │
         └──── ES-Ana 也低

如果不校正 PCA，就可能误认为：

SNP A → ES-Ana

实际上只是：

Population → SNP
Population → ES-Ana

因此加入：

PC1
PC2
PC3
PC4
PC5

作为协变量消除这种混杂。

5. 为什么加 Age 和 Gender？

因为菌群容易受到这些因素影响。

例如：

Age ─────► ES-Ana
Gender ─► ES-Ana

如果不调整：

可能把年龄效应错误归因到 SNP。

因此模型变成：

ES-Ana
   ▲
   │
SNP  Age  Gender  PC1...PC5

共同解释 ES-Ana。

6. 在 PLINK 中如何实现？

假设：

geno.bed
geno.bim
geno.fam

菌群丰度文件：

pheno.txt

FID IID ES_Ana
1   S1  0.15
1   S2  0.21
1   S3  0.30
...

协变量：

covar.txt

FID IID Age Sex PC1 PC2 PC3 PC4 PC5
1 S1 30 1 0.01 ...
1 S2 28 2 -0.03 ...

运行：

plink2 \
  --bfile geno \
  --pheno pheno.txt \
  --pheno-name ES_Ana \
  --covar covar.txt \
  --covar-name Age,Sex,PC1,PC2,PC3,PC4,PC5 \
  --glm hide-covar \
  --out ESAna_GWAS

PLINK 会自动：

for SNP in 全部 SNP:
    拟合
    ES_Ana ~ SNP + Age + Sex + PC1 + PC2 + PC3 + PC4 + PC5

输出每个 SNP 的：

β
SE
P

7. 为什么会得到 207 个 SNP，再变成 41 个 lead variants？

假设 chr1 上有：

rs1001
rs1002
rs1003
rs1004

由于它们处于同一个 LD block：

r² = 0.95

实际上代表的是同一遗传信号。

论文先筛出：

P < suggestive threshold

得到：

207 SNP

然后做 LD clumping：

r² < 0.1

保留每个 LD 区域最显著的一个 SNP：

207
 ↓
41 lead SNPs

因此 41 个 lead variants 代表 41 个相对独立的遗传位点，而不是只有 41 个显著 SNP。

整个流程可以概括为

宿主基因型（0/1/2）
          │
          ▼
   每个 SNP 单独建模
          │
          ▼
ES-Ana ~ SNP + Age + Gender + PC1 + PC2 + PC3 + PC4 + PC5
          │
          ▼
      得到 β、SE、P
          │
          ▼
  按 P 值筛选显著 SNP（207 个）
          │
          ▼
      LD Clumping（r² < 0.1）
          │
          ▼
      得到 41 个 Lead SNP

这实际上就是 Microbiome GWAS（mbGWAS） 的标准分析流程：**把菌群丰度作为表型（phenotype），把每个 SNP 作为自变量，逐个位点进行线性回归分析，并校正年龄、性别和 PCA 等协变量。**如果 ES-Ana 是相对丰度数据，还应注意其分布是否满足线性模型假设；很多研究会先进行对数变换、秩正态化（rank-based inverse normal transformation）或其他适当转换，以减少偏态分布对回归结果的影响。

public GWAS summary statistics for AD