Flujo de trabajo RNA-seq - HackMD (2024)

# **Flujo de trabajo RNA-Seq****Análisis transcriptómico de la almeja mano de león (*Nodipecten subnodosus*) bajo condiciones térmicas oscilatorias en dos localidades de la península de Baja California**![image](https://hackmd.io/_uploads/rJyoFWJO0.png)> [name=Giselle Moreno López]------## **I. Preprocesamiento de los datos**Lo primero que se realiza es una evaluación de la calidad de las lecturas. Para ello, se utiliza el software *FastQC*. Se puede utilizar a *MultiQC* para ver las métricas de todas las lecturas, sin embargo, es necesario ver una a una por si *MultiQC* sesga los gráficos.**1. Primera evaluación en FastQC**> Archivos de entrada:> - [ ] Lecturas crudas en formato fq, se recomienda que estén comprimidas (generalmente tienen un nombre largo, como N_2_1_CKDL230014579-1A_HJLJKBBXX_L3_1.fq.gz)```#!/bin/bash #SBATCH -p cicese #SBATCH --job-name=fastqc #SBATCH --output=fastqc-%j.log #SBATCH --error=fastqc-%j.err #SBATCH -N 1 #SBATCH --ntasks-per-node=8 #SBATCH -t 6-00:00:00export PATH=$PATH:/LUSTRE/apps/bioinformatica/FastQC_v0.11.7/:$PATH module load gcc-7.2.0 fastqc /LUSTRE/bioinformatica_data/genomica_funcional/Paulina/mano_leon/experimental/Giselle/trimmed/ph30/*.fq.gz -t 8 -o /LUSTRE/bioinformatica_data/genomica_funcional/Gis/trimmomatic/trimm2/multiq```> > Archivos de salida:> - [ ] Archivos .html con las métricas de cada archivo> - [ ] Archivos .zip con las imágenes y métricas> - [ ] Archivos .log y .err con información de la tarea enviada**2.** **Visualización de los datos en MultiQC***MultiQC* es un visualizador de gráficos que nos permite ver múltiples lecturas a la vez. Nos da un panorama general de como se ven las lecturas.> Archivos de entrada:> - [ ] Archivos .zip con las imágenes y métricas```#!/bin/bash#SBATCH -p cicese#SBATCH --job-name=multiqc#SBATCH --output=mul-%j.log#SBATCH --error=mul-%j.err#SBATCH -N 1#SBATCH --mem=100GB#SBATCH --ntasks-per-node=24#SBATCH -t 06-00:00:00export PATH=/LUSTRE/apps/Anaconda/2023/miniconda3/bin:$PATHsource activate qiime2-2023.2multiqc /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/multiq/*zip-o /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/multiq --data-format json --export```> Archivos de salida:> - [ ] Archivo .html con todas las métricas de las lecturas> - [ ] Carpeta multiqc_data con la información de cada parámetro (calidad, sobrerepresentación, %GC, etc.) en formato JSON> - [ ] Carpeta multiqc_plots con todos los gráficos en pdf, png y svg> - [ ] Archivos .log y .err con información de la tarea enviada**3. Recorte de las lecturas en Trimmomatic**Una vez se deciden cuáles serán los parámetros, se realiza el recorte de los índices y adaptadores y se eliminan lecturas de baja calidad (Phred score <30) en *Trimmomatic*. En mi caso, agregué todos los adaptadores para que se eliminarán de mis lecturas.> Archivos de entrada:> - [ ] Lecturas crudas en formato .fq.gz```#!/bin/bash##########################################################TRIMOMMATIC## Directivas#SBATCH --job-name=trimming#SBATCH --output=trimmomatic-%j.log#SBATCH --error=trimmomatic-%j.err#SBATCH -N 2#SBATCH --mem=100GB#SBATCH --ntasks-per-node=24#SBATCH -t 6-00:00:00#SBATCH -p cicese# Ruta a TRIMMOMATICTRIMMOMATIC=/LUSTRE/apps/bioinformatica/Trimmomatic-0.39#Ruta al archivo con los adaptadorestrueseq="$TRIMMOMATIC/adapters/TruSeq3-PE-2.fa"trueseq="$TRIMMOMATIC/adapters/TruSeq3-PE.fa"trueseq="$TRIMMOMATIC/adapters/TruSeq2-PE.fa"trueseq="$TRIMMOMATIC/adapters/NexteraPE-PE.fa"trueseq="$TRIMMOMATIC/adapters/TruSeq2-SE.fa"trueseq="$TRIMMOMATIC/adapters/TruSeq3-SE.fa"cd ${SLURM_SUBMIT_DIR}java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_1_CKDL230014573-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_1_CKDL230014573-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_1.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_2_1_CKDL230014579-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_2_1_CKDL230014579-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_2.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_3_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_3_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_3.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_4_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_4_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_4.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_5_CKDL230014574-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_5_CKDL230014574-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_5.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_6_CKDL230014580-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_6_CKDL230014580-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_6.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_7_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_7_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_7.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_8_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_8_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_8.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_9_CKDL230014575-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_9_CKDL230014575-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_9.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_10_2_CKDL230020994-1A_HJLJKBBXX_L3_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_10_2_CKDL230020994-1A_HJLJKBBXX_L3_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_10.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_11_1_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_11_1_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_11.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36java -jar $TRIMMOMATIC/trimmomatic-0.39.jar PE /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_12_CKDL230014589-1A_HJLMVCCX2_L5_1.fq.gz /LUSTRE/bioinformatica_data/genomica_funcional/Raw/N_subnodosus/experimental/N_12_CKDL230014589-1A_HJLMVCCX2_L5_2.fq.gz -baseout /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/N_12.fq.gz ILLUMINACLIP:$trueseq:2:30:10:8:true SLIDINGWINDOW:4:15 CROP:100 HEADCROP:15 MINLEN:36exit 0```> Archivos de salida:> - [ ] Archivo .fq.gz con un nuevo nombre> - [ ] Archivos .log y .err con información de la tarea enviadaUna vez que se realiza el recorte de las lecturas, podemos realizar de nuevo la evaluación de la calidad en FastQC y MultiQC. **FastQC**> Archivos de entrada:> - [ ] Lecturas preprocesadas en formato .fq.gz```#!/bin/bash #SBATCH -p cicese #SBATCH --job-name=fastqc #SBATCH --output=fastqc-%j.log #SBATCH --error=fastqc-%j.err #SBATCH -N 2 #SBATCH --ntasks-per-node=24 #SBATCH -t 6-00:00:00export PATH=$PATH:/LUSTRE/apps/bioinformatica/FastQC_v0.11.7/:$PATH module load gcc-7.2.0 fastqc /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/*.fq.gz -t 8 -o /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/multiq```> Archivos de salida:> - [ ] Archivos .html con las métricas de cada archivo> - [ ] Archivos .zip con las imágenes y métricas> - [ ] Archivos .log y .err con información de la tarea enviada **MultiQC**> Archivos de entrada:> - [ ] Archivos .zip con las imágenes y métricas```#!/bin/bash#SBATCH -p cicese#SBATCH --job-name=multifqc #SBATCH --output=mul-%j.log #SBATCH --error=mul-%j.err #SBATCH -N 4#SBATCH --mem=100GB#SBATCH --ntasks-per-node=24 #SBATCH -t 06-00:00:00export PATH=/LUSTRE/apps/Anaconda/2023/miniconda3/bin:$PATH source activate qiime2-2023.2multiqc /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/multiq/*zip -o /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/trimmomatic/trimm2/multiq --data-format json --export```> Archivos de salida:> - [ ] Archivo .html con todas las métricas de las lecturas> - [ ] Carpeta multiqc_data con la información de cada parámetro (calidad, sobrerepresentación, %GC, etc.) en formato JSON> - [ ] Carpeta multiqc_plots con todos los gráficos en pdf, png y svg> - [ ] Archivos .log y .err con información de la tarea enviada---## **II. Ensamble *de novo***Ahora que sabemos que nuestras lecturas están libres de adaptadores, índices y de secuencias de baja calidad, procederemos a realizar el ensamble de novo. Utilizaremos el software Trinity.Trinity es un paquete de programas cuya metodología permite la reconstrucción de transcriptomas basados en datos de RNA-Seq y reporta transcritos ensamblados con sus posibles isoformas. El ensamble de Trinity está basado principalmente en tres módulos: Inchworm, Chrysalis y Butterfly.**1. Ensamblado *de novo***Antes de realizar el ensamble, es necesario que todas las lecturas sean descomprimidas. Se puede utilizar el siguiente comando:```gunzip *.gz```Una vez que se descomprimieron todas las lecturas, se deben comprimir las 1P y 2P en archivos diferentes, usando el siguiente comando:```cat *1P.fq | gzip > R1.fastq.gzcat *2P.fq | gzip > R2.fastq.gz```Una vez hecho esto, podemos iniciar. > Archivos de entrada:> - [ ] Lecturas en formato .fastq.gz```#!/bin/bash# Directivas#SBATCH --job-name=Tri-denovo # nombre del trabajo que aparecera#SBATCH --output=Tri-%j.log # salida estandar num de trabajo#SBATCH --error=Tri-%j.err # salida estandar de error#SBATCH --nodes 2 # número de nodos#SBATCH --mem=100GB # RAM 1GB por cada millon de reads#SBATCH --ntasks-per-node=24 # número de tareas por nodo#SBATCH -t 6-00:00:00 # número maximo#SBATCH -p cicese #cola#export some softexport PATH=$PATH:/LUSTRE/apps/bioinformatica/bowtie2/bin/export PATH=$PATH:/LUSTRE/apps/bioinformatica/samtools-1.17/bin/export PATH=$PATH:/LUSTRE/apps/bioinformatica/jellyfish-2.3.0/export PATH=$PATH:/LUSTRE/apps/bioinformatica/salmon/bin/module load conda-2023module load trinityrnaseq-v2.15.1#my varsR1="/LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamblev3/trimm2/R1.fastq.gz"R2="/LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamblev3/trimm2/R2.fastq.gz"OUTDIR="Trinity_ensamble"LIBT="RF"Trinity --no_version_check --seqType fq --max_memory 100G \--no_bowtie \--CPU 24 --left $R1 --right $R2 \--normalize_reads \--output $OUTDIR#--full_cleanup"# see the time spentSTART=`date +%s`# Run trinity###$TRINITY $PARMEND=`date +%s`RUNTIME=$(($END-$START))echo "run time -> $RUNTIME"```> Archivos de salida:> - [ ] Archivo Trinity_ensamble.Trinity.fasta> - [ ] Archivo Trinity_ensamble.Trinity.fasta.gene_trans_map> - [ ] Carpeta Trinity_ensamble con la información de cada módulo> - [ ] Archivos .log y .err con información de la tarea enviada**Evaluación rápida del ensamble**Mediante Trinity, se puede hacer un análisis rápido de las métricas del ensamble (promedio del contig, número de genes e isoformas, N50, etc.). > Archivos de entrada:> - [ ] Archivo Trinity_ensamble.Trinity.fasta```#!/bin/bash# Directivas#SBATCH --job-name=Trin-est # nombre del trabajo que aparecera#SBATCH --output=stats-%j.log # salida estandar num de trabajo#SBATCH --error=stats-%j.err # salida estandar de error#SBATCH --nodes 2 # número de nodos#SBATCH --mem=100GB ##SBATCH --ntasks-per-node=24 # número de tareas por nodo#SBATCH -t 6-00:00:00 # número maximo#SBATCH -p cicese #cola#SBATCH --exclusivecd $SLURM_SUBMIT_DIR#export PATH=$PATH:/LUSTRE/apps/bioinformatica/trinityrnaseq-v2.15.1/cd /LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble_bueno/ensamblev3time perl /LUSTRE/apps/bioinformatica/trinityrnaseq-v2.15.1/util/TrinityStats.pl Trinity_ensamble.Trinity.fastaexit 0 ```> Archivos de salida:> - [ ] Archivos .log y .err con información de la tarea enviada**2. Evaluación del ensamble**Transrate es un software que evalua la calidad del ensamble > Archivos de entrada:> - [ ] Archivo Trinity_ensamble.Trinity.fasta> - [ ] Archivos R1.fastq.gz y R2.fastq.gz```#!/bin/bash#SBATCH -p cicese#SBATCH --job-name=tst-rvm#SBATCH --output=ruby-%j.log#SBATCH --error=ruby-%j.err#SBATCH -N 2#SBATCH --mem=100GB#SBATCH --ntasks-per-node=24#SBATCH -t 06-00:00:00##se deben concatenar todas las trimm_p forward en un archivo y las reverse en otro## correr este script dentro de la carpeta TRASNRATETRANSRATE=/LUSTRE/apps/bioinformatica/ruby2/ruby-2.2.0/binENSAMBLE="/LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/seq_cal/Trinity_ensamble.Trinity.fasta"R1="/LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/seq_cal/R1.fastq.gz"R2="/LUSTRE/bioinformatica_data/genomica_funcional/Gis/ensamble/ensamble_v2/Filtros/filtradas/seq_cal/R2.fastq.gz"cd $SLURM_SUBMIT_DIRexport PATH=/LUSTRE/apps/bioinformatica/.local/bin:$PATHexport PATH=/LUSTRE/apps/bioinformatica/ruby2/ruby-2.2.0/bin:$PATHexport LD_LIBRARY_PATH=/LUSTRE/apps/bioinformatica/ruby2/ruby-2.2.0/lib:$LD_LIBRARY_PATH$TRANSRATE/transrate \--assembly $ENSAMBLE \--left $R1 \--right $R2 \--threads 8 \--output gcontigs ```> Archivos de salida:> - [ ] Carpeta gcontigs> > - [ ] Carpeta Trinity_ensamble.Trinity con los conteos buenos y malos de los transcritos> > - [ ] Archivo assemblies.csv> - [ ] Archivos .log y .err con información de la tarea enviada---## **III. Alineamiento y cuantificación**Una vez que nuestro ensamble tiene buenas métricas, podemos hacer la alineación, que consiste en alinear las secuencias cortas de nuestras lecturas al transcriptoma ensamblado, además de contar el número de secuencias cortas que se alinean a un determinado contig. Mediante RSEM y Bowtie2 incluidos en Trinity podemos realizar este paso.**1. Alineamiento**RSEM alinea las lecturas al transcriptoma de referencia utilizando Bowtie2.Antes de realizar el alineamiento, es necesario generar una matriz que contenga las lecturas y tratamientos, como en el siguiente ejemplo:```Basal_BLABasal_BLA_1N_1_1P.fqN_1_2P.fqBasal_BLABasal_BLA_2N_2_1P.fqN_2_2P.fqBasal_BLABasal_BLA_3N_3_1P.fqN_3_2P.fqBasal_LOLBasal_LOL_1N_4_1P.fqN_4_1P.fqBasal_LOLBasal_LOL_2N_5_1P.fqN_5_2P.fqBasal_LOLBasal_LOL_3N_6_1P.fqN_6_2P.fqOsc_reg_BLAOsc_reg_BLA_1N_7_1P.fqN_7_2P.fqOsc_reg_BLAOsc_reg_BLA_2N_8_1P.fqN_8_2P.fqOsc_reg_BLAOsc_reg_BLA_3N_9_1P.fqN_9_2P.fqOsc_reg_LOLOsc_reg_LOL_1N_10_1P.fqN_10_2P.fqOsc_reg_LOLOsc_reg_LOL_2N_11_1P.fqN_11_2P.fqOsc_reg_LOLOsc_reg_LOL_3N_12_1P.fqN_12_2P.fq```> Archivos de entrada:> - [ ] Archivo samples.txt> - [ ] Archivo good.Trinity_ensamble.Trinity.fasta proveniente de Transrate```#!/bin/bash# Directivas#SBATCH --job-name=Trin-cuantif#SBATCH --output=cuantif-%j.log#SBATCH --error=cuantif-%j.err#SBATCH --nodes 4#SBATCH --mem=100GB#SBATCH --ntasks-per-node=24#SBATCH -t 6-00:00:00#SBATCH --exclusive#export some softexport PATH=$PATH:/LUSTRE/apps/bioinformatica/bowtie2/bin/export PATH=$PATH:/LUSTRE/apps/bioinformatica/samtools-1.17/bin/export PATH=$PATH:/LUSTRE/apps/bioinformatica/jellyfish-2.3.0/export PATH=$PATH:/LUSTRE/apps/bioinformatica/salmon/bin/export PATH=$PATH:/LUSTRE/apps/bioinformatica/RSEM/bin/module load conda-2023module load trinityrnaseq-v2.15.1UTIL=/LUSTRE/apps/bioinformatica/trinityrnaseq-v2.15.1/utilSAMPLES=/LUSTRE/bioinformatica_data/genomica_funcional/Gis/cuantificacioncd $SLURM_SUBMIT_DIR$UTIL/align_and_estimate_abundance.pl \--seqType fq \--samples_file samples.txt \--transcripts good.Trinity_ensamble.Trinity.fasta \--est_method RSEM \--aln_method bowtie2 \--trinity_mode \--prep_reference \--output_dir RSEM_outdir \--thread_count=24exit 0```> Archivos de salida:> - [ ] Carpetas de cada condición con los resultados de RSEM (conteos de genes e isoformas)> - [ ] Archivos .log y .err con información de la tarea enviadaUna vez que se realizó el alineamiento y conteo de las lecturas, se procede a realizar la matriz con todos los conteos mediante el siguiente comando (sin slurm):```ls */*isoforms.results > isoforms```> Aquí no es necesario tener algún archivo de entrada, el script por sí solo buscará los archivos de conteo en las carpetas de cada condición o tratamiento. Posteriormente, se utiliza el siguiente script:```#!/bin/bash# Directivas #SBATCH --job-name=abundance_mat#SBATCH --output=abundance_mat-%j.log#SBATCH --error=abundance_mat-%j.err#SBATCH --nodes 2#SBATCH --mem=100GB#SBATCH --ntasks-per-node=24#SBATCH -t 6-00:00:00#SBATCH --exclusive#export some softexport PATH=$PATH:/LUSTRE/apps/bioinformatica/bowtie2/bin/export PATH=$PATH:/LUSTRE/apps/bioinformatica/samtools-1.17/bin/export PATH=$PATH:/LUSTRE/apps/bioinformatica/jellyfish-2.3.0/export PATH=$PATH:/LUSTRE/apps/bioinformatica/salmon/bin/export PATH=$PATH:/LUSTRE/apps/bioinformatica/RSEM/bin/module load conda-2023module load trinityrnaseq-v2.15.1UTIL=/LUSTRE/apps/bioinformatica/trinityrnaseq-v2.15.1/utilTRANS=/LUSTRE/bioinformatica_data/genomica_funcional/Gis/exp_dif/ensamble_pgcd $SLURM_SUBMIT_DIR$UTIL/abundance_estimates_to_matrix.pl \--est_method RSEM \--out_prefix count \--quant_files $TRANS/isoforms \--gene_trans_map $TRANS/good.Trinity_ensamble.Trinity.fasta.gene_trans_map \--name_sample_by_basedirexit 0```> Archivos de salida:> - [ ] Matriz de conteos de genes> - [ ] Matriz de conteos de isoformas (es la que utilizaremos para el análisis DEseq2)> - [ ] Archivos .log y .err con información de la tarea enviada---## **IV. Análisis de expresión diferencial génica**Cuando tenemos las matrices de conteo de los genes e isoformas, podemos iniciar nuestro análisis de expresión diferencial génica (DGE). Podemos utilizar DEseq2 o EdgeR en RStudio.### **DGE en DEseq2**El análisis de expresión diferencial con DESeq2 implica varios pasos, donde DESeq2 modelará los recuentos brutos, utilizando factores de normalización (factores de tamaño) para tener en cuenta las diferencias en la profundidad de la biblioteca. Luego, estimará las dispersiones genéticas y reducirá estas estimaciones para generar estimaciones de dispersión más precisas para modelar los recuentos. Finalmente, DESeq2 se ajustará al modelo binomial negativo y realizará pruebas de hipótesis utilizando la prueba de Wald o prueba de ratio de verosimilitud.Es importante que los valores de recuento estén sin normalizar, ya que permiten evaluar correctamente la precisión de la medición. El modelo DESeq2 corrige internamente el tamaño de la biblioteca.> **Conceptos importantes**> * log2FoldChange:> * baseMean (expresión media):> * valor de p ajustado:> * FDR (tasa de descubrimiento falso):> **Manual de DEseq2**: [https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#standard-workflow](https://)### DEseq2Abriremos RStudio, de preferencia debemos de mantener un espacio de trabajo determinado (una carpeta llamada DEseq o DGE) donde tengamos nuestros scripts, matrices y los gráficos que resulten del análisis.Ejemplo de matriz de conteos:![image](https://hackmd.io/_uploads/SJSunoMrT.png)> Archivos de entrada:> - [ ] Script para RStudio (.R)> - [ ] Matriz de conteos de isoformas (.tsv)El siguiente script se ejecuta en RStudio, dando clic directamente en el archivo .R```# DIFFERENTIAL EXPRESSION ANALYSIS# Script modificado de Ricardo Gomez-Reyes# Limpiar la memoria de la sesion de R ====rm(list = ls()) if(!is.null(dev.list())) dev.off()if (!require("BiocManager", quietly = TRUE))install.packages("BiocManager")BiocManager::install("DESeq2")# Configuramos# options(stringsAsFactors = FALSE)# Cargamos nuestras librerias de trabajo ====# sacar herramientas de nuestra caja de herramientasif (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")BiocManager::install("vsn")#install.packages("tidyverse")library(tidyverse)library(DESeq2)library(vsn).cran_packages <- c('tidyverse', 'DESeq2') .inst <- .cran_packages %in% installed.packages() if(any(!.inst)) { install.packages(.cran_packages[!.inst], dep=TRUE, repos='http://cran.us.r-project.org') } sapply(c(.cran_packages), require, character.only = TRUE)#################################################################################################################################### Establecemos directorio (escritorio) de trabajo ==== path <- "D:/OneDrive/Documentos/Tesis_Giselle/DEseq2" count_f <- list.files(pattern = "good.count.isoform.counts.matrix", full.names = T)# Cargamos archivos en R ====#library(readr)count <- read_tsv(count_f)colData <- read_tsv(list.files(pattern = "metadata_Ns.tsv", full.names = T))rownames(colData) <- colData$Library_IDrownames=TRUE#colData <- colData[-1] #remover la primera columna# Modificamos caracteres ====colNames <- gsub("_L001.isoforms.results", "", names(count))#colNames <- sapply(strsplit(colNames, "_"), `[`, 1)#colnames(count) <- c("transcript_id", colNames[-1])query.ids <- count$transcript_idlibrary(dplyr)count <- count %>% select_if(is.double) %>% as(., "matrix")rownames(count) <- query.ids# 1) Filter data by removing low-abundance genes ---- by_count <- 1; by_freq <- 2 keep <- rowSums(count > by_count) >= by_freqsum(keep) # N transcriptsnrow(count <- count[keep,])count <- round(count) # redondea los valores # 2) Format metadata and count matrix ----row.names(colData) <- colData$Library_ID #para colocar los nombres de Library_ID en rownamesrow.names=TRUEcolData <- colData %>% arrange(match(Library_ID, colnames(count)))#all(colnames(count) %in% rownames(colData))#colData <- dplyr::rename(colData, samples = Library_ID)#rownames(colData) <- colData$Library_IDany(colnames(count) == colData$Library_ID) # sanity check#colData <- mutate_if(colData, is.character, as.factor)colData <- colData %>% tibble::rownames_to_column(var = "ROW_ID") %>% mutate(across(-ROW_ID, as.factor)) %>% column_to_rownames(var = "ROW_ID")# Usamos objeto colData para elaborar el diseno experimentalf_col <- "Condition"names(colData)[names(colData) %in% f_col] <- "Condition"# Si es posible, especificar el factor "control" como nivel de referencia:colData <- colData %>% mutate(Condition = relevel(Condition, ref = "Ctrl")) #usa el factor Basal como nivel de referencia en las comparaciones estadisticas# PREVALENCE ----apply(count, 1, function(x) sum(x > 0)) %>% table() #calcular elementos por cada fila >0prevelancedf = apply(count, 1, function(x) sum(x > 0)) # calcula el n?mero de elementos mayores que cero en cada fila de la matriz count y los almacena en un vector llamado prevalence#libreriaslibrary(dplyr)library(ggplot2)data.frame(Prevalence = prevelancedf, TotalAbundance = rowSums(count)) %>% # mean_se as_tibble(rownames = "Name") %>% arrange(desc(TotalAbundance)) -> prevelancedfprevelancedf %>% dplyr::count(Prevalence) %>% ggplot(aes(Prevalence, n)) + geom_col() + theme_minimal(base_family = "GillSans", base_size = 12) + scale_y_continuous("Number of transcript", labels = scales::comma) # BOXPLOT ====qprobs <- function(x) { x <- x[x > 1] quantile(x, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))}apply(log2(count+1), 2, qprobs) %>% t() %>% as_tibble(rownames = 'Library_ID') %>% left_join(colData) -> probs_dfprobs_df %>% ggplot(., aes(x = Library_ID, ymin = `5%`, lower = `25%`, middle = `50%`, upper = `75%`, ymax = `95%`, group = Library_ID)) + geom_errorbar(width = 0.3, position = position_dodge(0.6)) + geom_boxplot(aes(fill = Condition), width = 0.5, stat = 'identity', position = position_dodge(0.6)) + labs(y = expression(log[2]~ 'Abundance'), x = '') + theme_classic(base_family = "GillSans", base_size = 18) + theme( legend.position = 'top', panel.border = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.line.x = element_blank()) -> ptopptop <- ptop + facet_grid(~ Condition, scales = 'free') + theme( strip.background = element_rect(fill = 'grey', color = 'white'))ptop# BOTTOM ====apply(count, 2, function(x) sum(x > 0)) -> Total_genes# How singletones are per sample? ----apply(count, 2, function(x) sum(x > 0)) -> filtered_genescbind(as_tibble(Total_genes, rownames = 'name'), as_tibble(filtered_genes)) -> n_genesnames(n_genes) <- c('Library_ID','Raw', 'Filt')n_genes %>% mutate(pct_genes_retained = Filt/Raw) -> n_genesn_genes %>% left_join(colData) %>% mutate(Library_ID = factor(Library_ID, levels = unique(Library_ID))) %>% ggplot() + geom_col(aes(x = Library_ID, y = Raw, fill = Condition)) + labs(x = 'Samples', y = 'Total Transcript') + scale_y_continuous(labels = scales::comma) + theme_minimal(base_family = "GillSans", base_size = 12) + theme(legend.position = 'none', axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 10)) -> pbottompbottom <- pbottom + facet_grid(~ Condition, scales = 'free') + theme( strip.background = element_blank(), strip.text = element_blank())pbottom###aqui obtuve tres plotslibrary(patchwork) # para unir las dos graficasptop / pbottom + patchwork::plot_layout(heights = c(1,1.2))# DESEQ2 =====ddsFullCountTable <- DESeqDataSetFromMatrix( countData = count, colData = colData, design = ~ Condition + Site+ Condition:Site) ##para modelar la influencia de m?ltiples factores en los datos de exp g?nica, si no unicamente dejar ~ Condition# (Above): transform or not the count matrix? # In order to test for differential expression, we operate on raw counts and use discrete distributions ...# 3) Run DESeq in the following functions order ----# Negative Binomial GLM# # For the analysis we need to estimate the effective library size to normalize for.# Imagine, a gene has the same number of counts in two samples. But the library size # (total number of reads) was twice as high in the first sample. Then we would conclude that the gene was higher expressed in the second sample. # You can estimate the size factors from the count data using the function estimateSizeFactors():dds <- estimateSizeFactors(ddsFullCountTable) # sizeFactors(dds)# Next we need to estimate for each condition a function that allows to predict the # dispersion (= variance across samples). The core assumption of this method is that the # mean is a good predictor of the variance, i.e., that genes with a similar expression level also have similar variance across the samples:dds <- estimateDispersions(dds) #boxplot(log2(counts(ddsFullCountTable)+0.5)) #boxplot(log2(counts(dds, normalized=TRUE)+0.5))# Now we can test for differential expression of the genes between the two groups by calling the function nbinomWaldTest(). We provide the dds and the names of the groups of our samples to this function. This might take a few minutes:dds <- nbinomWaldTest(dds, maxit = 200) # (Above) use Wald test when contrasting a unique pair (ex. control vs cancer)res <- results(dds) # d1 <- plotCounts(dds, gene=which.min(res$padj), intgroup="Design", returnData=TRUE)# d2 <- plotCounts(dds, gene=which.max(res$padj), intgroup="Design", returnData=TRUE)# # p1 <- ggplot(rbind(d1,d2), aes(x=Design, y=count)) + # geom_jitter(position=position_jitter(w=0.1,h=0), size = 5, alpha = 0.5) + # scale_y_log10() +# stat_summary(fun.data = "mean_cl_boot", colour = "red", linewidth = 0.7, size = 1) +# labs(y = "Gene count (Log10)", x = "Treatment") +# theme_bw(base_family = "GillSans", base_size = 20)# # # p2 <- rbind(d1,d2) %>% # group_by(Design) %>% # tally(count) %>%# ggplot(aes(x=Design, y=n)) +# geom_col(width = 0.4) +# labs(y = "Library Size", x = "Treatment") +# theme_bw(base_family = "GillSans", base_size = 20)# # p2 /p1# normal vs gamma-poisson distribution# library(bayestestR)# # prior <- distribution_normal(nrow(m), mean = 5)# posterior <- distribution_gamma(nrow(m), 2) # Generate a skewed distribution# # # p <- bayesfactor_parameters(posterior, prior,# direction = "two-sided", null = 0, verbose = FALSE)# # # plot(p)# centrality <- point_estimate(posterior)# # plot(centrality)# Note: also run as a single batch dds <- DESeq(ddsFullCountTable)res <- results(dds)#4) (OPTIONAL) Test transformation for downstream analyses---- #However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data. Ex:vst <- DESeq2::varianceStabilizingTransformation(dds) # vst if cols > 10ntr <- DESeq2::normTransform(dds) DESeq2::plotPCA(ntr, intgroup = "Condition") DESeq2::plotPCA(vst, intgroup = "Condition") if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("vsn") raw_df <- vsn::meanSdPlot(assay(dds), plot = F) vst_df <- vsn::meanSdPlot(assay(vst), plot = F) ntr_df <- vsn::meanSdPlot(assay(ntr), plot = F)# rbind(data.frame(py = vst_df$sd, px = vst_df$rank, col = "vst"),# data.frame(py = ntr_df$sd, px = ntr_df$rank, col = "ntr"),# data.frame(py = raw_df$sd, px = raw_df$rank, col = "raw")) %>%# ggplot(aes(px, py, color = col)) +# labs(x = "Ranks", y = "sd", color = "") +# geom_line(orientation = NA, position = position_identity(), size = 2) +# theme_bw(base_family = "GillSans", base_size = 20) +# theme(legend.position = "top")# 5) (OPTIONAL) Correlation heatmap ----sample_cor = cor(assay(vst), method='pearson', use='pairwise.complete.obs')sample_dist = dist(t(assay(vst)), method='euclidean')hc_samples = hclust(sample_dist, method='complete')hc_order <- hc_samples$labels[hc_samples$order] heatmap(sample_cor, col = cm.colors(1)) sample_cor %>% as_tibble(rownames = 'Library_ID') %>% pivot_longer(cols = colnames(sample_cor), values_to = 'cor') %>% left_join(colData) %>% mutate(Library_ID = factor(Library_ID, levels = hc_order)) -> sample_cor_longlibrary(ggplot2) library(ggh4x)sample_cor_long %>% ggplot(aes(x = Library_ID, y = name, fill = cor)) + geom_tile(color = 'white', linewidth = 0.2) + # geom_raster() + # geom_text(aes(label = cor), color = 'white') + scale_fill_viridis_c(option = "F", name = "Pearson", direction = -1) + scale_x_discrete(position = 'top') + ggh4x::scale_x_dendrogram(hclust = hc_samples, position = 'top') + ggh4x::scale_y_dendrogram(hclust = hc_samples) + theme_classic(base_size = 12, base_family = "GillSans") + labs(x = '', y = '') + theme(axis.text.x = element_text(angle = 50, hjust = -0.15, vjust = 1)) # -> pheat# 6) PCA =====library(DESeq2)ncol(data <- log2(count+1))ncol(data <- assay(vst))PCA = prcomp(t(data), center = FALSE, scale. = FALSE)percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)sd_ratio <- sqrt(percentVar[2] / percentVar[1])PCAdf <- data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2])library(ggforce)PCAdf %>% mutate(Library_ID = rownames(.)) %>% left_join(colData) %>% ggplot(., aes(PC1, PC2)) + # coord_fixed(ratio = sd_ratio) + geom_abline(slope = 0, intercept = 0, linetype="dashed", alpha=0.5) + geom_vline(xintercept = -74, linetype="dashed", alpha=0.5) + ggforce::geom_mark_ellipse(aes(group = as.factor(Condition)), fill = 'grey', con.colour = 'grey') + geom_point(size = 7, alpha = 0.7, aes(color = Condition)) + geom_text( family = "GillSans", mapping = aes(label = Site), size = 2.5) + labs(caption = '', color = "") + xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) + ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) + theme_minimal(base_family = "GillSans", base_size = 20) + theme(plot.title = element_text(hjust = 0.5), legend.position = 'top')# EXPLORATORY DATA ====summary(res)out of 127719 with nonzero total read countadjusted p-value < 0.1LFC > 0 (up) : 143, 0.11%LFC < 0 (down) : 226, 0.18%outliers [1] : 648, 0.51%low counts [2] : 61847, 48%(mean count < 4)hist(res$pvalue)res <- res %>% as_tibble(rownames = "transcript")# ========================= Pruebas HeatMap ========================================library(DESeq2)library(ggplot2)library(ComplexHeatmap)#library(org.Hs.eg.db)#library("org.Hs.eg.db") paquete de Bioconductor de anotacion de genes del genoma de hom*o Sapienssignificant <- res %>% filter((log2FoldChange > 2 & padj < 0.05) | (log2FoldChange < 2 & padj < 0.05))sigs.df <- as.data.frame(significant)row.names(sigs.df) <- sigs.df$transcript #para colocar los nombres de Transcript en rownamesrow.names=TRUE#sigs.df$symbol <- mapIds(org.Hs.eg.db, keys = rownames(sigs.df), keytype = "TRINITY", column = "SYMBOL")####################################################################################Silencie las siguientes dos lineas porque filtraban mis datos, reduciendolos a 5#sigs.df <- sigs.df[(sigs.df$baseMean >100) & (abs(sigs.df$log2FoldChange) > 1), ]#sig <- sigs.df[, -1]################################################################################mat <- counts(dds, normalized =T)[rownames(sigs.df), ]mat.z <- t(apply(mat, 1, scale))colnames(mat.z) <- rownames(colData)Heatmap(mat.z, cluster_rows = T, cluster_columns = T, column_labels = colnames(mat.z), name = "Z-score", row_labels = sigs.df[rownames(mat.z),]$transcript)```Posteriormente, es necesario obtener un gráfico de volcano, el cual se obtiene de la siguiente manera (mismo script):> Archivos de salida:> Gráficos:> - [ ] Conteos crudos en un boxplot y gráfico de barras> - [ ] Heatmaps de correlación de Pearson> - [ ] Heatmap de los genes transcritos en cada condición> - [ ] Volcano plot> - [ ] En la consola, el número y porcentaje de transcritos (up, down y outliers)##### Asimismo, se puede realizar el análisis DGE por contrastes. En mi caso, lo realicé para Basales, Osc, LOL y BLA. Sin embargo, para fines prácticos, colocaré solo el script de Basales y de BLA.El siguiente script se corre en RStudio.> Archivos de entrada:> - [ ] Script R Basales.R> - [ ] Archivo good.count.isoform.counts.matrix> - [ ] Archivo metadata_Ns.tsv> - [ ] Trinotate_report.tsv```#BASALES# DIFFERENTIAL EXPRESSION ANALYSIS# Script modificado de Ricardo Gomez-Reyes# Limpiar la memoria de la sesion de R ====rm(list = ls()) if(!is.null(dev.list())) dev.off()#if (!require("BiocManager", quietly = TRUE))#install.packages("BiocManager")#BiocManager::install("DESeq2")library(DESeq2)# Configuramos# options(stringsAsFactors = FALSE)# Cargamos nuestras librerias de trabajo ====# sacar herramientas de nuestra caja de herramientas#if (!require("BiocManager", quietly = TRUE))# install.packages("BiocManager")#BiocManager::install("vsn")library(vsn)#install.packages("tidyverse")library(tidyverse)library(DESeq2)library(vsn).cran_packages <- c('tidyverse', 'DESeq2').inst <- .cran_packages %in% installed.packages()if(any(!.inst)) { install.packages(.cran_packages[!.inst], dep=TRUE, repos='http://cran.us.r-project.org')}sapply(c(.cran_packages), require, character.only = TRUE)#################################################################################################################################### Establecemos directorio (escritorio) de trabajo ====path <- "D:/OneDrive/Documentos/Tesis_Giselle/DEseq2"count_f <- list.files(pattern = "good.count.isoform.counts.matrix", full.names = T)# Cargamos archivos en R ====#library(readr)count <- read_tsv(count_f)colData <- read_tsv(list.files(pattern = "metadata_Ns.tsv", full.names = T))rownames(colData) <- colData$Library_IDrownames=TRUE#colData <- colData[-1] #remover la primera columna# Modificamos caracteres ====colNames <- gsub("_L001.isoforms.results", "", names(count))#colNames <- sapply(strsplit(colNames, "_"), `[`, 1)#colnames(count) <- c("transcript_id", colNames[-1])query.ids <- count$transcript_idlibrary(dplyr)count <- count %>% select_if(is.double) %>% as(., "matrix")rownames(count) <- query.ids# 1) Filter data by removing low-abundance genes ----by_count <- 1; by_freq <- 2 keep <- rowSums(count > by_count) >= by_freqsum(keep) # N transcriptsnrow(count <- count[keep,])count <- round(count) # redondea los valores # 2) Format metadata and count matrix ----row.names(colData) <- colData$Library_ID #para colocar los nombres de Library_ID en rownamesrow.names=TRUEcolData <- colData %>% arrange(match(Library_ID, colnames(count)))#all(colnames(count) %in% rownames(colData))#colData <- dplyr::rename(colData, samples = Library_ID)#rownames(colData) <- colData$Library_IDany(colnames(count) == colData$Library_ID) # sanity check#colData <- mutate_if(colData, is.character, as.factor)colData <- colData %>% tibble::rownames_to_column(var = "ROW_ID") %>% mutate(across(-ROW_ID, as.factor)) %>% column_to_rownames(var = "ROW_ID")# Usamos objeto colData para elaborar el diseno experimentalcolData=colData %>% filter(Condition=="Bas") #filtro de condicion(Basal) por sitio f_col <- "Site"names(colData)[names(colData) %in% f_col] <- "Site"# Si es posible, especificar el factor "control" como nivel de referencia:colData <- colData %>% mutate(Condition= relevel(Condition, ref = "LOL")) #usa el factor Basal como nivel de referencia en las comparaciones estadisticaskeep=colnames(count) %in% colData$Library_IDcount=count[,keep ]# PREVALENCE ----apply(count, 1, function(x) sum(x > 0)) %>% table() #calcular elementos por cada fila >0prevelancedf = apply(count, 1, function(x) sum(x > 0)) # calcula el n?mero de elementos mayores que cero en cada fila de la matriz count y los almacena en un vector llamado prevalence#libreriaslibrary(dplyr)library(ggplot2)data.frame(Prevalence = prevelancedf, TotalAbundance = rowSums(count)) %>% # mean_se as_tibble(rownames = "Name") %>% arrange(desc(TotalAbundance)) -> prevelancedfprevelancedf %>% dplyr::count(Prevalence) %>% ggplot(aes(Prevalence, n)) + geom_col() + theme_minimal(base_family = "GillSans", base_size = 12) + scale_y_continuous("Number of transcript", labels = scales::comma) # BOXPLOT ====qprobs <- function(x) { x <- x[x > 1] quantile(x, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))}apply(log2(count+1), 2, qprobs) %>% t() %>% as_tibble(rownames = 'Library_ID') %>% left_join(colData) -> probs_dfprobs_df %>% ggplot(., aes(x = Library_ID, ymin = `5%`, lower = `25%`, middle = `50%`, upper = `75%`, ymax = `95%`, group = Library_ID)) + geom_errorbar(width = 0.3, position = position_dodge(0.6)) + geom_boxplot(aes(fill = Site), width = 0.5, stat = 'identity', position = position_dodge(0.6)) + labs(y = expression(log[2]~ 'Abundance'), x = '') + theme_classic(base_family = "GillSans", base_size = 18) + theme( legend.position = 'top', panel.border = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.line.x = element_blank()) -> ptopptop <- ptop + facet_grid(~ Site, scales = 'free') + theme( strip.background = element_rect(fill = 'grey', color = 'white'))ptop# BOTTOM ====apply(count, 2, function(x) sum(x > 0)) -> Total_genes# How singletones are per sample? ----apply(count, 2, function(x) sum(x > 0)) -> filtered_genescbind(as_tibble(Total_genes, rownames = 'name'), as_tibble(filtered_genes)) -> n_genesnames(n_genes) <- c('Library_ID','Raw', 'Filt')n_genes %>% mutate(pct_genes_retained = Filt/Raw) -> n_genesn_genes %>% left_join(colData) %>% mutate(Library_ID = factor(Library_ID, levels = unique(Library_ID))) %>% ggplot() + geom_col(aes(x = Library_ID, y = Raw, fill = Site)) + labs(x = 'Samples', y = 'Total Transcript') + scale_y_continuous(labels = scales::comma) + theme_minimal(base_family = "GillSans", base_size = 12) + theme(legend.position = 'none', axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 10)) -> pbottompbottom <- pbottom + facet_grid(~ Site, scales = 'free') + theme( strip.background = element_blank(), strip.text = element_blank())pbottom######### aqui obtuve tres plots:# el boxplot, grafico de barras (arriba) y el que tiene a ambos (abajo)########library(patchwork) # para unir las dos graficasptop / pbottom + patchwork::plot_layout(heights = c(1,1.2))# DESEQ2 =====ddsFullCountTable <- DESeqDataSetFromMatrix( countData = count, colData = colData, design = ~ Site) ##para modelar la influencia de m?ltiples factores en los datos de exp g?nica, si no unicamente dejar ~ Condition #####OJOOOOO, AQUI cambie #Site# por condition # (Above): transform or not the count matrix? # In order to test for differential expression, we operate on raw counts and use discrete distributions ...# 3) Run DESeq in the following functions order ----# Negative Binomial GLM# # For the analysis we need to estimate the effective library size to normalize for.# Imagine, a gene has the same number of counts in two samples. But the library size(total number of reads) was twice as high in the first sample. Then we would conclude that the gene was higher expressed in the second sample. # You can estimate the size factors from the count data using the function estimateSizeFactors():dds <- estimateSizeFactors(ddsFullCountTable) # sizeFactors(dds)# Next we need to estimate for each condition a function that allows to predict the dispersion (= variance across samples). The core assumption of this method is that the # mean is a good predictor of the variance, i.e., that genes with a similar expression level also have similar variance across the samples:dds <- estimateDispersions(dds)################aqui empieza a estimar las dispersiones geneticas y reduce estas estimaciones para generar estimaciones de dispersión más precisas para modelar los recuentos###############boxplot(log2(counts(ddsFullCountTable)+0.5))#boxplot(log2(counts(dds, normalized=TRUE)+0.5))# Now we can test for differential expression of the genes between the two groups by calling the function nbinomWaldTest(). We provide the dds and the names of the groups of our samples to this function. This might take a few minutes:dds <- nbinomWaldTest(dds, maxit = 200) # (Above) use Wald test when contrasting a unique pair (ex. control vs cancer)res <- results(dds)# d1 <- plotCounts(dds, gene=which.min(res$padj), intgroup="Design", returnData=TRUE)# d2 <- plotCounts(dds, gene=which.max(res$padj), intgroup="Design", returnData=TRUE)# # p1 <- ggplot(rbind(d1,d2), aes(x=Design, y=count)) + # geom_jitter(position=position_jitter(w=0.1,h=0), size = 5, alpha = 0.5) + # scale_y_log10() +# stat_summary(fun.data = "mean_cl_boot", colour = "red", linewidth = 0.7, size = 1) +# labs(y = "Gene count (Log10)", x = "Treatment") +# theme_bw(base_family = "GillSans", base_size = 20)# # # p2 <- rbind(d1,d2) %>% # group_by(Design) %>% # tally(count) %>%# ggplot(aes(x=Design, y=n)) +# geom_col(width = 0.4) +# labs(y = "Library Size", x = "Treatment") +# theme_bw(base_family = "GillSans", base_size = 20)# # p2 /p1# normal vs gamma-poisson distribution# library(bayestestR)# # prior <- distribution_normal(nrow(m), mean = 5)# posterior <- distribution_gamma(nrow(m), 2) # Generate a skewed distribution# # # p <- bayesfactor_parameters(posterior, prior,# direction = "two-sided", null = 0, verbose = FALSE)# # # plot(p)# centrality <- point_estimate(posterior)# # plot(centrality)# Note: also run as a single batch dds <- DESeq(ddsFullCountTable)res <- results(dds)#4) (OPTIONAL) Test transformation for downstream analyses----#However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data. Ex:vst <- DESeq2::varianceStabilizingTransformation(dds) # vst if cols > 10ntr <- DESeq2::normTransform(dds)DESeq2::plotPCA(ntr, intgroup = "Site")DESeq2::plotPCA(vst, intgroup = "Site")# if (!require("BiocManager", quietly = TRUE))#install.packages("BiocManager")#BiocManager::install("vsn")library(vsn)raw_df <- vsn::meanSdPlot(assay(dds), plot = F)vst_df <- vsn::meanSdPlot(assay(vst), plot = F)ntr_df <- vsn::meanSdPlot(assay(ntr), plot = F)# rbind(data.frame(py = vst_df$sd, px = vst_df$rank, col = "vst"),# data.frame(py = ntr_df$sd, px = ntr_df$rank, col = "ntr"),# data.frame(py = raw_df$sd, px = raw_df$rank, col = "raw")) %>%# ggplot(aes(px, py, color = col)) +# labs(x = "Ranks", y = "sd", color = "") +# geom_line(orientation = NA, position = position_identity(), size = 2) +# theme_bw(base_family = "GillSans", base_size = 20) +# theme(legend.position = "top")# 5) (OPTIONAL) Correlation heatmap ----sample_cor = cor(assay(vst), method='pearson', use='pairwise.complete.obs')sample_dist = dist(t(assay(vst)), method='euclidean')hc_samples = hclust(sample_dist, method='complete')hc_order <- hc_samples$labels[hc_samples$order]heatmap(sample_cor, col = cm.colors(1))sample_cor %>% as_tibble(rownames = 'Library_ID') %>% pivot_longer(cols = colnames(sample_cor), values_to = 'cor') %>% left_join(colData) %>% mutate(Library_ID = factor(Library_ID, levels = hc_order)) -> sample_cor_longlibrary(ggplot2)library(ggh4x)sample_cor_long %>% ggplot(aes(x = Library_ID, y = name, fill = cor)) + geom_tile(color = 'white', linewidth = 0.2) + # geom_raster() + # geom_text(aes(label = cor), color = 'white') + scale_fill_viridis_c(option = "F", name = "Pearson", direction = -1) + scale_x_discrete(position = 'top') + ggh4x::scale_x_dendrogram(hclust = hc_samples, position = 'top') + ggh4x::scale_y_dendrogram(hclust = hc_samples) + theme_classic(base_size = 12, base_family = "GillSans") + labs(x = '', y = '') + theme(axis.text.x = element_text(angle = 50, hjust = -0.15, vjust = 1)) # -> pheat# 6) PCA =====library(DESeq2)ncol(data <- log2(count+1))ncol(data <- assay(vst))PCA = prcomp(t(data), center = FALSE, scale. = FALSE)percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)sd_ratio <- sqrt(percentVar[2] / percentVar[1])PCAdf <- data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2])library(ggforce)PCAdf %>% mutate(Library_ID = rownames(.)) %>% left_join(colData) %>% ggplot(., aes(PC1, PC2)) + # coord_fixed(ratio = sd_ratio) + geom_abline(slope = 0, intercept = 0, linetype="dashed", alpha=0.5) + geom_vline(xintercept = -74, linetype="dashed", alpha=0.5) + ggforce::geom_mark_ellipse(aes(group = as.factor(Condition)), fill = 'grey', con.colour = 'grey') + geom_point(size = 7, alpha = 0.7, aes(color = Condition)) + geom_text( family = "GillSans", mapping = aes(label = Site), size = 2.5) + labs(caption = '', color = "") + xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) + ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) + theme_minimal(base_family = "GillSans", base_size = 20) + theme(plot.title = element_text(hjust = 0.5), legend.position = 'top')# EXPLORATORY DATA ====summary(res)#################################################################################### BASALES: Por sitio, BLA x LOL ######################################### out of 126956 with nonzero total read count#adjusted p-value < 0.1#LFC > 0 (up) : 189, 0.15%#LFC < 0 (down) : 258, 0.2%#outliers [1] : 491, 0.39%#low counts [2] : 44291, 35%################################################################################################################################################hist(res$pvalue)res <- res %>% as_tibble(rownames = "transcript")# ========================= Pruebas HeatMap ========================================library(DESeq2)library(ggplot2)library(ComplexHeatmap)#library(org.Hs.eg.db)#library("org.Hs.eg.db") paquete de Bioconductor de anotacion de genes del genoma de hom*o Sapienssignificant <- res %>% filter((log2FoldChange > 2 & padj < 0.05) | (log2FoldChange < 2 & padj < 0.05))sigs.df <- as.data.frame(significant)row.names(sigs.df) <- sigs.df$transcript #para colocar los nombres de Transcript en rownamesrow.names=TRUE#sigs.df$symbol <- mapIds(org.Hs.eg.db, keys = rownames(sigs.df), keytype = "TRINITY", column = "SYMBOL")####################################################################################Silencie las siguientes dos lineas porque filtraban mis datos, reduciendolos a 5#sigs.df <- sigs.df[(sigs.df$baseMean >100) & (abs(sigs.df$log2FoldChange) > 1), ]#sig <- sigs.df[, -1]################################################################################mat <- counts(dds, normalized =T)[rownames(sigs.df), ]mat.z <- t(apply(mat, 1, scale))colnames(mat.z) <- rownames(colData)Heatmap(mat.z, cluster_rows = T, cluster_columns = T, column_labels = colnames(mat.z), name = "Z-score", row_labels = sigs.df[rownames(mat.z),]$transcript)#write.table(mat.z, file = "transcritos_DE.txt", sep = "\t", quote = FALSE, row.names = TRUE)write.csv(sigs.df, file = "DEG_Basales.csv")######### ## :) ## ##########=============== Seguimiento con Script de Ricardo ========================= sigfc <- "Sign and FC";pv <- "Sign";fc <- "FC"colors_fc <- c("red2", "#4169E1", "forestgreen", "grey30")colors_fc <- structure(colors_fc, names = c(sigfc, pv, fc, "NS"))res$cc <- 'NS'res[which(abs(res$log2FoldChange) > 2), 'cc'] <- fcres[which(abs(res$padj) <= 0.05), 'cc'] <- pvres[which(res$padj <= 0.05 & abs(res$log2FoldChange) > 2), 'cc'] <- sigfcp <- res %>% ggplot(aes(y = -log10(pvalue), x = log2FoldChange, color = cc)) + geom_point() p <- p + scale_color_manual(name = "", values = colors_fc) + labs(x= expression(Log[2] ~ "Fold Change"), y = expression(-Log[10] ~ "P")) + theme_bw(base_family = "GillSans", base_size = 18) + theme(legend.position = "top") + geom_abline(slope = 0, intercept = -log10(0.05), linetype="dashed", alpha=0.5) + geom_vline(xintercept = 0, linetype="dashed", alpha=0.5) p# POSITIVE LFC == UP EXPRESSED IN CANCER# NEGATIVE LFC == UP EXPRESSED IN CONTROLres %>% mutate(g = sign(log2FoldChange)) %>% dplyr::count(cc, g) %>% mutate(g = ifelse(g == "1", "Osc", "Ctrl")) %>% filter(cc != 'NS') %>% group_by(g) %>% mutate(pct = n / sum(n)) %>% ggplot() + geom_col(aes(x = g, y = pct, fill = cc), width = 0.5) + scale_fill_manual(name = "", values = colors_fc[-4]) + theme_bw(base_family = "GillSans", base_size = 18) + labs(y = '% Transcripts', x = '') + theme(legend.position = 'top') + coord_flip()# Annotation ====query.ids <- res %>% filter(cc == "Sign and FC") %>% pull(transcript)query.ids <- res$transcripturl <- "https://raw.githubusercontent.com/RJEGR/Small-RNASeq-data-analysis/master/FUNCTIONS.R"source(url)annot_path <- paste0(path, "/anotacion/")annot_f <- list.files(path = annot_path, pattern = "Trinotate_report.tsv", full.names = T)annot <- read_tsv(annot_f, na = ".") %>% filter(transcript_id %in% query.ids)blastp_df <- split_blast(annot, hit = "BLASTP") %>% group_by(transcript) %>% arrange(identity) %>% sample_n(1) %>% ungroup()blastp_df %>% dplyr::count(genus, sort = T)tex_dat <- res %>% left_join(blastp_df) %>% drop_na(name)p + ggrepel::geom_text_repel(data = tex_dat, aes(label = name), min.segment.length = 0)#############aqui salio un volcanoplot###########```> Archivos de salida:> - [ ] transcritos_DE.txt> - [ ] DEG_Basales.csv> - [ ] Archivo metadata_Ns.tsvEn el caso de que se requiera obtener las GO de los DGE up-down, se añade como última parte del script lo siguiente:```##############ENRIQUECIMIENTO FUNCIONAL################################# Functional enrichment ====.bioc_packages <- c('biomaRt','GOSemSim') # DESeq2'.inst <- .bioc_packages %in% installed.packages()#if(any(!.inst)) {# if (!require("BiocManager", quietly = TRUE))#install.packages("BiocManager")BiocManager::install(.bioc_packages[!.inst], ask = F)#}sapply(c(.bioc_packages), require, character.only = TRUE)# Local functionsurl <- "https://raw.githubusercontent.com/RJEGR/Cancer_sete_T_assembly/main/functions.R"source(url)go_df <- split_gene_ontology(annot, hit = "BLASTP") table(go_df$ontology)gene2GO <- split(go_df$go, go_df$transcript)########yo agregué esta líneas para que agarrara a GOSemSim & orgif (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")BiocManager::install("GOSemSim")#install.packages("GOSemSim") library(GOSemSim)if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")BiocManager::install("org.Hs.eg.db")library(org.Hs.eg.db)############################ aqui sigue el scripthsGO <- GOSemSim::godata('org.Hs.eg.db', ont="BP")#hsGO <- read_rds(paste0(annot_path, '/Trinotate_GO_annotations.txt'))library(readr)# Assuming the file is a TSV file (tab-separated values)/read_rds es formato binariohsGO <- read_tsv(paste0(annot_path, '/Trinotate_GO_annotations.txt'))res_up <- res %>% filter(log2FoldChange >=2 & padj <0.05) #filtro para los sobre-expresadosres <- res %>% drop_na(padj)res_down <- res %>% filter(log2FoldChange <2 & padj <0.05) #filtro para los sub-expresadosres_up %>% pull(padj) -> query.pres_up %>% pull(transcript) -> query.names#topGO es requerida, por lo que lo llamare por BiocmanagerBiocManager::install("topGO")allRes_up <- GOenrichment(query.p, query.names, gene2GO, Nodes = 20)allRes_up %>% as_tibble() %>% arrange(Annotated) %>% mutate(Term = factor(Term, levels= unique(Term))) %>% ggplot(aes(x = Term, y = Annotated, fill = p.adj.ks)) + # , fill = p.adj.ks coord_flip() + geom_col() + theme_minimal(base_family = "GillSans", base_size = 12)#AQUI me dio un grafico de enriquecimiento funcional con los procesos UPres_down %>% pull(padj) -> query.pres_down %>% pull(transcript) -> query.namesallRes_down <- GOenrichment(query.p, query.names, gene2GO, Nodes = 20)#null para devolver todosallRes_down %>% as_tibble() %>% arrange(Annotated) %>% mutate(Term = factor(Term, levels= unique(Term))) %>% ggplot(aes(x = Term, y = Annotated, fill = p.adj.ks)) + # , fill = p.adj.ks coord_flip() + geom_col() + theme_minimal(base_family = "GillSans", base_size = 12)#write.table(res_down, file = "down_DE.txt", sep = "\t", quote = FALSE, row.names = TRUE)#write.csv(sigs.df, file= "matriz.csv", row.names= TRUE )```Si se realiza este paso, se generan los archivos:> - [ ] down_DE.txt y up_DE.txt> - [ ] matriz.csv#### Script de BLA```##BLA# DIFFERENTIAL EXPRESSION ANALYSIS# Script modificado de Ricardo Gomez-Reyes# Limpiar la memoria de la sesion de R ====rm(list = ls()) if(!is.null(dev.list())) dev.off()#if (!require("BiocManager", quietly = TRUE))#install.packages("BiocManager")#BiocManager::install("DESeq2")library(DESeq2)# Configuramos# options(stringsAsFactors = FALSE)# Cargamos nuestras librerias de trabajo ====# sacar herramientas de nuestra caja de herramientas#if (!require("BiocManager", quietly = TRUE))# install.packages("BiocManager")#BiocManager::install("vsn")library(vsn)#install.packages("tidyverse")library(tidyverse)library(DESeq2)library(vsn).cran_packages <- c('tidyverse', 'DESeq2').inst <- .cran_packages %in% installed.packages()if(any(!.inst)) { install.packages(.cran_packages[!.inst], dep=TRUE, repos='http://cran.us.r-project.org')}sapply(c(.cran_packages), require, character.only = TRUE)#################################################################################################################################### Establecemos directorio (escritorio) de trabajo ====path <- "D:/OneDrive/Documentos/Tesis_Giselle/DEseq2"count_f <- list.files(pattern = "good.count.isoform.counts.matrix", full.names = T)# Cargamos archivos en R ====#library(readr)count <- read_tsv(count_f)colData <- read_tsv(list.files(pattern = "metadata_Ns.tsv", full.names = T))rownames(colData) <- colData$Library_IDrownames=TRUE#colData <- colData[-1] #remover la primera columna# Modificamos caracteres ====colNames <- gsub("_L001.isoforms.results", "", names(count))#colNames <- sapply(strsplit(colNames, "_"), `[`, 1)#colnames(count) <- c("transcript_id", colNames[-1])query.ids <- count$transcript_idlibrary(dplyr)count <- count %>% select_if(is.double) %>% as(., "matrix")rownames(count) <- query.ids# 1) Filter data by removing low-abundance genes ----by_count <- 1; by_freq <- 2 keep <- rowSums(count > by_count) >= by_freqsum(keep) # N transcriptsnrow(count <- count[keep,])count <- round(count) # redondea los valores # 2) Format metadata and count matrix ----row.names(colData) <- colData$Library_ID #para colocar los nombres de Library_ID en rownamesrow.names=TRUEcolData <- colData %>% arrange(match(Library_ID, colnames(count)))#all(colnames(count) %in% rownames(colData))#colData <- dplyr::rename(colData, samples = Library_ID)#rownames(colData) <- colData$Library_IDany(colnames(count) == colData$Library_ID) # sanity check#colData <- mutate_if(colData, is.character, as.factor)colData <- colData %>% tibble::rownames_to_column(var = "ROW_ID") %>% mutate(across(-ROW_ID, as.factor)) %>% column_to_rownames(var = "ROW_ID")# Usamos objeto colData para elaborar el diseno experimentalcolData=colData %>% filter(Site=="BLA") #filtro de condicion(Basal) por sitio f_col <- "Condition"names(colData)[names(colData) %in% f_col] <- "Condition"# Si es posible, especificar el factor "control" como nivel de referencia:colData <- colData %>% mutate(Condition= relevel(Condition, ref = "Bas")) #usa el factor Basal como nivel de referencia en las comparaciones estadisticaskeep=colnames(count) %in% colData$Library_IDcount=count[,keep ]# PREVALENCE ----apply(count, 1, function(x) sum(x > 0)) %>% table() #calcular elementos por cada fila >0prevelancedf = apply(count, 1, function(x) sum(x > 0)) # calcula el n?mero de elementos mayores que cero en cada fila de la matriz count y los almacena en un vector llamado prevalence#libreriaslibrary(dplyr)library(ggplot2)data.frame(Prevalence = prevelancedf, TotalAbundance = rowSums(count)) %>% # mean_se as_tibble(rownames = "Name") %>% arrange(desc(TotalAbundance)) -> prevelancedfprevelancedf %>% dplyr::count(Prevalence) %>% ggplot(aes(Prevalence, n)) + geom_col() + theme_minimal(base_family = "GillSans", base_size = 12) + scale_y_continuous("Number of transcript", labels = scales::comma) # BOXPLOT ====qprobs <- function(x) { x <- x[x > 1] quantile(x, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))}apply(log2(count+1), 2, qprobs) %>% t() %>% as_tibble(rownames = 'Library_ID') %>% left_join(colData) -> probs_dfprobs_df %>% ggplot(., aes(x = Library_ID, ymin = `5%`, lower = `25%`, middle = `50%`, upper = `75%`, ymax = `95%`, group = Library_ID)) + geom_errorbar(width = 0.3, position = position_dodge(0.6)) + geom_boxplot(aes(fill = Site), width = 0.5, stat = 'identity', position = position_dodge(0.6)) + labs(y = expression(log[2]~ 'Abundance'), x = '') + theme_classic(base_family = "GillSans", base_size = 18) + theme( legend.position = 'top', panel.border = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.line.x = element_blank()) -> ptopptop <- ptop + facet_grid(~ Condition, scales = 'free') + theme( strip.background = element_rect(fill = 'grey', color = 'white'))ptop# BOTTOM ====apply(count, 2, function(x) sum(x > 0)) -> Total_genes# How singletones are per sample? ----apply(count, 2, function(x) sum(x > 0)) -> filtered_genescbind(as_tibble(Total_genes, rownames = 'name'), as_tibble(filtered_genes)) -> n_genesnames(n_genes) <- c('Library_ID','Raw', 'Filt')n_genes %>% mutate(pct_genes_retained = Filt/Raw) -> n_genesn_genes %>% left_join(colData) %>% mutate(Library_ID = factor(Library_ID, levels = unique(Library_ID))) %>% ggplot() + geom_col(aes(x = Library_ID, y = Raw, fill = Site)) + labs(x = 'Samples', y = 'Total Transcript') + scale_y_continuous(labels = scales::comma) + theme_minimal(base_family = "GillSans", base_size = 12) + theme(legend.position = 'none', axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 10)) -> pbottompbottom <- pbottom + facet_grid(~ Site, scales = 'free') + theme( strip.background = element_blank(), strip.text = element_blank())pbottom######### aqui obtuve tres plots:# el boxplot, grafico de barras (arriba) y el que tiene a ambos (abajo)########library(patchwork) # para unir las dos graficasptop / pbottom + patchwork::plot_layout(heights = c(1,1.2))# DESEQ2 =====ddsFullCountTable <- DESeqDataSetFromMatrix( countData = count, colData = colData, design = ~ Condition) ##para modelar la influencia de m?ltiples factores en los datos de exp g?nica, si no unicamente dejar ~ Condition #####OJOOOOO, AQUI cambie #Site# por condition # (Above): transform or not the count matrix? # In order to test for differential expression, we operate on raw counts and use discrete distributions ...# 3) Run DESeq in the following functions order ----# Negative Binomial GLM# # For the analysis we need to estimate the effective library size to normalize for.# Imagine, a gene has the same number of counts in two samples. But the library size(total number of reads) was twice as high in the first sample. Then we would conclude that the gene was higher expressed in the second sample. # You can estimate the size factors from the count data using the function estimateSizeFactors():dds <- estimateSizeFactors(ddsFullCountTable) # sizeFactors(dds)# Next we need to estimate for each condition a function that allows to predict the dispersion (= variance across samples). The core assumption of this method is that the # mean is a good predictor of the variance, i.e., that genes with a similar expression level also have similar variance across the samples:dds <- estimateDispersions(dds)################aqui empieza a estimar las dispersiones geneticas y reduce estas estimaciones para generar estimaciones de dispersión más precisas para modelar los recuentos###############boxplot(log2(counts(ddsFullCountTable)+0.5))#boxplot(log2(counts(dds, normalized=TRUE)+0.5))# Now we can test for differential expression of the genes between the two groups by calling the function nbinomWaldTest(). We provide the dds and the names of the groups of our samples to this function. This might take a few minutes:dds <- nbinomWaldTest(dds, maxit = 200) # (Above) use Wald test when contrasting a unique pair (ex. control vs cancer)res <- results(dds)# d1 <- plotCounts(dds, gene=which.min(res$padj), intgroup="Design", returnData=TRUE)# d2 <- plotCounts(dds, gene=which.max(res$padj), intgroup="Design", returnData=TRUE)# # p1 <- ggplot(rbind(d1,d2), aes(x=Design, y=count)) + # geom_jitter(position=position_jitter(w=0.1,h=0), size = 5, alpha = 0.5) + # scale_y_log10() +# stat_summary(fun.data = "mean_cl_boot", colour = "red", linewidth = 0.7, size = 1) +# labs(y = "Gene count (Log10)", x = "Treatment") +# theme_bw(base_family = "GillSans", base_size = 20)# # # p2 <- rbind(d1,d2) %>% # group_by(Design) %>% # tally(count) %>%# ggplot(aes(x=Design, y=n)) +# geom_col(width = 0.4) +# labs(y = "Library Size", x = "Treatment") +# theme_bw(base_family = "GillSans", base_size = 20)# # p2 /p1# normal vs gamma-poisson distribution# library(bayestestR)# # prior <- distribution_normal(nrow(m), mean = 5)# posterior <- distribution_gamma(nrow(m), 2) # Generate a skewed distribution# # # p <- bayesfactor_parameters(posterior, prior,# direction = "two-sided", null = 0, verbose = FALSE)# # # plot(p)# centrality <- point_estimate(posterior)# # plot(centrality)# Note: also run as a single batch dds <- DESeq(ddsFullCountTable)res <- results(dds)#4) (OPTIONAL) Test transformation for downstream analyses----#However for other downstream analyses – e.g. for visualization or clustering – it might be useful to work with transformed versions of the count data. Ex:vst <- DESeq2::varianceStabilizingTransformation(dds) # vst if cols > 10ntr <- DESeq2::normTransform(dds)DESeq2::plotPCA(ntr, intgroup = "Site")DESeq2::plotPCA(vst, intgroup = "Site")# if (!require("BiocManager", quietly = TRUE))#install.packages("BiocManager")#BiocManager::install("vsn")library(vsn)raw_df <- vsn::meanSdPlot(assay(dds), plot = F)vst_df <- vsn::meanSdPlot(assay(vst), plot = F)ntr_df <- vsn::meanSdPlot(assay(ntr), plot = F)# rbind(data.frame(py = vst_df$sd, px = vst_df$rank, col = "vst"),# data.frame(py = ntr_df$sd, px = ntr_df$rank, col = "ntr"),# data.frame(py = raw_df$sd, px = raw_df$rank, col = "raw")) %>%# ggplot(aes(px, py, color = col)) +# labs(x = "Ranks", y = "sd", color = "") +# geom_line(orientation = NA, position = position_identity(), size = 2) +# theme_bw(base_family = "GillSans", base_size = 20) +# theme(legend.position = "top")# 5) (OPTIONAL) Correlation heatmap ----sample_cor = cor(assay(vst), method='pearson', use='pairwise.complete.obs')sample_dist = dist(t(assay(vst)), method='euclidean')hc_samples = hclust(sample_dist, method='complete')hc_order <- hc_samples$labels[hc_samples$order]heatmap(sample_cor, col = cm.colors(1))sample_cor %>% as_tibble(rownames = 'Library_ID') %>% pivot_longer(cols = colnames(sample_cor), values_to = 'cor') %>% left_join(colData) %>% mutate(Library_ID = factor(Library_ID, levels = hc_order)) -> sample_cor_longlibrary(ggplot2)library(ggh4x)sample_cor_long %>% ggplot(aes(x = Library_ID, y = name, fill = cor)) + geom_tile(color = 'white', linewidth = 0.2) + # geom_raster() + # geom_text(aes(label = cor), color = 'white') + scale_fill_viridis_c(option = "F", name = "Pearson", direction = -1) + scale_x_discrete(position = 'top') + ggh4x::scale_x_dendrogram(hclust = hc_samples, position = 'top') + ggh4x::scale_y_dendrogram(hclust = hc_samples) + theme_classic(base_size = 12, base_family = "GillSans") + labs(x = '', y = '') + theme(axis.text.x = element_text(angle = 50, hjust = -0.15, vjust = 1)) # -> pheat# 6) PCA =====library(DESeq2)ncol(data <- log2(count+1))ncol(data <- assay(vst))PCA = prcomp(t(data), center = FALSE, scale. = FALSE)percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1)sd_ratio <- sqrt(percentVar[2] / percentVar[1])PCAdf <- data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2])library(ggforce)PCAdf %>% mutate(Library_ID = rownames(.)) %>% left_join(colData) %>% ggplot(., aes(PC1, PC2)) + # coord_fixed(ratio = sd_ratio) + geom_abline(slope = 0, intercept = 0, linetype="dashed", alpha=0.5) + geom_vline(xintercept = -74, linetype="dashed", alpha=0.5) + ggforce::geom_mark_ellipse(aes(group = as.factor(Site)), fill = 'grey', con.colour = 'grey') + geom_point(size = 7, alpha = 0.7, aes(color = Condition)) + geom_text( family = "GillSans", mapping = aes(label = Site), size = 2.5) + labs(caption = '', color = "") + xlab(paste0("PC1, VarExp: ", percentVar[1], "%")) + ylab(paste0("PC2, VarExp: ", percentVar[2], "%")) + theme_minimal(base_family = "GillSans", base_size = 20) + theme(plot.title = element_text(hjust = 0.5), legend.position = 'top')# EXPLORATORY DATA ====summary(res)#################################################################################### Osc: Por sitio, BLA x LOL ######################################### out of 126387 with nonzero total read count#adjusted p-value < 0.1#LFC > 0 (up) : 775, 0.61%#LFC < 0 (down) : 504, 0.4%#outliers [1] : 520, 0.41%#low counts [2] : 46531, 37%################################################################################################################################################hist(res$pvalue)res <- res %>% as_tibble(rownames = "transcript")# ========================= Pruebas HeatMap ========================================library(DESeq2)library(ggplot2)library(ComplexHeatmap)#library(org.Hs.eg.db)#library("org.Hs.eg.db") paquete de Bioconductor de anotacion de genes del genoma de hom*o Sapienssignificant <- res %>% filter((log2FoldChange > 2 & padj < 0.05) | (log2FoldChange < 2 & padj < 0.05))sigs.df <- as.data.frame(significant)row.names(sigs.df) <- sigs.df$transcript #para colocar los nombres de Transcript en rownamesrow.names=TRUE#sigs.df$symbol <- mapIds(org.Hs.eg.db, keys = rownames(sigs.df), keytype = "TRINITY", column = "SYMBOL")####################################################################################Silencie las siguientes dos lineas porque filtraban mis datos, reduciendolos a 5#sigs.df <- sigs.df[(sigs.df$baseMean >100) & (abs(sigs.df$log2FoldChange) > 1), ]#sig <- sigs.df[, -1]################################################################################mat <- counts(dds, normalized =T)[rownames(sigs.df), ]mat.z <- t(apply(mat, 1, scale))colnames(mat.z) <- rownames(colData)Heatmap(mat.z, cluster_rows = T, cluster_columns = T, column_labels = colnames(mat.z), name = "Z-score", row_labels = sigs.df[rownames(mat.z),]$transcript)#write.table(mat.z, file = "transcritos_DE.txt", sep = "\t", quote = FALSE, row.names = TRUE)write.csv(sigs.df, file = "DEG_BLA.csv")######### ## :) ## ##########=============== Seguimiento con Script de Ricardo ========================= sigfc <- "Sign and FC";pv <- "Sign";fc <- "FC"colors_fc <- c("red2", "#4169E1", "forestgreen", "grey30")colors_fc <- structure(colors_fc, names = c(sigfc, pv, fc, "NS"))res$cc <- 'NS'res[which(abs(res$log2FoldChange) > 2), 'cc'] <- fcres[which(abs(res$padj) <= 0.05), 'cc'] <- pvres[which(res$padj <= 0.05 & abs(res$log2FoldChange) > 2), 'cc'] <- sigfcp <- res %>% ggplot(aes(y = -log10(pvalue), x = log2FoldChange, color = cc)) + geom_point() p <- p + scale_color_manual(name = "", values = colors_fc) + labs(x= expression(Log[2] ~ "Fold Change"), y = expression(-Log[10] ~ "P")) + theme_bw(base_family = "GillSans", base_size = 18) + theme(legend.position = "top") + geom_abline(slope = 0, intercept = -log10(0.05), linetype="dashed", alpha=0.5) + geom_vline(xintercept = 0, linetype="dashed", alpha=0.5) p# POSITIVE LFC == UP EXPRESSED IN CANCER# NEGATIVE LFC == UP EXPRESSED IN CONTROLres %>% mutate(g = sign(log2FoldChange)) %>% dplyr::count(cc, g) %>% mutate(g = ifelse(g == "1", "Osc", "Ctrl")) %>% filter(cc != 'NS') %>% group_by(g) %>% mutate(pct = n / sum(n)) %>% ggplot() + geom_col(aes(x = g, y = pct, fill = cc), width = 0.5) + scale_fill_manual(name = "", values = colors_fc[-4]) + theme_bw(base_family = "GillSans", base_size = 18) + labs(y = '% Transcripts', x = '') + theme(legend.position = 'top') + coord_flip()# Annotation ====query.ids <- res %>% filter(cc == "Sign and FC") %>% pull(transcript)query.ids <- res$transcripturl <- "https://raw.githubusercontent.com/RJEGR/Small-RNASeq-data-analysis/master/FUNCTIONS.R"source(url)annot_path <- paste0(path, "/anotacion/")annot_f <- list.files(path = annot_path, pattern = "Trinotate_report.tsv", full.names = T)annot <- read_tsv(annot_f, na = ".") %>% filter(transcript_id %in% query.ids)blastp_df <- split_blast(annot, hit = "BLASTP") %>% group_by(transcript) %>% arrange(identity) %>% sample_n(1) %>% ungroup()blastp_df %>% dplyr::count(genus, sort = T)tex_dat <- res %>% left_join(blastp_df) %>% drop_na(name)p + ggrepel::geom_text_repel(data = tex_dat, aes(label = name), min.segment.length = 0)#############aqui salio un volcanoplot#########################ENRIQUECIMIENTO FUNCIONAL################################# Functional enrichment ====.bioc_packages <- c('biomaRt','GOSemSim') # DESeq2'.inst <- .bioc_packages %in% installed.packages()#if(any(!.inst)) {# if (!require("BiocManager", quietly = TRUE))#install.packages("BiocManager")BiocManager::install(.bioc_packages[!.inst], ask = F)#}sapply(c(.bioc_packages), require, character.only = TRUE)# Local functionsurl <- "https://raw.githubusercontent.com/RJEGR/Cancer_sete_T_assembly/main/functions.R"source(url)go_df <- split_gene_ontology(annot, hit = "BLASTP") table(go_df$ontology)gene2GO <- split(go_df$go, go_df$transcript)########yo agregué esta líneas para que agarrara a GOSemSim & orgif (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")BiocManager::install("GOSemSim")#install.packages("GOSemSim") library(GOSemSim)if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")BiocManager::install("org.Hs.eg.db")library(org.Hs.eg.db)############################ aqui sigue el scripthsGO <- GOSemSim::godata('org.Hs.eg.db', ont="BP")#hsGO <- read_rds(paste0(annot_path, '/Trinotate_GO_annotations.txt'))library(readr)# Assuming the file is a TSV file (tab-separated values)/read_rds es formato binariohsGO <- read_tsv(paste0(annot_path, '/Trinotate_GO_annotations.txt'))res_up <- res %>% filter(log2FoldChange >=2 & padj <0.05) #filtro para los sobre-expresadosres <- res %>% drop_na(padj)res_down <- res %>% filter(log2FoldChange <2 & padj <0.05) #filtro para los sub-expresadosres_up %>% pull(padj) -> query.pres_up %>% pull(transcript) -> query.names#topGO es requerida, por lo que lo llamare por BiocmanagerBiocManager::install("topGO")allRes_up <- GOenrichment(query.p, query.names, gene2GO, Nodes = 20)allRes_up %>% as_tibble() %>% arrange(Annotated) %>% mutate(Term = factor(Term, levels= unique(Term))) %>% ggplot(aes(x = Term, y = Annotated, fill = p.adj.ks)) + # , fill = p.adj.ks coord_flip() + geom_col() + theme_minimal(base_family = "GillSans", base_size = 12)#AQUI me dio un grafico de enriquecimiento funcional con los procesos UPres_down %>% pull(padj) -> query.pres_down %>% pull(transcript) -> query.namesallRes_down <- GOenrichment(query.p, query.names, gene2GO, Nodes = 20)#null para devolver todosallRes_down %>% as_tibble() %>% arrange(Annotated) %>% mutate(Term = factor(Term, levels= unique(Term))) %>% ggplot(aes(x = Term, y = Annotated, fill = p.adj.ks)) + # , fill = p.adj.ks coord_flip() + geom_col() + theme_minimal(base_family = "GillSans", base_size = 12)#write.table(res_down, file = "down_DE.txt", sep = "\t", quote = FALSE, row.names = TRUE)#write.csv(sigs.df, file= "matriz.csv", row.names= TRUE )```### Diagrama de VennUna vez que se tienen los contrastes, se puede realizar un diagrama de Venn para ver cuales y cuantos son los genes que se comparten entre los contrastes.El siguientre script se trabaja en RStudio.> Archivos de entrada:> - [ ] Script R Basales.R> - [ ] Archivos DGE de cada contraste, DEG_Osc.csv```####----VENN Diagram-----####----Annotation-----#library(dplyr)# Set the working directory to the folder containing the CSV filesetwd("D:/OneDrive/Documentos/Tesis_Giselle/DEseq2")# Read the CSV file into a data framedf1 <- read.csv("DEG_Basales.csv") ##Bas_BLA vs Bas_LOL*df2 <- read.csv("DEG_Osc.csv") ##Osc_BLA vs Osc_LOL*df3 <- read.csv("DEG_BLA.csv") ##Bas_BLA* vs Osc_BLAdf4 <- read.csv("DEG_LOL.csv") ##Bas_LOL* vs Osc_LOL#* es el grupo ctrl###call de packages#install.packages("VennDiagram")library(VennDiagram)###Extract unique valueslibrary(VennDiagram)# Create character vectors for each categorytranscript_Bas <- unique(df1$transcript)transcript_Osc <- unique(df2$transcript)transcript_BLA <- unique(df3$transcript)transcript_LOL <- unique(df4$transcript)# Create a list of character vectorsvenn_list <- list( Bas = transcript_Bas, Osc = transcript_Osc, BLA = transcript_BLA, LOL = transcript_LOL)# Define colors for each categorycategory_colors <- c("red", "blue", "green", "purple")# Create the Venn diagramvenn.plot <- venn.diagram( x = venn_list, category.names = c("Bas (BLA vs LOL*)", "Osc (BLA vs LOL*)", "BLA (Bas* vs Osc)", "LOL (Bas* vs Osc)"), filename = NULL, output = TRUE, fill = category_colors)# Display the Venn diagramgrid.draw(venn.plot)# Extraer los tres genes compartidos entre los contrastesgenes_compartidos <- Reduce(intersect, venn_list)# Mostrar los genes compartidosprint(genes_compartidos)####[1] "TRINITY_DN13726_c0_g1_i15", es el gen Ral GTPase-activating protein subunit beta# "TRINITY_DN2478_c1_g1_i14" Src kinase-associated phosphoprotein 2-A#[3] "TRINITY_DN3091_c0_g1_i17" gen sin identificar# Calcular la intersección entre las categorías "Osc" y "Bas"interseccion_osc_bas <- intersect(venn_list$Osc, venn_list$Bas)# Crear una tabla con la intersección y sus conteostabla_interseccion <- data.frame( Gen = interseccion_osc_bas, Count_Osc = sum(venn_list$Osc %in% interseccion_osc_bas), Count_Bas = sum(venn_list$Bas %in% interseccion_osc_bas))# Mostrar la tablaprint(tabla_interseccion)```> Archivos de salida:> - [ ] Diagrama de venn (guardar manualmente)---Debido a que mi bitácora supera el número de caracteres de HackMD, a partir de la Anotación funcional esta disponible aquí: https://hackmd.io/@3Hs8iYAxRfK3NnYlJdTdyQ/BJmmIAfuC

Last changed by

Published on HackMD

Sign in

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox

Sign in with Wallet Wallet ( )

Connect another wallet

New to HackMD? Sign up

Flujo de trabajo RNA-seq - HackMD (2024)
Top Articles
Latest Posts
Article information

Author: Virgilio Hermann JD

Last Updated:

Views: 6223

Rating: 4 / 5 (41 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Virgilio Hermann JD

Birthday: 1997-12-21

Address: 6946 Schoen Cove, Sipesshire, MO 55944

Phone: +3763365785260

Job: Accounting Engineer

Hobby: Web surfing, Rafting, Dowsing, Stand-up comedy, Ghost hunting, Swimming, Amateur radio

Introduction: My name is Virgilio Hermann JD, I am a fine, gifted, beautiful, encouraging, kind, talented, zealous person who loves writing and wants to share my knowledge and understanding with you.