Page 95 - Molecular features of low-grade developmental brain tumours
P. 95

THE CODING AND NON-CODING TRANSCRIPTIONAL LANDSCAPE OF SEGA
Bioinformatics analysis of RNA-Sequencing data
Read quality was assessed using FastQC v0.11.5 (Babraham Institute, Babraham, Cambridgeshire, UK). Trimmomatic v0.36 was used to trim and filter reads of low quality 45. Low quality leading and trailing bases were removed from each read, the quality of the body of the reads was assessed with a sliding window trimming using a window of 4 and a Phred score threshold of 20 and 15 nucleotides, in our RNA and small RNA datasets respectively. Reads that dropped below 80 nucleotides in our RNA dataset and 17 nucleotides, in our small RNA datasets, as well as reads with no partner forward or reverse read were excluded from further analysis. For small RNA-Seq paired-end reads were aligned to the reference genome, GRCh38 using TopHat2 v2.0.13 46. No mismatches were allowed between the trimmed reads and the reference genome and small RNA reads were allowed to align a maximum of ten times 46. Next, transcripts for each sample were assembled de novo using Cufflinks v2.2.1 using the default settings, except that the expression of each transcript was not corrected for length 47. The transcript assembly for each sample, along with a custom reference annotation consisting of short RNA species extracted from Gencode v25 48 and miRNAs from miRBase21 49 were passed onto Cuffmerge v2.2.1 47. Cuffmerge compared the de novo transcript assembly of each sample with reference annotation of known miRNAs and short non-coding RNAs. This allowed each assembled transcript to be classified as a known short non-coding species, miRNAs or as a novel short non-coding RNA. Next, all assembled novel transcripts greater than 100 nucleotides were removed from the analysis. Subsequently, the chromosomal location of the novel short non-coding RNAs were compared to the location of the known genes, based on Gencode v25, and were classified as unannotated intergenic or unannotated gene derived. These elements were then all merged together to create a final reference annotation that consisted of miRNAs, short RNA species, unannotated intergenic short RNA or unannotated gene derived short RNAs. This reference annotation file along with the original small RNA read alignment files were passed to featureCounts from the Subread package and the number of reads that aligned to each transcripts were counted 50. For the RNA dataset paired-end reads were aligned to the reference genome, GRCh38 using TopHat2 v2.0.13 and the default settings 46. The number of reads that aligned to each gene, based on Gencode v25, were determined using the featureCounts program from the Subread package 50. The RNA and small RNA count matrices were passed on to the R package DESeq2 and were normalized using the median of ratios method 51. Genes and small RNAs with an Benjamini-Hochberg adjusted p-value<0.05 were considered differentially expressed. The biotypes of all the differentially expressed genes (DEGs) were assessed using BioMart 52. Based on the biotype assigned by BioMart the genes were further grouped into five categories; i) protein-coding: all genes with protein-coding ability, ii) pseudogenes: all genes classified as one of the following; polymorphic pseudogene, processed pseudogene, unprocessed pseudogene, or transcribed processed pseudogene, iii) long non-coding RNAs: all genes classed as long non-coding RNAs, iv) undefined: genes which could not be classified, v) other: genes that had one or more biotypes or did not fit into any of the aforementioned categories.
93
 4






























































































   93   94   95   96   97