To date, we have produced an assembly of Mytilus galloprovincialis genome using WGS data (Paired-End, Mate Pairs and Fosmid End Libraries). Briefly our procedure have collapsed for repeats resulting in an assembly of 1.88 Gbp named mg3. It comprises 22,290 scaffolds with length >= 500 bp and having a scaffold N50 of 226.97 kb and contig N50 of 39.65 kb.

In terms of gene completenes is 88.92% for the Complete and 99.6% Partial Genes. These statistics are based on 248 Conserved Eukaryotic Genes (CEGs). Additionally a mapping of all mussel species' ESTs (NCBI) against mg3 showed also that 89% of them overlap confidently in at least 70% of their length. Thus mg3 is amenable for annotation.

In 2016 we have incorporated Pacbio reads data (10x coverage of the genome) and we have produced an hybrid assembly (using Illumina and Pacbio Reads) named mg5. This version was refined by polishing it with Proovread using all unitigs as input. Finally, we have polished the assembly again but using RNAseq data from the Pura sample and NCBI libraries to produce mg8.

After performing a first version of annotation of the mg8 assembly, we have performed a new round of scaffolding with the annotation and the RNAseq data. This step has resulted in the mg9 assembly, which is more contigous than mg8.

The mg9 version of the assembly has been structurally annotated by using a combination of ab initio gene predictions and homology searches to proteins and transcripts. After functionally annotating the resulting genes, we have performed a final filtering step for removing some monoexonic, only ab initio, non functionally annotated proteins. The final gene set, MGAL9D, is made of 58,127 protein-coding genes whose 79,956 transcripts encode 75,783 unique proteins.

The mg10 version of the assembly is the result of performing a scaffolding round with the PacBio reads and producing a PacBio assembly of the myticin regions. The resulting assembly contains 10577 scaffolds with an scaffold N50 0f 208kb. Finally we have incorporated some new protein coding annotations that had been previously discarded due to the presence of either NMD or TE like domains, but that we have further curated in this new version. The final mg10 gene set, MGAL10A, is made of 60,338 protein-coding genes whose 82,780 transcripts encode 78,481 unique protein products.



AssemblyBrowserDownloadBlast server
mg3mg3 browserdata files (fasta, gff)
mg5mg5 browserdata files (fasta, gff)
mg8mg8 browserdata files (fasta)
mg9mg9 browserdata files (fasta)Blast server to the genome and annotation
mg10mg10 browserdata files (fasta)Blast server to the genome and annotation