First, we produced an assembly of the Mytilus galloprovincialis genome using just WGS data (Paired-End, Mate Pairs and Fosmid End Libraries). Briefly our procedure collapsed repeats resulting in an assembly of 1.88 Gbp named mg3. It comprises 22,290 scaffolds with length >= 500 bp and having a scaffold N50 of 226.97 kb and contig N50 of 39.65 kb. In terms of gene completenes, a search of 248 Conserved Eukaryotic Genes (CEGs) using CEGMA shows that 88.92% a present completely and 99.6% at least partially.
In 2016 we incorporated Pacbio reads data (10x coverage of the genome) and produced a hybrid assembly (using Illumina and Pacbio Reads) named mg5. This version was refined by polishing it with Proovread using all unitigs as input. Finally, we have polished the assembly again but using RNAseq data from the Pura sample and NCBI libraries to produce mg8. After performing a first version of annotation of the mg8 assembly, we performed a new round of scaffolding with the annotation and the RNAseq data. This step resulted in the mg9 assembly, which is more contiguous than mg8. The mg9 version of the assembly was structurally annotated by using a combination of ab initio gene predictions and homology searches to proteins and transcripts. After functionally annotating the resulting genes, we filtered out some monoexonic, ab initio, non-functionally annotated proteins. The final gene set, MGAL9D, is made of 58,127 protein-coding genes whose 79,956 transcripts encode 75,783 unique proteins.
The mg10 version of the assembly is the result of performing a scaffolding round with the PacBio reads and producing a PacBio assembly of the myticin regions. The resulting assembly contains 10577 scaffolds with a scaffold N50 0f 208kb. Finally we incorporated some new protein coding annotations that had been previously discarded due to the presence of either NMD or TE like domains, but that we have further curated in this new version. The final mg10 gene set, MGAL10A, is made of 60,302 protein-coding genes whose 83,073transcripts encode 78,736 unique protein products.
The different versions of the assembly and annotation can be downloaded from the "Download page" linked below. Please, make sure you get the last version (mg10) unless you explicitly need any of the previous versions.