About the NHX format

Fang-Yu Rao's Avatar

Fang-Yu Rao

31 Aug, 2015 07:43 PM


Recently I am trying to use the "forester" library available on the Internet (https://sites.google.com/site/cmzmasek/home/software/archaeopteryx) to process the Treefam dataset. After downloading the dataset (treefam_family_data) from "http://www.treefam.org/download", I tried to parse a file named "TF101001.nhx.emf" which is of NHX format. However, the "forester" library is not able to correctly parse this file and indicates that something is wrong in this file.

In particular, the library prints out "error in NH/NHX formatted data: failed to parse number from "TC013167-PA"". After taking a closer look at the file, I found that "TC013167-PA" follows a colon (:) preceded by "TCOGS2". But according to the NHX format, it seems that ":" cannot be part of names. Do I misunderstand anything about the NHX format here? Any help and suggestion is greatly appreciated.

P.S. I have also attached the file "TF101001.nhx.emf" in this discussion.

Best regards,

  1. Support Staff 1 Posted by Matthieu Muffat... on 04 Sep, 2015 10:24 AM

    Matthieu Muffato's Avatar

    Dear Fang-Yu

    There was indeed in the Ensembl Genomes release of Tribolium castaneum. All their gene names were prefixed with TCOGS2. It has been fixed since on their side and the next TreeFam release will use the correct names.

    In the mean-time, I'd suggest you remove the "TCOGS2:" string from all the files, with:
    find treefam_family_data/ -type f | xargs sed -i 's/TCOGS2://g'
    (it may take some time to run !)

    Best regards,

  2. 2 Posted by Fang-Yu Rao on 09 Sep, 2015 02:37 PM

    Fang-Yu Rao's Avatar

    Dear Matthieu,

    Thank you very much for the useful information! :-) I will do as you suggested to remove the "TCOGS2:" string from all the files.

    Best wishes,

  3. Matthieu Muffato closed this discussion on 02 Oct, 2017 01:25 PM.

Comments are currently closed for this discussion. You can start a new one.

Keyboard shortcuts


? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac