Issues with some of the TFs and perl API get_families.pl script

Patrick McGrath's Avatar

Patrick McGrath

02 Jun, 2015 03:40 PM

Hello,

I have found there to be issues with some of the treefams with regards to alignment files.

For example, if i use the treefam_scan.pl to find the best hit TF for the following sequence:

MQSRWWSCGIRLVTWTWIWSLAFLGAWCIPANEVNLLDSRSVMGDLGWVAYPKNGWEEIGEVDENYAPIH TYQVCRVMEQNQNNWLHTNWILTEGAQRVFIELKFTLRDCNSLPGGVGTCKETFNMYYYETDGDEEEMEE GEMRDGTGMTEEEDRAMKESRYIKIDTIAADESFTELDLGDRVMKLNTEVRDLGPLTRKGFYLAFQDLGA CIALVSVRVFYKRCPFLVKSLAEFPDTIPGSEASQLVEVVGRCVNNSLPLYEPPRMHCSTEGEWLVPIGK CVCQPGFEEINGSCQVCKVGFYRSLLESLACSKCPPHSVARQMGATACSCEDGYFKLDSDPSNMACTRPP SAPRNAISNVNETSVFLEWSIPMDTGGRKDVRYNVICRQVLPDGRGLEECGPNVRFLPRRTGLSNTSVMV ADLQSHTNYSFLLEAVNGVSDLAKGHAKQYVSLNVTTNQAAPSPVSVVRKGHTGKSSIALSWAEPDRPNG IILEYEIKYFEKEQDSSYTIIKSKDTEMVVEGLKPSSAYIFQVRARTSAGYGAFSRRFEFQTSPYLTATS ERAQASIVAVAITLALVLLAVVAGFLLSGRRCGYSKAKQDPEEEKMHFHNGHIKFPGVRTYIDPLTYEDP NQAVHEFAQEIDVSYISIERIIGAGEFGEVCSGPLRLPGKREIQVAIKTLKAGYTEQQRRDFLWEASIMG QFNHPNIIRLEGVVTKSKPVMIITEYMENGSLDTFLKKNDGQFTVIQLVGMLRGIASGMRYLSDMGYVHR DLAARNILVNSNLVCKVSDFGLSRVLEDDPEAAYTTRGGKIPIRWTAPEAIAYRKFTSASDVWSYGIVMW EVMSYGERPYWEMSNQDVIKAVEESYRLPGPMDCPEALYHLMMDCWQRERSNRPKFDEIVCLLDKFIRNP SSLKKLVNSSHRVSNLLVEHASVEGNCSTQSQTVGEWLDSIKMGRYTELFMEGGYSSLETVAQMTSEDLR RVGVNLAGHQKKIITSIQEMRVHMNSTNSTVNI*

I find TF314013 to be the best hit. However, when i search for the alignments and tree file from the bulk downloads (taken from here, I'm unable to find any of the relevant files for TF314013.

When i use the web API on that sequence, it again returns the TF314013 as the best hit, but is unable to add the sequence to the gene tree (consistent with the possibility that there is an issue with finding the alignment and/or gene tree for TF314013). Instead I get an error: "There was a Problem submitting the job. Try again"

I tried to use the perl script get_families.pl to download alignment and genetree files for TF314013 and get an error:

DBD::mysql::st execute failed: Unknown column 'gtr.gene_align_id' in 'field list' at /opt/local/ensembl-v70/ensembl-compara-release-70/modules/Bio/EnsEMBL/Compara/DBSQL/BaseAdaptor.pm line 147.

If i use mysql to show the columns for the gene_tree_root table in the treefam_production_9_69 database, there is no gene_align_id column.

Has anyone encountered similar issues or can recommend a solution? I am not very familiar with the Ensembl API, so i'm hopeful there might be an easy fix in modifying some of the parameters to fix the problem...

Thanks,
-Patrick

  1. Support Staff 1 Posted by Matthieu Muffat... on 09 Jun, 2015 03:03 PM

    Matthieu Muffato's Avatar

    Hi Patrick,

    The TreeFam 9 database is using the database schema from Ensembl Compara 69, not 70. The e70 code is expecting a gene_align_id column which was not present in e69

    Can you please select the "release/69" branch and try again ?

    Matthieu

  2. 2 Posted by Patrick McGrath on 09 Jun, 2015 04:05 PM

    Patrick McGrath's Avatar

    Thanks... i'll try that. When i originally ran the script using ensembl 79,
    i received an error:

    For treefam_production_9_69 there is a difference in the software release
    (79) and the database release (70). You should update one of these to
    ensure that your script does not crash.

    I assumed that meant i should use Ensembl Compara 70

  3. Support Staff 3 Posted by Matthieu Muffat... on 09 Jun, 2015 04:09 PM

    Matthieu Muffato's Avatar

    That was a sensible decision :)
    I've looked at the database and there is conflicting information in the table that reports the schema version. It both says 69 and 70.

    v69 has been used to produce the database, and should be used to read from it. I'll look into removing the bit that says 70.You can discard the warning in the meantime

    Matthieu

  4. 4 Posted by Patrick McGrath on 27 Sep, 2017 01:19 AM

    Patrick McGrath's Avatar

    I was curious if this issue was ever determined (i.e. some of the Treefams identified by the treefam_scan.pl script (such as TF314013) do not have tree files available in the download section of this website.

    I can use the MySQL database to get a newick format file for TF314013, however, it appears to lack support values for each node unlike the treefiles hosted on this web server. Is there a way to get the phylogeny files for TF314013 and the other missing treefams that include node support values?

    Thanks....

  5. Support Staff 5 Posted by Matthieu Muffat... on 02 Oct, 2017 01:19 PM

    Matthieu Muffato's Avatar

    This is because of the way data have been generated for TF314013. The family (as others) was too big to directly be built (alignment, trees, orthologues), so it had to be broken down into smaller pieces (sub-families). As a result TF314013 doesn't have the same set of annotations as smaller families.

    415 families are affected. They have the tree_type "supertree" in the database

    Matthieu

  6. 6 Posted by Patrick McGrath on 02 Oct, 2017 02:24 PM

    Patrick McGrath's Avatar

    Thanks.... so what should you do if a the hmmscan hits one of these trees?
    How do you know which subfamily to use?

  7. Support Staff 7 Posted by Matthieu Muffat... on 16 Oct, 2017 12:01 PM

    Matthieu Muffato's Avatar

    With the Perl API, one can fetch a tree for TF314013 that contains all the subfamilies.
    As this stands, the subfamilies are just the product of our own limitation when computing the alignment, trees, etc, and displaying them. The only family that ought to be used is TF314013.

Reply to this discussion

Internal reply

Formatting help / Preview (switch to plain text) No formatting (switch to Markdown)

Attaching KB article:

»

Attached Files

You can attach files up to 10MB

If you don't have an account yet, we need to confirm you're human and not a machine trying to post spam.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac