Methods and Websites: Difference between revisions

From MDWiki
Jump to navigationJump to search
No edit summary
No edit summary
Line 1: Line 1:
=== Websites with useful information or software ===
=== Websites with useful information or software ===






RCSB Protein Database: [http://www.rcsb.org/pdb/Welcome.do PDB]
* RCSB Protein Database: [http://www.rcsb.org/pdb/Welcome.do PDB]
 
 
Structural Classification of Proteins: [http://scop.mrc-lmb.cam.ac.uk/scop/ SCOP]
 
Structural Comparison of Proteins: [http://www.ebi.ac.uk/dali/ Dali], [http://cl.sdsc.edu/ CE]


Multiple Structure Alignment: [http://bioinformatics.albany.edu/~cemc/ CEMC]
* Structural Classification of Proteins: [http://scop.mrc-lmb.cam.ac.uk/scop/ SCOP]
* Structural Comparison of Proteins: [http://www.ebi.ac.uk/dali/ Dali], [http://cl.sdsc.edu/ CE]
* Multiple Structure Alignment: [http://bioinformatics.albany.edu/~cemc/ CEMC]


* Protein families database: [http://pfam.wustl.edu Pfam at St Louis] [http://www.sanger.ac.uk/Software/Pfam/ Pfam at Sanger]


Protein families database: [http://pfam.wustl.edu Pfam at St Louis] [http://www.sanger.ac.uk/Software/Pfam/ Pfam at Sanger]
* Clusters of Orthologous groups: [http://www.ncbi.nlm.nih.gov/COG/ COG]


* Protein Domain prediction: [http://www.ebi.ac.uk/interpro/ InterPro]


Clusters of Orthologous groups: [http://www.ncbi.nlm.nih.gov/COG/ COG]


 
* [http://foo.maths.uq.edu.au/~huber/BIOL3004/gi2name.pl Webserver] that converts sequence identifiers into species names.
Domain prediction: [http://www.ebi.ac.uk/interpro/ InterPro]
 
 
 
[http://foo.maths.uq.edu.au/~huber/BIOL3004/gi2name.pl Webserver] that converts sequence identifiers into species names.


The website lets you change the sequence identifiers to organism taxonomies. You need to upload the original FASTA file with all sequences and a second file (e.g. the Newick tree file, but could be any other (text) format). The result you get back from the web page will have replaced identifiers by taxonomies where possible in the second file.
The website lets you change the sequence identifiers to organism taxonomies. You need to upload the original FASTA file with all sequences and a second file (e.g. the Newick tree file, but could be any other (text) format). The result you get back from the web page will have replaced identifiers by taxonomies where possible in the second file.
Line 36: Line 28:
=== CD/DVD software, basic how to use ===
=== CD/DVD software, basic how to use ===


 
blast  clustalx  muscle  seaview  phylip-3.36 treeview rasmol  pymol
 
=blast  clustalx  muscle  seaview  phylip-3.36 treeview rasmol  pymol=




Line 50: Line 40:




=E:lastlastall -p blastp -d e:lastdatabasesimg_bacteria -i yourfile.fasta -o usefuloutputname.blast=
''E:lastlastall -p blastp -d e:lastdatabasesimg_bacteria -i yourfile.fasta -o usefuloutputname.blast''
 




Line 57: Line 46:




 
-d the database to use: =img_bacteria  img_archaea  img_eukaryota=  You can search several databases by putting quotes around them: ''-d "img_archaea img_bacteria"''
-d the database to use: =img_bacteria  img_archaea  img_eukaryota=  You can search several databases by putting quotes around them: =-d "img_archaea img_bacteria"=
 




-i input, query sequence (in FastaFormat)
-i input, query sequence (in FastaFormat)




-o output file to write blast results to.
-o output file to write blast results to.






==== Psi-blast =====
==== Psi-blast =====




Line 78: Line 62:




 
''E:lastlastpgp -d e:lastdatabasesimg_bacteria -i yourfile.fasta -o usefuloutputname.blast -j 3 -h 0.000001''
=E:lastlastpgp -d e:lastdatabasesimg_bacteria -i yourfile.fasta -o usefuloutputname.blast -j 3 -h 0.000001=
 




-j maximum number of rounds to do  (it will stop earlier, once the searches don't find more matches)
-j maximum number of rounds to do  (it will stop earlier, once the searches don't find more matches)




Line 103: Line 84:




 
''E:lastastacmd  -d e:lastdatabasesimg_bacteria -i filewith_img_numbers -o newsequences.fasta''
=E:lastastacmd  -d e:lastdatabasesimg_bacteria -i filewith_img_numbers -o newsequences.fasta=
 




-i  the input file should be a line-by-line listing of the "accession numbers" from the same img database you used in the blast search.  Each number needs to have =lcl|= in front of it:
-i  the input file should be a line-by-line listing of the "accession numbers" from the same img database you used in the blast search.  Each number needs to have ''lcl|'' in front of it:




Line 128: Line 107:




The complete fastacmd document is [[http://biowulf.nih.gov/apps/blast/doc/fastacmd.html][here]].
The complete fastacmd document is [http://biowulf.nih.gov/apps/blast/doc/fastacmd.html here].




Line 144: Line 123:




Alignment output defaults to =.aln= (which can be loaded back into clustal later); select phylip output format also (=.phy=) for phylip analysis.
Alignment output defaults to =.aln= (which can be loaded back into clustal later); select phylip output format also (.''phy'') for phylip analysis.




Line 156: Line 135:




Click on the icon for the appropriate program in the phylip *exe* folder.  Type in the input file name eg =H:BIOL3004mydata.phy=.  Most phylip programs take =.phy= input files; *neighbor* takes a distance matrix produced by *protdist*, *dnadist* or similar.
Click on the icon for the appropriate program in the phylip '''exe''' folder.  Type in the input file name eg ''H:BIOL3004mydata.phy''.  Most phylip programs take ''.phy'' input files; '''neighbor''' takes a distance matrix produced by '''protdist''', '''dnadist''' or similar.




Line 164: Line 143:




Complete phylip documentation is also on the DVD: click on the phylip.html document in the phylip folder, it has links to documentation for specific programs. Or, on the web, you can find it [[http://evolution.genetics.washington.edu/phylip/phylip.html][here]].
Complete phylip documentation is also on the DVD: click on the phylip.html document in the phylip folder, it has links to documentation for specific programs. Or, on the web, you can find it [http://evolution.genetics.washington.edu/phylip/phylip.html here].




Line 178: Line 157:




From clustal:  =filename.ph  filename.phb=
From clustal:  ''filename.ph  filename.phb''






From phylip: =outtree= (renamed appropriately)
From phylip: ''outtree'' (renamed appropriately)





Revision as of 03:41, 24 April 2007

Websites with useful information or software

  • RCSB Protein Database: PDB
  • Structural Classification of Proteins: SCOP
  • Structural Comparison of Proteins: Dali, CE
  • Multiple Structure Alignment: CEMC
  • Clusters of Orthologous groups: COG


  • Webserver that converts sequence identifiers into species names.

The website lets you change the sequence identifiers to organism taxonomies. You need to upload the original FASTA file with all sequences and a second file (e.g. the Newick tree file, but could be any other (text) format). The result you get back from the web page will have replaced identifiers by taxonomies where possible in the second file.


Additional tips for Deep Evolutionary Analysis



CD/DVD software, basic how to use

blast clustalx muscle seaview phylip-3.36 treeview rasmol pymol


Blast

Call up a command prompt (accessories), it should be H: (your student directory)


E:lastlastall -p blastp -d e:lastdatabasesimg_bacteria -i yourfile.fasta -o usefuloutputname.blast


-p the blast program to use: blastp, blastn


-d the database to use: =img_bacteria img_archaea img_eukaryota= You can search several databases by putting quotes around them: -d "img_archaea img_bacteria"


-i input, query sequence (in FastaFormat)


-o output file to write blast results to.


Psi-blast =

Psi-blast is very similar, but you need to use "blastpgp" and be aware of -j and -h options. Also remember that psi-blast will generally be slower because it has to do normal blast first, and then build profiles and do later rounds of searching with the profiles.


E:lastlastpgp -d e:lastdatabasesimg_bacteria -i yourfile.fasta -o usefuloutputname.blast -j 3 -h 0.000001


-j maximum number of rounds to do (it will stop earlier, once the searches don't find more matches)


-h significance level cut-off


The documentation with all possible options is on the CD/DVD under blastdoc.



Obtaining FastaFormat files of the sequences found with blast: =

Call up a command prompt.


E:lastastacmd -d e:lastdatabasesimg_bacteria -i filewith_img_numbers -o newsequences.fasta


-i the input file should be a line-by-line listing of the "accession numbers" from the same img database you used in the blast search. Each number needs to have lcl| in front of it:


<verbatim>

lcl|1234567

lcl|1234589

lcl|1456789

</verbatim>


[[%ATTACHURL%/ExtractIDs.doc][ExtractIDs.doc]] shows a fast, painless way to prepare the input file from your blast result.


The complete fastacmd document is here.


Clustal

Click on the clustalx.exe icon in the clustal folder. Load sequences (you can use "browse" to go to your student area files) in FastaFormat.


Select options from the various clustal menu items.


Alignment output defaults to =.aln= (which can be loaded back into clustal later); select phylip output format also (.phy) for phylip analysis.


Remember to change the output format options from branch to NODE before bootstrapping in clustal. If not, you will not be able to see the reliability of the branches in treeview (shown with the internal edge labels).


Phylip

Click on the icon for the appropriate program in the phylip exe folder. Type in the input file name eg H:BIOL3004mydata.phy. Most phylip programs take .phy input files; neighbor takes a distance matrix produced by protdist, dnadist or similar.


PhylipBootstrapping is a multi-stage process (details to come).


Complete phylip documentation is also on the DVD: click on the phylip.html document in the phylip folder, it has links to documentation for specific programs. Or, on the web, you can find it here.



Treeview

Click on treev32.exe in the Treeview folder. The input file should be a tree in NewHampshire format.


From clustal: filename.ph filename.phb


From phylip: outtree (renamed appropriately)



Rasmol

Click on rasmol.exe in the Rasmol folder. The input is a protein structure file, such as a PDB file.