Tutorial¶
Below we will go through an example analysis using some simulated data that is
available in the example/
folder in the
QCF GitHub repository.
Step 0: Get qcf
¶
If you don’t have the repo, clone it and install qcf. If you already have the QCF repository then go ahead and skip to the next step.
git clone https://github.com/pblischak/QCF.git
cd QCF
make
make test
sudo make install
Step 1: Look at Example Files¶
Now we’ll run the example analyses. First, we’ll change into the example/ folder in the QCF repo.
# cd into the example folder (wherever it is on your computer)
cd /path/to/QCF/example
# check out what files are there
ls
You should see the Phylip files containing the sequence data, the genes.txt
file containing
a list of all of the genes, and the map.txt
file.
# Aside: Making the gene list file can be done in the folder with all of the
# Phylip files using the following code within a terminal
ls -1 *.phy > genes.txt
Step 2: Run qcf
¶
The most basic analysis that we can do with qcf
is to just
calculate QCF scores with the genes.txt
and map.txt
files.
qcf -i genes.txt -m map.txt --prefix example1
If you want to incorporate uncertainty into the estimation of the QCF scores, there is an option to perform bootstrap resampling of sites within each gene. This will make the analysis slower depending on how many bootstrap replicates you perform.
# add -b <#> for bootstrapping
qcf -i genes.txt -m map.txt -b 500 --prefix example2
To calculate confidence intervals for the QCF scores, we’ll add the --printRaw
flag when calling qcf
. This will generate an extra output file that can be used
with the qcf_boot.py
script to
# add the --printRaw flag
qcf -i genes.txt -m map.txt -b 500 --prefix example3 --printRaw
Using the raw output from the previous step, we’ll use the Python script
qcf_boot.py
to resample gene-level quartets to calculate QCF values
and their 95% confidence intervals.
qcf_boot.py -i example3-raw.csv -b 500 --prefix resampled
This will generate the file resampled-boot.CFs.csv
. This file can be used to infer
a species tree with scripts from the TICR pipeline (Stenz et al. 2015), which are packaged
with QCF in the scripts/
folder (should be available if you ran sudo make install
).
Note
Citing TICR
If you use these scripts please be sure to cite the TICR pipeline:
Stenz, N. W. M., B. Larget, D. A. Baum, and C. Ane. 2015. Exploring Tree-Like and Non-Tree-Like Patterns Using Genome Sequences: An Example Using the Inbreeding Plant Species Arabidopsis thaliana (L.) Heynh. Systematic Biology 64:809–823.
To use these scripts, you will also need to install QuartetMaxCut, which is available here. The TICR README has a lot of helpful information for using these scripts as well.
# Get a tree topology using QuartetMaxCut
# Usage:
# perl get-pop-tree.pl <bootstrap QCF file>
perl get-pop-tree.pl resampled-boot.CFs.csv
# Estimate branch lengths in coalescent units
# Usage:
# Rscript --vanilla getTreeBranchLengths.R <bootstrap file prefix> <outgroup>
Rscript --vanilla getTreeBranchLengths.R resampled-boot 6
The resampled-boot.CFs.csv
file is also formatted to be analyzed using the SNaQ
species network inference method in the PhyloNetworks package. Documentation for running
SNaQ is available on the PhyloNetworks website.
Analyzing Genes in Parallel¶
If you have a large number of genes, it is possible to analyze smaller numbers of
genes separately and in parallel to make analyses more computationally efficient.
To do this, instead of listing all genes in one file, create several files listing
different groups of genes, analyze each one on its own (you can use the same map
file for each), and then combine them using the qcf_boot.py
script. Because
the results are combined using qcf_boot.py
, each analysis will have to be run
with the --printRaw
flag.
# First we'll analyze gene set 1
qcf -i genes1.txt -m map.txt -b 500 --prefix out1 --printRaw
# Now gene set 2
qcf -i genes2.txt -m map.txt -b 500 --prefix out2 --printRaw
Now we’ll calculate QCFs and their confidence intervals across the independent runs we just completed. The qcf_boot.py script is written such that it can combine the raw data across any number of independent runs.
#
qcf_boot.py -i out1-raw.csv out2-raw.csv -b 500 --prefix resampled2
If you have more than 2 input files, you can list them all after the -i
flag:
qcf_boot.py -i out1-raw.csv out2-raw.csv out3-raw.csv <...more files...> \
-b 500 --prefix resampled3
An easy way to list them all would be to do something like this:
qcf_boot.py -i $(ls *-raw.csv) -b 500 --prefix resampled4