Population genomics pipeline
The whole pipeline is freely available and easly reproducible through fully documented Jupyter Notebooks on Github. If you want to reproduce the analysis, please check the code thoroughly rather than copy/paste or execute directly Jupyter cells. Contact us if you have any questions!
We have designed and implemented a custom pipeline for analyzing the Drosophila Genome Nexus (Lack et al. 2015; Lack et al. 2016) and Human 1000GP Phase III (Sudmant et al. 2015; 1000 Genomes Project Consortium 2015) data, which could eventually be escalated to any available genomic data source. The pipeline pre-calculates the Derived Allele Frequency (DAF) and number of divergent synonymous and nonsynonymous sites, necessary to later perform on-the-fly MKTs, in 13,753 protein coding genes for 16 D. melanogaster populations and in 20,643 protein coding genes for 26 human populations of distinct geographical origin.
Figure 1. Population genetics pipeline overview.
2. Data retrieval
Human population genomic data:
Genome variation data and information of the ancestral state of the variants generated by the 1000GP Phase III (The 1000 Genomes Project Consortium), together with divergence between humans and chimpanzees, were retrieved from PopHuman (Casillas et al. 2018) in Variant Call Format (VCF). This included 84.4 million variants detected across 2,504 individuals from 26 different populations, mapped to the human reference genome version GRCh37/hg19. It was discarded reportedly inbred individuals (Gazal et al. 2015) and non-accessible nucleotides (The 1000 Genomes Project Consortium 2015). Genome annotations were retrieved from GENCODE (release 27). Recombination rate values associated to each protein coding gene were obtained from Bhérer (Bhérer et al. 2017) and correspond to the sex average estimates.
Genome variation data generated by the Drosophila Genome Nexus (Lack et al. 2015), together with divergence data between D. melanogaster and D. simulans, was retrieved from PopFly (Hervas et al. 2017) in multi-FASTA format. Only populations with at least 4 sampled genome sequences with less than 20% of missing or ambiguous nucleotides (after filtering by identity by descent, admixture, and heterozygosity) were included. We used the genome reference sequence and annotations corresponding to the 5.57 FlyBase annotation. DAF spectrum by functional classes was estimated by resampling a number of lines with nucleotide information (excluding undetermined sites) in each position without replacement. This was done to maximize the number of informative sites. The number of lines resampled in each population was chosen depending on the number of lines sequenced and the quality of those sequences. Positions and genes without valid information for at least this number of lines were discarded for the analysis. Information of the ancestral state was inferred from the comparison with D. simulans. Gene-associated recombination rate estimates were retrieved from Comeron (Comeron et al. 2012).
To compute humans divergence metrics, we used the differences between humans and chimpanzees (Pan troglodytes), as identified from a precomputed hg19-panTro4 alignment obtained from the VISTA browser (Poliakov et al. 2014) in multi-FASTA format (MFA). Specifically, the pairwise alignment was converted to VCF using custom scripts and merged with the 1000GP VCF files using Bcftools. For compute divergence statistics in Drosophila, we used the D. melanogaster and D. simulans alignment (Hu et al. 2013) in multi-FASTA format.
3. DAF and divergence estimation
MKT relies on polymorphism and divergence data on two types of sites in the genome: one putatively selected (usually 0-fold degenerated sites), and one putatively neutral (usually 4-fold degenerated sites). This implies knowing the functional class of each nucleotide in the genome, which is not trivial, since the same nucleotide can act as a coding site for one transcript while as an intron UTR site for another. To know the functional class of each nucleotide in both D. melanogaster and the human genome, each protein-coding position was re-annotated according to his aminoacid degenerancy from a reference sequence (FlyBase 5.57 and Gencode 27, respectively). In the case when multiple transcripts overlapped the same position, the most constrained functional class (0-fold degenerated) was considered. In this way we estimate the number of synonymous and non-synonymous changes on both data.