Getting Started¶
You will need:
- An AWS account
- A Sentieon License
- Conda
Creating and Setting up your Amazon Web Services (AWS) Account¶
If you do not have one already, please create an Amazon Web Services (AWS) account; the pipeline’s infrastructure is made up of several AWS Services (see Pipeline Infrastructure).
Depending on the current status of your AWS account and the number of samples on which you plan to call variants, you may need to increase the number of instance types to support the respective sample scale. You can do so by visiting the Limits page under the EC2 dashboard in the AWS console. Note that this may take some time to process, so it should be done early.
By default, the pipeline makes use of the following instance types:
- c5.9xlarge, c5.18xlarge, r4.2xlarge, r4.4xlarge.
The pricing specification for each of the AWS EC2 instance types can be found on the AWS Instance Pricing page.
Download and Upload Reference Files to S3¶
The pipeline performs many operations which require several reference files. (Eg. the human reference genome fasta and its indices). These must be uploaded to AWS S3 before the pipeline can be run. The standard reference files are provided by the Broad Institute’s GATK Resource Bundle. Currently, the pipeline supports two builds of the human reference genome - GRCh37 (hg19) and GRCh38 (hg38). GRCh37 files are located on the Broad Institute’s ftp site, while GRCH38 is hosted on Google Cloud Storage.
In order to upload the reference files to AWS S3, you will need to install the AWS Command Line Interface - please see AWS CLI Installation. For uploading files onto S3, please see the AWS S3 documentation.
Obtain a Sentieon License File¶
Currently, the pipeline utilizes only Sentieon in its haplotyping and joint genotyping steps. Thus, in order to use the pipeline, you must first contact Sentieon and obtain a license. They also offer a free trial.
Install Conda and your Dev Environment¶
In order to run the pipeline, you will need to install Conda.
- If you have python 2.7 installed currently, pick that installer.
- If you have python 3.6 installed currently, pick that.
- Run the installer. The defaults should be fine.
Then, create a python 3.6 environment:
$ conda create -n psy-ngs python=3.6
Activate the newly created environment: (you may need to start a new terminal session)
$ source activate psy-ngs
You can verify that the environment has activated by checking the python version (if it is different than your base):
$ python --version
You should also see the environment name prepended to your shell prompt, e.g.:
(psy-ngs) $ echo "See the environment name?"
After activating the environment install the pipeline’s python dependencies:
(psy-ngs) $ cd path/to/this/repo/
(psy-ngs) $ pip install -r requirements.txt
Next Steps¶
You are almost ready to run the pipeline - next you will need to configure it with your run specifications. Please see Your Pipeline Run to continue.