Tools

Git
Snakemake
bcftools
jq
vd
rsync
Docker
WDL
Makefile
FTP
For file conversion
- Misc
- PDF

Git

Commit all modification and added files: git commit -am "informative message"
To show all the history of a file: git log -p -- file
To retrieve a specific version of a file: git show COMMIT:file
Revert repo to a specific commit: git checkout COMMIT
Undo a commit: git reset HEAD~ and then for the real commit git commit -c ORIG_HEAD.
Add all untracked files: git st -s | grep '??' | cut -f2 -d ' ' | xargs git add
Unstage a file (without reverting its potential changes): git restore --staged FILE
Unstage all files git reset
Add remote e.g. after a fork: git remote add mine git@github.com:jmonlong/REPO.git
Activate credential saving (avoid retyping username/passwords, e.g. when SSH keys don’t work): git config credential.helper store

Aliases

git config --global alias.co checkout
git config --global alias.ci commit
git config --global alias.st status
git config --global alias.br branch
git config --global user.email '<EMAIL>'
git config --global user.name 'Jean Monlong'

Branches

List branches: git branch
List all branches: git branch -a
Update remote branch list: git remote prune origin
Create branch: git checkout -b hotfix
Link it to a remote branch: git branch -u origin/hotflix
Creat a new local branch from remote: git co -t origin/hotfix
Merge the current branch with another branch: git merge hotfix
Delete a branch: git branch -d hotfix
Delete remote branch: git push origin :hotfix

Submodules

How-To at devconnected.com
Clone a repo with submodules: git clone --recursive https://github.com/XXX/XXX.git
Update sub-modules: git submodule update --init --recursive

Fetch new submodule commits:

## in the submodule
git fetch
git checkout <COMMIT>
## in the main repo
git add .
git commit -m "updated the submodule"
git push

Check status of all repos

I have an alias calling the following commands:

WD=`pwd`
for ff in `find . -maxdepth 5 -name .git`
do
    GDIR=`dirname $ff`
    echo $GDIR
    cd $WD/$GDIR
    git st -s
    git st | grep ahead
done
cd $WD

Snakemake

Documentation

A few trick I’d like to remember or try soon:

Double “{{”/“}}” to escape { } in expand or NOT to define wildcards.
Define regexp constraints on wildcards as {data,\d+} or using wildcard_constraints (within a rule or globally).
Passing a Python/R script: directly.
Output file marked as temp() are deleted when not needed by any rules anymore.
Output file marked as touch() for flag files.
Local rules when running on a HPC.
Use remote (S3) files with remote()

Configuration files used as config["samples"]. Can also be used as snakemake --config yourparam=1.5. For sample metadata, a tabular configuration can also be used using Pandas. It’s also possible to define a separate config file for the cluster configuration (e.g. resources for each rule).

To use more flexible inputs, use a function and unpack:

def input_func(wildcards):
    return {'file1': '{wildcards.val}.txt'.format(wildcards=wildcards)}
rule myrule:
    input:
        unpack(input_func)
    output:
        "output.{val}.txt"
    shell: " ... {input.file1} ..."

bcftools

Rename sample

echo SAMPLENAME > temp.txt
bcftools reheader -s temp.txt input.vcf.gz > output.vcf.gz

Remove INFO fields and keep GT only

bcftools annotate -x INFO,^FORMAT/GT input.vcf.gz

Extract a sample and keep only sites with an alt allele

bcftools view -c 1 -s SAMPLENAME input.vcf.gz

Add AN,AC,AF to the INFO

bcftools +fill-tags input.vcf.gz -Oz -o output.vcf.gz --threads 4 -- -t AN,AC,AF

jq

Lesson at programminghistorian.org

Select elements based on one field’s value: jq 'select(.field==value)'
Keep only desired fields: jq '{id: .id, title: .title}'
Write in TSV: jq '.array | @tsv'

vd

vd can read many file formats, including TSV, CSV, JSON. I use it to explore TSV files as a more powerful less. It’s great to format wide columns but also to quickly explore summary stats of the table.

Keybindings:

Ctr-H or z? triggers the help page
_ expand/contract column.
z_ <N> set current column width to N.
/ regex search in current column
g/ regex search in all columns
n/N move to next/previous match
[/] sort ascending/descending by current column
Select rows and filter:
- | select by regexp in current column
- , select rows matching current cell in current column
- z" copy selected rows to new sheet
F toggle a frequency table/histogram of the current column.
- Enter to focus on a subset defined by a row in frequency table.
I toggle Describe sheet with summary statistics for each column.
. toggle dot plot.
- Make sure to set a column as numeric with #.
- Eventually select x-axis or labels with ! first.

rsync

rsync is not completely intuitive to me. Here are some of the commands I could make work.

To recurrently sync all the files that match the patterns in rsyncIncludes.txt:

rsync -r --include='*/' --include-from=rsyncIncludes.txt --exclude='*' --prune-empty-dirs SRC DEST

To recurrently sync all the files that match the patterns in rsyncIncludes.txt EXCEPT some with a specific pattern. Practical example: all the R scripts but not the ones created by BatchJobs in *-files directories:

rsync -r --exclude="*-files" --include='*/' --include='*.R' --exclude='*' --prune-empty-dirs SRC DEST

Docker

Build a docker instance

Write a Dockerfile :

WORKDIR /root sets the working directory.
COPY PopSV_1.0.tar.gz ./ copies a file in the instance. The / is important !
There is a cache management system so it’s important to keep related commands in the same RUN.

To run in the folder with the Dockerfile.

docker build -t jmonlong/popsv-docker .

Ignore (big) files fro the build context using a .dockerignore file.

To make setup a time zone for a ubuntu-based container, use tzdata in the Dockerfile:

RUN apt-get -y update && DEBIAN_FRONTEND=noninteractive apt-get -y install tzdata
ENV TZ=America/Los_Angeles

The time zone can also be changed in run command using -e TZ=America/Los_Angeles. List of time zone codes

To build smaller images, use less layers and a smaller base image (e.g. Alpine images).

Launch a docker instance

To launch an interactive instance with a shared folder:

docker run -t -i -v /home/ubuntu/analysis1:/root/analysis1 jmonlong/popsv-docker

-t and -i are used for interactive run.
-v links folder in the host with folder in the image. It must be absolute paths.
Sometimes use bash as the command to force interactive, or --entrypoint /bin/bash if the image uses an ENTRYPOINT.
To make sure the files created by the container have the appropriate owner use: -u `id -u $USER`.

Increase memory

In Mac OS, I had some problems with the docker stopping because of memory issues. I fixed by changing:

docker-machine stop
VBoxManage modifyvm default --cpus 3
VBoxManage modifyvm default --memory 8192
docker-machine start

Clean up space

To remove all images:

docker rm -vf $(docker ps -a -q)
docker rmi -f $(docker images -a -q)

To clean cache too:

docker system prune -a

WDL

A few useful commands/trick/remainders for WDL (see full specs).

size(file_name, 'G') to get the size in Gb (e.g. for dynamic disk space allocation). Used with ceil/round to get a round number.
and=&&, or=||
~{true="yes" false="no" boolean_value} to inject different strings in the command section, depending on the value of a Boolean.
~{sep="," array_value} to “join” an array into a string.
Int value = if othervalue < 5 then 1 else 4
glob("*.bam") to cath output files for example.
String out_prefix = basename(in_gam_file, ".gam") to get the basename of a file and strip a suffix too.
String out_prefix = sub(sub(sub(basename(in_gam_file), "\\.gz$", ""), "\\.gaf$", ""), "\\.gam$", "") another way for when the extension is variable.
File sel_file = select_first([OPTIONAL_INPUT_FILE, task.output_file]) to select the first defined arg in an array (e.g. when a task can recreate an optional input).
flatten([array1, [new_element]]) to add new_element to an existing array1.
Array[Pair[File,File]] paired_files_list = zip(file_list_1, file_list_2), e.g. to scatter across two lists. In the scatter, access with .left/.right.
Start command section with set -eux -o pipefail to stop the job at the first error, even in a pipe.
Simple workflow with typical structure

If an input is a (long) array, use it through a file with one element per line (see in WDL Spec):

input {
    Array[File] in_vcf_list
}
command <<<
while read invcf
do
    command $invcf
done < ~{write_lines(in_vcf_list)}
## or
command -f ~{write_lines(in_vcf_list)}
>>>

Makefile

GNU make manual
Makefile basics from Isaacs.
Automatic variables
- $@ the target
- $< the first prerequisite
- $^ all prerequisites
- $(@D) the directory part of $@ (works the same with <,^, etc).
- $(@F) the filename part of $@ (works the same with <,^, etc).
File name functions
- $(notdir src/foo.c) returns foo.c.
- $(addsuffix .c,foo) returns foo.c.
- $(basename dir/foo.test.c) returns dir/foo.test.
- objects := $(wildcard *.o) to list files in a variable.
String functions
- $(patsubst %.c,%.o,file.c) to substitute file extensions
- $(subst from,to,text) replaces all occurrences of from to to in text.
- $(word n,text) returns the n-th word in text.
Shell function: $(shell cat foo) runs a shell command.

FTP

Basic commands:

cd/ls to change directory/list files in the remote location
get to copy a file from the remote location to the local machine
put to copy a file from the local machine to the remote location
delete to remove a file in the remote location
lcd to change directory in the local machine
help to list FTP commands

For file conversion

Misc

SVG to PDF: inkscape --file=in.svg --export-area-drawing --without-gui --export-pdf=out.pdf
Video to mp3: ffmpeg -i in.m4a -acodec mp3 -ac 2 -ab 192k out.mp3
HTML page to PDF: pandoc -o out.pdf --include-in-header h.tex URL where h.tex could contain LaTeX packages declarations like \usepackage{fullpage}.

PDF

to EPS

I ended up using Inkscape in command-line mode. The result is not so bad (better than the pdf2eps etc).

inkscape document.pdf --export-eps=document.eps

Apparently, pdftops is even better.

pdftops -eps document.pdf

to PDF/A

In the end I had to use Acrobat Reader Pro… Still, converting the PDF using the following commands beforehand helped (otherwise Acrobat Reader Pro couldn’t convert it):

gs -dPDFA=1 -dBATCH -dNOPAUSE -dEmbedAllFonts=true -dSubsetFonts=false -dHaveTrueTypes=true -dPDFSETTINGS=/prepress -sProcessColorModel=DeviceRGB -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=mainPDFA.pdf main.pdf

On the other hand, passing by a .ps stage as recommended here, produced a smaller PDF that was directly PDF/A compliant (no need for Acrobat Reader Pro) but lost all cross-reference links :(

pdftops main.pdf main.ps
gs -dPDFA -dBATCH -dNOPAUSE -dNOOUTERSAVE -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=mainPDFA.pdf main.ps

To check for PDF/A compliance I used this online validator or Acrobat Reader Pro. Another way to check for problems is to look at the emb column of pdffonts main.pdf (should be all embedded) and the type column of pdfimages -list main.pdf (should be all image).

Note: this is based only on my one-time experience with the PDF of my thesis.

Adding a signature to a PDF

On Ubuntu, I digitally add signatures using Xournal. It’s ugly but the output PDF is left unchanged and it’s easy to place the signature (from a .png file for example) and export to PDF.

Remove watermark

Some articles have diagonal watermarks behind the text. It makes it more difficult to select and copy text. Personally, it’s annoying when I want to highlight/annotate a PDF.

Sometimes it’s as easy as uncompressing the PDF and replacing the text of the watermark. From StackExchange:

pdftk original.pdf output uncompressed.pdf uncompress 
sed -e "s/watermarktextstring/ /" uncompressed.pdf > unwatermarked.pdf
pdftk unwatermarked.pdf output fixed.pdf compress

Sometimes, the watermark text is not there in the PDF (e.g. the letters are split or the font uses unicodes or something). That’s my experience with Nature papers with the annoyingly big “ACCELERATED ARTICLE PREVIEW” watermark. In this situation, I had to look at the uncompress PDF (e.g. cat uncompress.pdf | less) to guess the block with the watermark and then mess it up (e.g. its text matrix field Tm):

sed -e "s/33.94110107 33.94110107 -33.94110107 33.94110107 5.8685999 122.48609924 Tm/0 0 0 0 0 0 Tm/g" < uncompress.pdf > unwatermarked.pdf

for a block that looked like:

...
BT
0 0 0 rg
/GS2 gs
/C2_1 1 Tf
0 Tc
0 Tw
33.94110107 33.94110107 -33.94110107 33.94110107 5.8685999 122.48609924 Tm
(^@$)Tj
.695 0 Td
(^@&)Tj
.722 0 Td
(^@&)Tj
.695 0 Td
(^@\()Tj
.61199998 0 Td
(^@/)Tj
.61199998 0 Td
(^@\()Tj
.695 0 Td
(^@5)Tj
.695 0 Td
(^@$)Tj
.639 0 Td
(^@7)Tj
.639 0 Td
(^@\()Tj
.695 0 Td
(^@')Tj
.361 -.85799998 Td
(^@^C)Tj
...

Hippocamplus My Second Memory