Some notes on my recent attempt to learn Python.
To read the standard input:
import fileinput
for line in fileinput.input():
# SOMETHING WITH line
To read a file line by line:
for line in open(input_file, 'rt')
line.rstrip('\n')
Or in one line:
lines = [line.rstrip('\n') for line in open(input_file, 'rt')]
To write a file:
f = open(output_file, 'wt') # 'a' for append mode
f.write('something' + str(an_integer) + '\n')
f.close()
For gzipped files, use gzip module:
import gzip
f = gzip.open('file.txt.gz', 'rt')
f = gzip.open('file.txt.gz', 'wt')
To quickly read a manually indexed (e.g. grep -b
) file quickly, use seek
to jump to a byte-offset and tell
to save the current byte-offset.
f = open(reads_fn)
f.seek(bos)
line = f.next()
f.close()
To save python objects one can use pickle
module:
pickle.dump(obj1, open("obj1.pkl","wb"))
obj1 = pickle.load(open("obj1.pkl", "rb"))
Note the b that specifies binary mode which is slightly more efficient for non-text files.
To prompt the user for input
x = input('What is?')
The quick way is to use sys.argv
like that:
import sys
in1 = sys.argv[1]
The more fancy way is to use argparse
. I usually use it like this (see the doc for a more complete example):
import argparse
parser = argparse.ArgumentParser(description='Do something cool.')
parser.add_argument('-in', dest='input', help='the input file', required=True)
parser.add_argument('-k', dest='k', default=3, type=int, help='an integer')
parser.add_argument('-bool', dest='bool', action='store_true', help='False by default')
parser.add_argument('-out', dest='output', help='the output file')
args = parser.parse_args()
print args.input
print args.k
print args.output
To use argparse but still read the standard input with fileinput:
import argparse
import fileinput
# argparse definition
for line in fileinput.input(files='-'):
# something with 'line' and the arguments from argparse
arr = [3]
arr.append(19)
for elt in arr:
print elt
list(myarray)
to make a copy of the array without going through the copy module.numpy arrays are more efficient when the data needs to be manipulated/combined. For example it allows for some vectorized operations:
pred_counts = numpy.zeros((4, len(rfc.classes_)))
pred_counts[0, ] += sum(predprobs > .5)
To convert a string into an array: numpy.fromstring(line, dtype=int, sep='\t')
.
The opposite would be '\t'.join(map(str, arr))
Hash tables or dictionaries holds unordered sets of key/value pairs.
a_hash = {'bob': 42, 'kevin': 7}
It’s much faster to check the presence of an element in a set than a list. For example:
all_words = set()
all_words.add('word') # etc
'word' in all_words
'tempString' + another_string + str(an_integer)
line.split('\t')
rstrip
removes the trailing characters. By defaut, white spaces. s.rstrip('\t\n')
removes any combination of tabs and new lines.str.replace(a_string, ":", "_")
to replace characters.' '.join(a_string_array)
to merge an array into one string.For regular expression:
import re
pattern = re.compile('something:(.+)')
mres = pattern.search(line)
mres.group(1)
To decode/encode a string, e.g. when reading/writing files and not using the text mode t
:
cur_line.decode('ascii')
cur_line.encode('utf-8')
def functionname( parameters ):
"function_docstring"
function_suite
return [expression]
df = pd.DataFrame([{'c1': 1, c2': 'val2'},
{'c1': 0, c2': 'val3'}])
df = DataFrame.from_csv('file.tsv, sep='\t', header=0)
df = df.sort_values(['c1'], ascending=False)
df.to_csv('file.tsv', sep='\t', index=False)
for row in df.itertuples():
# something with row.c1
import threading
def task(x, results, index):
results[index] = SOMETHING WITH X
threads = [None] * 10
results = [None] * 10
## assuming the inputs are in an array called inputs
for ii in range(len(threads)):
threads[ii] = threading.Thread(target=task, args=(inputs[ii], results, ii))
threads[ii].start()
for ii in range(len(threads)):
threads[ii].join()
_
-separated for modules and functions name. E.g. my_function
.MyClass
._
prefix for private variable. E.g. _secret_variable
.There should be a __init__.py
file in each directory containing modules to import. It can be empty. If not the code inside is run on import.
To import the classes/functions defined in file module1.py
, simply do
import module1
module1.fun()
Minimal structure:
├── __init__.py
├── mypackage
│ ├── __init__.py
│ ├── submod1.py
│ ├── submod2.py
├── setup.py
__init__.py
files can be (and usually are) empty.
The setup.py
contains metadata about the package and list dependencies and files to includes.
Once setup, the package can be installed locally with pip install .
.
Other functions exist to produce a tarball or upload the code to pypi.
More details at “How to package your python code”.
For example when running a script on a HPC cluster, it’s often easier to install modules in your home.
If the package can be installed with pip
, I used pip install --user packageName
.
Otherwise using the setup.py
method, I first initialize the local library by creating a directory and updating PYTHONPATH
.
mkdir -p /home/monlongj/pylib/lib/python2.7/site-packages/
PYTHONPATH=$PYTHONPATH:/home/monlongj/pylib/lib/python2.7/site-packages/
export PYTHONPATH
Then to install a module (e.g. pyfaidx) I did:
wget https://github.com/mdshw5/pyfaidx/archive/v0.4.7.1.tar.gz
tar -xzvf v0.4.7.1.tar.gz
cd pyfaidx-0.4.7.1/
python setup.py install --prefix=/home/monlongj/pylib
To use this, I must always have the local library in PYTHONPATH
.
Modules (as in module load ...
) might be a cleaner solution. I use existing module but I didn’t take the time to see how I could create and use them more.
An histogram with matplotlib:
import matplotlib.pyplot as plt
plt.hist(x)
plt.xlabel('x label')
plt.ylabel('y label')
plt.show()
Scatterplot:
plt.plot(xy[1], xy[0])
plt.show()
List files with glob, remove with os, remove non-empty directories with shutil.
import glob
import os
import shutil
filelist = glob.glob('temp*')
for f in filelist:
os.remove(f)
shutil.rmtree('myDir')
if(os.path.isfile(filen)):
file = open(filen)
Run commands with subprocess. /dev/null
to avoid annoying messages.
import subprocess
bwa_cmd = ['bwa', 'mem', bwaidx_file, fastq_file]
dump = open('/dev/null')
bwa_out = subprocess.check_output(bwa_cmd, stderr=dump)
dump.close()
For logging, add date to the output with:
import datetime
print(datetime.datetime.now())
Timing a function with timeit:
from timeit import Timer
def test():
global myobj
myobj.fun()
Timer(test).timeit(number=10)
A built-in way to monitor resident memory usage:
import tracemalloc
tracemalloc.start()
# do something
mem_size, peak_mem_size = tracemalloc.get_traced_memory()
peak_mem_gb = round(peak_mem_size/(1024*1024*1024), 5)
tracemalloc.stop()
BioPython main documentation is available here. What I ended up using are the following.
from Bio.Seq import MutableSeq
from Bio.Alphabet import generic_dna
mseq = MutableSeq('ATGCTAGCT', generic_dna)
len(mseq)
str(mseq[3:6])
To simulate sequences, I ended up using numpy arrays (I think they could take a list as indexes).
import numpy
import random
from Bio.Seq import MutableSeq
from Bio.Alphabet import generic_dna
_nuc = numpy.array(["A", "T", "C", "G"])
def randSeq(length):
seqArray = _nuc[[int(random.random()*4) for i in xrange(length)]]
return(MutableSeq("".join(seqArray), generic_dna))
To read a fasta file:
from Bio import SeqIO
for record in SeqIO.parse(fasta_file, "fasta"):
print record.id + '\t' + str(len(record.seq))
To write a fasta:
from Bio.SeqRecord import SeqRecord
recs = []
for ii in xrange(1000):
seq = randSeq(100))
recs.append(SeqRecord(seq, id=str(ii)))
SeqIO.write(recs, "reads.fa", "fasta")
To read/write a gzipped fasta file:
inf = gzip.open('input.fa.gz', 'rt')
recs = []
for rec in SeqIO.parse(inf, 'fasta'):
...
SeqIO.write(recs, gzip.open('reads.fa.gz', 'wt'), "fasta")
I use pyfaidx
(see repo) to quickly access slices of an indexed fasta. The indexing can be performed by samtools faidx
or functions from the package.
from pyfaidx import Fasta
fa = Fasta(args.ref)
fa = fa[str(ch)][start:end]
print fa.id + '\t' + fa.seq
Here is a short example on how to get reads from a region.
import pysam
bamfile = pysam.AlignmentFile(bam_fn, "rb")
reads_reg = bamfile.fetch(reference=ch_reg, start=start_reg, end=end_reg)
reads_seq = {}
for read in reads_reg:
reads_seq[read.query_name] = read.query_alignment_sequence
To iterate over all the reads (faster) use:
for aln in bamfile.fetch(until_eof=True):
...
If the BAM is indexed, the number of mapped and unmapped reads is stored in bamfile.mapped
and bamfile.unmapped
.
Using the PyVCF module.
import vcf
vcf_reader = vcf.Reader(open('var.vcf', 'r'))
for record in vcf_reader:
if(record.INFO['AF'][0] < 0.1):
print record.ID
The fields are accessed through:
record.CHROM
record.POS
record.ID
record.REF
record.ALT
record.QUAL
record.FILTER
record.INFO
xrange
instead of range
. It’s faster and more memory efficient.sorted(a_list, key=lambda k: -something[k])
.unique, counts = numpy.unique(array1, return_counts=True)
.random.sample(a_list, 10)
.continue
, or leave the loop with break
.numpy.where(x > 0)
.quit()
to stop a program.myVar is None
to test if a variable is None.os.getcwd()
(after importing os).When filling a nested dictionary, it’s painful to always test if the key exists before updating it’s value. One trick is to use try
/except
. It’s not that much quicker but it looks fancier so you forget about the pain:
try:
dict[lev1].append(i)
except KeyError:
dict[lev1] = [i]
To time a function, one simple way is to use time.time()
before and after the function and report the difference. There might be issues with Windows but I use timing for internal benchmarks, never in final code.
To convert a binary number into a decimal number: int('11011011', 2)
.
To get the index as well as the value in a list, use enumerate
:
for id, val in enumerate(myList):
print(id, val)
%%capture
at the beginning of a cell means the stdout/stderr will be captured (i.e. not shown).!pip install XXX
to install packages.The main keyboard shortcuts I use:
Ctrl+Enter
run cell.Shift+Enter
run cell and select below.b
insert cell below.y
code.m
markdown.1
to heading 1d d
delete cell. Undo cell deletion with z
.i i
interrupt kernel.0 0
restart kernel.h
Show keyboard help.Ctrl+]
([
) indent (deindent)Ctrl+Shift+-
split cell