Hippocamplus My Second Memory

R

Details to remember

  • Use rbindlist from data.table package for a memory-optimized and faster do.call(rbind, list(..)).
  • Use system2 instead of system to run a command. It’s more portable apparently.
  • Get RPubs working with options(rpubs.upload.method = "internal").
  • Operations on dates using strptime(x, "%a %b %d %T PST %Y'") and difftime and then as.Double(d, units="hours").

Order in condition assessment

Using & and | operators, R tries all the conditions and then performs the operations. However, sometimes we would like a smarter sequential assessment for AND. For example, we get an error if we run:

x = NULL
if(!is.null(x) & x>10) message("so big !")

That’s because it tries to do x>10 when x is NULL. Here, what we want is &&:

x = NULL
if(!is.null(x) && x>10) message("so big !")
x = 17
if(!is.null(x) && x>10) message("so big !")

Now it won’t try to do x<10 if !is.null(x) is not true (because what’s the point, anything “False AND …” is for sure False). Caution, && doesn’t work on vectors (it will only test the first element).

Read files in chunk

To avoid memory problems, I sometimes had to read a file by chunk. (This is usually not the right way to do things, more of a quick-and-dirty way to try something to gauge if it’s worth implementing something in Rcpp etc.)

Here are bits of code to help with that.

## open connection to file
con = gzfile(args[1], 'r')
## read one line at a time
while(length((line.r=readLines(con, 1)))>0){
  ## something on line.r (character containing the line), e.g. 
  line.s = unlist(strsplit(line.r, '\t'))
  ...
}

Using readLines(con, N) with larger N, a larger chunk of the file can be read at a time. In that case, it can be useful to parse the chunk (vector of character element) into a data.frame using read.table/read.csv with:

lines.r=readLines(con, 100)
nodes.df = read.csv(textConnection(lines.r), as.is=TRUE, header=TRUE)

Graphs

  • pdf("g.pdf", 9, 7)
  • png("g.png", 1300, 1000, res=200)
  • options(device=function() pdf(width=9, height=7)) to set the default device (e.g. remote graphs).

Palette with many colors

When I need many colors and want to distinguish consecutive classes (e.g. bar graphs or overlapping clusters), I use an interleaved rainbow-like palette:

interk <- function(x, k=4){ # Interleaves elements in x
  idx = unlist(lapply(1:k, function(kk) seq(kk, length(x), k)))
  x[idx]
}
pal = interk(rainbow(20, s=.8), 5)

For example:

ggplot(bar.df, aes(x=x, fill=c)) + geom_bar() + coord_flip() +
  guides(fill=guide_legend(ncol=4))

ggplot(bar.df, aes(x=x, fill=c)) + geom_bar() + coord_flip() +
  guides(fill=guide_legend(ncol=4)) + 
  scale_fill_manual(values=pal)

ggplot2 tricks

  • To plot a density distribution without the x-axis line, use stat_density(geom="line") (and eventually position="dodge" if plotting several groups).
  • To override the legend’s aes: guides(colour=guide_legend(override.aes=list(alpha=1)))
  • Multi-column legends: guides(fill=guide_legend(ncol=4))
  • Add more vertical space between legend labels (e.g. for multi-lines labels): guides(fill=guide_legend(keyheight=2))
  • Don’t show a in legend when using text (e.g. when already having points and colors): geom_text(show.legend=FALSE)
  • Keep “dodged” bars the same width: position=position_dodge(.9, preserve='single')
  • Change direction of multi-legends: theme(legend.box='vertical')
  • Remove some white space between legend and graph, or where the title used to be: theme(legend.margin=margin(-5))

Multi-panel ggplots

In general we can use grid.arrange. For example:

p1 = ggplot(...) + ...
p2 = ggplot(...) + ...
p3 = ggplot(...) + ...
grid.arrange(p1, p2, p3, heights=c(2,1), layout_matrix=rbind(c(1,1), 2:3))

Repositioned titles as panel legend

We can add A), B) etc as a title for each panel. To reposition them, for example near the corner:

p1 = p1 + ggtitle('A') + theme(plot.title=element_text(hjust=-.05, vjust=-3))

hjust/vjust may need to be adjusted to fit the graph type and aspect ratio.

Aligning the x-axis

Sometimes we might want to have two graphs, one on top of the other, with their x-axis aligned.

One easy way is to use the tracks function in the ggbio package. However, I don’t really like this package because it sometimes conflicts with ggplot2 (boo!) and you end up having to specify ggplot2:: to the functions to avoid obscure errors.

I found another way on the internet:

library(ggplot2)
library(gridExtra)
p1 <- ggplot(...
p2 <- ggplot(...
p1 <- ggplot_gtable(ggplot_build(p1))
p2 <- ggplot_gtable(ggplot_build(p2))
maxWidth = unit.pmax(p1$widths[2:3], p2$widths[2:3])
p1$widths[2:3] <- maxWidth
p2$widths[2:3] <- maxWidth
grid.arrange(p1, p2, heights = c(3, 2))

Otherwise, there is always adding manually margins to align them…

ggp.panels = grid.arrange(
    ggp.1 + theme(axis.text.x=element_blank(), axis.title.x=element_blank(),
                  plot.margin = margin(.1,.1,.1,.35, "cm")), 
    ggp.2 + theme(plot.margin = margin(.1,1.5,.1,.8, "cm")),
    heights=c(2,3))

Change font in ggplot2

library(extrafont)
font_import(pattern='Comic')
loadfonts()
qplot(x=rnorm(100)) + geom_histogram() + theme(text=element_text(family="Comic Sans MS")) + ggtitle('Ouch')

fonts() to check which fonts are imported by extrafont, names(pdfFonts()) to list the fonts available (loaded).

More in this blog post.

Waffle graphs

waffle package provides a waffle and iron function. For example:

iron(
  waffle(c(thing1=0, thing2=100), rows=5, keep=FALSE, size=0.5, colors=c("#af9139", "#544616")),
  waffle(c(thing1=25, thing2=75), rows=5, keep=FALSE, size=0.5, colors=c("#af9139", "#544616"))
)

Rmarkdown

  • \ to force a line break and add vertical spacing (e.g. in slides).

To define knitr parameters, I add a chunk at the beginning of the Rmarkdown document. For example:

```{r include=FALSE}
knitr::opts_chunk$set(echo=FALSE, message=FALSE, warning=FALSE, fig.width=10)
```

Zoom-on-click images

To get this feature on an HTML report, add anywhere (or by including an html file e.g. with includes: before_body:)

<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>

<style>
.zoomDiv {
  opacity: 0;
  position:fixed;
  top: 50%;
  left: 50%;
  z-index: 50;
  transform: translate(-50%, -50%);
  box-shadow: 0px 0px 50px #888888;
  max-height:100%; 
  overflow: scroll;
}

.zoomImg {
  width: 100%;
}
</style>

<script type="text/javascript">
  $(document).ready(function() {
    $('body').prepend("<div class=\"zoomDiv\"><img src=\"\" class=\"zoomImg\"></div>");
    // onClick function for all plots (img's)
    $('img:not(.zoomImg)').click(function() {
      $('.zoomImg').attr('src', $(this).attr('src'));
      $('.zoomDiv').css({opacity: '1', width: '100%'});
    });
    // onClick function for zoomImg
    $('img.zoomImg').click(function() {
      $('.zoomDiv').css({opacity: '0', width: '0%'});
    });
  });
</script>

All the images can then be clicked on to display a zoomed in display. Click again on the image to close the zoomed image.

Note: It required pandoc version >2 to load the javascript lib.

Other options I haven’t tried yet:

Beamer presentation

Some useful options to put in the YAML header:

title: The Title
subtitle: The Subtitle
author: Jean Monlong
date: 11 Oct. 2016
output:
  beamer_presentation:
    slide_level: 2
    fig_width: 7
    includes:
      in_header: header.tex
    toc: true
    dev: png
    keep_tex: true
  • slide_level defines the header level to be considered as a new slide.

To add slide count I put this on the header.tex:

\setbeamertemplate{navigation symbols}{}
\setbeamertemplate{footline}[page number]

Wide tables

To resize wide tables I use a hook that surround a chunk with \resizebox command, defined in the non-included chunk:

```{r, include=FALSE}
knitr::knit_hooks$set(resize = function(before, options, envir) {
    if (before) {
      return('\\resizebox{\\textwidth}{!}{')
    } else {
      return('}')
    }
})
```

## Wide table

```{r, resize=TRUE}}
knitr::kable(matrix(rnorm(10),10,10), format='latex')
```

Jekyll website

The Rmd files located in the _source folder get automatically compiled by servr package using this command:

Rscript -e "servr::jekyll(script='build.R', serve=FALSE)"

Note: I now use blogdown which automatically compile the R Markdown documents (every page is a R Markdown actually).

data.table package

I’m more of a tidyverse person but for very large data the data.table package is more efficient.

tidyverse data.table
group_by(col1,col2) %>% summarize(nb=n()) dt[,.(nb=.N),by=.(col1,col2)]
group_by(col1,col2) %>% mutate(nb=n()) dt[,nb:=.N,by=.(col1,col2)]
filter(nb==2) dt[nb==2]

Rcpp

Useful links:

Tips:

  • Using external libraries: place .h/.cpp in the src directory with your cpp files
  • Manipulate (g)zipped files: I used gzstream that I took from the ndjson source code.
  • Use Rcpp::checkUserInterrupt(); once in a while to make sure the user can stop easily.
  • Use Rcout << "message" << std::endl; to print messages

Example:

#include <Rcpp.h>
using namespace Rcpp;

//' Description
//' @title Title
//' @param filename the path to the file
//' @return data.frame
//' @author Jean Monlong
//' @keywords internal
// [[Rcpp::export]]
DataFrame read_vcf_cpp(std::string filename){
    std::vector<std::string> seqnames;
    std::vector<int> starts;
    std::vector<int> ends;
    // DO STUFF AND UPDATE VECTORS
    return DataFrame::create(_["seqnames"] = seqnames, _["start"] = starts, _["end"] = ends);
}

Linux setup

Necessary to compile some R packages:

sudo apt-get install libxml2-dev libssl-dev libmariadbclient-dev libcurl4-openssl-dev

Related to XML, OpenSSL, MySQL, Curl respectively.