In-memory Data Science

Ruda Zhang

2019-02-17

Abstract

As computer memory becomes abundant, it is now possible to load large data sets in main memory for fast analysis, without the input/output overhead of secondary storage or the complexity of distributed computing. But with data size close to your computer memory capacity (~ 1-100 GB), you need an in-memory database system that is memory-safe and computationally efficient. For iterative or reproducible analysis, you also need storage-efficient version control of your data sets. Here I present a hierarchical data schema for version control of massive data sets, and R code for creating and validating data at the various processing levels, exemplified by over 100 GB of NYC taxi trip records. The hierarchical schema, inspired by remote sensing practices, represents data at levels from 0 to 4: 0, raw data; 1, syntactically correct data; 2, semantically correct data with derived variables; 3, data with regular spatial and temporal organization; 4, aggregated or model output. I compare popular in-memory columnar databases in R, data.table and “tidyverse” (tibble), and serialization libraries, fst and Apache Arrow (feather), and give my recommendations.

Type

Manuscript

Date

February, 2019

In-memory Data Science

Abstract

Ruda Zhang

Postdoctoral Fellow