So I heard, you just got a bunch of messy CSVs. What to do now?

Query on Data

  • Convert CSV to SQLite: good for small dataset
  • q (a Python tool): slow
  • DataFrame (duh): slow
  • Spark/Hadoop/R: hard to use for me
  • JuliaDB: Good stuff, fast, even for user defined functions
  • ClickHouse: If you need more production ready solutions
  • Other GPUs stuff like OmniSci, AresDB if you need real-time, hard-core analytics

Unix tools

  • sort
  • xsv
  • head
  • tail
  • less
  • grep
  • split
  • sed (so confusing that everytime I want something done quickly, I just write my script in C++)
  • tmux very useful to manage SSH sessions

Editors

  • vim, nano, or emacs
  • Scientific writing GNU Texmacs is very productive

Programming Languages

My favorites are C++, Julia, Nim, and, when the data is small, Python

Is it graph?

Graph-tool (python) is actually fast (there is a script to convert NetworkX to Graph-Tool)

Algorithms

Clustering

Compare clustering algorithms

Fitting

LsqFit is my favorite

GPU Computing

  • ArrayFire (high level linear algebra on GPU basically)
  • CUDA/OpenCL (well you know it)

TIL

  • ls took a lot of time because there are so many files: use ls -U instead because ls alone will need to sort the filenames before listing

  • sort a BIG file: split them, sort each chunk, and merge