Columnar vs. Row Oriented Databases - The Basics (2 min read)
A 2 minute crash-course on columnar databases
Columnar databases have become the popular choice for storing analytical data workloads. Below are some quick takeaways about the differences between column-oriented vs. row-oriented databases.
Fundamental concept
Column oriented databases, under the hood, store all values from each column together whereas row oriented databases store all the values in a row together. A good way to visualize this is to think about it as the values of each column are stored in separate files. So a query that requires a few columns will only need to parse a few files that are relatively small to return the data. Now consider that in a row oriented database, the entire row (all columns) are stored in the file. These files can get very large and therefore a query that only cares about a couple of columns will need to parse the entire file resulting in slower data retrieval .
Key Takeaways
Modern data warehouses use column-oriented architecture. Examples: redshift, Google Bigquery, Vertica etc.
Better for analytical queries: A simple way to think about it is that if you need to read MANY rows but only a FEW columns, then Column-Oriented databases are the way to go. If you need to read a few rows but many columns then row oriented databases are better suited. Ex: select sum(balance_amount) from table will be faster on a column-oriented database. On the other hand, Select * from table will not work as well.
Sort Order: In a column-oriented database, the rows stored in each file, will be in the same sort order. This enables the database to reassemble a row from the individual column files. For example you can take the 2nd entry from each file and piece them together to form the 2nd row of the table. Sort order in a column-oriented database by default is the order in which the rows is inserted. Under the hood the respective column value is appended to each column file. A sort order can be imposed using a column or a set of columns.
Writes are not as efficient in column-oriented databases. If you need to insert a row in the middle of a sorted table then all the column files need to be rewritten
Columnar-Compression: Column-oriented Storage is very favorable for compression. This is due to the fact that the number of distinct values in a column is a lot smaller than the number of rows as well as all having the same data type.