Developing a practical understanding of internal and external tables in HDInsight

One has two options in creating HIVE tables in HDInsight: Internal, which is the default in a CREATE TABLE statement, and EXTERNAL, which is executed by CREATE EXTERNAL TABLE.

An internal table is one whose data is managed by Hive, so if you were to drop the table, the table information would go, and so would the data.

An external table is one whose data is NOT managed by Hive, so if you were to drop the table, the table information and any references to data would go, but the data would stay. Hive essentially becomes blind to the data, no matter where it is stored. There are certain misconceptions around INTERNAL tables and whether the data is also stored in the HIVE warehouse, which we will explore below.

So, how is Hive internal and external data stored in HDInsight? Let’s figure it out!

In this tutorial we will load sampledata onto BLOB storage. From there, we will create an external table and an internal table using Hive. Theoretically, an external table should keep our data in its original spot, while an internal table should move the data into the Hive warehouse. Let’s look at these scenarios in practice.

Continue reading