Developing a practical understanding of internal and external tables in HDInsight

One has two options in creating HIVE tables in HDInsight: Internal, which is the default in a CREATE TABLE statement, and EXTERNAL, which is executed by CREATE EXTERNAL TABLE.

An internal table is one whose data is managed by Hive, so if you were to drop the table, the table information would go, and so would the data.

An external table is one whose data is NOT managed by Hive, so if you were to drop the table, the table information and any references to data would go, but the data would stay. Hive essentially becomes blind to the data, no matter where it is stored. There are certain misconceptions around INTERNAL tables and whether the data is also stored in the HIVE warehouse, which we will explore below.

So, how is Hive internal and external data stored in HDInsight? Let’s figure it out!

In this tutorial we will load sampledata onto BLOB storage. From there, we will create an external table and an internal table using Hive. Theoretically, an external table should keep our data in its original spot, while an internal table should move the data into the Hive warehouse. Let’s look at these scenarios in practice.

First, lets look at our BLOB storage prior to loading our sampledata. This is before any import:

Next, lets add the sampledata:

With the sampledata loaded, we may now create our first table. This table will be external, which means our data should stay in its original location, and the metadata will be managed by Hive:

The Hive query:

DROP TABLE IF EXISTS sampledata_external;
--create the hvac table on comma-separated sensor data
CREATE EXTERNAL TABLE sampledata_external(date STRING, time STRING, targettemp BIGINT,
actualtemp BIGINT,
system BIGINT,
systemage BIGINT,
buildingid BIGINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION 'wasb://claytonrothschild@[removed].blob.core.windows.net/sampledata/';

Now, where is our data? Lets look at BLOB storage again:

The sampledata directory is still there, and navigating to its contents shows the data files are still there as well.

Lets query it:

SELECT * from sampledata_external;

Gives us:

Date Time NULL NULL NULL NULL NULL
6/1/13     0:00:01    66   58   13   20   4
6/2/13     1:00:01    69   68   3    20   17
6/3/13     2:00:01    70   73   17   20   18
6/4/13     3:00:01    67   63   2    23   15
6/5/13     4:00:01    68   74   16   9    3
6/6/13     5:00:01    67   56   13   28   4

What about creating an internal table? A common understanding is  that creating an internal table should move our data files into Hive warehouse, making the data managed. Lets see if this is true.

CREATE TABLE sampledata_internal(date STRING, time STRING, targettemp BIGINT,
actualtemp BIGINT,
system BIGINT,
systemage BIGINT,
buildingid BIGINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION 'wasb://claytonrothschild@[removed].blob.core.windows.net/sampledata/';

After table creation, how does our BLOB look?

Very odd! one would expect creation of an internal table would move the data from sampledata into the hive warehouse, but it didn’t.

Querying the table does work:

SELECT * FROM sampledata_internal;

We get:

Date Time NULL NULL NULL NULL NULL
6/1/13     0:00:01    66   58   13   20   4
6/2/13     1:00:01    69   68   3    20   17
6/3/13     2:00:01    70   73   17   20   18
6/4/13     3:00:01    67   63   2    23   15
6/5/13     4:00:01    68   74   16   9    3
6/6/13     5:00:01    67   56   13   28   4
6/7/13     6:00:01    70   58   12   24   2

And querying our external table does work:

SELECT * from sampledata_external;

Gives us:

Date Time NULL NULL NULL NULL NULL
6/1/13     0:00:01    66   58   13   20   4
6/2/13     1:00:01    69   68   3    20   17
6/3/13     2:00:01    70   73   17   20   18
6/4/13     3:00:01    67   63   2    23   15
6/5/13     4:00:01    68   74   16   9    3
6/6/13     5:00:01    67   56   13   28   4

So is our sampledata_internal an internal table or not? The answer is yes, it is an internal table in that HIVE manages the data. So shouldn’t it be in the Hive warehouse? No. Creating an internal table in HIVE does NOT necessarily mean that the data will be moved to the hive warehouse. This is good, what if we wanted to have multiple schemas on the same data, like in our case, an internal and external table? If the data were moved, our external table created above wouldn’t work.

So what does internal mean, if the data stays in the same location? It simply means that the data is managed by Hive, so if you were to DROP the table, the data would go also:

DROP TABLE IF EXISTS sampledata_internal;

And thus querying ANY table that uses that dataset, the external table included, no longer works:

SELECT * FROM sampledata_internal LIMIT 10;
SELECT * FROM sampledata_external LIMIT 10;
[BLANK]

What if we want our table to be internally managed AND stored in the Hive Warehouse?

Let’s create our internal table, but instead of using LOCATION at table creation, we will create the table and then use LOAD DATA INPATH:

DROP TABLE IF EXISTS sampledata_internalload;

CREATE TABLE sampledata_internalload(date STRING, time STRING, targettemp BIGINT,
actualtemp BIGINT,
system BIGINT,
systemage BIGINT,
buildingid BIGINT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

LOAD DATA INPATH 'wasb://claytonrothschild@[removed].blob.core.windows.net/sampledata/' OVERWRITE INTO TABLE sampledata_internalload;

Looking at the BLOB, our data is now moved from its original location, and into the HIVE warehouse:

Querying the table does work:

SELECT * FROM sampledata_internalload;
Date Time NULL NULL NULL NULL NULL
6/1/13     0:00:01    66   58   13   20   4
6/2/13     1:00:01    69   68   3    20   17
6/3/13     2:00:01    70   73   17   20   18
6/4/13     3:00:01    67   63   2    23   15
6/5/13     4:00:01    68   74   16   9    3
6/6/13     5:00:01    67   56   13   28   4
6/7/13     6:00:01    70   58   12   24   2
6/8/13     7:00:01    70   73   20   26   16

While querying the other two tables does not work:

SELECT * FROM sampledata_internal LIMIT 10;
SELECT * FROM sampledata_external LIMIT 10;
[BLANK]

Which practice is best?

Hive keeps track of the HDFS logical location of of both external and internal table data using a metastore. This metastore sends this location to headnode, which then locates the physical location of the table data – specific datanodes/blocks on which to execute a given query.

Either external or internal will do just fine.

Cindy Gross lists some great scenarios on how to choose between external and internal:

Use EXTERNAL tables when:

  1. The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
  2. Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
  3. You want to use a custom location such as a Blob (default location for HDInsight clusters).
  4. Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
  5. You are not creating table based on existing table (CREATE TABLE AS SELECT).

Use INTERNAL tables when:

  1. The data is temporary. You want Hive to completely manage the lifecycle of the table and data.

Leave a Reply

Your email address will not be published. Required fields are marked *