At my current client we use MSBuild and Octopus to set up automated builds and continuous integration for our solutions. Every commit, a build server grabs the latest code, MSBuild compiles it, a NuGet package gets dropped into a drop folder, and Octopus then pulls the package from a package feed. Then we manually deploy as needed. This usually works pretty well, except… SSIS.
Builds for one of our SSIS packages started failing in July. A couple developers took some stabs at getting it working again, but were unable to do so. Then the backlog started filling. Seven months later, and here I am trying to resolve the build. I’m basically starting from scratch, so I don’t have a lot of knowledge of what may have caused the initial failures. But no problem!
I was able to resolve the build problems by logging directly into the build server and running build commands locally to get more specific errors. Eventually, I identified a third party module that was not present on our build server as the culprit of the problems, despite a couple red herrings. I discuss my process below.
One has two options in creating HIVE tables in HDInsight: Internal, which is the default in a CREATE TABLE statement, and EXTERNAL, which is executed by CREATE EXTERNAL TABLE.
An internal table is one whose data is managed by Hive, so if you were to drop the table, the table information would go, and so would the data.
An external table is one whose data is NOT managed by Hive, so if you were to drop the table, the table information and any references to data would go, but the data would stay. Hive essentially becomes blind to the data, no matter where it is stored. There are certain misconceptions around INTERNAL tables and whether the data is also stored in the HIVE warehouse, which we will explore below.
So, how is Hive internal and external data stored in HDInsight? Let’s figure it out!
In this tutorial we will load sampledata onto BLOB storage. From there, we will create an external table and an internal table using Hive. Theoretically, an external table should keep our data in its original spot, while an internal table should move the data into the Hive warehouse. Let’s look at these scenarios in practice.