Background

It is widely known in the data industry that a majority of the work required to create a usable data asset is dedicated to the process of cleaning and properly structuring the associated raw data. Data from different systems is messy and rarely follows a standard structure, and much of analysts' time is consumed by this laborious process.

In Numetric Warehouse, we've created a set of tools that allow a data curation team to significantly reduce the time required for cleaning tasks. The tools themselves (the actual operations that can be performed on the data) are nothing revolutionary. They include standard filtering, if-then logic, value mapping functions, and other typical transformation formulas. Instead, the placement (at the entry point into the Warehouse) and reusability (because of our rule-based approach) of our cleaning tools are what make Numetric's cleaning experience so special. 

Rule-Based Cleaning

The Numetric approach to cleaning centers on rules that the data team can establish for each individual source that comes into Warehouse. Those rules are saved and reapplied every time new data from that source enters Warehouse. The result is a set-it-and-forget-it process that should only have to be done once.

Warehouse Sources vs. Warehouse Tables

It's important to distinguish between Sources and Tables within Warehouse. You'll see both referenced in the top menu within Warehouse:

As we learned in the previous article in this series, we use the term Source Table when referring to external sources of raw data. Each Source (e.g., Uploaded Files, Dropbox, Google Analytics, Numetric API, etc.) will have one or more Source Tables listed. Each has a checkbox next to the table name, which is used to tell Numetric whether to actually bring that table into the Warehouse for use in your analytics workflow. 

Why this behavior? Think of Sources as a sort of "staging area" for data that you might want to bring into Warehouse. Some data connectors will give you access to many tables from the external system, and often you'll only need a few tables from the many available. You can preview each Source table by clicking on the table name to help you decide which tables should be included in Warehouse. See the example below.

Each Source Table that is checked will  be listed as a Warehouse Table in the "Tables" section of Warehouse. It is for each of these Warehouse Tables that we will establish cleaning and transformation rules.

Cleaning Warehouse Tables

And now we get to the fun part. Let's see how easy it is to setup filtering, cleaning, and transformation rules for each of your Warehouse Tables.

Each of your selected Source Tables will be listed under "Tables" within Warehouse.

Each of these Tables can have its own set of shaping rules applied so that you can get each to look just the right way. Let's open the "Customers" table so we can see how it works. 

With a Warehouse table open, you'll have three tabs or areas where you can manipulate the data in the table (or the table itself). These are list along the top:

Transformations is where you can setup your rule-based cleaning and filtering operations. To add a transformation, you'll click the "+" button on the left side of the screen, where you can choose from the different transformations available to you.

Each transformation will allow you to provide a name so that you can keep track of which transformation is doing what. Transformations are added sequentially down the left side as you add them, and they are applied in the order they appear. You can adjust their order by simply dragging and dropping.

Types of Transformations

Numetric currently supports three types of transformations. Let's look at an example of each.

The Filter transformation allows you to filter out rows that fit certain criteria (called "Triggers"). You can filter by removing rows that fit the criteria, or by only keeping rows that fit the criteria. In the example below, I am removing all customers from California, telling Numetric to remove rows where the State field is either "CA" or "California." 

(Note that you can add multiple triggers or criteria to each filter transformation.)

The If-Then transformation allows logical operations to be performed with the data. Within an If-Then transformation, you can setup a variety of triggers (criteria), each with a potential true or false answer. When the trigger (or set of triggers) is true for a given row in the data, Numetric will apply one or more "actions." 

Note that you can add multiple triggers and multiple outcome actions. When multiple triggers are provided, you'll need to choose whether to use "And" or "Or" logic on the set of triggers. As you might expect, selecting "And" will only apply the action(s) if all of the trigger criteria are true, and selecting "Or" will apply the action(s) if any of the trigger criteria are true.

In the example below, I have setup an If-Then transformation to identify customers who live in the city (i.e., "Metro Customers"). My first trigger looks for a couple of specific area codes in the customers' phone numbers, and the second trigger looks for customers who have listed any of a few large cities in their mailing address. Because I've specified "Or" logic, if either those triggers evaluates as true, the action will be applied. In this case, I've asked Numetric to set the value of the MetroCustomer field to "Yes."

The Formula transformation is a more flexible transformation with which you can apply a wide variety of operations to the data. The sky's the limit when it comes to formula transformations, so we won't enumerate all of the possible options here. I'll provide one example below, and you can consult our more detailed formula documentation to learn more about your options.

In the example below, I have chosen to standardize the format of all of my customers' last names by converting all characters to upper case. To do this, I use the "Upper" formula, following the syntax suggested as I type the formula. Note that Column names are surrounded in curly braces ( "{ }" ).

Adding New Fields

Sometimes you would rather add a new column during a transformation rather than modifying an existing column, such as when you are creating a calculated field from another field. This process is simple and occurs as a natural part of creating a transformation. 

In the example of the If-Then transformation I used above, I had to create the MetroCustomer field because it didn't exist in the raw data. To do this, I simply typed the new column name in the column field. Because MetroCustomer didn't yet exist, I was prompted to create a new column, as shown below. I then specified the data type (Text) and I could continue creating the transformation. Just that easy!

Changing Column and Table Properties

Most of the action in Warehouse happens in the Transformations tab, but it's also important for you to know about the other two tabs, namely Column Properties and Table Properties. 

Column Properties

Here you can view some metadata about each of the columns in the table, change certain properties of the column (such as the data type), and, along the left side of the screen, change the display order of the columns in the table. Either drag and drop to rearrange or (especially with large tables where dragging would be annoying) simply enter the desired position value for a column. The column will move to the new position, moving all of the other columns down by one position.

Table Properties & Sharing

Here you can view and/or edit various attributes of the Warehouse table as a whole, including its size, its origin date, and the date it was last updated. The most important part of the Table Properties view is the sharing menu along the left side. Here is where the data management team can decide who in the organization can access the data being stored and cleaned in Warehouse. Only the users or groups who appear in the list for a given table will be able to use that table to assemble Datasets for analysis and visualization. 

To make any asset (Warehouse Table, Dataset, or Workbook) available to another user or group, simply type in the name of the user or group, and choose whether they should have "edit" access (with which they will be able to modify the asset) or just "view" access (with which they will be able to use but not change the asset). 

You can also choose to share assets with all users of a certain role. For example, the customers warehouse table we have been using in our examples here will probably be useful to pretty much everyone in the organization, so I have chosen to given everyone at Numetric "view" access to the table. 

Note: Only users who have been given the "Analyst" or "Administrator" role will be able to have edit access to Warehouse Tables. 

Publish vs. Revert

One last thing to note regarding Warehouse tables. As you setup your rules for filtering, cleaning, transforming, and reordering your tables, you'll notice the "Revert" and "Publish" buttons activate. As you might guess from their obvious names, these two buttons will allow you to either (1) revert your most recent changes (i.e., the changes you have made since you last published), or (2) publish those changes so that your new rules or other modifications will be applied to the table in Warehouse. Until you publish your changes, you'll see a "Draft" notation near the name of the table, reminding you that your changes have not yet been saved.

Note: Don't worry about losing work if you have to go elsewhere in Warehouse to check on something while you're working on a table. Those "unpublished" changes will remain until you open the table view again, essentially in a perpetual draft until you either revert or publish the changes you've made.

Conclusion

Wasn't that fun? If you invest a little bit of time up front to setup the rules required to filter, clean, and otherwise transform your messy raw data, you'll earn that time back many times over as each new update from each of those sources is cleaned and organization automatically. Now let's complete the story on organizing your data with a discussion of Dictionary.

Did this answer your question?