Skip to content
Blog - Dataiku Prepare steps
July 28, 20222 min read

Preparing a New Data Source for Analysis in Dataiku

Adding new data sources to any analysis is a common activity. Every new data source, no matter how curated, needs additional data prep. In this post, we will demonstrate our standard data preparation steps of new data sources in Dataiku. 

Breaking Down Data Silos

One of the many benefits of data science is the breaking down of data silos. This is achieved by bringing in data from different sources into the same platform for analysis. Dataiku makes this activity extremely easy. Dataiku has connectors to most data platforms and file types (full list here). It is helpful to bring any new data sources in their most raw form. 

It's tempting to start joining it with other data right away. However, prior to joining data sets, it is best practice to do some standard data preparation steps first.

Standard Data Preparation 

Step 1. Create a Prepare Recipe in Dataiku

dataiku visual prepare recipe
Step 2. Clean the Empty Columns

When you see an empty column always check with business users if this was intentionally left blank or data wasn’t properly brought into Dataiku upon upload.

A quick way to identify empty columns is in the Explore tab. Go to list view and look for those without any valid data.  Confirm it is not just your sample without data.

dataiku empty columns
Step 3. Rename Columns

Renaming column is especially important for those that are:

  • Duplicate columns names, but different data content
  • Name that may not be descriptive enough such as using Create Date vs. Date
  • Special characters or spaces on columns you want to use formulas on in the future such as “Total_Assets” instead of “Total Assets ($)”

Rename columns by clicking on the column name and selecting Rename

dataiku analyze data set image
Step 4. Parse Dates

Having dates parsed will allow for easier charting and date manipulation such as time between dates.

dataiku parse data screenshot
Step 5. Review Data Types and Meanings

The data quality bar will identify which rows are valid for the data meaning in this sample.  The data is being stored as the storage type.  It will also identify the percentage of the sample without any data.  For some situations with invalid or empty data you will remove the entire row, others will leave this field blank or replace with the accurate data.

In this example, PrecipTotal is being stored as a string, but the meaning (inferred from the data) is decimal.  Assuming the meaning of decimal is true, we see there is some data that does not match this meaning (“T”).  This is an opportunity to update the data storage type if the data is valid, or remove the data inconsistent with the mean.  

dataiku quality bar
Data Prep Success

In this post, we’ve covered the basic prep recipe steps you should do when bringing in any new data set. These steps will set you up for success in preparing your data for everything from descriptive statistics to machine learning.

Need some help? We're here for you. Contact our experts today!

avatar

Virginia Maus

Virginia is a problem solver who is passionate about using the power of data to make informed decisions. She wants to lead your most undefined, innovative, and challenging projects with clear communication and genuinely collaborative execution. To her, energy comes from building relationships and taking action to make a positive, infectious impact on the world.

RELATED ARTICLES