Skip to content
Scary manufacturing Data blog header
Virginia MausOctober 27, 20234 min read

Cleaning Scary Manufacturing Data in Dataiku

Manufacturing data is full of unexpected insights and countless opportunities. Taking the time to explore it can provide significant savings in reduced scrap and energy costs. It also allows you to impact other areas of your organization like providing the sales team with pricing and quote optimization or helping operations with improved uptime or predictive maintenance.

At the same time, manufacturing data can be rather spooky. Data created by machines can sometimes have errors, task-specific nuances, and lack of descriptive context. Before you can unmask precious insights, there is prep work that needs to be done. Using Dataiku to prepare and cleanse the data can get it ready for descriptive analysis and advanced data analytics.

This public data set from Kaggle is a synthetic dataset that reflects real predictive maintenance data. In order to emphasize the important points, we made minor changes to the data set. This process can be used on any machine data set to clean and prepare it for analysis.

Bringing in Data

Dataiku supports a growing number of formats. Start by confirming the format you have is on their latest list. Once you have uploaded or connected to the data, it is important to review the schema types. Inferring data types is a quick way to update the schema, but be sure to review the inferred types too.

Updating Column Names

For consistency, use your first prep recipe to alter column names to your company’s standards, this may include:
  1. Font case (all upper, lower, camel, etc.).
  2. Removing spaces.
  3. Removing special characters.
  4. Renaming or providing a reference to where the column definitions and other important metadata are located.

In this example, the column name Type is not descriptive enough. Renaming to Product Type helps both the analyst and end users quickly distinguish between products. There may be a corresponding reference to full definitions for each value, or you may consider replacing the values with full product names.

This is a great opportunity to get descriptive! If the field is Date, make it clear if this is the order date, production date, creation date, bill date, ship date, etc.  Consider removing empty or irrelevant columns and obfuscating any columns for security reasons.

Initial Cleansing

Your first prep recipe should include a review and cleaning of numeric fields. We do this by:

  1. Reviewing values below a minimum acceptable or above a maximum acceptable value.  An example would be unexpected negative values.
  2. Truncating decimal places of values to only those with significant value, including converting decimals to whole numbers.

In this example, we round Air Temperature [K] to one decimal place instead of two.

Next, we review and clear our text fields using these steps:

  1. Confirm values are in a format that was expected and clear to end users.
  2. Update misspellings.
  3. Clean up any duplicate values for the same results.

In this example, a Failure Type was misspelled and can be combined with the correct spelling grouping.

Lastly, we clean our date fields following the process below:

  1. Review the formats of all dates across the project, especially for regional differences.    
    Parse date values in place as opposed to a separate column.
  2. Confirm the time zones are the same across data sets if joining or converting to UTC (Universal Time Coordinated).
  3. Replace or remove date values that occur before a minimum acceptable or after a maximum acceptable date.
  4. In this example, the Manufacturing Date was parsed and removed from the Output Column in order to replace the existing Manufacturing Date with the now parsed value.

Manufacturing Data Preparation for Advanced Analytics

Your next Dataiku recipe(s) should encompass the business rules and domain knowledge to prepare it for advanced analytics, such as:

Control Limits

Most machinery has a specified control limit. Values outside of that limit may be irrelevant data. Using those control limits to identify where these areas are and working with the business to decide how to label or remove them will allow for more informative and descriptive statistics and performant advanced analytics.

Shut Down

There may be known time periods where data will not be relevant or correct. For example,  machine malfunction or complete shutdown for regular maintenance or holiday. This data can be marked or removed depending on the goals of the analysis.


Depending on the use case, add or join data could allow for advanced filtering.  For example, filtering on a particular plant, line, or part type.


Sometimes, the most valuable insights require formulas. These should be well documented in Dataiku or within the data dictionary.

Success Criteria

Defining exactly what is considered to be success vs. failure will provide clearer results. Is failure considered a rejected part? If so, why can it be rejected? How do you count those rejected for multiple reasons? Are we focusing on certain reasons for rejection? Are there levels of magnitude for rejection?

Operating Conditions

Are there any data sources that provide insight into the operation conditions of the machine? For example, what was the plant humidity, weather, etc? Can these conditions be tied to a shift or time?

From Scary Data to Unmasked Insights

Manufacturing data holds an incredible amount of valuable information. Build your analysis on a strong foundation of clean and prepared data by using Dataiku to bring in data, update column names, and do both initial and advanced manufacturing-specific cleaning. It will turn your abandoned (dare we say haunted) house into a home for valuable business insights. Check out this blog for more manufacturing data preparation ideas. 

Does your company have manufacturing data you’re scared to look at?  Contact Snow Fox Data. Our data wizards can help you turn data into powerful insights for your business.


Virginia Maus

Virginia is a problem solver who is passionate about using the power of data to make informed decisions. She wants to lead your most undefined, innovative, and challenging projects with clear communication and genuinely collaborative execution. To her, energy comes from building relationships and taking action to make a positive, infectious impact on the world.