According to a 2020 survey by Anaconda, data scientists spend 45% of their time on data preparation. It was 80% in 2016, according to a report by Forbes. There seems to be an improvement thanks to automation tools but data preparation still constitutes a large part of data science work. This is because getting the best possible results from a machine learning model depends on data quality and creating better features can help provide better quality data.

What is feature engineering?

Feature engineering is the process of transforming raw data into useful features. 

Real-world data is almost always messy. Before deploying a machine learning algorithm to work on it, the raw data must be transformed into a suitable form. This is called data preprocessing and feature engineering is a component of this process.

A feature refers to an attribute of data that is relevant to the problem that you would like to solve with the data and the machine learning model. So, the process of feature engineering depends on the problem, available data, and deployed machine learning algorithm. Therefore, it would not be useful to create the same features from a dataset for two different problems. In addition, different algorithms require different types of features for optimal performance.

What are feature engineering processes?

Feature engineering can involve:

  • Feature construction: Constructing new features from the raw data. Feature construction requires a good knowledge of the data and the underlying problem to be solved with the data.
  • Feature selection: Selecting a subset of available features that are most relevant to the problem for model training.
  • Feature extraction: Creating new and more useful features by combining and reducing the number of existing features. Principal component analysis (PCA) and embedding are some methods for feature extraction.

What are some feature engineering techniques?

Some common techniques of feature engineering include:

One-hot encoding

Most ML algorithms cannot work with categorical data and require numerical values. For instance, if you have a ‘Color’ column in your tabular dataset and the observations are “Red”, “Blue” and “Green”, you may need to convert these into numerical values for the model to better process it. However, labeling “Red” = 1, “Blue” = 2, and “Green” = 3 is not enough because there is not an ordered relation between colors (i.e. blue is not two times red).

Instead, one-hot encoding involves creating two columns for being “Red” and “Blue”. if an observation is red, it takes 1 in the “Red” column and 0 in “Blue”. If it is green, it takes 0 in both columns and the model deduces that it is green.


Log-transformation is replacing each value in a column with its logarithm. It is a useful method to handle skewed data as shown in the image below. Log-transformation can transform the distribution to approximately normal and decrease the effects of the outliers. Fitting a linear model, for instance, would give more accurate results after transformation because the relationship between the two variables is closer to linear after transformation.

Source: UMD

Outlier handling

Outliers are observations that are distant from other observations. They can be due to errors or be genuine observations. Whatever the reason, it is important to identify them because machine learning models are sensitive to the range and distribution of values. The image below demonstrates how outliers drastically change a linear model’s fit.

Source: datascienceplus

The outlier handling method depends on the dataset. Suppose you work with a dataset with house prices in a region. If you know that a house’s price cannot exceed a certain amount in that region and if there are observations above that value, you can

  • remove those observations because they are probably erroneous
  • replace outlier values with mean or median of the attribute


Binning is grouping observations under ‘bins’. Converting ages of individuals to age groups or grouping countries according to their continent are examples of binning. The decision for binning depends on what you are trying to obtain from the data.

Handling missing values

Missing values are among the most common problems of the data preparation process. There may be due to error, unavailability of the data, or privacy reasons. A significant portion of machine learning algorithms are designed to work with complete data so you should handle missing values in a dataset. If not, the model can automatically drop those observations which can be undesirable.

To handle missing values, you can

  • fill missing observations with mean/median of the attribute if it is numerical.
  • fill with the most frequent category if the attribute is categorical.
  • use ML algorithms to capture the structure of data and fill the missing values accordingly.
  • predict the missing values if you have domain knowledge about the data.
  • drop the missing observations.

Why is it important now?

Feature engineering is an integral part of every machine learning application because created and selected features have a great impact on model performance. Features that are relevant to the problem and appropriate for the model would increase model accuracy. Irrelevant features, on the other hand, would result in “Garbage in-Garbage out” situation.

How can we increase feature engineering efficiency?

Feature engineering is a process that is time-consuming, error-prone, and demands domain knowledge. It depends on the problem, the dataset, and the model so there is not a single method that solves all feature engineering problems. However, there are some methods to automate this process:

  • Open-source Python libraries for automated feature engineering such as featuretools. Featuretools uses an algorithm called deep feature synthesis to generate feature sets for structured datasets.
  • There are also AutoML solutions that offer automated feature engineering. For more information on AutoMl, check our comprehensive guide.

However, it should be noted that automated feature engineering tools use algorithms and may not be able to incorporate valuable domain knowledge that a data scientist may have.

If you have other questions about feature engineering and automated machine learning solutions, don’t hesitate to contact us:

Let us find the right vendor for your business

Source link