Feature Selection in Machine Learning

Artificial Intelligence

Feature Selection in Machine Learning

theb2bnews

November 13, 2020

[ad_1]

Feature Engineering
Exploratory Data Analysis (EDA)
Feature Engineering on Numeric data
Forward selection
Backward elimination
Mixed selection
Regularizing Models
Python code Example

Feature Engineering

For a model to become successful, the variables / parameters that are used to construct the model are critical. In their raw form, the variables may not be (usually are not) in a state where they can be used for modeling.

Feature engineering is the process of transforming data from the raw state to a state where it becomes suitable for modeling. It transforms the data columns into features that are better at representing a given situation in terms of clarity. Quality of the feature in distinctly representing an entity impact the quality of the model in predicting the behavior of the entity

Exploratory Data Analytics (EDA) is the first step towards feature engineering as it is critical to assess the quality of the raw data, plan the transformations required.

Exploratory Data Analytics (EDA)

Some of the key activities performed in EDA include –

Meaningful standardized names to the attributes
Meta information about the data. Describe the column level details such as what it is, how it was collected, units of measurement, frequency of measurement, possible range of values etc.
List and address the challenges that one will face using the data in its existing form. For e.g. missing values, outliers, data shift, sampling bias
Descriptive stats – spread(central values , skew, tails), mix-up of gaussians
Data distribution across different target classes (if in classification domain)
Outlier analysis and strategy for imputations
Assessing the impact of the actions taken on the data

Some of the key activities performed in EDA include –

Transform the raw data into useful attributes by generating derived attributes from existing attributes if the derived attributes are likely to be better than original attributes in information content.
Transform the data attributes using valid mathematical transformations such as log transformation of the distribution, if the transformed data is likely to help create simpler model without losing information.

Feature Engineering on Numeric data

Integers and floats are the most common data types that are directly used in building models. Instead, transforming them before modelling may yield better results!
Feature engineering on numerical columns may take the form of-
– scaling the data if using algorithms that involve similarity measurements based on distance calculations
– Transforming the distributions using mathematical techniques such as exponential distribution to almost normal using log functions
– Binning the numeric data followed by binarization for e.g. using one-hot coding
Binning can help make linear models powerful when the data distribution on predictors is spread out though it has a trend
Interaction & Polynomial features – Another way to enrich feature representation, especially in linear models is using interaction features , polynomial features
In the binning example the linear model creates constant value in each bin (intercept), however, we can also make it learn the slope by including the original feature

Feature Selection

Suppose you have a learning algorithm LA and a set of input attributes { X1 , X2 .. Xp }
You expect that LA will only find some subset of the attributes useful.
Question: How can we use cross-validation to find a useful subset?
Some ideas:
– Forward selection
– Backward elimination
– Mixed selection

Forward Selection

Begin with null model – a model that contains an intercept but no predictors
Then fit p simple linear regressions and add to the null model the variable that results in the lowest RSS(or highest R^2)
Then add to that model the variable that results in the lowest RSS(or highest R^2) for the new two-variable model
Continue this approach until some stopping rule is satisfied

Backward Elimination

Start with all variables in the model
Remove a variable from the above model and check the increment in RSS (or decrement in R^2) and remove the variable which has least influence, i.e., the variable that is least significant
The new (p-1) variable model is fit and the variable with the least significance is removed.
Continue this procedure until a stopping rule is reached

Mixed Selection

This is a combination of forward and backward selection
We start with no variables in the model and as in forward selection, we add the variable that provides the best fit
At times, the significance of variables can become low as new predictors are added to the model
Thus, if at any point, the significance for one of the variables in the model falls below a certain threshold, then we remove that variable from the model
We continue to perform these forward and backward steps until all variables in the model have a sufficiently high significance and all the variables outside the model would have a low significance if added to the model

Regularizing Linear Models (Shrinkage methods)

When we have too many parameters and are exposed to the curse of dimensionality, we resort to dimensionality reduction techniques such as transforming to PCA and eliminating the PCA with the least magnitude of eigenvalues. This can be a laborious process before we find the right number of principal components. Instead, we can employ the shrinkage methods.

Shrinkage methods attempt to shrink the coefficients of the attributes and lead us towards simpler yet effective models. The two shrinkage methods are :

Ridge regression is similar to the linear regression where the objective is to find the best fit surface. The difference is in the way the best coefficients are found. Unlike linear regression where the optimization function is SSE, here it is slightly different.

Linear Regression cost function

Ridge Regression with additional term in the cost function

The term is like a penalty term used to penalize large magnitude coefficients when it is set to a high number, coefficients are suppressed significantly. When it is set to 0, the cost function becomes same as linear regression cost function.

This brings us to the end of the blog on Feature Selection. If you found this helpful and wish to learn more such concepts, join Great Learning Academy’s pool of free online courses today, and learn the most in-demand skills to power ahead in your career.

[ad_2]

Source link