Streamlining Your Machine Learning Workflow: Data Cleaning, Feature Selection, Modeling, and Interpretability

17 min readApr 28, 2023

Introduction

Data Cleaning, Feature Selection, Modeling, and Interpretability are all important steps in the machine learning process.

Data Cleaning is the process of identifying and correcting or removing errors and inconsistencies in the data. This can involve tasks such as removing duplicates, dealing with missing values, and transforming data into a format suitable for analysis. The goal of data cleaning is to ensure that the data is accurate, complete, and consistent.
Feature Selection is the process of selecting a subset of relevant features from the original dataset. This can involve tasks such as identifying redundant or irrelevant features, and selecting the most informative features for the task at hand. The goal of feature selection is to improve the performance of the model by reducing the number of features that are used to make predictions.
Modeling involves selecting an appropriate machine learning algorithm and training it on the cleaned and feature-selected data. The goal of modeling is to create a predictive model that can accurately predict the target variable given the input features.
Interpretability refers to the ability to understand and explain how a model arrived at its predictions. Interpretability is becoming increasingly important in machine learning, especially in fields such as healthcare and finance where the consequences of incorrect predictions can be severe. Interpretability can involve tasks such as visualizing the model’s decision-making process, identifying important features for the model’s predictions, and understanding the impact of different features on the model’s output.
Overall, these four steps are critical to the success of any machine learning project. Proper data cleaning, feature selection, modeling, and interpretability can help ensure that the model is accurate, reliable, and provides valuable insights.

Abstract

The objective of this article is to predict the happiness score of different countries based on a variety of factors such as economic production, social support, life expectancy, freedom, absence of corruption, and generosity. The analysis aims to identify the key factors that contribute to the overall happiness score of a country and to develop a predictive model that can accurately forecast happiness scores based on these factors. This study has the potential to provide valuable insights into the factors that promote happiness and well-being in different countries around the world, which can inform policy decisions aimed at improving quality of life.

This article combines 3 topics: Data Cleaning, Feature Selection and Model Interpretability in order to have clean and readable report.

01. Data Cleaning and Feature Selection

We performed Exploratory Data Analysis such as data cleaning which involves identifying and correcting or removing any errors, inconsistencies, or missing values in the data. This ensured that the data is of high quality and suitable for analysis. The steps involved in data cleaning are:

Handling missing values: Depending on the extent of missing data, we either removed rows or columns or impute the missing values with appropriate techniques such as mean, median or mode.
Handling outliers: Outliers can be handled by removing them or by transforming the data using techniques such as log transformation.
Data normalization: Standardization or normalization of data helps in scaling the data so that different features are on the same scale.
Data type conversion: Converting categorical data into numerical format for analysis.

Feature selection involves selecting a subset of relevant features from the original set of features to improve the performance of the machine learning model. The goal of feature selection is to reduce the dimensionality of the data and remove any irrelevant or redundant features that do not contribute much to the outcome. The steps involved in feature selection include:

Correlation analysis: Identifying features that are highly correlated and removing one of them to reduce redundancy.
Significance of each feature: Conducting statistical tests such as ANOVA, Chi-squared test, and t-tests to determine the significance of each feature.
Feature ranking: Using algorithms such as decision trees, Random Forest or LASSO to rank the features in order of importance and selecting the top-ranked features.

In Data Cleaning and Feature Selection we performed EDA and predicted the happiness score of countries around the world depending upon many factors such as economic production, social support, life expectancy, freedom, absense of corruption and generosity.

The notebook has used statistical methods like p-value, t-statistics and visualization techniques like histogram, Q-Q plot, scatter plot, box-plot of python’s matplotlib and seaborn library to answer below questions about the dataset:

What are the data types? (Only numeric and categorical)
Are there missing values?
What are the likely distributions of the numeric variables?
Which independent variables are useful to predict a target (dependent variable)? (Use at least three methods)
Which independent variables have missing data? How much?
Do the training and test sets have the same data?
In the predictor variables independent of all the other predictor variables?
Which predictor variables are the most important?
Do the ranges of the predictor variables make sense?
What are the distributions of the predictor variables?
Remove outliers and keep outliers (does if have an effect of the final predictive model)?
Remove 1%, 5%, and 10% of your data randomly and impute the values back using at least 3 imputation methods. How well did the methods recover the missing values? That is remove some data, check the % error on residuals for numeric data and check for bias and variance of the error.

Information about the Dataset

Target Variable/Dependent Variable

Happiness Score: A metric measured in 2015 by asking the sampled people the question: “How would you rate your happiness on a scale of 0 to 10 where 10 is the happiest.”

Predictor Variables/Independent Variables

Country: Name of the country
Region: Region of the country belongs to or continent
Happiness Rank: Ranking of the countries according to their happiness score
Standard Error: the standard error of the happiness score
Economy (GDP per Capita): measures the monetary value of the final goods and services. The extent to which GDP contributes to the calculation of the Happiness score.
Family: The extent to which Family values contribute to the calculation of the Happiness score.
Health (Life Expectancy): The extent to which Health / Life Expectancy contribute to the happiness score calculations.
Freedom: Freedom in all terms for people in any country contributing to the calculation of the Happiness score
Trust (Government Corruption): Trust/ Government Corruption values of a country effecting Happiness score
Generosity: the quality of being kind and generous.
Dystopia Residual: It is the sum of the dystopia happiness score (1.85) ie score of a hypothetical country having rank lower than the lowest ranking country in the report

We’ll now check the distribution of all the numerical values

plt.figure(figsize=(15,7))
sns.distplot(Data['Happiness Rank'], bins=10, color = "green")
print("The mean of Happiness Rank is ",round(Data['Happiness Rank'].mean(),2))
print("The median of Happiness Rank is ",Data['Happiness Rank'].median())
print("The Mode of Happiness Rank is ",Data['Happiness Rank'].mode())
plt.xlabel("Happiness Rank", size=14)
plt.ylabel("Density", size=14)
plt.title('Distribution curve for Happiness Rank', size=20)


plt.figure(figsize=(15,7))sns.distplot(Data['Happiness Score'], bins=10, color = "green")
print("The mean of Happiness Score is ",round(Data['Happiness Score'].mean(),2))
print("The median of Happiness Score is ",Data['Happiness Score'].median())
print("The Mode of Happiness Score is ",Data['Happiness Score'].mode())
plt.xlabel("Happiness Score", size=14)plt.ylabel("Density", size=14)
plt.title('Distribution curve for Happiness Score', size=20)

Checking the range of the predictor variables

plt.figure(figsize=(20,7))
sns.boxplot(data=x, palette="Set3")

Checking the Ranges of the predictor variables individually

cols = list(df.columns) 
f, axs = plt.subplots(3,3,figsize=(20,10))
for i in range(len(cols)):
  plt.subplot(3,3,i+1)
  plt.title('Box plot of '+ cols[i])
  sns.boxplot(xdata[cols[i]], palette = 'Set3')
plt.show()

Normalizing the dataset

We need to scale our numerical columns. Although we can use any scaling technique, we will be using normalization as we want to have values in the range of [0,1] and also to detect outliers as normalization is highly affected my outlier.

# list of numerical columns which require normalization
num_cols=['Economy (GDP per Capita)','Family','Health (Life Expectancy)', 'Dystopia Residual']

# Importing required library from sklearn for normalization
from sklearn import preprocessing
feature_to_scale = num_cols

# Preparing for normalizing
min_max_scaler = preprocessing.MinMaxScaler()

# Transform the data to fit minmax processor
x[feature_to_scale] = min_max_scaler.fit_transform(x[feature_to_scale])

Checking the Ranges of the predictor variables together after normalization of numerical variables

plt.figure(figsize=(20,7))
sns.boxplot(data=x, palette="Set3")
plt.title("Box plot of predictor variables of the dataset", size=14)

At first glance we can see that that the variables have collinearity to some extent. To visualize the values lets check the heatmap next

#the heat map of the correlation
plt.figure(figsize=(20,7))
sns.heatmap(x.corr(), annot=True, cmap='RdYlGn')

Observations:

It is very clear from the heatmap that most of the variables are dependent on each other like Health(Life Expectancy) and Economy (GDP per capita); Family and Economy (GDP per Capita); Trust (Government Corruption) and Freedom.
Degree of collinearity is significantly less that 0.5 for most variables.
The Health(Life Expectancy) and Economy (GDP per Capita) have a degree of collinearity of 0.82.

Using OLS for finding the p value to check the significant features

import statsmodels.api as sm



import statsmodels.api as sm
model = sm.OLS(y, x[['Standard Error','Economy (GDP per Capita)','Family', 'Health (Life Expectancy)', 'Freedom','Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual']]).fit()
model.summary()

Observations: All predictor variables are significant.

Standard Error has a p-value of 0.0 which is lesser than 0.05 so it is a significant feature.
Economy (GDP per Capita) has a p-value of 0.0 which is lesser than 0.05 so it is a significant feature.
Family has a p-value of 0.0 which is lesser than 0.05 so it is a significant feature.
Health (Life Expectancy) has a p-value of 0.0 which is lesser than 0.05 so it is a significant feature.
Freedom has a p-value of 0.0 which is lesser than 0.05 so it is a significant feature.
Trust (Government Corruption) has a p-value of 0.5 which is lesser than 0.05 so it is a significant feature.
Generosity has a p-value of 0.0 which is lesser than 0.05 so it is a significant feature.
Dystopia Residual has a p-value of 0.0 which is lesser than 0.05 so it is a significant feature.

Feature Importance Plot

from sklearn import ensemble
model_2=ensemble.GradientBoostingClassifier()
model_2.fit(X_train,y_train)
cols=X_train.columns
plt.figure(figsize=(15, 7))
plt.barh(cols,model_2.feature_importances_, color ="#00b3b3")
plt.title('Feature Importance ', size=14)

Observations:

From the graph we can see that Dystopia Residual, Economy (GDP per Capita) and Family appear to be the most significant features.
All other variables are more or less at significanct level, slightly low as compared to Dystopia Residual, Economy (GDP per Capita) and Family.

Model Interpretability

Now lets try to understand model interpretability for all the models we have played so far to analyze our World happiness Dataset.

Model interpretability refers to the ability to understand and explain how a machine learning model arrives at its predictions or decisions. It is an essential aspect of machine learning, as it allows us to trust the decisions made by a model and identify any biases or errors that may be present.
Interpretable models are designed to be transparent and easy to understand, while still maintaining good predictive performance. They enable us to understand which features or inputs the model is relying on to make its predictions, and how different changes in the input data affect the output.
Interpretability can be achieved through various methods, including using simple and transparent models such as decision trees or linear models, adding interpretability methods to complex models such as neural networks, or using techniques such as feature importance analysis, partial dependence plots, and model-agnostic methods like SHAP (SHapley Additive exPlanations).
Overall, model interpretability is crucial for ensuring the trustworthiness, transparency, and accountability of machine learning models, particularly in high-stakes applications such as healthcare, finance, and criminal justice.

Let’s explore the interpretability of a machine learning model trained on a Happiness prediction dataset.

We focus on using partial dependence plots to visualize the relationship between individual features and the model’s predictions. The dataset was preprocessed, and a random forest classifier was trained on the data. Partial dependence plots were then generated for each feature, and the resulting plots were analyzed to gain insights into the model’s decision-making process.
We’ll later demonstrate that partial dependence plots can provide a valuable tool for understanding and interpreting machine learning models, particularly in cases where interpretability is critical, such as in the case of churn prediction. The notebook’s findings highlight the importance of model interpretability and suggest that interpretability tools should be an integral part of the machine learning workflow.

OLS Regression

# load data
data = pd.read_csv("https://raw.githubusercontent.com/Hanagojiv/DataSci/main/2015.csv")

# define X and y
y = data['Happiness Score']
X = data[['Standard Error','Economy (GDP per Capita)','Family', 'Health (Life Expectancy)', 'Freedom','Trust (Government Corruption)', 'Generosity', 'Dystopia Residual']]

# add a constant to the X matrix
X = sm.add_constant(X)

# fit the OLS model
OLS_model = sm.OLS(y, X).fit()

# interpret regression coefficients
print(OLS_model.summary())

Let’s Understand the output of the OLS Regression model:

The output of the OLS regression model provides us with information about the statistical significance and magnitude of the relationship between the independent variables and the dependent variable.

Here’s a brief explanation of some of the important statistics and coefficients that we can interpret from the output:

R-squared: This is a measure of how much variance in the dependent variable (Happiness Score) is explained by the independent variables. In this case, the R-squared value is 1.000, which means that the independent variables collectively explain 100% of the variation in the Happiness Score.

Coefficients: These are the estimated regression coefficients for each independent variable. They represent the change in the dependent variable associated with a one-unit increase in the independent variable, while holding all other variables constant. For example, the coefficient for Economy (GDP per Capita) is 1.0001, which means that a one-unit increase in GDP per Capita is associated with an increase of 1.0001 units in the Happiness Score, while holding all other variables constant.

Standard error: This is the standard error of the estimated coefficient, which measures the precision of the estimate. A smaller standard error indicates greater precision.

t-value: This is the t-statistic for each coefficient, which measures the number of standard errors the estimated coefficient is away from zero. A higher t-value indicates a stronger evidence against the null hypothesis that the coefficient is equal to zero.

p-value: This is the p-value for each coefficient, which measures the probability of observing a t-statistic as extreme as the one observed, assuming the null hypothesis is true. A smaller p-value indicates stronger evidence against the null hypothesis.

Confidence interval: This is the 95% confidence interval for each coefficient, which provides a range of values within which we can be 95% confident the true coefficient lies.

Overall, the OLS regression model provides us with a statistical model that allows us to quantify the relationship between the independent variables and the dependent variable, and to test hypotheses about the significance of these relationships.

Feature Importance for OLS Model

explainer_ols = shap.Explainer(ols_model.predict, X)
shap_values_ols = explainer_ols(X)
shap.summary_plot(shap_values_ols, X, plot_type = 'bar')

Random Forest Regressor

Similarly we’ll fit the Random forest regressor model and get the feature importance for the same.

# load data
data = pd.read_csv("https://raw.githubusercontent.com/Hanagojiv/DataSci/main/2015.csv")

# split data into features (X) and target variable (y)
X = data[['Standard Error','Economy (GDP per Capita)','Family', 'Health (Life Expectancy)', 'Freedom','Trust (Government Corruption)', 'Generosity', 'Dystopia Residual']]
y = data["Happiness Score"]

# fit model
rf_model = RandomForestRegressor()
rf_model.fit(X, y)

# plot feature importance
importances = rf_model.feature_importances_
sorted_idx = importances.argsort()
features = X.columns[sorted_idx]
plt.barh(range(len(sorted_idx)), importances[sorted_idx])
plt.yticks(range(len(sorted_idx)), features)
plt.xlabel("Importance")
plt.title("Feature Importance (Random Forest)")
plt.show()

SHAP Analysis for OLS model and Random Forest Regressor

y = data['Happiness Score']
X = data[['Standard Error','Economy (GDP per Capita)','Family', 'Health (Life Expectancy)', 'Freedom','Trust (Government Corruption)', 'Generosity', 'Dystopia Residual']]
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

# Create the OLS model
ols_model = LinearRegression()
ols_model.fit(X_train, y_train)

# Create the tree-based model
rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)

# Create the SHAP explainer for the OLS model
explainer_ols = shap.Explainer(ols_model, X_train)

# Get the SHAP values for the OLS model
shap_values_ols = explainer_ols(X_test)

# Create the SHAP explainer for the tree-based model
explainer_rf = shap.Explainer(rf_model, X_train)

# Get the SHAP values for the tree-based model
shap_values_rf = explainer_rf(X_test)

Plotting the SHAP values for the OLS model and Tree-based model

Partial Dependency plot for Random Forest Regressor

A partial dependence plot (PDP) shows how the predicted target variable (in this case, the Happiness Score) changes as a function of one or more features, while controlling for the other variables in the model.

In the example plot, we have two subplots, one for each feature, “Economy (GDP per Capita)” and “Family”. Each subplot shows the relationship between the feature on the x-axis and the predicted target variable (Happiness Score) on the y-axis.

The shape of the PDP curve indicates the nature of the relationship between the feature and the target variable. A flat line indicates that there is no relationship between the feature and the target variable, while a curved line indicates a non-linear relationship.

For example, in the first subplot, we see that the PDP curve for “Economy (GDP per Capita)” is upward sloping, indicating a positive relationship between the two variables. This suggests that as the GDP per capita increases, so does the predicted Happiness Score. Similarly, in the second subplot, we see that the PDP curve for “Family” is also upward sloping, indicating a positive relationship between the two variables. This suggests that as the perceived social support in a country (as measured by the Family variable) increases, so does the predicted Happiness Score.

The shaded area around the curve represents the confidence interval for the partial dependence estimate. The narrower the shaded area, the more confident we are in the partial dependence estimate.

Overall, partial dependence plots are a useful tool for understanding the relationship between features and the target variable in a machine learning model. By visualizing the partial dependence of the target variable on one or more features, we can gain insights into how the features influence the model’s predictions.

Partial Plot Explaination for OLS model

A partial dependence plot shows the relationship between a target response variable and a set of input features, by holding all other features constant at some specified values. In other words, it allows us to see how the predicted response changes as we vary the value of one input feature, while keeping all other input features fixed.

Each partial dependence plot consists of two main components: a line plot and a shaded area. The line plot shows the average predicted response (in this case, the happiness score) as the value of the input feature varies over its range. The shaded area around the line plot shows the range of uncertainty in the prediction, typically represented by the 95% confidence interval.
Let’s take a closer look at one of the partial dependence plots, say the one for the “Economy (GDP per Capita)” feature. The x-axis of the plot represents the values of the “Economy (GDP per Capita)” feature, which ranges from about 0.0 to 1.8 in this dataset. The y-axis of the plot represents the average predicted happiness score for each value of “Economy (GDP per Capita)”.
The blue line in the plot represents the average predicted happiness score as we vary the value of “Economy (GDP per Capita)” while holding all other features constant. We can see that as the value of “Economy (GDP per Capita)” increases, the predicted happiness score also increases. This suggests that higher levels of economic prosperity are associated with greater happiness.
The shaded area around the line shows the range of uncertainty in the prediction, represented by the 95% confidence interval. The wider the shaded area, the greater the uncertainty in the prediction. For example, we can see that there is greater uncertainty in the predicted happiness score at the high end of the “Economy (GDP per Capita)” range, where there are fewer observations.
The vertical black line in the plot shows the average value of “Economy (GDP per Capita)” across the dataset. This gives us a sense of how representative the values in the dataset are of the feature’s true range. In this case, the average value is around 0.8, suggesting that the dataset covers a reasonable range of economic prosperity levels.
By examining the partial dependence plots for all features in the dataset, we can gain insights into how each feature contributes to the predicted happiness score. For example, we can see that “Health (Life Expectancy)” and “Family” are also strong predictors of happiness, while “Trust (Government Corruption)” has a weaker relationship with happiness. These insights can help us to better understand the factors that drive happiness in different countries, and potentially inform policies aimed at promoting well-being.

Conclusion:

Data Cleaning and Feature Selection:

We learned that the dataset did not contain any null values, which made our analysis easier and more accurate.
We discovered that some features in the dataset were normally distributed, while others were skewed. This information helped us better understand the data and choose appropriate modeling techniques.
We identified outliers in the dataset, which we removed to improve the accuracy of our analysis.
By normalizing the dataset values, we were able to compare and analyze the variables on a consistent scale.
We found that certain variables, such as GDP per capita, social support, and life expectancy, had a strong positive correlation with the Happiness Score.
We trained and tested a Linear Regression model on the dataset, which allowed us to predict the Happiness Score based on other variables with reasonable accuracy.
We learned that different imputation methods can have varying levels of bias error and variance, which should be taken into consideration when handling missing data.

Overall, we gained insights into the factors that contribute to happiness, and how we can analyze and model this type of data to make predictions and draw conclusions. These insights could potentially be used to inform policies or interventions aimed at increasing happiness levels in populations.

Model Interpretability

We learned that model interpretability refers to the ability to understand and explain how a machine learning model arrives at its predictions or decisions. It is an essential aspect of machine learning, as it allows us to trust the decisions made by a model and identify any biases or errors that may be present.

We also learned that interpretability can be achieved through various methods, including using simple and transparent models such as decision trees or linear models, adding interpretability methods to complex models such as neural networks, or using techniques such as feature importance analysis, partial dependence plots, and model-agnostic methods like SHAP.

Moreover, we learned that interpretability is crucial for ensuring the trustworthiness, transparency, and accountability of machine learning models, particularly in high-stakes applications such as healthcare, finance, and criminal justice. It can help us identify issues such as bias, fairness, and privacy concerns, and it can also provide insights into the underlying data generating process and help to improve the model’s performance.

References:

GeeksForGeeks Quantile-Quantile plot documentation https://www.geeksforgeeks.org/qqplot-quantile-quantile-plot-in-python/
Matplotlib documentation, Scikit-learn Documentation, Pandas Official Documentation, Analytics Vidya
Many techniques used in this notebook have been adopted from the following github repositories

Owner — AI SkunkworksLink — https://github.com/aiskunks/Skunks_SkoolAuthor
name — Prof Nik Bear BrownLink — https://github.com/nikbearbrown/

The methods and parameters of the models amd code corrections have been adapted from stackoverflowLink — https://stackoverflow.com
AI Skunks AutoML notebook references.
Youtube: H2o
TowardsDataScience
Youtube: Model Interpretability
Definitions, methods, and applications in interpretable machine learning

Authors:

Vivek Basavanth Hanagoji (https://www.linkedin.com/in/vivekhanagoji/)
Nik Bear Brown (https://medium.com/@NikBearBrown)

Streamlining Your Machine Learning Workflow: Data Cleaning, Feature Selection, Modeling, and Interpretability

Introduction

Abstract

01. Data Cleaning and Feature Selection

Information about the Dataset

Checking the range of the predictor variables

Checking the Ranges of the predictor variables individually

Normalizing the dataset

Checking the Ranges of the predictor variables together after normalization of numerical variables

Using OLS for finding the p value to check the significant features

Observations: All predictor variables are significant.

Feature Importance Plot

Observations:

Model Interpretability

Let’s explore the interpretability of a machine learning model trained on a Happiness prediction dataset.

OLS Regression

Let’s Understand the output of the OLS Regression model:

Feature Importance for OLS Model

Random Forest Regressor

SHAP Analysis for OLS model and Random Forest Regressor

Plotting the SHAP values for the OLS model and Tree-based model

Partial Dependency plot for Random Forest Regressor

Partial Plot Explaination for OLS model

Conclusion:

Data Cleaning and Feature Selection:

Model Interpretability

References:

Authors:

Written by Vivek Basavanth Hanagoji

No responses yet