Data Science - Managers Guide Part 2

Introduction

Previously we discussed the meaning and methods of data science and machine learning. There are numerous tutorials on using machine language but it is always confusing in figuring out where to start when given a problem. Over the course of my career, I have developed a nine-step framework – ML Framework - with a set of questions that helps me get started towards laying the foundation. It is to be used only as a guide because planning every detail of the data science process upfront isn’t always possible and more often than not you’ll iterate multiple times between the different steps of the process.

ML Framework

Describe the problem
Chart a solution
Look for the necessary data
Check if the data is usable
Explore and understand the data
Decide on keep or delete features
Select a machine learning algorithm
Interpret the results
Plan for scaling

Describe the problem

What are we trying to solve?

The main purpose here is making sure all the stakeholders understand the what, how, and why of the project. The outcome should be a clear goal, a good understanding of the context, well-defined deliverable, and a plan of action with a timetable.

Chart a solution

What does success look like?
How do we measure the success of the project?
What is an acceptable threshold and what is not?
Do we have prior data and results and want to forecast future (supervised learning) or are we looking for patterns (unsupervised learning)?

Look for the necessary data

Is data readily available?
What type of data is available? (hint – structured, unstructured, graph)
How much data is available?

Check if the data is usable

How does the data look like?
Are there missing values, duplicates?
How to deal with missing values?
How to deal with Outliers? What about White spaces?

We should expect to spend a good portion of our project time doing data correction and cleansing, sometimes up to 80%.

What level of aggregation is the data at?
Does the data need to be transformed, combined, converted or scaled?

Transforming the data from a raw form into data that’s directly usable in the model will involve detecting and correcting different kinds of errors in the data, combining data from multiple data sources and changing the aggregation level of the data.

Explore and Understand the Data

How to make sure we understand the data?

We want to closely look at the data to get a sense of what the data is about. We may use statistical measures like mean, median, variance, standard deviation to learn about data distribution

Can we visualize the data? Does it show any trends or patterns?

Decide on keep or delete features

How many columns or variables of data (also called features) are available?
How many are essential for the analysis?

Sometimes we may have too many variables and need to reduce the number because they don’t add new information to the model. Having too many variables in the model makes the model difficult to handle, and certain techniques don’t perform well. Sometimes we may need to add features too

Select a machine learning algorithm

What machine learning algorithm should we use?

Sometimes it makes sense to try multiple algorithms and compare results using Cross Validation. Cross validation can help eliminate a model that performs good specific to the data only.

How should we split the data into test and train set?

Datasets are often split into a training set and a test set, where the training set is used to learn a model and the test set is used to check how well the model fits to the unseen data. Here we are pondering to figure how to split the data based on data volume and complexity

How to compare different models?

ROC curves help visualize performance of different models across different thresholds. Other metrics include Accuracy, True Positive Rate, F1-Score can also be used.

Interpret the results

What do the results mean?
Do we need to fine tune the model or add more data?
Would we like to fit for the offset (i.e., intercept)?

Overfitting and Underfitting represent a fundamental problem of machine learning. The model needs to be general enough to work with random data (outside the training set) by finding the right balance between optimization and generalization.
Bias and variance are other levers used to tune a model. A model with high bias will underfit the training data because it makes simplistic assumptions while a model with high variable will overfit because it models the noise in the data.

Would we like the model to be normalized?
Would we like to preprocess our features to add model flexibility?
What degree of regularization would we like to use in our model?
What happens if we change the parameters?
Are the results helpful or do we need to get more data?
Plot training vs. testing accuracy to check the model performance

Plan for scaling

Is the algorithm ready to scale?
What infrastructure do we need to scale?

These steps will help you get started. As you understand your business requirements and the available data, some of the steps will require a deeper introspection while you may decide to forego other steps. The field of analysis is changing drastically as new algorithms are popping up quickly and make life easier if you know what is required to be done!

Lead and Code

Search This Blog

Data Science - Managers Guide Part 2

Labels

Comments

Post a Comment

Popular posts from this blog

AI Reading, Understanding, and Reasoning Text: How It Works

Unlocking the Power of Data: Embracing Machine Learning for Business Success - Part 2

Why Emotional Intelligence Matters More Than You Think