Previously we discussed the meaning and methods of data science and machine learning. There are numerous tutorials on using machine language but it is always confusing in figuring out where to start when given a problem. Over the course of my career, I have developed a nine-step framework – ML Framework - with a set of questions that helps me get started towards laying the foundation. It is to be used only as a guide because planning every detail of the data science process upfront isn’t always possible and more often than not you’ll iterate multiple times between the different steps of the process.
ML Framework
- Describe the problem
- Chart a solution
- Look for the necessary data
- Check if the data is usable
- Explore and understand the data
- Decide on keep or delete features
- Select a machine learning algorithm
- Interpret the results
- Plan for scaling
Describe the problem
What are we trying to solve?
The main purpose here is making sure all the stakeholders understand the what, how, and why of the project. The outcome should be a clear goal, a good understanding of the context, well-defined deliverable, and a plan of action with a timetable.
Chart a solution
- What does success look like?
- How do we measure the success of the project?
- What is an acceptable threshold and what is not?
- Do we have prior data and results and want to forecast future (supervised learning) or are we looking for patterns (unsupervised learning)?
Look for the necessary data
- Is data readily available?
- What type of data is available? (hint – structured, unstructured, graph)
- How much data is available?
Check if the data is usable
- How does the data look like?
- Are there missing values, duplicates?
- How to deal with missing values?
- How to deal with Outliers? What about White spaces?
- We should expect to spend a good portion of our project time doing data correction and cleansing, sometimes up to 80%.
- What level of aggregation is the data at?
- Does the data need to be transformed, combined, converted or scaled?
- Transforming the data from a raw form into data that’s directly usable in the model will involve detecting and correcting different kinds of errors in the data, combining data from multiple data sources and changing the aggregation level of the data.
Explore and Understand the Data
- How to make sure we understand the data?
- We want to closely look at the data to get a sense of what the data is about. We may use statistical measures like mean, median, variance, standard deviation to learn about data distribution
- Can we visualize the data? Does it show any trends or patterns?
Decide on keep or delete features
- How many columns or variables of data (also called features) are available?
- How many are essential for the analysis?
- Sometimes we may have too many variables and need to reduce the number because they don’t add new information to the model. Having too many variables in the model makes the model difficult to handle, and certain techniques don’t perform well. Sometimes we may need to add features too
Select a machine learning algorithm
- What machine learning algorithm should we use?
- Sometimes it makes sense to try multiple algorithms and compare results using Cross Validation. Cross validation can help eliminate a model that performs good specific to the data only.
- How should we split the data into test and train set?
- Datasets are often split into a training set and a test set, where the training set is used to learn a model and the test set is used to check how well the model fits to the unseen data. Here we are pondering to figure how to split the data based on data volume and complexity
- How to compare different models?
- ROC curves help visualize performance of different models across different thresholds. Other metrics include Accuracy, True Positive Rate, F1-Score can also be used.
Interpret the results
- What do the results mean?
- Do we need to fine tune the model or add more data?
- Would we like to fit for the offset (i.e., intercept)?
- Overfitting and Underfitting represent a fundamental problem of machine learning. The model needs to be general enough to work with random data (outside the training set) by finding the right balance between optimization and generalization.
- Bias and variance are other levers used to tune a model. A model with high bias will underfit the training data because it makes simplistic assumptions while a model with high variable will overfit because it models the noise in the data.
- Would we like the model to be normalized?
- Would we like to preprocess our features to add model flexibility?
- What degree of regularization would we like to use in our model?
- What happens if we change the parameters?
- Are the results helpful or do we need to get more data?
- Plot training vs. testing accuracy to check the model performance
Plan for scaling
- Is the algorithm ready to scale?
- What infrastructure do we need to scale?
These steps will help you get started. As you understand your business requirements and the available data, some of the steps will require a deeper introspection while you may decide to forego other steps. The field of analysis is changing drastically as new algorithms are popping up quickly and make life easier if you know what is required to be done!

Comments
Post a Comment