Skip to main content

Data Science - Managers Guide Part 2




Introduction


Previously we discussed the meaning and methods of data science and machine learning. There are numerous tutorials on using machine language but it is always confusing in figuring out where to start when given a problem. Over the course of my career, I have developed a nine-step framework – ML Framework - with a set of questions that helps me get started towards laying the foundation. It is to be used only as a guide because planning every detail of the data science process upfront isn’t always possible and more often than not you’ll iterate multiple times between the different steps of the process.

ML Framework

  • Describe the problem 
  • Chart a solution 
  • Look for the necessary data 
  • Check if the data is usable 
  • Explore and understand the data 
  • Decide on keep or delete features 
  • Select a machine learning algorithm 
  • Interpret the results 
  • Plan for scaling 

Describe the problem

What are we trying to solve?

The main purpose here is making sure all the stakeholders understand the what, how, and why of the project. The outcome should be a clear goal, a good understanding of the context, well-defined deliverable, and a plan of action with a timetable.

Chart a solution

  • What does success look like? 
  • How do we measure the success of the project? 
  • What is an acceptable threshold and what is not? 
  • Do we have prior data and results and want to forecast future (supervised learning) or are we looking for patterns (unsupervised learning)? 

Look for the necessary data
  • Is data readily available? 
  • What type of data is available? (hint – structured, unstructured, graph) 
  • How much data is available? 

Check if the data is usable

  • How does the data look like? 
  • Are there missing values, duplicates? 
  • How to deal with missing values? 
  • How to deal with Outliers? What about White spaces? 
    • We should expect to spend a good portion of our project time doing data correction and cleansing, sometimes up to 80%. 
  • What level of aggregation is the data at? 
  • Does the data need to be transformed, combined, converted or scaled? 
    • Transforming the data from a raw form into data that’s directly usable in the model will involve detecting and correcting different kinds of errors in the data, combining data from multiple data sources and changing the aggregation level of the data. 

Explore and Understand the Data

  • How to make sure we understand the data? 
    • We want to closely look at the data to get a sense of what the data is about. We may use statistical measures like mean, median, variance, standard deviation to learn about data distribution 
  • Can we visualize the data? Does it show any trends or patterns? 

Decide on keep or delete features

  • How many columns or variables of data (also called features) are available? 
  • How many are essential for the analysis?
    • Sometimes we may have too many variables and need to reduce the number because they don’t add new information to the model. Having too many variables in the model makes the model difficult to handle, and certain techniques don’t perform well. Sometimes we may need to add features too 

Select a machine learning algorithm

  • What machine learning algorithm should we use? 
    • Sometimes it makes sense to try multiple algorithms and compare results using Cross Validation. Cross validation can help eliminate a model that performs good specific to the data only. 
  • How should we split the data into test and train set? 
    • Datasets are often split into a training set and a test set, where the training set is used to learn a model and the test set is used to check how well the model fits to the unseen data. Here we are pondering to figure how to split the data based on data volume and complexity 
  • How to compare different models? 
    • ROC curves help visualize performance of different models across different thresholds. Other metrics include Accuracy, True Positive Rate, F1-Score can also be used. 

Interpret the results

  • What do the results mean? 
  • Do we need to fine tune the model or add more data? 
  • Would we like to fit for the offset (i.e., intercept)? 
    • Overfitting and Underfitting represent a fundamental problem of machine learning. The model needs to be general enough to work with random data (outside the training set) by finding the right balance between optimization and generalization. 
    • Bias and variance are other levers used to tune a model. A model with high bias will underfit the training data because it makes simplistic assumptions while a model with high variable will overfit because it models the noise in the data. 
  • Would we like the model to be normalized? 
  • Would we like to preprocess our features to add model flexibility? 
  • What degree of regularization would we like to use in our model? 
  • What happens if we change the parameters? 
  • Are the results helpful or do we need to get more data? 
  • Plot training vs. testing accuracy to check the model performance 

Plan for scaling
  • Is the algorithm ready to scale? 
  • What infrastructure do we need to scale? 

These steps will help you get started. As you understand your business requirements and the available data, some of the steps will require a deeper introspection while you may decide to forego other steps. The field of analysis is changing drastically as new algorithms are popping up quickly and make life easier if you know what is required to be done!

Comments

Popular posts from this blog

Beyond the Gut Feeling: Mastering Data-Driven Decision Making (DDDM) for Sustainable Success Part 1/2

In the current hyper-competitive business landscape, intuition and experience—while still valuable—are no longer sufficient for making the best decisions. Organizations today operate in a world where data flows endlessly from every direction: operations, customer interactions, the market, and internal processes. This surge in volume, velocity, and variety of information brings both vast opportunity and pressing complexity. To navigate this environment, organizations need to adopt a more structured and evidence-based approach: Data-Driven Decision Making (DDDM) . This isn’t just about hoarding data. It’s about using data intentionally and intelligently—gathering the right insights, interpreting them accurately, and applying them to support both strategic and tactical decisions. --- Redefining the Role of Data in Business Data plays two foundational roles in any data-driven organization: 1. Monitoring Performance and Environment Think of data as the central nervous system of an organi...

A Framework for Digital Services in Large Organizations

Large organizations, often synonymous with entrenched systems and formidable bureaucracies, frequently find themselves in a wrestling match with digital change. It’s not for lack of talent or resources, but rather a fundamental design flaw: their very architecture tends to resist innovation . Legacy contracts, rigid hierarchies, and outdated processes combine to create an immense gravitational pull towards the status quo. Yet, expectations continue their relentless ascent, demanding faster, simpler, and more reliable services, indifferent to the complexities that lie beneath the surface. So, how does a behemoth pivot? The answer lies in a strategic shift away from grand, abstract blueprints and towards a more agile, user-centric approach. This article outlines a practical framework for digital services, built on the core principle that delivery comes first, fostering lasting change through consistent execution and practical problem-solving. Focus on Delivery, Not Just Planning The fou...

Train, Validate, Test: The Key to Success in AI

In machine learning, the question "How good is the model?" is fundamental. To answer this, it's essential to understand how data is structured and evaluated. To explain the importance of training, validation, and testing, let's dive into an analogy rooted in school days. Training Data: Building a Strong Foundation Imagine you're in your favorite class, absorbing new material. This is where the core learning happens. In the context of machine learning, the training data is the classroom lesson. It's the information the algorithm needs to understand the problem it's tasked with solving. For example, if you're studying history, your textbooks, lectures, and homework represent the training data. Similarly, a machine learning model relies on training data to learn patterns, relationships, and features in the dataset. It processes this information to prepare for solving problems, much like a student studies to perform well on tests. The training phase is cr...