Skip to main content

Data Science - Managers Guide Part 2




Introduction


Previously we discussed the meaning and methods of data science and machine learning. There are numerous tutorials on using machine language but it is always confusing in figuring out where to start when given a problem. Over the course of my career, I have developed a nine-step framework – ML Framework - with a set of questions that helps me get started towards laying the foundation. It is to be used only as a guide because planning every detail of the data science process upfront isn’t always possible and more often than not you’ll iterate multiple times between the different steps of the process.

ML Framework

  • Describe the problem 
  • Chart a solution 
  • Look for the necessary data 
  • Check if the data is usable 
  • Explore and understand the data 
  • Decide on keep or delete features 
  • Select a machine learning algorithm 
  • Interpret the results 
  • Plan for scaling 

Describe the problem

What are we trying to solve?

The main purpose here is making sure all the stakeholders understand the what, how, and why of the project. The outcome should be a clear goal, a good understanding of the context, well-defined deliverable, and a plan of action with a timetable.

Chart a solution

  • What does success look like? 
  • How do we measure the success of the project? 
  • What is an acceptable threshold and what is not? 
  • Do we have prior data and results and want to forecast future (supervised learning) or are we looking for patterns (unsupervised learning)? 

Look for the necessary data
  • Is data readily available? 
  • What type of data is available? (hint – structured, unstructured, graph) 
  • How much data is available? 

Check if the data is usable

  • How does the data look like? 
  • Are there missing values, duplicates? 
  • How to deal with missing values? 
  • How to deal with Outliers? What about White spaces? 
    • We should expect to spend a good portion of our project time doing data correction and cleansing, sometimes up to 80%. 
  • What level of aggregation is the data at? 
  • Does the data need to be transformed, combined, converted or scaled? 
    • Transforming the data from a raw form into data that’s directly usable in the model will involve detecting and correcting different kinds of errors in the data, combining data from multiple data sources and changing the aggregation level of the data. 

Explore and Understand the Data

  • How to make sure we understand the data? 
    • We want to closely look at the data to get a sense of what the data is about. We may use statistical measures like mean, median, variance, standard deviation to learn about data distribution 
  • Can we visualize the data? Does it show any trends or patterns? 

Decide on keep or delete features

  • How many columns or variables of data (also called features) are available? 
  • How many are essential for the analysis?
    • Sometimes we may have too many variables and need to reduce the number because they don’t add new information to the model. Having too many variables in the model makes the model difficult to handle, and certain techniques don’t perform well. Sometimes we may need to add features too 

Select a machine learning algorithm

  • What machine learning algorithm should we use? 
    • Sometimes it makes sense to try multiple algorithms and compare results using Cross Validation. Cross validation can help eliminate a model that performs good specific to the data only. 
  • How should we split the data into test and train set? 
    • Datasets are often split into a training set and a test set, where the training set is used to learn a model and the test set is used to check how well the model fits to the unseen data. Here we are pondering to figure how to split the data based on data volume and complexity 
  • How to compare different models? 
    • ROC curves help visualize performance of different models across different thresholds. Other metrics include Accuracy, True Positive Rate, F1-Score can also be used. 

Interpret the results

  • What do the results mean? 
  • Do we need to fine tune the model or add more data? 
  • Would we like to fit for the offset (i.e., intercept)? 
    • Overfitting and Underfitting represent a fundamental problem of machine learning. The model needs to be general enough to work with random data (outside the training set) by finding the right balance between optimization and generalization. 
    • Bias and variance are other levers used to tune a model. A model with high bias will underfit the training data because it makes simplistic assumptions while a model with high variable will overfit because it models the noise in the data. 
  • Would we like the model to be normalized? 
  • Would we like to preprocess our features to add model flexibility? 
  • What degree of regularization would we like to use in our model? 
  • What happens if we change the parameters? 
  • Are the results helpful or do we need to get more data? 
  • Plot training vs. testing accuracy to check the model performance 

Plan for scaling
  • Is the algorithm ready to scale? 
  • What infrastructure do we need to scale? 

These steps will help you get started. As you understand your business requirements and the available data, some of the steps will require a deeper introspection while you may decide to forego other steps. The field of analysis is changing drastically as new algorithms are popping up quickly and make life easier if you know what is required to be done!

Comments

Popular posts from this blog

AI Reading, Understanding, and Reasoning Text: How It Works

Artificial Intelligence (AI) has made significant progress in the way it reads, understands, and reasons about text. Today, AI powers search engines, virtual assistants, and even chatbots that can hold conversations with humans. But how does AI process and make sense of text? Here, we will break down this concept using simple language and real-world examples. How AI Reads Text Before AI can understand text, it needs to first read it. Reading, in AI terms, means converting raw text into a structured form that the machine can process. This is done through a process called Natural Language Processing (NLP). 1. Text Input – AI receives text from various sources, such as emails, websites, or voice-to-text conversions. 2. Tokenization – The text is broken down into smaller parts called tokens (words or phrases). 3. Parsing – AI identifies the grammatical structure of a sentence, recognizing nouns, verbs, adjectives, etc. 4. Named Entity Recognition (NER) – AI detects important words like na...

Unlocking the Power of Data: Embracing Machine Learning for Business Success - Part 2

Machine learning has revolutionized the way we solve complex problems, make predictions, and gain insights from data. One of the key decisions when choosing a machine learning algorithm is whether to opt for a parametric model or a non-parametric model. These two categories of models represent distinct approaches to handling data and have their own strengths and weaknesses. In this blog post, we will delve into the world of parametric and non-parametric machine learning models, exploring what sets them apart and when to use each type. Parametric Models: Structure and Assumptions Parametric machine learning models are characterized by their predefined structure and assumptions about the underlying relationship between input and output variables. These models assume that the relationship can be expressed using a fixed, predefined formula or functional form. The key features of parametric models are as follows: 1. Fixed Number of Parameters: Parametric models have a fixed number of parame...

Why Emotional Intelligence Matters More Than You Think

In everyday life, people often think of emotions as things that pop up in dramatic or personal moments—like falling in love or having a fight. But emotions are actually involved in nearly everything we do. From making decisions to understanding others, emotions play a central role in our lives. And to navigate this emotional landscape successfully, we need a special skill called Emotional Intelligence (EI) . Emotions Are Everywhere Emotions don’t just come into play during big life moments. They influence what we choose to eat, how we respond to co-workers, and whether we go to the gym or stay in bed. For example, if a child touches a hot stove and feels pain, they learn through that emotional experience to avoid doing it again. That emotional memory becomes a protective tool. Similarly, we interpret other people's emotions to help us understand what might happen next. If someone is shouting and has clenched fists, we instinctively know to be cautious—they may be ready to lash out...