Skip to main content

Data Science - Managers Guide Part 1




This is a manager's guide to data science and machine learning. Part 1 provides a very high-level overview along with few definitions. Part 2 provides a framework to get started with machine learning for data science projects. 

Data science is a field of study that aims to use a scientific approach to extract meaning and insights from data – collection of numbers, words, observations and just about anything. 

Data science uses results from statistics, machine learning, and computer science to create models that can transform hypotheses (assumptions) and data into actionable predictions (forecast). Much of the theoretical basis of data science comes from statistics along with a strong influence from software engineering methodologies. 

The basic purpose of statistics is simply to enable us to make sense of large amount of data - by providing the ability to consolidate and synthesize large numbers to reveal the collective characteristics and interrelationships, and transform them from an incomprehensible mass to a set of useful and enlightening indicators. Statistics has three primary components: How best can we collect data? How should it be analyzed? And what can we infer from the analysis? 

The underlying fuel for statistics is data - broadly numerical and categorical data. Numerical data are numbers and are either discreet (e.g.: the number of applicants for a job) or continuous (e.g.: the volume of water falling from a tap). Categorical data is all about text and labels and are either ordinal – has intrinsic order (ex: Class of freshmen, sophomore, junior, senior) or nominal (ex: The name of the town you live in). 

In general, this data can be synthesized by methods of central tendency, variation and shape. Differences between groups of data can be explained using t-test, chi-square and anova and the relationships among data elements can be explored using correlation and regression. 

Measures of central tendency describe the average or mid-point or the center of the data. Usually this is done using Mean (also called the arithmetic average - most useful and widely used technique), Median (the midpoint) or Mode (the most frequent). 

Measures of variability (spread) describe the extent of similarity or difference in a data set. Variance measures how far a set of numbers are spread out. A Variance of zero indicates that all the values are identical. Standard deviation, the most commonly used measure, is calculated using the square root of the variance. Range is the difference between highest and lowest observed value. 

Measures of shape describe the distribution of data using symmetry, skew and kurtosis. We will discuss this in the next section. 

Another commonly used term is the Z-score. 

Z-score helps tell how individual data points are aligned with the whole data set including if it is an outlier (an outlier is an extreme value of a data set. e.g: If I have a data set of 1,2,3,4,5 and 87; 87 will be characterized as an outlier). In other words, Z-score represents the number of standard deviations that the original data point is from the mean. In general, in a normal symmetrically distributed data, often illustrated by a bell-shaped curve centered on the data set's mean, 99.7% of all the data points will lie within three standard deviation of the mean. So, if a data point is outside this range, you can comfortably label it as an outlier. 

Earlier, I mentioned normally symmetrical distributed data. Here the arithmetic mean is approximately equal to the median. When the mean is less than median (left skewed) or the mean is greater than the median (right skewed), the shape is referred to as unsymmetrical distribution. Skewness is a measure of symmetry. 

t-test allows us to test whether a sample mean (of a normally distributed data set) significantly differs from a hypothesized value (e.g.: testing whether an average writing score differs significantly from a hypothesized score of 50). Anova or analysis of variance is used when there are more than two sample data sets. Another test is the chi-square. A chi-square test is used when you want to see if there is a relationship between two categorical variables (e.g.: testing to see if there is a relationship between the type of school attended and students' gender). 

One of the common models to describe relationship and make predictions is the Regression model. Regression analysis is a process for estimating the relationships between a dependent (values that result from the manipulation of the independent variable) and independent variable (e.g.: relationship between rash driving and number of road accidents by a driver is best studied through regression). Familiar methods include Linear, Logistic, Polynomial and Stepwise Regression. The choice of the method of regression analysis depends upon the number of independent variables, type of dependent variables and shape of regression line. Linear Regression is the most commonly used and it establishes a relationship between dependent variable and one or more independent variables using a best fit straight line. 

Correlation quantifies the degree to which two variables are related. It is positive when the values increase together, and is negative when one value decreases as the other increases. When the fluctuation of one variable reliably predicts a similar fluctuation in another variable (correlation), there’s often a tendency to think that the change in one causes the change in the other (causation). However, correlation does not imply causation. There may be, for example, an unknown factor that influences both variables equally. 

With so many ways of analyzing and understanding the data, the next step is to use computer models to help accomplish the tasks. Machine Learning is all about using algorithms to make these predictions. 

Machine learning is based on self-learning or self-improving algorithms. In machine learning, a computer starts with a model, and continues to enhance it through trial and error. It can then provide meaningful insight in the form of classification (assigning a data point to a known group which in turn was created by a set of training data), forecasting and clustering (grouping a set of data points in such a way that objects in the same group called cluster are more similar to each other than to those in other clusters). There are two broad types of machine learning. One is supervised and the other is unsupervised. 

Supervised learning is reinforced by feedback in the form of training data. In the unsupervised learning environment, there is no training data. In this case the machine learning algorithm solely depends on clustering and keeps enhancing its algorithm without external feedback. 

In the next part, we will discuss a simple framework to get started with Machine Learning.

Comments

Popular posts from this blog

Beyond the Gut Feeling: Mastering Data-Driven Decision Making (DDDM) for Sustainable Success Part 1/2

In the current hyper-competitive business landscape, intuition and experience—while still valuable—are no longer sufficient for making the best decisions. Organizations today operate in a world where data flows endlessly from every direction: operations, customer interactions, the market, and internal processes. This surge in volume, velocity, and variety of information brings both vast opportunity and pressing complexity. To navigate this environment, organizations need to adopt a more structured and evidence-based approach: Data-Driven Decision Making (DDDM) . This isn’t just about hoarding data. It’s about using data intentionally and intelligently—gathering the right insights, interpreting them accurately, and applying them to support both strategic and tactical decisions. --- Redefining the Role of Data in Business Data plays two foundational roles in any data-driven organization: 1. Monitoring Performance and Environment Think of data as the central nervous system of an organi...

A Framework for Digital Services in Large Organizations

Large organizations, often synonymous with entrenched systems and formidable bureaucracies, frequently find themselves in a wrestling match with digital change. It’s not for lack of talent or resources, but rather a fundamental design flaw: their very architecture tends to resist innovation . Legacy contracts, rigid hierarchies, and outdated processes combine to create an immense gravitational pull towards the status quo. Yet, expectations continue their relentless ascent, demanding faster, simpler, and more reliable services, indifferent to the complexities that lie beneath the surface. So, how does a behemoth pivot? The answer lies in a strategic shift away from grand, abstract blueprints and towards a more agile, user-centric approach. This article outlines a practical framework for digital services, built on the core principle that delivery comes first, fostering lasting change through consistent execution and practical problem-solving. Focus on Delivery, Not Just Planning The fou...

Train, Validate, Test: The Key to Success in AI

In machine learning, the question "How good is the model?" is fundamental. To answer this, it's essential to understand how data is structured and evaluated. To explain the importance of training, validation, and testing, let's dive into an analogy rooted in school days. Training Data: Building a Strong Foundation Imagine you're in your favorite class, absorbing new material. This is where the core learning happens. In the context of machine learning, the training data is the classroom lesson. It's the information the algorithm needs to understand the problem it's tasked with solving. For example, if you're studying history, your textbooks, lectures, and homework represent the training data. Similarly, a machine learning model relies on training data to learn patterns, relationships, and features in the dataset. It processes this information to prepare for solving problems, much like a student studies to perform well on tests. The training phase is cr...