Skip to main content

Data Science - Managers Guide Part 1




This is a manager's guide to data science and machine learning. Part 1 provides a very high-level overview along with few definitions. Part 2 provides a framework to get started with machine learning for data science projects. 

Data science is a field of study that aims to use a scientific approach to extract meaning and insights from data – collection of numbers, words, observations and just about anything. 

Data science uses results from statistics, machine learning, and computer science to create models that can transform hypotheses (assumptions) and data into actionable predictions (forecast). Much of the theoretical basis of data science comes from statistics along with a strong influence from software engineering methodologies. 

The basic purpose of statistics is simply to enable us to make sense of large amount of data - by providing the ability to consolidate and synthesize large numbers to reveal the collective characteristics and interrelationships, and transform them from an incomprehensible mass to a set of useful and enlightening indicators. Statistics has three primary components: How best can we collect data? How should it be analyzed? And what can we infer from the analysis? 

The underlying fuel for statistics is data - broadly numerical and categorical data. Numerical data are numbers and are either discreet (e.g.: the number of applicants for a job) or continuous (e.g.: the volume of water falling from a tap). Categorical data is all about text and labels and are either ordinal – has intrinsic order (ex: Class of freshmen, sophomore, junior, senior) or nominal (ex: The name of the town you live in). 

In general, this data can be synthesized by methods of central tendency, variation and shape. Differences between groups of data can be explained using t-test, chi-square and anova and the relationships among data elements can be explored using correlation and regression. 

Measures of central tendency describe the average or mid-point or the center of the data. Usually this is done using Mean (also called the arithmetic average - most useful and widely used technique), Median (the midpoint) or Mode (the most frequent). 

Measures of variability (spread) describe the extent of similarity or difference in a data set. Variance measures how far a set of numbers are spread out. A Variance of zero indicates that all the values are identical. Standard deviation, the most commonly used measure, is calculated using the square root of the variance. Range is the difference between highest and lowest observed value. 

Measures of shape describe the distribution of data using symmetry, skew and kurtosis. We will discuss this in the next section. 

Another commonly used term is the Z-score. 

Z-score helps tell how individual data points are aligned with the whole data set including if it is an outlier (an outlier is an extreme value of a data set. e.g: If I have a data set of 1,2,3,4,5 and 87; 87 will be characterized as an outlier). In other words, Z-score represents the number of standard deviations that the original data point is from the mean. In general, in a normal symmetrically distributed data, often illustrated by a bell-shaped curve centered on the data set's mean, 99.7% of all the data points will lie within three standard deviation of the mean. So, if a data point is outside this range, you can comfortably label it as an outlier. 

Earlier, I mentioned normally symmetrical distributed data. Here the arithmetic mean is approximately equal to the median. When the mean is less than median (left skewed) or the mean is greater than the median (right skewed), the shape is referred to as unsymmetrical distribution. Skewness is a measure of symmetry. 

t-test allows us to test whether a sample mean (of a normally distributed data set) significantly differs from a hypothesized value (e.g.: testing whether an average writing score differs significantly from a hypothesized score of 50). Anova or analysis of variance is used when there are more than two sample data sets. Another test is the chi-square. A chi-square test is used when you want to see if there is a relationship between two categorical variables (e.g.: testing to see if there is a relationship between the type of school attended and students' gender). 

One of the common models to describe relationship and make predictions is the Regression model. Regression analysis is a process for estimating the relationships between a dependent (values that result from the manipulation of the independent variable) and independent variable (e.g.: relationship between rash driving and number of road accidents by a driver is best studied through regression). Familiar methods include Linear, Logistic, Polynomial and Stepwise Regression. The choice of the method of regression analysis depends upon the number of independent variables, type of dependent variables and shape of regression line. Linear Regression is the most commonly used and it establishes a relationship between dependent variable and one or more independent variables using a best fit straight line. 

Correlation quantifies the degree to which two variables are related. It is positive when the values increase together, and is negative when one value decreases as the other increases. When the fluctuation of one variable reliably predicts a similar fluctuation in another variable (correlation), there’s often a tendency to think that the change in one causes the change in the other (causation). However, correlation does not imply causation. There may be, for example, an unknown factor that influences both variables equally. 

With so many ways of analyzing and understanding the data, the next step is to use computer models to help accomplish the tasks. Machine Learning is all about using algorithms to make these predictions. 

Machine learning is based on self-learning or self-improving algorithms. In machine learning, a computer starts with a model, and continues to enhance it through trial and error. It can then provide meaningful insight in the form of classification (assigning a data point to a known group which in turn was created by a set of training data), forecasting and clustering (grouping a set of data points in such a way that objects in the same group called cluster are more similar to each other than to those in other clusters). There are two broad types of machine learning. One is supervised and the other is unsupervised. 

Supervised learning is reinforced by feedback in the form of training data. In the unsupervised learning environment, there is no training data. In this case the machine learning algorithm solely depends on clustering and keeps enhancing its algorithm without external feedback. 

In the next part, we will discuss a simple framework to get started with Machine Learning.

Comments

Popular posts from this blog

AI Reading, Understanding, and Reasoning Text: How It Works

Artificial Intelligence (AI) has made significant progress in the way it reads, understands, and reasons about text. Today, AI powers search engines, virtual assistants, and even chatbots that can hold conversations with humans. But how does AI process and make sense of text? Here, we will break down this concept using simple language and real-world examples. How AI Reads Text Before AI can understand text, it needs to first read it. Reading, in AI terms, means converting raw text into a structured form that the machine can process. This is done through a process called Natural Language Processing (NLP). 1. Text Input – AI receives text from various sources, such as emails, websites, or voice-to-text conversions. 2. Tokenization – The text is broken down into smaller parts called tokens (words or phrases). 3. Parsing – AI identifies the grammatical structure of a sentence, recognizing nouns, verbs, adjectives, etc. 4. Named Entity Recognition (NER) – AI detects important words like na...

Unlocking the Power of Data: Embracing Machine Learning for Business Success - Part 2

Machine learning has revolutionized the way we solve complex problems, make predictions, and gain insights from data. One of the key decisions when choosing a machine learning algorithm is whether to opt for a parametric model or a non-parametric model. These two categories of models represent distinct approaches to handling data and have their own strengths and weaknesses. In this blog post, we will delve into the world of parametric and non-parametric machine learning models, exploring what sets them apart and when to use each type. Parametric Models: Structure and Assumptions Parametric machine learning models are characterized by their predefined structure and assumptions about the underlying relationship between input and output variables. These models assume that the relationship can be expressed using a fixed, predefined formula or functional form. The key features of parametric models are as follows: 1. Fixed Number of Parameters: Parametric models have a fixed number of parame...

Why Emotional Intelligence Matters More Than You Think

In everyday life, people often think of emotions as things that pop up in dramatic or personal moments—like falling in love or having a fight. But emotions are actually involved in nearly everything we do. From making decisions to understanding others, emotions play a central role in our lives. And to navigate this emotional landscape successfully, we need a special skill called Emotional Intelligence (EI) . Emotions Are Everywhere Emotions don’t just come into play during big life moments. They influence what we choose to eat, how we respond to co-workers, and whether we go to the gym or stay in bed. For example, if a child touches a hot stove and feels pain, they learn through that emotional experience to avoid doing it again. That emotional memory becomes a protective tool. Similarly, we interpret other people's emotions to help us understand what might happen next. If someone is shouting and has clenched fists, we instinctively know to be cautious—they may be ready to lash out...