This is a manager's guide to data science and machine learning. Part 1 provides a very high-level overview along with few definitions. Part 2 provides a framework to get started with machine learning for data science projects.
Data science is a field of study that aims to use a scientific approach to extract meaning and insights from data – collection of numbers, words, observations and just about anything.
Data science uses results from statistics, machine learning, and computer science to create models that can transform hypotheses (assumptions) and data into actionable predictions (forecast). Much of the theoretical basis of data science comes from statistics along with a strong influence from software engineering methodologies.
The basic purpose of statistics is simply to enable us to make sense of large amount of data - by providing the ability to consolidate and synthesize large numbers to reveal the collective characteristics and interrelationships, and transform them from an incomprehensible mass to a set of useful and enlightening indicators. Statistics has three primary components: How best can we collect data? How should it be analyzed? And what can we infer from the analysis?
The underlying fuel for statistics is data - broadly numerical and categorical data. Numerical data are numbers and are either discreet (e.g.: the number of applicants for a job) or continuous (e.g.: the volume of water falling from a tap). Categorical data is all about text and labels and are either ordinal – has intrinsic order (ex: Class of freshmen, sophomore, junior, senior) or nominal (ex: The name of the town you live in).
In general, this data can be synthesized by methods of central tendency, variation and shape. Differences between groups of data can be explained using t-test, chi-square and anova and the relationships among data elements can be explored using correlation and regression.
Measures of central tendency describe the average or mid-point or the center of the data. Usually this is done using Mean (also called the arithmetic average - most useful and widely used technique), Median (the midpoint) or Mode (the most frequent).
Measures of variability (spread) describe the extent of similarity or difference in a data set. Variance measures how far a set of numbers are spread out. A Variance of zero indicates that all the values are identical. Standard deviation, the most commonly used measure, is calculated using the square root of the variance. Range is the difference between highest and lowest observed value.
Measures of shape describe the distribution of data using symmetry, skew and kurtosis. We will discuss this in the next section.
Another commonly used term is the Z-score.
Z-score helps tell how individual data points are aligned with the whole data set including if it is an outlier (an outlier is an extreme value of a data set. e.g: If I have a data set of 1,2,3,4,5 and 87; 87 will be characterized as an outlier). In other words, Z-score represents the number of standard deviations that the original data point is from the mean. In general, in a normal symmetrically distributed data, often illustrated by a bell-shaped curve centered on the data set's mean, 99.7% of all the data points will lie within three standard deviation of the mean. So, if a data point is outside this range, you can comfortably label it as an outlier.
Earlier, I mentioned normally symmetrical distributed data. Here the arithmetic mean is approximately equal to the median. When the mean is less than median (left skewed) or the mean is greater than the median (right skewed), the shape is referred to as unsymmetrical distribution. Skewness is a measure of symmetry.
t-test allows us to test whether a sample mean (of a normally distributed data set) significantly differs from a hypothesized value (e.g.: testing whether an average writing score differs significantly from a hypothesized score of 50). Anova or analysis of variance is used when there are more than two sample data sets. Another test is the chi-square. A chi-square test is used when you want to see if there is a relationship between two categorical variables (e.g.: testing to see if there is a relationship between the type of school attended and students' gender).
One of the common models to describe relationship and make predictions is the Regression model. Regression analysis is a process for estimating the relationships between a dependent (values that result from the manipulation of the independent variable) and independent variable (e.g.: relationship between rash driving and number of road accidents by a driver is best studied through regression). Familiar methods include Linear, Logistic, Polynomial and Stepwise Regression. The choice of the method of regression analysis depends upon the number of independent variables, type of dependent variables and shape of regression line. Linear Regression is the most commonly used and it establishes a relationship between dependent variable and one or more independent variables using a best fit straight line.
Correlation quantifies the degree to which two variables are related. It is positive when the values increase together, and is negative when one value decreases as the other increases. When the fluctuation of one variable reliably predicts a similar fluctuation in another variable (correlation), there’s often a tendency to think that the change in one causes the change in the other (causation). However, correlation does not imply causation. There may be, for example, an unknown factor that influences both variables equally.
With so many ways of analyzing and understanding the data, the next step is to use computer models to help accomplish the tasks. Machine Learning is all about using algorithms to make these predictions.
Machine learning is based on self-learning or self-improving algorithms. In machine learning, a computer starts with a model, and continues to enhance it through trial and error. It can then provide meaningful insight in the form of classification (assigning a data point to a known group which in turn was created by a set of training data), forecasting and clustering (grouping a set of data points in such a way that objects in the same group called cluster are more similar to each other than to those in other clusters). There are two broad types of machine learning. One is supervised and the other is unsupervised.
Supervised learning is reinforced by feedback in the form of training data. In the unsupervised learning environment, there is no training data. In this case the machine learning algorithm solely depends on clustering and keeps enhancing its algorithm without external feedback.
In the next part, we will discuss a simple framework to get started with Machine Learning.

Comments
Post a Comment