Asking and framing the right question is really important1,2.
Obtaining relevant datasets to answer a problem might be quite difficult (e.g. expensive, time-consuming).
Data often comes from the "real world", which means it is full of human errors and biases.
You may need to consider how to source, ingest, and store data*.
*But we're not going to worry about this too much on this course.
Deciding on variables which should be part of the input data requires human (not artificial) intelligence.
Restaurant Daily Sales | Stock prices |
---|---|
Previous day's sales | Previous day's price |
Day of the week | Interest rates |
Holiday or not holiday | Company earnings |
Rain or no rain | News headlines |
Before applying any machine learning methods, we should explore our data.
Typically data is messy and needs to be prepared for downstream transformations and modelling.
"A feature is a numeric representation of an aspect of raw data."3
We want to remove "uninformative information" and retain useful bits3.
We want to remove "uninformative information" and retain useful bits3.
Create a subset of the original set of features.
Create new synthetic features through combining the original features and discarding less important ones.
...or put another way: Why not just learn the latest "best" model. For example, why not just focus on deep learning?
This theoretical finding suggests all optimization algorithms perform equally well when their performance is averaged over all possible objective functions.
Classical methods are categorised typically in relation to ensemble or neural network/deep learning models.
In unsupervised learning we aim to use data: $$D = \{\mathbf{x}_n\}^N_{n=1},$$ with inputs $\mathbf{x}_n$ and $N$ training examples, to learn how to represent or find interesting patterns in the data.
$\mathbf{x}$ is a $D$-dimensional vector of numbers (e.g. a patient's blood pressure, heart rate, and weight).
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality |
---|---|---|---|---|---|---|---|---|---|---|---|
7.4 | 0.7 | 0 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
7.8 | 0.88 | 0 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.2 | 0.68 | 9.8 | 5 |
7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.997 | 3.26 | 0.65 | 9.8 | 5 |
11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.998 | 3.16 | 0.58 | 9.8 | 6 |
7.4 | 0.7 | 0 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
For example, in clustering, we assume there are latent classes or clusters of the training data with similar behavior.
We aim to determine...
Example Models
In supervised learning, we have a training dataset of labelled input-output pairs, denoted $$D = \{(\mathbf{x}_n, y_n)\}^N_{n=1},$$ with inputs $\mathbf{x}_n$ and outputs $y_n$.
We aim to use $D$ to learn the mapping from $\mathbf{x}$ to $y$ for generalization, i.e. to automatically label future inputs $\mathbf{x}^*$.
Regression aims to learn the mapping from the inputs $\mathbf{x}$ to a continuous output $y \in \mathbb{R}$.
Example Models
Classification aims to learn the mapping from the inputs $\mathbf{x}$ to a categorical output $y \in \lbrace 1, \ldots, C \rbrace$, and is known as:
Example Models
bias low, variance low
bias high, variance low
bias low, variance high
bias high, variance high
Ensemble methods aim to improve generalisability of an algorithm by combining the predictions of several estimators5.
To achieve this there are three general methods:
Averaging methods build several separate estimators and then average their predictions.
For example, a bagging method averages an ensemble of base classifiers fit on random subsets of a dataset (observations and/or features) with replacement6.
Example Models
Boosting typically use an ensemble of weak estimators that are built sequentially, with each estimator attempting to reduce the bias of the predecessor2.
Example Models
Layers of interconnected artificial neurons or other formulas are stacked on top of each other.
Example Models
Regression and classification use different metrics to assess the performance of a model.
Classification | Regression |
---|---|
Accuracy | Mean squared error |
Sensitivity | Mean absolute error |
Specificity | Median absolute error |
Precision | $R^2$ (coefficient of determination) |
We can assess how a model performs on data it is trained on (training set) and data it has not seen before (validation/test set).
While tuning our model, we could evaluate our model performance on multiple splits (e.g. K-fold cross-validation).
Most models have settings we can change (hyper-parameters). Also, it may not be clear what feature pre-processing or engineering steps work best.
We could fiddle with them manually until we find a great combination (tedious)...
... or get the computer to do this for you (e.g. grid-search, random search).
Once a model is chosen, we can deploy it into production for inference.
The model needs to be continuously monitored for model performance, retrained, and recalibrated accordingly.