What is Machine Learning?

By Data Tricks, 17 April 2020

Last modified 17 April 2020

What is Machine Learning?

Machine Learning is a subset of artificial intelligence which involves getting computers to learn autonomously from hidden patterns in existing data in order to make predictions on unseen data.

There are two main types of machine learning – supervised and unsupervised. Supervised machine learning algorithms are used when the existing data has input variables and an output variable and the task is to create a function to predict the output variable from the input variables. Unsupervised machine learning algorithms are used when the existing data has input variables but no output variable and the task is to learn more about the structure and distribution of the data.

The Machine Learning Process

Collecting Data

The machine learning process starts with collecting data. This is usually data that is held internally by an organisation, but might also include data from external sources which can be merged with any internal data.

Data Cleansing

This part of the process should not be overlooked as data cleansing is one of the most effective ways of improving the accuracy of machine learning algorithms. Typical data cleansing tasks include handling missing data (by either removing, estimating or interpolating), outliers and scaling (standardisation or normalisation).

Feature Engineering

Another important task which, if done correctly, can boost the accuracy of machine learning algorithms is feature engineering. This involves extracting as much useful information from the data as possible. For example, one-hot encoding might be used on variables which are not numeric and have multiple categories, or entirely new variables could be created from existing data.


Once cleansed, the existing data should be split into two separate datasets – one for training an algorithm, and one for testing it. Data are usually split at random with the size of the training vs. testing dataset dependent on the particular situation, volume of data required and algorithm used.

Once the data has been split into training and test datasets, the training dataset is used to train the selected machine learning algorithm. There are many algorithms available from relatively simple algorithms such as logistic regression and decision trees, to more complex and sophisticated ones such as random forests and artificial neural networks.


Once the machine learning algorithm has been trained, it should then be applied to the test dataset in order to test its true accuracy by trying to predict the output variable in the test dataset. Training and testing algorithms in this way reduces the risk of over-fitting the algorithm and obtaining inflated accuracy scores.


Finally, a machine learning process should be cyclical. It is important to analyse and evaluate the output from the algorithm in order to make improvements to the process of data collection, cleansing and feature engineering.

If you are working on machine learning in R, you might like to read this tutorial which outlines the entire machine learning process on Kaggle’s Titanic competition.


Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that your first comment on this site will be moderated, after which you will be able to comment freely.