What is Machine Learning?

By Data Tricks, 17 April 2020

What is Machine Learning?

Machine Learning is a subset of artificial intelligence which involves getting computers to learn autonomously from hidden patterns in existing data in order to make predictions on unseen data.

There are two main types of machine learning – supervised and unsupervised. Supervised machine learning algorithms are used when the existing data has input variables and an output variable and the task is to create a function to predict the output variable from the input variables. Unsupervised machine learning algorithms are used when the existing data has input variables but no output variable and the task is to learn more about the structure and distribution of the data.

The Machine Learning Process

Collecting Data

The machine learning process starts with collecting data. This is usually data that is held internally by an organisation, but might also include data from external sources which can be merged with any internal data.

Data Cleansing

This part of the process should not be overlooked as data cleansing is one of the most effective ways of improving the accuracy of machine learning algorithms. Typical data cleansing tasks include handling missing data (by either removing, estimating or interpolating), outliers and scaling (standardisation or normalisation).

Feature Engineering

Another important task which, if done correctly, can boost the accuracy of machine learning algorithms is feature engineering. This involves extracting as much useful information from the data as possible. For example, one-hot encoding might be used on variables which are not numeric and have multiple categories, or entirely new variables could be created from existing data.

Training

Once cleansed, the existing data should be split into two separate datasets – one for training an algorithm, and one for testing it. Data are usually split at random with the size of the training vs. testing dataset dependent on the particular situation, volume of data required and algorithm used.

Once the data has been split into training and test datasets, the training dataset is used to train the selected machine learning algorithm. There are many algorithms available from relatively simple algorithms such as logistic regression and decision trees, to more complex and sophisticated ones such as random forests and artificial neural networks.

Testing

Once the machine learning algorithm has been trained, it should then be applied to the test dataset in order to test its true accuracy by trying to predict the output variable in the test dataset. Training and testing algorithms in this way reduces the risk of over-fitting the algorithm and obtaining inflated accuracy scores.

Evaluate

Finally, a machine learning process should be cyclical. It is important to analyse and evaluate the output from the algorithm in order to make improvements to the process of data collection, cleansing and feature engineering.

If you are working on machine learning in R, you might like to read this tutorial which outlines the entire machine learning process on Kaggle’s Titanic competition.

Tags: machine learning

Free data science in R guide

Sign up to our newsletter and we will send you a series of guides containing tips and tricks on data science and machine learning in R.

No thanks

Machine learning

Confusion matrix in R: two simple methods

April 13, 2021

Two of the best methods to calculate a confusion matrix in R – from scratch or with the caret package.

Machine learning

Feature scaling in R: five simple methods

November 18, 2020

Five simple methods for applying features scaling in R.

Machine learning

The quickest way to check for missing values in an R data frame

November 3, 2020

How to check how much missing data you have in your data frame, and in which columns.

Machine learning blog

What is a good classification accuracy in machine learning?

June 1, 2020

How to measure the performance of your classification algorithm.

Machine learning

How to apply and interpret linear regression in R

May 28, 2020

Learning how to apply linear regression in R and how to interpret the output using house price data.

What is Machine Learning?

What is Machine Learning?

The Machine Learning Process

Collecting Data

Data Cleansing

Feature Engineering

Training

Testing

Evaluate

Leave a Reply Cancel reply

Free data science in R guide

You might also like

Confusion matrix in R: two simple methods

Feature scaling in R: five simple methods

The quickest way to check for missing values in an R data frame

What is a good classification accuracy in machine learning?

How to apply and interpret linear regression in R