The quickest way to check for missing values in an R data frame

By Data Tricks, 3 November 2020

In any machine learning problem, one of the most important tasks is to deal with missing data. Often the quickest ways to check how much missing data you have in your data frame, and in which columns, is by using the sapply function.

To illustrate this, first lets create some example data:

data <- data.frame(id = sample(1000:9999, 100, replace = FALSE),
                   height = sample(150:190, 100, replace = TRUE),
                   weight = sample(60:90, 100, replace = TRUE))
data[sample(1:100, 20, replace = TRUE),2] <- NA
data[sample(1:100, 20, replace = TRUE),3] <- NA

This creates the following data frame:

> head(data)
  id height weight
1 6922 183 NA
2 7695 164 72
3 5381 154 62
4 8037  NA 77
5 3705 187 69
6 4319  NA 90

The sapply function can then be used to quickly check how many missing variables you have, and in which columns:

sapply(data, function(x) sum(is.na(x)))

id height weight
 0     19     17

The sapply function provides a count of the number of missing data points you have in each variable of your data frame. You then, of course, need to decide what to do about the missing data.

Tags: machine learning, missing data

Free data science in R guide

Sign up to our newsletter and we will send you a series of guides containing tips and tricks on data science and machine learning in R.

No thanks

Machine learning

Confusion matrix in R: two simple methods

April 13, 2021

Two of the best methods to calculate a confusion matrix in R – from scratch or with the caret package.

Machine learning

Feature scaling in R: five simple methods

November 18, 2020

Five simple methods for applying features scaling in R.

Machine learning blog

What is a good classification accuracy in machine learning?

June 1, 2020

How to measure the performance of your classification algorithm.

Machine learning

How to apply and interpret linear regression in R

May 28, 2020

Learning how to apply linear regression in R and how to interpret the output using house price data.

Machine learning blog

What is Machine Learning?

April 17, 2020

What is Machine Learning? Machine Learning is a subset of artificial intelligence which involves getting computers to learn autonomously from hidden patterns in existing data in order to make predictions on unseen data. There are two main types of machine learning – supervised and unsupervised. Supervised machine learning algorithms are used when the existing data […]

The quickest way to check for missing values in an R data frame

Leave a Reply Cancel reply

Free data science in R guide

You might also like

Confusion matrix in R: two simple methods

Feature scaling in R: five simple methods

What is a good classification accuracy in machine learning?

How to apply and interpret linear regression in R

What is Machine Learning?