The quickest way to check for missing values in an R data frame

By Data Tricks, 3 November 2020

In any machine learning problem, one of the most important tasks is to deal with missing data. Often the quickest ways to check how much missing data you have in your data frame, and in which columns, is by using the sapply function.

To illustrate this, first lets create some example data:

data <- data.frame(id = sample(1000:9999, 100, replace = FALSE),
                   height = sample(150:190, 100, replace = TRUE),
                   weight = sample(60:90, 100, replace = TRUE))
data[sample(1:100, 20, replace = TRUE),2] <- NA
data[sample(1:100, 20, replace = TRUE),3] <- NA

This creates the following data frame:

> head(data)
  id height weight
1 6922 183 NA
2 7695 164 72
3 5381 154 62
4 8037  NA 77
5 3705 187 69
6 4319  NA 90

The sapply function can then be used to quickly check how many missing variables you have, and in which columns:

sapply(data, function(x) sum(is.na(x)))
id height weight
 0     19     17

The sapply function provides a count of the number of missing data points you have in each variable of your data frame. You then, of course, need to decide what to do about the missing data.

Tags: ,

Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that your first comment on this site will be moderated, after which you will be able to comment freely.