80% in Kaggle’s Titanic competition in 50 lines of R code

By Data Tricks, 16 July 2019

Anyone new to machine learning will have probably come across Kaggle’s titanic competition. The task involves applying machine learning techniques to predict which passengers survived the tragedy.

Whilst not a comprehensive attempt to solve the problem, this tutorial guides you through some simple methods to clean the data, engineer features and train an ML algorithm in R to achieve an accuracy of over 80%.

First we need to load the packages required for preparing the data and applying machine learning algorithms. If you don’t have all of the packages use install.packages(“name-of-package”).

rm(list=ls())
library(reshape2)
library(plyr)
library(dplyr)
library(randomForest)
library(kernlab)
library(caret)

For the next part you’ll need to download the Train and Test datasets from Kaggle. Once we’ve loaded the data into R using read.csv, we can bind them with rbind.

train <- read.csv("train.csv")
test <- read.csv("test.csv")
test$Survived <- NA
data <- rbind(train, test)
apply(data, 2, function(x) sum(is.na(x)))

The last line in the code above will print the number of missing values there are in the data, which are as follows:

Survived = 418 (expected as we have included the Test dataset)

Age = 263

Fare = 1

Imputing variables

With so many missing values in the Age variable, it is worth applying some technique to impute the values. To keep things simple we’ll just replace all missing values in the Age variable with the mean age for the population, and do the same for Fare.

data$Age[is.na(data$Age)] <- mean(data$Age, na.rm=TRUE)
data$Fare[is.na(data$Fare)] <- mean(data$Fare, na.rm=TRUE)

Feature engineering

To continue reading the full article, register for your FREE account. To register click ACCOUNT > REGISTER in the top right of screen.

Tags: kaggle, machine learning, R, titanic

One thought on “80% in Kaggle’s Titanic competition in 50 lines of R code”

James says:

February 12, 2020 at 7:02 pm

very nice

Reply

Free data science in R guide

Sign up to our newsletter and we will send you a series of guides containing tips and tricks on data science and machine learning in R.

No thanks

Machine learning

Confusion matrix in R: two simple methods

April 13, 2021

Two of the best methods to calculate a confusion matrix in R – from scratch or with the caret package.

Machine learning

Feature scaling in R: five simple methods

November 18, 2020

Five simple methods for applying features scaling in R.

Machine learning

The quickest way to check for missing values in an R data frame

November 3, 2020

How to check how much missing data you have in your data frame, and in which columns.

Machine learning blog

What is a good classification accuracy in machine learning?

June 1, 2020

How to measure the performance of your classification algorithm.

Machine learning

How to apply and interpret linear regression in R

May 28, 2020

Learning how to apply linear regression in R and how to interpret the output using house price data.

80% in Kaggle’s Titanic competition in 50 lines of R code

Imputing variables

Feature engineering

One thought on “80% in Kaggle’s Titanic competition in 50 lines of R code”

Leave a Reply Cancel reply

Free data science in R guide

You might also like

Confusion matrix in R: two simple methods

Feature scaling in R: five simple methods

The quickest way to check for missing values in an R data frame

What is a good classification accuracy in machine learning?

How to apply and interpret linear regression in R