80% in Kaggle’s Titanic competition in 50 lines of R code

By Data Tricks, 16 July 2019

Anyone new to machine learning will have probably come across Kaggle’s titanic competition. The task involves applying machine learning techniques to predict which passengers survived the tragedy.

Whilst not a comprehensive attempt to solve the problem, this tutorial guides you through some simple methods to clean the data, engineer features and train an ML algorithm in R to achieve an accuracy of over 80%.

First we need to load the packages required for preparing the data and applying machine learning algorithms. If you don’t have all of the packages use install.packages(“name-of-package”).

rm(list=ls())
library(reshape2)
library(plyr)
library(dplyr)
library(randomForest)
library(kernlab)
library(caret)

For the next part you’ll need to download the Train and Test datasets from Kaggle. Once we’ve loaded the data into R using read.csv, we can bind them with rbind.

train <- read.csv("train.csv")
test <- read.csv("test.csv")
test$Survived <- NA
data <- rbind(train, test)
apply(data, 2, function(x) sum(is.na(x)))

The last line in the code above will print the number of missing values there are in the data, which are as follows:

Survived = 418 (expected as we have included the Test dataset)

Age = 263

Fare = 1

Imputing variables

With so many missing values in the Age variable, it is worth applying some technique to impute the values. To keep things simple we’ll just replace all missing values in the Age variable with the mean age for the population, and do the same for Fare.

data$Age[is.na(data$Age)] <- mean(data$Age, na.rm=TRUE)
data$Fare[is.na(data$Fare)] <- mean(data$Fare, na.rm=TRUE)

Feature engineering

The full article is for registered users only. To register for your completely free account, click JOIN in top right of screen.

Tags: , , ,

One thought on “80% in Kaggle’s Titanic competition in 50 lines of R code”

Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that your first comment on this site will be moderated, after which you will be able to comment freely.

You might also like

Creating a D3.js bar chart in R

How to produce a simple D3 bar chart in R with the r2d3 package. Including customisation of axes, titles and making it interactive.

Read more

The 5 most important skills of a data scientist

Thinking about getting into data science? Here is my take on the top skills needed to be an effective and successful data scientist.

Read more

Artificial Intelligence Jobs Fastest Growing

AI and machine learning roles are the fastest growing jobs of 2020 according to latest research by LinkedIn.

Read more

R Shiny App Theme

Custom Bootstrap theme for Shiny apps.

Read more

Ethics of machine learning in education

Avoiding bias in machine learning in education.

Read more