Mini project - Titanic survivors prediction
Dataset: Titanic survival
Introductory project of Udacity Machine Learning Nanodegree
In 1912, the ship RMS Titanic struck an iceberg on its maiden voyage and sank, resulting in the deaths of most of its passengers and crew. In this introductory project, we will explore a subset of the RMS Titanic passenger manifest to determine which features best predict whether someone survived or did not survive.
I think the Titanic dataset is a very good dataset to get started with classification task in Machine Learning as it has historical background. Furthermore, the attributes of the dataset is meaningful and easy to take as an example for many aspect of a machine learning workflow such as pre-processing or missing values. Compared to the Iris dataset, this Titanic daset is a much better choice. There is 11 attributes for each person in the dataset:
- Survived: Outcome of survival (0 = No; 1 = Yes)
- Pclass: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- Name: Name of passenger
- Sex: Sex of the passenger
- Age: Age of the passenger (Some entries contain
NaN
) - SibSp: Number of siblings and spouses of the passenger aboard
- Parch: Number of parents and children of the passenger aboard
- Ticket: Ticket number of the passenger
- Fare: Fare paid by the passenger
- Cabin Cabin number of the passenger (Some entries contain
NaN
) - Embarked: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)
Requirement
This project requires Python-3.5.2
, pandas-0.19.1
, numpy-0.11.2
,
and matplotlib-1.5.3
. Also, the data (csv) and plotting script (python)
can be found here.
First, we import numpy
for computations and pandas
for data handling.
The fundamental object of pandas
is a DataFrame
. In the code below,
full_data
is a pandas DataFrame
. DataFrame
object is presented
similar to a spreadsheet table (contains rows and columns). Each row is
a pandas Series
. The .head()
function of a DataFrame
return the
first five rows of the data:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import numpy as np
import pandas as pd
# Data
# RMS Titanic data visualization code
from sys import path
path.append('../src')
from titanic_visualizations_p3 import survival_stats
from IPython.display import display
%matplotlib inline
# Load the dataset
in_file = '../data/titanic_data.csv'
full_data = pd.read_csv(in_file)
# Print the first few entries of the RMS Titanic data
display(full_data.head())
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Data preprocessing
Since we’re interested in the outcome of survival for each passenger
or crew member, we can remove the Survived feature from this dataset
and store it as its own separate variable outcomes
. We will use these
outcomes as our prediction targets.
1
2
3
4
5
6
7
# Store the 'Survived' feature in a new variable and
# remove it from the dataset
outcomes = full_data['Survived']
data = full_data.drop('Survived', axis = 1)
# Show the new dataset with 'Survived' removed
display(data.head())
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
The very same sample of the RMS Titanic data now shows the
Survived feature removed from the DataFrame.
Note that data
(the passenger data) and outcomes
(the outcomes of survival) are now paired.
That means for any passenger data.loc[i]
, they have the
survival outcome outcome[i]
.
Evaluation method
To measure the performance of our predictions, we need a metric to score
our predictions against the true outcomes of survival. Since we are
interested in how accurate our predictions are, we will calculate the
proportion of passengers where our prediction of their survival is correct.
The code cell below to create our accuracy_score
function and test
a prediction on the first five passengers.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def accuracy_score(truth, pred):
""" Returns accuracy score for input truth and predictions. """
# Ensure that the number of predictions matches number of outcomes
if len(truth) == len(pred):
# Calculate and return the accuracy as a percent
return "Predictions have an accuracy of {:.2f}%." \
.format((truth == pred).mean()*100)
else:
return "Number of predictions does not match number of outcomes!"
# Test the 'accuracy_score' function
predictions = pd.Series(np.ones(5, dtype = int))
print(accuracy_score(outcomes[:5], predictions))
Predictions have an accuracy of 60.00%.
Making naive predictions
If we were told to make a prediction about any passenger aboard the RMS Titanic who we did not know anything about, then the best prediction we could make would be that they did not survive. This is because we can assume that a majority of the passengers as a whole did not survive the ship sinking. The function below will always predict that a passenger did not survive. It is always useful to have a ZeroR model which makes a default prediction. Based on such naive model, we know roughly what should we expect from our (more powerful) machine learning model.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def predictions_0(data):
"""
Model with no features. Always predicts
a passenger did not survive.
"""
predictions = []
for _, passenger in data.iterrows():
# Predict the survival of 'passenger'
predictions.append(0)
# Return our predictions
return pd.Series(predictions)
# Make the predictions
predictions = predictions_0(data)
1
print(accuracy_score(outcomes, predictions))
Predictions have an accuracy of 61.62%.
The prediction for a naive model which always says the passenger did not survive is 61.62%, which means we should expect our model later to perform well above this bar.
Analyze the data to find a better model
Let’s take a look at whether the feature Sex has any indication
of survival rates among passengers using the survival_stats
function.
This function is defined in the titanic_visualizations.py
Python script
included with this project. The first two parameters passed to the function
are the RMS Titanic data and passenger survival outcomes, respectively.
The third parameter indicates which feature we want to plot survival
statistics across. The code cell below plots the survival count wrt
the gender of the passenger.
1
survival_stats(data, outcomes, 'Sex')
Examining the survival statistics, a large majority of males did not survive the ship sinking. However, a majority of females did survive the ship sinking. Let’s build on our previous prediction: If a passenger was female, then we will predict that they survived. Otherwise, we will predict the passenger did not survive.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def predictions_1(data):
""" Model with one feature:
- Predict a passenger survived if they are female. """
predictions = []
for _, passenger in data.iterrows():
# Remove the 'pass' statement below
# and write your prediction conditions here
predictions.append(1 if passenger['Sex'] == 'female' else 0)
# Return our predictions
return pd.Series(predictions)
# Make the predictions
predictions = predictions_1(data)
The accuracy of our new prediction model:
1
print(accuracy_score(outcomes, predictions))
Predictions have an accuracy of 78.68%.
Using just the Sex feature for each passenger, we are able to
increase the accuracy of our predictions by a significant margin.
Now, let’s consider using an additional feature to see if we can further
improve our predictions. Consider, for example, all of the male passengers
aboard the RMS Titanic: Can we find a subset of those passengers that
had a higher rate of survival? Let’s start by looking at the Age
of each male, by again using the survival_stats
function. This time,
we’ll use a fourth parameter to filter out the data so that only
passengers with the Sex ‘male’ will be included.
1
survival_stats(data, outcomes, 'Age', ["Sex == 'male'"])
Examining the survival statistics, the majority of males younger then 10 survived the ship sinking, whereas most males age 10 or older did not survive the ship sinking. Let’s continue to build on our previous prediction: If a passenger was female, then we will predict they survive. If a passenger was male and younger than 10, then we will also predict they survive. Otherwise, we will predict they do not survive.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def predictions_2(data):
"""
Model with two features:
- Predict a passenger survived if they are female.
- Predict a passenger survived if they are male and younger than 10.
"""
predictions = []
for _, passenger in data.iterrows():
# Remove the 'pass' statement below
# and write your prediction conditions here
if passenger['Sex'] == 'female' or passenger.Age < 10:
predictions.append(1)
else:
predictions.append(0)
# Return our predictions
return pd.Series(predictions)
# Make the predictions
predictions = predictions_2(data)
1
print(accuracy_score(outcomes, predictions))
Predictions have an accuracy of 79.35%.
Adding the feature Age as a condition in conjunction with Sex improves the accuracy by a small margin more than with simply using the feature Sex alone.
More powerful model
Now we should analyze the data to find some more powerful rules that achieve at least 80% accuracy.
1
survival_stats(data, outcomes, 'Age')
1
survival_stats(data, outcomes, 'Pclass', ['Sex == "female"'])
1
survival_stats(data, outcomes, 'Pclass', ['Sex == "male"'])
1
survival_stats(data, outcomes, 'SibSp')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def predictions_3(data):
"""
Model with multiple features. Makes a prediction
with an accuracy of at least 80%.
"""
predictions = []
for _, passenger in data.iterrows():
# Remove the 'pass' statement below
# and write your prediction conditions here
if passenger.Sex == 'female':
if passenger.Pclass == 3:
if 20 < passenger.Age < 50:
predictions.append(0)
else:
predictions.append(1)
else:
predictions.append(1)
else:
if passenger.Pclass == 1:
if passenger.Age < 10 or 20 < passenger.Age < 40:
predictions.append(1)
else:
predictions.append(0)
else:
predictions.append(0)
# Return our predictions
return pd.Series(predictions)
# Make the predictions
predictions = predictions_3(data)
1
print(accuracy_score(outcomes, predictions))
Predictions have an accuracy of 80.13%.
Firstly, I plotted all ‘plotable’ features to see which feature has the deciding factor. Next, since the “Sex” feature plays a major role, I tried to look for filter that highlight survived males and not-survived female. I believe the next deciding feature is the Pclass. Most people from the first class survived, while the second and third class weren’t so luckly. Breaking down by age, gender and class reveal the portion of survived male (first class, young age) and not-survived female (third class, middle age).
Conclusion
I think supervised learning can be used as an lazy method for human to tell computer what to do. For example, instead of building a complex concrete model for 30-years crack prediction from data by hand, a civil engineer can assume a generic model and leave the pattern learning for the machine.