Predicting Titanic survivors pt. 2: know your data

See part 1 here.

Last post, I talked about my experience doing my first Kaggle competition. The goal was to determine who survived the Titanic shipwreck. I achieved an accuracy of 77.5% using RandomForestClassifier. I’d like to do better this time.

I decided to join the Kaggle discord, where they have a channel dedicated to the competition. First, I learned that people cheat to get a perfect score, so 100% accuracy is not reasonable. Second, I learned that 80% is a decent score, so I will set that as my goal for this post.

I will also continue to use Google Gemini as my “copilot.” I am taking Ethan Mollick’s advice to spend at least 10 hours using an AI assistant, to better understand the value it can provide.

Instinctually, the first thing I want to do is go back to the data. If my data isn’t clean or I have bad features, I am going to get bad results. I will assume the data is clean because it is provided by Kaggle. However, in my last post, I only used four features:

  • Pclass – The ticket class, a proxy for how wealthy a passenger is. Ranges from 1 (wealthiest) to 3 (poorest)
  • Sex – Male or female. This feature is categorical, but later one-hot encoded using get_dummies
  • SibSp – Number of siblings/spouses aboard the Titanic
  • Parch – Number of parents/children aboard the Titanic

The features ignored are age, ticket number, fare, cabin number, and port of embarkation. Age could be useful, because people in their “prime” could be more likely to survive the freezing water. Ticket number and cabin number don’t seem useful. The information from fare is already encoded in the ticket class. And the port of embarkation does not seem useful.

I will repeat the exact same experiment as last time, now including the age feature. Immediately I run into an issue: 177 of the passengers have an unknown age. Since this is over half of the passengers, I’m not going to ignore the rows. Instead, I will use information about the ages of other passengers to guess (impute). Gemini recommends using the median instead of the mean, because outliers might skew the results. I think this is reasonable, because it’s quite possible there were many old folks on the ship that could change the distribution.

Filling in NaN values with the median. 28/27 years old is surprising, this might not work…

After submitting the code, my accuracy goes down 0.3% to 77.2%. So that seems to be a dead end. To double-check, I put together a graph that paints a clearer picture (this excludes the median imputation).

Survival rates of passengers, grouped into 10-year buckets by age.

I spent some time browsing the discussions for the competition. Someone named Gunes Evitan was kind enough to put together a rigorous feature engineering tutorial for the competition here.

The first thing I found useful about the feature engineering tutorial was this simple way to find all missing values in the data set.

display_missing provides a clear way to see which columns are missing data.

It also addressed the issue I experienced with age imputation. Instead of using the median value, they calculated the correlation between the Age column and all other columns.

Correlations between age and other features.

In this case, passenger class is used as a weak proxy for age. I found this mind blowing. To increase accuracy even more, we group by Sex as well.

Impute the Age column using median age, grouped by Sex and Passenger Class.

The tutorial author is so thorough, they even went through the pains of figuring out what the two missing Embarked values were, using some Google-Fu. You can read about that here.

I will cover some of the additional observations/changes that were made, as the entire post is too lengthy to go through here.

  • Create a new Deck feature. A deck is a group of cabins, which are highly correlated to ticket class but not quite. Some ticket classes share a deck.
  • Understand correlations between features by visualizing them.
Correlations between features.
  • Parse the names of passengers and pick out Mrs, which often represents a mother. Mothers have high survival rates (women and children go first).
  • Use k-fold cross validation to compare models, even though the data set is small.

These changes in aggregate result in an accuracy of 83%, which is very impressive. Special thanks to Gunes Evitan for their feature engineering tutorial, which you can view here.

The lesson to be learned here is that feature engineering is important. There was no change to the original model; it still uses RandomForestClassifier.

I hope you found this blog post useful. Feel free to share your experiences with Kaggle and feature engineering in the comment section.


Leave a Reply

Your email address will not be published. Required fields are marked *