See part 1 here.
Last post, I talked about my experience doing my first Kaggle competition. The goal was to determine who survived the Titanic shipwreck. I achieved an accuracy of 77.5% using RandomForestClassifier. I’d like to do better this time.
I decided to join the Kaggle discord, where they have a channel dedicated to the competition. First, I learned that people cheat to get a perfect score, so 100% accuracy is not reasonable. Second, I learned that 80% is a decent score, so I will set that as my goal for this post.
I will also continue to use Google Gemini as my “copilot.” I am taking Ethan Mollick’s advice to spend at least 10 hours using an AI assistant, to better understand the value it can provide.
Instinctually, the first thing I want to do is go back to the data. If my data isn’t clean or I have bad features, I am going to get bad results. I will assume the data is clean because it is provided by Kaggle. However, in my last post, I only used four features:
- Pclass – The ticket class, a proxy for how wealthy a passenger is. Ranges from 1 (wealthiest) to 3 (poorest)
- Sex – Male or female. This feature is categorical, but later one-hot encoded using
get_dummies
- SibSp – Number of siblings/spouses aboard the Titanic
- Parch – Number of parents/children aboard the Titanic
The features ignored are age, ticket number, fare, cabin number, and port of embarkation. Age could be useful, because people in their “prime” could be more likely to survive the freezing water. Ticket number and cabin number don’t seem useful. The information from fare is already encoded in the ticket class. And the port of embarkation does not seem useful.
I will repeat the exact same experiment as last time, now including the age feature. Immediately I run into an issue: 177 of the passengers have an unknown age. Since this is over half of the passengers, I’m not going to ignore the rows. Instead, I will use information about the ages of other passengers to guess (impute). Gemini recommends using the median instead of the mean, because outliers might skew the results. I think this is reasonable, because it’s quite possible there were many old folks on the ship that could change the distribution.
After submitting the code, my accuracy goes down 0.3% to 77.2%. So that seems to be a dead end. To double-check, I put together a graph that paints a clearer picture (this excludes the median imputation).
I spent some time browsing the discussions for the competition. Someone named Gunes Evitan was kind enough to put together a rigorous feature engineering tutorial for the competition here.
The first thing I found useful about the feature engineering tutorial was this simple way to find all missing values in the data set.
It also addressed the issue I experienced with age imputation. Instead of using the median value, they calculated the correlation between the Age column and all other columns.
In this case, passenger class is used as a weak proxy for age. I found this mind blowing. To increase accuracy even more, we group by Sex as well.
The tutorial author is so thorough, they even went through the pains of figuring out what the two missing Embarked values were, using some Google-Fu. You can read about that here.
I will cover some of the additional observations/changes that were made, as the entire post is too lengthy to go through here.
- Create a new Deck feature. A deck is a group of cabins, which are highly correlated to ticket class but not quite. Some ticket classes share a deck.
- Understand correlations between features by visualizing them.
- Parse the names of passengers and pick out Mrs, which often represents a mother. Mothers have high survival rates (women and children go first).
- Use k-fold cross validation to compare models, even though the data set is small.
These changes in aggregate result in an accuracy of 83%, which is very impressive. Special thanks to Gunes Evitan for their feature engineering tutorial, which you can view here.
The lesson to be learned here is that feature engineering is important. There was no change to the original model; it still uses RandomForestClassifier.
I hope you found this blog post useful. Feel free to share your experiences with Kaggle and feature engineering in the comment section.
Leave a Reply