My First Kaggle Competition: Analyzing Titanic Survivors with Machine Learning

I would like to share with you some of my learnings in ML. Over the past year, I have completed several Coursera courses on ML and neural networks, but was a bit too intimidated to jump into a Kaggle competition.

Fortunately, Kaggle has a competition targeted toward beginners, called Titanic – Machine Learning from Disaster. Here is my experience.

A screencap from the introductory video.

The goal of the competition is to determine which Titanic passengers survived based on data including sex, age, and ticket price.

The data dictionary.

One thing that struck me while I was looking through the data was a sense of sadness. To us, over a century later, all of the passengers are reduced to rows in a CSV file. But they were real people. Here is Owen Braund, a 22-year-old who planned to emigrate from England to the United States. His older brother, Jim, was waiting for him across the pond. He died at sea, and his body was never recovered.

Owen Harris Braund. Courtesy Encyclopedia Titanica.

I digress.

While the data dictionary was helpful, Google Gemini provided a much clearer picture of the meanings of each column. Here is what it had to say about Owen in plain English:

“In summary, this row describes a 22-year-old male passenger named Mr. Owen Harris Braund. He was traveling in 3rd class, did not survive, and had one sibling or spouse on board. He paid 7.2500 for his ticket and embarked at Southampton.”

Digging in

Kaggle provides a tutorial that guides newcomers on how to do basic training and inference, as well as how to make a submission. It also provides a data explorer that visualizes the data without requiring tools like matplotlib.

Information about the training data set.

The data explorer isn’t the greatest tool. For example, the “Survived” column is binary. Strangely, the visualizer separates the data into 419 rows with a value of 0 to 0.1, and 342 rows with a value of 0.9 to 1.

The tutorial suggests looking for a pattern, like the survival rate of women. This makes sense. Women (and children) were the first passengers to go on the lifeboats, so they should have a higher rate of survival than men.

A screencap of the tutorial code which identifies the survival rates of women and men, respectively.

The intuition was correct. 74% of women survived, whereas only 19% of men survived. We can use this data point to increase the accuracy of our predictions.

In fact, if we make predictions solely based on this, our competition entry will have almost 77% accuracy. Not bad.

We will try to improve on this by using a random forest model. The core of the idea is that we will use an ensemble of decision trees (if statements, basically). Each tree is trained on a random subset of the data and random features. The trees then “vote” (in the case of a classifier), and the majority wins. The randomness is designed to prevent overfitting.

We use scikit-learn, as shown below.

Note that we ignore some noisy columns in the data, like the ticket number or port of departure.

After submitting this, our accuracy only increases by about 1% compared to the simple model of 100% female/0% male survival. In the leaderboard, competitors were able to achieve 100% accuracy, so we can certainly do better. In the next post of this series, I will outline some improvements to be made.

If you haven’t tried Kaggle before, I encourage you to give it a shot. And feel free to share your learning experiences in the comments below.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *