Gender Predictor

Use the power of machine learning to predict gender of Amazon review authors.


Below is a brief description of our methodology.

Try our demo

Try your own text

How We Approached This Project

Below is a brief description of our methodology. You can also download a full project report here.

1. Amazon Data Analysis

The data does not explicitly state whether or not someone is male or female so the team must solve this piece of data engineering first.

Our dataset in JSON format contained roughly 1.3 million records, including self-determined usernames, review texts, product IDs, and user ratings.

2. Data Clean-Up & Labeling

We chose the GenderGuesser Python library that contains a list of more than 40,000 first names and gender. GenderGuesser labels data as “male/female”, “mostly male/female” and “unknown”.

To ensure high quality of the data used in our prediction model, we only used unambiguous data labels.

3. Text Preprocessing using NLTK

The NLTK (The Natural Language Toolkit) is one of the most well-known language processing libraries in the industry.

We eliminated stopwords, expanded contractions, stemmed the text, formatted and removed unwanted characters using regular expressions, and encoded the genders.

4. Applying our ML Model

We used convolutional neural networks to predict the gender with highest accuracy. The model consists of 7 convolution layers, 4 maxpooling layers and 2 dense layers, and uses relu and sigmoid activation functions.