Use the power of machine learning to predict gender of Amazon review authors.
Below is a brief description of our methodology.
Below is a brief description of our methodology. You can also download a full project report here.
The data does not explicitly state whether or not someone is male or female so the team must solve this piece of data engineering first.
Our dataset in JSON format contained roughly 1.3 million records, including self-determined usernames, review texts, product IDs, and user ratings.
We chose the GenderGuesser Python library that contains a list of more than 40,000 first names and gender. GenderGuesser labels data as “male/female”, “mostly male/female” and “unknown”.
To ensure high quality of the data used in our prediction model, we only used unambiguous data labels.
The NLTK (The Natural Language Toolkit) is one of the most well-known language processing libraries in the industry.
We eliminated stopwords, expanded contractions, stemmed the text, formatted and removed unwanted characters using regular expressions, and encoded the genders.
We used convolutional neural networks to predict the gender with highest accuracy. The model consists of 7 convolution layers, 4 maxpooling layers and 2 dense layers, and uses relu and sigmoid activation functions.