Personality Prediction from Myer Brigg’s 16 Personality Types Dataset

8 min readDec 19, 2020

— by Abhishek Madaan, Neha Rana, and T G Narayanan

What’s it about?

As provided by The Myers-Briggs Type Indicator® (MBTI®) tool, there are 8 personality types, and every individual can be categorized into a particular combination of these personality types based on their behaviour, thinking, decision-making skills, etc.
This project concentrates on an area of Machine Learning where given certain twitter posts by 8675 individuals, we predict the personality of that individual.
We have used word embedding techniques (tf- idf, Word2Vec, GloVe) with Machine learning and Deep learning algorithms such as SVM, Logistic Regression, BiLSTM, Naive Bayes, and AdaBoost.
It was found that SVM + SMOTETomek is best for this classification.
XGBoost with resampling also gave quite promising results.

Introduction

Personality is the manner in which one person’s traits and characteristics are defined. Each person is special and unique in their own way and it’s the personality that defines a person. It is also an indicator of how a person sees and interacts with the world and her/his environment. Based on Carl Jung’s theory of psychological type, MBTI (a world renowned personality indicator) indicates ones’s personality preferences in four dimensions:
1. Where you focus your attention — Extraversion (E) or Introversion (I),
2. The way you take in information — Sensing (S) or INtuition (N),
3. How you make decisions — Thinking (T) or Feeling (F),
4. How you deal with the world — Judging (J) or Perceiving (P).
Using these four dimensions, a total of sixteen personality types exist:

**Source:** https://excellenceassured.com/16-personality-types/free-myers-briggs-type-personality-test

Importance of the project

The project focuses on analyzing the “Myers Briggs Type Indicator Dataset” and visualizing the data distribution with various plots.

Preprocessing of the dataset is done by splitting of instances into sentences, performing removal of sentences with URLs, etc. Then the occurrences of sentences for each label (indicator) are recorded. The count of the words in the dataset comes from using Count Vectorization and the importance of those words is determined by the use of Tf-idf. Word2Vec and GloVe are used to learn word embeddings.

To handle the imbalance in the dataset, we perform resampling. Resampling can be done either by performing oversampling or undersampling or a combination of the two. The resampling techniques create fairness among the dominating and recessive classes.

Oversampling techniques are used to populate the dataset with minority class instances. Undersampling techniques on the other hand reduces instances of the dominating class and bring it down to the count of the recessive class instances.

Although, the oversampling techniques tend to cause overfitting and the undersampling techniques tend to cause loss of useful information. In our work, SMOTETomek is used which performs oversampling using SMOTE (Synthetic Minority Oversampling Technique) and cleaning using Tomek links.

Now the machine learning algorithms such as Logistic Regression, SVM, XGBoost, AdaBoost, Naive Bayes are implemented to generate models.

A comparison study is done using the performance scores attained using those models.

Dataset Used

https://www.kaggle.com/datasnaek/mbti-type

Methodology

Data Analysis

This study uses the Myers Briggs Type Indicator Dataset. It consists of 50 statuses of 8675 users with personality labels based on Myers-Briggs Personality Types. The dataset consists of two columns. The first column consists of the MBTI personality trait of each user and the second column consist of the statuses of each user. Each status is separated by three pipe characters. The distribution of the dataset based on each personality trait is presented in Table 1 below.

**Table 1: Distribution of Personality Traits**

According to the table, the distribution of personality traits, ’INTJ’, ’INTP’ and ’INFJ’ and ’INFP’ is much higher than other personality traits. Figure 1 and 2 shows the number of occurrences of each MBTI personality trait in the dataset; in the form of a bar graph and pie chart, respectively.

According to MBTI system, every person falls into one of two options in four categories. The first category was for Introversion (I)/Extroversion (E), the second category was for Intuition (N)/Sensing (S), the third was for Thinking (T)/Feeling (F) and the fourth category was for Judging (J)/Perceiving (P). As a result, for each category, one letter will return and at the end there will be four letters that represent one of the 16 personality types in the MBTI. The distribution of the dataset based on each type of indicator is presented in Table 2 below.

**Table 2: Distribution of Personality Indicator**

Figure 3 shows the distribution across type indicators in the form of a bar graph.

Here we can see that classes ’Introversion (I)/ Extroversion (E)’ and ’Intuition (N)/ Sensing (S)’ are imbalanced. The distribution of Extroversion (E) is much higher than that of Introversion (I) and the distribution of Intuition (I) is much higher than (S)., which can leads to biasness towards the majority class.

Data Preprocessing

Data preprocessing is the most crucial step in any Machine or Deep Learning research. Here, the dataset is given in .csv file with 2 columns. The first column is ’type’ which contains all the classes (labels) we need to predict, and the second column consists of Twitter posts of an individual. Every instance in the dataset contains fifty sentences posted by an individual. The first step in preprocessing is that we split the posts by ’|||’ as all the sentences in each instance are separated by three pipes. We will get a list of lists after this step as an output. As there are 8 type indicators which can be combined into 4 categories: ’IE’, ’NS’, ’TF’, ’JP’, we will create 4 columns for each category, such that all the columns contain binary values, i.e. 0 or 1. When the first letter in the label is I, column ’IE’ in the dataframe is labeled as 1, otherwise 0. The same goes with ’NS’, ’TF’, and ’JP’. Now, we need to clean the sentences, so that they can be used further easily. For that, we used the NLTK library. First, remove the URLs from the sentences, then remove words that do not start with any alphabet, then convert every capital letter to lower cases, and finally, all words are lemmatized. Then, word embedding is done so that it can be used as an input to the algorithms which will be trained further.

Count vectorization and tf-idf is used for the same.

Experiments and Results

To handle the imbalanced dataset we have used resampling. The objective of resampling methods is to modify the dataset in order to reduce the discrepancy among the distribution of the classes. There are two resampling techniques are: undersampling and oversampling. Undersampling technique is used to eliminate instances from the majority class while the oversampling technique is used to generate instances for the minority class. For balancing, we used the SMOTETomek method, which is a combination of undersampling technique (Tomek link) and the oversampling technique (SMOTE). This study uses a few machine learning techniques and few deep learning techniques. We have applied the following models in our project:

(a) SVM without SmoteTomek: This model was learned without doing re-sampling.

(b) SVM with SmoteTomek: This model was learned with re-sampling.

(d) Logistic Regression with SmoteTomek: This model was learned with re-sampling.

(e) XGBoost with SmoteTomek: This model was learned without re-sampling.

(f) XGBoost with SmoteTomek: This model was learned with re-sampling.

(g) Word2Vec was used for word embeddings. W2V was used with BiLSTM.

(h) Another word embedding we used was GloVe. GloVe was used with BiLSTM.

(i) Naive Bayes and AdaBoost were also implemented.

The training and testing split ratio used for evaluation is 75:25.

All of the classification results obtained by using various models can be seen in Table 3 & Table 4.

The result shows that SMOTETomek + SVM showed better results than other models. First, this model was implemented without re-sampling. It can be seen in the table that the accuracy & f1-score of the model increased significantly when re-sampling is done.

Before re-sampling,

F1 score for IE: 87.939, NS: 92.196, TF: 74.387, and JP: 48.231.

Accuracy for IE:78.792, NS:85.523, TF:76.855, JP:66.943.

While a significant improvement was achieved with SMOTETomek + SVM.

F1 score after re-sampling was, IE:89.636, NS:94.890, TF:79.532, JP:72.180 and accuracy turned out to be: IE:88.645, NS:94.650, TF:79.812, JP:74.580

Our guide

Professor: Dr. Tanmoy Chakraborty (http://faculty.iiitd.ac.in/~tanmoy/)

LinkedIn: https://www.linkedin.com/in/tanmoy-chakraborty-89553324/

Twitter: @Tanmoy_Chak

Facebook: https://www.facebook.com/chak.tanmoy

Our Teaching Fellow and Teaching Assistants

Teaching Fellow: Ms. Ishita Bajaj
Teaching Assistants: Pragya Srivastava, Shiv Kumar Gehlot, Chhavi Jain, Vivek Reddy, Shikha Singh, and Nirav Diwan.

#MachineLearning2020 #IIITD

Authors (and their contributions)

Abhishek Madaan (https://www.linkedin.com/in/abhishek-madaan)
Models prepared using AdaBoost, GloVE + BiLSTM
Neha Rana
Models prepared using Naive Bayes, Word2vec + BiLSTM
T G Narayanan (https://www.linkedin.com/in/tgnarayan)
Data modelling using SVM, Logistic Regression, and XGBoost

All the authors performed data preprocessing and re-sampling using SMOTETomek. Evaluations were done in terms of Accuracy and F1-score.

References

[1] Mohammad Hossein Amirhosseini and HassanKazemian, “Machine Learning Approach to Person-ality Type Prediction Based on the Myers–BriggsType Indicator,” School of Computing and DigitalMedia, London Metropolitan University, LondonN7 8DB, UK; h.kazemian@londonmet.ac.uk, : 14 March 2020.

[2] Yash Mehta, Navonil Majumder, AlexanderGelbukh, Erik Cambria, “Recent Trends in Deep Learning Based Personality Detection,” arXiv:1908.03628v2 [cs.LG] 27 Aug 2019.

[3] Aditi V.Kunte, Suja Panicker, Dr.Vishwanath Karad, Dr.Vishwanath Karad MIT, “Using textual data for Personality Prediction:A Machine Learning Approach,” 2019 4th International Conference on Information Systems and Computer Networks

[4] ZHE WANG, CHUNHUA WU, KANGFENG ZHENG, XINXIN NIU, AND XIUJUAN WANG, “SMOTETomek-Based Resampling for Personality Recognition,” This work was supported in part by the National Key RD Program of China under Grant 2017YFB0802703, in part by the National Natural Science Foundation of China under Grant 61602052, and in part by the BUPT Excellent Ph.D. Students Foundation under Grant CX2019231.

[5] S. Bharadwaj, S. Sridhar, R. Choudhary and R. Srinath, “Persona Traits Identification based on Myers-Briggs Type Indicator(MBTI) — A Text Classification Approach,” 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Bangalore, 2018, pp. 1076–1082, doi: 10.1109/ICACCI.2018.8554828.

[6] Muhammad Nurfauzi Sahono, Fiqie Ulya Sidiastahta, Guruh Fajar Shidik3, Ahmad Zainul Fanani,Muljono, Safira Nuraisha and Erba Lutfina, “Extrovert and Introvert Classification based on Myers-Briggs Type Indicator(MBTI) using SVM,” 2020 International Seminar on Application for Technology of Information and Communication.