Tokenizer for StackOverflow Posts

This project serves as partial fulfillment to course CZ4045 : Natural Language Processing in Nanyang Technological University, Singapore.

We developed two tokenizers on Stack Overflow posts. One is based on regular expression, and the other is based on Conditional Random Field (CRF).

The difficulty in tokenizing Stack Overflow data is that its content is highly unstructured, composing of both English text and code snippet. The tokenizer designed and developed by the team is capable to

1. tokenize the code section into smaller meaningful units
2. identify irregular name entities such as "Albert Einstein"
3. identify file path like "src/main/resources"

which greatly improved the accuracy of tokenization, thus enhanced the performance of further analysis.

In the end, our CRF-based tokenizer achieved f1 score of 0.9483 on 5-fold cross validation, and regex-based tokenizer achieved f1-score of 0.9653.

We also developed an program to extract representative keywords from posts based on frequency.

Check it out here!