Getting started with using the MSR2021Replication#
Install Mallet#
To install the Mallet tool, first is necessary to have the Apache ant build tool installed. Install the binary from https://ant.apache.org/ and follow the manual instructions to configure it.
With ant installed and configured, open the Mallet 2.0.8 folder in the MSR2021Replication repository at mallet/mallet-2.0.8
and run the following command:
$ ant
The Mallet tool will be available to use at mallet/mallet-2.0.8/bin/mallet
.
Run with jupyter notebook#
The jupyter notebook can be used for StackOverflow datasets. To run the jupyter notebook run the following command on the repository root.
$ jupyter-notebook SO_dataset_analysis.ipynb
Follow the notebook instructions to import the correct dataset and run the scripts.
Run with bash#
Install python libraries#
Run the following command to install the libraries in the scripts:
$ pip install -r notebook/requirements.txt
Open a Python3 console with the command:
$ python3
Inside the console download the nltk packages by running the following code:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('word_tokenize')
nltk.download('tokenize')
nltk.download('stem')
Export the variables#
Use the following commands to export the variables so scripts can use the correct path to the dataset and output folder.
# Export path to the raw dataset
$ export DATASET_PATH=./tcc/so_questions.csv
# Export the output path
$ export OUTPUT_PATH=./output
# Export the number of topics division
$ export TOPICS_NUM=15
Prepare dataset#
To run the MSR2021Replication, it is necessary to run the following script to parse the .csv
dataset, clean it and create documents to be used by the mallet tool.
$ python3 prepare_dataset.py
Run the Mallet tool#
Run mallet instructions:
$ mallet/mallet-2.0.8/bin/mallet import-dir --input $OUTPUT_PATH/so_data/ --output $OUTPUT_PATH/so.mallet --keep-sequence --remove-stopwords --extra-stopwords extra_stopwords/so.txt
$ mallet/mallet-2.0.8/bin/mallet train-topics --random-seed 100 --input $OUTPUT_PATH/so.mallet --num-topics 15 --optimize-interval 20 --output-state $OUTPUT_PATH/so-topic-state.gz --output-topic-keys $OUTPUT_PATH/so_keys.txt --output-doc-topics $OUTPUT_PATH/so_composition.txt --diagnostics-file $OUTPUT_PATH/so_results/so_diagnostics.xml
Parse mallet results#
After running the mallet tool, run the following script.
$ python3 manage_results.py
This script will create document files for each topic containg all the questions related to this topic.
Run with docker#
first steps#
Download docker and docker compose
Extract the so_questions.csv.zip file present in TCC folder
start docker container#
Run the docker container
make start
Install dependencies and exec docker#
To install the dependencies and enter the container
make init
↳ follow the next steps using bash opened by the make init command:#
Prepare data#
Prepare data
make prepare
Process data#
Process the data with mallet
make process
Results#
Process the results
make results