This tutorial will walk you through the complete data mining task with EasyMiner.
- Choose action: Upload data, reuse existing datasource or open miner
- Upload dataset and select miner type
- Configure upload
- Configure columns
- Name the miner
- Define pattern for mining
- Set interest measure thresholds
- Activate pruning
- Inspect results, select rules, export or save
The tutorial will use the titanic dataset.
After logging in, you have three options:
- Upload your data: EasyMiner accepts the Comma Separated Values (CSV) format.
- Reuse already uploaded dataset: use this option if you want to preprocess the dataset in a different way. Note that to change the miner type, e.g. from unlimited (cloud) to limited (R) you will need to upload the dataset again.
- Open existing miner. This will open existing dataset including any preprocessing and discovered rules saved into the rule clipboard.
Apart from choosing the file to import, this screen allows you to set the miner (database) type. There are two types of miners. The recommended miner is suitable for smaller datasets as it provides fastest response. When the data are too large for this default miner type, you can try using the Cloud-based miner.
After uploading the dataset, there is the option to change upload parameters, such as encoding or separator.
In the next step, you can change data types for individual columns. You can also rename the columns.
In the final step of creating the miner, you can set the name.
If you want to use an attribute in a data mining task, it needs to be preprocessed. The three options for selecting data fields for preprocessing are displayed on the picture. We use the term attribute to refer to the preprocessed datafield.
By dragging attributes to the rule pattern, you can define which attributes can appear on the left side (antecedent) and right side (consequent) of the rules. If there is a target attribute in the task, we typically place it to the consequent, the remaining attributes are placed into the antecedent.
When an attribute is placed to the attribute palette, you have the option to set Fixed value or Any value.
- If you choose fixed value, the attribute can appear in discovered rule only with the prespecified value.
- If you choose Any value, the attribute can appear with any value.
The final step in preparing the rule learning task is setting of minimum thresholds for selected interest measures. The default measures are confidence or lift, but via the Add interest measure link it is possible to also add the Lift measure.
- The lift measure requires that confidence measure is also present, but it can be set to arbitrarily low value.
- If you first remove all the interest measures and then click on Add interest measure, you will have the opportunity to set the AUTO conf&supp measure, which will automatically set the thresholds maximizing the classification accuracy of the rule set. For "AUTO conf&supp measure" to be available, the rule pattern has to contain exactly one attribute in the Consequent, with value type set to "Any value".
The last optional setting is pruning. Pruning is available on on tasks complying to the classification pattern - the consequent has to contain one attribute with "Any value" value type. EasyMiner uses Classification based on Association (CBA) as the pruning algorithm. CBA has three steps:
- Sort the rules according to confidence, support and rule length.
- Remove redundant rules, which are essentially the rules that cover no training example after the examples covered by rules with higher precedence were applied.
- A default rule is included into the end of the list. Default rule is a rule with empty antecedent and the modal class among the examples uncovered by the rule set.
Note that the pruned rule set can also be used as a classifier.
Discovered rules are shown in the Result pane. Each rule is accompanied by the values of interest measures:
- Confidence: number of rows covered by the entire rule divided by the number of rules covered by the condition in the antecedent of the rule
- Support: number of rows covered by the entire rule divided by the number of all rows
- Green thumbs up icon: the rule can be moved to clipboard. Rules in clipboard are kept even when the task is rerun with different setting.
- Add all rules: all discovered rules are moved to the clipboard.
- Task details: displays report about the complete data mining task (description of data, transformation, task setting, results)
- Task export: exports the model in XML format