Data Guides bio photo

Data Guides

Data Data Data Data

Decision Trees with R

Contents

Introduction and Data

This example demonstrates how to perform decision tree analysis using R. The dataset used in this demonstration is titled beetles. The variables are:

Variable Description
width A numeric vector, maximal width of aedeagus (microns)
angle A numeric vector, front angle of aedeagus (1 unit=7.5 degrees)
species Species of beetle from genus Chaetocnema (3 levels)

Data as CSV: Here is the complete data set, in CSV format.

The Task: Using only the provided measurements of a beetle’s aedeagus, determine the species of beetle from the genus Chaetocnoma. The three species are concinna (Con), heikertingeri (Hei), and heptapotamica (Hep).

Getting Started in R

To begin, you need to have R installed on your computer. You can download it at the R Project website. We also recommend the R Studio. Launch R and then type the following commands at the command line.

#You may need to install the following packages

install.packages("party")

#Load the package into R.  This can also
#be done by clicking "Packages" and selecting
#the data set

library(party)

#Read the csv into R

beetles=read.csv("beetles.csv", header=TRUE)

Partitioning the Data

Since the goal of this procedure is to predict a categorical variable based on numeric inputs, a decision tree may be appropriate here.

First, the data must be separated into training and validation sets. A common percentage for these datasets is 70% training and 30% validation. The training dataset will be used to build the decision tree, while the validation set will be used to test the predictions and evaluate the model.

set.seed(1234)
train<-sample(2,nrow(beetles),replace=TRUE,prob=c(0.7,0.3))
beetles.tr<-beetles[train==1,]
beetles.va<-beetles[train==2,]

Building the Decision Tree

Now that the data are separated, the decision tree may be created. The commands below create a decision tree with the default settings.

model<-Species~Width+Angle
beetles_tree<-ctree(model,data=beetles.tr)
table(predict(beetles_tree),beetles.tr$Species)

Confusion Matrix

The output shows that the tree possesses clear predictive power, as the species are almost completely predicted correctly. Printing the tree reveals the decision rules and properties.

print(beetles_tree)

Decision Rules

The first split was on aedeagus width of 133 microns. Both of these divisions were then split again on Angle (one at 12, and the other at 13). The weights output reveals how many observations are present in a particular leaf. To get more detailed information, the tree may be plotted.

plot(beetles_tree)

Decision Tree

All of the information is still available from the print function, but additional insights are provided. The p-values are available at each split, as well as bar graphs of the species divisions within leaves. Notice that it is not possible to get exact percentages of species within leaves. For example, Node 3 contains approximately 75% heikertingeri. For a similar chart with percentages, the type can be specified as “simple”.

plot(beetles_tree, type="simple")

Decision Tree Simple

This chart reveals that the proportion of heikertingeri in Node 3 is, in fact, 75%.

Validation of the Model

Now, the model must be tested against the validation data. This allows one to see how the model responds to new data, and serves as an assessment of the model’s validity if implemented.

beetles_treetest<-predict(beetles_tree, newdata=beetles.va)
table(beetles_treetest, beetles.va$Species)

Validate Confusion Matrix

The model seems to perform well on the validation set, correctly predicting all but one specimen.

Conclusion

We hope you found this guide to decision trees in R useful!