Stratified Boxplot in R Programming

Stratified Boxplot in detail

A boxplot is a graphical summary that represents groups of numerical data using their quartiles. Since boxplots are non-parametric, they display the variation in samples from a statistical population without assuming any specific underlying distribution. The spacing within the box indicates the degree of dispersion and skewness in the data, while also highlighting outliers. Boxplots can be oriented vertically or horizontally, and they get their name from the rectangular “box” in the center.

Stratified boxplots are used to examine the relationship between a categorical variable and a numeric variable, or to compare multiple groups defined by an additional categorical variable. They are especially useful for comparing the distributions of a numeric variable across different categories.

Implementation in R

Stratified boxplots in R can be created using the boxplot() function from the R Graphics Package. Here is the syntax and a brief description of key parameters:

boxplot(formula, data = NULL, …, subset, na.action = NULL,
        xlab = mklab(y_var = horizontal),
        ylab = mklab(y_var = !horizontal),
        add = FALSE, ann = !add, horizontal = FALSE, drop = FALSE,
        sep = ".", lex.order = FALSE)

Key Parameters

  • formula: A formula describing the relationship between the numeric and categorical variables.
  • data: A data frame or list containing the variables specified in the formula.
  • subset: An optional vector specifying a subset of observations.
  • na.action: A function indicating what should be done with missing values.
  • xlab, ylab: Labels for the x- and y-axes; can be suppressed with ann = FALSE.
  • add: Logical flag to add the boxplot to an existing plot.
  • horizontal: Logical flag; if TRUE, boxplots are drawn horizontally.
  • range: Determines how far the whiskers extend from the box.
  • width: A vector specifying the relative widths of the boxes.
  • varwidth: If TRUE, the widths of the boxes are proportional to the square roots of the number of observations in each group.
  • notch: If TRUE, a notch is drawn in each side of the boxes.
  • outline: If FALSE, outliers are not plotted.
  • names: Group labels displayed under each boxplot.
  • border: Colors for the outlines of the boxplots.
  • col: Colors for the bodies of the boxplots.
  • log: A character string indicating whether the x or y (or both) should be on a log scale.
  • pars: A list of additional graphical parameters.
  • at: Numeric vector specifying the locations for drawing the boxplots when adding to an existing plot.

Example

In this example, we will use the iris dataset to create a stratified boxplot that compares the petal lengths across the three iris species.

# Load the iris dataset
data(iris)

# Create a stratified boxplot of Petal.Length by Species
boxplot(Petal.Length ~ Species, data = iris,
        main = "Boxplot of Petal Length by Iris Species",
        xlab = "Iris Species",
        ylab = "Petal Length (cm)",
        col = c("lightblue", "lightgreen", "lightpink"),
        border = "darkblue")

Output:

Example:

Below is an example where we compare lung capacity by gender rather than smoking status. In this revised example, we:

  1. Load the same dataset.
  2. Categorize ages into new groups.
  3. Create three boxplots:
    • Boxplot 1: Compares lung capacity between males and females.
    • Boxplot 2: Compares lung capacity between males and females for subjects aged 20 and above.
    • Boxplot 3: Displays stratified boxplots of lung capacity by gender within the defined age groups.
# Load the dataset
LungCapData <- read.csv("LungCapData.csv", header = TRUE)
LungCapData <- data.frame(LungCapData)
attach(LungCapData)

# Categorise Age into groups with new breakpoints
AgeGroups <- cut(LungCapData$Age,
                 breaks = c(0, 15, 20, 30),
                 labels = c("Under 15", "15-20", "Over 20"))

# BoxPlot 1: Lung Capacity by Gender
boxplot(LungCapData$LungCap ~ LungCapData$Gender,
        ylab = "Lung Capacity",
        main = "Lung Capacity: Males vs Females",
        col = c("skyblue", "salmon"),
        las = 1)

# BoxPlot 2: Lung Capacity by Gender for subjects aged 20 and above
boxplot(LungCapData$LungCap[LungCapData$Age >= 20] ~ LungCapData$Gender[LungCapData$Age >= 20],
        ylab = "Lung Capacity",
        main = "Lung Capacity (Age >= 20): Males vs Females",
        col = c("skyblue", "salmon"),
        las = 1)

# BoxPlot 3: Stratified Lung Capacity by Gender across Age Groups
boxplot(LungCapData$LungCap ~ LungCapData$Gender * AgeGroups,
        ylab = "Lung Capacity",
        xlab = "Gender and Age Group",
        main = "Stratified Lung Capacity by Gender and Age Groups",
        col = c("skyblue", "salmon"),
        las = 2)

Output:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *