3 chapter2: Measures of variability

Variability is most commonly measured with the following descriptive statistics:

Range: the difference between the highest and lowest values.
Interquartile range: the range of the middle half of a distribution.
Standard deviation: average distance from the mean.
Variance: average of squared distances from the mean.

3.1 Range in R

x <- c(12, 15, 18, 25, 30, 35)

range_value <- max(x) - min(x)
range_value

[1] 23

Using range() function

x <- c(12, 15, 18, 25, 30, 35)
range(x)

[1] 12 35

Using range() function with NA

x <- c(10, 15, NA, 25, 30, NA, 40)

range(x, na.rm = TRUE)

[1] 10 40

Range in DataFrame

df <- data.frame(
  A = c(10, 15, 20, 25, 30),
  B = c(5, 7, 9, 12, 15),
  C = c(100, 120, 110, 130, 125),
  Gender = c("m", "f", "f", "m", "f")
)

df

   A  B   C Gender
1 10  5 100      m
2 15  7 120      f
3 20  9 110      f
4 25 12 130      m
5 30 15 125      f

Using: sapply Numeric column

range_df <- sapply(df[, sapply(df, is.numeric)], function(x) max(x) - min(x))
range_df

 A  B  C 
20 10 30

aggregate() Categorical variables

aggregate(cbind(A, B, C) ~ Gender, data = df,
          FUN = function(x) max(x) - min(x))

  Gender  A B  C
1      f 15 8 15
2      m 15 7 30

Range in DataFrame with NA

df <- data.frame(
  A = c(10, 15, 20, 25, 30),
  B = c(5, NA, 9, 12, 15),
  C = c(100, 120, 110, 130, NA),
  Gender = c("m", "f", "f", "m", "f")
)

df

   A  B   C Gender
1 10  5 100      m
2 15 NA 120      f
3 20  9 110      f
4 25 12 130      m
5 30 15  NA      f

In R, to handle missing data we need to put na.rm = TRUE inside the function:

aggregate(cbind(A, B, C) ~ Gender, data = df,
          FUN = function(x) max(x, na.rm = TRUE) - min(x, na.rm = TRUE))

  Gender  A B  C
1      f  0 0  0
2      m 15 7 30