2 chapter1: Measures of central tendency

In this chapter we will learn descriptive statistics.

Measures of Central Tendency are statistical values that represent the center or typical value of a data set. They help us understand the overall trend or “average” behavior of the data.

2.1 Mode in R

The mode is the value(s) that appear most frequently in the data set.

In R, the lsr package provides a simple and clean way to calculate the mode.

Step 1: Install and Load the lsr Package

install.packages("lsr")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
(as 'lib' is unspecified)

library(lsr)

Example:

data <- c(1, 2, 2, 3, 3, 3, 4, 5)
modeOf(data)

[1] 3

⚠️ Important Note

There is a base R function also named mode() — but it returns the data type (like "numeric", "character") instead of calculating the statistical mode!

Example: Mode in Data Frame

df <- data.frame(
  Score = c(1, 2, 2, 3, 4, 3, NA),
  Group = c("A", "B", "B", "B", "A", "A", NA),
  Age   = c(20, 22, 21, 22, 21, 22, 21)
)

sapply(df, modeOf)

     Score Group Age 
[1,] "2"   "A"   "22"
[2,] "3"   "B"   "21"

The sapply() function is executed on each column of the dataframe.

2.2 Mode in Python

Calculating the Mode in Python Using pandas

Use .mode() to Find the Mode

Exaple 1:

import pandas as pd

data = pd.Series([1, 2, 2, 3, 4, 4, 5])
mod = data.mode()
print("Mode:", mod.tolist())

Mode: [2, 4]

#or

print (mod)

0    2
1    4
dtype: int64

Example 2: Create a Sample Data Frame.

Mode for All Columns:

data = {
    'A': [1, 2, 2, 3, 4],
    'B': [5, 5, 6, 7, 7],
    'C': [10, 10, 10, 11, 12]
}

df = pd.DataFrame(data)

mode_values = df.mode()
print(mode_values)

     A  B     C
0  2.0  5  10.0
1  NaN  7   NaN

Example 3:

Mode for All Columns:

data = {
    "Name": ["Alice", "Bob", "Charlie", "Diana", "Eli", "Fiona"],
    "Age": [24, 30, 22, 24, 30, 24],
    "Gender": ["Female", "Male", "Male", "Female", "Male", "Female"]
}

df = pd.DataFrame(data)
df

      Name  Age  Gender
0    Alice   24  Female
1      Bob   30    Male
2  Charlie   22    Male
3    Diana   24  Female
4      Eli   30    Male
5    Fiona   24  Female

Mode of the "Age" column:

df["Age"].mode()

0    24
Name: Age, dtype: int64

2.3 Median in R

The median is the middle value of a dataset when the numbers are sorted in order.

If the number of values is odd, it’s the middle one.
If it’s even, it’s the average of the two middle numbers.

Basic median() function.

x <- c(10, 20, 30, 40, 50)
median(x)

[1] 30

With missing values (NA) we can use na.rm = TRUE

y <- c(5, 8, NA, 12)
median(y, na.rm =TRUE)

[1] 8

Using tapply with categorical variables in R

tapply lets you apply a function (like median) to subsets of a vector, defined by a categorical variable (factor).

Age <- c(24, 30, 22, 24, 30, 24)
Gender <- c("F", "M", "M", "F", "M", "F")

tapply(Age, Gender, median)

 F  M 
24 30

Explanation:

First argument → numeric vector (Age)
Second argument → categorical variable (Gender)
Third argument → function to apply (median)

So this gives you the median age for each gender.

Example : DataFrame

data <- data.frame(
  Gender = c("f", "f", "m", "m", "f", "m"),
  Anxiety = c(12, 15, NA, 20, 18, 25),
  Depression = c(30, NA, 28, 35, 40, NA)
)

data

  Gender Anxiety Depression
1      f      12         30
2      f      15         NA
3      m      NA         28
4      m      20         35
5      f      18         40
6      m      25         NA

Use sapply , lapply

sapply(data[, c("Anxiety","Depression")], median, na.rm = TRUE)  #1

   Anxiety Depression 
      18.0       32.5

#or
sapply(data[sapply(data, is.numeric)], median, na.rm = TRUE)     #2

   Anxiety Depression 
      18.0       32.5

#or
lapply(data[, c("Anxiety","Depression")], median, na.rm = TRUE)  #3

$Anxiety
[1] 18

$Depression
[1] 32.5

Use tapply

tapply(data$Anxiety, data$Gender, median, na.rm = TRUE)

   f    m 
15.0 22.5

tapply(data$Depression, data$Gender, median, na.rm = TRUE)

   f    m 
35.0 31.5

Use aggregate

aggregate(cbind(Anxiety, Depression) ~ Gender, data = data, median,na.rm = TRUE)

  Gender Anxiety Depression
1      f      15         35
2      m      20         35

2.4 Median in Python

1.Using the statistics library

import statistics

data = [10, 12, 15, 17, 18]
median_value = statistics.median(data)

print("Median:", median_value)

Median: 15

2.Using the NumPy library

import numpy as np

data = [10, 12, 15, 17, 18]
median_value = np.median(data)

print("Median:", median_value)

Median: 15.0

3.Using the Pandaslibrary to DataFrame

import pandas as pd

data = {
    "Age": [24, 30, 22, 24, 30, 24],
    "Score": [85, 90, 78, 92, 88, 95]
}

df = pd.DataFrame(data)

print("Median Age:", df["Age"].median())

Median Age: 24.0

print("Median Score:", df["Score"].median())

Median Score: 89.0

Median in data with missing values (NaN)

Sometimes data contains missing values. Pandas and NumPy can handle them easily:

import numpy as np
import pandas as pd

data = [10, 12, np.nan, 17, 18]

# با NumPy
print("Median with NumPy:", np.nanmedian(data))

Median with NumPy: 14.5

# با Pandas
s = pd.Series(data)
print("Median with Pandas:", s.median())

Median with Pandas: 14.5

Example :Data Frame (Numeric + Categorical)

import pandas as pd
import numpy as np


data = {
    "Age": [24, 30, 22, 24, 30, 24],
    "Score": [85, np.nan, 78, 92, np.nan, 95],
    "Gender": ["F", "M", "M", "F", "M", "F"]
}

df = pd.DataFrame(data)
print(df)

   Age  Score Gender
0   24   85.0      F
1   30    NaN      M
2   22   78.0      M
3   24   92.0      F
4   30    NaN      M
5   24   95.0      F

Calculate the median for each column.

medians = df.median(numeric_only=True, skipna=True)
print(medians)

Age      24.0
Score    88.5
dtype: float64

Calculating the median by gender

group_medians = df.groupby("Gender").median(numeric_only=True)
print(group_medians)

         Age  Score
Gender             
F       24.0   92.0
M       30.0   78.0

2.5 Mean in R

2.5.1 Arithmetic Mean in R

In R, we use the mean() function to calculate the mean.

anxiety_scores <- c(12, 15, 20, 22, 18, 30, 25, 19, 17, 50)
mean(anxiety_scores)

[1] 22.8

Mean with missing data (NA) in R

If the data contains missing values (NA), you must use the argument na.rm = TRUE

data <- c(5, 10, NA, 20)

mean(data, na.rm = TRUE)

[1] 11.66667

categorical variables: tapply

The researcher wants to examine the average depression scores of two groups (men and women). To do this, we use tapply.

group <- c("f","f","f","m","m","m")
depression <- c(18, 22, 20, 25, 30, 28)

tapply(depression, group, mean)

       f        m 
20.00000 27.66667

Calculating the mean in a Data Frame with missing data(NA)

A researcher collected anxiety and depression scores from psychology students. Some students did not answer some questions, and the data is incomplete.

data <- data.frame(
  Gender = c("f", "f", "m", "m", "f", "m"),
  Anxiety = c(12, 15, NA, 20, 18, 25),
  Depression = c(30, NA, 28, 35, 40, NA)
)

data

  Gender Anxiety Depression
1      f      12         30
2      f      15         NA
3      m      NA         28
4      m      20         35
5      f      18         40
6      m      25         NA

Use lapply , sapply

Mean of each column “Anxiety”,”Depression” (regardless of gender)

sapply(data[, c("Anxiety","Depression")], mean, na.rm = TRUE)  #1

   Anxiety Depression 
     18.00      33.25

#or
sapply(data[sapply(data, is.numeric)], mean, na.rm = TRUE)     #2

   Anxiety Depression 
     18.00      33.25

#or
lapply(data[, c("Anxiety","Depression")], mean, na.rm = TRUE)  #3

$Anxiety
[1] 18

$Depression
[1] 33.25

Use tapply

Mean anxiety and depression scores by gender

tapply(data$Anxiety, data$Gender, mean, na.rm = TRUE)

   f    m 
15.0 22.5

tapply(data$Depression, data$Gender, mean, na.rm = TRUE)

   f    m 
35.0 31.5

Use aggregate

aggregate(cbind(Anxiety, Depression) ~ Gender, data = data, mean,na.rm = TRUE)

  Gender Anxiety Depression
1      f      15         35
2      m      20         35

2.5.2 Geometric Mean in R

This indicator is most often used when the data is a ratio or ratio, such as growth rates, investment returns, or population changes.

Method 1: Using the psych package

install.packages("psych")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
(as 'lib' is unspecified)

library(psych)

numbers <- c(4, 8, 16, 32)

geometric.mean(numbers)

[1] 11.31371

Method 2: Using a formula (without a package)

numbers <- c(4, 8, 16, 32)

gm <- exp(mean(log(numbers)))

print(gm)

[1] 11.31371

Application of the geometric mean:

Economics: mean annual growth rate of an investment.
Medical science: mean increase in bacterial population .
Management: mean annual sales growth percentage.

2.5.3 Harmonic Mean in R

Harmonic mean is most often used in situations where the data is a ratio or rate (such as speed, productivity, or growth rate).

Method 1: Using the psych package

install.packages("psych")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
(as 'lib' is unspecified)

library(psych)

numbers <- c(2, 4, 8, 16)

harmonic.mean(numbers)

[1] 4.266667

Method 2: Using a formula (without a package)

numbers <- c(2, 4, 8, 16)

hm <- length(numbers) / sum(1 / numbers)

print(hm)

[1] 4.266667

Application of harmonic mean:

Physics: Calculating the average speed when the same distance is traveled at different speeds.
Economics: Average interest rates or investment returns.
Management: Average productivity indicators in an organization.

Important

The data must be positive.
If the data contains zero or negative values, the harmonic mean is not defined.
For data with NA, you must use na.rm=TRUE

2.5.4 Trimmed Mean in R

The Trimmed Mean is a version of the arithmetic mean in which a percentage of the smallest and largest data values are removed to reduce the effect of outliers.

if the data contains very large or very small values, the arithmetic mean can be misleading. In these situations, we use the trimmed mean.

numbers <- c(2, 3, 4, 5, 100)

mean(numbers)

[1] 22.8

mean(numbers, trim = 0.2)

[1] 4

The trim value is chosen between 0 and 0.5.
For example, trim=0.1 means that 10% of the data is removed from the beginning and 10% from the end.
Before calculation, the data is sorted in ascending order.

Application

Social sciences: In questionnaires when some responses are very unusual.
Economics: In calculating indices when there are outliers.
Data analysis: In data that has noise or measurement error.

2.6 Mean in Python

2.6.1 Arithmetic Mean in Python

Method 1: Using statistics.mean

import statistics

numbers = [10, 20, 30, 40, 50]
mean_value = statistics.mean(numbers)

print("Mean:", mean_value)

Mean: 30

Method 2: Using NumPy

import numpy as np

numbers = [10, 20, 30, 40, 50]
mean_value = np.mean(numbers)

print("Mean:", mean_value)

Mean: 30.0

If the data is small: use statistics.mean
If the data is large or a matrix: use numpy.mean.

Method 3: Using Pandas

Example 1: Calculating the mean of a column

import pandas as pd

data = {
  "Name":["Sevda", "MG", "Reza", "RG"],
   "Math":[20, 12, 18, 15],
   "Statistics": [17, 16, 19, 14]
}

df = pd.DataFrame(data)

mean_math = df["Math"].mean()

print("Math:", mean_math)

Math: 16.25

Example 2: Calculating the mean of all columns

means = df.mean(numeric_only=True)
print(means)

Math          16.25
Statistics    16.50
dtype: float64

Example 3: Calculating the mean of each row

df["Individual Mean"] = df.mean(numeric_only=True, axis=1)

print(df)

    Name  Math  Statistics  Individual Mean
0  Sevda    20          17             18.5
1     MG    12          16             14.0
2   Reza    18          19             18.5
3     RG    15          14             14.5

Mean with NaN

Using Pandas The Pandas library ignores NaN values by default in its statistical functions.

import pandas as pd
import numpy as np


data = {
  "Name":["Sevda", "MG", "Reza", "RG"],
   "Math":[20, 12, np.nan, 15],
   "Statistics": [17, 16, 19, np.nan]
}

df = pd.DataFrame(data)

mean_math = df["Math"].mean()
mean_Statistics = df["Statistics"].mean()


print("Math:", mean_math)

Math: 15.666666666666666

print("Statistics:", mean_Statistics)

Statistics: 17.333333333333332

means = df.mean(numeric_only=True)
print(means)

Math          15.666667
Statistics    17.333333
dtype: float64

Using NumPy with np.nanmean()

import numpy as np

numbers = [10, 20, np.nan, 30, 40]

mean_value = np.nanmean(numbers)

print(" Mean:", mean_value)

 Mean: 25.0

Example: DataFrame (numeric + categorical)

import pandas as pd
import numpy as np

data = {
    "Name":["Sevda", "MG", "Reza", "RG", "SG"],
    "Class":["A", "B", "C", "A", "C"],
    "Math":[12, 18, 15, np.nan, 20],
    "physics": [17, 16, 19, 14, np.nan],
    "Psychology": [np.nan, 18, 13, 15, 19]
}

df = pd.DataFrame(data)

print(df)

    Name Class  Math  physics  Psychology
0  Sevda     A  12.0     17.0         NaN
1     MG     B  18.0     16.0        18.0
2   Reza     C  15.0     19.0        13.0
3     RG     A   NaN     14.0        15.0
4     SG     C  20.0      NaN        19.0

Mean of all numeric columns

means = df.mean(numeric_only=True)
print(means)

Math          16.25
physics       16.50
Psychology    16.25
dtype: float64

Mean of each row (for each student)

df["Individual Mean"] = df.mean(numeric_only=True, axis=1)

print(df)

    Name Class  Math  physics  Psychology  Individual Mean
0  Sevda     A  12.0     17.0         NaN        14.500000
1     MG     B  18.0     16.0        18.0        17.333333
2   Reza     C  15.0     19.0        13.0        15.666667
3     RG     A   NaN     14.0        15.0        14.500000
4     SG     C  20.0      NaN        19.0        19.500000

Calculate the group mean (based on categorical column)

group_means = df.groupby("Class").mean(numeric_only=True)
print(group_means)

       Math  physics  Psychology  Individual Mean
Class                                            
A      12.0     15.5        15.0        14.500000
B      18.0     16.0        18.0        17.333333
C      17.5     19.0        16.0        17.583333

2.6.2 Geometric Mean in Python

Using statistics.geometric_mean (Python 3.8 and later)

import statistics

numbers = [2, 8, 4]

gm = statistics.geometric_mean(numbers)

print("geometric_mean:", gm)

geometric_mean: 4.0

Using SciPy :scipy.stats.gmean

from scipy.stats import gmean

numbers = [2, 8, 4]

gm = gmean(numbers)

print("gmean:", gm)

gmean: 4.0

If a number is zero: the total geometric mean is zero.

If a number is negative: the geometric mean is not defined (except in special cases).

2.6.3 Harmonic Mean in Python

Using statistics.harmonic_mean (Python 3.6 and later)

import statistics

numbers = [2, 4, 4]

hm = statistics.harmonic_mean(numbers)

print("hm :", hm)

hm : 3.0

Using SciPy :scipy.stats.hmean

from scipy.stats import hmean

numbers = [2, 4, 4]

hm = hmean(numbers)

print("hmean:", hm)

hmean: 3.0

All data must be positive. If a value is zero or negative, the harmonic mean is not defined.
The main application is in rates (such as average speed over equal distances).

2.6.4 Trimmed Mean in Python

The trimmed mean is the same as the regular mean, except that some of the largest and smallest data (for example, the top 10% and bottom 10%) are removed and then the mean is taken.

Using SciPy (scipy.stats.trim_mean)

from scipy.stats import trim_mean

numbers = [1, 2, 2, 3, 4, 100]

tm = trim_mean(numbers, 0.1)

print("tm :", tm)

tm : 18.666666666666668

trim_mean(data, 0.1) : Remove 10% of the data from the beginning and end
trim_mean(data, 0.2) : Remove 20% of the data

End Reza