install.packages("lsr")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
(as 'lib' is unspecified)
library(lsr)
In this chapter we will learn descriptive statistics.
Measures of Central Tendency are statistical values that represent the center or typical value of a data set. They help us understand the overall trend or “average” behavior of the data.
The mode is the value(s) that appear most frequently in the data set.
In R, the lsr
package provides a simple and clean way to calculate the mode.
Step 1: Install and Load the lsr
Package
install.packages("lsr")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
(as 'lib' is unspecified)
library(lsr)
Example:
<- c(1, 2, 2, 3, 3, 3, 4, 5)
data modeOf(data)
[1] 3
⚠️ Important Note
There is a base R function also named mode()
— but it returns the data type (like "numeric"
, "character"
) instead of calculating the statistical mode!
Example: Mode in Data Frame
<- data.frame(
df Score = c(1, 2, 2, 3, 4, 3, NA),
Group = c("A", "B", "B", "B", "A", "A", NA),
Age = c(20, 22, 21, 22, 21, 22, 21)
)
sapply(df, modeOf)
Score Group Age
[1,] "2" "A" "22"
[2,] "3" "B" "21"
The sapply()
function is executed on each column of the dataframe.
Calculating the Mode in Python Using pandas
Use .mode()
to Find the Mode
Exaple 1:
import pandas as pd
= pd.Series([1, 2, 2, 3, 4, 4, 5])
data = data.mode()
mod print("Mode:", mod.tolist())
Mode: [2, 4]
#or
print (mod)
0 2
1 4
dtype: int64
Example 2: Create a Sample Data Frame.
Mode for All Columns:
= {
data 'A': [1, 2, 2, 3, 4],
'B': [5, 5, 6, 7, 7],
'C': [10, 10, 10, 11, 12]
}
= pd.DataFrame(data)
df
= df.mode()
mode_values print(mode_values)
A B C
0 2.0 5 10.0
1 NaN 7 NaN
Example 3:
Mode for All Columns:
= {
data "Name": ["Alice", "Bob", "Charlie", "Diana", "Eli", "Fiona"],
"Age": [24, 30, 22, 24, 30, 24],
"Gender": ["Female", "Male", "Male", "Female", "Male", "Female"]
}
= pd.DataFrame(data)
df df
Name Age Gender
0 Alice 24 Female
1 Bob 30 Male
2 Charlie 22 Male
3 Diana 24 Female
4 Eli 30 Male
5 Fiona 24 Female
Mode of the "Age"
column:
"Age"].mode() df[
0 24
Name: Age, dtype: int64
The median is the middle value of a dataset when the numbers are sorted in order.
If the number of values is odd, it’s the middle one.
If it’s even, it’s the average of the two middle numbers.
Basic median()
function.
<- c(10, 20, 30, 40, 50)
x median(x)
[1] 30
With missing values (NA
) we can use na.rm = TRUE
<- c(5, 8, NA, 12)
y median(y, na.rm =TRUE)
[1] 8
Using tapply
with categorical variables in R
tapply
lets you apply a function (like median
) to subsets of a vector, defined by a categorical variable (factor).
<- c(24, 30, 22, 24, 30, 24)
Age <- c("F", "M", "M", "F", "M", "F")
Gender
tapply(Age, Gender, median)
F M
24 30
Explanation:
First argument → numeric vector (Age
)
Second argument → categorical variable (Gender
)
Third argument → function to apply (median
)
So this gives you the median age for each gender.
Example : DataFrame
<- data.frame(
data Gender = c("f", "f", "m", "m", "f", "m"),
Anxiety = c(12, 15, NA, 20, 18, 25),
Depression = c(30, NA, 28, 35, 40, NA)
)
data
Gender Anxiety Depression
1 f 12 30
2 f 15 NA
3 m NA 28
4 m 20 35
5 f 18 40
6 m 25 NA
Use sapply , lapply
sapply(data[, c("Anxiety","Depression")], median, na.rm = TRUE) #1
Anxiety Depression
18.0 32.5
#or
sapply(data[sapply(data, is.numeric)], median, na.rm = TRUE) #2
Anxiety Depression
18.0 32.5
#or
lapply(data[, c("Anxiety","Depression")], median, na.rm = TRUE) #3
$Anxiety
[1] 18
$Depression
[1] 32.5
Use tapply
tapply(data$Anxiety, data$Gender, median, na.rm = TRUE)
f m
15.0 22.5
tapply(data$Depression, data$Gender, median, na.rm = TRUE)
f m
35.0 31.5
Use aggregate
aggregate(cbind(Anxiety, Depression) ~ Gender, data = data, median,na.rm = TRUE)
Gender Anxiety Depression
1 f 15 35
2 m 20 35
1.Using the statistics
library
import statistics
= [10, 12, 15, 17, 18]
data = statistics.median(data)
median_value
print("Median:", median_value)
Median: 15
2.Using the NumPy
library
import numpy as np
= [10, 12, 15, 17, 18]
data = np.median(data)
median_value
print("Median:", median_value)
Median: 15.0
3.Using the Pandas
library to DataFrame
import pandas as pd
= {
data "Age": [24, 30, 22, 24, 30, 24],
"Score": [85, 90, 78, 92, 88, 95]
}
= pd.DataFrame(data)
df
print("Median Age:", df["Age"].median())
Median Age: 24.0
print("Median Score:", df["Score"].median())
Median Score: 89.0
Median in data with missing values (NaN)
Sometimes data contains missing values. Pandas
and NumPy
can handle them easily:
import numpy as np
import pandas as pd
= [10, 12, np.nan, 17, 18]
data
# با NumPy
print("Median with NumPy:", np.nanmedian(data))
Median with NumPy: 14.5
# با Pandas
= pd.Series(data)
s print("Median with Pandas:", s.median())
Median with Pandas: 14.5
Example :Data Frame (Numeric + Categorical)
import pandas as pd
import numpy as np
= {
data "Age": [24, 30, 22, 24, 30, 24],
"Score": [85, np.nan, 78, 92, np.nan, 95],
"Gender": ["F", "M", "M", "F", "M", "F"]
}
= pd.DataFrame(data)
df print(df)
Age Score Gender
0 24 85.0 F
1 30 NaN M
2 22 78.0 M
3 24 92.0 F
4 30 NaN M
5 24 95.0 F
Calculate the median for each column.
= df.median(numeric_only=True, skipna=True)
medians print(medians)
Age 24.0
Score 88.5
dtype: float64
Calculating the median by gender
= df.groupby("Gender").median(numeric_only=True)
group_medians print(group_medians)
Age Score
Gender
F 24.0 92.0
M 30.0 78.0
In R, we use the mean()
function to calculate the mean.
<- c(12, 15, 20, 22, 18, 30, 25, 19, 17, 50)
anxiety_scores mean(anxiety_scores)
[1] 22.8
Mean with missing data (NA) in R
If the data contains missing values (NA), you must use the argument na.rm = TRUE
<- c(5, 10, NA, 20)
data
mean(data, na.rm = TRUE)
[1] 11.66667
categorical variables: tapply
The researcher wants to examine the average depression scores of two groups (men and women). To do this, we use tapply
.
<- c("f","f","f","m","m","m")
group <- c(18, 22, 20, 25, 30, 28)
depression
tapply(depression, group, mean)
f m
20.00000 27.66667
Calculating the mean in a Data Frame with missing data(NA)
A researcher collected anxiety and depression scores from psychology students. Some students did not answer some questions, and the data is incomplete.
<- data.frame(
data Gender = c("f", "f", "m", "m", "f", "m"),
Anxiety = c(12, 15, NA, 20, 18, 25),
Depression = c(30, NA, 28, 35, 40, NA)
)
data
Gender Anxiety Depression
1 f 12 30
2 f 15 NA
3 m NA 28
4 m 20 35
5 f 18 40
6 m 25 NA
Use lapply , sapply
Mean of each column “Anxiety”,”Depression” (regardless of gender)
sapply(data[, c("Anxiety","Depression")], mean, na.rm = TRUE) #1
Anxiety Depression
18.00 33.25
#or
sapply(data[sapply(data, is.numeric)], mean, na.rm = TRUE) #2
Anxiety Depression
18.00 33.25
#or
lapply(data[, c("Anxiety","Depression")], mean, na.rm = TRUE) #3
$Anxiety
[1] 18
$Depression
[1] 33.25
Use tapply
Mean anxiety and depression scores by gender
tapply(data$Anxiety, data$Gender, mean, na.rm = TRUE)
f m
15.0 22.5
tapply(data$Depression, data$Gender, mean, na.rm = TRUE)
f m
35.0 31.5
Use aggregate
aggregate(cbind(Anxiety, Depression) ~ Gender, data = data, mean,na.rm = TRUE)
Gender Anxiety Depression
1 f 15 35
2 m 20 35
This indicator is most often used when the data is a ratio or ratio, such as growth rates, investment returns, or population changes.
Method 1: Using the psych
package
install.packages("psych")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
(as 'lib' is unspecified)
library(psych)
<- c(4, 8, 16, 32)
numbers
geometric.mean(numbers)
[1] 11.31371
Method 2: Using a formula (without a package)
<- c(4, 8, 16, 32)
numbers
<- exp(mean(log(numbers)))
gm
print(gm)
[1] 11.31371
Application of the geometric mean:
Economics: mean annual growth rate of an investment.
Medical science: mean increase in bacterial population .
Management: mean annual sales growth percentage.
Harmonic mean is most often used in situations where the data is a ratio or rate (such as speed, productivity, or growth rate).
Method 1: Using the psych
package
install.packages("psych")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
(as 'lib' is unspecified)
library(psych)
<- c(2, 4, 8, 16)
numbers
harmonic.mean(numbers)
[1] 4.266667
Method 2: Using a formula (without a package)
<- c(2, 4, 8, 16)
numbers
<- length(numbers) / sum(1 / numbers)
hm
print(hm)
[1] 4.266667
Application of harmonic mean:
Physics: Calculating the average speed when the same distance is traveled at different speeds.
Economics: Average interest rates or investment returns.
Management: Average productivity indicators in an organization.
Important
The data must be positive.
If the data contains zero or negative values, the harmonic mean is not defined.
For data with NA, you must use na.rm=TRUE
The Trimmed Mean is a version of the arithmetic mean in which a percentage of the smallest and largest data values are removed to reduce the effect of outliers.
if the data contains very large or very small values, the arithmetic mean can be misleading. In these situations, we use the trimmed mean.
<- c(2, 3, 4, 5, 100)
numbers
mean(numbers)
[1] 22.8
mean(numbers, trim = 0.2)
[1] 4
The trim value is chosen between 0 and 0.5.
For example, trim=0.1 means that 10% of the data is removed from the beginning and 10% from the end.
Before calculation, the data is sorted in ascending order.
Application
Social sciences: In questionnaires when some responses are very unusual.
Economics: In calculating indices when there are outliers.
Data analysis: In data that has noise or measurement error.
Method 1: Using statistics.mean
import statistics
= [10, 20, 30, 40, 50]
numbers = statistics.mean(numbers)
mean_value
print("Mean:", mean_value)
Mean: 30
Method 2: Using NumPy
import numpy as np
= [10, 20, 30, 40, 50]
numbers = np.mean(numbers)
mean_value
print("Mean:", mean_value)
Mean: 30.0
If the data is small: use statistics.mean
If the data is large or a matrix: use numpy.mean.
Method 3: Using Pandas
Example 1: Calculating the mean of a column
import pandas as pd
= {
data "Name":["Sevda", "MG", "Reza", "RG"],
"Math":[20, 12, 18, 15],
"Statistics": [17, 16, 19, 14]
}
= pd.DataFrame(data)
df
= df["Math"].mean()
mean_math
print("Math:", mean_math)
Math: 16.25
Example 2: Calculating the mean of all columns
= df.mean(numeric_only=True)
means print(means)
Math 16.25
Statistics 16.50
dtype: float64
Example 3: Calculating the mean of each row
"Individual Mean"] = df.mean(numeric_only=True, axis=1)
df[
print(df)
Name Math Statistics Individual Mean
0 Sevda 20 17 18.5
1 MG 12 16 14.0
2 Reza 18 19 18.5
3 RG 15 14 14.5
Mean with NaN
NaN
values by default in its statistical functions.import pandas as pd
import numpy as np
= {
data "Name":["Sevda", "MG", "Reza", "RG"],
"Math":[20, 12, np.nan, 15],
"Statistics": [17, 16, 19, np.nan]
}
= pd.DataFrame(data)
df
= df["Math"].mean()
mean_math = df["Statistics"].mean()
mean_Statistics
print("Math:", mean_math)
Math: 15.666666666666666
print("Statistics:", mean_Statistics)
Statistics: 17.333333333333332
= df.mean(numeric_only=True)
means print(means)
Math 15.666667
Statistics 17.333333
dtype: float64
NumPy
with np.nanmean()
import numpy as np
= [10, 20, np.nan, 30, 40]
numbers
= np.nanmean(numbers)
mean_value
print(" Mean:", mean_value)
Mean: 25.0
Example: DataFrame (numeric + categorical)
import pandas as pd
import numpy as np
= {
data "Name":["Sevda", "MG", "Reza", "RG", "SG"],
"Class":["A", "B", "C", "A", "C"],
"Math":[12, 18, 15, np.nan, 20],
"physics": [17, 16, 19, 14, np.nan],
"Psychology": [np.nan, 18, 13, 15, 19]
}
= pd.DataFrame(data)
df
print(df)
Name Class Math physics Psychology
0 Sevda A 12.0 17.0 NaN
1 MG B 18.0 16.0 18.0
2 Reza C 15.0 19.0 13.0
3 RG A NaN 14.0 15.0
4 SG C 20.0 NaN 19.0
= df.mean(numeric_only=True)
means print(means)
Math 16.25
physics 16.50
Psychology 16.25
dtype: float64
"Individual Mean"] = df.mean(numeric_only=True, axis=1)
df[
print(df)
Name Class Math physics Psychology Individual Mean
0 Sevda A 12.0 17.0 NaN 14.500000
1 MG B 18.0 16.0 18.0 17.333333
2 Reza C 15.0 19.0 13.0 15.666667
3 RG A NaN 14.0 15.0 14.500000
4 SG C 20.0 NaN 19.0 19.500000
= df.groupby("Class").mean(numeric_only=True)
group_means print(group_means)
Math physics Psychology Individual Mean
Class
A 12.0 15.5 15.0 14.500000
B 18.0 16.0 18.0 17.333333
C 17.5 19.0 16.0 17.583333
statistics.geometric_mean
(Python 3.8 and later)import statistics
= [2, 8, 4]
numbers
= statistics.geometric_mean(numbers)
gm
print("geometric_mean:", gm)
geometric_mean: 4.0
scipy.stats.gmean
from scipy.stats import gmean
= [2, 8, 4]
numbers
= gmean(numbers)
gm
print("gmean:", gm)
gmean: 4.0
If a number is zero: the total geometric mean is zero.
statistics.harmonic_mean
(Python 3.6 and later)import statistics
= [2, 4, 4]
numbers
= statistics.harmonic_mean(numbers)
hm
print("hm :", hm)
hm : 3.0
scipy.stats.hmean
from scipy.stats import hmean
= [2, 4, 4]
numbers
= hmean(numbers)
hm
print("hmean:", hm)
hmean: 3.0
All data must be positive. If a value is zero or negative, the harmonic mean is not defined.
The main application is in rates (such as average speed over equal distances).
The trimmed mean is the same as the regular mean, except that some of the largest and smallest data (for example, the top 10% and bottom 10%) are removed and then the mean is taken.
scipy.stats.trim_mean
)from scipy.stats import trim_mean
= [1, 2, 2, 3, 4, 100]
numbers
= trim_mean(numbers, 0.1)
tm
print("tm :", tm)
tm : 18.666666666666668
trim_mean(data, 0.1) : Remove 10% of the data from the beginning and end
trim_mean(data, 0.2) : Remove 20% of the data
End Reza