Fundamentals of Statistics (Including Probability) - Revision Notes

Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data. It’s crucial for decision-making in various fields, including forestry.

Table of Contents

I. Introduction to Statistics

Data: A collection of facts, figures, or information, usually numerical, collected for a specific purpose.
Raw Data: Data collected in its original form, before any organization or analysis.
Types of Data:
Quantitative Data: Numerical data that can be measured or counted.
Discrete Data: Can only take specific, separate values (e.g., number of trees, number of animals). Often integers.
Continuous Data: Can take any value within a given range (e.g., height of a tree, temperature, weight).
Qualitative (Categorical) Data: Data that describes characteristics or categories, not numerical values.
Nominal Data: Categories with no inherent order (e.g., tree species, forest type).
Ordinal Data: Categories with a meaningful order but uncertain intervals between them (e.g., timber quality: excellent, good, fair, poor).
Population: The entire group of individuals, objects, or data that a researcher is interested in studying.
Sample: A subset or a representative part of the population selected for study. It’s often impractical to study the entire population.
Variables: Characteristics or attributes that can take on different values.
Independent Variable: The variable whose variation does not depend on that of another. It’s manipulated or chosen by the researcher.
Dependent Variable: The variable whose value depends on that of another. It’s measured or observed.
Descriptive Statistics: Summarizes and describes the characteristics of a dataset (e.g., mean, median, mode, standard deviation).
Inferential Statistics: Uses sample data to make inferences or predictions about a larger population (e.g., hypothesis testing, confidence intervals).

II. Measures of Central Tendency

These measures locate the center of a dataset.

1. Mean (Arithmetic Mean)

Definition: The average of all values in a dataset. Calculated by summing all values and dividing by the number of values.
Formula:
For ungrouped data: $\bar{X} = (\sum X) / n$
$\bar{X}$ (X-bar) = Mean
$\sum X$ (Sigma X) = Sum of all values
$n$ = Number of values
For grouped data (frequency distribution): $\bar{X} = (\sum fX) / (\sum f)$
$f$ = Frequency of each class/value
$X$ = Midpoint of each class (for continuous data) or value (for discrete data)
Properties:
Sensitive to extreme values (outliers).
Used for quantitative data.
Most commonly used measure.
Highlight: “Mean” is the “Average.”

2. Median

Definition: The middle value of a dataset when arranged in ascending or descending order.
Calculation:
Odd number of values: The median is the $(n+1)/2$-th value.
Even number of values: The median is the average of the $(n/2)$-th and $((n/2)+1)$-th values.
Properties:
Not affected by extreme values (resistant to outliers).
Can be used for quantitative and ordinal data.
Highlight: “Median” is the “Middle.” (Remember M for Middle)

3. Mode

Definition: The value that appears most frequently in a dataset.
Properties:
Can be used for any type of data (quantitative or qualitative).
A dataset can have no mode, one mode (unimodal), two modes (bimodal), or more (multimodal).
Not affected by extreme values.
Highlight: “Mode” is “Most Frequent.”

Table: Comparison of Measures of Central Tendency

Feature	Mean	Median	Mode
Calculation	Sum values / Count	Middle value (ordered data)	Most frequent value
Effect of Outliers	Highly affected	Minimally affected	Not affected
Data Types	Quantitative (interval, ratio)	Quantitative, Ordinal	All (nominal, ordinal, quantitative)
Uniqueness	Always unique	Always unique	May not exist, or multiple modes
Best Used When	Data is symmetrical, no extreme outliers	Data is skewed, outliers present	Finding most common item, nominal data

III. Measures of Dispersion (Variability)

These measures describe how spread out or varied the data points are.

1. Range

Definition: The difference between the highest and lowest values in a dataset.
Formula: Range = Maximum Value – Minimum Value
Properties: Simple to calculate but highly affected by outliers. Only considers two values.

2. Variance

Definition: The average of the squared differences from the mean. It measures the spread of data points around the mean.
Formula (Population Variance): $\sigma^2 = (\sum (X – \mu)^2) / N$
$\sigma^2$ (sigma-squared) = Population Variance
$\mu$ (mu) = Population Mean
$N$ = Population size
Formula (Sample Variance): $s^2 = (\sum (X – \bar{X})^2) / (n – 1)$
$s^2$ = Sample Variance
$\bar{X}$ = Sample Mean
$(n – 1)$ is used for unbiased estimation of population variance from a sample. This is called Bessel’s Correction.
Properties: Units are squared, making interpretation difficult.

3. Standard Deviation

Definition: The square root of the variance. It measures the typical distance of data points from the mean.
Formula (Population Standard Deviation): $\sigma = \sqrt{(\sum (X – \mu)^2) / N}$
Formula (Sample Standard Deviation): $s = \sqrt{(\sum (X – \bar{X})^2) / (n – 1)}$
Properties:
Expressed in the same units as the original data, making it easier to interpret than variance.
A larger standard deviation indicates greater variability.
Most widely used measure of dispersion.
Highlight: “Standard Deviation” gives the “Average Distance from the Mean.”

4. Quartiles and Interquartile Range (IQR)

Quartiles: Divide an ordered dataset into four equal parts.
Q1 (First Quartile): The value below which 25% of the data falls. (Also known as 25th percentile).
Q2 (Second Quartile): The median; the value below which 50% of the data falls. (Also known as 50th percentile).
Q3 (Third Quartile): The value below which 75% of the data falls. (Also known as 75th percentile).
Interquartile Range (IQR): The range of the middle 50% of the data.
Formula: IQR = Q3 – Q1
Properties:
Not affected by extreme values (resistant to outliers).
Useful for skewed distributions.

IV. Probability

Probability is the measure of the likelihood that an event will occur. It’s a numerical value between 0 and 1, where 0 indicates impossibility and 1 indicates certainty.

1. Basic Concepts

Experiment: An action or process that leads to one or several possible outcomes (e.g., tossing a coin, rolling a die, selecting a tree for measurement).
Outcome: A single possible result of an experiment (e.g., getting a Head, rolling a 3).
Sample Space (S): The set of all possible outcomes of an experiment (e.g., S = {Head, Tail} for coin toss; S = {1, 2, 3, 4, 5, 6} for die roll).
Event (E): A subset of the sample space; a particular outcome or a collection of outcomes (e.g., Event A = getting an even number {2, 4, 6}).

2. Types of Probability

Classical Probability (A Priori): Based on logical reasoning or equally likely outcomes.
Formula: P(E) = (Number of favorable outcomes) / (Total number of possible outcomes)
Example: Probability of rolling a 4 on a fair die is 1/6.
Empirical Probability (A Posteriori / Relative Frequency): Based on observed data or experiments.
Formula: P(E) = (Number of times an event occurred) / (Total number of trials)
Example: If a forester observes 20 diseased trees in a sample of 100, the empirical probability of a diseased tree is 20/100 = 0.2.
Subjective Probability: Based on personal judgment, experience, or intuition. Often used when classical or empirical methods are not applicable (e.g., forester’s estimate of the probability of a wildfire).

3. Key Terms and Rules

Complement of an Event (E’): All outcomes in the sample space that are not* in Event E.

P(E’) = 1 – P(E)
Example: If P(raining) = 0.3, then P(not raining) = 1 – 0.3 = 0.7.
Mutually Exclusive Events: Events that cannot occur at the same time (i.e., they have no common outcomes).
Example: Getting a Head and a Tail on a single coin toss.
Addition Rule for Mutually Exclusive Events: P(A or B) = P(A) + P(B)
Non-Mutually Exclusive Events: Events that can occur at the same time (i.e., they have common outcomes).
Example: Drawing a King or a Red card from a deck. (King of Hearts and King of Diamonds are both King and Red).
General Addition Rule: P(A or B) = P(A) + P(B) – P(A and B)
Independent Events: The occurrence of one event does not affect the probability of the other event occurring.
Example: Tossing two coins: result of first coin doesn’t affect the second.

Multiplication Rule for Independent Events: P(A and B) = P(A) P(B)

Dependent Events: The occurrence of one event does* affect the probability of the other event occurring.

Example: Drawing two cards without replacement* from a deck.

Conditional Probability: The probability of an event B occurring given that event A has already occurred.
Formula: P(B|A) = P(A and B) / P(A) (where P(A) > 0)

Multiplication Rule for Dependent Events: P(A and B) = P(A) P(B|A)

4. Permutations and Combinations

These are methods for counting the number of ways events can occur.

Permutations: Order matters. Used when arrangement is important (e.g., arranging letters, selecting members for specific roles).
Formula (n objects taken r at a time): $P(n, r) = n! / (n-r)!$
$n!$ (n factorial) = $n \times (n-1) \times … \times 2 \times 1$. $0! = 1$.
Example: Number of ways to arrange 3 letters (A, B, C) in a row: $P(3, 3) = 3! = 6$ (ABC, ACB, BAC, BCA, CAB, CBA).

Combinations: Order does not* matter. Used when selecting a group of items where the arrangement within the group is not important.

Formula (n objects taken r at a time): $C(n, r) = n! / (r! (n-r)!)$

Example: Number of ways to choose 2 letters from (A, B, C): $C(3, 2) = 3! / (2! 1!) = 3$ (AB, AC, BC).

Highlight for Permutations vs. Combinations:

Permutations: Think of Position (order matters).
Combinations: Think of Choosing (order doesn’t matter).

Mnemonic to remember main probability rules:

OR makes you ADD (Addition Rule)
AND makes you MULTIPLY (Multiplication Rule)

V. Data Presentation

Effective presentation makes data understandable and interpretable.

1. Tabular Presentation

Frequency Distribution Table: Organizes raw data into classes or categories and shows the number (frequency) of observations falling into each class.
Includes: Class intervals, tallies (optional), frequency, relative frequency, cumulative frequency.
Relative Frequency: Proportion of observations in a class (frequency / total observations).
Cumulative Frequency: Sum of frequencies up to a particular class.

2. Graphical Presentation

Bar Chart: Used for qualitative (categorical) or discrete quantitative data. Bars are separate to emphasize distinct categories. Length of bar represents frequency/count.
Histogram: Used for continuous quantitative data. Bars are adjacent to indicate continuity. The width of the bar represents the class interval, and the area of the bar represents the frequency.
Pie Chart: Used to show parts of a whole for qualitative data. Each slice represents a proportion of the total. Not recommended for many categories.
Line Graph: Shows trends over time or continuous data. Points are plotted and connected by lines.
Scatter Plot: Shows the relationship between two quantitative variables. Each point represents a pair of values. Used to suggest correlation.

VI. Correlation and Regression (Brief Overview)

Correlation: Measures the strength and direction of the linear relationship between two quantitative variables.
Correlation Coefficient (r): Ranges from -1 to +1.
+1: Perfect positive linear relationship.
-1: Perfect negative linear relationship.
0: No linear relationship.
Caution: Correlation does not imply causation.
Regression: Describes the nature of the relationship (equation) between two or more variables, allowing for prediction.
Simple Linear Regression: Fits a straight line to the data (Y = a + bX).
Y = Dependent variable
X = Independent variable
a = Y-intercept
b = Slope
Used to predict the value of one variable based on the value of another.

VII. Key Statistical Symbols

$\mu$ (mu): Population Mean
$\bar{X}$ (X-bar): Sample Mean
$\sigma$ (sigma): Population Standard Deviation
$s$: Sample Standard Deviation
$\sigma^2$ (sigma-squared): Population Variance
$s^2$: Sample Variance
$N$: Population Size
$n$: Sample Size
$\sum$ (Sigma): Summation
$P(E)$: Probability of Event E
$P(E’)$: Probability of Complement of Event E
$P(A \text{ or } B)$: Probability of A or B (Union)
$P(A \text{ and } B)$: Probability of A and B (Intersection)
$P(B|A)$: Probability of B given A (Conditional Probability)

VIII. Practical Application in Forestry (Relevance for Forester Exam)

Sampling: Selecting plots for tree measurements, animal counts, soil analysis.
Descriptive Statistics: Summarizing tree heights, diameters, timber volume, species distribution in a forest stand.
Inferential Statistics: Estimating total forest volume from sample plots, comparing growth rates under different silvicultural treatments, predicting forest fire risk.
Probability: Assessing the likelihood of pest outbreaks, disease spread, success of reforestation efforts, or predicting timber yield under varying conditions.
Data Presentation: Creating graphs and tables to communicate forest inventory results, ecological trends, or management plans.

Final Revision Tip: Understand the purpose behind each statistical measure and technique. Don’t just memorize formulas; know when and why to use them. Pay close attention to distinguishing between sample and population notations, especially for variance and standard deviation.

Editorial Team

Founder & Content Creator at EduFrugal

Email View Articles

Fundamentals of Statistics (Including Probability) – Revision Notes